# Silver EDA

This notebook is dedicated towards conducting EDA that will determine the steps necessary to transform the raw dataset into a silver state. Recall this project uses a psuedo [medallion](https://www.databricks.com/glossary/medallion-architecture) architecture where the silver layer data is transformed into a state ready for analysis/feature engineering. Check docs\etl.drawio.svg for more details. 

In [1]:
import sys 
import polars as pl
import numpy as np

sys.path.append('..')
from leash_bio_ai.utils.conf import train_dir

train_df = pl.scan_parquet(source=train_dir)

In [2]:
train_df.head(n=5).collect()

id,buildingblock1_smiles,buildingblock2_smiles,buildingblock3_smiles,molecule_smiles,protein_name,binds
i64,str,str,str,str,str,i64
0,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""BRD4""",0
1,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""HSA""",0
2,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""sEH""",0
3,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.NCc1cccc(Br)n1""","""C#CCOc1ccc(CNc2nc(NCc3cccc(Br)…","""BRD4""",0
4,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.NCc1cccc(Br)n1""","""C#CCOc1ccc(CNc2nc(NCc3cccc(Br)…","""HSA""",0


### Dataset Size

The primary obstacle of this project it dataset size given the utilization of local computing resources. Developing the silver layer of data will mainly involve dataset size reduction to enable analysis/feature engineering that will then lead into modeling efforts. Note the following opportunities for dataset size reduction:
1. Class Imbalance - binds boolean column
2. Column Value Duplication - molecule_smiles column encapsulates the same information that buildingblock1_smiles, buildingblock2_smiles, and buildingblock3_smiles have

#### Class Imbalance and Molecule Smiles

In [3]:
n_pos = (train_df
         .filter(pl.col("binds")==1)
         .select(pl.count("id"))
         .collect(streaming=True)
         .item())

n_neg = (train_df
         .filter(pl.col("binds")==0)
         .select(pl.count("id"))
         .collect(streaming=True)
         .item())

pct_pos = round(100 * n_pos/n_neg, 4)

print(f"Positive Protein Bonds Records: {n_pos}")
print(f"Negative Protein Bonds Records: {n_neg}")
print(f"Percent of Total Records that are Positive: {pct_pos}%")

Positive Protein Bonds Records: 1589906
Negative Protein Bonds Records: 293656924
Percent of Total Records that are Positive: 0.5414%


Less than 1% of the training dataset belongs to the positive binds class, making this dataset highly imbalanced. When creating the silver layer downsampling can be performed to reduce dataset size and make analysis, feature engineering, and modeling more feasible on local compute with less data. 

Almost the entirety of important molecule information is contained in the "molecule_smiles" column. The more unique molecule smiles that appear for the negative class the more data we likely need to keep for any future model to generalize well to unseen data. Below we take the first 200,000 negative class rows in the train set and assess how many of the 200,000 observed molecule_smiles values are unique. A higher level of uniqueness will motivate the belief that uniqueness is common in the data and drive a decision to retain more negative class samples so a future model can see as many molecule_smiles as possible.  

In [4]:
top_n = (train_df
         .filter(pl.col("binds")==0)
         .select("molecule_smiles")
         .head(n=200000)
         .collect(streaming=True)
         .n_unique())

In [5]:
print(f"{top_n} of 200,000 molecule_smiles sampled are unique.")

66814 of 200,000 molecule_smiles sampled are unique.


In [2]:
train_df.head(n=5).with_columns(pl.Series(name='randn', values=np.random.uniform(low=0, high=1, size=train_df.head(n=5).select(pl.len()).collect().item()))).collect()

id,buildingblock1_smiles,buildingblock2_smiles,buildingblock3_smiles,molecule_smiles,protein_name,binds,randn
i64,str,str,str,str,str,i64,f64
0,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""BRD4""",0,0.306955
1,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""HSA""",0,0.602447
2,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""sEH""",0,0.815065
3,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.NCc1cccc(Br)n1""","""C#CCOc1ccc(CNc2nc(NCc3cccc(Br)…","""BRD4""",0,0.057577
4,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.NCc1cccc(Br)n1""","""C#CCOc1ccc(CNc2nc(NCc3cccc(Br)…","""HSA""",0,0.857685
