# Silver EDA

This notebook is dedicated towards conducting EDA that will determine the steps necessary to transform the raw dataset into a silver state. Recall this project uses a psuedo [medallion](https://www.databricks.com/glossary/medallion-architecture) architecture where the silver layer data is transformed into a state ready for analysis/feature engineering. Check docs\etl.drawio.svg for more details. 

In [1]:
import sys 
import polars as pl
import numpy as np

sys.path.append('..')
from leash_bio_ai.utils.conf import train_dir

train_df = pl.scan_parquet(source=train_dir)

In [3]:
train_df.head(n=5).collect()

id,buildingblock1_smiles,buildingblock2_smiles,buildingblock3_smiles,molecule_smiles,protein_name,binds
i64,str,str,str,str,str,i64
0,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""BRD4""",0
1,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""HSA""",0
2,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""sEH""",0
3,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.NCc1cccc(Br)n1""","""C#CCOc1ccc(CNc2nc(NCc3cccc(Br)…","""BRD4""",0
4,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.NCc1cccc(Br)n1""","""C#CCOc1ccc(CNc2nc(NCc3cccc(Br)…","""HSA""",0


### Dataset Size

The primary obstacle of this project it dataset size given the utilization of local computing resources. Developing the silver layer of data will mainly involve dataset size reduction to enable analysis/feature engineering that will then lead into modeling efforts. Note the following opportunities for dataset size reduction:
1. Class Imbalance - binds boolean column
2. Column Value Duplication - molecule_smiles column encapsulates the same information that buildingblock1_smiles, buildingblock2_smiles, and buildingblock3_smiles have

#### Class Imbalance

In [4]:
print("Positive Protein Bonds Records")
(train_df
 .filter(pl.col("binds")==1)
 .select(pl.count("id"))
 .collect(streaming=True))

Positive Protein Bonds Records


id
u32
1589906


In [5]:
print("Negative Protein Bonds Records")
(train_df
 .filter(pl.col("binds")==0)
 .select(pl.count("id"))
 .collect(streaming=True))

Negative Protein Bonds Records


id
u32
293656924


In [6]:
print("Total Dataset Records")
(train_df
 .select(pl.count("id"))
 .collect(streaming=True))

Total Dataset Records


id
u32
295246830


The heavy class imbalance here requires either upsampling or downsampling of the data to reduce bias in future modeling efforts. Given the desire to reduce dataset size this presents an opportunity to downsample the dataset to be more class balanced. 

In [4]:
print("Unique Molecule Smiles in Random Sample")



Unique Molecule Smiles in Random Sample


: 

In [2]:
train_df.head(n=5).with_columns(pl.Series(name='randn', values=np.random.uniform(low=0, high=1, size=train_df.head(n=5).select(pl.len()).collect().item()))).collect()

id,buildingblock1_smiles,buildingblock2_smiles,buildingblock3_smiles,molecule_smiles,protein_name,binds,randn
i64,str,str,str,str,str,i64,f64
0,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""BRD4""",0,0.306955
1,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""HSA""",0,0.602447
2,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""sEH""",0,0.815065
3,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.NCc1cccc(Br)n1""","""C#CCOc1ccc(CNc2nc(NCc3cccc(Br)…","""BRD4""",0,0.057577
4,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.NCc1cccc(Br)n1""","""C#CCOc1ccc(CNc2nc(NCc3cccc(Br)…","""HSA""",0,0.857685
