# Preliminary Data Examination

This notebook is dedicated towards having a first glance at the dataset. The purpose of this notebook is not to run data analysis or modeling tasks, but rather just to see what the data structure looks like to better inform/plan future work for the project and develop a level of comfort with that the data looks like. 

The data is stored in parquet files, given the size of the data I utilize polars here to pull in a smaller chunk (1000 rows) of the data and examine the data types and what observations look like. 

In [5]:
import sys
import polars as pl

sys.path.append('..')
from leash_bio_ai.utils.conf import train_dir, test_dir, sample_submission_dir

train_df = pl.read_parquet(source=train_dir, n_rows=1000)
test_df = pl.read_parquet(source=test_dir, n_rows=1000)
sample_sub_df = pl.read_csv(source=sample_submission_dir)

#### Train and Test Set Structure

In [6]:
train_df.head(n=10)

id,buildingblock1_smiles,buildingblock2_smiles,buildingblock3_smiles,molecule_smiles,protein_name,binds
i64,str,str,str,str,str,i64
0,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""BRD4""",0
1,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""HSA""",0
2,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.Br.NCC1CCCN1c1cccnn1""","""C#CCOc1ccc(CNc2nc(NCC3CCCN3c3c…","""sEH""",0
3,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.NCc1cccc(Br)n1""","""C#CCOc1ccc(CNc2nc(NCc3cccc(Br)…","""BRD4""",0
4,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.NCc1cccc(Br)n1""","""C#CCOc1ccc(CNc2nc(NCc3cccc(Br)…","""HSA""",0
5,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""Br.NCc1cccc(Br)n1""","""C#CCOc1ccc(CNc2nc(NCc3cccc(Br)…","""sEH""",0
6,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""C#CCOc1ccc(CN)cc1.Cl""","""C#CCOc1ccc(CNc2nc(NCc3ccc(OCC#…","""BRD4""",0
7,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""C#CCOc1ccc(CN)cc1.Cl""","""C#CCOc1ccc(CNc2nc(NCc3ccc(OCC#…","""HSA""",0
8,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""C#CCOc1ccc(CN)cc1.Cl""","""C#CCOc1ccc(CNc2nc(NCc3ccc(OCC#…","""sEH""",0
9,"""C#CC[C@@H](CC(=O)O)NC(=O)OCC1c…","""C#CCOc1ccc(CN)cc1.Cl""","""C=C(C)C(=O)NCCN.Cl""","""C#CCOc1ccc(CNc2nc(NCCNC(=O)C(=…","""BRD4""",0


In [7]:
test_df.head(n=10)

id,buildingblock1_smiles,buildingblock2_smiles,buildingblock3_smiles,molecule_smiles,protein_name
i64,str,str,str,str,str
295246830,"""C#CCCC[C@H](NC(=O)OCC1c2ccccc2…","""C=Cc1ccc(N)cc1""","""C=Cc1ccc(N)cc1""","""C#CCCC[C@H](Nc1nc(Nc2ccc(C=C)c…","""BRD4"""
295246831,"""C#CCCC[C@H](NC(=O)OCC1c2ccccc2…","""C=Cc1ccc(N)cc1""","""C=Cc1ccc(N)cc1""","""C#CCCC[C@H](Nc1nc(Nc2ccc(C=C)c…","""HSA"""
295246832,"""C#CCCC[C@H](NC(=O)OCC1c2ccccc2…","""C=Cc1ccc(N)cc1""","""C=Cc1ccc(N)cc1""","""C#CCCC[C@H](Nc1nc(Nc2ccc(C=C)c…","""sEH"""
295246833,"""C#CCCC[C@H](NC(=O)OCC1c2ccccc2…","""C=Cc1ccc(N)cc1""","""CC(O)Cn1cnc2c(N)ncnc21""","""C#CCCC[C@H](Nc1nc(Nc2ccc(C=C)c…","""BRD4"""
295246834,"""C#CCCC[C@H](NC(=O)OCC1c2ccccc2…","""C=Cc1ccc(N)cc1""","""CC(O)Cn1cnc2c(N)ncnc21""","""C#CCCC[C@H](Nc1nc(Nc2ccc(C=C)c…","""HSA"""
295246835,"""C#CCCC[C@H](NC(=O)OCC1c2ccccc2…","""C=Cc1ccc(N)cc1""","""CC(O)Cn1cnc2c(N)ncnc21""","""C#CCCC[C@H](Nc1nc(Nc2ccc(C=C)c…","""sEH"""
295246836,"""C#CCCC[C@H](NC(=O)OCC1c2ccccc2…","""C=Cc1ccc(N)cc1""","""CC1(C)CCCC1(O)CN""","""C#CCCC[C@H](Nc1nc(NCC2(O)CCCC2…","""BRD4"""
295246837,"""C#CCCC[C@H](NC(=O)OCC1c2ccccc2…","""C=Cc1ccc(N)cc1""","""CC1(C)CCCC1(O)CN""","""C#CCCC[C@H](Nc1nc(NCC2(O)CCCC2…","""HSA"""
295246838,"""C#CCCC[C@H](NC(=O)OCC1c2ccccc2…","""C=Cc1ccc(N)cc1""","""CC1(C)CCCC1(O)CN""","""C#CCCC[C@H](Nc1nc(NCC2(O)CCCC2…","""sEH"""
295246839,"""C#CCCC[C@H](NC(=O)OCC1c2ccccc2…","""C=Cc1ccc(N)cc1""","""COC(=O)c1cc(Cl)sc1N""","""C#CCCC[C@H](Nc1nc(Nc2ccc(C=C)c…","""BRD4"""


The train and test datasets are primarily composed of string feature columns which are the SMILE representations of the building block molecules described [here](https://www.kaggle.com/competitions/leash-BELKA/overview). The train set includes a 64-bit signed integer type "binds" column, which is a binary indicating if the protein and molecule binded together. Additionally we appear to have an index column named "id" which is a 64-bit signed integer. 

A few things to consider to better deal with this larger-memory-dataset:
* "binds" is a binary target variable, conversion to an 8-bit integer type rather than 64-bit will ease the memory burden, and 64-bit is unnecessary for binary variables.
* The "id" column will get large in value when considering the whole dataset, so there is likely not opportunity here to switch to a less memory intensive data-type.
* The feature columns are all string types, future work on these should consider how to represent them numerically with an array or with another translation using [RDKit](https://www.rdkit.org/docs/GettingStartedInPython.html).
* The "protein_name" column has 3 distinct string values described [here](https://www.kaggle.com/competitions/leash-BELKA/data). There is likely a simpler integer encoding we can use to represent each protein_name as an 8-bit integer to be more memory efficient.

#### Sample Submission Structure

In [8]:
sample_sub_df.head(n=10)

id,binds
i64,f64
295246830,0.5
295246831,0.5
295246832,0.5
295246833,0.5
295246834,0.5
295246835,0.5
295246836,0.5
295246837,0.5
295246838,0.5
295246839,0.5


The sample submission dataframe has indices ("id" column) corresponding to the test dataframe and additionally example predicted probabilities of the molecule and protein binding associated to each test set row. This is a simple layout for submissions and will be reproduced when model results are generated. 