Goal is to create a SMILES to boolean activity mapping from the smaller dataset, with a few thousand inactives from the bigger dataset thrown in for good measure.

In [1]:
import pandas as pd
import numpy as np

Start by loading the data sets into memory

In [2]:
datatable = pd.read_csv("../data/raw/smaller_dataset/AID_624381_datatable.csv", skiprows=[1, 2, 3, 4])
big_datatable = pd.read_csv("../data/raw/AID_624169_datatable.csv", skiprows=[1, 2, 3, 4])

In [3]:
datatable.head()

Unnamed: 0,PUBCHEM_RESULT_TAG,PUBCHEM_SID,PUBCHEM_CID,PUBCHEM_ACTIVITY_OUTCOME,PUBCHEM_ACTIVITY_SCORE,PUBCHEM_ACTIVITY_URL,PUBCHEM_ASSAYDATA_COMMENT,Average Activation at 7.6 uM,Standard Deviation,Activation at 7.6 uM [1],Activation at 7.6 uM [2],Activation at 7.6 uM [3]
0,1,56463199,11957566,Active,100,,,179.13,13.903,165.991,198.366,173.028
1,2,56463291,16219016,Active,97,,,174.28,19.2319,193.587,181.207,148.034
2,3,46500349,23640911,Active,95,,,171.02,14.6551,151.428,174.977,186.666
3,4,85273743,55397,Active,95,,,170.53,8.62722,164.413,164.447,182.731
4,5,56463619,5282106,Active,95,,,170.21,12.8275,176.306,181.961,152.367


In [4]:
print(datatable.isnull().sum())
print(big_datatable.isnull().sum())

PUBCHEM_RESULT_TAG                 0
PUBCHEM_SID                        0
PUBCHEM_CID                        0
PUBCHEM_ACTIVITY_OUTCOME           0
PUBCHEM_ACTIVITY_SCORE             0
PUBCHEM_ACTIVITY_URL            2266
PUBCHEM_ASSAYDATA_COMMENT       2266
Average Activation at 7.6 uM       0
Standard Deviation                 0
Activation at 7.6 uM [1]           0
Activation at 7.6 uM [2]           0
Activation at 7.6 uM [3]           0
dtype: int64
PUBCHEM_RESULT_TAG                0
PUBCHEM_SID                       0
PUBCHEM_CID                       1
PUBCHEM_ACTIVITY_OUTCOME          0
PUBCHEM_ACTIVITY_SCORE            0
PUBCHEM_ACTIVITY_URL         364131
PUBCHEM_ASSAYDATA_COMMENT    364131
Activation at 7.6 uM              0
dtype: int64


The data is overall pretty clean so far, with only one substance (SID) missing it's corresponding compound ID (CID). We want to filter that one out. 

In [5]:
big_datatable = big_datatable[~big_datatable["PUBCHEM_CID"].isnull()]

type_cols = {"PUBCHEM_SID": int, "PUBCHEM_CID": int}

for col, col_type in type_cols.items():
    datatable[col] = datatable[col].astype(col_type)
    big_datatable[col] = big_datatable[col].astype(col_type)

In [6]:
print(datatable.isnull().sum())
print(big_datatable.isnull().sum())

PUBCHEM_RESULT_TAG                 0
PUBCHEM_SID                        0
PUBCHEM_CID                        0
PUBCHEM_ACTIVITY_OUTCOME           0
PUBCHEM_ACTIVITY_SCORE             0
PUBCHEM_ACTIVITY_URL            2266
PUBCHEM_ASSAYDATA_COMMENT       2266
Average Activation at 7.6 uM       0
Standard Deviation                 0
Activation at 7.6 uM [1]           0
Activation at 7.6 uM [2]           0
Activation at 7.6 uM [3]           0
dtype: int64
PUBCHEM_RESULT_TAG                0
PUBCHEM_SID                       0
PUBCHEM_CID                       0
PUBCHEM_ACTIVITY_OUTCOME          0
PUBCHEM_ACTIVITY_SCORE            0
PUBCHEM_ACTIVITY_URL         364130
PUBCHEM_ASSAYDATA_COMMENT    364130
Activation at 7.6 uM              0
dtype: int64


Problem solved.

Now we start creating the final dataset by mapping activity to booleans.

In [7]:
cid_to_outcome = datatable[["PUBCHEM_CID", "PUBCHEM_ACTIVITY_OUTCOME"]]
big_cid_to_outcome = big_datatable[["PUBCHEM_CID", "PUBCHEM_ACTIVITY_OUTCOME", "PUBCHEM_ACTIVITY_SCORE"]]

cid_to_outcome = cid_to_outcome.replace({"PUBCHEM_ACTIVITY_OUTCOME": {"Inactive": 0, "Active": 1}})
big_cid_to_outcome = big_cid_to_outcome.replace({"PUBCHEM_ACTIVITY_OUTCOME": {"Inactive": 0, "Active": 1}})

Let's grab just a few thousand inactive compounds from the big dataset that are not in the smaller dataset and join the data.

In [8]:
print(len(big_cid_to_outcome))
big_cid_to_outcome = big_cid_to_outcome[~big_cid_to_outcome["PUBCHEM_CID"].isin(cid_to_outcome["PUBCHEM_CID"])]
print(len(big_cid_to_outcome))

364130
361862


In [9]:
big_cid_to_outcome = big_cid_to_outcome[big_cid_to_outcome["PUBCHEM_ACTIVITY_SCORE"] == 0]
big_cid_to_outcome = big_cid_to_outcome[["PUBCHEM_CID", "PUBCHEM_ACTIVITY_OUTCOME"]]
extra_data = big_cid_to_outcome.sample(n=2000)

We now get the SMILES for the CIDs. 

In [10]:
from pubchempy import Compound

def get_smiles(cid: int) -> str:
    comp = Compound.from_cid(cid)
    return comp.isomeric_smiles

cid_to_outcome["SMILES"] = cid_to_outcome["PUBCHEM_CID"].apply(get_smiles)
extra_data["SMILES"] = extra_data["PUBCHEM_CID"].apply(get_smiles)

Finally we save the datasets.

In [11]:
cid_to_outcome = cid_to_outcome[["SMILES", "PUBCHEM_ACTIVITY_OUTCOME"]]
extra_data = extra_data[["SMILES", "PUBCHEM_ACTIVITY_OUTCOME"]]

cid_to_outcome.to_csv("../data/interim/smaller_dataset/SMILES_to_Activity.csv", index=False)
extra_data.to_csv("../data/interim/smaller_dataset/extra_SMILES_to_Activity.csv", index=False)