We want to augment the smaller dataset with some negatives from the larger dataset. We want to get a wide variety of molecules, so we want to cluster the molecules to ensure we get an even distribution of molecules from each cluster. Start by loading the big data, filtering bad data, filtering molecules that are in the small data, and filtering out any positives left over. 

In [2]:
import pandas as pd
import rdkit

In [2]:
big_data = pd.read_csv("../data/raw/AID_624169_datatable.csv", skiprows=[1, 2, 3, 4])
print(len(big_data))

364131


In [3]:
big_data = big_data[~big_data["PUBCHEM_CID"].isnull()]
    
print(len(big_data))

364130


In [4]:
small_data = pd.read_csv("../data/raw/smaller_dataset/AID_624381_datatable.csv", skiprows=[1, 2, 3, 4])

type_cols = {"PUBCHEM_SID": int, "PUBCHEM_CID": int}

for col, col_type in type_cols.items():
    big_data[col] = big_data[col].astype(col_type)
    small_data[col] = small_data[col].astype(col_type)



big_data = big_data[~big_data["PUBCHEM_CID"].isin(small_data["PUBCHEM_CID"])]
print(len(big_data))

361862


In [5]:
big_data = big_data[big_data["PUBCHEM_ACTIVITY_OUTCOME"] == "Inactive"]
print(len(big_data))

361714


Next we add the SMILES representation of the modecules to the table. SMILES data to CID mappings come from ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-SMILES.gz

In [6]:
CID_to_SMILES = pd.read_csv("../data/raw/CID-SMILES", sep='\t', header=None, names=["CID", "SMILES"])
smiles = big_data.join(CID_to_SMILES.set_index("CID"), on="PUBCHEM_CID")
# Do more filtering to remove molecules that are really big maybe?
smiles.head()

Unnamed: 0,PUBCHEM_RESULT_TAG,PUBCHEM_SID,PUBCHEM_CID,PUBCHEM_ACTIVITY_OUTCOME,PUBCHEM_ACTIVITY_SCORE,PUBCHEM_ACTIVITY_URL,PUBCHEM_ASSAYDATA_COMMENT,Activation at 7.6 uM,SMILES
0,1,842121,6603008,Inactive,0,,,-2.06,CCOCCCNCC(=O)NC1=CC=C(C=C1)OC(F)(F)F.Cl
1,2,842122,6602571,Inactive,0,,,-2.81,COCCN1C(=NN=N1)CN2CCC(CC2)CC3=CC=CC=C3.Cl
2,3,842123,6602616,Inactive,0,,,-1.17,COCCN1C(=NN=N1)CN2CCC(CC2)(C3=CC(=CC=C3)C(F)(F...
3,4,842124,644371,Inactive,2,,,4.85,C1CCCN(CC1)CC(=O)NCCC2=CC=C(C=C2)F.C(=O)(C(=O)O)O
4,5,842125,6603132,Inactive,4,,,9.38,COC1=CC=C(C=C1)C(=O)C(C2=CC=CC=C2)N3CCOCC3.Cl


In [7]:
smiles = smiles["SMILES"]
smiles.to_csv("../data/interim/negative_smiles.csv")
# Restart the kernel and start on the next cell to save memory

In [3]:
smiles = pd.read_csv("../data/interim/negative_smiles.csv")

https://rdkit.readthedocs.io/en/latest/Cookbook.html#clustering-molecules

In [4]:
from rdkit import Chem
from rdkit.Chem import AllChem

def ClusterFps(fps,cutoff=0.2):
    from rdkit import DataStructs
    from rdkit.ML.Cluster import Butina

    # first generate the distance matrix:
    dists = []
    nfps = len(fps)
    for i in range(1,nfps):
        sims = DataStructs.BulkTanimotoSimilarity(fps[i],fps[:i])
        dists.extend([1-x for x in sims])

    # now cluster the data:
    cs = Butina.ClusterData(dists,nfps,cutoff,isDistData=True)
    return cs

In [5]:
ms = [Chem.MolFromSmiles(x) for x in smiles["SMILES"]]
fps = [AllChem.GetMorganFingerprintAsBitVect(x,2,1024) for x in ms]



In [None]:
clusters=ClusterFps(fps,cutoff=0.4)

Even on a server with 450GB of RAM, the above clustering fails because of memory requirements. Another approach is needed.