Data is extracted from chembl_24.db (from [ChEMBL database](https://www.ebi.ac.uk/chembl/downloads)). We then parse the data and save data into pickle file for use in deep learning of the structures.

Import packages

In [1]:
import numpy as np
import pandas as pd
import os
import sqlite3
import pickle
import rdkit.Chem as chem

from sklearn.utils import shuffle

Connect database using sqlite3 package

In [2]:
db = sqlite3.connect('../chembl_24.db')
c = db.cursor()

Import doc_id from chembl_24.db, then use doc_id to extract molregno (unique internal Chembl compound identifier)

In [3]:
categories = ['%toxin%', '%fungicidal%', '%nematicidal%', '%herbicidal%', '%insecticidal%']

In [4]:
molregno = dict.fromkeys(categories, None)
for cat in categories:
    # extract doc_id from assays that contain agrochemical and non-agrochemical keywords
    doc_id = c.execute("SELECT doc_id FROM assays where description like '%s'" %cat).fetchall()
    doc_id = [i[0] for i in doc_id]
    
    # extract unique compound identifier from doc_id
    molregno[cat] = c.execute("SELECT molregno FROM compound_records WHERE doc_id IN " + str(tuple(doc_id))).fetchall()
    molregno[cat] = set([i[0] for i in molregno[cat]])
    
    print ("%s" %cat[1:-1], ":", len(molregno[cat]))

toxin : 496979
fungicidal : 4279
nematicidal : 497
herbicidal : 3454
insecticidal : 5201


Dispose of toxin data that overlaps with any agrochemical type

In [5]:
%%time
for agro in categories[1:]:
    intersection = molregno["%toxin%"] & molregno[agro]
    print ("Common to toxins and %ses:" %agro[1:-3], len(intersection))
    molregno["%toxin%"] = molregno["%toxin%"].difference(intersection)
    molregno[agro] = molregno[agro].difference(intersection)

Common to toxins and fungicides: 285
Common to toxins and nematicides: 141
Common to toxins and herbicides: 128
Common to toxins and insecticides: 410
CPU times: user 91.8 ms, sys: 19.6 ms, total: 111 ms
Wall time: 109 ms


Group all agrochemicals into one class and remove duplicates

In [6]:
molregno['agrochemical'] = molregno[categories[1]] | molregno[categories[2]] | molregno[categories[3]] | molregno[categories[4]]

In [7]:
categories = ['%toxin%', 'agrochemical']
for cat in categories:
    molregno[cat] = list(molregno[cat])
    print ("%s" %cat[1:-1], ":", len(molregno[cat]))

toxin : 496015
grochemica : 11996


Then, we get canonical smiles string and compound properties from molregno compound identifier. We check to make sure the smiles strings and properties are mapped exactly.

- mw_freebase = Molecular weight of parent compound
- alogp = Calculated ALogP
- hba = number of hydrogen bond acceptors
- hbd = number of hydrogen bond donors
- psa = polar surface area
- rtb = number of rotatable bonds
- acd_logp = calculated octanol/water partition coefficient using ACDlabs v12.01
- acd_logd = calculated octanol/water distribution coefficient at pH 7.4 using ACDlabs v12.01
- full_mwt = molecular weight of the full compound including any salts
- aromatic_rings = number of aromatic rings
- heavy_atoms = number of heavy (non-hydrogen) atoms
- qed_weighted = weighted quantitative estimate of drug likeness
- mw_monoisotopic = monoisotopic parent molecular weight
- hba_lipinski = number of hydrogen bond acceptros calculated according to the Lipinski's original rules (i.e. N + O count)
- hbd_lipinski = number of hydrogen bond donors calculated according to the Lipinski's original rules (i.e., NH + OH count)

In [8]:
smiles_string = dict.fromkeys(categories, None)
mw_freebase_dict = dict.fromkeys(categories, None)
alogp_dict = dict.fromkeys(categories, None)
hba_dict = dict.fromkeys(categories, None)
hbd_dict = dict.fromkeys(categories, None)
psa_dict = dict.fromkeys(categories, None)
rtb_dict = dict.fromkeys(categories, None)
acd_logp_dict = dict.fromkeys(categories, None)
acd_logd_dict = dict.fromkeys(categories, None)
full_mwt_dict = dict.fromkeys(categories, None)
aromatic_rings_dict = dict.fromkeys(categories, None)
heavy_atoms_dict = dict.fromkeys(categories, None)
qed_weighted_dict = dict.fromkeys(categories, None)
mw_monoisotopic_dict = dict.fromkeys(categories, None)
hba_lipinski_dict = dict.fromkeys(categories, None)
hbd_lipinski_dict = dict.fromkeys(categories, None)

In [9]:
for cat in categories:
    smiles_string[cat], mw_freebase_dict[cat], alogp_dict[cat], hba_dict[cat], hbd_dict[cat], psa_dict[cat] = [], [], [], [], [], []
    rtb_dict[cat], acd_logp_dict[cat], acd_logd_dict[cat], full_mwt_dict[cat], aromatic_rings_dict[cat], heavy_atoms_dict[cat] = [], [], [], [], [], []
    qed_weighted_dict[cat], mw_monoisotopic_dict[cat], hba_lipinski_dict[cat], hbd_lipinski_dict[cat] = [], [], [], []
    for num in molregno[cat]:
        smile = c.execute("SELECT canonical_smiles FROM compound_structures WHERE molregno = " + str(num)).fetchall()
        properties = c.execute("SELECT * FROM compound_properties WHERE molregno = " + str(num)).fetchall()
        if not smile or not properties:
            molregno[cat].remove(num)
        else:
            properties = properties[0]
            smiles_string[cat].append(smile[0])
            
            # assign properties to corresponding dictionaries
            mw_freebase_dict[cat].append(properties[1])
            alogp_dict[cat].append(properties[2])
            hba_dict[cat].append(properties[3])
            hbd_dict[cat].append(properties[4])
            psa_dict[cat].append(properties[5])
            rtb_dict[cat].append(properties[6])
            acd_logp_dict[cat].append(properties[11])
            acd_logd_dict[cat].append(properties[12])
            full_mwt_dict[cat].append(properties[14])
            aromatic_rings_dict[cat].append(properties[15])
            heavy_atoms_dict[cat].append(properties[16])
            qed_weighted_dict[cat].append(properties[17])
            mw_monoisotopic_dict[cat].append(properties[18])
            hba_lipinski_dict[cat].append(properties[20])
            hbd_lipinski_dict[cat].append(properties[21])

Convert smiles string and the properties into a long list and then create another list containing their corresponding categorical name

In [10]:
canonical_smiles, label = [], []
mw_freebase, alogp, hba, hbd, psa, rtb, acd_logp, acd_logd, full_mwt, aromatic_rings, heavy_atoms, qed_weighted, mw_monoisotopic, hba_lipinski, hbd_lipinski = [],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
for i, cat in enumerate(smiles_string):
    canonical_smiles += smiles_string[cat]
    label += [i]*len(smiles_string[cat])
    
    mw_freebase += mw_freebase_dict[cat]
    alogp += alogp_dict[cat]
    hba += hba_dict[cat]
    hbd += hbd_dict[cat]
    psa += psa_dict[cat]
    rtb += rtb_dict[cat]
    acd_logp += acd_logp_dict[cat]
    acd_logd += acd_logd_dict[cat]
    full_mwt += full_mwt_dict[cat]
    aromatic_rings += aromatic_rings_dict[cat]
    heavy_atoms += heavy_atoms_dict[cat]
    qed_weighted += qed_weighted_dict[cat]
    mw_monoisotopic += mw_monoisotopic_dict[cat]
    hba_lipinski += hba_lipinski_dict[cat]
    hbd_lipinski += hbd_lipinski_dict[cat]
    

Stack the two lists together

In [11]:
data = np.column_stack((canonical_smiles, label, mw_freebase, alogp, hba, hbd, psa, rtb, acd_logp, acd_logd, full_mwt, aromatic_rings, heavy_atoms, qed_weighted, mw_monoisotopic, hba_lipinski, hbd_lipinski))

Convert data into pandas dataframe

In [12]:
data = pd.DataFrame(data, columns=['smiles', 'agrochemical', 'mw_freebase', 'alogp', 'hba', 'hbd', 'psa', 'rtb', 'acd_logp', 'acd_logd', 'full_mwt', 'aromatic_rings', 'heavy_atoms', 'qed_weighted', 'mw_monoisotopic', 'hba_lipinski', 'hbd_lipinski'])

Add a column containing RDKit Molecule class

In [13]:
%%time
data['mol'] = data['smiles'].apply(chem.MolFromSmiles)

CPU times: user 2min 55s, sys: 5.52 s, total: 3min
Wall time: 3min 2s


Remove null values

In [14]:
data.dropna(axis=0, inplace=True)

Reset index

In [15]:
data.reset_index(drop=True, inplace=True)

Count number of compounds in agrochemical and non-agrochemical category

In [16]:
data['agrochemical'].value_counts()

0    487474
1     11543
Name: agrochemical, dtype: int64

In [17]:
data

Unnamed: 0,smiles,agrochemical,mw_freebase,alogp,hba,hbd,psa,rtb,acd_logp,acd_logd,full_mwt,aromatic_rings,heavy_atoms,qed_weighted,mw_monoisotopic,hba_lipinski,hbd_lipinski,mol
0,Br\C=C\1/CCC(C(=O)O1)c2cccc3ccccc23,0,317.18,4.5,2,0,26.3,1,4.39,4.39,317.18,2,19,0.72,316.01,2,0,<rdkit.Chem.rdchem.Mol object at 0x1059d9a30>
1,Oc1ccc(cc1NC(=O)c2ccccc2NS(=O)(=O)c3ccc(F)cc3)...,0,533.6,3.76,6,3,132.88,7,5.45,4.87,533.6,3,36,0.4,533.109,9,3,<rdkit.Chem.rdchem.Mol object at 0x12899f620>
2,[Na+].Cc1cc(CC(=O)[O-])n(C)c1C(=O)c2ccc(Cl)cc2,0,291.73,2.85,3,1,59.3,4,3.36,0.27,313.72,2,20,0.88,291.066,4,1,<rdkit.Chem.rdchem.Mol object at 0x12899f530>
3,CCN1C=C(C(=O)O)C(=O)c2ccc(C)nc12,0,232.24,1.42,4,1,72.19,2,0.03,-1.54,232.24,2,17,0.85,232.085,5,1,<rdkit.Chem.rdchem.Mol object at 0x12899f6c0>
4,Oc1cc2C(=O)Oc3c(O)c(O)cc4C(=O)Oc(c1O)c2c34,0,302.19,1.31,8,4,141.34,0,0.24,-3.38,302.19,4,22,0.22,302.006,8,4,<rdkit.Chem.rdchem.Mol object at 0x12899f5d0>
5,CC1(C)[C@@H](N2[C@@H](CC2=O)S1(=O)=O)C(=O)O,0,233.24,-0.79,4,1,91.75,1,0.39,-3.33,233.24,0,15,0.6,233.036,6,1,<rdkit.Chem.rdchem.Mol object at 0x12899f760>
6,CN1CCN(CC1)c2cc3N(C=C(C(=O)O)C(=O)c3cc2F)c4ccc...,0,399.4,2.72,5,1,65.78,3,0.84,-0.46,399.4,3,29,0.73,399.139,6,1,<rdkit.Chem.rdchem.Mol object at 0x12899f670>
7,C[C@]1(Cn2ccnn2)[C@@H](N3[C@@H](CC3=O)S1(=O)=O...,0,300.3,-1.52,7,1,122.46,3,0.6,-3.13,300.3,1,20,0.67,300.053,9,1,<rdkit.Chem.rdchem.Mol object at 0x12899f800>
8,CN1CCN(CC1)c2c(F)cc3C(=O)C(=CN(CCF)c3c2F)C(=O)O,0,369.34,1.7,5,1,65.78,4,1.84,-0.27,369.34,2,26,0.89,369.13,6,1,<rdkit.Chem.rdchem.Mol object at 0x12899f710>
9,Cn1cc(C2=C(C(=O)NC2=O)c3cn(CCCSC(=N)N)c4ccccc3...,0,457.56,3.72,6,3,105.9,6,6.1,4.77,457.56,4,33,0.18,457.157,7,4,<rdkit.Chem.rdchem.Mol object at 0x12899f8a0>


### Save data as pickle file labeled dataset2 (all data)

In [18]:
data.to_pickle("./binary_classification/data/dataset2.pkl")

### Save another dataset: dataset1 (balanced dataset with approximately equal proportion of agro and non-agrochemicals) 

In [19]:
agrochemicals = data.loc[data['agrochemical'] == 1]
nonagrochemicals = data.loc[data['agrochemical'] == 0]

In [20]:
type(agrochemicals)

pandas.core.frame.DataFrame

In [21]:
nonagrochemicals = shuffle(nonagrochemicals)
nonagrochemicals = nonagrochemicals[:15000]

In [22]:
dataset1 = pd.concat([agrochemicals, nonagrochemicals], axis=0)

In [23]:
dataset1

Unnamed: 0,smiles,agrochemical,mw_freebase,alogp,hba,hbd,psa,rtb,acd_logp,acd_logd,full_mwt,aromatic_rings,heavy_atoms,qed_weighted,mw_monoisotopic,hba_lipinski,hbd_lipinski,mol
487474,COc1cc2N(C)C3=C(C(=O)OC3)C(C)(c4cc(OC)c(OC)c(O...,1,441.48,3.3,8,0,75.69,6,0.22,0.22,441.48,2,32,0.63,441.179,8,0,<rdkit.Chem.rdchem.Mol object at 0x12eb5f030>
487475,CN1C2=C(C(=O)OC2)C(C)(c3ccc(Cl)cc3)c4ccc(C)cc14,1,339.82,4.22,3,0,29.54,1,2.1,2.1,339.82,2,24,0.73,339.103,3,0,<rdkit.Chem.rdchem.Mol object at 0x12eb5f080>
487476,CCN1C2=C(C(=O)OC2)C(C)(c3cc(OC)cc(OC)c3)c4cc5O...,1,409.44,3.39,7,0,66.46,4,1.92,1.92,409.44,2,30,0.72,409.152,7,0,<rdkit.Chem.rdchem.Mol object at 0x12eb5f0d0>
487477,CCN1C2=C(C(=O)OC2)C(C)(c3cc(OC)cc(OC)c3)c4cc(O...,1,425.48,3.68,7,0,66.46,6,1.56,1.56,425.48,2,31,0.65,425.184,7,0,<rdkit.Chem.rdchem.Mol object at 0x12eb5f120>
487478,COc1ccc2c(c1)N(C)C3=C(C(=O)OC3)C2(C)c4cc(OC)cc...,1,381.43,3.28,6,0,57.23,4,1.4,1.4,381.43,2,28,0.76,381.158,6,0,<rdkit.Chem.rdchem.Mol object at 0x12eb5f170>
487479,Clc1cc(Cl)cc(c1)C2C3=C(COC3=O)Oc4cc5OCOc5cc24,1,377.18,4.06,5,0,53.99,1,3.96,3.96,377.18,2,25,0.7,375.99,5,0,<rdkit.Chem.rdchem.Mol object at 0x12eb5f1c0>
487480,Clc1cccc(c1)C2C3=C(COC3=O)Oc4cc5OCOc5cc24,1,342.73,3.4,5,0,53.99,1,3.28,3.28,342.73,2,24,0.74,342.029,5,0,<rdkit.Chem.rdchem.Mol object at 0x12eb5f210>
487481,NC(=N)Nc1ccccc1SCc2ccc(Cl)cc2,1,291.81,3.94,2,3,61.9,4,2.86,1.72,291.81,2,19,0.45,291.06,3,4,<rdkit.Chem.rdchem.Mol object at 0x12eb5f260>
487482,Clc1ccc(CSc2ccccc2N=C(NC3CCCCC3)NC4CCCCC4)cc1,1,456.1,7.46,2,2,36.42,6,7.41,5.45,456.1,2,31,0.27,455.216,3,2,<rdkit.Chem.rdchem.Mol object at 0x12eb5f2b0>
487483,Nc1ccccc1SCc2ccc(Cl)cc2,1,249.77,4.21,2,1,26.02,3,3.02,3.01,249.77,2,16,0.65,249.038,1,2,<rdkit.Chem.rdchem.Mol object at 0x12eb5f300>


In [24]:
dataset1.to_pickle("./binary_classification/data/dataset1.pkl")