Data is extracted from chembl_24.db (from [ChEMBL database](https://www.ebi.ac.uk/chembl/downloads)). We then parse the data and save data into pickle file for use in deep learning of the structures)

Import packages

In [1]:
import numpy as np
import pandas as pd
import os
import sqlite3
import pickle

from random import shuffle

Connect database using sqlite3 package

In [2]:
db = sqlite3.connect('chembl_24.db')
c = db.cursor()

Import doc_id from chembl_24.db, then use doc_id to extract molregno (unique internal Chembl compound identifier)

In [3]:
files = ['%toxin%', '%fungicidal%', '%nematicidal%', '%herbicidal%', '%insecticidal%']

In [4]:
molregno = dict.fromkeys(files, None)
for file in files:
    # extract doc_id from assays that contain agrochemical and non-agrochemical keywords
    doc_id = c.execute("SELECT doc_id FROM assays where description like '%s'" %file).fetchall()
    doc_id = [i[0] for i in doc_id]
    
    # extract unique compound identifier from doc_id
    molregno[file] = c.execute("SELECT molregno FROM compound_records WHERE doc_id IN " + str(tuple(doc_id))).fetchall()
    molregno[file] = [i[0] for i in molregno[file]]
    print ("%s" %file[1:-1], ":", len(molregno[file]))

toxin : 541388
fungicidal : 4678
nematicidal : 555
herbicidal : 3715
insecticidal : 5987


Dispose data that overlaps 

In [5]:
for i, file_a in enumerate(files):
    for j, file_b in enumerate(files):
        if (i >= j):
            continue
        intersection = list(set(molregno[file_a])&set(molregno[file_b]))
        molregno[file_a] = [x for x in molregno[file_a] if x not in intersection]

Then, we get canonical smiles string from molregno compound identifier. 

In [6]:
smiles = dict.fromkeys(files, None)
for file in files:
    smiles[file] = c.execute("SELECT canonical_smiles FROM compound_structures WHERE molregno IN" + str(tuple(molregno[file]))).fetchall()
    smiles[file] = [i[0] for i in smiles[file]]
    print ("%s" %file[1:-1], ":", len(smiles[file]))

toxin : 493363
fungicidal : 3880
nematicidal : 400
herbicidal : 3338
insecticidal : 5167


Only take as many toxin (non-agrochemical) data as the other data added together

In [7]:
shuffle(smiles['%toxin%'])
smiles['%toxin%'] = smiles["%toxin%"][:15000]

Convert smiles string into a long list and then create another list containing their corresponding categorical name

In [8]:
canonical_smiles, category = [], []
for file in smiles:
    canonical_smiles += smiles[file]
    category += [file[1:-1]]*len(smiles[file])

Stack the two lists together

In [9]:
data = np.column_stack((canonical_smiles, category))

Convert data into pandas dataframe

In [10]:
data = pd.DataFrame(data, columns=['smiles', 'category'])

Save data as pickle file

In [11]:
data = data.to_pickle("./data.pkl")