Now that necessary data is retrieved and saved, we can import the pickle file straight away and do deep learning on the compounds.

Import packages

In [1]:
import numpy as np
import pandas as pd
import os
import deepchem as dc
import rdkit

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.

  from ._conv import register_converters as _register_converters


Import structure data

In [2]:
data = pd.read_pickle(os.path.join(os.getcwd(), 'data.pkl'))
print (data.shape)
data.head(1)

(27785, 2)


Unnamed: 0,smiles,category
0,COC(=O)C1=C(C)N(Cc2ccccc2)C(=NC1c3ccc(Br)cc3)N...,toxin


Convert Smiles string to rdkit Molecular class

In [3]:
data['mol'] = data['smiles'].apply(rdkit.Chem.MolFromSmiles)

Remove null values

In [4]:
data.dropna(axis=0, inplace=True)

Count number of compounds in each category

In [14]:
data['category'].value_counts()

toxin           14544
insecticidal     4653
fungicidal       3640
herbicidal       3284
nematicidal       321
Name: category, dtype: int64

Shuffle data

In [7]:
np.random.shuffle(data.values)

Use deepchem package to featurize data. We are trying Convolutional Molecule Graphs now

In [46]:
ft = dc.feat.ConvMolFeaturizer()

In [47]:
data['conv_mol'] = ft.featurize(data['mol'])

Split dataset into training, validation and testing datasets

In [55]:
splitter = dc.splits.splitters.RandomSplitter()
train, valid, test = splitter.split(data, frac_train=0.7, frac_valid=0.2, frac_test=0.1)

In [56]:
len(train), len(valid), len(test)

(18509, 5288, 2645)