# Molecules

Property prediction task
Task
The task is to predict a target compound’s property from its molecular structure.

​ Property predictors can be used to test large collections of molecules in silico to identify candidates with high activity, 
and these candidates can then be validated in the lab. For COVID-19, the predictor will be applied to safe compounds (e.g. FDA-approved drugs) to screen for antiviral activity against SARS-CoV-2. The top ranked molecules will be tested in the lab.

Data
For all the datasets, a training pair is represented by a molecular structure (SMILES string) and an activity measurement.

E.coli: This dataset consists of 2335 pairs, with a binary activity measurement indicating E. coli inhibition. There are 120  molecules which inhibit E. coli growth. The size, quality, and distributional properties of this set are a good proxy for the SARS-CoV-2 screening data that will eventually be available (data).
 

SARS-CoV 3CLpro: This dataset consists of 290,726 pairs obtained via an assay that measures activity against the SARS-CoV 3CLpro target, which is highly homologous to the corresponding protease in SARS-CoV-2. There are 405 molecules in this dataset which are active against the 3CLpro target (raw data, processed data).

​ Specific training, validation, and test splits for the above datasets are here.

Model
Chemprop is a type of neural network called a message passing neural network (MPNN). MPNNs are designed to operate on graph-structured objects like molecules, where each atom is represented by a node and each bond is represented by an edge. An MPNN for molecules works by first creating feature vectors for each atom and bond based on simple properties like atom type (carbon, oxygen, etc) and bond type (single, double, etc). Then it performs a series of “message passing” steps where a neural network sends information between neighboring atoms and bonds, thereby encoding local chemical information. After a number of these steps, the local chemical information is aggregated to form a single vector representing the entire molecule, which is then processed by a feed-forward neural network that makes the final property prediction. Optionally, the molecule vector created by the MPNN can be augmented with additional chemical information by concatenating it with a chemical fingerprint or descriptor before feeding the combined vector through the feed-forward neural network.

Results
See below

References
[1] Chemprop (https://github.com/chemprop/chemprop) - GitHub repo containing code for the message passing neural network.

 
[2] Yang, Kevin, et al. “Analyzing Learned Molecular Representations for Property Prediction.” Journal of Chemical Information and Modeling. 59.8 (2019): 3370-3388. (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.9b00237) - Paper describing the message passing neural network applied to a range of molecular properties.

[4] Stokes, Jonathan, et al. “A Deep Learning Approach to Antibiotic Discovery” Cell. 180.4 (2020): 688-702. (https://www.cell.com/cell/fulltext/S0092-8674(20)30102-1) - Paper describing the application of the message passing neural network to E. coli.

[3] Landrum, Greg. "RDKit: Open-source cheminformatics." (2006): 2012. https://www.rdkit.org/ Open source package for computational chemistry.

In [1]:
import pandas as pd
import os
import sklearn as sk
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Descriptors
from rdkit import RDConfig
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import FragmentCatalog

#from utility import FeatureGenerator
from rdkit.Chem import PandasTools as PandasTools
from rdkit import DataStructs
from rdkit.Chem.Subshape import SubshapeBuilder,SubshapeAligner,SubshapeObjects


In [2]:
def vaid_smile(data):
    data_smiles = data['smiles']
    c_smiles = []
    data_smiles.head
    for ds in data_smiles:
        try:
            cs = Chem.CanonSmiles(ds)
            c_smiles.append(cs)
        except:
            print('Invalid SMILES:', ds)
    print('Completed with ',len(c_smiles),' out of',len(data_smiles),'.')

In [3]:

url = 'https://raw.githubusercontent.com/yangkevin2/coronavirus_data/master/data/AID1706_binarized_sars.csv'
sars = pd.read_csv(url)
url2 = 'https://raw.githubusercontent.com/yangkevin2/coronavirus_data/master/data/ecoli.csv'
ecl = pd.read_csv(url2)
url3 = 'https://gist.githubusercontent.com/GoodmanSciences/c2dd862cd38f21b0ad36b8f96b4bf1ee/raw/1d92663004489a5b6926e944c1b3d9ec5c40900e/Periodic%2520Table%2520of%2520Elements.csv'
pt = pd.read_csv(url3)

In [4]:
pt.head

<bound method NDFrame.head of      AtomicNumber        Element Symbol  AtomicMass  NumberofNeutrons  \
0               1       Hydrogen      H       1.007                 0   
1               2         Helium     He       4.002                 2   
2               3        Lithium     Li       6.941                 4   
3               4      Beryllium     Be       9.012                 5   
4               5          Boron      B      10.811                 6   
5               6         Carbon      C      12.011                 6   
6               7       Nitrogen      N      14.007                 7   
7               8         Oxygen      O      15.999                 8   
8               9       Fluorine      F      18.998                10   
9              10           Neon     Ne      20.180                10   
10             11         Sodium     Na      22.990                12   
11             12      Magnesium     Mg      24.305                12   
12             13    

In [6]:
vaid_smile(ecl)
vaid_smile(sars)

Completed with  2335  out of 2335 .
Invalid SMILES: CCCCC1CC1C(=O)NC2=NC=C3C(=N2)CC(CC3=O)(C)C
Completed with  290725  out of 290726 .


In [7]:
def my_mw(data):
    data_mw = data['smiles']
    mw_smiles = []
    #data_mw.head
    for ds in data_mw:
        try:
            cs = Chem.Descriptors.ExactMolWt(Chem.MolFromSmiles(ds))
            mw_smiles.append(round(cs,2))
            
        except:
            print('Invalid SMILES:', ds)
    print('Completed with ',len(mw_smiles),' out of',len(mw_smiles),'.')
    return mw_smiles

In [8]:
ecl["mw"] = my_mw(ecl)
sars["mw"] = my_mw(sars)

Completed with  2335  out of 2335 .
Completed with  290726  out of 290726 .


Split the data into training and test

In [137]:
sarsy = sars.activity
sarsx = sars.drop('activity',axis=1)
sarsx_train,sarsx_test,sarsy_train,sarsy_test = train_test_split(sarsx,sarsy,test_size=0.3)
sarsx_train.head()

Unnamed: 0,smiles,mw
52989,CC1=CC(=C(C=C1)NC2=CC(=NC3=NC=NN23)C)Cl,273.08
88732,CC(C)NC1=NN=C(S1)SCC2=CC=C(C=C2)C(=O)OC,323.08
30715,CCN(CCCNC(=O)CCNC(=O)CN1C=NC2=CC=CC=C2C1=O)C3=...,435.23
240909,CCOCCCNC(=O)C1=CC(=CC=C1)NC(=O)C2=C(OCCS2)C,364.15
154788,C1CCC(=CC1)CCNC(=S)NCC2=CC=CO2,264.13


In [138]:
ecly = ecl.activity
eclx = ecl.drop('activity',axis=1)
eclx_train,eclx_test,ecly_train,ecly_test = train_test_split(eclx,ecly,test_size=0.3)
eclx_train.head()

Unnamed: 0,smiles,mw
1949,COC(=O)c1cc(OC)c(OC)c(OC)c1,226.08
1933,COc1ccc2c(c1O)-c1c(OC)c(OC)cc3c1C(C2)N(C)CC3,341.16
787,CC(N)C(O)c1cccc(O)c1.O=C(O)C(O)C(O)C(=O)O,317.11
824,CC1CCC2(NC1)OC1CC3C4CCC5CC(O)CCC5(C)C4CCC3(C)C...,451.32
1721,Cn1nnc2c(C(N)=O)ncn2c1=O,194.06
