## COVID-19: Computational Drug Discovery [Fingerprint Discriptor Calculation][Part 3]

This an attempt to find an FDA approved compound or molecule that will inhibit the function of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV2).In Part 3, we will be calculating PubChem fingerprint descriptors that are essentially quantitative description of the compounds in the dataset. Each descriptor gives a numerical representation of some physical, chemical, or electromechanical aspect of a given compound. For example, "nN" is the number of nitrogen atoms present in the compound, "nC" is the number of carbon atoms present, etc.

Some of the descriptors are somewhat ambiguous - the ATS descriptors are a measurement of autocorrelation between neighboring atoms with respect to a certain weighting, such as mass and charge. More detailed descriptions for each descriptor can be found in a spreadsheet at http://www.yapcwsoft.com/dd/padeldescriptor/ by clicking the "1875" link towards the top of the page.

Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

An Otsogile Onalepelo Project aka Morena!

### import required packages

In [1]:
import pandas as pd
from padelpy import from_smiles #to calculate PubChem finger prints

### Load bioactivity data

In [2]:
df = pd.read_csv('bioactivity_preprocessed_data_with_pIC50.csv')

In [3]:
df

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL480,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,active,369.368,3.51522,1.0,4.0,6.408935
1,CHEMBL178459,Cc1c(-c2cnccn2)ssc1=S,active,226.351,3.30451,0.0,5.0,6.677781
2,CHEMBL3545157,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,active,334.400,3.26220,0.0,5.0,7.096910
3,CHEMBL297453,O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)...,inactive,458.375,2.23320,8.0,11.0,5.801343
4,CHEMBL4303595,O=C1C=Cc2cc(Br)ccc2C1=O,active,237.052,2.22770,0.0,2.0,7.397940
...,...,...,...,...,...,...,...,...
105,CHEMBL376488,COc1nc2ccc(Br)cc2cc1[C@@H](c1ccccc1)[C@@](O)(C...,inactive,555.516,7.13050,1.0,4.0,5.360514
106,CHEMBL154580,C=CC(=O)c1ccc2ccccc2c1,inactive,182.222,3.20850,0.0,1.0,5.906578
107,CHEMBL354349,C[n+]1c2cc(N)ccc2cc2ccc(N)cc21.[Cl-],inactive,259.740,-1.01410,2.0,2.0,5.302771
108,CHEMBL1382627,Nc1ccc(S(=O)(=O)[N-]c2ncccn2)cc1.[Ag+],active,357.143,1.45040,1.0,5.0,6.124939


In [4]:
#only get the canonical smiles and molecule id columns from the bioactivity data
cols_to_use = ['canonical_smiles','molecule_chembl_id']
df2 = df[cols_to_use]
df2

Unnamed: 0,canonical_smiles,molecule_chembl_id
0,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,CHEMBL480
1,Cc1c(-c2cnccn2)ssc1=S,CHEMBL178459
2,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,CHEMBL3545157
3,O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)...,CHEMBL297453
4,O=C1C=Cc2cc(Br)ccc2C1=O,CHEMBL4303595
...,...,...
105,COc1nc2ccc(Br)cc2cc1[C@@H](c1ccccc1)[C@@](O)(C...,CHEMBL376488
106,C=CC(=O)c1ccc2ccccc2c1,CHEMBL154580
107,C[n+]1c2cc(N)ccc2cc2ccc(N)cc21.[Cl-],CHEMBL354349
108,Nc1ccc(S(=O)(=O)[N-]c2ncccn2)cc1.[Ag+],CHEMBL1382627


In [5]:
df2.to_csv('molecule.smi', sep='\t', index=False, header=False)

### Calculate fingerprint descriptors using PaDEL

In [6]:
canonical_smiles_list = df2["canonical_smiles"].tolist()

In [7]:
canonical_smiles_list

['Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1',
 'Cc1c(-c2cnccn2)ssc1=S',
 'O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1',
 'O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)c(O)c1)c1cc(O)c(O)c(O)c1',
 'O=C1C=Cc2cc(Br)ccc2C1=O',
 'CC(CN1CC(=O)NC(=O)C1)N1CC(=O)NC(=O)C1',
 'Nc1ccc2cc3ccc(N)cc3nc2c1',
 'CCOC(=O)Cc1ccc(-c2ccccc2)cc1',
 'O=[N+]([O-])c1ccc(Sc2cccc[n+]2[O-])c2nonc12',
 'CCCCCCNC(=O)n1cc(F)c(=O)[nH]c1=O',
 'O=C1C(Cl)=C(N2CCOCC2)C(=O)N1c1ccc(Cl)c(Cl)c1',
 'CN1CCN(C(=O)c2ccc(-c3ccc4c(C=O)c(O)ccc4c3)s2)CC1',
 'NC(CO)C(=O)NNCc1ccc(O)c(O)c1O',
 'O=c1c(O)c(-c2cc(O)c(O)c(O)c2)oc2cc(O)cc(O)c12',
 'CCN1CCN(C(c2ccc(C(F)(F)F)cc2)c2ccc3cccnc3c2O)CC1',
 'Cn1sc(=O)n(Cc2ccccc2)c1=O',
 'Oc1cc2c(cc1O)C(c1ccccc1)CNCC2',
 'COC(=O)CC[C@H](NC(=O)[C@H](CC(=O)OC)NC(=O)OCc1ccccc1)C(=O)N[C@H](C(=O)N[C@@H](CC(=O)OC)C(=O)CF)C(C)C',
 'Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1',
 'Oc1cc2c(cc1C(c1ccc(C(F)(F)F)cc1)N1CCOCC1)OCO2',
 'Sc1nnc(Nc2ccccc2)s1',
 'Cc1ccc(C)c(-n2sc3cc(F)ccc3c2=O)c1',
 '[O-][n+]1c

In [8]:
pub_chem_finger_prints = []
for canonical_smile in canonical_smiles_list:
  fingerprint = from_smiles(canonical_smile, fingerprints=True, descriptors=False)
  pub_chem_finger_prints.append(fingerprint)

In [9]:
pub_chem_finger_prints

[OrderedDict([('PubchemFP0', '1'),
              ('PubchemFP1', '1'),
              ('PubchemFP2', '0'),
              ('PubchemFP3', '0'),
              ('PubchemFP4', '0'),
              ('PubchemFP5', '0'),
              ('PubchemFP6', '0'),
              ('PubchemFP7', '0'),
              ('PubchemFP8', '0'),
              ('PubchemFP9', '1'),
              ('PubchemFP10', '1'),
              ('PubchemFP11', '1'),
              ('PubchemFP12', '1'),
              ('PubchemFP13', '0'),
              ('PubchemFP14', '1'),
              ('PubchemFP15', '1'),
              ('PubchemFP16', '0'),
              ('PubchemFP17', '0'),
              ('PubchemFP18', '1'),
              ('PubchemFP19', '1'),
              ('PubchemFP20', '0'),
              ('PubchemFP21', '0'),
              ('PubchemFP22', '0'),
              ('PubchemFP23', '1'),
              ('PubchemFP24', '1'),
              ('PubchemFP25', '0'),
              ('PubchemFP26', '0'),
              ('PubchemFP27', '0'),
  

In [10]:
pub_chem_finger_prints_df = pd.DataFrame(pub_chem_finger_prints)

In [11]:
pub_chem_finger_prints_df

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
106,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
107,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
108,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


### Y variable

Add pIC50 to the finger prints dataframe

In [12]:
pub_chem_finger_prints_df['pIC50'] = df['pIC50']

In [14]:
pub_chem_finger_prints_df

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.408935
1,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.677781
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.096910
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.801343
4,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.397940
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.360514
106,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.906578
107,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.302771
108,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.124939


In [15]:
df

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL480,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,active,369.368,3.51522,1.0,4.0,6.408935
1,CHEMBL178459,Cc1c(-c2cnccn2)ssc1=S,active,226.351,3.30451,0.0,5.0,6.677781
2,CHEMBL3545157,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,active,334.400,3.26220,0.0,5.0,7.096910
3,CHEMBL297453,O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)...,inactive,458.375,2.23320,8.0,11.0,5.801343
4,CHEMBL4303595,O=C1C=Cc2cc(Br)ccc2C1=O,active,237.052,2.22770,0.0,2.0,7.397940
...,...,...,...,...,...,...,...,...
105,CHEMBL376488,COc1nc2ccc(Br)cc2cc1[C@@H](c1ccccc1)[C@@](O)(C...,inactive,555.516,7.13050,1.0,4.0,5.360514
106,CHEMBL154580,C=CC(=O)c1ccc2ccccc2c1,inactive,182.222,3.20850,0.0,1.0,5.906578
107,CHEMBL354349,C[n+]1c2cc(N)ccc2cc2ccc(N)cc21.[Cl-],inactive,259.740,-1.01410,2.0,2.0,5.302771
108,CHEMBL1382627,Nc1ccc(S(=O)(=O)[N-]c2ncccn2)cc1.[Ag+],active,357.143,1.45040,1.0,5.0,6.124939


Alright..so, this step is complete. Let us export the dataset for model building

### Export the Dataset

In [16]:
pub_chem_finger_prints_df.to_csv('replicase_polyprotein_1ab_bioactivity_data_2class_pIC50_pubchem_fp.csv', index=False)