# Part 3 - Descriptor Calculation
Eric Kwok

This notebook is based on [Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation](https://github.com/dataprofessor/code/blob/master/python/CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb) by Chanin Nantasenamat.

In this part, we will calculate molecular descriptors that quantitatively describe compounds in the dataset. We will then save them as a dataset for model building in Part 4.

---

## Download PaDEL-Descriptor

In [2]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2020-11-28 17:34:54--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2020-11-28 17:34:55--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.196.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.196.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2020-11-28 17:34:59 (5.63 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2020-11-28 17:35:00--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (github.com)... 192.30.255.113
Connecti

In [12]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

  inflating: PaDEL-Descriptor/lib/cdk-1.4.15.jar  
  inflating: __MACOSX/PaDEL-Descriptor/lib/._cdk-1.4.15.jar  
  inflating: PaDEL-Descriptor/lib/ambit2-smarts-2.4.7-SNAPSHOT(5).jar  
  inflating: __MACOSX/PaDEL-Descriptor/lib/._ambit2-smarts-2.4.7-SNAPSHOT(5).jar  
  inflating: PaDEL-Descriptor/lib/ambit2-core-2.4.7-SNAPSHOT(1).jar  
  inflating: __MACOSX/PaDEL-Descriptor/lib/._ambit2-core-2.4.7-SNAPSHOT(1).jar  
  inflating: PaDEL-Descriptor/lib/libPaDEL-Jobs(8).jar  
  inflating: __MACOSX/PaDEL-Descriptor/lib/._libPaDEL-Jobs(8).jar  
  inflating: PaDEL-Descriptor/lib/jgrapht-0.6.0(6).jar  
  inflating: __MACOSX/PaDEL-Descriptor/lib/._jgrapht-0.6.0(6).jar  
  inflating: PaDEL-Descriptor/lib/jama(2).jar  
  inflating: __MACOSX/PaDEL-Descriptor/lib/._jama(2).jar  
  inflating: PaDEL-Descriptor/lib/jama(3).jar  
  inflating: __MACOSX/PaDEL-Descriptor/lib/._jama(3).jar  
  inflating: PaDEL-Descriptor/lib/commons-cli-1.2(1).jar  
  inflating: __MACOSX/PaDEL-Descriptor/lib/._commons-cli-1

## Load bioactivity data

In [4]:
import pandas as pd

df = pd.read_csv('influenza_a_pIC50.csv')
df

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL327097,CC(=O)Nc1ccc(C(=O)O)cc1NC(=O)CO,inactive,252.226,0.2740,4.0,4.0,2.397940
1,CHEMBL327097,CC(=O)Nc1ccc(C(=O)O)cc1NC(=O)CO,inactive,252.226,0.2740,4.0,4.0,2.000000
2,CHEMBL324455,CC(=O)Nc1c(O)cc(C(=O)O)cc1[N+](=O)[O-],inactive,240.171,0.9570,3.0,5.0,3.124939
3,CHEMBL324455,CC(=O)Nc1c(O)cc(C(=O)O)cc1[N+](=O)[O-],inactive,240.171,0.9570,3.0,5.0,3.000000
4,CHEMBL321393,CC(=O)Nc1c(OC(C)=O)cc(C(=O)O)cc1[N+](=O)[O-],inactive,282.208,1.1767,2.0,6.0,2.301030
...,...,...,...,...,...,...,...,...
1586,CHEMBL4286184,C=C(C)[C@@H]1CC[C@]2(C(=O)N[C@H]3C=C(C(=O)OC)O...,inactive,869.106,5.8766,3.0,12.0,4.000000
1587,CHEMBL4294084,COC(=O)C1=C[C@H](NC(=O)[C@]23CCC(C)(C)C[C@H]2C...,inactive,758.994,3.2791,7.0,10.0,4.392545
1588,CHEMBL4282791,C=C(C)[C@@H]1CC[C@]2(C(=O)N[C@H]3C=C(C(=O)OC)O...,inactive,742.995,4.1642,6.0,9.0,4.000000
1589,CHEMBL140,COc1cc(/C=C/C(=O)CC(=O)/C=C/c2ccc(O)c(OC)c2)ccc1O,intermediate,368.385,3.3699,2.0,6.0,5.173925


In [5]:
selection = ['canonical_smiles', 'molecule_chembl_id']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [6]:
! head molecule.smi

CC(=O)Nc1ccc(C(=O)O)cc1NC(=O)CO	CHEMBL327097
CC(=O)Nc1ccc(C(=O)O)cc1NC(=O)CO	CHEMBL327097
CC(=O)Nc1c(O)cc(C(=O)O)cc1[N+](=O)[O-]	CHEMBL324455
CC(=O)Nc1c(O)cc(C(=O)O)cc1[N+](=O)[O-]	CHEMBL324455
CC(=O)Nc1c(OC(C)=O)cc(C(=O)O)cc1[N+](=O)[O-]	CHEMBL321393
CC(=O)Nc1c(OC(C)=O)cc(C(=O)O)cc1[N+](=O)[O-]	CHEMBL321393
CC(=O)Nc1ccc(C(=O)O)cc1N	CHEMBL109162
CC(=O)Nc1ccc(C(=O)O)cc1N	CHEMBL109162
CC(=O)Nc1ccc(C(=O)O)cc1[N+](=O)[O-]	CHEMBL111082
CC(=O)Nc1ccc(C(=O)O)cc1[N+](=O)[O-]	CHEMBL111082


In [8]:
! cat molecule.smi | wc -l

    1591


## Calculate fingerprint descriptors
### Calculate PaDEL descriptors

In [9]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [13]:
! bash padel.sh

Processing CHEMBL327097 in molecule.smi (1/1591). 
Processing CHEMBL111756 in molecule.smi (13/1591). 
Processing CHEMBL109939 in molecule.smi (22/1591). 
Processing CHEMBL300932 in molecule.smi (29/1591). 
Processing CHEMBL222813 in molecule.smi (37/1591). 
Processing CHEMBL386285 in molecule.smi (41/1591). 
Processing CHEMBL327097 in molecule.smi (2/1591). 
Processing CHEMBL420491 in molecule.smi (15/1591). 
Processing CHEMBL109004 in molecule.smi (23/1591). 
Processing CHEMBL50902 in molecule.smi (31/1591). 
Processing CHEMBL266691 in molecule.smi (39/1591). 
Processing CHEMBL324455 in molecule.smi (3/1591). 
Processing CHEMBL55440 in molecule.smi (12/1591). 
Processing CHEMBL109005 in molecule.smi (19/1591). 
Processing CHEMBL1145 in molecule.smi (25/1591). 
Processing CHEMBL60586 in molecule.smi (36/1591). 
Processing CHEMBL324455 in molecule.smi (4/1591). 
Processing CHEMBL55440 in molecule.smi (11/1591). 
Processing CHEMBL109005 in molecule.smi (20/1591). 
Processing CHEMBL1145 

Processing CHEMBL57651 in molecule.smi (122/1591). Average speed: 0.08 s/mol.
Processing CHEMBL152644 in molecule.smi (123/1591). Average speed: 0.08 s/mol.
Processing CHEMBL298834 in molecule.smi (124/1591). Average speed: 0.08 s/mol.
Processing CHEMBL57127 in molecule.smi (125/1591). Average speed: 0.08 s/mol.
Processing CHEMBL56859 in molecule.smi (126/1591). Average speed: 0.08 s/mol.
Processing CHEMBL222813 in molecule.smi (127/1591). Average speed: 0.08 s/mol.
Processing CHEMBL56658 in molecule.smi (128/1591). Average speed: 0.08 s/mol.
Processing CHEMBL355930 in molecule.smi (130/1591). Average speed: 0.08 s/mol.
Processing CHEMBL660 in molecule.smi (129/1591). Average speed: 0.08 s/mol.
Processing CHEMBL416320 in molecule.smi (131/1591). Average speed: 0.08 s/mol.
Processing CHEMBL57832 in molecule.smi (132/1591). Average speed: 0.08 s/mol.
Processing CHEMBL299915 in molecule.smi (133/1591). Average speed: 0.08 s/mol.
Processing CHEMBL347425 in molecule.smi (134/1591). Average 

Processing CHEMBL197813 in molecule.smi (226/1591). Average speed: 0.06 s/mol.
Processing CHEMBL198130 in molecule.smi (227/1591). Average speed: 0.06 s/mol.
Processing CHEMBL383574 in molecule.smi (228/1591). Average speed: 0.06 s/mol.
Processing CHEMBL199686 in molecule.smi (229/1591). Average speed: 0.06 s/mol.
Processing CHEMBL197868 in molecule.smi (230/1591). Average speed: 0.06 s/mol.
Processing CHEMBL200175 in molecule.smi (232/1591). Average speed: 0.06 s/mol.
Processing CHEMBL380775 in molecule.smi (231/1591). Average speed: 0.06 s/mol.
Processing CHEMBL198027 in molecule.smi (233/1591). Average speed: 0.06 s/mol.
Processing CHEMBL200754 in molecule.smi (234/1591). Average speed: 0.06 s/mol.
Processing CHEMBL372605 in molecule.smi (235/1591). Average speed: 0.06 s/mol.
Processing CHEMBL199628 in molecule.smi (236/1591). Average speed: 0.06 s/mol.
Processing CHEMBL369955 in molecule.smi (238/1591). Average speed: 0.06 s/mol.
Processing CHEMBL199890 in molecule.smi (237/1591). 

Processing CHEMBL558168 in molecule.smi (332/1591). Average speed: 0.07 s/mol.
Processing CHEMBL556411 in molecule.smi (333/1591). Average speed: 0.07 s/mol.
Processing CHEMBL556411 in molecule.smi (334/1591). Average speed: 0.07 s/mol.
Processing CHEMBL562965 in molecule.smi (335/1591). Average speed: 0.08 s/mol.
Processing CHEMBL562965 in molecule.smi (336/1591). Average speed: 0.08 s/mol.
Processing CHEMBL539726 in molecule.smi (337/1591). Average speed: 0.08 s/mol.
Processing CHEMBL539726 in molecule.smi (338/1591). Average speed: 0.08 s/mol.
Processing CHEMBL599886 in molecule.smi (339/1591). Average speed: 0.08 s/mol.
Processing CHEMBL601846 in molecule.smi (341/1591). Average speed: 0.08 s/mol.
Processing CHEMBL601845 in molecule.smi (340/1591). Average speed: 0.08 s/mol.
Processing CHEMBL601847 in molecule.smi (342/1591). Average speed: 0.08 s/mol.
Processing CHEMBL606419 in molecule.smi (343/1591). Average speed: 0.08 s/mol.
Processing CHEMBL600995 in molecule.smi (344/1591). 

Processing CHEMBL1084461 in molecule.smi (437/1591). Average speed: 0.09 s/mol.
Processing CHEMBL1084462 in molecule.smi (438/1591). Average speed: 0.09 s/mol.
Processing CHEMBL1229 in molecule.smi (440/1591). Average speed: 0.09 s/mol.
Processing CHEMBL1084463 in molecule.smi (439/1591). Average speed: 0.09 s/mol.
Processing CHEMBL1229 in molecule.smi (441/1591). Average speed: 0.09 s/mol.
Processing CHEMBL1229 in molecule.smi (442/1591). Average speed: 0.09 s/mol.
Processing CHEMBL1229 in molecule.smi (443/1591). Average speed: 0.09 s/mol.
Processing CHEMBL1229 in molecule.smi (444/1591). Average speed: 0.09 s/mol.
Processing CHEMBL1229 in molecule.smi (445/1591). Average speed: 0.09 s/mol.
Processing CHEMBL1229 in molecule.smi (446/1591). Average speed: 0.09 s/mol.
Processing CHEMBL1229 in molecule.smi (447/1591). Average speed: 0.09 s/mol.
Processing CHEMBL1229 in molecule.smi (448/1591). Average speed: 0.09 s/mol.
Processing CHEMBL222813 in molecule.smi (449/1591). Average speed: 

Processing CHEMBL1229 in molecule.smi (544/1591). Average speed: 0.08 s/mol.
Processing CHEMBL1229 in molecule.smi (543/1591). Average speed: 0.08 s/mol.
Processing CHEMBL1229 in molecule.smi (545/1591). Average speed: 0.08 s/mol.
Processing CHEMBL1229 in molecule.smi (546/1591). Average speed: 0.08 s/mol.
Processing CHEMBL1229 in molecule.smi (547/1591). Average speed: 0.08 s/mol.
Processing CHEMBL1229 in molecule.smi (548/1591). Average speed: 0.08 s/mol.
Processing CHEMBL1229 in molecule.smi (549/1591). Average speed: 0.08 s/mol.
Processing CHEMBL1229 in molecule.smi (550/1591). Average speed: 0.08 s/mol.
Processing CHEMBL1229 in molecule.smi (551/1591). Average speed: 0.08 s/mol.
Processing CHEMBL1229 in molecule.smi (552/1591). Average speed: 0.08 s/mol.
Processing CHEMBL1229 in molecule.smi (553/1591). Average speed: 0.08 s/mol.
Processing CHEMBL1229 in molecule.smi (554/1591). Average speed: 0.08 s/mol.
Processing CHEMBL1229 in molecule.smi (555/1591). Average speed: 0.08 s/mol.

Processing CHEMBL222813 in molecule.smi (648/1591). Average speed: 0.07 s/mol.
Processing CHEMBL222813 in molecule.smi (649/1591). Average speed: 0.07 s/mol.
Processing CHEMBL222813 in molecule.smi (650/1591). Average speed: 0.07 s/mol.
Processing CHEMBL222813 in molecule.smi (651/1591). Average speed: 0.07 s/mol.
Processing CHEMBL222813 in molecule.smi (652/1591). Average speed: 0.07 s/mol.
Processing CHEMBL222813 in molecule.smi (653/1591). Average speed: 0.07 s/mol.
Processing CHEMBL222813 in molecule.smi (654/1591). Average speed: 0.07 s/mol.
Processing CHEMBL222813 in molecule.smi (656/1591). Average speed: 0.07 s/mol.
Processing CHEMBL222813 in molecule.smi (655/1591). Average speed: 0.07 s/mol.
Processing CHEMBL222813 in molecule.smi (657/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1229 in molecule.smi (658/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1229 in molecule.smi (659/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1229 in molecule.smi (660/1591). Averag

Processing CHEMBL466246 in molecule.smi (753/1591). Average speed: 0.07 s/mol.
Processing CHEMBL466246 in molecule.smi (754/1591). Average speed: 0.07 s/mol.
Processing CHEMBL466246 in molecule.smi (755/1591). Average speed: 0.07 s/mol.
Processing CHEMBL466246 in molecule.smi (756/1591). Average speed: 0.07 s/mol.
Processing CHEMBL473062 in molecule.smi (757/1591). Average speed: 0.07 s/mol.
Processing CHEMBL473062 in molecule.smi (758/1591). Average speed: 0.07 s/mol.
Processing CHEMBL473062 in molecule.smi (759/1591). Average speed: 0.07 s/mol.
Processing CHEMBL473062 in molecule.smi (760/1591). Average speed: 0.07 s/mol.
Processing CHEMBL473062 in molecule.smi (761/1591). Average speed: 0.07 s/mol.
Processing CHEMBL473062 in molecule.smi (762/1591). Average speed: 0.07 s/mol.
Processing CHEMBL473062 in molecule.smi (763/1591). Average speed: 0.07 s/mol.
Processing CHEMBL473062 in molecule.smi (764/1591). Average speed: 0.07 s/mol.
Processing CHEMBL473062 in molecule.smi (765/1591). 

Processing CHEMBL1956542 in molecule.smi (857/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1956402 in molecule.smi (858/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1956409 in molecule.smi (860/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1956402 in molecule.smi (859/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1956535 in molecule.smi (861/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1956540 in molecule.smi (862/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1956398 in molecule.smi (863/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1956547 in molecule.smi (864/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1956540 in molecule.smi (865/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1956403 in molecule.smi (866/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1956541 in molecule.smi (867/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1956543 in molecule.smi (868/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1956406 in molecule.smi

Processing CHEMBL2032348 in molecule.smi (960/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2032349 in molecule.smi (961/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2029378 in molecule.smi (962/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2032327 in molecule.smi (963/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2032328 in molecule.smi (964/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2032329 in molecule.smi (965/1591). Average speed: 0.07 s/mol.
Processing CHEMBL660 in molecule.smi (966/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2032324 in molecule.smi (967/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2032331 in molecule.smi (968/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2032324 in molecule.smi (969/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2032331 in molecule.smi (970/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2032324 in molecule.smi (971/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2032331 in molecule.smi (97

Processing CHEMBL1949750 in molecule.smi (1063/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1949746 in molecule.smi (1064/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2386527 in molecule.smi (1065/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1643 in molecule.smi (1066/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2386539 in molecule.smi (1067/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2386538 in molecule.smi (1068/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2386537 in molecule.smi (1069/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2386536 in molecule.smi (1070/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2386535 in molecule.smi (1071/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2386534 in molecule.smi (1072/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2386533 in molecule.smi (1073/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2385095 in molecule.smi (1074/1591). Average speed: 0.07 s/mol.
Processing CHEMBL2386532 in mol

Processing CHEMBL3126790 in molecule.smi (1165/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3126789 in molecule.smi (1166/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3126788 in molecule.smi (1167/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3126787 in molecule.smi (1168/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3126786 in molecule.smi (1169/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3126785 in molecule.smi (1170/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1643 in molecule.smi (1171/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1229 in molecule.smi (1172/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3126784 in molecule.smi (1173/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3126783 in molecule.smi (1174/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3126782 in molecule.smi (1175/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3124959 in molecule.smi (1176/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3126781 in molecu

Processing CHEMBL1165514 in molecule.smi (1268/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1229 in molecule.smi (1269/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3623292 in molecule.smi (1270/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1644101 in molecule.smi (1271/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1165514 in molecule.smi (1273/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3623293 in molecule.smi (1272/1591). Average speed: 0.07 s/mol.
Processing CHEMBL1229 in molecule.smi (1274/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3764014 in molecule.smi (1275/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3764858 in molecule.smi (1276/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3765730 in molecule.smi (1277/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3764303 in molecule.smi (1278/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3763716 in molecule.smi (1279/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3764043 in molecu

Processing CHEMBL1200340 in molecule.smi (1370/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4299541 in molecule.smi (1371/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4299572 in molecule.smi (1372/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3808536 in molecule.smi (1373/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3819510 in molecule.smi (1374/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3819510 in molecule.smi (1375/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3819510 in molecule.smi (1376/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3819510 in molecule.smi (1377/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3819510 in molecule.smi (1378/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3819510 in molecule.smi (1379/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3819510 in molecule.smi (1380/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3819510 in molecule.smi (1381/1591). Average speed: 0.07 s/mol.
Processing CHEMBL3819510 in 

Processing CHEMBL4093391 in molecule.smi (1472/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4091947 in molecule.smi (1473/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4082164 in molecule.smi (1474/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4100166 in molecule.smi (1475/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4071634 in molecule.smi (1476/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4092427 in molecule.smi (1477/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4097739 in molecule.smi (1478/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4070593 in molecule.smi (1479/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4105298 in molecule.smi (1480/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4087444 in molecule.smi (1481/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4067716 in molecule.smi (1482/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4095139 in molecule.smi (1483/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4089859 in 

Processing CHEMBL4288590 in molecule.smi (1575/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4280686 in molecule.smi (1576/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4294262 in molecule.smi (1577/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4286372 in molecule.smi (1578/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4279109 in molecule.smi (1579/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4289794 in molecule.smi (1580/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4292365 in molecule.smi (1581/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4284468 in molecule.smi (1582/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4279337 in molecule.smi (1583/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4287269 in molecule.smi (1584/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4294489 in molecule.smi (1585/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4283912 in molecule.smi (1586/1591). Average speed: 0.07 s/mol.
Processing CHEMBL4286184 in 

## Preparing the X and Y Data Matrices
### X data matrix

In [14]:
df_X = pd.read_csv('descriptors_output.csv')
df_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL109162,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL109162,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL324455,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL324455,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL327097,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1586,CHEMBL4294489,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1587,CHEMBL4283912,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1588,CHEMBL4294084,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1589,CHEMBL4286184,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
df_X = df_X.drop(columns=['Name'])
df_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1586,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1587,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1588,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1589,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


### Y variable

In [16]:
df_Y = df['pIC50']
df_Y

0       2.397940
1       2.000000
2       3.124939
3       3.000000
4       2.301030
          ...   
1586    4.000000
1587    4.392545
1588    4.000000
1589    5.173925
1590    4.477556
Name: pIC50, Length: 1591, dtype: float64

## Combining X and Y variables

In [17]:
dataset = pd.concat([df_X, df_Y], axis=1)
dataset

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,2.397940
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,2.000000
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.124939
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.000000
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,2.301030
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1586,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
1587,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.392545
1588,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
1589,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.173925


In [18]:
dataset.to_csv('influenza_a_pIC50_pubchem_fp.csv', index=False)