# **Computational Drug Discovery - SOAT-1 : Descriptor Calculation and Dataset Preparation Part 03**

khalid El Akri

[*'Chem Code Professor' YouTube channel*](http://youtube.com/@chemcodeprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the Bindingdb bioactivity data.

In **Part 03**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 04.

---

## **Download PaDEL-Descriptor**

In [1]:
! curl -O https://github.com/chemcodeprofessor/data/raw/master/PaDel-Descriptor.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  206k    0  206k    0     0   269k      0 --:--:-- --:--:-- --:--:--  269k


In [2]:
!curl -O https://github.com/chemcodeprofessor/data/raw/master/PaDel-Descriptor.sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  206k    0  206k    0     0   343k      0 --:--:-- --:--:-- --:--:--  343k


In [3]:
! ls -l

total 61960
-rw-r--r--@  1 akrikhalid  staff     16872 May 29 14:44 IC50.csv
-rw-r--r--   1 akrikhalid  staff   3041472 May 29 14:36 Output3_Sterol O-acyltransferase 1 inhibitors 1703.csv
drwxr-xr-x   2 akrikhalid  staff        64 May 29 14:22 [34mPDFs[m[m
drwxr-xr-x  21 akrikhalid  staff       672 May 24 13:06 [34mPaDel-Descriptor[m[m
-rw-r--r--@  1 akrikhalid  staff    211054 May 29 16:19 PaDel-Descriptor.sh
-rw-r--r--@  1 akrikhalid  staff    211058 May 29 16:19 PaDel-Descriptor.zip
-rw-r--r--@  1 akrikhalid  staff    129784 May 29 16:15 Sterol O-acyltransferase 1 inhibitors 1703 Part 01.ipynb
-rw-r--r--@  1 akrikhalid  staff    364534 May 29 16:19 Sterol O-acyltransferase 1 inhibitors 1703 Part 02.ipynb
-rw-r--r--@  1 akrikhalid  staff     89689 May 29 10:53 Sterol O-acyltransferase 1 inhibitors 1703 Part 03.ipynb
-rw-r--r--@  1 akrikhalid  staff     76160 May 29 11:06 Sterol O-acyltransferase 1 inhibitors 1703 Part 04.ipynb
-rw-r--r--@  1 akrikhalid  staff  110723

## **UnZip PaDEL-Descriptor.zip**

In [None]:
import zipfile

with zipfile.ZipFile("PaDel-Descriptor.zip", "r") as zip_ref:
    zip_ref.extractall()

In [4]:
! ls -l

total 61960
-rw-r--r--@  1 akrikhalid  staff     16872 May 29 14:44 IC50.csv
-rw-r--r--   1 akrikhalid  staff   3041472 May 29 14:36 Output3_Sterol O-acyltransferase 1 inhibitors 1703.csv
drwxr-xr-x   2 akrikhalid  staff        64 May 29 14:22 [34mPDFs[m[m
drwxr-xr-x  21 akrikhalid  staff       672 May 24 13:06 [34mPaDel-Descriptor[m[m
-rw-r--r--@  1 akrikhalid  staff    211054 May 29 16:19 PaDel-Descriptor.sh
-rw-r--r--@  1 akrikhalid  staff    211058 May 29 16:19 PaDel-Descriptor.zip
-rw-r--r--@  1 akrikhalid  staff    129784 May 29 16:15 Sterol O-acyltransferase 1 inhibitors 1703 Part 01.ipynb
-rw-r--r--@  1 akrikhalid  staff    364534 May 29 16:19 Sterol O-acyltransferase 1 inhibitors 1703 Part 02.ipynb
-rw-r--r--@  1 akrikhalid  staff     89689 May 29 10:53 Sterol O-acyltransferase 1 inhibitors 1703 Part 03.ipynb
-rw-r--r--@  1 akrikhalid  staff     76160 May 29 11:06 Sterol O-acyltransferase 1 inhibitors 1703 Part 04.ipynb
-rw-r--r--@  1 akrikhalid  staff  110723

## **Loading SOAT-2 bioactivity data**

Download the curated Bindingdb bioactivity data that has been pre-processed from Parts 1 and 2 of this Machine learning Project series. Here we will be using the **soat_2_bioa_data_preprocessed.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [5]:
import pandas as pd

In [8]:
df = pd.read_csv('soat_1_bioa_data_preprocessed.csv')

In [9]:
df

Unnamed: 0,mol_bdID,mol_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,50648002,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(C1CCCCC...,active,511.538,4.24260,0.0,8.0,8.853872
1,50648007,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(C1CCCCC...,active,466.541,4.33440,0.0,6.0,8.769551
2,50647999,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(Cc1cccc...,active,474.520,4.20190,0.0,6.0,8.585027
3,50648003,CC(C)CC(C)N1Cc2c(n(CC(=O)c3ccccc3)c3cc(-c4cccc...,active,468.557,4.43630,0.0,6.0,8.522879
4,51138741,CSc1cc(C)nc(SC)c1NC(=O)CN1CCN(CCSc2nc3ccccc3s2...,active,519.787,4.79192,1.0,9.0,8.522879
...,...,...,...,...,...,...,...,...
1698,50051759,CCCCCCCN(CCOCCSc1nc(-c2ccccc2)c(-c2ccccc2)[nH]...,active,522.759,7.24280,2.0,4.0,6.721246
1699,50004370,COc1ccc(-c2nc(SCCCCCN(Cc3ccccn3)C(=O)Nc3ccc(F)...,active,643.760,8.43080,2.0,6.0,6.721246
1700,50004374,CC(C)NC(=O)N(CCCCCSc1nc(-c2ccccc2)c(-c2ccccc2)...,active,527.738,7.06380,2.0,4.0,6.721246
1701,50004847,CCCCCCCCCCCCn1nnnc1C(NC(=O)c1cccc([N+](=O)[O-]...,active,492.624,6.02160,1.0,7.0,6.721246


In [10]:
selection = ['mol_smiles','mol_bdID']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [11]:
! cat molecule.smi | head -6

O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(C1CCCCC1)C2=O)c1cccc([N+](=O)[O-])c1	50648002
O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(C1CCCCC1)C2=O)c1ccccc1	50648007
O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(Cc1ccccc1)C2=O)c1ccccc1	50647999
CC(C)CC(C)N1Cc2c(n(CC(=O)c3ccccc3)c3cc(-c4ccccc4)nn3c2=O)C1=O	50648003
CSc1cc(C)nc(SC)c1NC(=O)CN1CCN(CCSc2nc3ccccc3s2)CC1	51138741
CN(C)c1ccc(-c2nc(SCCCCCN(CCCCCSc3nc(-c4ccccc4)c(-c4ccccc4)[nH]3)C(=O)Nc3ccc(F)cc3F)[nH]c2-c2ccc(N(C)C)cc2)cc1	50004373


In [12]:
! cat molecule.smi | wc -l

    1703


## **Calculate fingerprint descriptors**

### **Calculate PaDEL descriptors**

In [15]:
! cat PaDel-Descriptor.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [16]:
! bash Padel-Descriptor.sh

Processing 50648002 in molecule.smi (1/1703). 
Processing 50648007 in molecule.smi (2/1703). 
Processing 50647999 in molecule.smi (3/1703). 
Processing 50648003 in molecule.smi (4/1703). 
Processing 50004373 in molecule.smi (6/1703). Average speed: 3.24 s/mol.
Processing 51138741 in molecule.smi (5/1703). Average speed: 6.21 s/mol.
Processing 330423 in molecule.smi (7/1703). Average speed: 2.19 s/mol.
Processing 50648005 in molecule.smi (8/1703). Average speed: 1.76 s/mol.
Processing 330422 in molecule.smi (9/1703). Average speed: 1.49 s/mol.
Processing 50529973 in molecule.smi (10/1703). Average speed: 1.38 s/mol.
Processing 50084919 in molecule.smi (11/1703). Average speed: 1.29 s/mol.
Processing 330420 in molecule.smi (12/1703). Average speed: 1.18 s/mol.
Processing 50648001 in molecule.smi (13/1703). Average speed: 1.06 s/mol.
Processing 50084195 in molecule.smi (14/1703). Average speed: 0.99 s/mol.
Processing 50835199 in molecule.smi (15/1703). Average speed: 0.90 s/mol.
Processin

Processing 50762094 in molecule.smi (114/1703). Average speed: 0.29 s/mol.
Processing 50051819 in molecule.smi (115/1703). Average speed: 0.28 s/mol.
Processing 50762060 in molecule.smi (116/1703). Average speed: 0.28 s/mol.
Processing 50084942 in molecule.smi (118/1703). Average speed: 0.28 s/mol.
Processing 50009812 in molecule.smi (117/1703). Average speed: 0.28 s/mol.
Processing 50477995 in molecule.smi (119/1703). Average speed: 0.28 s/mol.
Processing 50084932 in molecule.smi (121/1703). Average speed: 0.28 s/mol.
Processing 50762059 in molecule.smi (120/1703). Average speed: 0.28 s/mol.
Processing 50423028 in molecule.smi (122/1703). Average speed: 0.28 s/mol.
Processing 50082538 in molecule.smi (123/1703). Average speed: 0.27 s/mol.
Processing 50046144 in molecule.smi (124/1703). Average speed: 0.27 s/mol.
Processing 50004405 in molecule.smi (125/1703). Average speed: 0.27 s/mol.
Processing 50004377 in molecule.smi (126/1703). Average speed: 0.27 s/mol.
Processing 50768292 in mo

Processing 50084180 in molecule.smi (224/1703). Average speed: 0.22 s/mol.
Processing 50084177 in molecule.smi (225/1703). Average speed: 0.22 s/mol.
Processing 50084176 in molecule.smi (226/1703). Average speed: 0.22 s/mol.
Processing 50051746 in molecule.smi (227/1703). Average speed: 0.22 s/mol.
Processing 50623267 in molecule.smi (228/1703). Average speed: 0.22 s/mol.
Processing 51284160 in molecule.smi (229/1703). Average speed: 0.22 s/mol.
Processing 50474039 in molecule.smi (230/1703). Average speed: 0.22 s/mol.
Processing 51138739 in molecule.smi (231/1703). Average speed: 0.22 s/mol.
Processing 50768318 in molecule.smi (232/1703). Average speed: 0.22 s/mol.
Processing 51053311 in molecule.smi (233/1703). Average speed: 0.22 s/mol.
Processing 50768300 in molecule.smi (234/1703). Average speed: 0.22 s/mol.
Processing 50004861 in molecule.smi (235/1703). Average speed: 0.22 s/mol.
Processing 50058726 in molecule.smi (236/1703). Average speed: 0.22 s/mol.
Processing 50084168 in mo

Processing 51125432 in molecule.smi (335/1703). Average speed: 0.20 s/mol.
Processing 51068444 in molecule.smi (337/1703). Average speed: 0.20 s/mol.
Processing 50762076 in molecule.smi (336/1703). Average speed: 0.20 s/mol.
Processing 50004380 in molecule.smi (338/1703). Average speed: 0.20 s/mol.
Processing 51053333 in molecule.smi (339/1703). Average speed: 0.20 s/mol.
Processing 50529975 in molecule.smi (340/1703). Average speed: 0.20 s/mol.
Processing 50708855 in molecule.smi (341/1703). Average speed: 0.20 s/mol.
Processing 50082530 in molecule.smi (342/1703). Average speed: 0.20 s/mol.
Processing 50053376 in molecule.smi (343/1703). Average speed: 0.20 s/mol.
Processing 50969055 in molecule.smi (344/1703). Average speed: 0.20 s/mol.
Processing 50061000 in molecule.smi (346/1703). Average speed: 0.20 s/mol.
Processing 51128140 in molecule.smi (345/1703). Average speed: 0.21 s/mol.
Processing 51052033 in molecule.smi (347/1703). Average speed: 0.20 s/mol.
Processing 50030318 in mo

Processing 50779801 in molecule.smi (445/1703). Average speed: 0.20 s/mol.
Processing 51055506 in molecule.smi (446/1703). Average speed: 0.20 s/mol.
Processing 51053307 in molecule.smi (447/1703). Average speed: 0.20 s/mol.
Processing 51053330 in molecule.smi (448/1703). Average speed: 0.20 s/mol.
Processing 50474841 in molecule.smi (449/1703). Average speed: 0.20 s/mol.
Processing 50768288 in molecule.smi (450/1703). Average speed: 0.20 s/mol.
Processing 50768308 in molecule.smi (451/1703). Average speed: 0.20 s/mol.
Processing 50779775 in molecule.smi (452/1703). Average speed: 0.20 s/mol.
Processing 50969059 in molecule.smi (453/1703). Average speed: 0.20 s/mol.
Processing 51125422 in molecule.smi (454/1703). Average speed: 0.20 s/mol.
Processing 51128137 in molecule.smi (455/1703). Average speed: 0.20 s/mol.
Processing 51138734 in molecule.smi (456/1703). Average speed: 0.20 s/mol.
Processing 50329418 in molecule.smi (457/1703). Average speed: 0.20 s/mol.
Processing 50474125 in mo

Processing 50001664 in molecule.smi (555/1703). Average speed: 0.19 s/mol.
Processing 50061012 in molecule.smi (556/1703). Average speed: 0.19 s/mol.
Processing 50061011 in molecule.smi (557/1703). Average speed: 0.19 s/mol.
Processing 50474848 in molecule.smi (558/1703). Average speed: 0.19 s/mol.
Processing 50084199 in molecule.smi (559/1703). Average speed: 0.19 s/mol.
Processing 50084203 in molecule.smi (560/1703). Average speed: 0.19 s/mol.
Processing 50107729 in molecule.smi (561/1703). Average speed: 0.19 s/mol.
Processing 50107724 in molecule.smi (562/1703). Average speed: 0.19 s/mol.
Processing 50107726 in molecule.smi (563/1703). Average speed: 0.19 s/mol.
Processing 50107713 in molecule.smi (564/1703). Average speed: 0.19 s/mol.
Processing 50566102 in molecule.smi (565/1703). Average speed: 0.19 s/mol.
Processing 50835201 in molecule.smi (566/1703). Average speed: 0.19 s/mol.
Processing 50061009 in molecule.smi (567/1703). Average speed: 0.19 s/mol.
Processing 50088289 in mo

Processing 51128136 in molecule.smi (665/1703). Average speed: 0.19 s/mol.
Processing 51128141 in molecule.smi (666/1703). Average speed: 0.19 s/mol.
Processing 51053312 in molecule.smi (667/1703). Average speed: 0.19 s/mol.
Processing 51068459 in molecule.smi (668/1703). Average speed: 0.19 s/mol.
Processing 50768312 in molecule.smi (669/1703). Average speed: 0.19 s/mol.
Processing 50768339 in molecule.smi (670/1703). Average speed: 0.19 s/mol.
Processing 50053377 in molecule.smi (671/1703). Average speed: 0.19 s/mol.
Processing 50004265 in molecule.smi (673/1703). Average speed: 0.19 s/mol.
Processing 51055488 in molecule.smi (672/1703). Average speed: 0.19 s/mol.
Processing 51061835 in molecule.smi (674/1703). Average speed: 0.19 s/mol.
Processing 50623313 in molecule.smi (675/1703). Average speed: 0.19 s/mol.
Processing 50058723 in molecule.smi (676/1703). Average speed: 0.19 s/mol.
Processing 50004363 in molecule.smi (677/1703). Average speed: 0.19 s/mol.
Processing 51052032 in mo

Processing 50623283 in molecule.smi (776/1703). Average speed: 0.20 s/mol.
Processing 50009831 in molecule.smi (777/1703). Average speed: 0.20 s/mol.
Processing 50779781 in molecule.smi (778/1703). Average speed: 0.20 s/mol.
Processing 50009830 in molecule.smi (779/1703). Average speed: 0.20 s/mol.
Processing 51346966 in molecule.smi (780/1703). Average speed: 0.20 s/mol.
Processing 50004338 in molecule.smi (781/1703). Average speed: 0.20 s/mol.
Processing 50466127 in molecule.smi (782/1703). Average speed: 0.20 s/mol.
Processing 50779868 in molecule.smi (783/1703). Average speed: 0.20 s/mol.
Processing 50004334 in molecule.smi (784/1703). Average speed: 0.20 s/mol.
Processing 50084927 in molecule.smi (785/1703). Average speed: 0.20 s/mol.
Processing 50623271 in molecule.smi (786/1703). Average speed: 0.20 s/mol.
Processing 51061830 in molecule.smi (787/1703). Average speed: 0.20 s/mol.
Processing 50623295 in molecule.smi (789/1703). Average speed: 0.20 s/mol.
Processing 51068072 in mo

Processing 50270330 in molecule.smi (886/1703). Average speed: 0.21 s/mol.
Processing 50779903 in molecule.smi (887/1703). Average speed: 0.21 s/mol.
Processing 50474111 in molecule.smi (888/1703). Average speed: 0.21 s/mol.
Processing 51068438 in molecule.smi (889/1703). Average speed: 0.21 s/mol.
Processing 51055484 in molecule.smi (890/1703). Average speed: 0.21 s/mol.
Processing 50004346 in molecule.smi (891/1703). Average speed: 0.21 s/mol.
Processing 50474019 in molecule.smi (892/1703). Average speed: 0.21 s/mol.
Processing 51055507 in molecule.smi (893/1703). Average speed: 0.21 s/mol.
Processing 50623259 in molecule.smi (894/1703). Average speed: 0.21 s/mol.
Processing 51055478 in molecule.smi (895/1703). Average speed: 0.21 s/mol.
Processing 50779900 in molecule.smi (896/1703). Average speed: 0.21 s/mol.
Processing 50779932 in molecule.smi (897/1703). Average speed: 0.21 s/mol.
Processing 50602933 in molecule.smi (898/1703). Average speed: 0.21 s/mol.
Processing 51128142 in mo

Processing 50135708 in molecule.smi (996/1703). Average speed: 0.21 s/mol.
Processing 50779823 in molecule.smi (997/1703). Average speed: 0.21 s/mol.
Processing 50779856 in molecule.smi (998/1703). Average speed: 0.21 s/mol.
Processing 50779857 in molecule.smi (999/1703). Average speed: 0.21 s/mol.
Processing 50779784 in molecule.smi (1000/1703). Average speed: 0.21 s/mol.
Processing 51064 in molecule.smi (1001/1703). Average speed: 0.21 s/mol.
Processing 51105767 in molecule.smi (1002/1703). Average speed: 0.21 s/mol.
Processing 51128120 in molecule.smi (1003/1703). Average speed: 0.21 s/mol.
Processing 51253368 in molecule.smi (1004/1703). Average speed: 0.21 s/mol.
Processing 50623286 in molecule.smi (1005/1703). Average speed: 0.21 s/mol.
Processing 51055489 in molecule.smi (1006/1703). Average speed: 0.21 s/mol.
Processing 50082542 in molecule.smi (1007/1703). Average speed: 0.21 s/mol.
Processing 50779805 in molecule.smi (1008/1703). Average speed: 0.21 s/mol.
Processing 50779813

Processing 50065740 in molecule.smi (1105/1703). Average speed: 0.20 s/mol.
Processing 50107717 in molecule.smi (1106/1703). Average speed: 0.20 s/mol.
Processing 50084207 in molecule.smi (1107/1703). Average speed: 0.20 s/mol.
Processing 50042607 in molecule.smi (1108/1703). Average speed: 0.20 s/mol.
Processing 50058737 in molecule.smi (1109/1703). Average speed: 0.20 s/mol.
Processing 50566101 in molecule.smi (1110/1703). Average speed: 0.20 s/mol.
Processing 50463711 in molecule.smi (1111/1703). Average speed: 0.20 s/mol.
Processing 50084174 in molecule.smi (1112/1703). Average speed: 0.20 s/mol.
Processing 50061004 in molecule.smi (1113/1703). Average speed: 0.20 s/mol.
Processing 50088292 in molecule.smi (1114/1703). Average speed: 0.20 s/mol.
Processing 50001653 in molecule.smi (1116/1703). Average speed: 0.20 s/mol.
Processing 51128119 in molecule.smi (1115/1703). Average speed: 0.20 s/mol.
Processing 50474826 in molecule.smi (1117/1703). Average speed: 0.20 s/mol.
Processing 5

Processing 50779881 in molecule.smi (1213/1703). Average speed: 0.20 s/mol.
Processing 50779887 in molecule.smi (1214/1703). Average speed: 0.20 s/mol.
Processing 50193427 in molecule.smi (1215/1703). Average speed: 0.20 s/mol.
Processing 50058738 in molecule.smi (1216/1703). Average speed: 0.20 s/mol.
Processing 50270328 in molecule.smi (1217/1703). Average speed: 0.20 s/mol.
Processing 50779923 in molecule.smi (1218/1703). Average speed: 0.20 s/mol.
Processing 50290222 in molecule.smi (1219/1703). Average speed: 0.20 s/mol.
Processing 50139070 in molecule.smi (1220/1703). Average speed: 0.20 s/mol.
Processing 50974851 in molecule.smi (1221/1703). Average speed: 0.20 s/mol.
Processing 51055468 in molecule.smi (1222/1703). Average speed: 0.20 s/mol.
Processing 50821645 in molecule.smi (1223/1703). Average speed: 0.20 s/mol.
Processing 51055499 in molecule.smi (1224/1703). Average speed: 0.20 s/mol.
Processing 51055467 in molecule.smi (1225/1703). Average speed: 0.20 s/mol.
Processing 5

Processing 50367878 in molecule.smi (1321/1703). Average speed: 0.20 s/mol.
Processing 50474836 in molecule.smi (1322/1703). Average speed: 0.20 s/mol.
Processing 50974857 in molecule.smi (1323/1703). Average speed: 0.20 s/mol.
Processing 51055465 in molecule.smi (1324/1703). Average speed: 0.20 s/mol.
Processing 51053340 in molecule.smi (1325/1703). Average speed: 0.20 s/mol.
Processing 51068446 in molecule.smi (1326/1703). Average speed: 0.20 s/mol.
Processing 50193436 in molecule.smi (1327/1703). Average speed: 0.20 s/mol.
Processing 50270325 in molecule.smi (1328/1703). Average speed: 0.20 s/mol.
Processing 50290220 in molecule.smi (1329/1703). Average speed: 0.20 s/mol.
Processing 50779930 in molecule.smi (1330/1703). Average speed: 0.20 s/mol.
Processing 50357590 in molecule.smi (1331/1703). Average speed: 0.20 s/mol.
Processing 51055515 in molecule.smi (1332/1703). Average speed: 0.20 s/mol.
Processing 50193421 in molecule.smi (1333/1703). Average speed: 0.20 s/mol.
Processing 5

Processing 330419 in molecule.smi (1429/1703). Average speed: 0.20 s/mol.
Processing 51128134 in molecule.smi (1430/1703). Average speed: 0.20 s/mol.
Processing 50051775 in molecule.smi (1431/1703). Average speed: 0.20 s/mol.
Processing 50708850 in molecule.smi (1432/1703). Average speed: 0.21 s/mol.
Processing 50821456 in molecule.smi (1433/1703). Average speed: 0.21 s/mol.
Processing 50448236 in molecule.smi (1434/1703). Average speed: 0.21 s/mol.
Processing 50477996 in molecule.smi (1435/1703). Average speed: 0.21 s/mol.
Processing 50084944 in molecule.smi (1436/1703). Average speed: 0.21 s/mol.
Processing 50107727 in molecule.smi (1437/1703). Average speed: 0.21 s/mol.
Processing 50051809 in molecule.smi (1438/1703). Average speed: 0.21 s/mol.
Processing 50051822 in molecule.smi (1439/1703). Average speed: 0.21 s/mol.
Processing 50058709 in molecule.smi (1440/1703). Average speed: 0.21 s/mol.
Processing 50058722 in molecule.smi (1441/1703). Average speed: 0.21 s/mol.
Processing 500

Processing 50107738 in molecule.smi (1537/1703). Average speed: 0.21 s/mol.
Processing 50768286 in molecule.smi (1538/1703). Average speed: 0.21 s/mol.
Processing 50084930 in molecule.smi (1539/1703). Average speed: 0.21 s/mol.
Processing 50084946 in molecule.smi (1540/1703). Average speed: 0.21 s/mol.
Processing 50084945 in molecule.smi (1541/1703). Average speed: 0.21 s/mol.
Processing 50084941 in molecule.smi (1542/1703). Average speed: 0.21 s/mol.
Processing 50602965 in molecule.smi (1543/1703). Average speed: 0.21 s/mol.
Processing 50448229 in molecule.smi (1544/1703). Average speed: 0.21 s/mol.
Processing 50768347 in molecule.smi (1545/1703). Average speed: 0.21 s/mol.
Processing 50004866 in molecule.smi (1546/1703). Average speed: 0.21 s/mol.
Processing 50768351 in molecule.smi (1547/1703). Average speed: 0.21 s/mol.
Processing 50768349 in molecule.smi (1548/1703). Average speed: 0.21 s/mol.
Processing 50051785 in molecule.smi (1549/1703). Average speed: 0.21 s/mol.
Processing 5

Processing 50065742 in molecule.smi (1645/1703). Average speed: 0.21 s/mol.
Processing 50009835 in molecule.smi (1646/1703). Average speed: 0.21 s/mol.
Processing 50051749 in molecule.smi (1647/1703). Average speed: 0.21 s/mol.
Processing 50004850 in molecule.smi (1649/1703). Average speed: 0.21 s/mol.
Processing 50004392 in molecule.smi (1648/1703). Average speed: 0.21 s/mol.
Processing 50623312 in molecule.smi (1651/1703). Average speed: 0.21 s/mol.
Processing 50030319 in molecule.smi (1650/1703). Average speed: 0.21 s/mol.
Processing 50768369 in molecule.smi (1652/1703). Average speed: 0.21 s/mol.
Processing 50779854 in molecule.smi (1653/1703). Average speed: 0.21 s/mol.
Processing 50821658 in molecule.smi (1654/1703). Average speed: 0.21 s/mol.
Processing 50762072 in molecule.smi (1655/1703). Average speed: 0.21 s/mol.
Processing 50762091 in molecule.smi (1656/1703). Average speed: 0.21 s/mol.
Processing 50270326 in molecule.smi (1657/1703). Average speed: 0.21 s/mol.
Processing 5

#### We can see : Descriptor calculation completed in 6 mins 0.613 secs . Average speed: 0.21 s/mol.

## Let check the Descriptors output file ''descriptors_output.csv''

In [17]:
! ls -l

total 25400
-rw-r--r--@  1 akrikhalid  staff    16872 May 29 14:44 IC50.csv
-rw-r--r--   1 akrikhalid  staff  3041472 May 29 14:36 Output3_Sterol O-acyltransferase 1 inhibitors 1703.csv
drwxr-xr-x   2 akrikhalid  staff       64 May 29 14:22 [34mPDFs[m[m
drwxr-xr-x  21 akrikhalid  staff      672 May 24 13:06 [34mPaDel-Descriptor[m[m
-rw-r--r--@  1 akrikhalid  staff      231 May 24 10:18 PaDel-Descriptor.sh
-rw-r--r--@  1 akrikhalid  staff   211058 May 29 16:19 PaDel-Descriptor.zip
-rw-r--r--@  1 akrikhalid  staff   129784 May 29 16:15 Sterol O-acyltransferase 1 inhibitors 1703 Part 01.ipynb
-rw-r--r--@  1 akrikhalid  staff   364534 May 29 16:21 Sterol O-acyltransferase 1 inhibitors 1703 Part 02.ipynb
-rw-r--r--@  1 akrikhalid  staff   220381 May 29 16:33 Sterol O-acyltransferase 1 inhibitors 1703 Part 03.ipynb
-rw-r--r--@  1 akrikhalid  staff    76160 May 29 11:06 Sterol O-acyltransferase 1 inhibitors 1703 Part 04.ipynb
-rw-r--r--   1 akrikhalid  staff    35645 May 29 0

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [18]:
df2_X = pd.read_csv('descriptors_output.csv')

In [19]:
df2_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,50648003,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,50648007,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,50648002,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,50647999,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,51138741,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1698,50051759,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1699,50004847,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1700,50004374,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1701,50004370,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
df2_X = df2_X.drop(columns = ["Name"])

In [21]:
df2_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1698,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1699,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1700,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1701,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [22]:
df2_Y = df['pIC50']

In [23]:
df2_Y.info()

<class 'pandas.core.series.Series'>
RangeIndex: 1703 entries, 0 to 1702
Series name: pIC50
Non-Null Count  Dtype  
--------------  -----  
1703 non-null   float64
dtypes: float64(1)
memory usage: 13.4 KB


In [24]:
df2_Y.describe()

count    1703.000000
mean        5.900156
std         1.141998
min         2.943095
25%         5.000000
50%         5.920819
75%         6.795880
max         8.853872
Name: pIC50, dtype: float64

## **Combining X and Y variable**

In [25]:
dataset2 = pd.concat([df2_X,df2_Y], axis=1)

In [26]:
dataset2

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.853872
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.769551
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.585027
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.522879
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.522879
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1698,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.721246
1699,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.721246
1700,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.721246
1701,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.721246


In [27]:
dataset2['pIC50'].describe()

count    1703.000000
mean        5.900156
std         1.141998
min         2.943095
25%         5.000000
50%         5.920819
75%         6.795880
max         8.853872
Name: pIC50, dtype: float64

In [28]:
dataset2.to_csv('soat_1_bioa_data_preprocessed_pIC50_pubchem_fp.csv', index=False)

In [29]:
! ls -l

total 31344
-rw-r--r--@  1 akrikhalid  staff    16872 May 29 14:44 IC50.csv
-rw-r--r--   1 akrikhalid  staff  3041472 May 29 14:36 Output3_Sterol O-acyltransferase 1 inhibitors 1703.csv
drwxr-xr-x   2 akrikhalid  staff       64 May 29 14:22 [34mPDFs[m[m
drwxr-xr-x  21 akrikhalid  staff      672 May 24 13:06 [34mPaDel-Descriptor[m[m
-rw-r--r--@  1 akrikhalid  staff      231 May 24 10:18 PaDel-Descriptor.sh
-rw-r--r--@  1 akrikhalid  staff   211058 May 29 16:19 PaDel-Descriptor.zip
-rw-r--r--@  1 akrikhalid  staff   129784 May 29 16:15 Sterol O-acyltransferase 1 inhibitors 1703 Part 01.ipynb
-rw-r--r--@  1 akrikhalid  staff   364534 May 29 16:21 Sterol O-acyltransferase 1 inhibitors 1703 Part 02.ipynb
-rw-r--r--@  1 akrikhalid  staff   220381 May 29 16:33 Sterol O-acyltransferase 1 inhibitors 1703 Part 03.ipynb
-rw-r--r--@  1 akrikhalid  staff    76160 May 29 11:06 Sterol O-acyltransferase 1 inhibitors 1703 Part 04.ipynb
-rw-r--r--   1 akrikhalid  staff    35645 May 29 0