# **Drug Discovery using Machine Learning**

Subscribe to Future Omics Bioinformatics made easy
@Bioinformatics_Made_Easy
https://www.youtube.com/@Bioinformatics_Made_Easy/videos

**COVID 19 Drug Discovery using Machine Learning**

Drug discovery for COVID-19 using machine learning (ML) in Python involves several stages, including data collection, feature engineering, model training, and evaluation. ML can be used to predict potential drug candidates by analyzing chemical structures, biological activities, and other relevant data.

**1.Data Collection:**PubChem or ChEMBL: Chemical databases containing information on small molecules, their biological activities, and their interactions with target proteins.

**2.Data Preprocessing**: Once you have the data, the next step is preprocessing:

 **3.Cleaning:** Handle missing or incomplete data.
Feature extraction: Extract useful features from molecular structures (like SMILES strings or molecular descriptors) using cheminformatics libraries.
Encoding: Convert molecular data into formats suitable for ML models, e.g., SMILES to molecular embeddings.

**4.Feature Engineering**
To predict drug efficacy, the molecular features are extracted. You can use:
Descriptors: Properties of molecules such as molecular weight, logP, topological polar surface area, etc.

**5.Model Training
Machine learning algorithms can be used to predict the biological activity of molecules against a specific target (e.g., COVID-19 proteins).

You can use: Supervised learning: Train models like Random Forest, Support Vector Machines (SVM), XGBoost, etc

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
df=pd.read_csv("/content/Covid drug bioactive properties.csv")
df.head()

Unnamed: 0,SMILES,MolecularWeight,XLogP,ExactMass,MonoisotopicMass,TPSA,Complexity,Charge,HBondDonorCount,HBondAcceptorCount,...,FeatureAcceptorCount3D,FeatureDonorCount3D,FeatureAnionCount3D,FeatureCationCount3D,FeatureRingCount3D,FeatureHydrophobeCount3D,ConformerModelRMSD3D,EffectiveRotorCount3D,ConformerCount3D,pIC50
0,ClC1=CC(NC(=O)CSC2=NC=CC(=N2)C2=CSC(=N2)C2=CC=...,473.4,5.6,471.998609,471.998609,121.0,559,0,1,6,...,3,1,0,1,4,0,1.0,7.0,10,-0.477121255
1,CN1N=C(C=C1C(F)(F)F)C1=CC=C(S1)C1=CC=NC(SCC(=O...,510.0,4.9,509.035865,509.035865,126.0,670,0,1,9,...,3,1,0,1,4,0,1.2,8.0,10,-1
2,CSC1=C(C(C)=C(S1)C1=NC(C)=CS1)C1=CC=NC(SCC(=O)...,519.1,6.3,518.013024,518.013024,175.0,627,0,1,8,...,3,1,0,1,4,1,1.0,8.0,10,-1.041392685
3,CSC1=C(C(C)=C(S1)C1=NC(C)=CS1)C1=CC=NC(SCC(=O)...,519.1,6.3,518.013024,518.013024,175.0,635,0,1,8,...,3,1,0,1,4,1,1.2,8.0,10,BLINDED
4,CC1=NC(=CS1)C1=NC(=CS1)C1=NC(SCC(=O)NC2=CC=C(C...,460.0,4.4,459.004901,459.004901,162.0,554,0,1,8,...,4,1,0,1,4,0,1.0,7.0,10,-1.146128036


In [None]:
df.columns

Index(['SMILES', 'MolecularWeight', 'XLogP', 'ExactMass', 'MonoisotopicMass',
       'TPSA', 'Complexity', 'Charge', 'HBondDonorCount', 'HBondAcceptorCount',
       'RotatableBondCount', 'HeavyAtomCount', 'IsotopeAtomCount',
       'AtomStereoCount', 'DefinedAtomStereoCount', 'UndefinedAtomStereoCount',
       'BondStereoCount', 'DefinedBondStereoCount', 'CovalentUnitCount',
       'Volume3D', 'XStericQuadrupole3D', 'YStericQuadrupole3D',
       'ZStericQuadrupole3D', 'FeatureCount3D', 'FeatureAcceptorCount3D',
       'FeatureDonorCount3D', 'FeatureAnionCount3D', 'FeatureCationCount3D',
       'FeatureRingCount3D', 'FeatureHydrophobeCount3D',
       'ConformerModelRMSD3D', 'EffectiveRotorCount3D', 'ConformerCount3D',
       'pIC50'],
      dtype='object')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 34 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   SMILES                    100 non-null    object 
 1   MolecularWeight           100 non-null    float64
 2   XLogP                     100 non-null    float64
 3   ExactMass                 100 non-null    float64
 4   MonoisotopicMass          100 non-null    float64
 5   TPSA                      100 non-null    float64
 6   Complexity                100 non-null    int64  
 7   Charge                    100 non-null    int64  
 8   HBondDonorCount           100 non-null    int64  
 9   HBondAcceptorCount        100 non-null    int64  
 10  RotatableBondCount        100 non-null    int64  
 11  HeavyAtomCount            100 non-null    int64  
 12  IsotopeAtomCount          100 non-null    int64  
 13  AtomStereoCount           100 non-null    int64  
 14  DefinedAtom

In [None]:
df['pIC50'].unique()

# 'what does 'BLINDED' mean? anyway remove it.

array(['-0.477121255', '-1', '-1.041392685', 'BLINDED', '-1.146128036',
       '-1.176091259', '-1.477121255', '-1.602059991', '-1.653212514',
       '-1.77815125', '-2', '-2.301029996', '-2.397940009',
       '-2.477121255', '-2.544068044', '-2.602059991', '-2.698970004',
       '-1.394451681', '-1.324282455', '-2.35545152', '-1.587710965',
       '-1.158362492', '0.522878745', '0.045757491', '-0.77815125',
       '-1.079181246', '-1.113943352', '-1.204119983', '-1.397940009',
       '-0.698970004', '-1.255272505', '-1.301029996', '1.200659451',
       '1.22184875', '0.681936665', '0.301029996', '0.785156152',
       '0.156767222', '0.477555766', '0.164943898', '0.906578315',
       '0.966576245', '-1.501196242', '-1.50623436', '-1.542949849',
       '-1.003029471', '-1.710371264', '-1.102433706', '-0.071882007',
       '-0.352182518', '-0.633468456', '-1.072984745', '-1.827369273',
       '-0.996073654', '-1.14176323', '-0.017033339', '-1.2509077',
       '-0.450249108', '-1.91860691

In [None]:
sum(df['pIC50']=='BLINDED')
# oops!

9

In [None]:
df = df[df['pIC50']!='BLINDED']
df

Unnamed: 0,SMILES,MolecularWeight,XLogP,ExactMass,MonoisotopicMass,TPSA,Complexity,Charge,HBondDonorCount,HBondAcceptorCount,...,FeatureAcceptorCount3D,FeatureDonorCount3D,FeatureAnionCount3D,FeatureCationCount3D,FeatureRingCount3D,FeatureHydrophobeCount3D,ConformerModelRMSD3D,EffectiveRotorCount3D,ConformerCount3D,pIC50
0,ClC1=CC(NC(=O)CSC2=NC=CC(=N2)C2=CSC(=N2)C2=CC=...,473.4,5.6,471.998609,471.998609,121.0,559,0,1,6,...,3,1,0,1,4,0,1.0,7.0,10,-0.477121255
1,CN1N=C(C=C1C(F)(F)F)C1=CC=C(S1)C1=CC=NC(SCC(=O...,510.0,4.9,509.035865,509.035865,126.0,670,0,1,9,...,3,1,0,1,4,0,1.2,8.0,10,-1
2,CSC1=C(C(C)=C(S1)C1=NC(C)=CS1)C1=CC=NC(SCC(=O)...,519.1,6.3,518.013024,518.013024,175.0,627,0,1,8,...,3,1,0,1,4,1,1.0,8.0,10,-1.041392685
4,CC1=NC(=CS1)C1=NC(=CS1)C1=NC(SCC(=O)NC2=CC=C(C...,460.0,4.4,459.004901,459.004901,162.0,554,0,1,8,...,4,1,0,1,4,0,1.0,7.0,10,-1.146128036
5,ClC1=CC=C(NC(=O)CSC2=NC=CC(=N2)C2=CC(=NO2)C2=C...,422.9,4.4,422.060425,422.060425,106.0,529,0,1,6,...,3,1,0,1,4,0,1.0,7.0,10,-1.176091259
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,IC1=CC=C2N(CC3=CC4=CC=CC=C4S3)C(=O)C(=O)C2=C1,419.2,4.1,418.947700,418.947700,65.6,485,0,0,3,...,2,0,0,0,4,1,0.6,2.2,10,0.022276395
96,ClC1=C2C(=O)C(=O)N(CC3=CC4=CC=CC=C4S3)C2=CC=C1,327.8,4.1,327.012077,327.012077,65.6,485,0,0,3,...,2,0,0,0,4,0,0.6,2.2,10,-1.049218023
97,IC1=CC=C2N(C\C=C\C3=CC4=CC=CC=C4S3)C(=O)C(=O)C...,445.3,4.7,444.963350,444.963350,65.6,552,0,0,3,...,2,0,0,0,4,1,0.6,3.2,10,-1.371067862
98,ClC1=CC=C(NC(=O)C2=CC=C(CN3C(=O)C(=O)C4=CC(I)=...,522.7,4.6,521.930190,521.930190,94.7,643,0,1,4,...,3,1,0,0,4,1,0.8,5.2,10,-1.099335278


In [None]:
df.describe()

Unnamed: 0,MolecularWeight,XLogP,ExactMass,MonoisotopicMass,TPSA,Complexity,Charge,HBondDonorCount,HBondAcceptorCount,RotatableBondCount,...,FeatureCount3D,FeatureAcceptorCount3D,FeatureDonorCount3D,FeatureAnionCount3D,FeatureCationCount3D,FeatureRingCount3D,FeatureHydrophobeCount3D,ConformerModelRMSD3D,EffectiveRotorCount3D,ConformerCount3D
count,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,...,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0
mean,376.399231,3.186813,375.865712,375.821811,99.848352,571.582418,0.0,0.67033,5.648352,3.912088,...,8.395604,3.714286,0.648352,0.230769,0.307692,3.230769,0.263736,0.795604,4.945055,8.879121
std,75.117761,1.564339,74.989076,74.916248,32.142334,149.089836,0.0,0.789584,1.695573,2.02566,...,1.679147,1.185896,0.779938,0.496139,0.509734,0.857346,0.490695,0.217006,2.250149,2.678454
min,212.3,-0.6,212.007805,212.007805,34.1,242.0,0.0,0.0,2.0,0.0,...,5.0,1.0,0.0,0.0,0.0,2.0,0.0,0.6,0.0,1.0
25%,313.0,2.0,312.509024,312.509024,83.1,472.0,0.0,0.0,5.0,2.0,...,7.0,3.0,0.0,0.0,0.0,3.0,0.0,0.6,3.2,10.0
50%,372.2,3.4,370.96156,370.96156,96.9,554.0,0.0,0.0,6.0,4.0,...,9.0,4.0,0.0,0.0,0.0,3.0,0.0,0.8,5.0,10.0
75%,427.5,4.1,427.132214,427.132214,120.0,669.5,0.0,1.0,6.0,5.5,...,9.0,4.0,1.0,0.0,1.0,4.0,0.0,1.0,7.0,10.0
max,565.0,7.3,563.814325,561.817275,197.0,960.0,0.0,3.0,10.0,9.0,...,13.0,7.0,3.0,2.0,2.0,6.0,2.0,1.4,10.2,10.0


In [None]:
df = df.dropna(axis=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 91 entries, 0 to 99
Data columns (total 34 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   SMILES                    91 non-null     object 
 1   MolecularWeight           91 non-null     float64
 2   XLogP                     91 non-null     float64
 3   ExactMass                 91 non-null     float64
 4   MonoisotopicMass          91 non-null     float64
 5   TPSA                      91 non-null     float64
 6   Complexity                91 non-null     int64  
 7   Charge                    91 non-null     int64  
 8   HBondDonorCount           91 non-null     int64  
 9   HBondAcceptorCount        91 non-null     int64  
 10  RotatableBondCount        91 non-null     int64  
 11  HeavyAtomCount            91 non-null     int64  
 12  IsotopeAtomCount          91 non-null     int64  
 13  AtomStereoCount           91 non-null     int64  
 14  DefinedAtomStereo

EDA analysis


In [None]:
df.describe().T

# Charge, IsotopeAtomCount, DefinedAtomStereoCount, UndefinedBondStereoCount, CovalentUnitCount have to remove

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MolecularWeight,91.0,376.399231,75.117761,212.3,313.0,372.2,427.5,565.0
XLogP,91.0,3.186813,1.564339,-0.6,2.0,3.4,4.1,7.3
ExactMass,91.0,375.865712,74.989076,212.007805,312.509024,370.96156,427.132214,563.814325
MonoisotopicMass,91.0,375.821811,74.916248,212.007805,312.509024,370.96156,427.132214,561.817275
TPSA,91.0,99.848352,32.142334,34.1,83.1,96.9,120.0,197.0
Complexity,91.0,571.582418,149.089836,242.0,472.0,554.0,669.5,960.0
Charge,91.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
HBondDonorCount,91.0,0.67033,0.789584,0.0,0.0,0.0,1.0,3.0
HBondAcceptorCount,91.0,5.648352,1.695573,2.0,5.0,6.0,6.0,10.0
RotatableBondCount,91.0,3.912088,2.02566,0.0,2.0,4.0,5.5,9.0


In [None]:
# Check the existing columns in your DataFrame
print(df.columns)

# Assuming there's a typo or the column names are slightly different,
# adjust the names in the drop function accordingly.
# For example, if 'Charge' is actually 'charge', use the following:

columns_to_drop = ['Charge', 'IsotopeAtomCount', 'DefinedAtomStereoCount', 'CovalentUnitCount']
# Create a list of columns to drop

existing_columns = df.columns  # Get the existing column names
columns_to_drop = [col for col in columns_to_drop if col in existing_columns]
# Filter the list to include only existing columns

df = df.drop(columns=columns_to_drop, axis=1)
# Use the filtered list to drop columns

df.shape

Index(['SMILES', 'MolecularWeight', 'XLogP', 'ExactMass', 'MonoisotopicMass',
       'TPSA', 'Complexity', 'HBondDonorCount', 'HBondAcceptorCount',
       'RotatableBondCount', 'HeavyAtomCount', 'AtomStereoCount',
       'UndefinedAtomStereoCount', 'BondStereoCount', 'DefinedBondStereoCount',
       'Volume3D', 'XStericQuadrupole3D', 'YStericQuadrupole3D',
       'ZStericQuadrupole3D', 'FeatureCount3D', 'FeatureAcceptorCount3D',
       'FeatureDonorCount3D', 'FeatureAnionCount3D', 'FeatureCationCount3D',
       'FeatureRingCount3D', 'FeatureHydrophobeCount3D',
       'ConformerModelRMSD3D', 'EffectiveRotorCount3D', 'ConformerCount3D',
       'pIC50'],
      dtype='object')


(91, 30)

In [None]:
df.drop(['SMILES'],axis=1,inplace=True)

In [None]:
# correlation chart

corrmat = df.corr()
corrmat = round(corrmat, 2)
corrmat

Unnamed: 0,MolecularWeight,XLogP,ExactMass,MonoisotopicMass,TPSA,Complexity,HBondDonorCount,HBondAcceptorCount,RotatableBondCount,HeavyAtomCount,...,FeatureAcceptorCount3D,FeatureDonorCount3D,FeatureAnionCount3D,FeatureCationCount3D,FeatureRingCount3D,FeatureHydrophobeCount3D,ConformerModelRMSD3D,EffectiveRotorCount3D,ConformerCount3D,pIC50
MolecularWeight,1.0,0.47,1.0,1.0,0.44,0.71,0.34,0.47,0.67,0.89,...,0.12,0.35,-0.17,0.28,0.41,0.21,0.7,0.72,0.28,-0.25
XLogP,0.47,1.0,0.47,0.47,-0.0,0.02,0.03,-0.06,0.36,0.35,...,-0.46,0.04,-0.03,0.13,0.13,0.39,0.34,0.29,0.01,0.04
ExactMass,1.0,0.47,1.0,1.0,0.44,0.71,0.34,0.47,0.67,0.89,...,0.12,0.35,-0.17,0.28,0.42,0.21,0.7,0.72,0.28,-0.25
MonoisotopicMass,1.0,0.47,1.0,1.0,0.44,0.71,0.34,0.47,0.67,0.89,...,0.12,0.35,-0.17,0.28,0.42,0.21,0.7,0.72,0.28,-0.25
TPSA,0.44,-0.0,0.44,0.44,1.0,0.39,0.58,0.76,0.56,0.46,...,0.47,0.59,0.27,0.25,-0.14,-0.03,0.53,0.63,0.12,-0.3
Complexity,0.71,0.02,0.71,0.71,0.39,1.0,0.33,0.43,0.42,0.85,...,0.42,0.35,-0.22,0.05,0.5,-0.03,0.57,0.53,0.16,-0.34
HBondDonorCount,0.34,0.03,0.34,0.34,0.58,0.33,1.0,0.41,0.59,0.37,...,0.02,0.98,-0.2,0.2,-0.17,0.2,0.56,0.59,0.15,-0.49
HBondAcceptorCount,0.47,-0.06,0.47,0.47,0.76,0.43,0.41,1.0,0.51,0.54,...,0.5,0.43,0.2,0.42,-0.09,-0.31,0.53,0.63,0.18,-0.2
RotatableBondCount,0.67,0.36,0.67,0.67,0.56,0.42,0.59,0.51,1.0,0.7,...,0.06,0.61,-0.22,0.4,0.02,0.09,0.86,0.95,0.51,-0.32
HeavyAtomCount,0.89,0.35,0.89,0.89,0.46,0.85,0.37,0.54,0.7,1.0,...,0.25,0.4,-0.22,0.33,0.53,0.02,0.8,0.78,0.3,-0.33


In [None]:
import plotly.express as px
#import plotly.io as pio
#pio.renderers.default = "notebook_connected"

fig = px.imshow(corrmat, text_auto=True, aspect="auto")
fig.show()

Maachine Learning for data split

In [None]:
target = df['pIC50']
features = df.drop(['pIC50'], axis=1)
target.shape, features.shape

((91,), (91, 28))

In [None]:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()

features_scaled = min_max_scaler.fit_transform(features)
features_scaled

array([[0.7402892 , 0.78481013, 0.73901645, ..., 0.5       , 0.68627451,
        1.        ],
       [0.84406011, 0.69620253, 0.84429379, ..., 0.75      , 0.78431373,
        1.        ],
       [0.86986107, 0.87341772, 0.86981111, ..., 0.5       , 0.78431373,
        1.        ],
       ...,
       [0.66061809, 0.67088608, 0.66216949, ..., 0.        , 0.31372549,
        1.        ],
       [0.88006805, 0.65822785, 0.88094554, ..., 0.25      , 0.50980392,
        1.        ],
       [0.75985257, 0.50632911, 0.76176148, ..., 0.25      , 0.52941176,
        1.        ]])

In [None]:
y = df['XLogP']
y

Unnamed: 0,XLogP
0,5.6
1,4.9
2,6.3
4,4.4
5,4.4
...,...
95,4.1
96,4.1
97,4.7
98,4.6


In [None]:
X = df.drop('XLogP', axis=1)
X

Unnamed: 0,MolecularWeight,ExactMass,MonoisotopicMass,TPSA,Complexity,HBondDonorCount,HBondAcceptorCount,RotatableBondCount,HeavyAtomCount,AtomStereoCount,...,FeatureAcceptorCount3D,FeatureDonorCount3D,FeatureAnionCount3D,FeatureCationCount3D,FeatureRingCount3D,FeatureHydrophobeCount3D,ConformerModelRMSD3D,EffectiveRotorCount3D,ConformerCount3D,pIC50
0,473.4,471.998609,471.998609,121.0,559,1,6,6,30,0,...,3,1,0,1,4,0,1.0,7.0,10,-0.477121255
1,510.0,509.035865,509.035865,126.0,670,1,9,6,33,0,...,3,1,0,1,4,0,1.2,8.0,10,-1
2,519.1,518.013024,518.013024,175.0,627,1,8,7,32,0,...,3,1,0,1,4,1,1.0,8.0,10,-1.041392685
4,460.0,459.004901,459.004901,162.0,554,1,8,6,29,0,...,4,1,0,1,4,0,1.0,7.0,10,-1.146128036
5,422.9,422.060425,422.060425,106.0,529,1,6,6,29,0,...,3,1,0,1,4,0,1.0,7.0,10,-1.176091259
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,419.2,418.947700,418.947700,65.6,485,0,3,2,22,0,...,2,0,0,0,4,1,0.6,2.2,10,0.022276395
96,327.8,327.012077,327.012077,65.6,485,0,3,2,22,0,...,2,0,0,0,4,0,0.6,2.2,10,-1.049218023
97,445.3,444.963350,444.963350,65.6,552,0,3,3,24,0,...,2,0,0,0,4,1,0.6,3.2,10,-1.371067862
98,522.7,521.930190,521.930190,94.7,643,1,4,4,28,0,...,3,1,0,0,4,1,0.8,5.2,10,-1.099335278


In [None]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

In [None]:
X_train

Unnamed: 0,MolecularWeight,ExactMass,MonoisotopicMass,TPSA,Complexity,HBondDonorCount,HBondAcceptorCount,RotatableBondCount,HeavyAtomCount,AtomStereoCount,...,FeatureAcceptorCount3D,FeatureDonorCount3D,FeatureAnionCount3D,FeatureCationCount3D,FeatureRingCount3D,FeatureHydrophobeCount3D,ConformerModelRMSD3D,EffectiveRotorCount3D,ConformerCount3D,pIC50
85,512.60,512.151826,512.151826,99.3,960,0,7,5,37,0,...,5,0,0,1,6,0,1.2,6.4,10,-0.741939078
21,357.40,357.114713,357.114713,97.7,486,0,6,5,25,1,...,4,0,0,1,3,0,0.8,5.6,10,-2.477121255
25,393.50,393.151098,393.151098,89.4,474,1,5,7,28,0,...,3,1,0,1,3,1,1.0,8.0,10,-2.602059991
7,457.30,456.021452,456.021452,106.0,567,1,6,6,30,0,...,3,1,0,1,4,0,1.0,7.0,10,-1.176091259
4,460.00,459.004901,459.004901,162.0,554,1,8,6,29,0,...,4,1,0,1,4,0,1.0,7.0,10,-1.146128036
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87,448.50,448.145678,448.145678,83.1,829,0,5,4,32,0,...,4,0,0,0,5,0,0.8,5.4,10,-0.227886705
96,327.80,327.012077,327.012077,65.6,485,0,3,2,22,0,...,2,0,0,0,4,0,0.6,2.2,10,-1.049218023
75,308.35,308.083078,308.083078,91.9,554,1,5,2,21,1,...,4,1,0,0,3,0,0.6,3.4,10,-0.352182518
26,379.80,379.016924,379.016924,80.2,443,1,8,4,24,0,...,2,1,0,1,2,0,0.8,6.0,10,-2.698970004


In [None]:
X_test

Unnamed: 0,MolecularWeight,ExactMass,MonoisotopicMass,TPSA,Complexity,HBondDonorCount,HBondAcceptorCount,RotatableBondCount,HeavyAtomCount,AtomStereoCount,...,FeatureAcceptorCount3D,FeatureDonorCount3D,FeatureAnionCount3D,FeatureCationCount3D,FeatureRingCount3D,FeatureHydrophobeCount3D,ConformerModelRMSD3D,EffectiveRotorCount3D,ConformerCount3D,pIC50
63,278.65,278.009434,278.009434,85.0,347,0,5,3,19,0,...,4,0,1,0,2,0,0.6,4.0,10,0.164943898
68,475.5,475.141321,475.141321,123.0,810,1,9,7,33,0,...,7,1,0,1,4,0,1.2,8.4,10,-1.50623436
29,338.4,338.115424,338.115424,73.6,613,0,5,2,25,1,...,4,0,0,0,4,0,0.6,2.8,8,-1.324282455
32,278.3,278.094294,278.094294,43.4,533,0,3,0,21,1,...,3,0,0,0,4,0,0.6,0.6,1,-1.158362492
47,365.5,365.086784,365.086784,125.0,544,2,6,5,24,0,...,4,2,0,0,3,1,0.8,5.0,10,-1.0
62,278.65,278.009434,278.009434,85.0,347,0,5,3,19,0,...,4,0,1,0,2,0,0.6,4.0,10,0.477555766
38,317.32,317.047027,317.047027,125.0,565,0,6,2,22,0,...,6,0,1,0,2,0,0.6,3.0,4,-1.113943352
48,413.9,412.972933,412.972933,154.0,578,2,6,5,25,0,...,4,2,0,0,3,1,0.8,6.0,10,-1.176091259
24,424.6,424.193297,424.193297,120.0,773,2,5,6,30,0,...,3,2,0,0,3,1,1.2,8.2,10,-2.544068044
83,412.5,412.145678,412.145678,83.1,730,0,5,4,29,2,...,4,0,0,0,4,0,0.8,5.4,10,-0.450249108


In [None]:
#Linear Regression
#Training the model
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
#Applying the model to make a prediction
y_lr_train_pred = lr.predict(X_train)
y_lr_test_pred = lr.predict(X_test)

In [None]:
y_lr_train_pred

array([3.49109851, 2.68906977, 4.87106645, 4.93810468, 4.16833196,
       1.93775958, 3.33580365, 2.22610204, 3.56263138, 3.55400663,
       2.73508738, 4.0651746 , 3.13427485, 2.18000328, 3.92405235,
       3.49800034, 4.5563849 , 3.54038177, 4.7       , 3.64620119,
       4.43902041, 4.47010727, 2.76201263, 5.28745878, 3.6891538 ,
       3.3       , 2.88714504, 1.6       , 2.28978219, 2.29519725,
       0.15543355, 3.39710004, 0.04527225, 3.79588534, 1.80461168,
       5.20743304, 4.242647  , 1.44714758, 1.47794235, 0.61030317,
       2.08903097, 1.25221803, 1.84279595, 3.32651749, 0.77053911,
       4.36489917, 2.98006561, 1.91808892, 3.71300188, 5.60921234,
       4.25962205, 4.52331922, 3.30858358, 6.61297261, 5.20755996,
       2.52503733, 4.1832269 , 2.33172057, 5.89076449, 4.79484977,
       3.45039726, 3.85559524, 1.2934312 , 3.62125306, 3.24033837,
       2.43272856, 3.82620197, 3.99552732, 3.71745784, 1.28967445,
       3.47329221, 0.74288792])

In [None]:
y_lr_test_pred

array([ 2.88123305,  1.03735623,  2.28790853,  2.64397028,  4.0756812 ,
        2.82229087,  1.20536902,  3.30308224,  4.5933003 ,  3.69910518,
        5.99865965,  4.11989386,  1.78079107, -0.40952725,  2.85688337,
        3.16790537,  1.70722053,  0.63256559,  2.2857912 ])

In [None]:
#Evaluate model performance
from sklearn.metrics import mean_squared_error, r2_score

lr_train_mse = mean_squared_error(y_train, y_lr_train_pred)
lr_train_r2 = r2_score(y_train, y_lr_train_pred)

lr_test_mse = mean_squared_error(y_test, y_lr_test_pred)
lr_test_r2 = r2_score(y_test, y_lr_test_pred)

In [None]:
print('LR MSE (Train): ', lr_train_mse)
print('LR R2 (Train): ', lr_train_r2)
print('LR MSE (Test): ', lr_test_mse)
print('LR R2 (Test): ', lr_test_r2)

LR MSE (Train):  0.2625606485590743
LR R2 (Train):  0.8786972050005133
LR MSE (Test):  0.8958665546830812
LR R2 (Test):  0.733293892264067


In [None]:
lr_results = pd.DataFrame(['Linear regression', lr_train_mse, lr_train_r2, lr_test_mse, lr_test_r2]).transpose()
lr_results.columns = ['Method', 'Training MSE', 'Training R2', 'Test MSE', 'Test R2']


In [None]:
lr_results

Unnamed: 0,Method,Training MSE,Training R2,Test MSE,Test R2
0,Linear regression,0.262561,0.878697,0.895867,0.733294


In [None]:
#Random Forest
#Training the model
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(max_depth=2, random_state=100)
rf.fit(X_train, y_train)

In [None]:
#Applying the model to make a prediction
y_rf_train_pred = rf.predict(X_train)
y_rf_test_pred = rf.predict(X_test)

In [None]:
#Evaluate model performance
from sklearn.metrics import mean_squared_error, r2_score

rf_train_mse = mean_squared_error(y_train, y_rf_train_pred)
rf_train_r2 = r2_score(y_train, y_rf_train_pred)

rf_test_mse = mean_squared_error(y_test, y_rf_test_pred)
rf_test_r2 = r2_score(y_test, y_rf_test_pred)

In [None]:
rf_results = pd.DataFrame(['Random forest', rf_train_mse, rf_train_r2, rf_test_mse, rf_test_r2]).transpose()
rf_results.columns = ['Method', 'Training MSE', 'Training R2', 'Test MSE', 'Test R2']
rf_results

Unnamed: 0,Method,Training MSE,Training R2,Test MSE,Test R2
0,Random forest,0.437737,0.472318,0.322556,0.459244
