# Use lazypredict to test modelling on multiple ML algorithms

Sometimes, we wish to see the difference in quality of various regressors and classifiers on modelling of our data. 

Lazypredict is a package that enables a convenient way to do this: https://pypi.org/project/lazypredict/

[Classification Example](#classification-example)

[Regression Example](#regression-example)

For some reason, the classifier metrics on the BBB model are really low when compared to the detailed modelling done using RandomForestClassifier (check notebook in BBB folder)

However, the regressors on logD data showed similar metrics to standalone modelling using the HistGradientBoostingRegressor()

In [19]:
import pandas as pd

from lazypredict.Supervised import LazyClassifier
from lazypredict.Supervised import LazyRegressor
from sklearn.model_selection import train_test_split

import rdkit
print(f"I am RDKit version: {rdkit.__version__}")
import sys
print(f"I am python version {sys.version}")

I am RDKit version: 2024.03.2
I am python version 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:35:25) [Clang 16.0.6 ]


## Classification Example

In [20]:
bbb = pd.read_csv("data/B3DB_classification_cleaned.csv")
print(bbb.shape)
bbb.head(3)

(7797, 6)


Unnamed: 0,compound_name,SMILES,BBB+/BBB-,Inchi,mol,fp
0,sulphasalazine,O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O,0,InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18...,<rdkit.Chem.rdchem.Mol object at 0x17a24b060>,[0 0 0 ... 0 0 0]
1,moxalactam,COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)...,0,InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8...,<rdkit.Chem.rdchem.Mol object at 0x17a24b0d0>,[0 1 0 ... 0 0 0]
2,clioquinol,Oc1c(I)cc(Cl)c2cccnc12,0,InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1...,<rdkit.Chem.rdchem.Mol object at 0x17a24b140>,[0 0 0 ... 0 0 0]


In [21]:
X = pd.DataFrame(bbb['fp'].tolist()) # put features in a dataframe called X
y = bbb['BBB+/BBB-'] # put features in series y

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y, random_state=7) #we stratify as the dataset is imbalanced

clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)

100%|██████████| 29/29 [00:02<00:00, 12.19it/s]

[LightGBM] [Info] Number of positive: 4458, number of negative: 2559
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000176 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 15
[LightGBM] [Info] Number of data points in the train set: 7017, number of used features: 1
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.635314 -> initscore=0.555084
[LightGBM] [Info] Start training from score 0.555084





In [22]:
pd.DataFrame(models)

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GaussianNB,0.65,0.6,0.6,0.64,0.0
QuadraticDiscriminantAnalysis,0.65,0.6,0.6,0.64,0.01
PassiveAggressiveClassifier,0.64,0.6,0.6,0.63,0.01
NearestCentroid,0.63,0.6,0.6,0.63,0.01
KNeighborsClassifier,0.64,0.59,0.59,0.63,0.02
LinearSVC,0.63,0.5,0.5,0.51,0.12
CalibratedClassifierCV,0.63,0.5,0.5,0.51,0.02
RidgeClassifierCV,0.63,0.5,0.5,0.51,0.01
RidgeClassifier,0.63,0.5,0.5,0.51,0.0
LinearDiscriminantAnalysis,0.63,0.5,0.5,0.51,0.01


## Regression Example

In [23]:
logd = pd.read_csv("data/logD_normalizedDescriptors.csv")
print(f"The dataset contains {logd.shape[0]} endpoints on LogD")
logd.head(3)

The dataset contains 10629 endpoints on LogD


Unnamed: 0,smiles,logD,group,0,2,3,4,5,6,10,...,205,206,207,208,209,210,211,212,214,215
0,O=C(Nc1cnn(Cc2ccccc2)c1)c1[nH]nc2c1CCC1(CC1)C2,3.9,test,0.69,0.07,0.82,0.8,0.17,0.37,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0
1,Cc1ccc(C)c2c(=O)c3c([nH]c12)CCCC3,2.6,test,0.69,0.13,0.85,0.77,0.13,0.21,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,O=c1ncn(Cc2ccc(F)cc2F)c2cc(F)c(Oc3ncccc3C(F)(F...,3.9,test,0.83,0.01,0.42,0.42,0.08,0.52,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.17


In [24]:
X = logd.copy().drop(['smiles', 'logD', 'group'], axis=1) # put features in a dataframe called X
y = logd.logD # put features in series y

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, random_state=0)

reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models, predictions = reg.fit(X_train, X_test, y_train, y_test)

 21%|██▏       | 9/42 [00:21<02:03,  3.75s/it]

GammaRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfGammaLoss'.


 71%|███████▏  | 30/42 [01:06<00:15,  1.32s/it]

PoissonRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfPoissonLoss'.


 98%|█████████▊| 41/42 [01:48<00:02,  2.28s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005033 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 18645
[LightGBM] [Info] Number of data points in the train set: 9566, number of used features: 159
[LightGBM] [Info] Start training from score 2.224543


100%|██████████| 42/42 [01:49<00:00,  2.60s/it]


In [25]:
pd.DataFrame(models)

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,0.83,0.86,0.52,13.26
XGBRegressor,0.82,0.85,0.53,0.35
SVR,0.81,0.84,0.55,4.7
NuSVR,0.8,0.84,0.56,4.22
LGBMRegressor,0.8,0.84,0.56,0.63
HistGradientBoostingRegressor,0.8,0.83,0.56,2.11
RandomForestRegressor,0.79,0.83,0.57,33.02
MLPRegressor,0.78,0.82,0.59,6.17
BaggingRegressor,0.77,0.81,0.6,3.1
GradientBoostingRegressor,0.67,0.73,0.72,11.38
