[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/eneskelestemur/MolecularModeling/blob/main/Labs/lab05_machine_learning/PyCaret.ipynb)

In [None]:
# packages that need to be installed
%pip install pycaret

# PyCaret

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows.

* [Regression](#regression)
* [Classification](#classification)

## Regression

PyCaret’s Regression Module is a supervised machine learning module that is used for estimating the relationships between a dependent variable (often called the ‘outcome variable’, or ‘target’) and one or more independent variables (often called ‘features’, ‘predictors’, or ‘covariates’). 

The objective of regression is to predict continuous values such as predicting sales amount, predicting quantity, predicting temperature, etc. 

In [None]:
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator

# Load dataset
df = pd.read_csv('data/curated_solubility_dataset.csv')

# Convert Mols to molecular fingerprints
mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=3, fpSize=2048)
def mols_to_fingerprints(mols, fp_gen=mfpgen):
    feature_vectors = []
    for mol in mols:
        fp = mfpgen.GetFingerprint(mol)
        feature_vectors.append(fp)
    return np.array(feature_vectors)

df['Fingerprint'] = mols_to_fingerprints([Chem.MolFromSmiles(smi) for smi in df['SMILES']]).tolist()

# Prepare features (X) and target (y) for regression
X = np.array(df['Fingerprint'].tolist())
y = df['LogS']

# create a dataframe for pycaret
data = pd.DataFrame(X)
data['target'] = y

# to keep the runtime short, we will only use 10% of the data
data = data.sample(frac=0.1, random_state=42)

# free up memory
import gc
del df, X, y
gc.collect()


In [None]:
import pycaret.regression as reg

# setup the regression model
reg_setup = reg.setup(data=data, target='target', session_id=123, train_size=0.7, fold=5)

In [None]:
# exising models
reg.models()

In [None]:
# compare all models
best5 = reg.compare_models(n_select=5, exclude=['lightgbm'], turbo=True)

In [None]:
# evaluate the best model
best_model = best5[0]
reg.evaluate_model(best_model)

In [None]:
# plot the model
reg.plot_model(best_model, plot='residuals')

In [None]:
# save the model
reg.save_model(best_model, 'reg_model')

## Classification

PyCaret’s Classification Module is a supervised machine learning module that is used for classifying elements into groups. 

The goal is to predict the categorical class labels which are discrete and unordered. Some common use cases include predicting customer default (Yes or No), predicting customer churn (customer will leave or stay), the disease found (positive or negative). 

This module can be used for binary or multiclass problems.

In [3]:
# convert the target to a binary variable
data['target'] = data['target'].apply(lambda x: 1 if x < -3 else 0)

In [None]:
# setup the classification model
import pycaret.classification as clf

clf_setup = clf.setup(data=data, target='target', session_id=123, train_size=0.7, fold=5)

In [None]:
# existing models
clf.models()

In [None]:
# compare all models
best5 = clf.compare_models(n_select=5, exclude=['lightgbm'], turbo=True)

In [None]:
# evaluate the best model
best_model = best5[0]
clf.evaluate_model(best_model)

In [None]:
# plot the model
clf.plot_model(best_model, plot='confusion_matrix')

In [None]:
# save the model
clf.save_model(best_model, 'clf_model')