# Avalon Fingerprints for Molecular Property and Reaction Predictions

This notebook utilises **Avalon Fingerprints** to model and predict:
1. **HOMO-LUMO Energy Gap of Organic Molecules**
2. **Reaction Yield for C-N Cross-Coupling Reactions**
3. **Catalytic Enantioselectivity of Thiol-Imine Reactions**

The **supervised learning methods** we will use two algorithms for **regression**:
1. **Random Forest Regressor**
2. **LightGBM Regressors**

The dataset **<sup>1</sup>** we will use consists of molecular SMILES strings and their HOMO-LUMO energy gap in meV.

<div align="center">
    <img src="attachment:65d47bf1-abc9-4824-a165-05d966bb652d.jpg", 
     alt="avalanon-fingerprints-predictive-model"/>
    <p>
      <b>Fig 1</b> Avalon Fingerprints for predictive modeling schematic. <b><sup>2</sup></b>
    </p>
</div>

In [1]:
import sys
import os
import time
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
#-------------------------------------------------------
from rdkit.Chem import AllChem
from rdkit import Chem
from rdkit.DataStructs.cDataStructs import ExplicitBitVect
from rdkit.Avalon import pyAvalonTools
from rdkit.Chem import PandasTools
from rdkit.Chem import rdMolDescriptors
from tqdm import tqdm
#-------------------------------------------------------
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
import time
from sklearn.model_selection import ShuffleSplit, cross_validate,train_test_split
#-------------------------------------------------------
from lightgbm import LGBMRegressor
#-------------------------------------------------------
os.environ['SUP_LEARN_PYTHON_DIR_PATH'] =  os.path.join(os.getcwd(), '../src')
os.environ['SUP_LEARN_DATA_DIR_PATH'] =  os.path.join(os.getcwd(), '../data')
sys.path.append(os.getenv('SUP_LEARN_PYTHON_DIR_PATH'))
sys.path.append(os.getenv('SUP_LEARN_DATA_DIR_PATH'))

from utils import avalon_fingerprints_utils

# 1. HOMO-LUMO Energy Gap

## 1.1 Data Preparation

In [2]:
# Load data
homo_lumo_dataset: pd.DataFrame = pd.read_csv(
    os.path.join(
        os.getenv('SUP_LEARN_DATA_DIR_PATH'), 
        'raw/orbital-energies-input-data.csv'
    )
)

# Add 2D structure columnd
PandasTools.AddMoleculeColumnToFrame(
    homo_lumo_dataset,
    'SMILES',
    'Structure',
    includeFingerprints=True
)

# Generate Avalon fingerprints
avalon_fpts: np.ndarray = avalon_fingerprints_utils.generate_avalon_fingerprints(
    homo_lumo_dataset['Structure']
)

# Insert into DataFrame. Each row represents a molecule's Avalon fingerprint and
# each column an individual bit
avalon_fpts_dataset: pd.DataFrame = pd.DataFrame(
    avalon_fpts,
    columns=['Bit {}'.format(bit) for bit in range(avalon_fpts.shape[1])]
)

avalon_fpts_dataset.head()

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2904/2904 [00:00<00:00, 4029.68it/s]


Unnamed: 0,Bit 0,Bit 1,Bit 2,Bit 3,Bit 4,Bit 5,Bit 6,Bit 7,Bit 8,Bit 9,...,Bit 4086,Bit 4087,Bit 4088,Bit 4089,Bit 4090,Bit 4091,Bit 4092,Bit 4093,Bit 4094,Bit 4095
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 1.2 Instantiate and Train Models

We will now **instantiate and train our models** on the Avalon fingerprint dataset.

### 1.2.1 LightGBM Regressor (`LGBMRegressor`)

LightGBM, which stands for **Light Gradient Boosting Machine**, is a **highly efficient and fast implementation of gradient boosting for decision tree algorithms**.
* **Gradient Boosting** is a powerful machine learning technique used for **both regression and classification tasks**.
* **It builds an ensemble of trees** in a **sequential manner**, where each new tree is **trained to correct the errors of the previous trees**.
* This is in contrast to **bagging-based algorithms** such as **Random Forest**, where the trees in the ensemble are **built independently and combined**.
* It **combines the predictions of multiple weaker learners** to create a **strong learner predictive model**.
* Gradient boosting uses **gradient descent to minimise loss function loss** (*c.f.* `intro_to_supervised_learning.ipynb`). Each new tree is **fit on the negative gradient of the loss function** with respect to the **predictions of the previously built ensemble of trees**.

LightGBM has the following key characteristics:
1. **Speed and Efficiency**: LightGBM is **optimised for performance and memory usage**, making it **much faster than other gradient boosting implementations**.
2. **Support for large datasets**: LightGBM can **handle large-scale data** with **millions of instances and features** efficiently.
3. **Accuracy**: It provides **high accuracy** due to the implementation of advanced features like **leaf-wise tree growth** (uses the leaf with the **maximum reduction in loss to be split next**.

The **`LGBMRegressor`** class in the **LightGBM library** is **designed specifically for regression tasks**. It inherits from LightGBM's core functionalities and allows for the building of **powerful regression models**.

### 1.2.2 Random Forest Regressor (`RandomForestRegressor`)

The class `RandomForestRegressor` is a **popular ensemble supervised learning method** from the **scikit-learn library**. It is used for **regression tasks** and, like LightGBM regressor, **combines the predictions of multiple weaker learners** to create a **strong learner predictive model**.

However, it differs from LightGBM regressor in that it utilises the technique called **bagging** where the **predictions of the individual trees are aggregated through averaging to give a final, stronger prediction**.

In [None]:
# Instantiate the `LGBMRegressor` model. The `n_estimators` argument specifies the number of boosting
# iteratinos (i.e. the number of trees in the model). The `random_state` argument sets the seed for the
# random number generator, ensuring the results are reproducible (i.e. that the same random sequences are
# generated each time the code is run)
lgbm_regressor: LGBMRegressor = LGBMRegressor(n_estimators=800, random_state=42)

# Instantiate the `RandomForestRegressor`model. The `random_state` argument sets the seed for the random 
# number generator, ensuring the results are reproducible 
rf_regressor: RandomForestRegressor = RandomForestRegressor(random_state=42)

# References

**[1]** Kühnemund, M. (2020) Marius Kühnemund / FP-DM-tool · GitLab, *GitLab*. Available at: https://zivgitlab.uni-muenster.de/m_kueh11/fp-dm-tool (Accessed: 10 July 2024).<br><br>
**[2]** Goshu, G.M. (2023) Avalon-fingerprints-for-machine-learning/avalon fingerprints for predictive modeling.ipynb at main · GASHAWMG/Avalon-fingerprints-for-machine-learning, *GitHub*. Available at: https://github.com/gashawmg/Avalon-fingerprints-for-machine-learning/blob/main/Avalon%20fingerprints%20for%20predictive%20modeling.ipynb (Accessed: 10 July 2024).<br><br>
