<a href="https://colab.research.google.com/github/bbarthougatica/ChmInf/blob/Regression/cheminf_latin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cheminformatics & drug design - Project**



We chose two Data Sets from the *MoleculeNet*.
We aim to study toxicity and lipophilicity properties and use classification and regression tasks with them.
Both datasets are split into training, validation and test subsets following a 80/10/10 ratio and is recommended to do RANDOM splitting.
# 1st Data Set: **Lipophilicity**

*   Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility. The lipophilicity dataset, curated from ChEMBL database, provides experimental results of octanol/water distribution coefficient (logD at pH 7.4) of 4200 compounds. This property influences how a drug is absorbed, distributed, metabolised, and excreted in the body (ADME properties).
*   Task type: Regression
*   Nº Tasks: 1
*   Recommended regression metric: Root-Mean-Square Error
*   Nº Compounds: 4200
*   Prediction target: Lipophilicity


# 2nd Data Set: **SIDER**

* Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.
*   Task type: Classification
*   Nº Tasks: 27
*   Recommended classification metric: Area Under Curve of Receiver Operating Characteristics
*   Nº Compounds: 1427
*   Prediction target: Adverce Drug Reaction



*You should perform appropriate data cleaning and preprocessing. This includes evaluating at least two different molecular featurization strategies, such as circular fingerprints (e.g., Morgan) and descriptor-based or graph-based representations.
Document any preprocessing decisions or challenges.*


**The path to a ML model.**

*   Define the task
*   Prepare data & split data
*   Choose the model
*   Train the model
*   Evaluate the model
*   Use the model




In [1]:
# Install all libraries
!pip install numpy scipy matplotlib scikit-learn pandas rdkit xgboost deepchem mordred pycm

import pandas as pd
import numpy as np
import deepchem as dc
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole




Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


Now we upload our selected Data Bases.

# 1. Starting with the regression algorithms

#**1.1) Get DataBases**
In this case, "Lipophilicity" for regression exercise.

In [2]:
# Get tasks from deepchem's functions

# For Regression
tasks_lipo, datasets_lipo, transformers_lipo = dc.molnet.load_lipo(splitter=None)


print(f"Lipo Dataset load succesfully. Sub datasets: {len(datasets_lipo)}")





Lipo Dataset load succesfully. Sub datasets: 1


In [8]:

print("Dataset:", "Lipophilicity")
print("Number of tasks (side effect categories):", len(tasks_lipo))
print("Example task names:", tasks_lipo)

Dataset: Lipophilicity
Number of tasks (side effect categories): 1
Example task names: ['exp']


In [7]:
datasets_lipo

(<DiskDataset X.shape: (4200, 1024), y.shape: (4200, 1), w.shape: (4200, 1), task_names: ['exp']>,)

In [14]:
dataset_lipo = datasets_lipo[0]

In [16]:
dataset_lipo.ids

array(['Cn1c(CN2CCN(CC2)c3ccc(Cl)cc3)nc4ccccc14',
       'COc1cc(OC)c(cc1NC(=O)CSCC(=O)O)S(=O)(=O)N2C(C)CCc3ccccc23',
       'COC(=O)[C@@H](N1CCc2sccc2C1)c3ccccc3Cl', ...,
       'COc1cccc2[nH]ncc12', 'Clc1ccc2ncccc2c1C(=O)NCC3CCCCC3',
       'CN1C(=O)C=C(CCc2ccc3ccccc3c2)N=C1N'], dtype=object)

We can see that we don't have the SMILES column so we will adapt the .ids data to add it into the panda DataFrame.

In [19]:
# Auxiliary function to transform the DeepChem Dataset object into a DataFrame
def dataset_to_lipo_df(dataset, split_name):
    # .ids has the SMILES strings
    smiles = dataset.ids
    # Y has the target value of logP
    log_p = dataset.y.flatten()

    return pd.DataFrame({
        'smiles': smiles,
        'log_P': log_p,
        'split': split_name
    })

# We splitt the dataset
#train_dataset, valid_dataset, test_dataset = datasets_lipo
dataset_lipo = datasets_lipo[0] # we unpack the tuple
# DataFrame for the training set
df_lipo = dataset_to_lipo_df(dataset_lipo, 'Total')

# Checking
print("\n--- Train DataSet (SMILES y log_P) ---")
print(f"Number of molecules: {len(df_lipo)}")
print("DataFrame Head:")
print(df_lipo.head())


--- Train DataSet (SMILES y log_P) ---
Number of molecules: 4200
DataFrame Head:
                                              smiles     log_P  split
0            Cn1c(CN2CCN(CC2)c3ccc(Cl)cc3)nc4ccccc14  1.125371  Total
1  COc1cc(OC)c(cc1NC(=O)CSCC(=O)O)S(=O)(=O)N2C(C)... -2.798609  Total
2             COC(=O)[C@@H](N1CCc2sccc2C1)c3ccccc3Cl  1.250074  Total
3  OC[C@H](O)CN1C(=O)C(Cc2ccccc12)NC(=O)c3cc4cc(C...  0.984041  Total
4  Cc1cccc(C[C@H](NC(=O)c2cc(nn2C)C(C)(C)C)C(=O)N...  0.759576  Total


#**1.2) Featurizing**

We'll try two different molecular featurization strategies. The first one will be Morgan Fingerprints (Circular) and the second one will be RDKit Descriptors.

The idea is to capture the difference in the lipophilicity prediction using different ML training tools focused in the molecular structure and the descriptors.

In [21]:
# Separate the SMILES strings (X) and the target variable (Y)
X_smiles = df_lipo['smiles']
Y = df_lipo['log_P']

print("SMILES and target variable successfully extracted.")

SMILES and target variable successfully extracted.


### Morgan Fingerprints Featurizer
We chose this featurizer because of its fragment-based structure representation and its efficiency in calculating structural similarity between molecules. Also because it is the standard for rapid comparison in virtual screening tasks.

In [22]:
# Initialize the Morgan Featurizer (radius=2, size=2048)
morgan_featurizer = dc.feat.CircularFingerprint(size=2048, radius=2)

# featurize the list of SMILES strings
# each row is a molecule's fingerprint
X_morgan = morgan_featurizer.featurize(X_smiles.tolist())

# convert the NumPy array to a Pandas DataFrame for easier handling
df_morgan = pd.DataFrame(X_morgan, columns=[f'fp_{i}' for i in range(2048)])

print(f"Morgan FP dimensions for training: {df_morgan.shape}")
print("Morgan Fingerprints generated and ready.")



Morgan FP dimensions for training: (4200, 2048)
Morgan Fingerprints generated and ready.


### RDKit Descriptors Featurizer:

This second featurizer was selected because it uses physicochemical properties to order the data, which contrasts with the previous step that was related to structure. Since our task is to predict logP, it is very likely that other correlated physicochemical descriptors (such as hydrophobicity) will be very relevant to the model.

In [23]:
# Initialize the RDKit Descriptors Featurizer
desc_featurizer = dc.feat.RDKitDescriptors()

# Featurize the list of SMILES strings
# This may take a bit longer and can potentially produce NaN values
X_desc = desc_featurizer.featurize(X_smiles.tolist())

# Handle NaN values: RDKit can sometimes fail to compute descriptors.
# We replace NaN with 0 for simplicity, or we could use imputation (e.g., mean/median).
# Using 0 is a common practice when NaNs are sparse.
X_desc[np.isnan(X_desc)] = 0

# Convert the NumPy array to a Pandas DataFrame
df_desc = pd.DataFrame(X_desc, columns=desc_featurizer.descriptors)

print(f"RDKit Descriptors dimensions for training: {df_desc.shape}")
print("RDKit Descriptors generated and ready.")

RDKit Descriptors dimensions for training: (4200, 217)
RDKit Descriptors generated and ready.


#**1.3) Models**

First we split the data into the training and testing groups. The chosen models for this project are **Random Forest Regression** as a classic ML model and **Multi Layer Perceptron** as a neural network-based model.

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
import numpy as np

We split the Morgan FP data using 80% of the data to train and 20% test for internal evaluation, as recommended in Moleculenet.org.

In [25]:
# Y_train is the target
X_morgan_tr, X_morgan_te, Y_tr, Y_te = train_test_split(
    df_morgan, Y, test_size=0.2, random_state=42)

# Split the RDKit Descriptor data. We reuse Y_tr and Y_te since the split is the same.
X_desc_tr, X_desc_te, _, _ = train_test_split(
    df_desc, Y, test_size=0.2, random_state=42)

print("Data splits for modeling complete.")

Data splits for modeling complete.


#**1.4) Training and evaluating the models**


Model performance was assessed using the R² score and Mean Squared Error (MSE), as these metrics are appropriate for regression tasks involving continuous targets such as lipophilicity values.

The R² score measures how much of the variance in the experimental data is explained by the model, providing an overall indication of how well predictions align with actual values. The MSE captures the average squared difference between predicted and true values, penalizing larger errors more strongly and reflecting the overall prediction accuracy in numerical terms.
Metrics like accuracy were not used because they apply to classification problems with discrete categories. In regression, where predictions are continuous, accuracy does not meaningfully describe how close predicted values are to the true ones.

Together, R² and MSE give a balanced view of model performance, combining interpretability with sensitivity to prediction errors.

##Random Forest

In [26]:
# Random Forest Regressor with Morgan FP
print("Training RF Regressor with Morgan FP")
results_lipo = []

# Initialize the Random Forest model
rf_morgan = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)

# Train the model (Fitting the model to the training features (X) and target (Y))
# Y_tr is the log_P value (target)
rf_morgan.fit(X_morgan_tr, Y_tr)

# Predict on the temporary test set
Y_pred_morgan = rf_morgan.predict(X_morgan_te)

# Evaluate performance (Metrics)
rmse_morgan = np.sqrt(mean_squared_error(Y_te, Y_pred_morgan))
r2_morgan = r2_score(Y_te, Y_pred_morgan)

# We save the results
results_lipo.append({
    'Model': 'Random Forest',
    'Featurizer': 'Morgan Fingerprints',
    'R2 Score': r2_morgan,
    'RMSE': rmse_morgan})

print(f"Morgan FP Results: R² = {r2_morgan:.4f}, RMSE = {rmse_morgan:.4f}")

Training RF Regressor with Morgan FP
Morgan FP Results: R² = 0.5387, RMSE = 0.6863


In [27]:
# Random Forest Regressor with RDKit Descriptors
print("Training RF Regressor with RDKit Descriptors")

# Initialize the model
rf_desc = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)

# Train the model
# X_desc_tr are the RDKit Descriptors (features)
rf_desc.fit(X_desc_tr, Y_tr)

# Predict on the temporary test set
Y_pred_desc = rf_desc.predict(X_desc_te)

# Evaluate performance
rmse_desc = np.sqrt(mean_squared_error(Y_te, Y_pred_desc))
r2_desc = r2_score(Y_te, Y_pred_desc)

# We save the results
results_lipo.append({
    'Model': 'Random Forest',
    'Featurizer': 'RDKit Descriptors',
    'R2 Score': r2_desc,
    'RMSE': rmse_desc})

print(f"RDKit Descriptor Results: R² = {r2_desc:.4f}, RMSE = {rmse_desc:.4f}")

Training RF Regressor with RDKit Descriptors
RDKit Descriptor Results: R² = 0.6553, RMSE = 0.5933


## Multi Layer

In [28]:
# MLP Regressor with Morgan FP
print("Training MLP Regressor with Morgan FP")

# Feature Scaling for Neural Networks.
# We fit the scaler ONLY on the training data and then apply it to both train and test sets.
scaler_morgan = StandardScaler()
X_morgan_tr_scaled = scaler_morgan.fit_transform(X_morgan_tr)
X_morgan_te_scaled = scaler_morgan.transform(X_morgan_te)

# Initialize the MLP model
# Architecture: two hidden layers with 100 neurons each (100, 100)
# max_iter is increased to allow for proper convergence
mlp_morgan = MLPRegressor(
    hidden_layer_sizes=(100, 100),
    max_iter=500,
    random_state=42,
    early_stopping=True,            # Stops training early if validation score doesn't improve
    n_iter_no_change=20             # Number of epochs with no improvement before stopping
)

# Train the model
mlp_morgan.fit(X_morgan_tr_scaled, Y_tr)

# Predict on the temporary test set
Y_pred_mlp_morgan = mlp_morgan.predict(X_morgan_te_scaled)

# Evaluate performance (Regression Metrics)
rmse_mlp_morgan = np.sqrt(mean_squared_error(Y_te, Y_pred_mlp_morgan))
r2_mlp_morgan = r2_score(Y_te, Y_pred_mlp_morgan)

# We save the results
results_lipo.append({
    'Model': 'MLP Neural Network',
    'Featurizer': 'Morgan Fingerprints',
    'R2 Score': r2_mlp_morgan,
    'RMSE': rmse_mlp_morgan})

print(f"MLP (Morgan FP) Results: R² = {r2_mlp_morgan:.4f}, RMSE = {rmse_mlp_morgan:.4f}")

Training MLP Regressor with Morgan FP
MLP (Morgan FP) Results: R² = 0.5190, RMSE = 0.7008


In [29]:
# MLP Regressor with RDKit Descriptors
print("Training MLP Regressor with RDKit Descriptors")

# Feature Scaling for Descriptors
scaler_desc = StandardScaler()
X_desc_tr_scaled = scaler_desc.fit_transform(X_desc_tr)
X_desc_te_scaled = scaler_desc.transform(X_desc_te)

# Initialize the MLP model (using the same architecture as in RF)
mlp_desc = MLPRegressor(
    hidden_layer_sizes=(100, 100),
    max_iter=500,
    random_state=42,
    early_stopping=True,
    n_iter_no_change=20)


# Train the model
mlp_desc.fit(X_desc_tr_scaled, Y_tr)

# Predict on the temporary test set
Y_pred_mlp_desc = mlp_desc.predict(X_desc_te_scaled)

# Evaluate performance
rmse_mlp_desc = np.sqrt(mean_squared_error(Y_te, Y_pred_mlp_desc))
r2_mlp_desc = r2_score(Y_te, Y_pred_mlp_desc)

# We save the results
results_lipo.append({
    'Model': 'MLP Neural Network',
    'Featurizer': 'RDKit Descriptors',
    'R2 Score': r2_mlp_desc,
    'RMSE': rmse_mlp_desc})

print(f"MLP (RDKit Descriptors) Results: R² = {r2_mlp_desc:.4f}, RMSE = {rmse_mlp_desc:.4f}")

Training MLP Regressor with RDKit Descriptors
MLP (RDKit Descriptors) Results: R² = 0.6631, RMSE = 0.5865


To finalize the training and to chose the best model, we display the table with all the evaluation results of the 4 different combinations.

In [30]:
# Convert the results list into a readable Pandas DataFrame to display
df_results_lipo = pd.DataFrame(results_lipo)

# Sort the table by R2 Score (or RMSE) to easily see the best performing model
df_results_lipo = df_results_lipo.sort_values(by='R2 Score', ascending=False)

print("Lipophilicity Regression Results")
df_results_lipo

Lipophilicity Regression Results


Unnamed: 0,Model,Featurizer,R2 Score,RMSE
3,MLP Neural Network,RDKit Descriptors,0.663141,0.586501
1,Random Forest,RDKit Descriptors,0.655312,0.593278
0,Random Forest,Morgan Fingerprints,0.538742,0.686304
2,MLP Neural Network,Morgan Fingerprints,0.519045,0.700805


**Overall performance**
The best-performing configuration was the MLP Neural Network using RDKit Descriptors, achieving an R² of 0.663 and an RMSE of 0.587.
This indicates that the model explains approximately 66% of the variance in experimental lipophilicity values, with an average prediction error of about 0.59 logP units.
These results are consistent with typical benchmarks for the Lipophilicity dataset, suggesting robust and reliable predictive performance.

**Effect of featurizers**

Models utilizing RDKit Descriptors consistently outperformed those using Morgan Fingerprints, regardless of the model type.
This suggests that physicochemical descriptors, which capture molecular properties directly correlated with lipophilicity (e.g., polar surface area, molecular weight, and hydrophobicity), provide a more informative representation for regression than purely structural fingerprints.




**Effect of model type**

The MLP Neural Network achieved slightly higher R² and lower RMSE values compared to the Random Forest, particularly when using RDKit Descriptors.
This indicates that the MLP was able to capture modestly more nonlinear relationships in the data, though the improvement was not substantial.
Random Forests still performed competitively, reflecting that the dataset may not be large or complex enough to fully exploit the representational capacity of a neural network.