<a href="https://colab.research.google.com/github/bbarthougatica/ChmInf/blob/Regression/cheminf_latin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cheminformatics & drug design - Project**



We chose two Data Sets from the *MoleculeNet*.
We aim to study toxicity and lipophilicity properties and use classification and regression tasks with them.
Both datasets are split into training, validation and test subsets following a 80/10/10 ratio and is recommended to do RANDOM splitting.
# 1st Data Set: **Lipophilicity**

*   Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility. The lipophilicity dataset, curated from ChEMBL database, provides experimental results of octanol/water distribution coefficient (logD at pH 7.4) of 4200 compounds. This property influences how a drug is absorbed, distributed, metabolised, and excreted in the body (ADME properties).
*   Task type: Regression
*   Nº Tasks: 1
*   Recommended regression metric: Root-Mean-Square Error
*   Nº Compounds: 4200
*   Prediction target: Lipophilicity


# 2nd Data Set: **SIDER**

* Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.
*   Task type: Classification
*   Nº Tasks: 27
*   Recommended classification metric: Area Under Curve of Receiver Operating Characteristics
*   Nº Compounds: 1427
*   Prediction target: Adverce Drug Reaction



*You should perform appropriate data cleaning and preprocessing. This includes evaluating at least two different molecular featurization strategies, such as circular fingerprints (e.g., Morgan) and descriptor-based or graph-based representations.
Document any preprocessing decisions or challenges.*


**The path to a ML model.**

*   Define the task
*   Prepare data & split data
*   Choose the model
*   Train the model
*   Evaluate the model
*   Use the model




In [11]:
# Install all libraries
!pip install numpy scipy matplotlib scikit-learn pandas rdkit xgboost deepchem mordred pycm

import pandas as pd
import numpy as np
import deepchem as dc
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole




Now we upload our selected Data Bases.

# 1. Starting with the regression algorithms

#**1.1) Get DataBases**
In this case, "Lipophilicity" for regression exercise.

In [2]:
# Get tasks from deepchem's functions

# For Regression
tasks_lipo, datasets_lipo, transformers_lipo = dc.molnet.load_lipo()


print(f"Lipo Dataset load succesfully. Sub datasets: {len(datasets_lipo)}")



Lipo Dataset load succesfully. Sub datasets: 3


In [3]:

print("Dataset:", "Lipophilicity")
print("Number of tasks (side effect categories):", len(tasks_lipo))
print("Example task names:", tasks_lipo[:5])

Dataset: Lipophilicity
Number of tasks (side effect categories): 1
Example task names: ['exp']


We can see that we don't have the SMILES column so we will adapt the .ids data to add it into the panda DataFrame.

In [6]:
# Auxiliary function to transform the DeepChem Dataset object into a DataFrame
def dataset_to_lipo_df(dataset, split_name):
    # .ids has the SMILES strings
    smiles = dataset.ids
    # Y has the target value of logP
    log_p = dataset.y.flatten()

    return pd.DataFrame({
        'smiles': smiles,
        'log_P': log_p,
        'split': split_name
    })

# We splitt the dataset
train_dataset, valid_dataset, test_dataset = datasets_lipo

# DataFrame for the training set
df_lipo_train = dataset_to_lipo_df(train_dataset, 'train')

# Checking
print("\n--- Train DataSet (SMILES y log_P) ---")
print(f"Number of molecules: {len(df_lipo_train)}")
print("DataFrame Head:")
print(df_lipo_train.head())


--- Train DataSet (SMILES y log_P) ---
Number of molecules: 3360
DataFrame Head:
                                              smiles     log_P  split
0                  CC(C)NCC(O)COc1ccc(COCCOC(C)C)cc1 -1.703482  train
1                                Cc1ccc(NC(=N)N)cc1C -2.892589  train
2                   CC(C)C(=O)NCCNCC(O)COc1ccc(O)cc1 -1.909924  train
3  C[C@@](O)(C(=O)Nc1ccc(cc1Cl)S(=O)(=O)NCC=C)C(F...  0.749051  train
4                         CC(C)NCC(O)COc1ccccc1OCC=C -1.620905  train


#**1.2) Featurizing**

We'll try two different molecular featurization strategies. The first one will be Morgan Fingerprints (Circular) and the second one will be RDKit Descriptors.

The idea is to capture the difference in the lipophilicity prediction using different ML training tools focused in the molecular structure and the descriptors.

In [7]:
# Separate the SMILES strings (X) and the target variable (Y)
X_smiles_train = df_lipo_train['smiles']
Y_train = df_lipo_train['log_P']

print("SMILES and target variable successfully extracted.")

SMILES and target variable successfully extracted.


### Morgan Fingerprints Featurizer
We chose this featurizer because of its fragment-based structure representation and its efficiency in calculating structural similarity between molecules. Also because it is the standard for rapid comparison in virtual screening tasks.

In [9]:
# Initialize the Morgan Featurizer (radius=2, size=2048)
morgan_featurizer = dc.feat.CircularFingerprint(size=2048, radius=2)

# featurize the list of SMILES strings
# each row is a molecule's fingerprint
X_morgan_train = morgan_featurizer.featurize(X_smiles_train.tolist())

# convert the NumPy array to a Pandas DataFrame for easier handling
df_morgan_train = pd.DataFrame(X_morgan_train, columns=[f'fp_{i}' for i in range(2048)])

print(f"Morgan FP dimensions for training: {df_morgan_train.shape}")
print("Morgan Fingerprints generated and ready.")



Morgan FP dimensions for training: (3360, 2048)
Morgan Fingerprints generated and ready.


### RDKit Descriptors Featurizer:

This second featurizer was selected because it uses physicochemical properties to order the data, which contrasts with the previous step that was related to structure. Since our task is to predict logP, it is very likely that other correlated physicochemical descriptors (such as hydrophobicity) will be very relevant to the model.

In [12]:
# Initialize the RDKit Descriptors Featurizer
desc_featurizer = dc.feat.RDKitDescriptors()

# Featurize the list of SMILES strings
# This may take a bit longer and can potentially produce NaN values
X_desc_train = desc_featurizer.featurize(X_smiles_train.tolist())

# Handle NaN values: RDKit can sometimes fail to compute descriptors.
# We replace NaN with 0 for simplicity, or we could use imputation (e.g., mean/median).
# Using 0 is a common practice when NaNs are sparse.
X_desc_train[np.isnan(X_desc_train)] = 0

# Convert the NumPy array to a Pandas DataFrame
df_desc_train = pd.DataFrame(X_desc_train, columns=desc_featurizer.descriptors)

print(f"RDKit Descriptors dimensions for training: {df_desc_train.shape}")
print("RDKit Descriptors generated and ready.")

RDKit Descriptors dimensions for training: (3360, 217)
RDKit Descriptors generated and ready.


#**1.3) Models**

First we split the data into the training and testing groups.

In [None]:
# Split the Morgan FP data (80% train, 20% test for internal evaluation)
# Y_train is the target
X_morgan_tr, X_morgan_te, Y_tr, Y_te = train_test_split(
    df_morgan_train, Y_train, test_size=0.2, random_state=42)

# Split the RDKit Descriptor data. We reuse Y_tr and Y_te since the split is the same.
X_desc_tr, X_desc_te, _, _ = train_test_split(
    df_desc_train, Y_train, test_size=0.2, random_state=42)

print("Data splits for modeling complete.")

##Random Forest

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

Data splits for modeling complete.


## Multi Layer

In [None]:
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
import numpy as np

#**1.4) Training and evaluating the models**


##Random Forest

In [16]:
# Random Forest Regressor with Morgan FP
print("Training RF Regressor with Morgan FP")

# Initialize the Random Forest model
rf_morgan = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)

# Train the model (Fitting the model to the training features (X) and target (Y))
# X_morgan_tr are the Morgan Fingerprints (features)
# Y_tr is the log_P value (target)
rf_morgan.fit(X_morgan_tr, Y_tr)

# Predict on the temporary test set
Y_pred_morgan = rf_morgan.predict(X_morgan_te)

# Evaluate performance (Metrics)
rmse_morgan = np.sqrt(mean_squared_error(Y_te, Y_pred_morgan))
r2_morgan = r2_score(Y_te, Y_pred_morgan)

print(f"Morgan FP Results: R² = {r2_morgan:.4f}, RMSE = {rmse_morgan:.4f}")

Training RF Regressor with Morgan FP
Morgan FP Results: R² = 0.5292, RMSE = 0.6704


In [17]:
# Random Forest Regressor with RDKit Descriptors
print("Training RF Regressor with RDKit Descriptors")

# Initialize the model
rf_desc = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)

# Train the model
# X_desc_tr are the RDKit Descriptors (features)
rf_desc.fit(X_desc_tr, Y_tr)

# Predict on the temporary test set
Y_pred_desc = rf_desc.predict(X_desc_te)

# Evaluate performance
rmse_desc = np.sqrt(mean_squared_error(Y_te, Y_pred_desc))
r2_desc = r2_score(Y_te, Y_pred_desc)

print(f"RDKit Descriptor Results: R² = {r2_desc:.4f}, RMSE = {rmse_desc:.4f}")

Training RF Regressor with RDKit Descriptors
RDKit Descriptor Results: R² = 0.6691, RMSE = 0.5620


## Multi Layer

In [None]:
# MLP Regressor with Morgan FP
print("Training MLP Regressor with Morgan FP")

# Feature Scaling for Neural Networks.
# We fit the scaler ONLY on the training data and then apply it to both train and test sets.
scaler_morgan = StandardScaler()
X_morgan_tr_scaled = scaler_morgan.fit_transform(X_morgan_tr)
X_morgan_te_scaled = scaler_morgan.transform(X_morgan_te)

# Initialize the MLP model
# Architecture: two hidden layers with 100 neurons each (100, 100)
# max_iter is increased to allow for proper convergence
mlp_morgan = MLPRegressor(
    hidden_layer_sizes=(100, 100),
    max_iter=500,
    random_state=42,
    early_stopping=True,            # Stops training early if validation score doesn't improve
    n_iter_no_change=20             # Number of epochs with no improvement before stopping
)

# Train the model
mlp_morgan.fit(X_morgan_tr_scaled, Y_tr)

# Predict on the temporary test set
Y_pred_mlp_morgan = mlp_morgan.predict(X_morgan_te_scaled)

# Evaluate performance (Regression Metrics)
rmse_mlp_morgan = np.sqrt(mean_squared_error(Y_te, Y_pred_mlp_morgan))
r2_mlp_morgan = r2_score(Y_te, Y_pred_mlp_morgan)

print(f"MLP (Morgan FP) Results: R² = {r2_mlp_morgan:.4f}, RMSE = {rmse_mlp_morgan:.4f}")

In [None]:
# MLP Regressor with RDKit Descriptors
print("Training MLP Regressor with RDKit Descriptors")

# Feature Scaling for Descriptors
scaler_desc = StandardScaler()
X_desc_tr_scaled = scaler_desc.fit_transform(X_desc_tr)
X_desc_te_scaled = scaler_desc.transform(X_desc_te)

# Initialize the MLP model (using the same architecture)
mlp_desc = MLPRegressor(
    hidden_layer_sizes=(100, 100),
    max_iter=500,
    random_state=42,
    early_stopping=True,
    n_iter_no_change=20
)

# Train the model
mlp_desc.fit(X_desc_tr_scaled, Y_tr)

# Predict on the temporary test set
Y_pred_mlp_desc = mlp_desc.predict(X_desc_te_scaled)

# Evaluate performance
rmse_mlp_desc = np.sqrt(mean_squared_error(Y_te, Y_pred_mlp_desc))
r2_mlp_desc = r2_score(Y_te, Y_pred_mlp_desc)

print(f"MLP (RDKit Descriptors) Results: R² = {r2_mlp_desc:.4f}, RMSE = {rmse_mlp_desc:.4f}")