<a href="https://colab.research.google.com/github/bbarthougatica/ChmInf/blob/Classification/cheminf_latin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cheminformatics & drug design - Project**

We chose two Data Sets from the *MoleculeNet*.
We aim to study toxicity and lipophilicity properties and use classification and regression tasks with them.
Both datasets are split into training, validation and test subsets following a 80/10/10 ratio and is recommended to do RANDOM splitting.
# 1st Data Set: **Lipophilicity**

*   Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility. The lipophilicity dataset, curated from ChEMBL database, provides experimental results of octanol/water distribution coefficient (logD at pH 7.4) of 4200 compounds. This property influences how a drug is absorbed, distributed, metabolised, and excreted in the body (ADME properties).
*   Task type: Regression
*   Nº Tasks: 1
*   Recommended regression metric: Root-Mean-Square Error
*   Nº Compounds: 4200
*   Prediction target: Lipophilicity


# 2nd Data Set: **SIDER**

* Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.
*   Task type: Classification
*   Nº Tasks: 27
*   Recommended classification metric: Area Under Curve of Receiver Operating Characteristics
*   Nº Compounds: 1427
*   Prediction target: Adverce Drug Reaction


In [1]:
# Install all libraries
!pip install numpy scipy matplotlib scikit-learn pandas rdkit xgboost deepchem mordred pycm

import pandas as pd
import deepchem as dc
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole





Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


Now we upload our selected Data Bases.

## 2. Classification algorithms

#**2.1) Get DataBases**
In this case, "SIDER" for classification exercise. Per default, featurizer is ECFP, also known as circular fingerprints or as morgan generator

In [2]:
# Circular Fingerprints Featurizer (Morgan FP with size=2048, radius=8)
featurizer = dc.feat.CircularFingerprint(size=2048, radius=8)

# Load SIDER dataset, applying the custom featurizer during loading
tasks, datasets, transformers = dc.molnet.load_sider(featurizer=featurizer)

# Unpack the datasets
train_dataset, valid_dataset, test_dataset = datasets

print(f"SIDER Dataset load succesfully.")
print("Dataset:", "SIDER")
print("Number of tasks (side effect categories):", len(tasks))
print("Example task names:", tasks[:5])
print(f"Number of classification tasks: {len(tasks)}")
print(f"Molecules in Training set: {len(train_dataset)}")



SIDER Dataset load succesfully.
Dataset: SIDER
Number of tasks (side effect categories): 27
Example task names: ['Hepatobiliary disorders', 'Metabolism and nutrition disorders', 'Product issues', 'Eye disorders', 'Investigations']
Number of classification tasks: 27
Molecules in Training set: 1141


We can check all the 27 tasks related to side effects in this DataSet.

In [3]:
tasks

['Hepatobiliary disorders',
 'Metabolism and nutrition disorders',
 'Product issues',
 'Eye disorders',
 'Investigations',
 'Musculoskeletal and connective tissue disorders',
 'Gastrointestinal disorders',
 'Social circumstances',
 'Immune system disorders',
 'Reproductive system and breast disorders',
 'Neoplasms benign, malignant and unspecified (incl cysts and polyps)',
 'General disorders and administration site conditions',
 'Endocrine disorders',
 'Surgical and medical procedures',
 'Vascular disorders',
 'Blood and lymphatic system disorders',
 'Skin and subcutaneous tissue disorders',
 'Congenital, familial and genetic disorders',
 'Infections and infestations',
 'Respiratory, thoracic and mediastinal disorders',
 'Psychiatric disorders',
 'Renal and urinary disorders',
 'Pregnancy, puerperium and perinatal conditions',
 'Ear and labyrinth disorders',
 'Cardiac disorders',
 'Nervous system disorders',
 'Injury, poisoning and procedural complications']

In [4]:

# Convert to DataFrame
X_train = train_dataset.X #1024-bit molecular fingerprints
y_train = train_dataset.y #Side effect binary labels
ids_train = train_dataset.ids #SMILES

df = pd.DataFrame(y_train, columns=tasks)
df.insert(0, "SMILES", ids_train)

df.head()

Unnamed: 0,SMILES,Hepatobiliary disorders,Metabolism and nutrition disorders,Product issues,Eye disorders,Investigations,Musculoskeletal and connective tissue disorders,Gastrointestinal disorders,Social circumstances,Immune system disorders,...,"Congenital, familial and genetic disorders",Infections and infestations,"Respiratory, thoracic and mediastinal disorders",Psychiatric disorders,Renal and urinary disorders,"Pregnancy, puerperium and perinatal conditions",Ear and labyrinth disorders,Cardiac disorders,Nervous system disorders,"Injury, poisoning and procedural complications"
0,C(CNCCNCCNCCN)N,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0
1,Cl[Tl],0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,C[N+](C)(C)CC(CC(=O)O)O,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
3,C(CC(=O)O)CN,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,C(CC(=O)O)C(=O)CN,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,...,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0


In this project, our goal is to develop a machine learning model that predicts physiological side effects of drug molecules based on their chemical structure. Each molecule is represented by molecular descriptors derived from its SMILES notation, and the target output consists of 27 binary labels corresponding to physiological side effect categories. This represents a multi-label classification problem, where each compound can be associated with multiple side effects simultaneously.

In drug discovery, predicting adverse side effects before clinical testing can: Reduce the risk of toxicity in early drug candidates Help prioritize safer compounds Save significant time and cost in development So, our proposed model essentially acts as a computational toxicity screener, a decision-support tool for medicinal chemists.

From the preveius model we know that we had no invalid SMILES so we don't have to clean the data.

#**1.2) Featurizing**

We'll try two different molecular featurization strategies.

In the first case, we used circular fingerpints, which are binary.

For the second featurizer we chose **RDKdescriptors**. As said before, this descriptor compliments nicely since it allows us to add global molecular fingerprints, it includes physicochemical descriptors (molecular weight, logP, H-bond donors/acceptors, etc.), it's easier to interpret and lower dimensional than bit-based fingerprints, and often better for property prediction or toxicity-related endpoints

As a third, we chose **MACCSKeysFingerprint**.

### RDKit Descriptors
This featurizer will allow us compare the previous results from the Random Forest Classification model that also used this featurizer.

In [9]:
# Featurizing: RDKit Descriptors for SIDER

rdkit_desc_featurizer = dc.feat.RDKitDescriptors()

# RDKit Descriptors using the SMILES strings
X_desc_sider_train = rdkit_desc_featurizer.featurize(ids_train.tolist())

# Handle NaN values: replace with 0
X_desc_sider_train[np.isnan(X_desc_sider_train)] = 0

# Get the descriptor names
descriptor_names = rdkit_desc_featurizer.descriptors

# Create the final DataFrame for this feature set
df_desc_sider_train = pd.DataFrame(X_desc_sider_train, columns=descriptor_names)



### MACCSKeysFingerprint
Since we use Circular Fingerprints, we now want to counterbalance this with a more direct featurizer. That is why we chose MACCS Keys as a low-complexity representation (166 bits). If our model works well, we can know that the prediction is based on the fundamental chemical or structural rules that MACCS Keys can see.

In [7]:
import numpy as np

# MACCS Keys Fingerprints
print("Featurizing: MACCS Keys Fingerprints for SIDER")

# Initialize the MACCS Keys Featurizer.
maccs_featurizer = dc.feat.MACCSKeysFingerprint()

# Featurize the SMILES
X_maccs_sider_train = maccs_featurizer.featurize(ids_train.tolist())


# Create a DataFrame for the MACCS Keys
df_maccs_sider_train = pd.DataFrame(X_maccs_sider_train, columns=[f'maccs_{i}' for i in range(maccs_size)])

print(f"MACCS Keys Fingerprints dimensions for SIDER: {df_maccs_sider_train.shape}")

Featurizing: MACCS Keys Fingerprints for SIDER




MACCS Keys Fingerprints dimensions for SIDER: (1141, 167)


#**1.3) Models**

First we split the data into the training and testing groups. The chosen models for this project are **Random Forest Regression** as a classic ML model and **Multi Layer Perceptron** as a neural network-based model as it was used in the Regression algorithms.

In [None]:
# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=0,
    n_jobs=-1 # allows for parallel computation
)

In [None]:
# MLP Classifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score


We split the Morgan FP data using 80% of the data to train and 20% test for internal evaluation, as recommended in Moleculenet.org.

#**1.4) Training and evaluating the models**
