<a href="https://colab.research.google.com/github/bbarthougatica/ChmInf/blob/Toxic_data/cheminf_latin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cheminformatics & drug design - Project**


RDKit is an open-source software toolkit for cheminformatics, designed to assist in the analysis and design of small molecules and chemical compounds. It provides a set of libraries and tools for the manipulation and analysis of molecular structures, molecular descriptors, molecular fingerprints, molecular similarity, molecular visualization, and more. The toolkit is widely used in academia, as well as in the pharmaceutical, biotech, and chemical industries for a variety of tasks such as virtual screening, lead optimization, and chemical database management.

In [1]:
# Install all libraries
!pip install numpy scipy matplotlib scikit-learn pandas rdkit xgboost deepchem mordred pycm

import pandas as pd
import deepchem as dc
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole



Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


Now we upload our selected Data Bases.

Per default, featurizer es ECFP, also known as circular fingerprints or as morgan generator

In [51]:
featurizer = dc.feat.CircularFingerprint(size=2048, radius=8, sparse=True, smiles=True)
tasks, datasets, transformers = dc.molnet.load_sider(featurizer=featurizer)
train_dataset, valid_dataset, test_dataset = datasets

In [52]:
print("Dataset:", "SIDER")
print("Number of tasks (side effect categories):", len(tasks))
print("Example task names:", tasks[:5])

Dataset: SIDER
Number of tasks (side effect categories): 27
Example task names: ['Hepatobiliary disorders', 'Metabolism and nutrition disorders', 'Product issues', 'Eye disorders', 'Investigations']


In [53]:
print([type(d) for d in datasets])

[<class 'deepchem.data.datasets.DiskDataset'>, <class 'deepchem.data.datasets.DiskDataset'>, <class 'deepchem.data.datasets.DiskDataset'>]


Load sider usually returns a tuple with three datasets, one for training, one for validation and one for test, that's why we need to unpack them for use. In our case, we keep all the data in the same dataset for now

In [54]:
tasks

['Hepatobiliary disorders',
 'Metabolism and nutrition disorders',
 'Product issues',
 'Eye disorders',
 'Investigations',
 'Musculoskeletal and connective tissue disorders',
 'Gastrointestinal disorders',
 'Social circumstances',
 'Immune system disorders',
 'Reproductive system and breast disorders',
 'Neoplasms benign, malignant and unspecified (incl cysts and polyps)',
 'General disorders and administration site conditions',
 'Endocrine disorders',
 'Surgical and medical procedures',
 'Vascular disorders',
 'Blood and lymphatic system disorders',
 'Skin and subcutaneous tissue disorders',
 'Congenital, familial and genetic disorders',
 'Infections and infestations',
 'Respiratory, thoracic and mediastinal disorders',
 'Psychiatric disorders',
 'Renal and urinary disorders',
 'Pregnancy, puerperium and perinatal conditions',
 'Ear and labyrinth disorders',
 'Cardiac disorders',
 'Nervous system disorders',
 'Injury, poisoning and procedural complications']

We are now working with DiskDataset, we need to convert to panda dataframe

In [57]:
# Convert to DataFrame
X = train_dataset.X #1024-bit molecular fingerprints
y = train_dataset.y #Side effect binary labels
ids = train_dataset.ids #SMILES


In [58]:
df = pd.DataFrame(y, columns=tasks)
df.insert(0, "SMILES", ids)

df.head()

Unnamed: 0,SMILES,Hepatobiliary disorders,Metabolism and nutrition disorders,Product issues,Eye disorders,Investigations,Musculoskeletal and connective tissue disorders,Gastrointestinal disorders,Social circumstances,Immune system disorders,...,"Congenital, familial and genetic disorders",Infections and infestations,"Respiratory, thoracic and mediastinal disorders",Psychiatric disorders,Renal and urinary disorders,"Pregnancy, puerperium and perinatal conditions",Ear and labyrinth disorders,Cardiac disorders,Nervous system disorders,"Injury, poisoning and procedural complications"
0,C(CNCCNCCNCCN)N,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0
1,Cl[Tl],0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,C[N+](C)(C)CC(CC(=O)O)O,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
3,C(CC(=O)O)CN,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,C(CC(=O)O)C(=O)CN,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,...,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0



We chose two Data Sets from the *Physiology Collection*.
We aim to study toxicity and adverse drug reaction properties and use classification tasks with them.
Both datasets are split into training, validation and test subsets following a 80/10/10 ratio and is recommended to do RANDOM splitting.

In this file, we will be looking at the Sider dataset


# Data Set: **SIDER**

* Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.
*   Task type: Classification
*   Nº Tasks: 27
*   Recommended classification and regress metrics: Area Under Curve of Receiver Operating Characteristics
*   Nº Compounds: 1427
*   Prediction target: Adverce Drug Reaction



*You should perform appropriate data cleaning and preprocessing. This includes evaluating at least two different molecular featurization strategies, such as circular fingerprints (e.g., Morgan) and descriptor-based or graph-based representations.
Document any preprocessing decisions or challenges.*


**The path to a ML model.**

*   Define the task
*   Prepare data & split data
*   Choose the model
*   Train the model
*   Evaluate the model
*   Use the model




| Property     | Description                                                                      |
| ------------ | -------------------------------------------------------------------------------- |
| **Type**     | Supervised learning                                                              |
| **Sub-type** | Multi-label classification                                                       |
| **Input**    | Molecular features derived from SMILES                                           |
| **Output**   | Binary vector of 27 side effect indicators                                       |
| **Goal**     | Predict the presence (1) or absence (0) of each side effect for unseen molecules |


In this project, our goal is to develop a machine learning model that predicts physiological side effects of drug molecules based on their chemical structure.
Each molecule is represented by molecular descriptors derived from its SMILES notation, and the target output consists of 27 binary labels corresponding to physiological side effect categories.
This represents a multi-label classification problem, where each compound can be associated with multiple side effects simultaneously.

In drug discovery, predicting adverse side effects before clinical testing can:
Reduce the risk of toxicity in early drug candidates
Help prioritize safer compounds
Save significant time and cost in development
So, our proposed model essentially acts as a computational toxicity screener, a decision-support tool for medicinal chemists.

In [48]:
df.shape, df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1427 entries, 0 to 1426
Data columns (total 28 columns):
 #   Column                                                               Non-Null Count  Dtype  
---  ------                                                               --------------  -----  
 0   SMILES                                                               1427 non-null   object 
 1   Hepatobiliary disorders                                              1427 non-null   float64
 2   Metabolism and nutrition disorders                                   1427 non-null   float64
 3   Product issues                                                       1427 non-null   float64
 4   Eye disorders                                                        1427 non-null   float64
 5   Investigations                                                       1427 non-null   float64
 6   Musculoskeletal and connective tissue disorders                      1427 non-null   float64
 7   Gastro

((1427, 28), None)

In [49]:
print(df.isna().sum().sum(), "missing values total")
print(df.describe())

0 missing values total
       Hepatobiliary disorders  Metabolism and nutrition disorders  \
count              1427.000000                         1427.000000   
mean                  0.520673                            0.697968   
std                   0.499748                            0.459300   
min                   0.000000                            0.000000   
25%                   0.000000                            0.000000   
50%                   1.000000                            1.000000   
75%                   1.000000                            1.000000   
max                   1.000000                            1.000000   

       Product issues  Eye disorders  Investigations  \
count     1427.000000    1427.000000     1427.000000   
mean         0.015417       0.613875        0.806587   
std          0.123247       0.487030        0.395112   
min          0.000000       0.000000        0.000000   
25%          0.000000       0.000000        1.000000   
50%       

We being by properly preparing our data

In [62]:
from sklearn.feature_selection import VarianceThreshold

# Remove zero-variance features
selector = VarianceThreshold(threshold=0.0)
X_clean = selector.fit_transform(df)

print("Original number of features:", df.shape[1])
print("After removing zero-variance features:", X_clean.shape[1])

ValueError: could not convert string to float: 'C(CNCCNCCNCCN)N'