<a href="https://colab.research.google.com/github/bbarthougatica/ChmInf/blob/Toxic_data/cheminf_latin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cheminformatics & drug design - Project**


RDKit is an open-source software toolkit for cheminformatics, designed to assist in the analysis and design of small molecules and chemical compounds. It provides a set of libraries and tools for the manipulation and analysis of molecular structures, molecular descriptors, molecular fingerprints, molecular similarity, molecular visualization, and more. The toolkit is widely used in academia, as well as in the pharmaceutical, biotech, and chemical industries for a variety of tasks such as virtual screening, lead optimization, and chemical database management.

In [1]:
# Install all libraries
!pip install numpy scipy matplotlib scikit-learn pandas rdkit xgboost deepchem mordred pycm

import pandas as pd
import deepchem as dc
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole



Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


Now we upload our selected Data Bases.

Per default, featurizer es ECFP, also known as circular fingerprints or as morgan generator

In [2]:
tasks, datasets, transformers = dc.molnet.load_sider(featurizer='ECFP')




Dataset: SIDER
Number of tasks (side effect categories): 27
Example task names: ['Hepatobiliary disorders', 'Metabolism and nutrition disorders', 'Product issues', 'Eye disorders', 'Investigations']


In [3]:
print("Dataset:", "SIDER")
print("Number of tasks (side effect categories):", len(tasks))
print("Example task names:", tasks[:5])

Dataset: SIDER
Number of tasks (side effect categories): 27
Example task names: ['Hepatobiliary disorders', 'Metabolism and nutrition disorders', 'Product issues', 'Eye disorders', 'Investigations']


In [11]:
print([type(d) for d in datasets])

[<class 'deepchem.data.datasets.DiskDataset'>, <class 'deepchem.data.datasets.DiskDataset'>, <class 'deepchem.data.datasets.DiskDataset'>]


Load sider returns a tup`ple with three datasets, one for training, one for validation and one for test, that's why we need to unpack them for use

In [12]:
train_dataset, valid_dataset, test_dataset = datasets

In [18]:
tasks

['Hepatobiliary disorders',
 'Metabolism and nutrition disorders',
 'Product issues',
 'Eye disorders',
 'Investigations',
 'Musculoskeletal and connective tissue disorders',
 'Gastrointestinal disorders',
 'Social circumstances',
 'Immune system disorders',
 'Reproductive system and breast disorders',
 'Neoplasms benign, malignant and unspecified (incl cysts and polyps)',
 'General disorders and administration site conditions',
 'Endocrine disorders',
 'Surgical and medical procedures',
 'Vascular disorders',
 'Blood and lymphatic system disorders',
 'Skin and subcutaneous tissue disorders',
 'Congenital, familial and genetic disorders',
 'Infections and infestations',
 'Respiratory, thoracic and mediastinal disorders',
 'Psychiatric disorders',
 'Renal and urinary disorders',
 'Pregnancy, puerperium and perinatal conditions',
 'Ear and labyrinth disorders',
 'Cardiac disorders',
 'Nervous system disorders',
 'Injury, poisoning and procedural complications']

In [16]:
train_dataset.X.shape

(1141, 1024)

In [19]:
y

array([[1., 1., 0., ..., 1., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 1., 1., 1.],
       ...,
       [1., 1., 0., ..., 1., 1., 1.],
       [1., 1., 0., ..., 1., 1., 1.],
       [0., 1., 0., ..., 1., 1., 1.]])

We are now working with DiskDataset, we need to convert to panda dataframe

In [17]:
# Convert to DataFrame
X = train_dataset.X
y = train_dataset.y
ids = train_dataset.ids

# Create DataFrame
df = pd.DataFrame(X)
df['SMILES'] = ids

# Add label columns
for i, task in enumerate(tasks):
    df[task] = y[:, i]

# Move SMILES to front
cols = ['SMILES'] + [c for c in df.columns if c != 'SMILES']
df = df[cols]

# Show first rows
print(df.head())

                    SMILES    0    1    2    3    4    5    6    7    8  ...  \
0          C(CNCCNCCNCCN)N  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   
1                   Cl[Tl]  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   
2  C[N+](C)(C)CC(CC(=O)O)O  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   
3             C(CC(=O)O)CN  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   
4        C(CC(=O)O)C(=O)CN  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   

   Congenital, familial and genetic disorders  Infections and infestations  \
0                                         0.0                          0.0   
1                                         0.0                          1.0   
2                                         0.0                          1.0   
3                                         0.0                          0.0   
4                                         0.0                          1.0   

   Respiratory, thoracic and mediastinal disorders

In [21]:
df = pd.DataFrame(train_dataset.y, columns=tasks)
df.insert(0, "SMILES", train_dataset.ids)

# 3️⃣ Inspect the result
df.head()

Unnamed: 0,SMILES,Hepatobiliary disorders,Metabolism and nutrition disorders,Product issues,Eye disorders,Investigations,Musculoskeletal and connective tissue disorders,Gastrointestinal disorders,Social circumstances,Immune system disorders,...,"Congenital, familial and genetic disorders",Infections and infestations,"Respiratory, thoracic and mediastinal disorders",Psychiatric disorders,Renal and urinary disorders,"Pregnancy, puerperium and perinatal conditions",Ear and labyrinth disorders,Cardiac disorders,Nervous system disorders,"Injury, poisoning and procedural complications"
0,C(CNCCNCCNCCN)N,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0
1,Cl[Tl],0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,C[N+](C)(C)CC(CC(=O)O)O,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
3,C(CC(=O)O)CN,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,C(CC(=O)O)C(=O)CN,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,...,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0



We chose two Data Sets from the *Physiology Collection*.
We aim to study toxicity and adverse drug reaction properties and use classification tasks with them.
Both datasets are split into training, validation and test subsets following a 80/10/10 ratio and is recommended to do RANDOM splitting.
# 1st Data Set: **Tox21**

*   Qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways.
*   Task type: Classification
*   Nº Tasks: 12
*   Recommended classification and regress metrics: Area Under Curve of Receiver Operating Characteristics
*   Nº Compounds: 7831
*   Prediction target: Toxicity


# 2nd Data Set: **SIDER**

* Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.
*   Task type: Classification
*   Nº Tasks: 27
*   Recommended classification and regress metrics: Area Under Curve of Receiver Operating Characteristics
*   Nº Compounds: 1427
*   Prediction target: Adverce Drug Reaction



*You should perform appropriate data cleaning and preprocessing. This includes evaluating at least two different molecular featurization strategies, such as circular fingerprints (e.g., Morgan) and descriptor-based or graph-based representations.
Document any preprocessing decisions or challenges.*


**The path to a ML model.**

*   Define the task
*   Prepare data & split data
*   Choose the model
*   Train the model
*   Evaluate the model
*   Use the model


