**Preparação dos dados para aplicação em Machine Learning**

O ponto mais importante deste projeto, transformar os dados de estrutura molecular para que os algorítmos de machine learning possam interpreta-los. Utilizou-se a biblioteca padelpy para realizar esta transformação.

In [1]:
import os
import pandas as pd

In [9]:
df = pd.read_csv('hepg2_3class_pic50.csv')
df

Unnamed: 0,molecule_id,smiles,bioactivity,pic50
0,CHEMBL39380,COC(=O)CC(O)CP(=O)([O-])CCc1c(Cl)cc(Cl)cc1OCc1...,active,1.000000
1,CHEMBL39003,CC(C)(C)C(=O)OCOC(=O)CC(O)CP(=O)(CCc1c(Cl)cc(C...,active,1.000000
2,CHEMBL39102,O=C(O)C[C@H](O)CP(=O)(O)CCc1c(Cl)cc(Cl)cc1OCc1...,intermediate,-0.000000
3,CHEMBL16120,CCN(CC#Cc1ccc(N=[N+]=[N-])cc1)Cc1cccc(OCc2cc(-...,intermediate,-0.322219
4,CHEMBL267332,CCN(CC#Cc1cccc(N=[N+]=[N-])c1)Cc1cccc(OCc2cc(-...,active,0.267606
...,...,...,...,...
22047,CHEMBL5220792,C=C(C)CC[C@H]1C[C@]23C[C@@H](CC=C(C)C)C(C)(C)[...,intermediate,-1.352183
22048,CHEMBL5219051,C=C(C)[C@H]1C[C@]23C[C@H](CC=C(C)C)C(C)(C)[C@]...,intermediate,-1.342423
22049,CHEMBL5219808,C=C(C)[C@H]1C[C@]23C[C@H](CC=C(C)C)C(C)(C)[C@]...,intermediate,-1.041393
22050,CHEMBL4520920,CC(C)=CC[C@@H]1C[C@]23C[C@@H](CC=C(C)C)C(C)(C)...,intermediate,-0.633468


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22052 entries, 0 to 22051
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   molecule_id  22052 non-null  object 
 1   smiles       22052 non-null  object 
 2   bioactivity  22052 non-null  object 
 3   pic50        22052 non-null  float64
dtypes: float64(1), object(3)
memory usage: 689.3+ KB


In [4]:
selection = ['smiles','molecule_id']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index = False, header=False)

In [4]:
pip install padelpy

Note: you may need to restart the kernel to use updated packages.


In [5]:
df = pd.read_csv('molecule.smi')
df.head()

Unnamed: 0,COC(=O)CC(O)CP(=O)([O-])CCc1c(Cl)cc(Cl)cc1OCc1ccccc1.[Na+]\tCHEMBL39380
0,CC(C)(C)C(=O)OCOC(=O)CC(O)CP(=O)(CCc1c(Cl)cc(C...
1,O=C(O)C[C@H](O)CP(=O)(O)CCc1c(Cl)cc(Cl)cc1OCc1...
2,CCN(CC#Cc1ccc(N=[N+]=[N-])cc1)Cc1cccc(OCc2cc(-...
3,CCN(CC#Cc1cccc(N=[N+]=[N-])c1)Cc1cccc(OCc2cc(-...
4,CCN(CC#Cc1ccccc1)Cc1cccc(OCc2cc(-c3ccsc3)cs2)c...


O padel é um programa escrito em java que realiza a transformação dos dados de smiles ("Simplified Molecular Input Line Entry System") em fingerprints ou descriptors, que consiste em uma representação matemática de uma determinada estrutura molecular. Com os fingerprints um modelo de ML pode ser aplicado.

A biblioteca padelpy gera uma interface para a comunicação com o programa padel no python. Neste trabalho, escolheu-se o decodificador utilizado pela Pubchen, foi necessário manter o arquivo XML "PubchemFingerprinter.xml" na mesma pasta deste notebook para que a função da célula abaixo funcionasse.

In [6]:
from padelpy import padeldescriptor

fingerprint = 'Pubchem'

fingerprint_output_file = ''.join([fingerprint,'.csv'])
fingerprint_descriptortypes = 'PubchemFingerprinter.xml'

padeldescriptor(mol_dir='molecule.smi',
                d_file=fingerprint_output_file,
                descriptortypes = fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [6]:
df_X = pd.read_csv('x_ml_data.csv')
df_X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22052 entries, 0 to 22051
Columns: 882 entries, Name to PubchemFP880
dtypes: float64(881), object(1)
memory usage: 148.4+ MB


In [7]:
df_X.head()

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL39380,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,CHEMBL39003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,CHEMBL39102,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,CHEMBL16120,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,CHEMBL267332,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
df_y = df['pic50']
df_y.info()

<class 'pandas.core.series.Series'>
RangeIndex: 22052 entries, 0 to 22051
Series name: pic50
Non-Null Count  Dtype  
--------------  -----  
22052 non-null  float64
dtypes: float64(1)
memory usage: 172.4 KB


Para que futuros testes possam ser feitos, decidiu-se utilizar o OneHotEncoder a fim de aproveitar a coluna "bioactivity", gerando valores numéricos que possam ser aproveitados em algorítmos de ML.

In [13]:
from sklearn.preprocessing import OneHotEncoder

In [16]:
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False).set_output(transform='pandas')
bioactivity_ohe = ohe.fit_transform(df[['bioactivity']])
bioactivity_ohe.head()

Unnamed: 0,bioactivity_active,bioactivity_inactive,bioactivity_intermediate
0,1.0,0.0,0.0
1,1.0,0.0,0.0
2,0.0,0.0,1.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0


In [19]:
df_final_rg = pd.concat([df_X,bioactivity_ohe, df_y], axis=1)
df_final_rg.to_csv('x_y_ml_rg_data.csv', index = False)

In [20]:
df_final_rg.head()

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,bioactivity_active,bioactivity_inactive,bioactivity_intermediate,pic50
0,CHEMBL39380,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1,CHEMBL39003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,CHEMBL39102,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.0
3,CHEMBL16120,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.322219
4,CHEMBL267332,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.267606


In [18]:
df_final_cl = pd.concat([df_X, df['bioactivity']], axis=1)
df_final_cl.head()

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,bioactivity
0,CHEMBL39380,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,active
1,CHEMBL39003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,active
2,CHEMBL39102,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,intermediate
3,CHEMBL16120,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,intermediate
4,CHEMBL267332,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,active


In [19]:
df_final_cl.to_csv('x_y_ml_cl.csv', index=False)