# **Projeto de Bioinformática e Análise de Dados - Descoberta Computacional de Fármacos**

## **Download de Dados de Bioatividade**

Neste Jupyter notebook, construímos um modelo de aprendizado de máquina usando os dados de bioatividade do ChEMBL.

Realizaremos a Coleta e Pré-processamento de Dados do Banco de Dados ChEMBL.

---


## **Banco de Dados ChEMBL**

O [*Banco de Dados ChEMBL*](https://www.ebi.ac.uk/chembl/) é um banco de dados que contém dados curados de bioatividade de mais de 2 milhões de compostos. Ele é compilado a partir de mais de 76.000 documentos, 1,2 milhões de ensaios e os dados abrangem 13.000 alvos, 1.800 células e 33.000 indicações.


## **Instalando as bibliotecas**

Instalação do pacote de serviço web ChEMBL para que possamos recuperar dados de bioatividade do banco de dados ChEMBL.

In [1]:
! pip install chembl_webresource_client




[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: C:\Users\danii\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


## **Importando as bibliotecas**

In [2]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## **Busca por proteína alvo**

### **Busca de alvo para dengue**

In [3]:
target = new_client.target
target_query = target.search('dengue')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Dengue virus,Dengue virus,15.0,False,CHEMBL613757,[],ORGANISM,12637
1,[],dengue virus type 4,dengue virus type 4,11.0,False,CHEMBL613728,[],ORGANISM,11070
2,[],dengue virus type 1,dengue virus type 1,11.0,False,CHEMBL613360,[],ORGANISM,11053
3,[],dengue virus type 2,dengue virus type 2,11.0,False,CHEMBL613966,[],ORGANISM,11060
4,[],dengue virus type 3,dengue virus type 3,11.0,False,CHEMBL612717,[],ORGANISM,11069
5,"[{'xref_id': 'P29990', 'xref_name': None, 'xre...",Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,9.0,False,CHEMBL5980,"[{'accession': 'P29990', 'component_descriptio...",SINGLE PROTEIN,31634


**Seleciona e recupere dados de bioatividade para *proteinase semelhante a dengue* (quinta entrada)**

O tipo 3 (DENV-3), é o mais comum no Brasil nos últimos 15 anos, apresenta maior virulência o que significa que causa sintomas mais graves que os demais.

Atribuiremos a quinta entrada (que corresponde à proteína alvo, *proteinase semelhante ao dengue virus tipo 3*) à variável ***selected_target***

In [4]:
selected_target = targets.target_chembl_id[4]
selected_target

'CHEMBL612717'

Aqui, recuperaremos apenas dados de bioatividade para *proteinase semelhante ao dengue virus tipo 3* (CHEMBL612717) que são relatados como valores de IC$_{50}$ em unidades nM (nanomolar).

In [5]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [6]:
df = pd.DataFrame.from_dict(res)

In [7]:
df.head(3)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,12674455,[],CHEMBL2341562,Antiviral activity against Dengue virus 3 stra...,F,,,BAO_0000190,...,dengue virus type 3,dengue virus type 3,11069,,,IC50,uM,UO_0000065,,100.0
1,,,18326333,[],CHEMBL4135359,Antiviral activity against Fluc-tagged DENV3 i...,F,,,BAO_0000190,...,dengue virus type 3,dengue virus type 3,11069,,,IC50,uM,UO_0000065,,0.73
2,,,19235373,[],CHEMBL4398331,Antiviral activity against DENV3 Bolivia infec...,F,,,BAO_0000190,...,dengue virus type 3,dengue virus type 3,11069,,,IC50,uM,UO_0000065,15.0,6.8


Por fim, salvaremos os dados de bioatividade resultantes em um arquivo CSV **bioactivity_data.csv**.

In [8]:
df.to_csv('dengue_bioactivity_data.csv', index=False)

## **Manipulando dados ausentes**
Se algum composto tiver valor ausente para a coluna **standard_value**, então descarte-o

In [9]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,12674455,[],CHEMBL2341562,Antiviral activity against Dengue virus 3 stra...,F,,,BAO_0000190,...,dengue virus type 3,dengue virus type 3,11069,,,IC50,uM,UO_0000065,,100.0
1,,,18326333,[],CHEMBL4135359,Antiviral activity against Fluc-tagged DENV3 i...,F,,,BAO_0000190,...,dengue virus type 3,dengue virus type 3,11069,,,IC50,uM,UO_0000065,,0.73
2,,,19235373,[],CHEMBL4398331,Antiviral activity against DENV3 Bolivia infec...,F,,,BAO_0000190,...,dengue virus type 3,dengue virus type 3,11069,,,IC50,uM,UO_0000065,15.0,6.8
3,,,19235452,[],CHEMBL4398397,Antiviral activity against DENV3 97 infected i...,F,,,BAO_0000190,...,dengue virus type 3,dengue virus type 3,11069,,,IC50,uM,UO_0000065,10.4,6.9
4,,,19235453,[],CHEMBL4398397,Antiviral activity against DENV3 97 infected i...,F,,,BAO_0000190,...,dengue virus type 3,dengue virus type 3,11069,,,IC50,uM,UO_0000065,10.8,6.5
5,,,19235470,[],CHEMBL4398414,Antiviral activity against DENV3 infected in h...,F,,,BAO_0000190,...,dengue virus type 3,dengue virus type 3,11069,,,IC50,uM,UO_0000065,,13.07
6,,,19440701,[],CHEMBL4431626,Antiviral activity against Dengue virus 3 infe...,F,,,BAO_0000190,...,dengue virus type 3,dengue virus type 3,11069,,,IC50,uM,UO_0000065,1.0,0.5
7,,,24708214,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5109272,Antiviral activity against DENV3 infected in A...,F,,,BAO_0000190,...,dengue virus type 3,dengue virus type 3,11069,,,IC50,uM,UO_0000065,,5.0
8,,,24789044,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5131959,Antiviral activity against DENV 3 infected in ...,F,,,BAO_0000190,...,dengue virus type 3,dengue virus type 3,11069,,,IC50,uM,UO_0000065,,5.0
9,,,24866747,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5151807,Antiviral activity against DENV-3 infected ham...,F,,,BAO_0000190,...,dengue virus type 3,dengue virus type 3,11069,,,IC50,uM,UO_0000065,,1.66


Aparentemente, para este conjunto de dados não há dados faltantes. Mas podemos usar a célula de código acima para dados de bioatividade de outra proteína alvo.

## **Pré-processamento de dados de bioatividade**

### **Rotulagem de compostos como ativos, inativos ou intermediários**
Os dados de bioatividade estão na unidade IC50. Compostos com valores menores que 1000 nM serão considerados **ativos**, enquanto aqueles maiores que 10.000 nM serão considerados **inativos**. Já aqueles valores entre 1.000 e 10.000 nM serão chamados de **intermediários**.

In [10]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Combine as 3 colunas (molecule_chembl_id,canonical_smiles,standard_value) e bioactivity_class em um DataFrame**

In [11]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL2332247,O=C(CNCc1ccc(C(=O)N2CCCCC2)cc1)NC(=O)COc1ccccc1,100000.0
1,CHEMBL4175102,COc1ccc(-c2cc(-c3ccc4cc(F)ccc4n3)n(-c3ccc(S(N)...,730.0
2,CHEMBL4526128,O=C(N[C@@H](Cc1ccc(O)cc1)C(=O)O)c1cc(-c2ccccc2...,6800.0
3,CHEMBL4443913,CCCCCCCCCCCCNCCNC(=O)C[C@@]1(O)C[C@@H](O)[C@@H...,6900.0
4,CHEMBL4572441,CCCCCCCCCCCCCCNCCNC(=O)C[C@]1(O)C[C@@H](O)[C@@...,6500.0
5,CHEMBL1138,O=C1[C@H](CC[C@H](O)c2ccc(F)cc2)[C@@H](c2ccc(O...,13070.0
6,CHEMBL506569,CC(=O)O[C@H]1C[C@H](O[C@H]2[C@@H](O)C[C@H](O[C...,500.0
7,CHEMBL5176406,FC(F)(F)Oc1cccc(Nc2ccnc(Nc3cccc(OC(F)(F)F)c3)n...,5000.0
8,CHEMBL5182661,CC(=O)c1cccc(-c2cc(Nc3ccc(OC(F)(F)F)cc3)ncn2)c1,5000.0
9,CHEMBL263291,CC[C@H](C)[C@H]1O[C@]2(CC[C@@H]1C)C[C@@H]1C[C@...,1660.0


In [12]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL2332247,O=C(CNCc1ccc(C(=O)N2CCCCC2)cc1)NC(=O)COc1ccccc1,100000.0,inactive
1,CHEMBL4175102,COc1ccc(-c2cc(-c3ccc4cc(F)ccc4n3)n(-c3ccc(S(N)...,730.0,active
2,CHEMBL4526128,O=C(N[C@@H](Cc1ccc(O)cc1)C(=O)O)c1cc(-c2ccccc2...,6800.0,intermediate
3,CHEMBL4443913,CCCCCCCCCCCCNCCNC(=O)C[C@@]1(O)C[C@@H](O)[C@@H...,6900.0,intermediate
4,CHEMBL4572441,CCCCCCCCCCCCCCNCCNC(=O)C[C@]1(O)C[C@@H](O)[C@@...,6500.0,intermediate
5,CHEMBL1138,O=C1[C@H](CC[C@H](O)c2ccc(F)cc2)[C@@H](c2ccc(O...,13070.0,inactive
6,CHEMBL506569,CC(=O)O[C@H]1C[C@H](O[C@H]2[C@@H](O)C[C@H](O[C...,500.0,active
7,CHEMBL5176406,FC(F)(F)Oc1cccc(Nc2ccnc(Nc3cccc(OC(F)(F)F)c3)n...,5000.0,intermediate
8,CHEMBL5182661,CC(=O)c1cccc(-c2cc(Nc3ccc(OC(F)(F)F)cc3)ncn2)c1,5000.0,intermediate
9,CHEMBL263291,CC[C@H](C)[C@H]1O[C@]2(CC[C@@H]1C)C[C@@H]1C[C@...,1660.0,intermediate


Salva o dataframe em um arquivo CSV

In [13]:
df4.to_csv('dengue_bioactivity_data_preprocessed.csv', index=False)

In [14]:
ls -l

 O volume na unidade C � OS
 O N�mero de S�rie do Volume � A0AC-8780

 Pasta de c:\Users\danii\source\repos\PredicaoBioatividade



Arquivo n�o encontrado


---