# **ChEMBL database filtering**

With the aim of creating a model to predict the IC50 of molecules, the ChEMBL database is explored to obtain molecules with the ability to inhibit alpha synuclein, that have information on their IC50 value.

**Install all needed libraries**

In [1]:
!pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-24.1.2-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl.metadata (3.1 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-24.1.2-py3-none-any.whl (66 kB)
[2K   [90m━━━━━━━━━━

**Importing libraries**

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
from chembl_webresource_client.new_client import new_client

## **Target search**

### **Using the API, search for Homo sapiens alpha-synuclein**

In [5]:
target = new_client.target
target_query = target.search('alpha-synuclein')
targets = pd.DataFrame.from_dict(target_query)
targets.head()

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P37840', 'xref_name': None, 'xre...",Homo sapiens,Alpha-synuclein,23.0,False,CHEMBL6152,"[{'accession': 'P37840', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Synphilin-1,23.0,False,CHEMBL1926494,"[{'accession': 'Q9Y6H5', 'component_descriptio...",SINGLE PROTEIN,9606
2,[],Mus musculus,Alpha-synuclein,23.0,False,CHEMBL5169100,"[{'accession': 'O55042', 'component_descriptio...",SINGLE PROTEIN,10090
3,"[{'xref_id': 'O00418', 'xref_name': None, 'xre...",Homo sapiens,Serine/threonine-protein kinase EEF2K,13.0,False,CHEMBL5026,"[{'accession': 'O00418', 'component_descriptio...",SINGLE PROTEIN,9606
4,[],Rattus norvegicus,Eukaryotic elongation factor 2 kinase,13.0,False,CHEMBL3325307,"[{'accession': 'P70531', 'component_descriptio...",SINGLE PROTEIN,10116
...,...,...,...,...,...,...,...,...,...
1437,[],Cavia porcellus,Natriuretic peptides A,1.0,False,CHEMBL3097985,"[{'accession': 'P27596', 'component_descriptio...",SINGLE PROTEIN,10141
1438,[],Homo sapiens,Voltage-gated potassium channel subunit Kv7.1/...,1.0,False,CHEMBL3430890,"[{'accession': 'P51787', 'component_descriptio...",PROTEIN COMPLEX,9606
1439,[],Homo sapiens,Cardiac myosin,1.0,False,CHEMBL3831286,"[{'accession': 'P12883', 'component_descriptio...",PROTEIN COMPLEX,9606
1440,[],Homo sapiens,"Amiloride-sensitive sodium channel, ENaC mRNA",1.0,False,CHEMBL4834389,"[{'accession': 'ENSG00000166828', 'component_d...",NUCLEIC-ACID,9606


In [6]:
targets.columns

Index(['cross_references', 'organism', 'pref_name', 'score',
       'species_group_flag', 'target_chembl_id', 'target_components',
       'target_type', 'tax_id'],
      dtype='object')

In [7]:
targets = targets[(targets.pref_name == 'Alpha-synuclein') & (targets.organism == 'Homo sapiens')]
targets.head()

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P37840', 'xref_name': None, 'xre...",Homo sapiens,Alpha-synuclein,23.0,False,CHEMBL6152,"[{'accession': 'P37840', 'component_descriptio...",SINGLE PROTEIN,9606


In [8]:
targets.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1 entries, 0 to 0
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   cross_references    1 non-null      object 
 1   organism            1 non-null      object 
 2   pref_name           1 non-null      object 
 3   score               1 non-null      float64
 4   species_group_flag  1 non-null      bool   
 5   target_chembl_id    1 non-null      object 
 6   target_components   1 non-null      object 
 7   target_type         1 non-null      object 
 8   tax_id              1 non-null      int64  
dtypes: bool(1), float64(1), int64(1), object(6)
memory usage: 73.0+ bytes


### **Selecting and extracting bioactivity data**

In [9]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL6152'

In [11]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")
df_types = pd.DataFrame.from_dict(res).type.unique()
df_types

array(['IC50'], dtype=object)

In [16]:
df = pd.DataFrame.from_dict(res)
df.head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,10978896,[],CHEMBL2065090,Binding affinity to human alpha-synuclein expr...,B,,,BAO_0000190,...,Homo sapiens,Alpha-synuclein,9606,,,IC50,nM,UO_0000065,,26000.0
1,,,10978897,[],CHEMBL2065090,Binding affinity to human alpha-synuclein expr...,B,,,BAO_0000190,...,Homo sapiens,Alpha-synuclein,9606,,,IC50,nM,UO_0000065,,26000.0
2,,,10978898,[],CHEMBL2065090,Binding affinity to human alpha-synuclein expr...,B,,,BAO_0000190,...,Homo sapiens,Alpha-synuclein,9606,,,IC50,nM,UO_0000065,,16000.0
3,,,10978899,[],CHEMBL2065090,Binding affinity to human alpha-synuclein expr...,B,,,BAO_0000190,...,Homo sapiens,Alpha-synuclein,9606,,,IC50,nM,UO_0000065,,507.1
4,,,10978900,[],CHEMBL2065090,Binding affinity to human alpha-synuclein expr...,B,,,BAO_0000190,...,Homo sapiens,Alpha-synuclein,9606,,,IC50,nM,UO_0000065,,1440.0


In [17]:
df.shape

(94, 46)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 46 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   action_type                56 non-null     object
 1   activity_comment           0 non-null      object
 2   activity_id                94 non-null     int64 
 3   activity_properties        94 non-null     object
 4   assay_chembl_id            94 non-null     object
 5   assay_description          94 non-null     object
 6   assay_type                 94 non-null     object
 7   assay_variant_accession    0 non-null      object
 8   assay_variant_mutation     0 non-null      object
 9   bao_endpoint               94 non-null     object
 10  bao_format                 94 non-null     object
 11  bao_label                  94 non-null     object
 12  canonical_smiles           94 non-null     object
 13  data_validity_comment      3 non-null      object
 14  data_validit

In [19]:
df['assay_description'].value_counts()

Unnamed: 0_level_0,count
assay_description,Unnamed: 1_level_1
Binding affinity to human alpha-synuclein expressed in Escherichia coli BL21 (DE3) cells after 1 hr by thioflavin T fluorescence assay,19
Inhibition of human alpha-synuclein filament formation expressed in Escherichia coli BL21(DE3) cells incubated for 72 hrs by thioflavin S based fluorescence assay,19
Inhibition of alpha-synuclein fibril formation (unknown origin) incubated for 24 hrs to 7 days by thioflavin S based fluorescence assay,12
Inhibition of alpha-synuclein (unknown origin) aggregation incubated for 3 days by thioflavin T based fluorescence assay,7
Inhibition of alpha-synuclein aggregation (unknown origin) incubated for 8 days by thioflavin S based fluorescence assay,6
Inhibition of alpha-synuclein aggregation (unknown origin) expressed in Escherichia coli BL21 (DE3) incubated for 3 days by thioflavin T fluorescence assay,6
Inhibition of wild type human alpha-synuclein fibrillization expressed in Escherichia coli BL21(DE3)pLysS by thioflavin-T based fluorescence assay,4
Inhibition of alpha-synuclein (unknown origin) self-aggregation by fluorescence polarization assay,4
Inhibition of alpha-synuclein fibril formation (unknown origin) incubated for 6 days by thioflavin S based fluorescence assay,4
Inhibition of wild type alpha-synuclein aggregation (unknown origin) expressed in Escherichia coli BL21 cells incubated for 30 days by thioflavin T based fluorescence assay,4


Saving to **bioactivity_data.csv**.

In [29]:
df.to_csv('bioactivity_data_raw.csv', index=False)