# Computational Drug Discovery
## Download Bioactivity Data

In this Jupyter notebook, we will be building a real-life data science project that you can include in your data science portfolio. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

### ChEMBL Database

The [ChEMBL Database](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2.4 million compounds. It is compiled from more than 89,000 documents, 1.6 million assays and the data spans 15,000 targets and 2,000 cells and 48,000 indications. [Data as of Nov 05, 2024; ChEMBL version 34].

### Installing libraries

*pip install chembl-webresource-client*

### Importing libraries

In [1]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## Search for Target protein

### Target search for coronavirus

In [2]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('coronavirus')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Coronavirus,Coronavirus,17.0,False,CHEMBL613732,[],ORGANISM,11119
1,[],Feline coronavirus,Feline coronavirus,14.0,False,CHEMBL612744,[],ORGANISM,12663
2,[],Murine coronavirus,Murine coronavirus,14.0,False,CHEMBL5209664,[],ORGANISM,694005
3,[],Canine coronavirus,Canine coronavirus,14.0,False,CHEMBL5291668,[],ORGANISM,11153
4,[],Human coronavirus 229E,Human coronavirus 229E,13.0,False,CHEMBL613837,[],ORGANISM,11137
5,[],Human coronavirus OC43,Human coronavirus OC43,13.0,False,CHEMBL5209665,[],ORGANISM,31631
6,"[{'xref_id': 'P0C6U8', 'xref_name': None, 'xre...",SARS coronavirus,SARS coronavirus 3C-like proteinase,10.0,False,CHEMBL3927,"[{'accession': 'P0C6U8', 'component_descriptio...",SINGLE PROTEIN,227859
7,[],Middle East respiratory syndrome-related coron...,Middle East respiratory syndrome-related coron...,9.0,False,CHEMBL4296578,[],ORGANISM,1335626
8,"[{'xref_id': 'P0C6X7', 'xref_name': None, 'xre...",SARS coronavirus,Replicase polyprotein 1ab,4.0,False,CHEMBL5118,"[{'accession': 'P0C6X7', 'component_descriptio...",SINGLE PROTEIN,227859
9,[],Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,4.0,False,CHEMBL4523582,"[{'accession': 'P0DTD1', 'component_descriptio...",SINGLE PROTEIN,2697049


### Select and retrieve bioactivity data for Replicase polyprotein 1ab (9th entry)

We will assign the 9th entry (which corresponds to the target protein, Replicase polyprotein 1ab) to the selected_target variable

In [3]:
selected_target = targets.target_chembl_id[9]
selected_target

'CHEMBL4523582'

Here, we will retrieve only bioactivity data for Replicase polyprotein 1ab (CHEMBL4523582) that are reported as IC-50
 values in nM (nanomolar) unit.

In [4]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [5]:
df = pd.DataFrame.from_dict(res)
df.head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04


In [6]:
df.shape

(1538, 46)

In [7]:
df.standard_type.unique()

array(['IC50'], dtype=object)

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv.**

In [None]:
#df.to_csv('RP1ab_bioactivity_data_raw.csv', index=False)

### Handling missing data

If any compounds has missing value for the **standard_value**  and **canonical_smiles** column then drop it

In [8]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  df2 = df2[df.canonical_smiles.notna()]


Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1533,"{'action_type': 'INHIBITOR', 'description': 'N...",,25099146,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5260693,Inhibition of MBP tagged recombinant SARS-CoV-...,B,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.46
1534,"{'action_type': 'INHIBITOR', 'description': 'N...",,25099147,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5260694,Inhibition of SARS-CoV-2 RdRP using ATP substr...,B,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,3.22
1535,"{'action_type': 'INHIBITOR', 'description': 'N...",,25099148,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5260694,Inhibition of SARS-CoV-2 RdRP using ATP substr...,B,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.64
1536,"{'action_type': 'INHIBITOR', 'description': 'N...",,25099151,[],CHEMBL5260697,Inhibition of recombinant SARS-CoV-2 main prot...,B,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,4.7


In [9]:
len(df2.canonical_smiles.unique())

1136

In [10]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1530,"{'action_type': 'INHIBITOR', 'description': 'N...",,25096893,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5260257,Inhibition of full length SARS-CoV-2 3CL prote...,B,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,nM,UO_0000065,,18.06
1531,"{'action_type': 'INHIBITOR', 'description': 'N...",,25096894,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5260257,Inhibition of full length SARS-CoV-2 3CL prote...,B,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,nM,UO_0000065,,22.42
1532,"{'action_type': 'INHIBITOR', 'description': 'N...",,25099145,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5260693,Inhibition of MBP tagged recombinant SARS-CoV-...,B,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,8.06
1536,"{'action_type': 'INHIBITOR', 'description': 'N...",,25099151,[],CHEMBL5260697,Inhibition of recombinant SARS-CoV-2 main prot...,B,,,BAO_0000190,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,4.7


### Data pre-processing of the bioactivity data

Combine the 3 columns (molecule_chembl_id, canonical_smiles, standard_value) and bioactivity_class into a DataFrame

In [11]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL480,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,390.0
1,CHEMBL178459,Cc1c(-c2cnccn2)ssc1=S,210.0
2,CHEMBL3545157,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,80.0
3,CHEMBL297453,O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)...,1580.0
4,CHEMBL4303595,O=C1C=Cc2cc(Br)ccc2C1=O,40.0
...,...,...,...
1530,CHEMBL5286307,CC(C)(C)[C@H](NC(=O)C(F)(F)F)C(=O)N1[C@@H]2CCC...,18.06
1531,CHEMBL5282079,CC(C)(C)[C@H](NS(=O)(=O)C(F)(F)F)C(=O)N1C[C@H]...,22.42
1532,CHEMBL5275584,Cn1cnc2c1c(=O)n(CC(=O)Nc1ccc(S(=O)(=O)Nc3ncccn...,8060.0
1536,CHEMBL5281103,Cn1c(=O)c2c(nc3n(CC#Cc4ccc(Cl)c(Cl)c4)ccn23)n(...,4700.0


Saves dataframe to CSV file

In [None]:
#df3.to_csv('RP1ab_02_bioactivity_data_preprocessed.csv', index=False)

### Labeling compounds as either being active, inactive or intermediate

The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be active while those greater than 10,000 nM will be considered to be inactive. As for those values in between 1,000 and 10,000 nM will be referred to as intermediate.

In [12]:
df4 = pd.read_csv('RP1ab_02_bioactivity_data_preprocessed.csv')

In [13]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [14]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL480,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,390.00,active
1,CHEMBL178459,Cc1c(-c2cnccn2)ssc1=S,210.00,active
2,CHEMBL3545157,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,80.00,active
3,CHEMBL297453,O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)...,1580.00,intermediate
4,CHEMBL4303595,O=C1C=Cc2cc(Br)ccc2C1=O,40.00,active
...,...,...,...,...
1131,CHEMBL5286307,CC(C)(C)[C@H](NC(=O)C(F)(F)F)C(=O)N1[C@@H]2CCC...,18.06,active
1132,CHEMBL5282079,CC(C)(C)[C@H](NS(=O)(=O)C(F)(F)F)C(=O)N1C[C@H]...,22.42,active
1133,CHEMBL5275584,Cn1cnc2c1c(=O)n(CC(=O)Nc1ccc(S(=O)(=O)Nc3ncccn...,8060.00,intermediate
1134,CHEMBL5281103,Cn1c(=O)c2c(nc3n(CC#Cc4ccc(Cl)c(Cl)c4)ccn23)n(...,4700.00,intermediate


Saves dataframe to CSV file

In [None]:
#df5.to_csv('RP1ab_03_bioactivity_data_curated.csv', index=False)
