## COVID-19: Computational Drug Discovery [Download Bioactivity Data][Part 1]

This an attempt to find an FDA approved compound or molecule that will inhibit the function of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV2).

An Otsogile Onalepelo Project aka Morena!

## ChEMBL Database
The ChEMBL Database is a database that contains curated bioactivity data of more than 2 million compounds. It is the one used as the data source for this project

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

## **Importing libraries**

In [1]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## Search for Target protein

Targets here refers to the proteins or the organisim that the drug will act on. Basically,the compound will come into contact with the target protein or organism and induce a modulatory activity on it. Either to activate the protein or inhibit it. For this project,I will be working with SARS-CoV2

### **Target search for coronavirus**

In [2]:
# Target: search for coronavirus
target = new_client.target
target_query = target.search('coronavirus')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Coronavirus,Coronavirus,17.0,False,CHEMBL613732,[],ORGANISM,11119
1,[],SARS coronavirus,SARS coronavirus,15.0,False,CHEMBL612575,[],ORGANISM,227859
2,[],Feline coronavirus,Feline coronavirus,15.0,False,CHEMBL612744,[],ORGANISM,12663
3,[],Human coronavirus 229E,Human coronavirus 229E,13.0,False,CHEMBL613837,[],ORGANISM,11137
4,"[{'xref_id': 'P0C6U8', 'xref_name': None, 'xre...",SARS coronavirus,SARS coronavirus 3C-like proteinase,10.0,False,CHEMBL3927,"[{'accession': 'P0C6U8', 'component_descriptio...",SINGLE PROTEIN,227859
5,[],Middle East respiratory syndrome-related coron...,Middle East respiratory syndrome-related coron...,9.0,False,CHEMBL4296578,[],ORGANISM,1335626
6,"[{'xref_id': 'P0C6X7', 'xref_name': None, 'xre...",SARS coronavirus,Replicase polyprotein 1ab,4.0,False,CHEMBL5118,"[{'accession': 'P0C6X7', 'component_descriptio...",SINGLE PROTEIN,227859
7,[],Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,4.0,False,CHEMBL4523582,"[{'accession': 'P0DTD1', 'component_descriptio...",SINGLE PROTEIN,2697049


## **Select and retrieve bioactivity data for *SARS Coronavirus 2 Replicase polyprotein 1ab* (seventh entry)**

Replicase polyprotein 1ab is the protein responsible for replication and transcription of the viral RNA genome. We will assign the seventh entry (which corresponds to the target protein: *Replicase polyprotein 1ab*), to the ***selected_target*** variable 

In [3]:
#Unique identification of the target
selected_target = targets.target_chembl_id[7]
selected_target

'CHEMBL4523582'

Retrieve only bioactivity data for *Replicase polyprotein 1ab* (CHEMBL4523582) that are reported as IC$_{50}$ values in nM (nanomolar) unit.

In [4]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [5]:
df = pd.DataFrame.from_dict(res)

In [6]:
df.head(3)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08


lets see what other standard type values are present on the above df

In [7]:
df.standard_type.unique()

array(['IC50'], dtype=object)

For this dataset it would not matter if we just went ahead without filtering as it only has the ideal unit of measurement, but for other datasets this might not be the case it could ec50 or %activity combined with ic50 and you might have to filter such a dataset. This is just to standardized our dataset to avoid having a mixture of bioactivity units

## Data Exploration
If any compounds has missing values for the **standard_value,canonical_smiles & molecule_chembl_id** columns, drop them. These are the features that will be extracted from this bioactivity data.

Standard value  is the potency of the drug, the lower the number the better and vice versa.Ideally, we want the standard value to be as low as possible. This will mean the inhitory concentration at 50% will have a low concentration. Simply put, inorder to illicit 50% of the inhibition of a target protein you will need a low concentration of the drug. Think of the number as a representation of the drug. The lower the concentration that is required the better it is. If this number is high, it will mean you require more amounts of the drug in order to produce the same inhibition at 50%. For example, having to take 5ml of a medication instead of 5l for the same medication if the value is very high, which is impossible.


Canonical smiles:  This is information about the chemical structure of that compound/molecule. A canonical smile is the exact atomic details of the molecule. A molecule or compund here means a drug. These are capable of producing a modulatory activity. In otherwords, it excerts some effect (inhibit or activate) the target protein. We are going to use canonical smiles as input to  calculate the molecular descriptors. Molecular descriptors distinguish one molecule from others.

Molecule id: is self explanotory. Each compound is described by a chembl molecule ID.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117 entries, 0 to 116
Data columns (total 45 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   activity_comment           117 non-null    object
 1   activity_id                117 non-null    int64 
 2   activity_properties        117 non-null    object
 3   assay_chembl_id            117 non-null    object
 4   assay_description          117 non-null    object
 5   assay_type                 117 non-null    object
 6   assay_variant_accession    0 non-null      object
 7   assay_variant_mutation     0 non-null      object
 8   bao_endpoint               117 non-null    object
 9   bao_format                 117 non-null    object
 10  bao_label                  117 non-null    object
 11  canonical_smiles           110 non-null    object
 12  data_validity_comment      0 non-null      object
 13  data_validity_description  0 non-null      object
 14  document_c

In [9]:
df.standard_value.notna()

0      True
1      True
2      True
3      True
4      True
       ... 
112    True
113    True
114    True
115    True
116    True
Name: standard_value, Length: 117, dtype: bool

In [10]:
df.isna().sum()

activity_comment               0
activity_id                    0
activity_properties            0
assay_chembl_id                0
assay_description              0
assay_type                     0
assay_variant_accession      117
assay_variant_mutation       117
bao_endpoint                   0
bao_format                     0
bao_label                      0
canonical_smiles               7
data_validity_comment        117
data_validity_description    117
document_chembl_id             0
document_journal             117
document_year                  0
ligand_efficiency            117
molecule_chembl_id             0
molecule_pref_name            53
parent_molecule_chembl_id      0
pchembl_value                  3
potential_duplicate            0
qudt_units                     0
record_id                      0
relation                       0
src_id                         0
standard_flag                  0
standard_relation              0
standard_text_value          117
standard_t

In [11]:
#export the data frame to view it as a spread sheet
df.to_csv('bioactivity_raw_data.csv', index=False)

In [12]:
df.describe()

Unnamed: 0,activity_id,document_year,potential_duplicate,record_id,src_id,standard_flag
count,117.0,117.0,117.0,117.0,117.0,117.0
mean,19964260.0,2020.0,0.0,3346515.0,52.0,1.0
std,33.91902,0.0,0.0,2366.436,0.0,0.0
min,19964200.0,2020.0,0.0,3341963.0,52.0,1.0
25%,19964230.0,2020.0,0.0,3344721.0,52.0,1.0
50%,19964260.0,2020.0,0.0,3346744.0,52.0,1.0
75%,19964290.0,2020.0,0.0,3348347.0,52.0,1.0
max,19964320.0,2020.0,0.0,3350610.0,52.0,1.0


From the all above data exploration, some features do have missing values. Our dataset have two data types, int64 and objects. But since we do not need all the features for this project we are only going to drop missing values for canonical smiles, standard_value and chembl_id. These are the only features we need and drop the rest.  


## Handling Missing Values

In [13]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112,Dtt Insensitive,19964311,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.24
113,Dtt Insensitive,19964312,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,4.98
114,Dtt Insensitive,19964313,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.75
115,Dtt Insensitive,19964314,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.88


In [14]:
df3 = df2[df2.canonical_smiles.notna()]
df3

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111,Dtt Insensitive,19964310,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,4.36
112,Dtt Insensitive,19964311,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.24
113,Dtt Insensitive,19964312,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,4.98
114,Dtt Insensitive,19964313,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.75


They were around 7 missing values for canonical smiles which were dropped from above dataframe

In [15]:
df4 = df3[df3.molecule_chembl_id.notna()]
df4

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111,Dtt Insensitive,19964310,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,4.36
112,Dtt Insensitive,19964311,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.24
113,Dtt Insensitive,19964312,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,4.98
114,Dtt Insensitive,19964313,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.75


## Data pre-processing of the bioactivity data

### Labeling compounds as either being active, inactive
The bioactivity data is in the IC50 unit. IC50 is a quantitative measure that indicates how much of a particular inhibitory substance (e.g. drug) is needed to inhibit, in vitro, a given biological process or biological component by 50%.

Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 1000 nM will be considered to be **inactive**.

In [16]:
bioactivity_class = []
for i in df4.standard_value:
  if float(i) >= 1000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")

### Iterate the *molecule_chembl_id* to a list

In [17]:
mol_cid = []
for i in df4.molecule_chembl_id:
  mol_cid.append(i)

### Iterate *canonical_smiles* to a list

In [18]:
canonical_smiles = []
for i in df4.canonical_smiles:
  canonical_smiles.append(i)

### Iterate *standard_value* to a list

In [19]:
standard_value = []
for i in df4.standard_value:
  standard_value.append(i)

### **Combine the 4 lists into a dataframe**

In [20]:
data_tuples = list(zip(mol_cid, canonical_smiles,standard_value,bioactivity_class))
df5 = pd.DataFrame( data_tuples,  columns=['molecule_chembl_id', 'canonical_smiles','standard_value','bioactivity_class'])

In [21]:
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL480,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,390.0,active
1,CHEMBL178459,Cc1c(-c2cnccn2)ssc1=S,210.0,active
2,CHEMBL3545157,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,80.0,active
3,CHEMBL297453,O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)...,1580.0,inactive
4,CHEMBL4303595,O=C1C=Cc2cc(Br)ccc2C1=O,40.0,active
...,...,...,...,...
105,CHEMBL376488,COc1nc2ccc(Br)cc2cc1[C@@H](c1ccccc1)[C@@](O)(C...,4360.0,inactive
106,CHEMBL154580,C=CC(=O)c1ccc2ccccc2c1,1240.0,inactive
107,CHEMBL354349,C[n+]1c2cc(N)ccc2cc2ccc(N)cc21.[Cl-],4980.0,inactive
108,CHEMBL1382627,Nc1ccc(S(=O)(=O)[N-]c2ncccn2)cc1.[Ag+],750.0,active


In [22]:
df5.bioactivity_class.unique()

array(['active', 'inactive'], dtype=object)

Lets save our data for part two.

In [23]:
df5.to_csv('bioactivity_preprocessed_data.csv', index=False)