## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.

[Data as of March 25, 2020; ChEMBL version 26].

---

## **Importing libraries**

In [1]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

This refers to target proteins that the drug will regulate

In [2]:
target = new_client.target
target_query = target.search('coronavirus')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Coronavirus,Coronavirus,17.0,False,CHEMBL613732,[],ORGANISM,11119
1,[],Feline coronavirus,Feline coronavirus,14.0,False,CHEMBL612744,[],ORGANISM,12663
2,[],Murine coronavirus,Murine coronavirus,14.0,False,CHEMBL5209664,[],ORGANISM,694005
3,[],Canine coronavirus,Canine coronavirus,14.0,False,CHEMBL5291668,[],ORGANISM,11153
4,[],Human coronavirus 229E,Human coronavirus 229E,13.0,False,CHEMBL613837,[],ORGANISM,11137
5,[],Human coronavirus OC43,Human coronavirus OC43,13.0,False,CHEMBL5209665,[],ORGANISM,31631
6,"[{'xref_id': 'P0C6U8', 'xref_name': None, 'xre...",SARS coronavirus,SARS coronavirus 3C-like proteinase,10.0,False,CHEMBL3927,"[{'accession': 'P0C6U8', 'component_descriptio...",SINGLE PROTEIN,227859
7,[],Middle East respiratory syndrome-related coron...,Middle East respiratory syndrome-related coron...,9.0,False,CHEMBL4296578,[],ORGANISM,1335626
8,"[{'xref_id': 'P0C6X7', 'xref_name': None, 'xre...",SARS coronavirus,Replicase polyprotein 1ab,4.0,False,CHEMBL5118,"[{'accession': 'P0C6X7', 'component_descriptio...",SINGLE PROTEIN,227859
9,[],Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,4.0,False,CHEMBL4523582,"[{'accession': 'P0DTD1', 'component_descriptio...",SINGLE PROTEIN,2697049


### **Select and retrieve bioactivity data for *SARS coronavirus 3C-like proteinase* (sixth entry)**

In [3]:
# Selected target index had changed from 4 (presented in class) to 6. 
# To achieve a robust pipeline it would be smart to select 3C-like proteinase by other filter then the dataframe index
selected_target = targets.target_chembl_id[6] 
selected_target

'CHEMBL3927'

Here, we will retrieve only bioactivity data for *coronavirus 3C-like proteinase* (CHEMBL3927) that are reported as IC$_{50}$ values in nM (nanomolar) unit.

In [4]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [5]:
df = pd.DataFrame.from_dict(res)

In [6]:
df.head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1480935,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,7.2
1,,,1480936,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,9.4
2,,,1481061,[],CHEMBL830868,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,13.5
3,,,1481065,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,13.11
4,,,1481066,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,2.0


Here we'll be looking into standartd value columns: where the lower the value the better the drug. This measures the concentration necessary to inihibit 50% of the target protein activity.

## **Saving bioactivity data locally**
Technically we could work just with the pandas dataframe, but for the class sake i'll be saving it locally.
Original class converts it into a CSV. I'll be converting it to a parquet file because yes.

In [7]:
df.to_parquet(f'datasets/3c_like_proteinase_bioactivity.parquet')

---

## **Pre-processing data**

Handling missing values: dropping rows where standard_value is NaN

In [8]:
df_cleaned = df[df.standard_value.notna()]

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

*Note*

On this part of the class notebook, professor iterated all the columns he wanted to select into a list and combined them into a new dataframe, I think it is way easier and clean-code friendly to just subset the cleaned dataframe.

In [9]:
important_features = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']

# Make a copy of the dataframe slice to avoid SettingWithCopyWarning
subset_df = df_cleaned[important_features].copy()

# Define a function for bioactivity classification
def classify_bioactivity(standard_value):
    '''
    IC50 ≥ 10,000 nM: The compound is classified as "inactive". Requires a high concentration to inhibit the target.
    1,000 nM < IC50 < 10,000 nM: The compound is considered "intermediate", showing moderate bioactivity.
    IC50 ≤ 1,000 nM: The compound is classified as "active" due to its low IC50. Potent and effective at lower concentrations.
    '''
    if float(standard_value) >= 10000:
        return "inactive"
    elif float(standard_value) <= 1000:
        return "active"
    else:
        return "intermediate"

# Apply the function to the 'standard_value' column and create a new column for the classification
subset_df.loc[:, 'bioactivity_class'] = subset_df['standard_value'].apply(classify_bioactivity)

# Converting 'standard_value' to float
subset_df['standard_value'] = pd.to_numeric(subset_df['standard_value'], errors='coerce')

In [10]:
display(subset_df.head())

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL187579,Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21,7200.0,intermediate
1,CHEMBL188487,O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21,9400.0,intermediate
2,CHEMBL185698,O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21,13500.0,inactive
3,CHEMBL426082,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21,13110.0,inactive
4,CHEMBL187717,O=C1C(=O)N(Cc2cc3ccccc3s2)c2c1cccc2[N+](=O)[O-],2000.0,intermediate


**Why these columns?**
+ *molecule_chembl_id:* It is the molecule with drug-like properties ID.
+ *canonical_smiles:* It is a string of the molecular structure. Encodes atoms, bonds and overall connectivity
+ *standard_value:* It is a measure of compound's effectiveness in inhibiting a biological process
+ *bioactivity_class:* Clusters drug std_value

In [11]:
subset_df.to_parquet('datasets/3c_like_proteinase_bioactivity_clean.parquet')