<a href="https://colab.research.google.com/github/almunawaroh2020-maker/Drug-discovery-AI-course-2026/blob/main/Assignment_2_qsar_data_curation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **AI And Biotechnology/Bioinformatics**

## **AI and Drug Discovery Course: QSAR Modeling**
This notebook demonstrates how to collect and preprocess bioactivity data from ChEMBL for QSAR modeling

# **Part 1: Data Collection & Curation**

**First Google drive was connected to Google Colab, so that Google drive can be access within Google Colab.**

This allows to:
* Save datasets
* Reload data across sessions
* Organize project files




In [116]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


**Now create "data" folder in our "Colab Notebooks" folder on Google Drive.**

In [117]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data"

mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/data’: File exists


## Install and Import Required Libraries
We install the ChEMBL web service package so that we can retrieve bioactivity data

In [118]:
!pip install chembl_webresource_client



# Import Libraries
* pandas for data handling
* new_client from chembl for accessing the database

In [119]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

# Step 1: Search for Traget Protein

## **Target Identification (KIT)**
Search ChEMBL for the KRAS target and select the most relevant entry.


In [121]:
target = new_client.target
target_query = target.search("KIT")
targets = pd.DataFrame.from_dict(target_query)
targets.head(20)


Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mus musculus,Mast/stem cell growth factor receptor Kit,21.0,False,CHEMBL2034798,"[{'accession': 'P05532', 'component_descriptio...",SINGLE PROTEIN,10090
1,[],Canis lupus familiaris,Mast/stem cell growth factor receptor Kit,21.0,False,CHEMBL5303563,"[{'accession': 'O97799', 'component_descriptio...",SINGLE PROTEIN,9615
2,[],Homo sapiens,Mast/stem cell growth factor receptor Kit,20.0,False,CHEMBL1936,"[{'accession': 'P10721', 'component_descriptio...",SINGLE PROTEIN,9606
3,[],Homo sapiens,Protein cereblon/Stem cell growth factor receptor,20.0,False,CHEMBL4630731,"[{'accession': 'P10721', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
4,[],Homo sapiens,Kit ligand,19.0,False,CHEMBL2346489,"[{'accession': 'P21583', 'component_descriptio...",SINGLE PROTEIN,9606
5,[],Homo sapiens,von Hippel-Lindau disease tumor suppressor/KIT,19.0,False,CHEMBL4523731,"[{'accession': 'P10721', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
6,[],Homo sapiens,VEGF-receptor 2 and stem cell growth factor re...,18.0,False,CHEMBL2111428,"[{'accession': 'P10721', 'component_descriptio...",SELECTIVITY GROUP,9606


**Reterive Bioactivity data for selected target**

In [122]:
selected_target = targets.target_chembl_id[2]
selected_target

'CHEMBL1936'

**Now retrieve only bioactivity data for target; **Mast/stem cell growth factor receptor Kit(CHEMBL1936)** with reported IC 50  values in nM (nanomolar) unit.**

In [124]:
activity = new_client.activity
results = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [125]:
df1 = pd.DataFrame.from_dict(results)
df1.head(5)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,653060,[],CHEMBL820421,Inhibition of c-Kit autophosphorylation in int...,B,,,BAO_0000190,...,Homo sapiens,Mast/stem cell growth factor receptor Kit,9606,,,IC50,nM,UO_0000065,,100.0
1,,,750646,[],CHEMBL702237,Inhibition of KIT kinase activity,B,,,BAO_0000190,...,Homo sapiens,Mast/stem cell growth factor receptor Kit,9606,,,IC50,nM,UO_0000065,,10000.0
2,,,866062,[],CHEMBL766073,Inhibition of chimeric PDGF receptor with c-ki...,B,,,BAO_0000190,...,Homo sapiens,Mast/stem cell growth factor receptor Kit,9606,,,IC50,uM,UO_0000065,,0.021
3,,,872531,[],CHEMBL766073,Inhibition of chimeric PDGF receptor with c-ki...,B,,,BAO_0000190,...,Homo sapiens,Mast/stem cell growth factor receptor Kit,9606,,,IC50,uM,UO_0000065,,0.17
4,,,872563,[],CHEMBL766073,Inhibition of chimeric PDGF receptor with c-ki...,B,,,BAO_0000190,...,Homo sapiens,Mast/stem cell growth factor receptor Kit,9606,,,IC50,uM,UO_0000065,,0.006


In [127]:
df1.standard_type.unique()

array(['IC50'], dtype=object)

**Finally Save the resulting bioactivity data to a CSV file** **bioactivity_raw_data.csv**.

In [129]:
df1.to_csv('mastcell_kit_raw_data.csv', index=False)

**Now copy "mastcell_kit_raw_data.csv" file to Google Drive, in foler "data"**

In [131]:
! cp mastcell_kit_raw_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [133]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data"

total 8515
-rw------- 1 root root  366047 Jan 26 08:17 bioactivity_preprocessed_data.csv
-rw------- 1 root root 4176208 Jan 26 08:43 bioactivity_raw_data.csv
-rw------- 1 root root 4176208 Jan 26 08:43 mastcell_kit_raw_data.csv


In [135]:
! head mastcell_kit_raw_data.csv

action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,,653060,[],CHEMBL820421,Inhibition of c-Kit autophosphorylation in intact cells,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1cc2c(Oc3ccc(Nc4ccc(C(C)(C)C)cc4)cc3)ccnc2cc1OCCNCCO,,,CHEMBL1146677,Bioorg Med Chem Lett,2004,"{'bei': '13.95', 'le': '0.26', 'lle': '0.96', 'sei': '8.25'}",CHEMBL352308,,CHEMBL35

# **Step 3: Bioactivity Data Retrieval (IC50)**
**Retrieve bioactivity data (IC50) for the selected KIT target.**

**Inspect Missing Values**

In [136]:
df1["standard_type"].isna().sum()

np.int64(0)

**Filter Rows with Valid Bioactivity Values**

In [137]:
df2 = df1[df1["standard_value"].notna()]
df2.head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,653060,[],CHEMBL820421,Inhibition of c-Kit autophosphorylation in int...,B,,,BAO_0000190,...,Homo sapiens,Mast/stem cell growth factor receptor Kit,9606,,,IC50,nM,UO_0000065,,100.0
1,,,750646,[],CHEMBL702237,Inhibition of KIT kinase activity,B,,,BAO_0000190,...,Homo sapiens,Mast/stem cell growth factor receptor Kit,9606,,,IC50,nM,UO_0000065,,10000.0
2,,,866062,[],CHEMBL766073,Inhibition of chimeric PDGF receptor with c-ki...,B,,,BAO_0000190,...,Homo sapiens,Mast/stem cell growth factor receptor Kit,9606,,,IC50,uM,UO_0000065,,0.021
3,,,872531,[],CHEMBL766073,Inhibition of chimeric PDGF receptor with c-ki...,B,,,BAO_0000190,...,Homo sapiens,Mast/stem cell growth factor receptor Kit,9606,,,IC50,uM,UO_0000065,,0.17
4,,,872563,[],CHEMBL766073,Inhibition of chimeric PDGF receptor with c-ki...,B,,,BAO_0000190,...,Homo sapiens,Mast/stem cell growth factor receptor Kit,9606,,,IC50,uM,UO_0000065,,0.006


**To Normalize IC50 units and Remove duplicates**

In [161]:
df2 = df2.copy()
df2.loc[:, 'standard_value_nM'] = df2.apply(
    lambda x: x['standard_value']*1000 if x['standard_units'].lower()=='um' else x['standard_value'],
    axis=1)
# remove duplicate
df2 = df2.drop_duplicates(subset='canonical_smiles', keep='first')
df2[['molecule_chembl_id', 'canonical_smiles', 'standard_value',
     'standard_units', 'standard_value_nM']].head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,standard_units,standard_value_nM
0,CHEMBL352308,COc1cc2c(Oc3ccc(Nc4ccc(C(C)(C)C)cc4)cc3)ccnc2c...,100.0,nM,100.0
1,CHEMBL115220,O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1,10000.0,nM,10000.0
2,CHEMBL330863,COc1cc2c(N3CCN(C(=O)Nc4ccc(OC(C)C)cc4)CC3)ncnc...,21.0,nM,21.0
3,CHEMBL124660,COc1cc2c(N3CCN(C(=O)Nc4ccc(OC(C)C)cc4)CC3)ncnc...,170.0,nM,170.0
4,CHEMBL126699,COc1cc2c(N3CCN(C(=O)Nc4ccc(C#N)cc4)CC3)ncnc2cc...,6.0,nM,6.0


**Assign Bioactivity Classes**
Define active, intermediate, and inactive classes based on IC50 values.


In [162]:
bioactivity_class = []
for value in df2.standard_value:
    value = float(value)
    if value >= 10000:
        bioactivity_class.append("inactive")
    elif value <= 1000:
        bioactivity_class.append("active")
    else:
        bioactivity_class.append("intermediate")

**Extract Relevant Columns**

In [163]:
molecule_ids = df2.molecule_chembl_id.tolist()
canonical_smiles = df2.canonical_smiles.tolist()
standard_values = df2.standard_value.tolist()

In [164]:
data = list(zip(
    molecule_ids,
    canonical_smiles,
    standard_values,
        bioactivity_class,
))

**Create Preprocessed bioactivity Dataset**

In [165]:

df3 = pd.DataFrame(
    data,
    columns=[
        "molecule_chembl_id",
        "canonical_smiles",
        "standard_value",
        "bioactivity_class",
    ]
)
df3.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL352308,COc1cc2c(Oc3ccc(Nc4ccc(C(C)(C)C)cc4)cc3)ccnc2c...,100.0,active
1,CHEMBL115220,O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1,10000.0,inactive
2,CHEMBL330863,COc1cc2c(N3CCN(C(=O)Nc4ccc(OC(C)C)cc4)CC3)ncnc...,21.0,active
3,CHEMBL124660,COc1cc2c(N3CCN(C(=O)Nc4ccc(OC(C)C)cc4)CC3)ncnc...,170.0,active
4,CHEMBL126699,COc1cc2c(N3CCN(C(=O)Nc4ccc(C#N)cc4)CC3)ncnc2cc...,6.0,active


**Remove Compounds without Valid SMILES**. Drop rows with **NaN**, **empty** or **None** SMILES values.

In [166]:
df3 = df3.dropna(subset=["canonical_smiles"])
df3 = df3[df3["canonical_smiles"].str.lower() != "none"]
df3 = df3[df3["canonical_smiles"].str.strip() != ""]
df3.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL352308,COc1cc2c(Oc3ccc(Nc4ccc(C(C)(C)C)cc4)cc3)ccnc2c...,100.0,active
1,CHEMBL115220,O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1,10000.0,inactive
2,CHEMBL330863,COc1cc2c(N3CCN(C(=O)Nc4ccc(OC(C)C)cc4)CC3)ncnc...,21.0,active
3,CHEMBL124660,COc1cc2c(N3CCN(C(=O)Nc4ccc(OC(C)C)cc4)CC3)ncnc...,170.0,active
4,CHEMBL126699,COc1cc2c(N3CCN(C(=O)Nc4ccc(C#N)cc4)CC3)ncnc2cc...,6.0,active


**Save Preprocessed Bioactivity Data.** Save the cleaned dataset to CSV and copy to Google Drive.

In [167]:
df3.to_csv("KIT_preprocessed_data.csv", index=False)

!cp KIT_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"
!ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioactivity_preprocessed_data.csv  KIT_preprocessed_data.csv
bioactivity_raw_data.csv	   mastcell_kit_raw_data.csv


## **End of Part 1: Data Collection and Curation**