### __1. Data acquisition__

This notebook focuses on acquiring data for the Malaria Plasmodium falciparum dihydroorotate dehydrogenase (PfDHODH) drug target from ChEMBL. Using Python libraries like pandas, NumPy, and RDKit, I will manipulate and analyze the data, such as filtering and potentially feature engineering in the future. The code utilizes chembl_webresource_client.new_client to directly access bioactivity data for PfDHODH from the ChEMBL database. 

First, import necessary libraries and modules to search for the PfDHODH target and its associated compounds dataset in ChEMBL.

In [26]:
import pandas as pd # For data manipulation and analysis
import numpy as np # For numerical operations
from chembl_webresource_client.new_client import new_client # For accessing ChEMBL database
from rdkit.Chem import PandasTools # For handling chemical structures
from zipfile import ZipFile # For handling zip files

##### __1.1 Search for the target, PfDHODH, in ChEMBL__



In [27]:
target_protein = new_client.target # For accessing target in ChEMBL
target_query = target_protein.search('Dihydroorotate dehydrogenase').filter(target_type='SINGLE PROTEIN').filter(organism='Plasmodium falciparum') # Querying the target
target_query # Displaying the data which in this case is a dictionary

[{'cross_references': [{'xref_id': 'Q54A96', 'xref_name': None, 'xref_src': 'canSAR-Target'}], 'organism': 'Plasmodium falciparum', 'pref_name': 'Dihydroorotate dehydrogenase', 'score': 24.0, 'species_group_flag': False, 'target_chembl_id': 'CHEMBL3486', 'target_components': [{'accession': 'Q54A96', 'component_description': 'Dihydroorotate dehydrogenase (quinone), mitochondrial', 'component_id': 1807, 'component_type': 'PROTEIN', 'relationship': 'SINGLE PROTEIN', 'target_component_synonyms': [{'component_synonym': 'dhod', 'syn_type': 'GENE_SYMBOL'}, {'component_synonym': 'Dihydroorotate dehydrogenase (quinone), mitochondrial', 'syn_type': 'UNIPROT'}, {'component_synonym': '1.3.5.2', 'syn_type': 'EC_NUMBER'}], 'target_component_xrefs': [{'xref_id': 'GO:0005737', 'xref_name': 'cytoplasm', 'xref_src_db': 'GoComponent'}, {'xref_id': 'GO:0005743', 'xref_name': 'mitochondrial inner membrane', 'xref_src_db': 'GoComponent'}, {'xref_id': 'GO:0005886', 'xref_name': 'plasma membrane', 'xref_src_d

In [28]:
#Convert query result to DataFrame and visualize
data = pd.DataFrame.from_dict(target_query) # Converting the query result in dict to a DataFrame
data # Displaying the DataFrame

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'Q54A96', 'xref_name': None, 'xre...",Plasmodium falciparum,Dihydroorotate dehydrogenase,24.0,False,CHEMBL3486,"[{'accession': 'Q54A96', 'component_descriptio...",SINGLE PROTEIN,5833
1,"[{'xref_id': 'Q6KCB6', 'xref_name': None, 'xre...",Plasmodium falciparum,Dihydrolipoyl dehydrogenase,9.0,False,CHEMBL6147,"[{'accession': 'Q6KCB6', 'component_descriptio...",SINGLE PROTEIN,5833
2,[],Plasmodium falciparum,Lactate dehydrogenase,9.0,False,CHEMBL6071,"[{'accession': 'Q0PJ46', 'component_descriptio...",SINGLE PROTEIN,5833
3,"[{'xref_id': 'Q6VVP7', 'xref_name': None, 'xre...",Plasmodium falciparum,Malate dehydrogenase,9.0,False,CHEMBL6072,"[{'accession': 'Q6VVP7', 'component_descriptio...",SINGLE PROTEIN,5833
4,[],Plasmodium falciparum,Glucose-6-phosphate 1-dehydrogenase,8.0,False,CHEMBL2169726,"[{'accession': 'Q25856', 'component_descriptio...",SINGLE PROTEIN,5833
5,[],Plasmodium falciparum,Glyceraldehyde-3-phosphate dehydrogenase,8.0,False,CHEMBL3559677,"[{'accession': 'Q8T6B1', 'component_descriptio...",SINGLE PROTEIN,5833
6,"[{'xref_id': 'Q965D6', 'xref_name': None, 'xre...",Plasmodium falciparum,3-oxoacyl-acyl-carrier protein reductase,2.0,False,CHEMBL4513,"[{'accession': 'Q965D6', 'component_descriptio...",SINGLE PROTEIN,5833
7,"[{'xref_id': 'Q965D5', 'xref_name': None, 'xre...",Plasmodium falciparum,Enoyl-acyl-carrier protein reductase,2.0,False,CHEMBL4150,"[{'accession': 'Q965D5', 'component_descriptio...",SINGLE PROTEIN,5833


The first one (0th entry) is our target protein that is PfDHODH,and its id CHEMBL3486

In [29]:
## Extract the 'target_chembl_id' of the first row in the DataFrame as our target
our_target = data.iloc[0]['target_chembl_id'] # Extracting the 'target_chembl_id' of the first row in the DataFrame 
our_target # Displaying the 'target_chembl_id'  

'CHEMBL3486'

##### __1.2 Retrieve bioactivity data of compounds  for the PfDHODH target__

In [30]:
target_activity = new_client.activity # For accessing activity in ChEMBL
target_activity # Displaying the data which in this case is a dictionary

[{'action_type': None, 'activity_comment': None, 'activity_id': 31863, 'activity_properties': [], 'assay_chembl_id': 'CHEMBL663853', 'assay_description': 'Inhibitory concentration against human DNA topoisomerase II, alpha mediated relaxation of pBR322; no measurable activity', 'assay_type': 'B', 'assay_variant_accession': None, 'assay_variant_mutation': None, 'bao_endpoint': 'BAO_0000190', 'bao_format': 'BAO_0000357', 'bao_label': 'single protein format', 'canonical_smiles': 'c1ccc(-c2nc3c(-c4nc5ccccc5o4)cccc3o2)cc1', 'data_validity_comment': None, 'data_validity_description': None, 'document_chembl_id': 'CHEMBL1137930', 'document_journal': 'Bioorg Med Chem Lett', 'document_year': 2004, 'ligand_efficiency': None, 'molecule_chembl_id': 'CHEMBL113081', 'molecule_pref_name': None, 'parent_molecule_chembl_id': 'CHEMBL113081', 'pchembl_value': None, 'potential_duplicate': 0, 'qudt_units': 'http://www.openphacts.org/units/Nanomolar', 'record_id': 206172, 'relation': '>', 'src_id': 1, 'standa

In [31]:
#Query the target activity data for our specific target protein and filter for standard type 'IC50'
target_query = target_activity.filter(target_chembl_id=our_target).filter(standard_type='IC50') 
target_query # Displaying the data which in this case is a dictionary

[{'action_type': None, 'activity_comment': None, 'activity_id': 1662473, 'activity_properties': [], 'assay_chembl_id': 'CHEMBL863916', 'assay_description': 'Binding affinity to Plasmodium falciparum DHODH', 'assay_type': 'B', 'assay_variant_accession': None, 'assay_variant_mutation': None, 'bao_endpoint': 'BAO_0000190', 'bao_format': 'BAO_0000357', 'bao_label': 'single protein format', 'canonical_smiles': 'CN(C(=O)c1ccc(-c2ccccc2)cc1)c1ccccc1C(=O)O', 'data_validity_comment': None, 'data_validity_description': None, 'document_chembl_id': 'CHEMBL1148537', 'document_journal': 'Bioorg Med Chem Lett', 'document_year': 2006, 'ligand_efficiency': {'bei': '13.19', 'le': '0.24', 'lle': '0.04', 'sei': '7.59'}, 'molecule_chembl_id': 'CHEMBL199572', 'molecule_pref_name': None, 'parent_molecule_chembl_id': 'CHEMBL199572', 'pchembl_value': '4.37', 'potential_duplicate': 0, 'qudt_units': 'http://www.openphacts.org/units/Nanomolar', 'record_id': 414629, 'relation': '=', 'src_id': 1, 'standard_flag': 1

In [32]:
## Convert the target query result to a pandas DataFrame for easier data manipulation and analysis
df = pd.DataFrame.from_dict(target_query)
df # Displaying the DataFrame

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1662473,[],CHEMBL863916,Binding affinity to Plasmodium falciparum DHODH,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,42.6
1,,,1662477,[],CHEMBL863916,Binding affinity to Plasmodium falciparum DHODH,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,142.6
2,,,1662479,[],CHEMBL863916,Binding affinity to Plasmodium falciparum DHODH,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,93.4
3,,,1662481,[],CHEMBL863916,Binding affinity to Plasmodium falciparum DHODH,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,153.5
4,,,1662493,[],CHEMBL863916,Binding affinity to Plasmodium falciparum DHODH,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,200.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
597,,,18951322,[],CHEMBL4331522,Inhibition of Plasmodium falciparum recombinan...,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,250.0
598,,,18951323,[],CHEMBL4331522,Inhibition of Plasmodium falciparum recombinan...,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,250.0
599,,,18951324,[],CHEMBL4331522,Inhibition of Plasmodium falciparum recombinan...,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,250.0
600,,,19145102,[],CHEMBL4378107,Inhibition of Plasmodium falciparum DHODH expr...,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,0.01


A total of 602 compounds, each with its potency (IC50) and chemical structure, etc, are available for the PfDHODH target. Let's visualize them below.

In [33]:
# Select and display only the 'standard_value', 'molecule_chembl_id', and 'canonical_smiles' columns from the DataFrame
df[['standard_value', 'molecule_chembl_id','canonical_smiles' ]]

Unnamed: 0,standard_value,molecule_chembl_id,canonical_smiles
0,42600.0,CHEMBL199572,CN(C(=O)c1ccc(-c2ccccc2)cc1)c1ccccc1C(=O)O
1,142600.0,CHEMBL199574,O=C(Nc1ccccc1C(=O)O)c1ccc2cc(Br)ccc2c1
2,93400.0,CHEMBL372561,CN(C(=O)c1ccc2cc(Br)ccc2c1)c1ccccc1C(=O)O
3,153500.0,CHEMBL370865,O=C(Nc1ccccc1C(=O)O)c1ccc(-c2ccccc2)cc1
4,200000.0,CHEMBL199575,CN(C(=O)c1ccc2ccccc2c1)c1ccccc1C(=O)O
...,...,...,...
597,250000.0,CHEMBL4569109,Cn1nc(OCC2CC2)c(C(=O)O)c1COc1ccccc1
598,250000.0,CHEMBL4568957,Cn1nc(OCc2ccccc2)c(C(=O)O)c1COc1ccccc1
599,250000.0,CHEMBL4449622,Cn1nc(O)c(C(N)=O)c1COc1ccccc1
600,10.0,CHEMBL1956285,Cc1cc(Nc2ccc(S(F)(F)(F)(F)F)cc2)n2nc(C(C)(F)F)...


##### __Save the raw data to CSV file__

In [34]:
df.to_csv('../data/chembl_dataset/00_PfDHODH_raw_data.csv', index=False) # Save the DataFrame to a CSV file