# **Bioinformatics Project - Computational Drug Discovery [Part 1]**

Building a machine learning model using the ChEMBL bioactivity data <br>
Part 1: Data Collection and Pre-Processing from the ChEMBL Database.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds.

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
[?25l  Downloading https://files.pythonhosted.org/packages/1f/dc/d80cce4e997a89e7a857e57b62f57665c0330e81c2e441ac511f00cba5b9/chembl_webresource_client-0.10.7-py3-none-any.whl (55kB)
[K     |██████                          | 10kB 16.2MB/s eta 0:00:01[K     |███████████▉                    | 20kB 22.2MB/s eta 0:00:01[K     |█████████████████▊              | 30kB 19.4MB/s eta 0:00:01[K     |███████████████████████▋        | 40kB 16.0MB/s eta 0:00:01[K     |█████████████████████████████▌  | 51kB 8.8MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 4.3MB/s 
Collecting requests-cache~=0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/db/04/1bd25f86b5a0169d02375523382c263db4346a4af66e7187c1b15e6cb73a/requests_cache-0.7.0-py3-none-any.whl (49kB)
[K     |████████████████████████████████| 51kB 6.2MB/s 
Collecting itsdangerous>=2.0.1
  Downloading https://files.pythonhosted.org/packages/9c/96/26f935afba9cd6140216da5a

## **Importing libraries**

In [None]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for AKT**



In [None]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('akt')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],AKT8 murine leukemia virus,AKT kinase-transforming protein,18.0,False,CHEMBL3627590,"[{'accession': 'P31748', 'component_descriptio...",SINGLE PROTEIN,11790
1,[],Homo sapiens,AKT/p21CIP1,15.0,False,CHEMBL3038463,"[{'accession': 'P38936', 'component_descriptio...",PROTEIN COMPLEX,9606
2,[],Mus musculus,RAC-alpha serine/threonine-protein kinase,13.0,False,CHEMBL5859,"[{'accession': 'P31750', 'component_descriptio...",SINGLE PROTEIN,10090
3,"[{'xref_id': 'AKT1S1', 'xref_name': None, 'xre...",Homo sapiens,Proline-rich AKT1 substrate 1,13.0,False,CHEMBL1255161,"[{'accession': 'Q96B36', 'component_descriptio...",SINGLE PROTEIN,9606
4,"[{'xref_id': 'P31751', 'xref_name': None, 'xre...",Homo sapiens,Serine/threonine-protein kinase AKT2,11.0,False,CHEMBL2431,"[{'accession': 'P31751', 'component_descriptio...",SINGLE PROTEIN,9606
5,"[{'xref_id': 'P31749', 'xref_name': None, 'xre...",Homo sapiens,Serine/threonine-protein kinase AKT,11.0,False,CHEMBL4282,"[{'accession': 'P31749', 'component_descriptio...",SINGLE PROTEIN,9606
6,"[{'xref_id': 'Q60823', 'xref_name': None, 'xre...",Mus musculus,RAC-beta serine/threonine-protein kinase,11.0,False,CHEMBL5382,"[{'accession': 'Q60823', 'component_descriptio...",SINGLE PROTEIN,10090
7,[],Homo sapiens,Serine/threonine-protein kinase AKT,11.0,False,CHEMBL2111353,"[{'accession': 'P31751', 'component_descriptio...",PROTEIN FAMILY,9606
8,[],Homo sapiens,Serine/threonine-protein kinase Rac alpha/beta,11.0,False,CHEMBL4106175,"[{'accession': 'P31751', 'component_descriptio...",PROTEIN FAMILY,9606
9,"[{'xref_id': 'Q9Y243', 'xref_name': None, 'xre...",Homo sapiens,Serine/threonine-protein kinase AKT3,10.0,False,CHEMBL4816,"[{'accession': 'Q9Y243', 'component_descriptio...",SINGLE PROTEIN,9606


### **Select and retrieve bioactivity data for *AKT1, 2 AND 3* (entry index no 4 5 9)**

We will assign the fifth entry (which corresponds to the target protein, *AKT*) to the ***selected_target*** variable 

In [None]:
selected_target1 = targets.target_chembl_id[4]
selected_target2 = targets.target_chembl_id[5]
selected_target3 = targets.target_chembl_id[9]



print(selected_target1, selected_target2, selected_target3)



CHEMBL2431 CHEMBL4282 CHEMBL4816


Here, we will retrieve only bioactivity data for *AKT*  that are reported as IC50 values.

In [None]:
activity = new_client.activity
res1 = activity.filter(target_chembl_id=selected_target1).filter(standard_type="IC50")
res2 = activity.filter(target_chembl_id=selected_target2).filter(standard_type="IC50")
res3 = activity.filter(target_chembl_id=selected_target3).filter(standard_type="IC50")



In [None]:
dfa = pd.DataFrame.from_dict(res1)
dfb = pd.DataFrame.from_dict(res2)
dfc = pd.DataFrame.from_dict(res3)
df = pd.concat([dfa,dfb,dfc])

In [None]:
df


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1421464,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,Oc1nc2ccccc2n1C1CCN(Cc2ccc(-c3nc4ccccc4nc3-c3c...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL260397,,CHEMBL260397,,False,http://www.openphacts.org/units/Nanomolar,365301,>,1,True,>,,IC50,nM,,50000.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,50000.0
1,,1421467,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,O=c1[nH]c2ccccc2n1C1CCN(Cc2ccc(-c3nc4cc5[nH]cn...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL258844,,CHEMBL258844,,False,http://www.openphacts.org/units/Nanomolar,365306,>,1,True,>,,IC50,nM,,2200.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,2200.0
2,,1421470,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,Cc1nc(-c2ccc(CN3CCC(n4c(=O)[nH]c5ccccc54)CC3)c...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL178397,,CHEMBL178397,,False,http://www.openphacts.org/units/Nanomolar,365307,>,1,True,>,,IC50,nM,,20000.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,20000.0
3,,1421473,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,CC(C)Cc1nc(-c2ccccc2)c(-c2ccc(CN3CCC(n4c(O)nc5...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL178468,,CHEMBL178468,,False,http://www.openphacts.org/units/Nanomolar,365311,>,1,True,>,,IC50,nM,,20000.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,20000.0
4,,1421869,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,Oc1nc2ccccc2n1C1CCN(Cc2ccc(-c3nc4ccccc4cc3-c3c...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL359864,,CHEMBL359864,,False,http://www.openphacts.org/units/Nanomolar,365316,>,1,True,>,,IC50,nM,,50000.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,50000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
352,,19476745,[],CHEMBL4479693,Inhibition of full-length recombinant human Hi...,B,,,BAO_0000190,BAO_0000019,assay format,Cn1cc(-c2cnc3c(-c4csc(C(=O)N[C@@H]5CCCC[C@@H]5...,,,CHEMBL4477252,Bioorg Med Chem Lett,2016.0,"{'bei': '15.38', 'le': '0.30', 'lle': '3.62', ...",CHEMBL4568087,,CHEMBL4568087,6.48,False,http://www.openphacts.org/units/Nanomolar,3256973,=,1,True,=,,IC50,nM,,330.0,CHEMBL4816,Homo sapiens,Serine/threonine-protein kinase AKT3,9606,,,IC50,nM,UO_0000065,,330.0
353,,19476845,[],CHEMBL4479693,Inhibition of full-length recombinant human Hi...,B,,,BAO_0000190,BAO_0000019,assay format,Cc1sc(C(=O)N[C@@H]2[C@H](N)CCCC2(F)F)cc1-c1cnn...,,,CHEMBL4477252,Bioorg Med Chem Lett,2016.0,"{'bei': '13.20', 'le': '0.27', 'lle': '1.96', ...",CHEMBL4552628,,CHEMBL4552628,5.62,False,http://www.openphacts.org/units/Nanomolar,3256981,=,1,True,=,,IC50,nM,,2400.0,CHEMBL4816,Homo sapiens,Serine/threonine-protein kinase AKT3,9606,,,IC50,nM,UO_0000065,,2400.0
354,,20143321,"[{'comments': None, 'relation': None, 'result_...",CHEMBL4512147,Akt3 [PKBgamma(AKT3LGY1)] Takeda global kinase...,B,,,BAO_0000190,BAO_0000357,single protein format,CN1C(=O)[C@@H](N2CCc3c(nn(Cc4ccccc4)c3Br)C2=O)...,,,CHEMBL4507326,,,"{'bei': '12.46', 'le': '0.26', 'lle': '2.88', ...",CHEMBL4549667,,CHEMBL4549667,6.00,False,http://www.openphacts.org/units/Nanomolar,3359780,=,54,True,=,,IC50,nM,,1000.0,CHEMBL4816,Homo sapiens,Serine/threonine-protein kinase AKT3,9606,,,pIC50,,UO_0000065,,6.0
355,,20143683,"[{'comments': None, 'relation': None, 'result_...",CHEMBL4512434,Akt3 [PKBgamma(AKT3LGY1)] Takeda global kinase...,B,,,BAO_0000190,BAO_0000357,single protein format,CN1C(=O)[C@@H](N2CCc3cn(Cc4ccccc4)nc3C2=O)COc2...,,,CHEMBL4507327,,,"{'bei': '14.91', 'le': '0.27', 'lle': '3.65', ...",CHEMBL4088216,,CHEMBL4088216,6.00,False,http://www.openphacts.org/units/Nanomolar,3359782,=,54,True,=,,IC50,nM,,1000.0,CHEMBL4816,Homo sapiens,Serine/threonine-protein kinase AKT3,9606,,,pIC50,,UO_0000065,,6.0


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [None]:
df.to_csv('akt_01_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [None]:
df2 = df[df.standard_value.notna()]
df2 = df2[df2.canonical_smiles.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1421464,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,Oc1nc2ccccc2n1C1CCN(Cc2ccc(-c3nc4ccccc4nc3-c3c...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL260397,,CHEMBL260397,,False,http://www.openphacts.org/units/Nanomolar,365301,>,1,True,>,,IC50,nM,,50000.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,50000.0
1,,1421467,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,O=c1[nH]c2ccccc2n1C1CCN(Cc2ccc(-c3nc4cc5[nH]cn...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL258844,,CHEMBL258844,,False,http://www.openphacts.org/units/Nanomolar,365306,>,1,True,>,,IC50,nM,,2200.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,2200.0
2,,1421470,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,Cc1nc(-c2ccc(CN3CCC(n4c(=O)[nH]c5ccccc54)CC3)c...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL178397,,CHEMBL178397,,False,http://www.openphacts.org/units/Nanomolar,365307,>,1,True,>,,IC50,nM,,20000.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,20000.0
3,,1421473,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,CC(C)Cc1nc(-c2ccccc2)c(-c2ccc(CN3CCC(n4c(O)nc5...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL178468,,CHEMBL178468,,False,http://www.openphacts.org/units/Nanomolar,365311,>,1,True,>,,IC50,nM,,20000.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,20000.0
4,,1421869,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,Oc1nc2ccccc2n1C1CCN(Cc2ccc(-c3nc4ccccc4cc3-c3c...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL359864,,CHEMBL359864,,False,http://www.openphacts.org/units/Nanomolar,365316,>,1,True,>,,IC50,nM,,50000.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,50000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
352,,19476745,[],CHEMBL4479693,Inhibition of full-length recombinant human Hi...,B,,,BAO_0000190,BAO_0000019,assay format,Cn1cc(-c2cnc3c(-c4csc(C(=O)N[C@@H]5CCCC[C@@H]5...,,,CHEMBL4477252,Bioorg Med Chem Lett,2016.0,"{'bei': '15.38', 'le': '0.30', 'lle': '3.62', ...",CHEMBL4568087,,CHEMBL4568087,6.48,False,http://www.openphacts.org/units/Nanomolar,3256973,=,1,True,=,,IC50,nM,,330.0,CHEMBL4816,Homo sapiens,Serine/threonine-protein kinase AKT3,9606,,,IC50,nM,UO_0000065,,330.0
353,,19476845,[],CHEMBL4479693,Inhibition of full-length recombinant human Hi...,B,,,BAO_0000190,BAO_0000019,assay format,Cc1sc(C(=O)N[C@@H]2[C@H](N)CCCC2(F)F)cc1-c1cnn...,,,CHEMBL4477252,Bioorg Med Chem Lett,2016.0,"{'bei': '13.20', 'le': '0.27', 'lle': '1.96', ...",CHEMBL4552628,,CHEMBL4552628,5.62,False,http://www.openphacts.org/units/Nanomolar,3256981,=,1,True,=,,IC50,nM,,2400.0,CHEMBL4816,Homo sapiens,Serine/threonine-protein kinase AKT3,9606,,,IC50,nM,UO_0000065,,2400.0
354,,20143321,"[{'comments': None, 'relation': None, 'result_...",CHEMBL4512147,Akt3 [PKBgamma(AKT3LGY1)] Takeda global kinase...,B,,,BAO_0000190,BAO_0000357,single protein format,CN1C(=O)[C@@H](N2CCc3c(nn(Cc4ccccc4)c3Br)C2=O)...,,,CHEMBL4507326,,,"{'bei': '12.46', 'le': '0.26', 'lle': '2.88', ...",CHEMBL4549667,,CHEMBL4549667,6.00,False,http://www.openphacts.org/units/Nanomolar,3359780,=,54,True,=,,IC50,nM,,1000.0,CHEMBL4816,Homo sapiens,Serine/threonine-protein kinase AKT3,9606,,,pIC50,,UO_0000065,,6.0
355,,20143683,"[{'comments': None, 'relation': None, 'result_...",CHEMBL4512434,Akt3 [PKBgamma(AKT3LGY1)] Takeda global kinase...,B,,,BAO_0000190,BAO_0000357,single protein format,CN1C(=O)[C@@H](N2CCc3cn(Cc4ccccc4)nc3C2=O)COc2...,,,CHEMBL4507327,,,"{'bei': '14.91', 'le': '0.27', 'lle': '3.65', ...",CHEMBL4088216,,CHEMBL4088216,6.00,False,http://www.openphacts.org/units/Nanomolar,3359782,=,54,True,=,,IC50,nM,,1000.0,CHEMBL4816,Homo sapiens,Serine/threonine-protein kinase AKT3,9606,,,pIC50,,UO_0000065,,6.0


In [None]:
len(df2.canonical_smiles.unique())

3464

In [None]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1421464,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,Oc1nc2ccccc2n1C1CCN(Cc2ccc(-c3nc4ccccc4nc3-c3c...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL260397,,CHEMBL260397,,False,http://www.openphacts.org/units/Nanomolar,365301,>,1,True,>,,IC50,nM,,50000.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,50000.0
1,,1421467,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,O=c1[nH]c2ccccc2n1C1CCN(Cc2ccc(-c3nc4cc5[nH]cn...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL258844,,CHEMBL258844,,False,http://www.openphacts.org/units/Nanomolar,365306,>,1,True,>,,IC50,nM,,2200.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,2200.0
2,,1421470,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,Cc1nc(-c2ccc(CN3CCC(n4c(=O)[nH]c5ccccc54)CC3)c...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL178397,,CHEMBL178397,,False,http://www.openphacts.org/units/Nanomolar,365307,>,1,True,>,,IC50,nM,,20000.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,20000.0
3,,1421473,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,CC(C)Cc1nc(-c2ccccc2)c(-c2ccc(CN3CCC(n4c(O)nc5...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL178468,,CHEMBL178468,,False,http://www.openphacts.org/units/Nanomolar,365311,>,1,True,>,,IC50,nM,,20000.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,20000.0
4,,1421869,[],CHEMBL832703,Inhibitory concentration against Akt3 kinase,B,,,BAO_0000190,BAO_0000357,single protein format,Oc1nc2ccccc2n1C1CCN(Cc2ccc(-c3nc4ccccc4cc3-c3c...,,,CHEMBL1141976,Bioorg. Med. Chem. Lett.,2005.0,,CHEMBL359864,,CHEMBL359864,,False,http://www.openphacts.org/units/Nanomolar,365316,>,1,True,>,,IC50,nM,,50000.0,CHEMBL2431,Homo sapiens,Serine/threonine-protein kinase AKT2,9606,,,IC50,nM,UO_0000065,,50000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
164,,3219713,[],CHEMBL1110411,Inhibition of PKBgamma in presence of 100 uM ATP,B,,,BAO_0000190,BAO_0000357,single protein format,N=C(N)NCCC[C@@H](NC(=O)[C@@H](CCCNC(=N)N)NC(=O...,,,CHEMBL1154334,Bioorg. Med. Chem. Lett.,2009.0,"{'bei': '5.24', 'le': None, 'lle': None, 'sei'...",CHEMBL1077375,,CHEMBL1077375,7.92,False,http://www.openphacts.org/units/Nanomolar,896504,=,1,True,=,,IC50,nM,,12.0,CHEMBL4816,Homo sapiens,Serine/threonine-protein kinase AKT3,9606,,,IC50,nM,UO_0000065,,12.0
165,,3219714,[],CHEMBL1110411,Inhibition of PKBgamma in presence of 100 uM ATP,B,,,BAO_0000190,BAO_0000357,single protein format,N=C(N)NCCC[C@@H](NC(=O)CCCCCNC(=O)[C@@H](CCCCN...,,,CHEMBL1154334,Bioorg. Med. Chem. Lett.,2009.0,"{'bei': '4.50', 'le': None, 'lle': None, 'sei'...",CHEMBL1077376,,CHEMBL1077376,7.38,False,http://www.openphacts.org/units/Nanomolar,896505,=,1,True,=,,IC50,nM,,42.0,CHEMBL4816,Homo sapiens,Serine/threonine-protein kinase AKT3,9606,,,IC50,nM,UO_0000065,,42.0
212,,10863369,[],CHEMBL2024945,Displacement of fluorescent-ARC-583/ARC-1042/A...,B,Q9Y243,S472D,BAO_0000190,BAO_0000019,assay format,N=C(N)NCCC[C@@H](NC(=O)[C@@H](CCCNC(=N)N)NC(=O...,,,CHEMBL2021921,Bioorg. Med. Chem. Lett.,2012.0,"{'bei': '5.49', 'le': None, 'lle': None, 'sei'...",CHEMBL2023714,,CHEMBL2023714,7.14,False,http://www.openphacts.org/units/Nanomolar,1639325,=,1,True,=,,IC50,nM,,72.44,CHEMBL4816,Homo sapiens,Serine/threonine-protein kinase AKT3,9606,,,pIC50,,UO_0000065,,7.14
213,,10863370,[],CHEMBL2024945,Displacement of fluorescent-ARC-583/ARC-1042/A...,B,Q9Y243,S472D,BAO_0000190,BAO_0000019,assay format,N=C(N)NCCC[C@@H](NC(=O)[C@@H](CCCNC(=N)N)NC(=O...,,,CHEMBL2021921,Bioorg. Med. Chem. Lett.,2012.0,"{'bei': '5.11', 'le': '0.10', 'lle': '4.50', '...",CHEMBL2023842,,CHEMBL2023842,5.04,False,http://www.openphacts.org/units/Nanomolar,1639480,=,1,True,=,,IC50,nM,,9120.11,CHEMBL4816,Homo sapiens,Serine/threonine-protein kinase AKT3,9606,,,pIC50,,UO_0000065,,5.04


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [None]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL260397,Oc1nc2ccccc2n1C1CCN(Cc2ccc(-c3nc4ccccc4nc3-c3c...,50000.0
1,CHEMBL258844,O=c1[nH]c2ccccc2n1C1CCN(Cc2ccc(-c3nc4cc5[nH]cn...,2200.0
2,CHEMBL178397,Cc1nc(-c2ccc(CN3CCC(n4c(=O)[nH]c5ccccc54)CC3)c...,20000.0
3,CHEMBL178468,CC(C)Cc1nc(-c2ccccc2)c(-c2ccc(CN3CCC(n4c(O)nc5...,20000.0
4,CHEMBL359864,Oc1nc2ccccc2n1C1CCN(Cc2ccc(-c3nc4ccccc4cc3-c3c...,50000.0
...,...,...,...
164,CHEMBL1077375,N=C(N)NCCC[C@@H](NC(=O)[C@@H](CCCNC(=N)N)NC(=O...,12.0
165,CHEMBL1077376,N=C(N)NCCC[C@@H](NC(=O)CCCCCNC(=O)[C@@H](CCCCN...,42.0
212,CHEMBL2023714,N=C(N)NCCC[C@@H](NC(=O)[C@@H](CCCNC(=N)N)NC(=O...,72.44
213,CHEMBL2023842,N=C(N)NCCC[C@@H](NC(=O)[C@@H](CCCNC(=N)N)NC(=O...,9120.11


Save dataframe to CSV file

In [None]:
df3.to_csv('akt_02_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [None]:
df4 = pd.read_csv('akt_02_bioactivity_data_preprocessed.csv')

In [None]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [None]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL260397,Oc1nc2ccccc2n1C1CCN(Cc2ccc(-c3nc4ccccc4nc3-c3c...,50000.00,inactive
1,CHEMBL258844,O=c1[nH]c2ccccc2n1C1CCN(Cc2ccc(-c3nc4cc5[nH]cn...,2200.00,intermediate
2,CHEMBL178397,Cc1nc(-c2ccc(CN3CCC(n4c(=O)[nH]c5ccccc54)CC3)c...,20000.00,inactive
3,CHEMBL178468,CC(C)Cc1nc(-c2ccccc2)c(-c2ccc(CN3CCC(n4c(O)nc5...,20000.00,inactive
4,CHEMBL359864,Oc1nc2ccccc2n1C1CCN(Cc2ccc(-c3nc4ccccc4cc3-c3c...,50000.00,inactive
...,...,...,...,...
3459,CHEMBL1077375,N=C(N)NCCC[C@@H](NC(=O)[C@@H](CCCNC(=N)N)NC(=O...,12.00,active
3460,CHEMBL1077376,N=C(N)NCCC[C@@H](NC(=O)CCCCCNC(=O)[C@@H](CCCCN...,42.00,active
3461,CHEMBL2023714,N=C(N)NCCC[C@@H](NC(=O)[C@@H](CCCNC(=N)N)NC(=O...,72.44,active
3462,CHEMBL2023842,N=C(N)NCCC[C@@H](NC(=O)[C@@H](CCCNC(=N)N)NC(=O...,9120.11,intermediate


Save data frame to CSV file

In [None]:
df5.to_csv('akt_03_bioactivity_data_curated.csv', index=False)

In [None]:
! zip akt.zip *.csv

  adding: akt_01_bioactivity_data_raw.csv (deflated 93%)
  adding: akt_02_bioactivity_data_preprocessed.csv (deflated 83%)
  adding: akt_03_bioactivity_data_curated.csv (deflated 84%)


In [None]:
! ls -l

total 5216
-rw-r--r-- 1 root root 4365555 Jul  9 16:08 akt_01_bioactivity_data_raw.csv
-rw-r--r-- 1 root root  277192 Jul  9 16:11 akt_02_bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root  305447 Jul  9 16:11 akt_03_bioactivity_data_curated.csv
-rw-r--r-- 1 root root  384531 Jul  9 16:12 akt.zip
drwxr-xr-x 1 root root    4096 Jul  1 13:42 sample_data


---