<a href="https://colab.research.google.com/github/ahrbadr/test/blob/master/CDD_ML_Part_1_Bioactivity_Data_Beta_Amyloid.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Computational Drug Discovery [Part 1] Download Bioactivity Data )**



In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.



---

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client



## **Importing libraries**

In [None]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for Beta_Amyloid**

In [None]:
# Target search for Beta_Amyloid
target = new_client.target
target_query = target.search('beta amyloid')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P05067', 'xref_name': None, 'xre...",Homo sapiens,Beta amyloid A4 protein,19.0,False,CHEMBL2487,"[{'accession': 'P05067', 'component_descriptio...",SINGLE PROTEIN,9606.0
1,[],Homo sapiens,Amyloid beta-binding alcohol dehydrogenase,19.0,False,CHEMBL4295598,"[{'accession': 'Q2L8D9', 'component_descriptio...",SINGLE PROTEIN,9606.0
2,[],Mus musculus,Amyloid-beta A4 protein,19.0,False,CHEMBL4523942,"[{'accession': 'P12023', 'component_descriptio...",SINGLE PROTEIN,10090.0
3,"[{'xref_id': 'P37840', 'xref_name': None, 'xre...",Homo sapiens,Alpha-synuclein,18.0,False,CHEMBL6152,"[{'accession': 'P37840', 'component_descriptio...",SINGLE PROTEIN,9606.0
4,[],Rattus norvegicus,Amyloid beta A4 protein,18.0,False,CHEMBL3638365,"[{'accession': 'P08592', 'component_descriptio...",SINGLE PROTEIN,10116.0
...,...,...,...,...,...,...,...,...,...
1100,[],Homo sapiens,"3',5'-cyclic phosphodiesterase",1.0,False,CHEMBL2363066,"[{'accession': 'O76074', 'component_descriptio...",PROTEIN FAMILY,9606.0
1101,[],Homo sapiens,Caspase,1.0,False,CHEMBL3831289,"[{'accession': 'P49662', 'component_descriptio...",PROTEIN FAMILY,9606.0
1102,[],Homo sapiens,mTORC2,1.0,False,CHEMBL4523999,"[{'accession': 'P42345', 'component_descriptio...",PROTEIN COMPLEX,9606.0
1103,"[{'xref_id': 'C3TDZ2', 'xref_name': None, 'xre...",Escherichia coli,3-oxoacyl-[acyl-carrier-protein] synthase 3,0.0,False,CHEMBL1795135,"[{'accession': 'C3TDZ2', 'component_descriptio...",SINGLE PROTEIN,562.0


### **Select and retrieve bioactivity data for *Human Beta_Amyloid* (first entry)**

We will assign the first entry (which corresponds to the target protein, *Human Beta_Amyloid*) to the ***selected_target*** variable 

In [None]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL2487'

Here, we will retrieve only bioactivity data for *Human Beta_Amyloid* (CHEMBL2487) that are reported as pChEMBL values.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,357577,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CCC(C1)C(C)(C)C2NS(=O)(=O)c1ccc(F)cc1,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '17.02', 'le': '0.34', 'lle': '1.98', ...",CHEMBL311039,,CHEMBL311039,5.30,False,http://www.openphacts.org/units/Nanomolar,132837,=,1,True,=,,IC50,nM,,5000.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,5.0
1,,357580,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1cccs1,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '18.60', 'le': '0.40', 'lle': '2.33', ...",CHEMBL450926,,CHEMBL450926,5.57,False,http://www.openphacts.org/units/Nanomolar,132839,=,1,True,=,,IC50,nM,,2700.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,2.7
2,,358965,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '18.45', 'le': '0.37', 'lle': '2.42', ...",CHEMBL310242,,CHEMBL310242,5.75,False,http://www.openphacts.org/units/Nanomolar,132841,=,1,True,=,,IC50,nM,,1800.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,1.8
3,,368887,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '15.12', 'le': '0.32', 'lle': '1.13', ...",CHEMBL74874,,CHEMBL74874,4.96,False,http://www.openphacts.org/units/Nanomolar,132840,=,1,True,=,,IC50,nM,,11000.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,11.0
4,,375954,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '13.43', 'le': '0.33', 'lle': '1.06', ...",CHEMBL75183,,CHEMBL75183,5.00,False,http://www.openphacts.org/units/Nanomolar,132838,=,1,True,=,,IC50,nM,,10000.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1229,Not Active,20120045,[],CHEMBL4510291,APP40 inhibition assay,B,,,BAO_0000190,BAO_0000357,single protein format,CC(C)(C)OC(=O)N1CCCC1CNC1CCC(c2cc(F)ccc2F)(S(=...,,,CHEMBL4507288,,,"{'bei': '10.18', 'le': '0.21', 'lle': '-0.42',...",CHEMBL4558518,,CHEMBL4558518,5.80,False,http://www.openphacts.org/units/Nanomolar,3359706,=,54,True,=,,IC50,nM,,1600.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,nM,UO_0000065,,1600.0
1230,Active,20120642,"[{'comments': None, 'relation': None, 'result_...",CHEMBL4510293,APP42 inhibition assay,B,,,BAO_0000190,BAO_0000357,single protein format,COc1cc(-c2cn(C3CCc4c(F)cccc4N(CC(F)(F)F)C3=O)n...,,,CHEMBL4507289,,,"{'bei': '14.59', 'le': '0.28', 'lle': '2.84', ...",CHEMBL3609637,,CHEMBL3609637,7.51,False,http://www.openphacts.org/units/Nanomolar,3359707,=,54,True,=,,IC50,nM,,31.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,nM,UO_0000065,,31.0
1231,Active,20120643,[],CHEMBL4510294,APP40 inhibition assay,B,,,BAO_0000190,BAO_0000357,single protein format,COc1cc(-c2cn(C3CCc4c(F)cccc4N(CC(F)(F)F)C3=O)n...,,,CHEMBL4507289,,,"{'bei': '13.38', 'le': '0.25', 'lle': '2.21', ...",CHEMBL3609637,,CHEMBL3609637,6.88,False,http://www.openphacts.org/units/Nanomolar,3359707,=,54,True,=,,IC50,nM,,131.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,nM,UO_0000065,,131.0
1232,Not Active,20120647,[],CHEMBL4510293,APP42 inhibition assay,B,,,BAO_0000190,BAO_0000357,single protein format,COc1cc(-c2cn(C3CCc4ccccc4N(CC(F)(F)F)C3=O)nn2)...,,,CHEMBL4507289,,,,CHEMBL4534005,,CHEMBL4534005,,False,http://www.openphacts.org/units/Nanomolar,3359708,>,54,True,>,,IC50,nM,,10000.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,10.0


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [None]:
df.to_csv('BetaAmyloid_01_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [None]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,357577,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CCC(C1)C(C)(C)C2NS(=O)(=O)c1ccc(F)cc1,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '17.02', 'le': '0.34', 'lle': '1.98', ...",CHEMBL311039,,CHEMBL311039,5.30,False,http://www.openphacts.org/units/Nanomolar,132837,=,1,True,=,,IC50,nM,,5000.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,5.0
1,,357580,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1cccs1,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '18.60', 'le': '0.40', 'lle': '2.33', ...",CHEMBL450926,,CHEMBL450926,5.57,False,http://www.openphacts.org/units/Nanomolar,132839,=,1,True,=,,IC50,nM,,2700.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,2.7
2,,358965,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '18.45', 'le': '0.37', 'lle': '2.42', ...",CHEMBL310242,,CHEMBL310242,5.75,False,http://www.openphacts.org/units/Nanomolar,132841,=,1,True,=,,IC50,nM,,1800.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,1.8
3,,368887,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '15.12', 'le': '0.32', 'lle': '1.13', ...",CHEMBL74874,,CHEMBL74874,4.96,False,http://www.openphacts.org/units/Nanomolar,132840,=,1,True,=,,IC50,nM,,11000.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,11.0
4,,375954,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '13.43', 'le': '0.33', 'lle': '1.06', ...",CHEMBL75183,,CHEMBL75183,5.00,False,http://www.openphacts.org/units/Nanomolar,132838,=,1,True,=,,IC50,nM,,10000.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1229,Not Active,20120045,[],CHEMBL4510291,APP40 inhibition assay,B,,,BAO_0000190,BAO_0000357,single protein format,CC(C)(C)OC(=O)N1CCCC1CNC1CCC(c2cc(F)ccc2F)(S(=...,,,CHEMBL4507288,,,"{'bei': '10.18', 'le': '0.21', 'lle': '-0.42',...",CHEMBL4558518,,CHEMBL4558518,5.80,False,http://www.openphacts.org/units/Nanomolar,3359706,=,54,True,=,,IC50,nM,,1600.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,nM,UO_0000065,,1600.0
1230,Active,20120642,"[{'comments': None, 'relation': None, 'result_...",CHEMBL4510293,APP42 inhibition assay,B,,,BAO_0000190,BAO_0000357,single protein format,COc1cc(-c2cn(C3CCc4c(F)cccc4N(CC(F)(F)F)C3=O)n...,,,CHEMBL4507289,,,"{'bei': '14.59', 'le': '0.28', 'lle': '2.84', ...",CHEMBL3609637,,CHEMBL3609637,7.51,False,http://www.openphacts.org/units/Nanomolar,3359707,=,54,True,=,,IC50,nM,,31.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,nM,UO_0000065,,31.0
1231,Active,20120643,[],CHEMBL4510294,APP40 inhibition assay,B,,,BAO_0000190,BAO_0000357,single protein format,COc1cc(-c2cn(C3CCc4c(F)cccc4N(CC(F)(F)F)C3=O)n...,,,CHEMBL4507289,,,"{'bei': '13.38', 'le': '0.25', 'lle': '2.21', ...",CHEMBL3609637,,CHEMBL3609637,6.88,False,http://www.openphacts.org/units/Nanomolar,3359707,=,54,True,=,,IC50,nM,,131.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,nM,UO_0000065,,131.0
1232,Not Active,20120647,[],CHEMBL4510293,APP42 inhibition assay,B,,,BAO_0000190,BAO_0000357,single protein format,COc1cc(-c2cn(C3CCc4ccccc4N(CC(F)(F)F)C3=O)nn2)...,,,CHEMBL4507289,,,,CHEMBL4534005,,CHEMBL4534005,,False,http://www.openphacts.org/units/Nanomolar,3359708,>,54,True,>,,IC50,nM,,10000.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,10.0


In [None]:
len(df2.canonical_smiles.unique())

929

In [None]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,357577,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CCC(C1)C(C)(C)C2NS(=O)(=O)c1ccc(F)cc1,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '17.02', 'le': '0.34', 'lle': '1.98', ...",CHEMBL311039,,CHEMBL311039,5.30,False,http://www.openphacts.org/units/Nanomolar,132837,=,1,True,=,,IC50,nM,,5000.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,5.0
1,,357580,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1cccs1,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '18.60', 'le': '0.40', 'lle': '2.33', ...",CHEMBL450926,,CHEMBL450926,5.57,False,http://www.openphacts.org/units/Nanomolar,132839,=,1,True,=,,IC50,nM,,2700.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,2.7
2,,358965,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '18.45', 'le': '0.37', 'lle': '2.42', ...",CHEMBL310242,,CHEMBL310242,5.75,False,http://www.openphacts.org/units/Nanomolar,132841,=,1,True,=,,IC50,nM,,1800.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,1.8
3,,368887,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '15.12', 'le': '0.32', 'lle': '1.13', ...",CHEMBL74874,,CHEMBL74874,4.96,False,http://www.openphacts.org/units/Nanomolar,132840,=,1,True,=,,IC50,nM,,11000.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,11.0
4,,375954,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,,,CHEMBL1133739,J. Med. Chem.,2000.0,"{'bei': '13.43', 'le': '0.33', 'lle': '1.06', ...",CHEMBL75183,,CHEMBL75183,5.00,False,http://www.openphacts.org/units/Nanomolar,132838,=,1,True,=,,IC50,nM,,10000.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1224,,19258715,[],CHEMBL4403499,Inhibition of beta amyloid in human SH-SY5Y cells,B,,,BAO_0000190,BAO_0000219,cell-based format,O=c1oc2c(O)c(O)cc3c(=O)oc4c(O)c(O)cc1c4c23,Outside typical range,Values for this activity type are unusually la...,CHEMBL4402535,Eur J Med Chem,2019.0,,CHEMBL6246,ELLAGIC ACID,CHEMBL6246,,False,http://www.openphacts.org/units/Nanomolar,3215485,=,1,True,=,,IC50,nM,,300000.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,uM,UO_0000065,,300.0
1226,Active,20120042,[],CHEMBL4510290,APP42 inhibition assay,B,,,BAO_0000190,BAO_0000357,single protein format,O=S(=O)(NC1CCC(c2cc(F)ccc2F)(S(=O)(=O)c2ccc(Cl...,,,CHEMBL4507288,,,,CHEMBL1091513,,CHEMBL1091513,,False,http://www.openphacts.org/units/Nanomolar,3359705,<,54,True,<,,IC50,nM,,0.5,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,nM,UO_0000065,,0.5
1229,Not Active,20120045,[],CHEMBL4510291,APP40 inhibition assay,B,,,BAO_0000190,BAO_0000357,single protein format,CC(C)(C)OC(=O)N1CCCC1CNC1CCC(c2cc(F)ccc2F)(S(=...,,,CHEMBL4507288,,,"{'bei': '10.18', 'le': '0.21', 'lle': '-0.42',...",CHEMBL4558518,,CHEMBL4558518,5.80,False,http://www.openphacts.org/units/Nanomolar,3359706,=,54,True,=,,IC50,nM,,1600.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,nM,UO_0000065,,1600.0
1230,Active,20120642,"[{'comments': None, 'relation': None, 'result_...",CHEMBL4510293,APP42 inhibition assay,B,,,BAO_0000190,BAO_0000357,single protein format,COc1cc(-c2cn(C3CCc4c(F)cccc4N(CC(F)(F)F)C3=O)n...,,,CHEMBL4507289,,,"{'bei': '14.59', 'le': '0.28', 'lle': '2.84', ...",CHEMBL3609637,,CHEMBL3609637,7.51,False,http://www.openphacts.org/units/Nanomolar,3359707,=,54,True,=,,IC50,nM,,31.0,CHEMBL2487,Homo sapiens,Beta amyloid A4 protein,9606,,,IC50,nM,UO_0000065,,31.0


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [None]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL311039,CC12CCC(C1)C(C)(C)C2NS(=O)(=O)c1ccc(F)cc1,5000.0
1,CHEMBL450926,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1cccs1,2700.0
2,CHEMBL310242,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,1800.0
3,CHEMBL74874,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,11000.0
4,CHEMBL75183,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,10000.0
...,...,...,...
1224,CHEMBL6246,O=c1oc2c(O)c(O)cc3c(=O)oc4c(O)c(O)cc1c4c23,300000.0
1226,CHEMBL1091513,O=S(=O)(NC1CCC(c2cc(F)ccc2F)(S(=O)(=O)c2ccc(Cl...,0.5
1229,CHEMBL4558518,CC(C)(C)OC(=O)N1CCCC1CNC1CCC(c2cc(F)ccc2F)(S(=...,1600.0
1230,CHEMBL3609637,COc1cc(-c2cn(C3CCc4c(F)cccc4N(CC(F)(F)F)C3=O)n...,31.0


Saves dataframe to CSV file

In [None]:
df3.to_csv('BetaAmyloid_02_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [None]:
df4 = pd.read_csv('BetaAmyloid_02_bioactivity_data_preprocessed.csv')

In [None]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [None]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL311039,CC12CCC(C1)C(C)(C)C2NS(=O)(=O)c1ccc(F)cc1,5000.0,intermediate
1,CHEMBL450926,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1cccs1,2700.0,intermediate
2,CHEMBL310242,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,1800.0,intermediate
3,CHEMBL74874,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,11000.0,inactive
4,CHEMBL75183,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,10000.0,inactive
...,...,...,...,...
924,CHEMBL6246,O=c1oc2c(O)c(O)cc3c(=O)oc4c(O)c(O)cc1c4c23,300000.0,inactive
925,CHEMBL1091513,O=S(=O)(NC1CCC(c2cc(F)ccc2F)(S(=O)(=O)c2ccc(Cl...,0.5,active
926,CHEMBL4558518,CC(C)(C)OC(=O)N1CCCC1CNC1CCC(c2cc(F)ccc2F)(S(=...,1600.0,intermediate
927,CHEMBL3609637,COc1cc(-c2cn(C3CCc4c(F)cccc4N(CC(F)(F)F)C3=O)n...,31.0,active


Saves dataframe to CSV file

In [None]:
df5.to_csv('BetaAmyloid_03_bioactivity_data_curated.csv', index=False)

In [None]:
! zip BetaAmyloid.zip *.csv

  adding: acetylcholinesterase_01_bioactivity_data_raw.csv (deflated 92%)
  adding: acetylcholinesterase_02_bioactivity_data_preprocessed.csv (deflated 81%)
  adding: acetylcholinesterase_03_bioactivity_data_curated.csv (deflated 82%)
  adding: BetaAmyloid_01_bioactivity_data_raw.csv (deflated 92%)
  adding: BetaAmyloid_02_bioactivity_data_preprocessed.csv (deflated 81%)
  adding: BetaAmyloid_03_bioactivity_data_curated.csv (deflated 82%)


In [None]:
! ls -l

total 2148
-rw-r--r-- 1 root root 754885 Jun 20 21:37 acetylcholinesterase_01_bioactivity_data_raw.csv
-rw-r--r-- 1 root root  72657 Jun 20 21:37 acetylcholinesterase_02_bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root  81186 Jun 20 21:37 acetylcholinesterase_03_bioactivity_data_curated.csv
-rw-r--r-- 1 root root 181904 Jun 21 00:08 acetylcholinesterase.zip
-rw-r--r-- 1 root root 754885 Jun 20 23:58 BetaAmyloid_01_bioactivity_data_raw.csv
-rw-r--r-- 1 root root  72657 Jun 21 00:03 BetaAmyloid_02_bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root  81186 Jun 21 00:08 BetaAmyloid_03_bioactivity_data_curated.csv
-rw-r--r-- 1 root root 181904 Jun 21 00:09 BetaAmyloid.zip
drwxr-xr-x 1 root root   4096 Jun 15 13:37 sample_data


---