<a href="https://colab.research.google.com/github/djava387/Snap-game-java/blob/master/1_DatacollectionPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The aim of this project is to identify inhibitors of Glucocorticoid receptor from the Chembl database. The project consists of three parts:

1. Data Pre-processing: In this step, the data is cleaned and pre-processed for further analysis.

2. Identification of compounds with potential to inhibit Glucocorticoid receptor: In this step, the compounds that have the potential to inhibit Glucocorticoid receptor are identified.

3. Analysis of the results: In this step, the identified compounds are analysed, and the results are presented.


The glucocorticoid receptor (GR) is a transcription factor that is activated by the corticosteroid hormones, including cortisol. When GR is activated, it can bind to DNA and change the transcription of genes. GR is important for a variety of physiological processes, including stress response, inflammation, and energy metabolism. GR is also a drug target for the treatment of a variety of diseases, including rheumatoid arthritis, asthma, and cancer.

There are several different ways to modulate the activity of GR. One way is to target the receptor with drugs that bind to it and activate or inhibit its activity. Another way is to target the GR protein with small molecules that can modulate its activity. There are several known small-molecule modulators of GR, but there is still a lot to learn about how they work and what their potential uses might be.

The goal of this project was to identify novel small-molecule modulators of GR using a high-throughput screening (HTS) approach. The HTS approach allows for the screening of many compounds in a short amount of time. The results of the screen were then analysed to identify the most promising compounds.

In this project, the GR protein was expressed in a cell line and then screened against a library of over 2 million compounds. The screen identified several promising compounds, including some that had not been reported to modulate GR before. The results of this study provide a starting point for further exploration of the potential uses of these compounds.

Using the glucocorticoid receptor as a target, the first step is to extract features from the Compound Library. The features that are extracted are: molecular weight, Lipinski's rule of five, number of hydrogen bond donors, number of hydrogen bond acceptors, and number of rotatable bonds. The second step is to pre-process the data using the features that were extracted in the first step. The data is pre-processed using the following algorithm:

1. Convert all values to the same scale
2. Remove any outliers
3. Standardize the data

After the data is pre-processed, it is used to train a model to predict the biological activity of a molecule. The model is trained using a unique fingerprint of each molecule within the compounds. The fingerprint is a string of characters that represent the molecular structure of a molecule. The model is trained to predict the pIC50 values of a molecule. The pIC50 values are a measure of how potent a molecule is in vitro. 
The model is a quantitative structure-relationship model and Random Forest was built based on compiled descriptor output or dataset using PAEDL Descriptor software. There were also comparisons of several regression models. The main objective was to help scientists in the discovery of drugs, by eliminating time-consuming and inaccurate traditional methods of trial-and-error processes which takes years to understand as well as design more effective drugs with less side effects using PIC50 values as a guide.


# **Step 1: Data Collection and Pre-processing from ChEMBL Database**

# **Installing libraries**
Install ChEMBL web service package for retrieval of bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client



# **Importing libraries**

In [None]:
# Import libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

# **Search for Target protein**
# Target search for Glucocorticoid receptor

In [None]:
# Target search for Glucocorticoid receptor
target = new_client.target
target_query = target.search('Glucocorticoid receptor')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P04150', 'xref_name': None, 'xre...",Homo sapiens,Glucocorticoid receptor,20.0,False,CHEMBL2034,"[{'accession': 'P04150', 'component_descriptio...",SINGLE PROTEIN,9606.0
1,"[{'xref_id': 'P06536', 'xref_name': None, 'xre...",Rattus norvegicus,Glucocorticoid receptor,20.0,False,CHEMBL3368,"[{'accession': 'P06536', 'component_descriptio...",SINGLE PROTEIN,10116.0
2,"[{'xref_id': 'P06537', 'xref_name': None, 'xre...",Mus musculus,Glucocorticoid receptor,20.0,False,CHEMBL3144,"[{'accession': 'P06537', 'component_descriptio...",SINGLE PROTEIN,10090.0
3,[],Canis lupus familiaris,Glucocorticoid receptor,20.0,False,CHEMBL4105966,"[{'accession': 'F1Q298', 'component_descriptio...",SINGLE PROTEIN,9615.0
4,[],Sus scrofa,Glucocorticoid receptor,20.0,False,CHEMBL4295956,"[{'accession': 'Q9N1U3', 'component_descriptio...",SINGLE PROTEIN,9823.0
...,...,...,...,...,...,...,...,...,...
2176,[],Homo sapiens,VHL/Protein-tyrosine phosphatase 2C,1.0,False,CHEMBL4630742,"[{'accession': 'Q06124', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606.0
2177,[],Homo sapiens,26S proteasome,0.0,False,CHEMBL2364701,"[{'accession': 'Q99460', 'component_descriptio...",PROTEIN COMPLEX,9606.0
2178,[],Homo sapiens,Aminopeptidase,0.0,False,CHEMBL3831223,"[{'accession': 'P15144', 'component_descriptio...",PROTEIN FAMILY,9606.0
2179,[],Homo sapiens,80S Ribosome,0.0,False,CHEMBL3987582,"[{'accession': 'P08865', 'component_descriptio...",PROTEIN NUCLEIC-ACID COMPLEX,9606.0


# **Select and retrieve bioactivity data for Human Glucocorticoid receptor (first entry)**
By assigning the fifth entry (which corresponds to the target protein, Human Glucocorticoid receptor) to the **selected_target** variable

In [None]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL2034'

Retrieve only bioactivity data for Human Glucocorticoid receptor(CHEMBL2034) that are reported as pChEMBL values

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,32672,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccsc1)Oc1ccc(...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '17.97', 'le': '0.34', 'lle': '-0.05',...",CHEMBL134277,,CHEMBL134277,7.00,False,http://www.openphacts.org/units/Nanomolar,253290,=,1,True,=,,IC50,nM,,100.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,100.0
1,,39901,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CCc1ccccc1/C=C1\Oc2ccc(F)cc2-c2ccc3c(c21)C(C)=...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '17.29', 'le': '0.31', 'lle': '-0.44',...",CHEMBL415524,,CHEMBL415524,7.11,False,http://www.openphacts.org/units/Nanomolar,253298,=,1,True,=,,IC50,nM,,77.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,77.0
2,,45859,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccccc1N(C)C)O...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '15.23', 'le': '0.28', 'lle': '-0.57',...",CHEMBL336353,,CHEMBL336353,6.50,False,http://www.openphacts.org/units/Nanomolar,253301,=,1,True,=,,IC50,nM,,320.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,320.0
3,,47127,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccccc1)Oc1c(F...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '17.22', 'le': '0.31', 'lle': '-0.39',...",CHEMBL413309,,CHEMBL413309,6.60,False,http://www.openphacts.org/units/Nanomolar,253313,=,1,True,=,,IC50,nM,,250.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,250.0
4,,48443,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CC(=O)O[C@]1(C(C)=O)CC[C@H]2[C@@H]3C[C@H](C)C4...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '20.70', 'le': '0.39', 'lle': '3.34', ...",CHEMBL717,MEDROXYPROGESTERONE ACETATE,CHEMBL717,8.00,False,http://www.openphacts.org/units/Nanomolar,253311,=,1,True,=,,IC50,nM,,10.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4163,,19273170,[],CHEMBL4407269,Antagonist activity at human glucocorticoid re...,B,,,BAO_0000190,BAO_0000219,cell-based format,CN(C)c1ccc([C@H]2C[C@@]3(C)[C@@H](CC[C@@]3(O)C...,,,CHEMBL4406804,J Med Chem,2019.0,"{'bei': '17.23', 'le': '0.32', 'lle': '1.69', ...",CHEMBL4163069,,CHEMBL4163069,8.12,False,http://www.openphacts.org/units/Nanomolar,3218403,=,1,True,=,,IC50,nM,,7.5,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,7.5
4164,,19273171,[],CHEMBL4407269,Antagonist activity at human glucocorticoid re...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(...,,,CHEMBL4406804,J Med Chem,2019.0,"{'bei': '17.49', 'le': '0.32', 'lle': '2.07', ...",CHEMBL4571052,,CHEMBL4571052,7.80,False,http://www.openphacts.org/units/Nanomolar,3218416,=,1,True,=,,IC50,nM,,16.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,16.0
4165,,19273172,[],CHEMBL4407269,Antagonist activity at human glucocorticoid re...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CCC4=C3[...,,,CHEMBL4406804,J Med Chem,2019.0,"{'bei': '19.68', 'le': '0.36', 'lle': '3.05', ...",CHEMBL1276308,MIFEPRISTONE,CHEMBL1276308,8.46,False,http://www.openphacts.org/units/Nanomolar,3218423,=,1,True,=,,IC50,nM,,3.5,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,3.5
4166,Active,20722949,[],CHEMBL4509807,Fluorescence polarization assay,B,,,BAO_0000190,BAO_0000357,single protein format,CCS(=O)(=O)c1cc2cc(C[C@](O)(CC(C)(C)c3ccc(F)cc...,,,CHEMBL4507277,,2021.0,,CHEMBL3358954,BI 653048,CHEMBL3358954,,False,http://www.openphacts.org/units/Nanomolar,3359683,,54,True,,,IC50,nM,,55.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,55.0


 Save the bioactivity data to a CSV file bioactivity_data.csv.

In [None]:
df.to_csv('glucocorticoid _01_bioactivity_data_raw.csv', index=False)

# **Handling missing data**
Drop compounds with a missing value for the standard_value and canonical_smiles

In [None]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,32672,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccsc1)Oc1ccc(...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '17.97', 'le': '0.34', 'lle': '-0.05',...",CHEMBL134277,,CHEMBL134277,7.00,False,http://www.openphacts.org/units/Nanomolar,253290,=,1,True,=,,IC50,nM,,100.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,100.0
1,,39901,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CCc1ccccc1/C=C1\Oc2ccc(F)cc2-c2ccc3c(c21)C(C)=...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '17.29', 'le': '0.31', 'lle': '-0.44',...",CHEMBL415524,,CHEMBL415524,7.11,False,http://www.openphacts.org/units/Nanomolar,253298,=,1,True,=,,IC50,nM,,77.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,77.0
2,,45859,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccccc1N(C)C)O...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '15.23', 'le': '0.28', 'lle': '-0.57',...",CHEMBL336353,,CHEMBL336353,6.50,False,http://www.openphacts.org/units/Nanomolar,253301,=,1,True,=,,IC50,nM,,320.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,320.0
3,,47127,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccccc1)Oc1c(F...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '17.22', 'le': '0.31', 'lle': '-0.39',...",CHEMBL413309,,CHEMBL413309,6.60,False,http://www.openphacts.org/units/Nanomolar,253313,=,1,True,=,,IC50,nM,,250.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,250.0
4,,48443,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CC(=O)O[C@]1(C(C)=O)CC[C@H]2[C@@H]3C[C@H](C)C4...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '20.70', 'le': '0.39', 'lle': '3.34', ...",CHEMBL717,MEDROXYPROGESTERONE ACETATE,CHEMBL717,8.00,False,http://www.openphacts.org/units/Nanomolar,253311,=,1,True,=,,IC50,nM,,10.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4163,,19273170,[],CHEMBL4407269,Antagonist activity at human glucocorticoid re...,B,,,BAO_0000190,BAO_0000219,cell-based format,CN(C)c1ccc([C@H]2C[C@@]3(C)[C@@H](CC[C@@]3(O)C...,,,CHEMBL4406804,J Med Chem,2019.0,"{'bei': '17.23', 'le': '0.32', 'lle': '1.69', ...",CHEMBL4163069,,CHEMBL4163069,8.12,False,http://www.openphacts.org/units/Nanomolar,3218403,=,1,True,=,,IC50,nM,,7.5,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,7.5
4164,,19273171,[],CHEMBL4407269,Antagonist activity at human glucocorticoid re...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(...,,,CHEMBL4406804,J Med Chem,2019.0,"{'bei': '17.49', 'le': '0.32', 'lle': '2.07', ...",CHEMBL4571052,,CHEMBL4571052,7.80,False,http://www.openphacts.org/units/Nanomolar,3218416,=,1,True,=,,IC50,nM,,16.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,16.0
4165,,19273172,[],CHEMBL4407269,Antagonist activity at human glucocorticoid re...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CCC4=C3[...,,,CHEMBL4406804,J Med Chem,2019.0,"{'bei': '19.68', 'le': '0.36', 'lle': '3.05', ...",CHEMBL1276308,MIFEPRISTONE,CHEMBL1276308,8.46,False,http://www.openphacts.org/units/Nanomolar,3218423,=,1,True,=,,IC50,nM,,3.5,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,3.5
4166,Active,20722949,[],CHEMBL4509807,Fluorescence polarization assay,B,,,BAO_0000190,BAO_0000357,single protein format,CCS(=O)(=O)c1cc2cc(C[C@](O)(CC(C)(C)c3ccc(F)cc...,,,CHEMBL4507277,,2021.0,,CHEMBL3358954,BI 653048,CHEMBL3358954,,False,http://www.openphacts.org/units/Nanomolar,3359683,,54,True,,,IC50,nM,,55.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,55.0


In [None]:
len(df2.canonical_smiles.unique())

1826

In [None]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,32672,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccsc1)Oc1ccc(...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '17.97', 'le': '0.34', 'lle': '-0.05',...",CHEMBL134277,,CHEMBL134277,7.00,False,http://www.openphacts.org/units/Nanomolar,253290,=,1,True,=,,IC50,nM,,100.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,100.0
1,,39901,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CCc1ccccc1/C=C1\Oc2ccc(F)cc2-c2ccc3c(c21)C(C)=...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '17.29', 'le': '0.31', 'lle': '-0.44',...",CHEMBL415524,,CHEMBL415524,7.11,False,http://www.openphacts.org/units/Nanomolar,253298,=,1,True,=,,IC50,nM,,77.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,77.0
2,,45859,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccccc1N(C)C)O...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '15.23', 'le': '0.28', 'lle': '-0.57',...",CHEMBL336353,,CHEMBL336353,6.50,False,http://www.openphacts.org/units/Nanomolar,253301,=,1,True,=,,IC50,nM,,320.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,320.0
3,,47127,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccccc1)Oc1c(F...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '17.22', 'le': '0.31', 'lle': '-0.39',...",CHEMBL413309,,CHEMBL413309,6.60,False,http://www.openphacts.org/units/Nanomolar,253313,=,1,True,=,,IC50,nM,,250.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,250.0
4,,48443,[],CHEMBL685254,Inhibition of human glucocorticoid receptor at...,B,,,BAO_0000190,BAO_0000357,single protein format,CC(=O)O[C@]1(C(C)=O)CC[C@H]2[C@@H]3C[C@H](C)C4...,,,CHEMBL1145087,J. Med. Chem.,2003.0,"{'bei': '20.70', 'le': '0.39', 'lle': '3.34', ...",CHEMBL717,MEDROXYPROGESTERONE ACETATE,CHEMBL717,8.00,False,http://www.openphacts.org/units/Nanomolar,253311,=,1,True,=,,IC50,nM,,10.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4153,,19273009,[],CHEMBL4407257,Antagonist activity at human glucocorticoid re...,B,,,BAO_0000190,BAO_0000219,cell-based format,CN(C)c1ccc([C@H]2C[C@@]3(C)[C@@H](CC[C@@]3(O)C...,,,CHEMBL4406804,J Med Chem,2019.0,"{'bei': '15.32', 'le': '0.28', 'lle': '0.71', ...",CHEMBL4476413,,CHEMBL4476413,7.47,False,http://www.openphacts.org/units/Nanomolar,3218410,=,1,True,=,,IC50,nM,,33.6,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,33.6
4154,,19273010,[],CHEMBL4407257,Antagonist activity at human glucocorticoid re...,B,,,BAO_0000190,BAO_0000219,cell-based format,CC(C)C#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C...,,,CHEMBL4406804,J Med Chem,2019.0,"{'bei': '15.22', 'le': '0.28', 'lle': '0.84', ...",CHEMBL4468879,,CHEMBL4468879,7.21,False,http://www.openphacts.org/units/Nanomolar,3218411,=,1,True,=,,IC50,nM,,62.0,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,62.0
4155,,19273011,[],CHEMBL4407257,Antagonist activity at human glucocorticoid re...,B,,,BAO_0000190,BAO_0000219,cell-based format,CN(C)c1ccc([C@H]2C[C@@]3(C)[C@@H](CC[C@@]3(O)C...,,,CHEMBL4406804,J Med Chem,2019.0,"{'bei': '16.00', 'le': '0.29', 'lle': '1.43', ...",CHEMBL4527872,,CHEMBL4527872,7.54,False,http://www.openphacts.org/units/Nanomolar,3218412,=,1,True,=,,IC50,nM,,28.5,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,28.5
4158,,19273014,[],CHEMBL4407257,Antagonist activity at human glucocorticoid re...,B,,,BAO_0000190,BAO_0000219,cell-based format,CCN(CC)c1ccc([C@H]2C[C@@]3(C)[C@@H](CC[C@@]3(O...,,,CHEMBL4406804,J Med Chem,2019.0,"{'bei': '15.66', 'le': '0.29', 'lle': '0.71', ...",CHEMBL4576596,,CHEMBL4576596,7.86,False,http://www.openphacts.org/units/Nanomolar,3218405,=,1,True,=,,IC50,nM,,13.9,CHEMBL2034,Homo sapiens,Glucocorticoid receptor,9606,,,IC50,nM,UO_0000065,,13.9


# **Data pre-processing of the bioactivity data**
Combine  3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame

In [None]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL134277,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccsc1)Oc1ccc(...,100.0
1,CHEMBL415524,CCc1ccccc1/C=C1\Oc2ccc(F)cc2-c2ccc3c(c21)C(C)=...,77.0
2,CHEMBL336353,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccccc1N(C)C)O...,320.0
3,CHEMBL413309,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccccc1)Oc1c(F...,250.0
4,CHEMBL717,CC(=O)O[C@]1(C(C)=O)CC[C@H]2[C@@H]3C[C@H](C)C4...,10.0
...,...,...,...
4153,CHEMBL4476413,CN(C)c1ccc([C@H]2C[C@@]3(C)[C@@H](CC[C@@]3(O)C...,33.6
4154,CHEMBL4468879,CC(C)C#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C...,62.0
4155,CHEMBL4527872,CN(C)c1ccc([C@H]2C[C@@]3(C)[C@@H](CC[C@@]3(O)C...,28.5
4158,CHEMBL4576596,CCN(CC)c1ccc([C@H]2C[C@@]3(C)[C@@H](CC[C@@]3(O...,13.9


Saves dataframe to CSV file

In [None]:
df3.to_csv('glucocorticoid _02_bioactivity_data_preprocessed.csv', index=False)

# **Labeling active, inactive or intermediate compounds**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM are considered **active**, those greater than 10,000 nM are **inactive**. Values between 1,000 and 10,000 nM are **intermediate.**

In [None]:
df4 = pd.read_csv('glucocorticoid _02_bioactivity_data_preprocessed.csv')

In [None]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [None]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL134277,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccsc1)Oc1ccc(...,100.0,active
1,CHEMBL415524,CCc1ccccc1/C=C1\Oc2ccc(F)cc2-c2ccc3c(c21)C(C)=...,77.0,active
2,CHEMBL336353,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccccc1N(C)C)O...,320.0,active
3,CHEMBL413309,CC1=CC(C)(C)Nc2ccc3c(c21)/C(=C/c1ccccc1)Oc1c(F...,250.0,active
4,CHEMBL717,CC(=O)O[C@]1(C(C)=O)CC[C@H]2[C@@H]3C[C@H](C)C4...,10.0,active
...,...,...,...,...
1821,CHEMBL4476413,CN(C)c1ccc([C@H]2C[C@@]3(C)[C@@H](CC[C@@]3(O)C...,33.6,active
1822,CHEMBL4468879,CC(C)C#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C...,62.0,active
1823,CHEMBL4527872,CN(C)c1ccc([C@H]2C[C@@]3(C)[C@@H](CC[C@@]3(O)C...,28.5,active
1824,CHEMBL4576596,CCN(CC)c1ccc([C@H]2C[C@@]3(C)[C@@H](CC[C@@]3(O...,13.9,active


Save dataframe to CSV file

In [None]:
df5.to_csv('glucocorticoid _03_bioactivity_data_curated.csv', index=False)

In [None]:
! zip glucocorticoid.zip *.csv

updating: glucocorticoid _01_bioactivity_data_raw.csv (deflated 92%)
updating: glucocorticoid _02_bioactivity_data_preprocessed.csv (deflated 84%)
updating: glucocorticoid _03_bioactivity_data_curated.csv (deflated 85%)


In [None]:
! ls -l

total 2436
-rw-r--r-- 1 root root 2158927 Feb  1 10:56 'glucocorticoid _01_bioactivity_data_raw.csv'
-rw-r--r-- 1 root root  155449 Feb  1 10:58 'glucocorticoid _02_bioactivity_data_preprocessed.csv'
-rw-r--r-- 1 root root  170273 Feb  1 10:58 'glucocorticoid _03_bioactivity_data_curated.csv'
drwxr-xr-x 1 root root    4096 Jan  7 14:33  sample_data
