# Notebook to Extract Drug-Target Interactions Based on ChEMBL Data

### Authors: Barbara Zdrazil, Lina Heinzke
### 12/2022

This notebook extracts data from ChEMBL in order to retrieve a data set for drug-target, and clinical candidate-target associations and comparator compounds for the respective targets.

The notebook is based on initial work by Anne Hersey, Patrica Bento, Emma Manners, Paul Leeson, and Andrew Leach:  
*Target-Based Evaluation of “Drug-Like” Properties and Ligand Efficiencies  
Paul D. Leeson, A. Patricia Bento, Anna Gaulton, Anne Hersey, Emma J. Manners, Chris J. Radoux, and Andrew R. Leach  
J. Med. Chem. 2021, 64, 11, 7210–7230  
[DOI: 10.1021/acs.jmedchem.1c00416](https://doi.org/10.1021/acs.jmedchem.1c00416)*


More documentation on the initial data set compilation can be found here ("Ligand Efficiency"): https://www.ebi.ac.uk/seqdb/confluence/pages/viewpage.action?spaceKey=CHEMBL&title=Anne%27s+Notes


In [1]:
import pandas as pd
import sqlite3
import numpy as np
from tqdm import tqdm
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.Chem import PandasTools
from rdkit.Chem.Scaffolds import MurckoScaffold

In [2]:
#### notebook settings
pd.options.display.max_rows = 100
pd.options.display.max_columns = 50
pd.options.display.max_colwidth = 100

# Get Data From ChEMBL

In [3]:
chembl_version = '26'
base_path = '/Users/heinzke/Documents/PhD/Projects/drug_target_dataset_curation/'
path_results = base_path+'results/'
path_sqlite3_database = base_path+'data/chembl_'+chembl_version+'/chembl_'+chembl_version+'_sqlite/chembl_'+chembl_version+'.db'
chembl_con = sqlite3.connect(path_sqlite3_database)

Initial query for activities + related assay, mutation, target und docs information.

In [4]:
sql = '''
SELECT act.molregno as salt_molregno, act.pchembl_value, act.standard_type, 
    md2.molregno as parent_molregno, md2.chembl_id as parent_chemblid, md2.pref_name as parent_pref_name,
    md2.max_phase, md2.usan_year, md2.first_approval,
    md2.prodrug, md2.oral, md2.parenteral, md2.topical, md2.black_box_warning, 
    ass.assay_type, ass.tid, 
    vs.mutation,
    td.pref_name as target_pref_name, td.target_type, td.organism, td.chembl_id as target_chembl_id,
    docs.year, docs.journal
FROM activities act
LEFT JOIN molecule_hierarchy mh 
    ON act.molregno = mh.molregno
LEFT JOIN molecule_dictionary md2 
    ON mh.parent_molregno = md2.molregno 
INNER JOIN assays ass 
    ON  act.assay_id = ass.assay_id
LEFT JOIN variant_sequences vs
    ON ass.variant_id = vs.variant_id
INNER JOIN target_dictionary td
    ON ass.tid = td.tid
INNER JOIN docs
    ON act.doc_id = docs.doc_id
WHERE act.potential_duplicate = 0
    and act.standard_relation = '='
    and data_validity_comment is null
    and td.tid <>22226   -- exclude unchecked targets
    and td.target_type like '%PROTEIN%'
'''

df_mols = pd.read_sql_query(sql, con=chembl_con)
df_mols['tid_mutation'] = np.where(df_mols['mutation'].notnull(), 
                                   df_mols['tid'].astype('str')+'_'+df_mols['mutation'], 
                                   df_mols['tid'].astype('str'))
df_mols

Unnamed: 0,salt_molregno,pchembl_value,standard_type,parent_molregno,parent_chemblid,parent_pref_name,max_phase,usan_year,first_approval,prodrug,oral,parenteral,topical,black_box_warning,assay_type,tid,mutation,target_pref_name,target_type,organism,target_chembl_id,year,journal,tid_mutation
0,252199,5.40,IC50,252199,CHEMBL357278,,0,,,-1,0,0,0,0,B,10483,,Palmitoyl-CoA oxidase,SINGLE PROTEIN,Rattus norvegicus,CHEMBL4632,2004.0,Bioorg. Med. Chem. Lett.,10483
1,253534,4.77,IC50,253534,CHEMBL357119,,0,,,-1,0,0,0,0,B,10483,,Palmitoyl-CoA oxidase,SINGLE PROTEIN,Rattus norvegicus,CHEMBL4632,2004.0,Bioorg. Med. Chem. Lett.,10483
2,253199,6.75,IC50,253199,CHEMBL152968,,0,,,-1,0,0,0,0,B,10483,,Palmitoyl-CoA oxidase,SINGLE PROTEIN,Rattus norvegicus,CHEMBL4632,2004.0,Bioorg. Med. Chem. Lett.,10483
3,253199,5.22,IC50,253199,CHEMBL152968,,0,,,-1,0,0,0,0,A,12594,,Cytochrome P450 1A2,SINGLE PROTEIN,Homo sapiens,CHEMBL3356,2004.0,Bioorg. Med. Chem. Lett.,12594
4,253199,4.43,IC50,253199,CHEMBL152968,,0,,,-1,0,0,0,0,A,17045,,Cytochrome P450 3A4,SINGLE PROTEIN,Homo sapiens,CHEMBL340,2004.0,Bioorg. Med. Chem. Lett.,17045
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3343627,2317951,7.89,Ki,2317951,CHEMBL4278500,,0,,,-1,0,0,0,0,B,134,,Vasopressin V1a receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL1889,2018.0,J Med Chem,134
3343628,2325859,,Bmax,2325859,CHEMBL4286411,,0,,,-1,0,0,0,0,B,11522,,Cholecystokinin B receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL298,2018.0,J Med Chem,11522
3343629,198115,,Bmax,198115,CHEMBL120632,TETRAGASTRIN,0,,,0,0,0,0,0,B,11522,,Cholecystokinin B receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL298,2018.0,J Med Chem,11522
3343630,2317531,,Bmax,2317531,CHEMBL4278080,,0,,,-1,0,0,0,0,B,11522,,Cholecystokinin B receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL298,2018.0,J Med Chem,11522


Query and calculate the first appearance of a compound in the literature based on ChEMBL data.

In [5]:
# first appearance of a compound in the literature 
# information about salts is aggregated in the parent
sql = '''
SELECT DISTINCT docs.year, mh.parent_molregno
FROM docs
LEFT JOIN compound_records cr
    ON docs.doc_id = cr.doc_id
INNER JOIN molecule_hierarchy mh 
    ON cr.molregno = mh.molregno   -- cr.molregno = salt_molregno
WHERE docs.year is not null
'''

df_docs = pd.read_sql_query(sql, con=chembl_con)
df_docs['first_publication_cpd'] = df_docs.groupby('parent_molregno')['year'].transform('min')
df_docs = df_docs[['parent_molregno', 'first_publication_cpd']].drop_duplicates()
df_docs

Unnamed: 0,parent_molregno,first_publication_cpd
0,4941,1974
1,921,1974
2,1005421,1976
3,1750777,1976
4,1750778,1976
...,...,...
1482279,2329285,2018
1482280,2317951,2018
1482281,2325859,2018
1482283,2317531,2018


In [6]:
df_mols = df_mols.merge(df_docs, on = 'parent_molregno', how='left')
df_mols

Unnamed: 0,salt_molregno,pchembl_value,standard_type,parent_molregno,parent_chemblid,parent_pref_name,max_phase,usan_year,first_approval,prodrug,oral,parenteral,topical,black_box_warning,assay_type,tid,mutation,target_pref_name,target_type,organism,target_chembl_id,year,journal,tid_mutation,first_publication_cpd
0,252199,5.40,IC50,252199,CHEMBL357278,,0,,,-1,0,0,0,0,B,10483,,Palmitoyl-CoA oxidase,SINGLE PROTEIN,Rattus norvegicus,CHEMBL4632,2004.0,Bioorg. Med. Chem. Lett.,10483,2004.0
1,253534,4.77,IC50,253534,CHEMBL357119,,0,,,-1,0,0,0,0,B,10483,,Palmitoyl-CoA oxidase,SINGLE PROTEIN,Rattus norvegicus,CHEMBL4632,2004.0,Bioorg. Med. Chem. Lett.,10483,2004.0
2,253199,6.75,IC50,253199,CHEMBL152968,,0,,,-1,0,0,0,0,B,10483,,Palmitoyl-CoA oxidase,SINGLE PROTEIN,Rattus norvegicus,CHEMBL4632,2004.0,Bioorg. Med. Chem. Lett.,10483,2004.0
3,253199,5.22,IC50,253199,CHEMBL152968,,0,,,-1,0,0,0,0,A,12594,,Cytochrome P450 1A2,SINGLE PROTEIN,Homo sapiens,CHEMBL3356,2004.0,Bioorg. Med. Chem. Lett.,12594,2004.0
4,253199,4.43,IC50,253199,CHEMBL152968,,0,,,-1,0,0,0,0,A,17045,,Cytochrome P450 3A4,SINGLE PROTEIN,Homo sapiens,CHEMBL340,2004.0,Bioorg. Med. Chem. Lett.,17045,2004.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3343627,2317951,7.89,Ki,2317951,CHEMBL4278500,,0,,,-1,0,0,0,0,B,134,,Vasopressin V1a receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL1889,2018.0,J Med Chem,134,2018.0
3343628,2325859,,Bmax,2325859,CHEMBL4286411,,0,,,-1,0,0,0,0,B,11522,,Cholecystokinin B receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL298,2018.0,J Med Chem,11522,2018.0
3343629,198115,,Bmax,198115,CHEMBL120632,TETRAGASTRIN,0,,,0,0,0,0,0,B,11522,,Cholecystokinin B receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL298,2018.0,J Med Chem,11522,1987.0
3343630,2317531,,Bmax,2317531,CHEMBL4278080,,0,,,-1,0,0,0,0,B,11522,,Cholecystokinin B receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL298,2018.0,J Med Chem,11522,2018.0


Set correct types.

In [7]:
df_mols = df_mols.astype({
    'year': 'Int64',
    'usan_year': 'Int64',
    'first_approval': 'Int64',
    'first_publication_cpd': 'Int64'
})

In [8]:
# df_mols.to_csv(path_results+"ChEMBL"+chembl_version+"_initial_query.csv", sep = ';', index = False)

In [9]:
############### TESTING: method to save dataset size at any given point to array ###############
# assay with sizes of full dataset
all_lengths = []
# assay with sizes of dataset with pchembl values
all_lengths_pchembl = []

def calculate_dataset_sizes(data):
    now_mols = set(data["parent_molregno"]) 
    now_targets = set(data["tid_mutation"]) 
    now_pairs = set(data['parent_molregno_tid_mutation']) 
    
    if 'DTI' in data.columns:
        now_drugs = set(data[data["DTI"] == "D_DT"]["parent_molregno"]) 
        now_drug_targets = set(data[data["DTI"] == "D_DT"]["tid_mutation"]) 
        now_drug_pairs = set(data[data["DTI"] == "D_DT"]['parent_molregno_tid_mutation'])
    else: 
        now_drugs = set(data[data["max_phase"] == 4]["parent_molregno"]) 
        now_drug_targets = set(data[data["max_phase"] == 4]["tid_mutation"]) 
        now_drug_pairs = set(data[data["max_phase"] == 4]['parent_molregno_tid_mutation'])

    return [len(now_mols), len(now_drugs), len(now_targets), len(now_drug_targets), len(now_pairs), len(now_drug_pairs)]

def add_dataset_sizes(data, label, output=False):
    data_test = data.copy()
    data_test['parent_molregno_tid_mutation'] = data_test.agg('{0[parent_molregno]}_{0[tid_mutation]}'.format, axis=1)
    
    all_lengths.append([label] + calculate_dataset_sizes(data_test))
    
    # only data with pchembl value
    if 'pchembl_value' in data_test.columns:
        data_pchembl = data_test[~data_test['pchembl_value'].isnull()]
    else:
        data_pchembl = data_test[~data_test['pchembl_value_mean'].isnull()]
    all_lengths_pchembl.append([label] + calculate_dataset_sizes(data_pchembl))

In [10]:
############### TESTING: initial query ###############
add_dataset_sizes(df_mols, "init", True)

# Calculate Mean, Median, and Max *pchembl* Values for Each Compound-Target Pair

The following values are set to summarise the information for compound-target pairs:  

|||
| :----------- | :----------- |
| *pchembl_value_mean* | mean pchembl value for a compound-target pairs|
| *pchembl_value_max*| maximum pchembl value for a compound-target pairs|
| *pchembl_value_median*| median pchembl value for a compound-target pairs|
| *first_publication_target_cpd_pair* | first publication in ChEMBL with this compound-target pair |
| *first_publication_target_cpd_pair_w_pchembl* | first publication in ChEMBL with this compound-target pair and an associated pchembl value |

The values are set for a table with binding and functional assay data and another table with only binding assay data. These tables are combined into one table for further handling and can be distinguished by the parameter only_binding (binding and functional assay data = False; only binding data = True).

In [11]:
# summarise the information for binding and functional assays
df_mols_all = df_mols[(df_mols['assay_type'] == 'B') | (df_mols['assay_type'] == 'F')].copy()
df_mols_all['pchembl_value_mean'] = df_mols_all.groupby(['parent_molregno', 'tid_mutation'])['pchembl_value'].transform('mean')
df_mols_all['pchembl_value_max'] = df_mols_all.groupby(['parent_molregno', 'tid_mutation'])['pchembl_value'].transform('max')
df_mols_all['pchembl_value_median'] = df_mols_all.groupby(['parent_molregno', 'tid_mutation'])['pchembl_value'].transform('median')
df_mols_all['first_publication_target_cpd_pair'] = df_mols_all.groupby(['parent_molregno', 'tid_mutation'])['year'].transform('min')
df_mols_all_first_publication_pchembl = df_mols_all[~df_mols_all['pchembl_value'].isnull()].groupby(['parent_molregno', 'tid_mutation'])['year'].min().reset_index().rename(columns={"year": "first_publication_target_cpd_pair_w_pchembl"})
df_mols_all = df_mols_all.merge(df_mols_all_first_publication_pchembl, on=['parent_molregno', 'tid_mutation'], how='left')

In [12]:
# summarise the information for only binding assays
df_mols_binding = df_mols[df_mols['assay_type'] == 'B'].copy()
df_mols_binding['pchembl_value_mean'] = df_mols_binding.groupby(['parent_molregno', 'tid_mutation'])['pchembl_value'].transform('mean')
df_mols_binding['pchembl_value_max'] = df_mols_binding.groupby(['parent_molregno', 'tid_mutation'])['pchembl_value'].transform('max')
df_mols_binding['pchembl_value_median'] = df_mols_binding.groupby(['parent_molregno', 'tid_mutation'])['pchembl_value'].transform('median')
df_mols_binding['first_publication_target_cpd_pair'] = df_mols_binding.groupby(['parent_molregno', 'tid_mutation'])['year'].transform('min')
df_mols_binding_first_publication_pchembl = df_mols_binding[~df_mols_binding['pchembl_value'].isnull()].groupby(['parent_molregno', 'tid_mutation'])['year'].min().reset_index().rename(columns={"year": "first_publication_target_cpd_pair_w_pchembl"})
df_mols_binding = df_mols_binding.merge(df_mols_binding_first_publication_pchembl, on=['parent_molregno', 'tid_mutation'], how='left')

In [13]:
# combine table based on all assay data (only_binding = False)
# and based on only binding assays (only_binding = True)
df_mols_all['only_binding'] = False
df_mols_binding['only_binding'] = True
df_combined = pd.concat([df_mols_all, df_mols_binding])
# drop all salt related information (information based on parent wanted)
# as well as the underlying information for the aggregated values
df_combined = df_combined.drop(columns=['salt_molregno', 
                                        'year', 'journal', 'pchembl_value', 'standard_type', 'assay_type']).drop_duplicates()
df_combined

Unnamed: 0,parent_molregno,parent_chemblid,parent_pref_name,max_phase,usan_year,first_approval,prodrug,oral,parenteral,topical,black_box_warning,tid,mutation,target_pref_name,target_type,organism,target_chembl_id,tid_mutation,first_publication_cpd,pchembl_value_mean,pchembl_value_max,pchembl_value_median,first_publication_target_cpd_pair,first_publication_target_cpd_pair_w_pchembl,only_binding
0,252199,CHEMBL357278,,0,,,-1,0,0,0,0,10483,,Palmitoyl-CoA oxidase,SINGLE PROTEIN,Rattus norvegicus,CHEMBL4632,10483,2004,5.400,5.40,5.400,2004,2004,False
1,253534,CHEMBL357119,,0,,,-1,0,0,0,0,10483,,Palmitoyl-CoA oxidase,SINGLE PROTEIN,Rattus norvegicus,CHEMBL4632,10483,2004,4.770,4.77,4.770,2004,2004,False
2,253199,CHEMBL152968,,0,,,-1,0,0,0,0,10483,,Palmitoyl-CoA oxidase,SINGLE PROTEIN,Rattus norvegicus,CHEMBL4632,10483,2004,6.750,6.75,6.750,2004,2004,False
3,933,CHEMBL268439,,0,,,-1,0,0,0,0,10989,,Carbonic anhydrase XIII,SINGLE PROTEIN,Mus musculus,CHEMBL2186,10989,1999,8.700,8.70,8.700,2004,2004,False
4,606480,CHEMBL608018,,0,,,-1,0,0,0,0,105567,,Adenosine A1 receptor,SINGLE PROTEIN,Cavia porcellus,CHEMBL2304404,105567,2004,,,,2004,,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1886001,2326437,CHEMBL4286989,,0,,,-1,0,0,0,0,134,,Vasopressin V1a receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL1889,134,2018,5.890,5.89,5.890,2018,2018,True
1886002,2321411,CHEMBL4281963,,0,,,-1,0,0,0,0,134,,Vasopressin V1a receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL1889,134,2018,9.400,9.70,9.400,2018,2018,True
1886006,2322887,CHEMBL4283439,,0,,,-1,0,0,0,0,134,,Vasopressin V1a receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL1889,134,2018,6.025,6.52,6.025,2018,2018,True
1886012,2325859,CHEMBL4286411,,0,,,-1,0,0,0,0,11522,,Cholecystokinin B receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL298,11522,2018,7.875,8.41,7.875,2018,2018,True


Only drugs or clinical candidates are allowed to have no pchembl value. Comparator compounds (max_phase = 0) without a pchembl value are discarded.

In [14]:
# Keep only compounds with pchembl_value or clinical compounds / drugs
# I.e. clinical compounds / drugs are not required to have a pchembl value
df_combined = df_combined[(~df_combined['pchembl_value_mean'].isnull()) | (df_combined['max_phase'] > 0)].copy()
df_combined

Unnamed: 0,parent_molregno,parent_chemblid,parent_pref_name,max_phase,usan_year,first_approval,prodrug,oral,parenteral,topical,black_box_warning,tid,mutation,target_pref_name,target_type,organism,target_chembl_id,tid_mutation,first_publication_cpd,pchembl_value_mean,pchembl_value_max,pchembl_value_median,first_publication_target_cpd_pair,first_publication_target_cpd_pair_w_pchembl,only_binding
0,252199,CHEMBL357278,,0,,,-1,0,0,0,0,10483,,Palmitoyl-CoA oxidase,SINGLE PROTEIN,Rattus norvegicus,CHEMBL4632,10483,2004,5.400,5.40,5.400,2004,2004,False
1,253534,CHEMBL357119,,0,,,-1,0,0,0,0,10483,,Palmitoyl-CoA oxidase,SINGLE PROTEIN,Rattus norvegicus,CHEMBL4632,10483,2004,4.770,4.77,4.770,2004,2004,False
2,253199,CHEMBL152968,,0,,,-1,0,0,0,0,10483,,Palmitoyl-CoA oxidase,SINGLE PROTEIN,Rattus norvegicus,CHEMBL4632,10483,2004,6.750,6.75,6.750,2004,2004,False
3,933,CHEMBL268439,,0,,,-1,0,0,0,0,10989,,Carbonic anhydrase XIII,SINGLE PROTEIN,Mus musculus,CHEMBL2186,10989,1999,8.700,8.70,8.700,2004,2004,False
9,82960,CHEMBL54530,,0,,,-1,0,0,0,0,11643,,DNA topoisomerase III,SINGLE PROTEIN,Bacillus subtilis (strain 168),CHEMBL4320,11643,1980,4.720,4.72,4.720,1984,1984,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1886001,2326437,CHEMBL4286989,,0,,,-1,0,0,0,0,134,,Vasopressin V1a receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL1889,134,2018,5.890,5.89,5.890,2018,2018,True
1886002,2321411,CHEMBL4281963,,0,,,-1,0,0,0,0,134,,Vasopressin V1a receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL1889,134,2018,9.400,9.70,9.400,2018,2018,True
1886006,2322887,CHEMBL4283439,,0,,,-1,0,0,0,0,134,,Vasopressin V1a receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL1889,134,2018,6.025,6.52,6.025,2018,2018,True
1886012,2325859,CHEMBL4286411,,0,,,-1,0,0,0,0,11522,,Cholecystokinin B receptor,SINGLE PROTEIN,Homo sapiens,CHEMBL298,11522,2018,7.875,8.41,7.875,2018,2018,True


# Extract Drug-Target Interactions With Disease Relevance From the drug_mechanism Table

Extract the known drug-target interactions from ChEMBL. These will be used to determine if a drug-target pairs from the activities query above are known drug-target interactions. Only entries with a disease_efficacy of 1 are taken into account, i.e., the target is believed to play a role in the efficacy of the drug.

In [15]:
# disease_efficacy NUMBER
# Flag to show whether the target assigned is believed to play a role in the efficacy of the drug in the indication(s)
# for which it is approved (1 = yes, 0 = no)
sql = '''
SELECT DISTINCT mh.parent_molregno, dm.tid
FROM drug_mechanism dm
INNER JOIN molecule_hierarchy mh
    ON dm.molregno = mh.molregno
INNER JOIN molecule_dictionary md
    ON mh.parent_molregno = md.molregno
WHERE dm.disease_efficacy = 1
    and dm.tid is not null
'''

df_dti = pd.read_sql_query(sql, con=chembl_con)
df_dti

Unnamed: 0,parent_molregno,tid
0,1124,11060
1,675068,10193
2,1125,10193
3,1085,10193
4,1124,10193
...,...,...
4530,1304559,101019
4531,1304559,100417
4532,2336099,11540
4533,2146132,100097


Query target_relations for related target ids to increase the number of target ids for which there is data in the drug_mechanisms table.
The following mappings are considered:

||||
|:------|:-----:|-----|
|protein family |-[superset of]->| single protein|
|protein complex |-[superset of]->| single protein|
|protein complex group |-[superset of]->| single protein|
|single protein |-[equivalent to]->| single protein|
|chimeric protein |-[superset of]->| single protein|
|protein-protein interaction |-[superset of]->| single protein|

In [16]:
sql = '''
SELECT tr.tid, tr.relationship, tr.related_tid, 
    td1.pref_name as pref_name_1, td1.target_type as target_type_1, td1.organism as organism_1, 
    td2.pref_name as pref_name_2, td2.target_type as target_type_2, td2.organism as organism_2 
FROM target_relations tr
INNER JOIN target_dictionary td1
    ON tr.tid = td1.tid
INNER JOIN target_dictionary td2
    ON tr.related_tid = td2.tid
'''

df_related_targets = pd.read_sql_query(sql, con=chembl_con)
df_related_targets.head()

Unnamed: 0,tid,relationship,related_tid,pref_name_1,target_type_1,organism_1,pref_name_2,target_type_2,organism_2
0,10193,SUBSET OF,104764,Carbonic anhydrase I,SINGLE PROTEIN,Homo sapiens,Carbonic anhydrase,PROTEIN FAMILY,Homo sapiens
1,12071,SUBSET OF,109746,Cyclin-dependent kinase 1,SINGLE PROTEIN,Homo sapiens,Cyclin-dependent kinase,PROTEIN FAMILY,Homo sapiens
2,12071,SUBSET OF,104709,Cyclin-dependent kinase 1,SINGLE PROTEIN,Homo sapiens,Cyclin-dependent kinase 1/cyclin B,PROTEIN COMPLEX,Homo sapiens
3,12071,SUBSET OF,107893,Cyclin-dependent kinase 1,SINGLE PROTEIN,Homo sapiens,CDK1/Cyclin A,PROTEIN COMPLEX,Homo sapiens
4,12071,SUBSET OF,117095,Cyclin-dependent kinase 1,SINGLE PROTEIN,Homo sapiens,Cyclin-dependent kinase 1/G1/S-specific cyclin-D1,PROTEIN COMPLEX,Homo sapiens


In [17]:
protein_family_mapping = df_related_targets[(df_related_targets["target_type_1"] == "PROTEIN FAMILY") 
                    & (df_related_targets["target_type_2"] == "SINGLE PROTEIN")
                    & (df_related_targets["relationship"] == "SUPERSET OF")]

protein_complex_mapping = df_related_targets[(df_related_targets["target_type_1"] == "PROTEIN COMPLEX") 
                    & (df_related_targets["target_type_2"] == "SINGLE PROTEIN")
                    & (df_related_targets["relationship"] == "SUPERSET OF")]

protein_complex_group_mapping = df_related_targets[(df_related_targets["target_type_1"] == "PROTEIN COMPLEX GROUP") 
                    & (df_related_targets["target_type_2"] == "SINGLE PROTEIN")
                    & (df_related_targets["relationship"] == "SUPERSET OF")]

single_protein_mapping = df_related_targets[(df_related_targets["target_type_1"] == "SINGLE PROTEIN") 
                    & (df_related_targets["target_type_2"] == "SINGLE PROTEIN")
                    & (df_related_targets["relationship"] == "EQUIVALENT TO")]

chimeric_protein_mapping = df_related_targets[(df_related_targets["target_type_1"] == "CHIMERIC PROTEIN") 
                    & (df_related_targets["target_type_2"] == "SINGLE PROTEIN")
                    & (df_related_targets["relationship"] == "SUPERSET OF")]

ppi_mapping = df_related_targets[(df_related_targets["target_type_1"] == "PROTEIN-PROTEIN INTERACTION") 
                    & (df_related_targets["target_type_2"] == "SINGLE PROTEIN")
                    & (df_related_targets["relationship"] == "SUPERSET OF")]

relevant_mappings = pd.concat([protein_family_mapping, 
                               protein_complex_mapping, 
                               protein_complex_group_mapping,
                               single_protein_mapping, 
                               chimeric_protein_mapping, 
                               ppi_mapping])
relevant_mappings.head()

Unnamed: 0,tid,relationship,related_tid,pref_name_1,target_type_1,organism_1,pref_name_2,target_type_2,organism_2
234,104303,SUPERSET OF,10688,Muscarinic acetylcholine receptor,PROTEIN FAMILY,Rattus norvegicus,Muscarinic acetylcholine receptor M5,SINGLE PROTEIN,Rattus norvegicus
236,104303,SUPERSET OF,12485,Muscarinic acetylcholine receptor,PROTEIN FAMILY,Rattus norvegicus,Muscarinic acetylcholine receptor M4,SINGLE PROTEIN,Rattus norvegicus
237,104303,SUPERSET OF,12566,Muscarinic acetylcholine receptor,PROTEIN FAMILY,Rattus norvegicus,Muscarinic acetylcholine receptor M3,SINGLE PROTEIN,Rattus norvegicus
239,104303,SUPERSET OF,10647,Muscarinic acetylcholine receptor,PROTEIN FAMILY,Rattus norvegicus,Muscarinic acetylcholine receptor M1,SINGLE PROTEIN,Rattus norvegicus
240,104303,SUPERSET OF,12076,Muscarinic acetylcholine receptor,PROTEIN FAMILY,Rattus norvegicus,Muscarinic acetylcholine receptor M2,SINGLE PROTEIN,Rattus norvegicus


Combine the drug-target-interactions (DTI) and target ids (dti_tids) from the drug mechanism table with the information based on the mapped target ids.

In [18]:
# drug-target-interactions (DTI) and target ids (dti_tids) based on drug_mechanisms table
DTIs_original = set(df_dti.agg('{0[parent_molregno]}_{0[tid]}'.format, axis=1))
dti_tids_original = set(df_dti['tid'])

# drug-target-interactions (DTI) and target ids (dti_tids) based on mapped target ids
df_dti_add_targets = df_dti.merge(relevant_mappings, on = 'tid', how = 'inner')
DTIs_mapped = set(df_dti_add_targets.agg('{0[parent_molregno]}_{0[related_tid]}'.format, axis=1))
dti_tids_mapped = set(df_dti_add_targets['related_tid'].astype("int"))

# combined drug-target-interactions (DTI) and target ids (dti_tids) based on drug_mechanisms table and mapped target ids
DTIs_set = DTIs_original.union(DTIs_mapped)
dti_tids_set = dti_tids_original.union(dti_tids_mapped)

# DTI (Drug-Target Interaction) Classification

Every compound-target pair is assigned a DTI (drug target interaction) annotation.  

The assignement is based on three questions:
- Is the compound-target pair in the drug_mechanisms table? = Is it a known compound-target interaction?
- What is the max_phase of the compound? = Is it a drug / clinical compound?
- Is the target in the drug_mechanisms table = Is it a therapeutic target?

The assigments are based on the following table:

|in drug_mechanisms table?|max_phase?|therapeutic target?|DTI annotation|explanation|
|:-----:|:-----:|:-----:|:-----:|:-----|
|yes|4|-|D_DT|drug - drug target|
|yes|3|-|C3_DT|clinical candidate in phase 3 - drug target|
|yes|2|-|C2_DT|clinical candidate in phase 2 - drug target|
|yes|1|-|C1_DT|clinical candidate in phase 1 - drug target|
|yes|0|-|C0_DT|clinical candidate in phase 0 - drug target|
|no|-|yes|DT|drug target|
|no|-|no|NDT|not drug target|




Identify which targets are drug target (= are they in the drug mechanism table?) and add the field *therapeutic_target* that indicates whether target is a known drug target.

In [19]:
df_combined['therapeutic_target'] = df_combined['tid'].isin(dti_tids_set)

Add a field *cpd_target_pair* that reflects the compound-target association. 

In [20]:
df_combined['cpd_target_pair'] = df_combined.agg('{0[parent_molregno]}_{0[tid]}'.format, axis=1)

Assign the annotations based on the table

In [21]:
df_combined.loc[(df_combined['cpd_target_pair'].isin(DTIs_set) & (df_combined['max_phase'] == 4)), 'DTI'] = "D_DT"
df_combined.loc[(df_combined['cpd_target_pair'].isin(DTIs_set) & (df_combined['max_phase'] == 3)), 'DTI'] = "C3_DT"
df_combined.loc[(df_combined['cpd_target_pair'].isin(DTIs_set) & (df_combined['max_phase'] == 2)), 'DTI'] = "C2_DT"
df_combined.loc[(df_combined['cpd_target_pair'].isin(DTIs_set) & (df_combined['max_phase'] == 1)), 'DTI'] = "C1_DT"
df_combined.loc[(df_combined['cpd_target_pair'].isin(DTIs_set) & (df_combined['max_phase'] == 0)), 'DTI'] = "C0_DT"
df_combined.loc[((~df_combined['cpd_target_pair'].isin(DTIs_set)) 
                 & (df_combined['therapeutic_target'] == True)), 'DTI'] = "DT"
# if target is not a therapeutic target, 'cpd_target_pair' cannot be in DTIs_set
# (~df_combined['cpd_target_pair'].isin(DTIs_set)) is included for clarity
df_combined.loc[((~df_combined['cpd_target_pair'].isin(DTIs_set)) 
                 & (df_combined['therapeutic_target'] == False)), 'DTI'] = "NDT"

In [22]:
############### TESTING: before discarding NDT rows ###############
add_dataset_sizes(df_combined, "pre DTI")

Discard rows that were annotated with NDT, i.e., compound-target pairs that are not in the drug_mechanisms table and for which the target was also not in the drug_mechanisms table (not a comparator compound).

In [23]:
# discard NDT rows
df_combined = df_combined[(df_combined['DTI'].isin(['D_DT', 'C3_DT', 'C2_DT', 'C1_DT', 'C0_DT', 'DT']))]

In [24]:
############### TESTING: after discarding NDT rows ###############
add_dataset_sizes(df_combined, "post DTI")

# Add Compound Properties

Add compound properties and structures based on the compound_properties table and the compound_structures table. 

In [25]:
sql = '''
SELECT DISTINCT mh.parent_molregno, 
    cp.mw_freebase, cp.alogp, cp.hba, cp.hbd, cp.psa, cp.rtb, cp.ro3_pass, cp.num_ro5_violations, 
    cp.cx_most_apka, cp.cx_most_bpka, cp.cx_logp, cp.cx_logd, cp.molecular_species, cp.full_mwt, 
    cp.aromatic_rings, cp.heavy_atoms, cp.qed_weighted, cp.mw_monoisotopic, cp.full_molformula, 
    cp.hba_lipinski, cp.hbd_lipinski, cp.num_lipinski_ro5_violations, 
    struct.standard_inchi, struct.standard_inchi_key, struct.canonical_smiles
FROM compound_properties cp
INNER JOIN molecule_hierarchy mh
    ON cp.molregno = mh.parent_molregno
INNER JOIN compound_structures struct
    ON mh.parent_molregno = struct.molregno
'''

df_cpd_props = pd.read_sql_query(sql, con=chembl_con)
df_cpd_props.head()

Unnamed: 0,parent_molregno,mw_freebase,alogp,hba,hbd,psa,rtb,ro3_pass,num_ro5_violations,cx_most_apka,cx_most_bpka,cx_logp,cx_logd,molecular_species,full_mwt,aromatic_rings,heavy_atoms,qed_weighted,mw_monoisotopic,full_molformula,hba_lipinski,hbd_lipinski,num_lipinski_ro5_violations,standard_inchi,standard_inchi_key,canonical_smiles
0,477782,506.37,3.04,8.0,2.0,116.43,8.0,N,1.0,,6.5,2.16,2.11,NEUTRAL,506.37,2.0,27.0,0.53,506.0485,C17H23IN4O4S,8.0,3.0,1.0,"InChI=1S/C17H23IN4O4S/c1-10(2)11-7-14(25-3)12(18)8-13(11)26-15-9-21-17(22-16(15)19)20-5-6-27(4,2...",AAAAEENPAALFRN-UHFFFAOYSA-N,COc1cc(C(C)C)c(Oc2cnc(NCCS(C)(=O)=O)nc2N)cc1I
1,2237474,927.28,7.03,11.0,7.0,252.91,41.0,N,4.0,4.13,,8.43,5.36,ACID,927.28,0.0,65.0,0.02,926.6555,C49H90N4O12,16.0,8.0,4.0,InChI=1S/C49H90N4O12/c1-5-8-10-12-14-16-18-20-22-24-26-28-30-37(31-29-27-25-23-21-19-17-15-13-11...,AAAAJHGLNDAXFP-VNKVACROSA-N,CCCCCCCCCCCCCCC(CCCCCCCCCCCCCC)C(=O)OC[C@H]1OC(O)[C@H](NC(C)=O)[C@@H](OCC(=O)N[C@@H](CC)C(=O)N[C...
2,412019,271.32,1.72,2.0,2.0,65.2,1.0,N,0.0,13.43,,0.77,0.77,NEUTRAL,271.32,2.0,20.0,0.83,271.1321,C15H17N3O2,5.0,2.0,0.0,"InChI=1S/C15H17N3O2/c1-8-7-16-14(19)13-12(8)10-6-9(15(20)18(2)3)4-5-11(10)17-13/h4-6,8,17H,7H2,1...",AAAAKTROWFNLEP-UHFFFAOYSA-N,CC1CNC(=O)c2[nH]c3ccc(C(=O)N(C)C)cc3c21
3,26284,323.35,2.13,4.0,1.0,71.53,3.0,N,0.0,,4.73,1.13,1.13,NEUTRAL,323.35,2.0,24.0,0.94,323.127,C18H17N3O3,6.0,1.0,0.0,InChI=1S/C18H17N3O3/c1-11(22)20-10-17-16-8-14-7-12(13-3-2-6-19-9-13)4-5-15(14)21(16)18(23)24-17/...,AAAATQFUBIBQIS-IRXDYDNUSA-N,CC(=O)NC[C@@H]1OC(=O)N2c3ccc(-c4cccnc4)cc3C[C@@H]12
4,299040,613.14,7.15,9.0,2.0,112.4,11.0,N,2.0,12.55,8.81,5.82,4.4,BASE,613.14,5.0,43.0,0.15,612.171,C32H29ClN6O3S,9.0,2.0,2.0,InChI=1S/C32H29ClN6O3S/c1-4-41-28-16-25-22(15-26(28)37-30(40)10-7-13-39(2)3)32(20(17-34)18-35-25...,AAAAZQPHATYWOK-JXMROGBWSA-N,CCOc1cc2ncc(C#N)c(Nc3ccc(OCc4nc5ccccc5s4)c(Cl)c3)c2cc1NC(=O)/C=C/CN(C)C


Combine initial query with compound properties

In [26]:
df_combined = df_combined.merge(df_cpd_props, on = 'parent_molregno', how = 'inner')

In [27]:
############### TESTING: compound props ###############
add_dataset_sizes(df_combined, "cpd props")

# Calculate Ligand Efficiency (LE) Metrics

Calculate the ligand efficiency metrics for the compounds based on the mean pchembl values for a compound-target pair and the following ligand efficiency (LE) formulas:

$\text{LE} = \frac{\Delta\text{G}}{\text{HA}}$
where $ \Delta\text{G} = − RT \ln(K_d)$, $− RT\ln(K_i)$, or $− RT\ln(IC_{50})$

$\text{LE}=\frac{(2.303 \cdot 298 \cdot 0.00199 \cdot \text{pchembl_value})} {\text{heavy_atoms}}$


$\text{BEI}=\frac{\text{pchembl_mean} \cdot 1000} {\text{mw_freebase}}$

$\text{SEI}=\frac{\text{pchembl_mean} \cdot 100} {\text{PSA}}$

$\text{LLE}=\text{pchembl_mean}-\text{ALOGP}$

In [28]:
df_combined['LE'] = df_combined['pchembl_value_mean']/df_combined['heavy_atoms']*(2.303*298*0.00199)
df_combined['BEI'] = df_combined['pchembl_value_mean']*1000/df_combined["mw_freebase"]
df_combined['SEI'] = df_combined['pchembl_value_mean']*100/df_combined["psa"]
df_combined['LLE'] = df_combined['pchembl_value_mean']-df_combined["alogp"]

# Add Compound Descriptors

Add relevant compound descriptors using built-in RDKit methods. 

In [29]:
# # add a column with RDKit molecules, used to calculate the descriptors
# PandasTools.AddMoleculeColumnToFrame(df_combined,'canonical_smiles','mol',includeFingerprints=False)

# df_combined.loc[:,'fraction_csp3'] = df_combined['mol'].apply(Descriptors.FractionCSP3)
# df_combined.loc[:,'num_aliphatic_carbocycles'] = df_combined['mol'].apply(Descriptors.NumAliphaticCarbocycles)
# df_combined.loc[:,'num_aliphatic_heterocycles'] = df_combined['mol'].apply(Descriptors.NumAliphaticHeterocycles)
# df_combined.loc[:,'num_aliphatic_rings'] = df_combined['mol'].apply(Descriptors.NumAliphaticRings)
# df_combined.loc[:,'num_aromatic_carbocycles'] = df_combined['mol'].apply(Descriptors.NumAromaticCarbocycles)
# df_combined.loc[:,'num_aromatic_heterocycles'] = df_combined['mol'].apply(Descriptors.NumAromaticHeterocycles)
# df_combined.loc[:,'num_aromatic_rings'] = df_combined['mol'].apply(Descriptors.NumAromaticRings)
# df_combined.loc[:,'num_heteroatoms'] = df_combined['mol'].apply(Descriptors.NumHeteroatoms)
# df_combined.loc[:,'num_saturated_carbocycles'] = df_combined['mol'].apply(Descriptors.NumSaturatedCarbocycles)
# df_combined.loc[:,'num_saturated_heterocycles'] = df_combined['mol'].apply(Descriptors.NumSaturatedHeterocycles)
# df_combined.loc[:,'num_saturated_rings'] = df_combined['mol'].apply(Descriptors.NumSaturatedRings)
# df_combined.loc[:,'ring_count'] = df_combined['mol'].apply(Descriptors.RingCount)
# df_combined.loc[:,'num_stereocentres'] = df_combined['mol'].apply(Chem.rdMolDescriptors.CalcNumAtomStereoCenters)

# # drop the column with RDKit molecules
# df_combined = df_combined.drop(['mol'] , axis=1)

Add descriptors for aromaticity, using an RDKit-based method.

In [30]:
def calculate_aromatic_atoms(smiles_set):
    aromatic_atoms_dict = dict()
    aromatic_c_dict = dict()
    aromatic_n_dict = dict()
    aromatic_hetero_dict = dict()
    
    for smiles in tqdm(smiles_set):
        mol = Chem.MolFromSmiles(smiles)
        aromatic_atoms_dict[smiles] = sum(mol.GetAtomWithIdx(i).GetIsAromatic() for i in range(mol.GetNumAtoms()))
        aromatic_c_dict[smiles] = sum((mol.GetAtomWithIdx(i).GetIsAromatic() & (mol.GetAtomWithIdx(i).GetAtomicNum() == 6)) for i in range(mol.GetNumAtoms()))
        aromatic_n_dict[smiles] = sum((mol.GetAtomWithIdx(i).GetIsAromatic() & (mol.GetAtomWithIdx(i).GetAtomicNum() == 7)) for i in range(mol.GetNumAtoms()))
        aromatic_hetero_dict[smiles] = sum((mol.GetAtomWithIdx(i).GetIsAromatic() & (mol.GetAtomWithIdx(i).GetAtomicNum() != 6) & (mol.GetAtomWithIdx(i).GetAtomicNum() != 1)) for i in range(mol.GetNumAtoms()))
        
    return aromatic_atoms_dict, aromatic_c_dict, aromatic_n_dict, aromatic_hetero_dict

In [31]:
# smiles_set = set(df_combined["canonical_smiles"])
# aromatic_atoms_dict, aromatic_c_dict, aromatic_n_dict, aromatic_hetero_dict = calculate_aromatic_atoms(list(smiles_set))

# df_combined['aromatic_atoms'] = df_combined['canonical_smiles'].map(aromatic_atoms_dict)
# df_combined['aromatic_c'] = df_combined['canonical_smiles'].map(aromatic_c_dict)
# df_combined['aromatic_n'] = df_combined['canonical_smiles'].map(aromatic_n_dict)
# df_combined['aromatic_hetero'] = df_combined['canonical_smiles'].map(aromatic_hetero_dict)

# Add ATC Classifications (Level 1)

Query ATC classifications from the atc_classification and molecule_atc_classification tables.

In [32]:
sql = '''
SELECT DISTINCT mh.parent_molregno, atc.level1, level1_description
FROM atc_classification atc
INNER JOIN molecule_atc_classification matc
    ON atc.level5 = matc.level5
INNER JOIN molecule_hierarchy mh
    ON matc.molregno = mh.molregno
'''

atc_levels = pd.read_sql_query(sql, con=chembl_con)
atc_levels["l1_full"] = atc_levels["level1"] + "_" + atc_levels["level1_description"]
atc_levels

Unnamed: 0,parent_molregno,level1,level1_description,l1_full
0,454514,J,ANTIINFECTIVES FOR SYSTEMIC USE,J_ANTIINFECTIVES FOR SYSTEMIC USE
1,675285,L,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L_ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS
2,88739,N,NERVOUS SYSTEM,N_NERVOUS SYSTEM
3,1379143,B,BLOOD AND BLOOD FORMING ORGANS,B_BLOOD AND BLOOD FORMING ORGANS
4,1152067,P,"ANTIPARASITIC PRODUCTS, INSECTICIDES AND REPELLENTS","P_ANTIPARASITIC PRODUCTS, INSECTICIDES AND REPELLENTS"
...,...,...,...,...
3765,1376478,N,NERVOUS SYSTEM,N_NERVOUS SYSTEM
3766,366274,S,SENSORY ORGANS,S_SENSORY ORGANS
3767,229629,L,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L_ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS
3768,85649,N,NERVOUS SYSTEM,N_NERVOUS SYSTEM


Combine ATC level annotations for the same parent_molregno into one description.

In [33]:
between_str_join = ' | '
atc_levels['atc_level1'] = atc_levels.groupby(['parent_molregno'])['l1_full'].transform(lambda x: between_str_join.join(sorted(x)))
atc_levels = atc_levels[['parent_molregno', 'atc_level1']].drop_duplicates()
atc_levels

Unnamed: 0,parent_molregno,atc_level1
0,454514,J_ANTIINFECTIVES FOR SYSTEMIC USE
1,675285,L_ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS
2,88739,N_NERVOUS SYSTEM
3,1379143,B_BLOOD AND BLOOD FORMING ORGANS
4,1152067,"P_ANTIPARASITIC PRODUCTS, INSECTICIDES AND REPELLENTS"
...,...,...
3765,1376478,N_NERVOUS SYSTEM
3766,366274,S_SENSORY ORGANS
3767,229629,L_ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS
3768,85649,N_NERVOUS SYSTEM


Add the information to the full table (df_combined).

In [34]:
df_combined = df_combined.merge(atc_levels, on='parent_molregno', how = 'left')

# Add Scaffold SMILES

Add the scaffold SMILES for every molecule. For the column *scaffold_w_stereo* the stereochemistry is taken into account. For the column *scaffold_wo_stereo* the stereochemistry information is removed before calculating the scaffold.

In [35]:
# note: this takes a few minutes to calculate for all molecules
def calculate_scaffolds(smiles_set):
    scaffolds_dict = dict()
    scaffolds_no_stereo_dict = dict()
    for smiles in tqdm(smiles_set):
        mol = Chem.MolFromSmiles(smiles)
        if Chem.rdMolDescriptors.CalcNumRings(mol) == 0:
            continue

        scaffold = MurckoScaffold.GetScaffoldForMol(mol)
        scaffolds_dict[smiles] = Chem.MolToSmiles(scaffold)
        
        # repeat after removing stereochemistry
        Chem.RemoveStereochemistry(mol)
        scaffold_no_stereo = MurckoScaffold.GetScaffoldForMol(mol)
        scaffolds_no_stereo_dict[smiles] = Chem.MolToSmiles(scaffold_no_stereo)
        
    return scaffolds_dict, scaffolds_no_stereo_dict

In [36]:
# smiles_set = set(df_combined["canonical_smiles"])
# scaffolds_dict, scaffolds_no_stereo_dict = calculate_scaffolds(smiles_set)

# df_combined["scaffold_w_stereo"] = df_combined['canonical_smiles'].map(scaffolds_dict)
# df_combined['scaffold_wo_stereo'] = df_combined['canonical_smiles'].map(scaffolds_no_stereo_dict)

# Add Target Class Annotations

Add information about level 1 and level 2 target class annotations in ChEMBL.

In [37]:
sql = '''
SELECT DISTINCT tc.tid, pc.*, pfc.*
FROM protein_classification pc
-- join several tables to get the corresponding target id
INNER JOIN component_class cc
    ON pc.protein_class_id = cc.protein_class_id
INNER JOIN component_sequences cs
    ON cc.component_id = cs.component_id
INNER JOIN target_components tc
    ON cs.component_id = tc.component_id
-- join the protein_family_classification table for a faster way to traverse the hierarchy
INNER JOIN protein_family_classification pfc 
    ON  pc.protein_class_id = pfc.protein_class_id
'''

df_target_classes = pd.read_sql_query(sql, con=chembl_con)
# only interested in the target ids that are in the current dataset
current_tids = set(df_combined['tid'])
df_target_classes = df_target_classes[df_target_classes['tid'].isin(current_tids)]
df_target_classes

Unnamed: 0,tid,protein_class_id,parent_id,pref_name,short_name,protein_class_desc,definition,class_level,protein_class_id.1,protein_class_desc.1,l1,l2,l3,l4,l5,l6,l7,l8
0,1,646,1,Hydrolase,Hydrolase,enzyme hydrolase,A group of enzymes that catalyze the hydrolysis of a chemical bond,2,646,enzyme hydrolase,Enzyme,Hydrolase,,,,,,
1,2,1133,1104,ABCC subfamily,MRP,transporter ntpase atp binding cassette mrp,A sequence-related subfamily of ATP-BINDING CASSETTE TRANSPORTERS that actively transport organi...,4,1133,transporter ntpase atp binding cassette mrp,Transporter,Primary active transporter,ATP-binding cassette,ABCC subfamily,,,,
2,3,104,1065,Phosphodiesterase 5A,PDE_5A,enzyme phosphodiesterase pde_5 pde_5a,,4,104,enzyme phosphodiesterase pde_5 pde_5a,Enzyme,Phosphodiesterase,Phosphodiesterase 5,Phosphodiesterase 5A,,,,
3,4,1583,1019,Voltage-gated calcium channel,VG CA,ion channel vgc vg ca,Voltage-dependent cell membrane glycoproteins selectively permeable to calcium ions. They are ca...,3,1583,ion channel vgc vg ca,Ion channel,Voltage-gated ion channel,Voltage-gated calcium channel,,,,,
5,6,10,1,Oxidoreductase,Reductase,enzyme reductase,The class of all enzymes catalyzing oxidoreduction reactions. The substrate that is oxidized is ...,2,10,enzyme reductase,Enzyme,Oxidoreductase,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8425,117043,404,1693,TKL protein kinase STKR Type 1 subfamily,Type1,enzyme kinase protein kinase tkl stkr type1,,6,404,enzyme kinase protein kinase tkl stkr type1,Enzyme,Kinase,Protein Kinase,TKL protein kinase group,TKL protein kinase STKR family,TKL protein kinase STKR Type 1 subfamily,,
8426,117043,601,0,Unclassified protein,Unclassified,unclassified,,1,601,unclassified,Unclassified protein,,,,,,,
8677,117219,601,0,Unclassified protein,Unclassified,unclassified,,1,601,unclassified,Unclassified protein,,,,,,,
8760,117303,10,1,Oxidoreductase,Reductase,enzyme reductase,The class of all enzymes catalyzing oxidoreduction reactions. The substrate that is oxidized is ...,2,10,enzyme reductase,Enzyme,Oxidoreductase,,,,,,


Summarise the information for a target id with several assigned target classes of level 1 into one description. If a target id has more than one assigned target class, the target class 'Unclassified protein' is discarded.

In [38]:
level = 'l1'
between_str_join = '|'
target_classes_level = df_target_classes[['tid', level]].drop_duplicates().dropna()

# remove 'Unclassified protein' from targets with more than one target class, level 1
more_than_one = target_classes_level.groupby(['tid'])[level].count()
target_classes_level = target_classes_level[
    (target_classes_level['tid'].isin(more_than_one[more_than_one == 1].index.tolist())) 
    | ((target_classes_level['tid'].isin(more_than_one[more_than_one > 1].index.tolist())) 
       & (target_classes_level['l1'] != 'Unclassified protein'))]

target_classes_level['target_class_l1'] = target_classes_level.groupby(['tid'])[level].transform(lambda x: between_str_join.join(sorted(x)))
target_classes_level = target_classes_level[['tid', 'target_class_l1']].drop_duplicates()

df_combined = df_combined.merge(target_classes_level, on='tid', how = 'left')

Repeat the summary step for target classes of level 2.

In [39]:
level = 'l2'
target_classes_level = df_target_classes[['tid', level]].drop_duplicates().dropna()
target_classes_level['target_class_l2'] = target_classes_level.groupby(['tid'])[level].transform(lambda x: between_str_join.join(sorted(x)))
target_classes_level = target_classes_level[['tid', 'target_class_l2']].drop_duplicates()

df_combined = df_combined.merge(target_classes_level, on='tid', how = 'left')

Instances with targets with more than one target class assigned to them.  
These could be reassigned by hand if a single target class is preferable.

In [40]:
############### TESTING: which targets have more than one level 1 target class assigned to them? ###############
test = df_combined[(df_combined['target_class_l1'].str.contains('|', regex=False))][['tid', 'target_pref_name', 'target_type', 'target_class_l1', 'target_class_l2']].drop_duplicates()
print("#Instances with >1 level 1 target class:", len(test))
test

#Instances with >1 level 1 target class: 20


Unnamed: 0,tid,target_pref_name,target_type,target_class_l1,target_class_l2
718,104295,Cyclin-dependent kinase 4/cyclin D1,PROTEIN COMPLEX,Enzyme|Other cytosolic protein,Kinase
10383,104811,Bcr/Abl fusion protein,CHIMERIC PROTEIN,Enzyme|Other cytosolic protein,Kinase
10463,100128,Breakpoint cluster region protein,SINGLE PROTEIN,Enzyme|Other cytosolic protein,Kinase
11602,104841,Serotonin (5-HT) receptor,PROTEIN FAMILY,Ion channel|Membrane receptor,Family A G protein-coupled receptor|Ligand-gated ion channel
12255,104737,Sulfonylurea receptors; K-ATP channels,PROTEIN COMPLEX GROUP,Ion channel|Transporter,Primary active transporter|Voltage-gated ion channel
12800,104717,Gamma-secretase,PROTEIN COMPLEX,Enzyme|Ion channel,Other ion channel|Protease
13939,106197,26S proteasome,PROTEIN COMPLEX,Enzyme|Other cytosolic protein,Protease
15600,104758,Potassium-transporting ATPase,PROTEIN COMPLEX,Enzyme|Transporter,Hydrolase|Primary active transporter
15986,104782,"Sulfonylurea receptor 2, Kir6.2",PROTEIN COMPLEX,Ion channel|Transporter,Primary active transporter|Voltage-gated ion channel
18079,105734,Voltage-gated calcium channel,PROTEIN COMPLEX GROUP,Auxiliary transport protein|Ion channel,Calcium channel auxiliary subunit alpha2delta family|Calcium channel auxiliary subunit beta fami...


In [41]:
############### TESTING: which targets have more than one level 2 target class assigned to them? ###############df_combined_test = df_combined[~(df_combined['target_class_l2'].isnull())]
test = df_combined[(~df_combined['target_class_l2'].isnull()) & (df_combined['target_class_l2'].str.contains('|', regex=False))][['tid', 'target_pref_name', 'target_type', 'target_class_l1', 'target_class_l2']].drop_duplicates()
print("#Instances with >1 level 2 target class:", len(test))
test

#Instances with >1 level 2 target class: 10


Unnamed: 0,tid,target_pref_name,target_type,target_class_l1,target_class_l2
11602,104841,Serotonin (5-HT) receptor,PROTEIN FAMILY,Ion channel|Membrane receptor,Family A G protein-coupled receptor|Ligand-gated ion channel
12255,104737,Sulfonylurea receptors; K-ATP channels,PROTEIN COMPLEX GROUP,Ion channel|Transporter,Primary active transporter|Voltage-gated ion channel
12800,104717,Gamma-secretase,PROTEIN COMPLEX,Enzyme|Ion channel,Other ion channel|Protease
15600,104758,Potassium-transporting ATPase,PROTEIN COMPLEX,Enzyme|Transporter,Hydrolase|Primary active transporter
15986,104782,"Sulfonylurea receptor 2, Kir6.2",PROTEIN COMPLEX,Ion channel|Transporter,Primary active transporter|Voltage-gated ion channel
18079,105734,Voltage-gated calcium channel,PROTEIN COMPLEX GROUP,Auxiliary transport protein|Ion channel,Calcium channel auxiliary subunit alpha2delta family|Calcium channel auxiliary subunit beta fami...
18294,104770,Sodium/potassium-transporting ATPase,PROTEIN COMPLEX GROUP,Enzyme|Ion channel|Transporter,Hydrolase|Other ion channel|Primary active transporter
19073,29,Sodium/potassium-transporting ATPase alpha-1 chain,SINGLE PROTEIN,Enzyme|Transporter,Hydrolase|Primary active transporter
35115,104852,"Sulfonylurea receptor 1, Kir6.2",PROTEIN COMPLEX,Ion channel|Transporter,Primary active transporter|Voltage-gated ion channel
651570,322,DNA (cytosine-5)-methyltransferase 3A,SINGLE PROTEIN,Epigenetic regulator,Reader|Writer


# Get Relevant Subsets of the Data

Calculate different subsets of the data based on binding and functional data in ChEMBL.

In [42]:
# function to calculate and return the different subsets of interest
def get_data_subsets(min_nof_cpds, data):
    # Restrict the dataset to targets with at least *min_nof_cpds* compounds with a pchembl value.
    comparator_counts = data[~data['pchembl_value_mean'].isnull()].groupby(['tid_mutation'])['parent_molregno'].count()
    targets_w_enough_cpds = comparator_counts[comparator_counts >= min_nof_cpds].index.tolist()
    df_enough_cpds = data.query('tid_mutation in @targets_w_enough_cpds')
    
    # Restrict the dataset further to targets with at least one compound-target pair labelled as 'D_DT', 'C3_DT', 'C2_DT', 'C1_DT' or 'C0_DT', 
    # i.e. a compound-target pair with a known interaction.
    c_dt_d_dt_targets = set(df_enough_cpds[df_enough_cpds['DTI'].isin(['D_DT', 'C3_DT', 'C2_DT', 'C1_DT', 'C0_DT'])].tid_mutation.to_list())
    df_c_dt_d_dt = df_enough_cpds.query('tid_mutation in @c_dt_d_dt_targets')
    
    # Restrict the dataset further to targets with at least one compound-target pair labelled as 'D_DT', 
    # i.e. a known drug-target interaction. 
    d_dt_targets = set(df_enough_cpds[df_enough_cpds['DTI'] == 'D_DT'].tid_mutation.to_list())
    df_d_dt = df_enough_cpds.query('tid_mutation in @d_dt_targets')
    
    return df_enough_cpds, df_c_dt_d_dt, df_d_dt

### Binding and Functional Assays

In [43]:
# consider binding and functional assays
min_nof_cpds = 100
df_combined_all = df_combined[(df_combined['only_binding'] == False)]
df_combined_all_enough_cpds, df_combined_all_c_dt_d_dt, df_combined_all_d_dt = get_data_subsets(min_nof_cpds, df_combined_all)

In [44]:
# df_combined_all.to_csv(path_results+"ChEMBL"+chembl_version+"_DTI_all_assays.csv", sep = ";", index = False)
# df_combined_all.to_excel(path_results+"ChEMBL"+chembl_version+"_DTI_all_assays.xlsx", index = False)

# df_combined_all_enough_cpds.to_csv(path_results+"ChEMBL"+chembl_version+"_DTI_all_assays_" + str(min_nof_cpds) + ".csv", sep = ";", index = False)
# df_combined_all_enough_cpds.to_excel(path_results+"ChEMBL"+chembl_version+"_DTI_all_assays_" + str(min_nof_cpds) + ".xlsx", index = False)

# df_combined_all_c_dt_d_dt.to_csv(path_results+"ChEMBL"+chembl_version+"_DTI_all_assays_" + str(min_nof_cpds) + "_c_dt_d_dt.csv", sep = ";", index = False)
# df_combined_all_c_dt_d_dt.to_excel(path_results+"ChEMBL"+chembl_version+"_DTI_all_assays_" + str(min_nof_cpds) + "_c_dt_d_dt.xlsx", index = False)

# df_combined_all_d_dt.to_csv(path_results+"ChEMBL"+chembl_version+"_DTI_all_assays_" + str(min_nof_cpds) + "_d_dt.csv", sep = ";", index = False)
# df_combined_all_d_dt.to_excel(path_results+"ChEMBL"+chembl_version+"_DTI_all_assays_" + str(min_nof_cpds) + "_d_dt.xlsx", index = False)

In [45]:
############### TESTING: binding and functional assays ###############
add_dataset_sizes(df_combined_all, "all assays")
add_dataset_sizes(df_combined_all_enough_cpds, "all, >= 100")
add_dataset_sizes(df_combined_all_c_dt_d_dt, "all, >= 100, c_dt and d_dt")
add_dataset_sizes(df_combined_all_d_dt, "all, >= 100, d_dt")

### Only Binding Assays

In [46]:
# consider only binding assays and therapeutic targets
min_nof_cpds = 100
df_combined_B = df_combined[(df_combined['only_binding'] == True)]
df_combined_B_enough_cpds, df_combined_B_c_dt_d_dt, df_combined_B_d_dt = get_data_subsets(min_nof_cpds, df_combined_B)

In [47]:
# df_combined_B.to_csv(path_results+"ChEMBL"+chembl_version+"_DTI_binding.csv", sep = ";", index = False)
# df_combined_B.to_excel(path_results+"ChEMBL"+chembl_version+"_DTI_binding.xlsx", index = False)

# df_combined_B_enough_cpds.to_csv(path_results+"ChEMBL"+chembl_version+"_DTI_binding_" + str(min_nof_cpds) + ".csv", sep = ";", index = False)
# df_combined_B_enough_cpds.to_excel(path_results+"ChEMBL"+chembl_version+"_DTI_binding_" + str(min_nof_cpds) + ".xlsx", index = False)

# df_combined_B_c_dt_d_dt.to_csv(path_results+"ChEMBL"+chembl_version+"_DTI_binding_" + str(min_nof_cpds) + "_c_dt_d_dt.csv", sep = ";", index = False)
# df_combined_B_c_dt_d_dt.to_excel(path_results+"ChEMBL"+chembl_version+"_DTI_binding_" + str(min_nof_cpds) + "_c_dt_d_dt.xlsx", index = False)

# df_combined_B_d_dt.to_csv(path_results+"ChEMBL"+chembl_version+"_DTI_binding_" + str(min_nof_cpds) + "_d_dt.csv", sep = ";", index = False)
# df_combined_B_d_dt.to_excel(path_results+"ChEMBL"+chembl_version+"_DTI_binding_" + str(min_nof_cpds) + "_d_dt.xlsx", index = False)

In [48]:
############### TESTING: binding assays ###############
add_dataset_sizes(df_combined_B, "binding")
add_dataset_sizes(df_combined_B_enough_cpds, "b, >= 100")
add_dataset_sizes(df_combined_B_c_dt_d_dt, "b, >= 100, c_dt and d_dt")
add_dataset_sizes(df_combined_B_d_dt, "b, >= 100, d_dt")

# Testing: Overview of Dataset Size at Different Points in the Pipeline

In [49]:
############### TESTING: development of the full dataset size ###############
print("Size of full dataset at different points")
pd.DataFrame(all_lengths,
                   columns=['type', 'mols', 'drugs', 'targets', 'drug_targets', 'cpd_target', 'drug_target'])

Size of full dataset at different points


Unnamed: 0,type,mols,drugs,targets,drug_targets,cpd_target,drug_target
0,init,1098855,1754,8312,3360,2579716,29759
1,pre DTI,921555,752,7410,588,2085879,1628
2,post DTI,477626,752,1472,588,723010,1628
3,cpd props,477185,750,1472,588,722308,1626
4,all assays,477185,750,1472,588,722308,1626
5,"all, >= 100",470770,687,567,309,707355,1226
6,"all, >= 100, c_dt and d_dt",399135,687,435,309,602331,1226
7,"all, >= 100, d_dt",292793,687,309,309,439024,1226
8,binding,384806,716,1424,573,579084,1559
9,"b, >= 100",377893,656,539,296,563611,1166


In [50]:
############### TESTING: development of the dataset size (pchembl values required) ###############
print("Size of dataset with pchembl values at different points")
pd.DataFrame(all_lengths_pchembl,
                   columns=['type', 'mols', 'drugs', 'targets', 'drug_targets', 'cpd_target', 'drug_target'])

Size of dataset with pchembl values at different points


Unnamed: 0,type,mols,drugs,targets,drug_targets,cpd_target,drug_target
0,init,923862,1637,6893,2569,2094029,21601
1,pre DTI,921348,696,6843,484,2074104,1420
2,post DTI,477417,696,1368,484,718807,1420
3,cpd props,476976,694,1368,484,718105,1418
4,all assays,476976,694,1368,484,718105,1418
5,"all, >= 100",470582,653,567,296,704036,1155
6,"all, >= 100, c_dt and d_dt",398941,653,435,296,599271,1155
7,"all, >= 100, d_dt",292577,653,309,296,437090,1155
8,binding,384577,660,1327,471,575043,1362
9,"b, >= 100",377682,617,539,283,560469,1098
