# Target Fixes

The principle behind this excercise is that the genes in the list can correspond to several targets, but each target should only correspond to one gene in the list.

The idea here is that the genes are either single-protein targets or are the 'central' or 'active' component of a complex (_i.e._ where the active site lies), with other components being regarded as regulatory or ancilliary.

Thus, data for ChEMBL targets for a gene will be merged (in the first instance, at least) with some confidence that the active compounds are targeting the same site.

This is obviously a great simplification, as regulatory subunits are likely to affect binding site conformation, extra binding sites may be introduced in complexes as opposed to isolated proteins _etc._.
However, it seems like a necessary simplification at this time.

Note that protein families were deliberately _not_ included in the original [target mapping](2_ChEMBL_targets.ipynb#target_mapping)), only single proteins and protein complexes.

However, there appears to have been some mislabelling of protein families as protein complexes. 

There may also be benefits from pooling data for protein families, even where some isoform-specific data is available. This is under investigation.

In [1]:
from addict import Dict as adict

from local_utils.file_utils import backup_file

### Configuration

In [2]:
# ChEMBL connection...

engine = create_engine(open('database.txt').read().strip())

### Reload targets

In [3]:
targets = pd.read_pickle('chembl_targets.pkl')

targets.shape

(377, 11)

### Fixes due to incorrect target synonyms

In [4]:
targets[(
      ( (targets.symbol == 'ADRA1A') & (targets.pref_name == 'Alpha-1d adrenergic receptor') )
    | ( (targets.symbol == 'CHRNA1') & targets.pref_name.str.contains('alpha2') )
)]

Unnamed: 0,symbol,approved_name,targets,n_target,chembl_id,target_type,pref_name,species,exclude,n_active,n_total
31,ADRA1A,adrenoceptor alpha 1A,α1A-adrenergic receptor;1|Adrenergic α1A;2.1|Adrenergic α1a receptor (Al1a);2.2|alpha1A;3,2,CHEMBL223,SINGLE PROTEIN,Alpha-1d adrenergic receptor,Human,0,869,2163
33,ADRA1A,adrenoceptor alpha 1A,α1A-adrenergic receptor;1|Adrenergic α1A;2.1|Adrenergic α1a receptor (Al1a);2.2|alpha1A;3,2,CHEMBL326,SINGLE PROTEIN,Alpha-1d adrenergic receptor,Rat,0,519,1702
119,CHRNA1,"cholinergic receptor, nicotinic, alpha 1 (muscle)",Acetylcholine receptor subunit α1 or α4;1|Nicotinic acetylcholine;2.1|Nicotinic receptor (central);2.2,2,CHEMBL3038458,PROTEIN COMPLEX,Nicotinic acetylcholine receptor alpha2/beta2,Human,0,3,12
120,CHRNA1,"cholinergic receptor, nicotinic, alpha 1 (muscle)",Acetylcholine receptor subunit α1 or α4;1|Nicotinic acetylcholine;2.1|Nicotinic receptor (central);2.2,3,CHEMBL3038459,PROTEIN COMPLEX,Nicotinic acetylcholine receptor alpha2/beta4,Human,0,5,12


In [5]:
targets.loc[(
      ( (targets.symbol == 'ADRA1A') & (targets.pref_name == 'Alpha-1d adrenergic receptor') )
    | ( (targets.symbol == 'CHRNA1') & targets.pref_name.str.contains('alpha2') )
), 'exclude'] = 1

### Other fixes

The target [CHEMBL4872](https://www.ebi.ac.uk/chembl/target/inspect/CHEMBL4872) ([KCNE1](https://en.wikipedia.org/wiki/KCNE1)) is associated witrh only four data points, all inactive. KCNE1 is also a component (subsidiary to [KCNQ1](https://en.wikipedia.org/wiki/KvLQT1)) of [CHEMBL2221347](https://www.ebi.ac.uk/chembl/target/inspect/CHEMBL2221347), for which a non-trivial amount of data is available.

Thus, KCNE1 is removed for simplicity.

In [6]:
targets.query("(symbol == 'KCNE1') | (chembl_id == 'CHEMBL2221347')").sort(['symbol', 'chembl_id'], ascending=[0, 1])

Unnamed: 0,symbol,approved_name,targets,n_target,chembl_id,target_type,pref_name,species,exclude,n_active,n_total
237,KCNQ1,"potassium channel, voltage gated KQT-like subfamily Q, member 1",Potassium voltage-gated channel KQT-like member 1 and minimal potassium channel MinK;1|KCNQ1;2.2|IKs;4,1,CHEMBL2221347,PROTEIN COMPLEX,"Voltage-gated potassium channel, IKs; KCNQ1(Kv7.1)/KCNE1(MinK)",Human,0,25,67
225,KCNE1,"potassium channel, voltage gated subfamily E regulatory beta subunit 1",Potassium voltage-gated channel KQT-like member 1 and minimal potassium channel MinK;1,1,CHEMBL2221347,PROTEIN COMPLEX,"Voltage-gated potassium channel, IKs; KCNQ1(Kv7.1)/KCNE1(MinK)",Human,0,25,67
226,KCNE1,"potassium channel, voltage gated subfamily E regulatory beta subunit 1",Potassium voltage-gated channel KQT-like member 1 and minimal potassium channel MinK;1,2,CHEMBL4872,SINGLE PROTEIN,Voltage-gated potassium channel beta subunit Mink,Human,0,0,4


In [7]:
targets.loc[targets.eval("symbol == 'KCNE1'"), 'exclude'] = 1

<a name="duplicates"></a>
### Duplicated ChEMBL Target IDs

In [8]:
subset = targets.query("exclude == 0")

duplicated = subset['chembl_id'][subset['chembl_id'].duplicated()].tolist()

duplicated

['CHEMBL3038488', 'CHEMBL2095189', 'CHEMBL3038454', 'CHEMBL2111459']

In [9]:
df = targets.query("chembl_id in @duplicated").sort(['chembl_id', 'symbol'])

df

Unnamed: 0,symbol,approved_name,targets,n_target,chembl_id,target_type,pref_name,species,exclude,n_active,n_total
284,PDGFRA,"platelet-derived growth factor receptor, alpha polypeptide",PDGFRs;5,1,CHEMBL2095189,PROTEIN COMPLEX,Platelet-derived growth factor receptor,Human,0,234,439
286,PDGFRB,"platelet-derived growth factor receptor, beta polypeptide",PDGFRs;5,1,CHEMBL2095189,PROTEIN COMPLEX,Platelet-derived growth factor receptor,Human,0,234,439
322,ROCK1,"Rho-associated, coiled-coil containing protein kinase 1",ROCK;5,1,CHEMBL2111459,PROTEIN COMPLEX,Rho-associated protein kinase,Human,0,69,84
325,ROCK2,"Rho-associated, coiled-coil containing protein kinase 2",ROCK;5,1,CHEMBL2111459,PROTEIN COMPLEX,Rho-associated protein kinase,Human,0,69,84
301,PRKAA1,"protein kinase, AMP-activated, alpha 1 catalytic subunit",AMPK;5,2,CHEMBL3038454,PROTEIN COMPLEX,AMPK alpha1/alpha2,Human,0,0,12
307,PRKAA2,"protein kinase, AMP-activated, alpha 2 catalytic subunit",AMPK;5,1,CHEMBL3038454,PROTEIN COMPLEX,AMPK alpha1/alpha2,Human,0,0,12
234,KCNJ3,"potassium channel, inwardly rectifying subfamily J, member 3",IKAch;4,2,CHEMBL3038488,PROTEIN COMPLEX,Kir3.1/Kir3.4,Human,0,49,65
235,KCNJ5,"potassium channel, inwardly rectifying subfamily J, member 5",IKAch;4,1,CHEMBL3038488,PROTEIN COMPLEX,Kir3.1/Kir3.4,Human,0,49,65


Note that these are all actually 'protein families', but mislabelled as 'protein complexes' (note also also they are all Human targets).

Examining _all_ the ChEMBL Target IDs for those symbols which have duplicated ChEMBL IDs (excluding the duplicated ChEMBL IDs and taking only the Human targets)...

In [10]:
symbols = df['symbol'].tolist()

HTML(targets.query("symbol in @symbols and chembl_id not in @duplicated and species == 'Human'").sort(['symbol', 'target_type'], ascending=[1, 0]).to_html())

Unnamed: 0,symbol,approved_name,targets,n_target,chembl_id,target_type,pref_name,species,exclude,n_active,n_total
233,KCNJ3,"potassium channel, inwardly rectifying subfamily J, member 3",IKAch;4,1,CHEMBL3038489,PROTEIN COMPLEX,Kir3.1/Kir3.2,Human,0,50,65
285,PDGFRA,"platelet-derived growth factor receptor, alpha polypeptide",PDGFRs;5,2,CHEMBL2007,SINGLE PROTEIN,Platelet-derived growth factor receptor alpha,Human,0,370,1415
287,PDGFRB,"platelet-derived growth factor receptor, beta polypeptide",PDGFRs;5,2,CHEMBL1913,SINGLE PROTEIN,Platelet-derived growth factor receptor beta,Human,0,833,2246
305,PRKAA1,"protein kinase, AMP-activated, alpha 1 catalytic subunit",AMPK;5,6,CHEMBL4045,SINGLE PROTEIN,"AMP-activated protein kinase, alpha-1 subunit",Human,0,295,1430
300,PRKAA1,"protein kinase, AMP-activated, alpha 1 catalytic subunit",AMPK;5,1,CHEMBL2111345,PROTEIN COMPLEX,AMP-activated protein kinase (AMPK) alpha-1/beta-1/gamma-1,Human,0,4,203
302,PRKAA1,"protein kinase, AMP-activated, alpha 1 catalytic subunit",AMPK;5,3,CHEMBL3038451,PROTEIN COMPLEX,AMPK alpha1/beta1/gamma2,Human,0,0,1
303,PRKAA1,"protein kinase, AMP-activated, alpha 1 catalytic subunit",AMPK;5,4,CHEMBL3038452,PROTEIN COMPLEX,AMPK alpha1/beta1/gamma3,Human,0,0,1
304,PRKAA1,"protein kinase, AMP-activated, alpha 1 catalytic subunit",AMPK;5,5,CHEMBL3038453,PROTEIN COMPLEX,AMPK alpha1/beta2/gamma1,Human,0,0,2
311,PRKAA2,"protein kinase, AMP-activated, alpha 2 catalytic subunit",AMPK;5,5,CHEMBL2116,SINGLE PROTEIN,"AMP-activated protein kinase, alpha-2 subunit",Human,0,17,501
308,PRKAA2,"protein kinase, AMP-activated, alpha 2 catalytic subunit",AMPK;5,2,CHEMBL3038455,PROTEIN COMPLEX,AMPK alpha2/beta1/gamma1,Human,0,14,17


Thus, for all these genes except KCNJ3, there exists a 'single protein' or (genuine) 'protein complex' for the gene, at least one of which has a fair number of actives associated with it.
In the case of PRKAA2, numbers are not large, but the protein-family targets are not associated with _any_ actives.


Thus, removing the duplicated 'protein family' targets will not remove these genes from the panel.

The exception is KCNJ3, which is only represented by CHEMBL3038489, 'Kir3.1/Kir3.2'. This, like the duplicated CHEMBL3038488, is a 'protein family' (and also mislabelled as a 'protein complex'); it is not itself duplicated because Kir3.2 (KCNJ6) is not in the gene list. Thus, removing these targets (_i.e._ CHEMBL3038488 and CHEMBL3038489) will remove KCNJ3 and KCNJ5 entirely. However, as there are no 'single protein' targets associated with these genes at present, this is unavoidable (if the single-gene-to-target paradigm is to be adhered to).

In [11]:
pref_names_to_drop = ['Kir3.1/Kir3.4', 'Kir3.1/Kir3.2', 'Platelet-derived growth factor receptor', 'AMPK alpha1/alpha2', 'Rho-associated protein kinase']

targets.loc[targets.eval("pref_name in @pref_names_to_drop"), 'exclude'] = 1

In [12]:
targets.shape

(377, 11)

<a name="#multi"></a>
## Multi-target symbols

See also [here](extras/multi_target_symbols.ipynb) for an alternative approach.

In [13]:
cols = targets.columns.values.tolist()

def f(symbol, group):
            
    html = ''
        
    records = [adict(zip(cols, x)) for x in group.to_records(index=False)]
    
    if len(records) > 1:
                
        html += '<table>'

        for x in sorted(records,  key=lambda x: x.target_type, reverse=True):

            html += '<tr> <td><a target="_blank" href="https://www.ebi.ac.uk/chembl/target/inspect/{}">{}</a></td> <td>{}</td> <td>{}</td> <td>{}</td> </tr>'.format(* [x.chembl_id]*2 + [x.target_type, x.pref_name, int(x.n_active)])

        html += '</table>'
            
    return {'symbol': symbol, 'targets': html, 'count': len(records)}

def multi_target_symbols(df):

    return pd.DataFrame([f(x, y) for x, y in df.groupby('symbol')], columns=['symbol', 'targets', 'count']).query("count > 1").reset_index(drop=True)

In [15]:
multi_target_symbols(targets.query("species == 'Human' and chembl_id not in @duplicated and exclude == 0").sort('symbol'))

Unnamed: 0,symbol,targets,count
0,AKT1,CHEMBL4282 SINGLE PROTEIN Serine/threonine-protein kinase AKT 1339 CHEMBL3038463 PROTEIN COMPLEX AKT/p21CIP1 0,2
CHEMBL4282,SINGLE PROTEIN,Serine/threonine-protein kinase AKT,1339
CHEMBL3038463,PROTEIN COMPLEX,AKT/p21CIP1,0
1,CALCR,"CHEMBL1832 SINGLE PROTEIN Calcitonin receptor 14 CHEMBL2111189 PROTEIN COMPLEX Amylin receptor AMY1, CALCR/RAMP1 1 CHEMBL2364173 PROTEIN COMPLEX Amylin receptor AMY2; CALCR/RAMP2 0 CHEMBL2111190 PROTEIN COMPLEX Amylin receptor AMY3; CALCR/RAMP3 1",4
CHEMBL1832,SINGLE PROTEIN,Calcitonin receptor,14
CHEMBL2111189,PROTEIN COMPLEX,"Amylin receptor AMY1, CALCR/RAMP1",1
CHEMBL2364173,PROTEIN COMPLEX,Amylin receptor AMY2; CALCR/RAMP2,0
CHEMBL2111190,PROTEIN COMPLEX,Amylin receptor AMY3; CALCR/RAMP3,1
2,CALCRL,"CHEMBL3798 SINGLE PROTEIN Calcitonin gene-related peptide type 1 receptor 634 CHEMBL2109232 PROTEIN COMPLEX Adrenomedullin receptor AM1; CALCRL/RAMP2 1 CHEMBL2111191 PROTEIN COMPLEX Adrenomedullin receptor, AM2; CALCRL/RAMP3 1 CHEMBL2107838 PROTEIN COMPLEX Calcitonin-gene-related peptide receptor, CALCRL/RAMP1 70",4
CHEMBL3798,SINGLE PROTEIN,Calcitonin gene-related peptide type 1 receptor,634

0,1,2,3
CHEMBL4282,SINGLE PROTEIN,Serine/threonine-protein kinase AKT,1339
CHEMBL3038463,PROTEIN COMPLEX,AKT/p21CIP1,0

0,1,2,3
CHEMBL1832,SINGLE PROTEIN,Calcitonin receptor,14
CHEMBL2111189,PROTEIN COMPLEX,"Amylin receptor AMY1, CALCR/RAMP1",1
CHEMBL2364173,PROTEIN COMPLEX,Amylin receptor AMY2; CALCR/RAMP2,0
CHEMBL2111190,PROTEIN COMPLEX,Amylin receptor AMY3; CALCR/RAMP3,1

0,1,2,3
CHEMBL3798,SINGLE PROTEIN,Calcitonin gene-related peptide type 1 receptor,634
CHEMBL2109232,PROTEIN COMPLEX,Adrenomedullin receptor AM1; CALCRL/RAMP2,1
CHEMBL2111191,PROTEIN COMPLEX,"Adrenomedullin receptor, AM2; CALCRL/RAMP3",1
CHEMBL2107838,PROTEIN COMPLEX,"Calcitonin-gene-related peptide receptor, CALCRL/RAMP1",70

0,1,2,3
CHEMBL301,SINGLE PROTEIN,Cyclin-dependent kinase 2,1526
CHEMBL3038469,PROTEIN COMPLEX,CDK2/Cyclin A,89
CHEMBL3038470,PROTEIN COMPLEX,CDK2/Cyclin A1,2
CHEMBL2094128,PROTEIN COMPLEX,Cyclin-dependent kinase 2/cyclin A,896
CHEMBL2094126,PROTEIN COMPLEX,Cyclin-dependent kinase 2/cyclin E,595
CHEMBL1907605,PROTEIN COMPLEX,Cyclin-dependent kinase 2/cyclin E1,307

0,1,2,3
CHEMBL331,SINGLE PROTEIN,Cyclin-dependent kinase 4,381
CHEMBL3038472,PROTEIN COMPLEX,CDK4/Cyclin D3,0
CHEMBL2095942,PROTEIN COMPLEX,Cyclin-dependent kinase 4/cyclin D,141
CHEMBL1907601,PROTEIN COMPLEX,Cyclin-dependent kinase 4/cyclin D1,544
CHEMBL3301385,PROTEIN COMPLEX,Cyclin-dependent kinase 4/cyclin D2,0

0,1,2,3
CHEMBL4808,SINGLE PROTEIN,Acetylcholine receptor protein alpha chain,12
CHEMBL1907588,PROTEIN COMPLEX,Acetylcholine receptor; alpha1/beta1/delta/gamma,48

0,1,2,3
CHEMBL1882,SINGLE PROTEIN,Neuronal acetylcholine receptor protein alpha-4 subunit,138
CHEMBL1907589,PROTEIN COMPLEX,Neuronal acetylcholine receptor; alpha4/beta2,741
CHEMBL1907591,PROTEIN COMPLEX,Neuronal acetylcholine receptor; alpha4/beta4,29
CHEMBL3038461,PROTEIN COMPLEX,Nicotinic acetylcholine receptor alpha4/beta2/alpha5,10

0,1,2,3
CHEMBL217,SINGLE PROTEIN,Dopamine D2 receptor,4266
CHEMBL3038478,PROTEIN COMPLEX,Dopamine receptor D2L/neurotensin receptor NTS1,0

0,1,2,3
CHEMBL2015,SINGLE PROTEIN,Glutamate (NMDA) receptor subunit zeta 1,77
CHEMBL1907604,PROTEIN COMPLEX,Glutamate NMDA receptor; GRIN1/GRIN2A,68
CHEMBL1907603,PROTEIN COMPLEX,Glutamate NMDA receptor; GRIN1/GRIN2B,171
CHEMBL3038504,PROTEIN COMPLEX,Ionotropic glutamate receptor NMDA 1/2C,0
CHEMBL3038505,PROTEIN COMPLEX,Ionotropic glutamate receptor NMDA 1/2D,0

0,1,2,3
CHEMBL2971,SINGLE PROTEIN,Tyrosine-protein kinase JAK2,1346
CHEMBL3301390,PROTEIN COMPLEX,JAK1/JAK2/TYK2,5
CHEMBL3038492,PROTEIN COMPLEX,JAK2/JAK1,8
CHEMBL3301392,PROTEIN COMPLEX,JAK2/TYK2,3

0,1,2,3
CHEMBL1886,SINGLE PROTEIN,"Potassium channel, inwardly rectifying, subfamily J, member 11",45
CHEMBL2096972,PROTEIN COMPLEX,"Sulfonylurea receptor 1, Kir6.2",26
CHEMBL2095198,PROTEIN COMPLEX,"Sulfonylurea receptor 2, Kir6.2",15

0,1,2,3
CHEMBL1866,SINGLE PROTEIN,Voltage-gated potassium channel subunit Kv7.1,30
CHEMBL2221347,PROTEIN COMPLEX,"Voltage-gated potassium channel, IKs; KCNQ1(Kv7.1)/KCNE1(MinK)",25

0,1,2,3
CHEMBL2842,SINGLE PROTEIN,FK506 binding protein 12,1372
CHEMBL2221341,PROTEIN COMPLEX,Mammalian target of Rapamycin (mTORC1),0

0,1,2,3
CHEMBL233,SINGLE PROTEIN,Mu opioid receptor,2984
CHEMBL3301384,PROTEIN COMPLEX,CCR5/mu opioid receptor complex,0

0,1,2,3
CHEMBL4005,SINGLE PROTEIN,PI3-kinase p110-alpha subunit,2077
CHEMBL2111367,PROTEIN COMPLEX,PI3-kinase p110-alpha/p85-alpha,124

0,1,2,3
CHEMBL235,SINGLE PROTEIN,Peroxisome proliferator-activated receptor gamma,2386
CHEMBL2111394,PROTEIN COMPLEX,RXR alpha/PPAR gamma,30

0,1,2,3
CHEMBL4045,SINGLE PROTEIN,"AMP-activated protein kinase, alpha-1 subunit",295
CHEMBL2111345,PROTEIN COMPLEX,AMP-activated protein kinase (AMPK) alpha-1/beta-1/gamma-1,4
CHEMBL3038451,PROTEIN COMPLEX,AMPK alpha1/beta1/gamma2,0
CHEMBL3038452,PROTEIN COMPLEX,AMPK alpha1/beta1/gamma3,0
CHEMBL3038453,PROTEIN COMPLEX,AMPK alpha1/beta2/gamma1,0

0,1,2,3
CHEMBL2116,SINGLE PROTEIN,"AMP-activated protein kinase, alpha-2 subunit",17
CHEMBL3038455,PROTEIN COMPLEX,AMPK alpha2/beta1/gamma1,14
CHEMBL3038456,PROTEIN COMPLEX,AMPK alpha2/beta2/gamma1,0
CHEMBL3038457,PROTEIN COMPLEX,AMPK alpha2/beta2/gamma3,0


Most of these are straightforward cases where there is the gene product associated with the symbol and one or more complexes also involving other subunits (_e.g._ AKT or the CDKs).
While these subunits may well alter the SAR of the core entity, treating them, together seems like an acceptable compromise for now.

In other cases, this assumption may be more questionable, in that the complex might have radiaclly different SAR. Examples of this could be...

* CALCR:   Calcitonin _vs._ the Amylin receptors


* CALCRL:  Calcitonin gene-related peptide type 1 receptor _vs._ Adrenomedullin receptors


* KCNJ11:  Kir6.2 _vs._ the Sulfonylurea receptors (_NB_ the ChEMBL targets do not seem to align well with those described [here](https://en.wikipedia.org/wiki/Sulfonylurea_receptor))


* OPRM1:   Mu opioid receptor _vs._ CCR5/mu opioid receptor complex


* PPARG:   PPAR gamma _vs._ RXR alpha/PPAR gamma

However, in all cases except KCNJ11, the amount of data for the single-protein form far outweighs that for the other complexes, and no futher action will be taken at present.

In the cases of KCNJ11, looking at some assay descriptions suggests the assignments to different targets might be a bit arbitrary, and thus these will also be kept together for now.

By contrast, the target 'Dopamine receptor D2L/neurotensin receptor NTS1' for DRD2 is an error: this is in fact a selectivity ratio, not a complex and should be excluded...

In [16]:
targets.loc[targets.eval("pref_name == 'Dopamine receptor D2L/neurotensin receptor NTS1'"), 'exclude'] = 1

The JAK2 'protein complexes' appear to be another case of protein families being mislabelled.
Again, as there is a single-protein target (CHEMBL2971) with a significant amount of data associated with it the (mislabelled) complex targets are superfluous and can be removed.

In [17]:
HTML(targets[targets.symbol.str.contains('JAK|TYK')].to_html())

Unnamed: 0,symbol,approved_name,targets,n_target,chembl_id,target_type,pref_name,species,exclude,n_active,n_total
212,JAK2,Janus kinase 2,JAK2;5,1,CHEMBL3301390,PROTEIN COMPLEX,JAK1/JAK2/TYK2,Human,0,5,5
213,JAK2,Janus kinase 2,JAK2;5,2,CHEMBL3038492,PROTEIN COMPLEX,JAK2/JAK1,Human,0,8,14
214,JAK2,Janus kinase 2,JAK2;5,3,CHEMBL3301392,PROTEIN COMPLEX,JAK2/TYK2,Human,0,3,5
215,JAK2,Janus kinase 2,JAK2;5,4,CHEMBL2971,SINGLE PROTEIN,Tyrosine-protein kinase JAK2,Human,0,1346,2987
216,JAK2,Janus kinase 2,JAK2;5,1,CHEMBL1075225,SINGLE PROTEIN,Tyrosine-protein kinase JAK2,Rat,0,0,1


In [18]:
pref_names_to_drop = ['JAK1/JAK2/TYK2', 'JAK2/JAK1', 'JAK2/TYK2']

targets.loc[targets.eval("pref_name in @pref_names_to_drop"), 'exclude'] = 1

### Excluded

In [19]:
excluded = targets.query('exclude == 1')

excluded.shape

(19, 11)

In [20]:
HTML(excluded.to_html())

Unnamed: 0,symbol,approved_name,targets,n_target,chembl_id,target_type,pref_name,species,exclude,n_active,n_total
31,ADRA1A,adrenoceptor alpha 1A,α1A-adrenergic receptor;1|Adrenergic α1A;2.1|Adrenergic α1a receptor (Al1a);2.2|alpha1A;3,2,CHEMBL223,SINGLE PROTEIN,Alpha-1d adrenergic receptor,Human,1,869,2163
33,ADRA1A,adrenoceptor alpha 1A,α1A-adrenergic receptor;1|Adrenergic α1A;2.1|Adrenergic α1a receptor (Al1a);2.2|alpha1A;3,2,CHEMBL326,SINGLE PROTEIN,Alpha-1d adrenergic receptor,Rat,1,519,1702
119,CHRNA1,"cholinergic receptor, nicotinic, alpha 1 (muscle)",Acetylcholine receptor subunit α1 or α4;1|Nicotinic acetylcholine;2.1|Nicotinic receptor (central);2.2,2,CHEMBL3038458,PROTEIN COMPLEX,Nicotinic acetylcholine receptor alpha2/beta2,Human,1,3,12
120,CHRNA1,"cholinergic receptor, nicotinic, alpha 1 (muscle)",Acetylcholine receptor subunit α1 or α4;1|Nicotinic acetylcholine;2.1|Nicotinic receptor (central);2.2,3,CHEMBL3038459,PROTEIN COMPLEX,Nicotinic acetylcholine receptor alpha2/beta4,Human,1,5,12
147,DRD2,dopamine receptor D2,Dopamine receptor D2;1,1,CHEMBL3038478,PROTEIN COMPLEX,Dopamine receptor D2L/neurotensin receptor NTS1,Human,1,0,1
212,JAK2,Janus kinase 2,JAK2;5,1,CHEMBL3301390,PROTEIN COMPLEX,JAK1/JAK2/TYK2,Human,1,5,5
213,JAK2,Janus kinase 2,JAK2;5,2,CHEMBL3038492,PROTEIN COMPLEX,JAK2/JAK1,Human,1,8,14
214,JAK2,Janus kinase 2,JAK2;5,3,CHEMBL3301392,PROTEIN COMPLEX,JAK2/TYK2,Human,1,3,5
225,KCNE1,"potassium channel, voltage gated subfamily E regulatory beta subunit 1",Potassium voltage-gated channel KQT-like member 1 and minimal potassium channel MinK;1,1,CHEMBL2221347,PROTEIN COMPLEX,"Voltage-gated potassium channel, IKs; KCNQ1(Kv7.1)/KCNE1(MinK)",Human,1,25,67
226,KCNE1,"potassium channel, voltage gated subfamily E regulatory beta subunit 1",Potassium voltage-gated channel KQT-like member 1 and minimal potassium channel MinK;1,2,CHEMBL4872,SINGLE PROTEIN,Voltage-gated potassium channel beta subunit Mink,Human,1,0,4


_N.B._ symbol/chembl_id pairs are now unique...

In [21]:
targets[targets[['symbol', 'chembl_id']].duplicated()]

Unnamed: 0,symbol,approved_name,targets,n_target,chembl_id,target_type,pref_name,species,exclude,n_active,n_total


### Save/Restore

File now includes exclusion flag.

In [22]:
backup_file('chembl_targets.pkl')

targets.to_pickle('chembl_targets.pkl')

In [33]:
# Update table in  RDBMS...

targets[['symbol', 'chembl_id', 'exclude']].to_sql('tt_temp', engine, if_exists='replace', index=False, dtype={'symbol': VARCHAR2(10), 'chembl_id': VARCHAR2(20)})

# Update main targets table with excludion info...

engine.execute("""
update
  tt_chembl_targets a
set
  a.exclude = (
    select
      b.exclude
    from
      tt_temp b
    where
      a.symbol = b.symbol and a.chembl_id = b.chembl_id
  )
""")

# Clean up...

engine.execute("drop table tt_temp")

# Check update worked...

pd.read_sql("select * from tt_chembl_targets where exclude = 1", engine).shape

(19, 9)