Testing Guney's `toolbox` package for network based proximity between drug targets and disease genes.

Chosen targets:
* Hydroxychloroquine targets (TLR7 and TLR9)
* two of the AD disease genes

AD disease genes:
* Guney AD genes: from Guney et al
* Knowledge based AD genes: from the DISEASES database
* High confidence AD genes (seed): knowledge based + TWAS + incipient proteomic signature

Both choices of targets are rather arbitrary.  We expect the two AD disease genes, by definition, to be more proximal than the Hydroxychloroquine targets (which are not AD genes by any definition used in this notebook). The calculations below will support this quantitatively.

In [30]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload
from toolbox import wrappers
import pandas as pd

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Preparations

### GeneID -- Symbol mapping

`id_mapping_file` comes from [this file](ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz) at NCBI; see [parse_ncbi.py](https://github.com/attilagk/guney_code/blob/master/parse_ncbi.py) for details.

In [52]:
id_mapping_file = '../../resources/PPI/geneid_to_symbol.txt'
id_symbol = pd.read_csv(id_mapping_file, sep='\t', index_col='GeneID')
id_symbol = id_symbol.set_index(id_symbol.index.astype('str'))
id_symbol

Unnamed: 0_level_0,Symbol
GeneID,Unnamed: 1_level_1
1,A1BG
2,A2M
3,A2MP1
9,NAT1
10,NAT2
...,...
8923215,trnD
8923216,trnP
8923217,trnA
8923218,COX1


### PPI networks

I read two PPI networks:
1. `network_guney` from Guney et al 2016
1. `network_cheng` from Cheng et al 2019

Below is the number of binary PPI interactions in each of these networks

In [3]:
%%bash
wc -l ../../resources/proximity/data/network/network.sif ../../resources/PPI/Cheng2019/network.sif

 141296 ../../resources/proximity/data/network/network.sif
 217160 ../../resources/PPI/Cheng2019/network.sif
 358456 total


In [4]:
network_guney = wrappers.get_network('../../resources/proximity/data/network/network.sif', only_lcc = True)
network_cheng = wrappers.get_network('../../resources/PPI/Cheng2019/network.sif', only_lcc = True)

I will use Hydroxychloroquine's targets.  See [this page](https://go.drugbank.com/drugs/DB01611) on drugbank.

In [5]:
%%bash
echo TLR7 > ../../results/2021-08-04-guney-tools/Hydroxychloroquine-targets
echo TLR9 >> ../../results/2021-08-04-guney-tools/Hydroxychloroquine-targets

In [6]:
Hydroxychloroquine_targets = wrappers.convert_to_geneid(file_name='../../results/2021-08-04-guney-tools/Hydroxychloroquine-targets', id_type='symbol', id_mapping_file=id_mapping_file)
Hydroxychloroquine_targets

set()


{'51284', '54106'}

### AD gene sets
#### Guney AD genes

In [7]:
%%bash
grep 'alzheimer disease' ../../resources/proximity/data/disease/disease_genes.tsv | \
tr '\t' '\n' | sed -n '/^[0-9]\+/ p' > ../../results/2021-08-04-guney-tools/AD-genes-guney

In [8]:
with open('../../results/2021-08-04-guney-tools/AD-genes-guney') as f:
    AD_genes_guney = f.readlines()
AD_genes_guney = [x.strip('\n') for x in AD_genes_guney]

Making sure that the gene set is a subset of the network's nodes

In [9]:
def remove_genes_notin_network(genes, network):
    oldsize = len(genes)
    newgenes = [y for y in genes if y in network.nodes]
    newsize = len(newgenes)
    restgenes = set(genes).difference(set(newgenes))
    print(oldsize - newsize, 'genes removed from', oldsize)
    return((newgenes, restgenes))

AD_genes_guney, AD_genes_guney_removed = remove_genes_notin_network(AD_genes_guney, network_cheng)

5 genes removed from 36


When I use the older, smaller `network_guney` then additional genes are removed (therefore I won't use `network_guney`).

In [10]:
AD_genes_guney1, AD_genes_guney_removed1 = remove_genes_notin_network(AD_genes_guney, network_guney)

2 genes removed from 31


#### Knowledge based AD genes

This is a gene set whose size is similar to `AD_genes_guney` but this set comes from other source of evidence (arguably more reliable source when we consider that knowledge bases contain manually curated genes based on experimental evidence).

In [14]:
AD_genes_knowledge = wrappers.convert_to_geneid(file_name='../../results/2021-07-01-high-conf-ADgenes/AD-genes-knowledge', id_type='symbol', id_mapping_file=id_mapping_file)

{'MT-ND1', 'MT-ND2'}


In [15]:
AD_genes_knowledge, AD_genes_knowledge_removed = remove_genes_notin_network(AD_genes_knowledge, network_cheng)

0 genes removed from 24


#### High confidence AD genes (seed)

In [20]:
AD_genes_seed = wrappers.convert_to_geneid(file_name='../../results/2021-07-01-high-conf-ADgenes/AD-genes-seed', id_type='symbol', id_mapping_file=id_mapping_file)

{'X84075', 'MT-ND1', 'ENSG00000270081.1', 'ENSG00000260911', 'MT-ND2'}


In [21]:
AD_genes_seed, AD_genes_seed_removed = remove_genes_notin_network(AD_genes_seed, network_cheng)

11 genes removed from 103


## Results

### Overlap between AD gene sets

The intersection contains APP, APOE and PSEN1, PSEN2.

In [54]:
id_symbol.loc[list(set(AD_genes_guney).intersection(set(AD_genes_knowledge))), ]

Unnamed: 0_level_0,Symbol
GeneID,Unnamed: 1_level_1
10347,ABCA7
5664,PSEN2
55103,RALGPS2
5663,PSEN1
1191,CLU
351,APP
348,APOE


### Proximity

Now the proximity calculation between `Hydroxychloroquine_targets` and `AD_genes_guney`

In [11]:
res_guney = wrappers.calculate_proximity(network=network_cheng, nodes_from=Hydroxychloroquine_targets, nodes_to=AD_genes_guney)

In [12]:
res_guney

(2.0, 0.14785208756232124, (1.962, 0.25701361831622854))

The distance $d = 2.0$ whereas $z \approx 0.106$ so we can conclude that `Hydroxychloroquine_targets` are not significantly proximal to `AD_genes_guney`.  Note that $z < -2$ would indicate statistically significant proximity.

In [13]:
res_guney_prox = wrappers.calculate_proximity(network=network_cheng, nodes_from=AD_genes_guney[:2], nodes_to=AD_genes_guney)

In [27]:
res_guney_prox

(0.0, -5.0814613401165225, (2.0165, 0.39683466330450523))



Thus the small `AD_genes_guney` gene set (31 genes) does not support Hydroxychloroquine's repurposing for AD.  Below I study other AD gene sets that might be better than `AD_genes_guney`.

In [16]:
res_knowledge = wrappers.calculate_proximity(network=network_cheng, nodes_from=Hydroxychloroquine_targets, nodes_to=AD_genes_knowledge)

In [17]:
res_knowledge

(2.5, 2.1595243805696294, (1.9645, 0.2479712684969773))

### Changing targets

In [18]:
res_knowledge_prox = wrappers.calculate_proximity(network=network_cheng, nodes_from=AD_genes_knowledge[:2], nodes_to=AD_genes_knowledge)

In [19]:
res_knowledge_prox

(0.0, -4.425650685559612, (1.6095, 0.363675335979772))

In [22]:
res_seed = wrappers.calculate_proximity(network=network_cheng, nodes_from=Hydroxychloroquine_targets, nodes_to=AD_genes_seed)

In [23]:
res_seed

(2.0, 0.6532017683875009, (1.8015, 0.30388772597786834))

### Changing targets

In [24]:
res_seed_prox = wrappers.calculate_proximity(network=network_cheng, nodes_from=AD_genes_seed[:2], nodes_to=AD_genes_seed)

In [25]:
res_seed_prox

(0.0, -5.403031491966403, (1.7585, 0.32546543595288274))

## TODO

* fix gene symbols in the high confidence AD gene set
* Download and parse drugbank. See [drugbank-downloader](https://pypi.org/project/drugbank-downloader/)
* deploy `wrappers.calculate_proximity` to cloud

In [26]:
%connect_info

{
  "shell_port": 40677,
  "iopub_port": 45183,
  "stdin_port": 51959,
  "control_port": 59383,
  "hb_port": 52937,
  "ip": "127.0.0.1",
  "key": "baacd117-1e3fc134b602b42211df35cd",
  "transport": "tcp",
  "signature_scheme": "hmac-sha256",
  "kernel_name": ""
}

Paste the above JSON into a file, and connect with:
    $> jupyter <app> --existing <file>
or, if you are local, you can connect with just:
    $> jupyter <app> --existing kernel-9700232e-f654-4840-9d63-d4b7a178eaff.json
or even just:
    $> jupyter <app> --existing
if this is the most recent Jupyter kernel you have started.
