# Validation of In-House Network
This notebook maps the drug-disease pairs (DrugBank-MeSH) from ClinicalTrials.gov (16-04-2020)
to UMLS (DisGeNet) for validating the In-House Network.

In [1]:
import getpass
import sys
import time

import pandas as pd

In [2]:
getpass.getuser()

'danieldomingo'

In [3]:
sys.version

'3.7.4 (v3.7.4:e09359112e, Jul  8 2019, 14:54:52) \n[Clang 6.0 (clang-600.0.57)]'

In [4]:
time.asctime()

'Wed May  6 14:24:50 2020'

Load clinical trials information from ClinicalTrials.gov

In [5]:
clinical_trials = pd.read_csv(
    "../data/DrugBank-MeSH-slim-counts.tsv",
    sep="\t",
)
print(clinical_trials.shape)

(59797, 3)


In [6]:
clinical_trials.head()

Unnamed: 0,drugbank_id,condition,n_trials
0,DB00001,D008175,1
1,DB00001,D013921,3
2,DB00001,D013927,1
3,DB00001,D016638,1
4,DB00001,D055752,1


In [7]:
len(clinical_trials.condition.unique())

2871

Mappings between UMLS (DisGeNet ids) and MeSH. Note that these mappings are subject to change with new releases of DisGeNET.

In [8]:
mappings_disgenet = pd.read_csv(
    "http://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz",
    sep="\t",
)

In [9]:
mappings_disgenet = mappings_disgenet[mappings_disgenet.vocabulary == "MSH"]

In [10]:
umls_to_mesh = {
    row['diseaseId']: row['code']
    for _, row in mappings_disgenet.iterrows()
}
print(f"# mappings between UMLS and MeSH: {len(umls_to_mesh)}")

# mappings between UMLS and MeSH: 10057


Read In-House Network

In [11]:
inhouse_network_df = pd.read_csv(
    "../../networks/data/custom_network.tsv",
    sep='\t'
)

Read DisGeNet Nodes in In-House Network

In [12]:
target_disgenet_nodes = {
    target.replace("UMLS:", "") # Replace the prefix
    for target in inhouse_network_df.target.unique()
    if target.startswith("UMLS:")
}
print(f"DisGeNet targets: {len(target_disgenet_nodes)}")

DisGeNet targets: 4125


In [13]:
mesh_to_umls = {
    # MeSH (ClinicalTrials): UMLS (DisGeNet id)
    umls_to_mesh[umls]: umls
    for umls in target_disgenet_nodes
    if umls in umls_to_mesh
}
print(f"Diseases mapped from UMLS to MeSH: {len(mesh_to_umls)}")

Diseases mapped from UMLS to MeSH: 2645


Read DrugBank Nodes in In-House Network

In [14]:
source_drug_nodes = {
    source.replace("drugbank:", "") # Replace the prefix
    for source in inhouse_network_df.source.unique()
    if source.startswith("drugbank:")
}
print(f"DrugBank sources: {len(source_drug_nodes)}")

DrugBank sources: 1395


Get Drug-Disease pairs that are in the In-House Network

In [15]:
pairs_for_validation = pd.DataFrame([
    # filtered drugbank and mapped UMLS terms
    {'drugbank_id': row['drugbank_id'], 'umls_cui': mesh_to_umls[row['condition']], 'n_trials': row['n_trials']}
    for _, row in clinical_trials.iterrows()
    if row['condition'] in mesh_to_umls and row['drugbank_id'] in source_drug_nodes
])
print(pairs_for_validation.shape)

(10101, 3)


In [16]:
pairs_for_validation.head()

Unnamed: 0,drugbank_id,umls_cui,n_trials
0,DB00004,C0001418,1
1,DB00004,C0006142,3
2,DB00004,C0007131,1
3,DB00004,C1306837,1
4,DB00004,C0740457,2


Export drug-disease pairs for validation of the In-House Network

In [17]:
pairs_for_validation.to_csv("validation_custom.tsv", sep="\t", index=False)

Export source nodes for validation of the In-House Network (drugs)

In [18]:
pd.Series(
    [f"drugbank:{drug}" for drug in pairs_for_validation.drugbank_id.unique()] # Add prefix
).to_csv("../data/source_nodes_custom.tsv", sep="\t", index=False)

Export target nodes for validation of the In-House Network(diseases)

In [19]:
pd.Series(
    [f"UMLS:{drug}" for drug in pairs_for_validation.umls_cui.unique()] # Add prefix
).to_csv("../data/target_nodes_custom.tsv", sep="\t", index=False)

## Analysis of the disease-gene associations
In the following cells, we analyzed the number of interactions for each disease (target node) to assess whether there are genes that act as funnel (most of the paths will go through that gene to the disease)

In [20]:
disgenet = pd.read_csv(
    "http://www.disgenet.org/static/disgenet_ap1/files/downloads/curated_gene_disease_associations.tsv.gz",
    sep="\t",
)

In [21]:
disgenet.head(1)

Unnamed: 0,geneId,geneSymbol,DSI,DPI,diseaseId,diseaseName,diseaseType,diseaseClass,diseaseSemanticType,score,EI,YearInitial,YearFinal,NofPmids,NofSnps,source
0,1,A1BG,0.857,0.172,C0019209,Hepatomegaly,phenotype,C06;C23,Finding,0.3,,2017.0,2017.0,1,0,CTD_human


In [22]:
disgenet = disgenet[disgenet.score > 0.6]

In [23]:
disgenet.shape

(3618, 16)

In [24]:
diseases_to_validate = pairs_for_validation.umls_cui.unique()

print(f"Number of diseases to validate in the In-House Network: {len(diseases_to_validate)}")

# Last relations for all paths (all interactions between a gene and disease)
last_relations = pd.DataFrame([
    {'umls_cui': row['diseaseId'], 'hgnc_symbol': row['geneSymbol']}
    for _, row in disgenet.iterrows()
    # Check that it is a final interaction (gene to disease) and the disease is in clinical trials
    if row['diseaseId'] in diseases_to_validate
])

Number of diseases to validate in the In-House Network: 414


In [25]:
last_relations.shape

(739, 2)

In [26]:
last_relations.groupby('umls_cui').count()

Unnamed: 0_level_0,hgnc_symbol
umls_cui,Unnamed: 1_level_1
C0001418,1
C0001815,2
C0001973,3
C0002066,1
C0002395,4
...,...
C3489396,1
C3495427,1
C3536984,1
C3714756,34
