<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Downloading-files" data-toc-modified-id="Downloading-files-1">Downloading files</a></span></li><li><span><a href="#Check-for-duplicates" data-toc-modified-id="Check-for-duplicates-2">Check for duplicates</a></span></li><li><span><a href="#Remove-rows" data-toc-modified-id="Remove-rows-3">Remove rows</a></span></li><li><span><a href="#Pre-NodeNorming" data-toc-modified-id="Pre-NodeNorming-4">Pre-NodeNorming</a></span><ul class="toc-item"><li><span><a href="#Genes" data-toc-modified-id="Genes-4.1">Genes</a></span></li><li><span><a href="#Diseases" data-toc-modified-id="Diseases-4.2">Diseases</a></span></li><li><span><a href="#Adding-NodeNorm-data,-removing-rows" data-toc-modified-id="Adding-NodeNorm-data,-removing-rows-4.3">Adding NodeNorm data, removing rows</a></span></li></ul></li><li><span><a href="#Generating-documents" data-toc-modified-id="Generating-documents-5">Generating documents</a></span><ul class="toc-item"><li><span><a href="#Rows-not-included" data-toc-modified-id="Rows-not-included-5.1">Rows not included</a></span></li><li><span><a href="#Columns-not-included" data-toc-modified-id="Columns-not-included-5.2">Columns not included</a></span></li><li><span><a href="#File:-List-of-TRAPI-edges" data-toc-modified-id="File:-List-of-TRAPI-edges-5.3">File: List of TRAPI edges</a></span></li><li><span><a href="#File:-KGX-edges" data-toc-modified-id="File:-KGX-edges-5.4">File: KGX edges</a></span></li><li><span><a href="#File:-KGX-nodes" data-toc-modified-id="File:-KGX-nodes-5.5">File: KGX nodes</a></span></li></ul></li><li><span><a href="#Notes" data-toc-modified-id="Notes-6">Notes</a></span></li></ul></div>

# Notebook for DISEASES parser development

In [1]:
## not for parser. for notebook only 

## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Downloading files

**using entire files for now. Allows me to check for duplicates.** 

In parser, may want to do a generator approach and ingest large chunks (ex: 1000-2000 lines) at 1 time. This balances "less memory footprint" vs "NodeNorming in large batches is faster". 

(pandas [read_table](https://pandas.pydata.org/docs/reference/api/pandas.read_table.html) has an iterator for rows/chunks! see iterator/chunksize parameters)

In [2]:
## put in parser (format)
import pathlib
import pandas as pd

## don't put in parser. Just for this notebook
from pprint import pprint

## unsure on putting into parser: more for notebook viewing/debugging...
pd.options.display.max_columns = None
pd.set_option('display.max_colwidth', None)

In [3]:
## useful function for exploring data
def check_if_contains(df, column_name, patterns):
    for i in patterns:
        temp = df[df[column_name].str.contains(pat=i, case=False)]
        if temp.size > 0:
            print(f'"{i}"')
            print(temp.shape)

<div class="alert alert-block alert-danger">
    
This notebook was originally written using data files downloaded 2025-04-14 between 4:23-4:24 PM Pacific time (23:23-23:24 UTC+0) from https://diseases.jensenlab.org/Downloads. 

In [4]:
## using on-demand from links

textmining_path = "https://download.jensenlab.org/human_disease_textmining_filtered.tsv"
knowledge_path = "https://download.jensenlab.org/human_disease_knowledge_filtered.tsv"
experiments_path = "https://download.jensenlab.org/human_disease_experiments_filtered.tsv"

In [None]:
## put in parser (format)

## paths to raw data files
# base_file_path = pathlib.Path.home().joinpath("Desktop", "DISEASES_files", "2025_09_04")

# textmining_path = base_file_path.joinpath("human_disease_textmining_filtered.tsv")
# knowledge_path = base_file_path.joinpath("human_disease_knowledge_filtered.tsv")

# textmining_path
# knowledge_path

In [5]:
## put in parser (format)

## download files
## files have no headers: adding based on https://diseases.jensenlab.org/Downloads
textmining_header = ["gene_id", "gene_name", "disease_id", "disease_name", 
                     "z_score", "confidence_score", "url"]
knowledge_header = ["gene_id", "gene_name", "disease_id", "disease_name", 
                    "source_db", "evidence_type", "confidence_score"]


df_textmining = pd.read_table(textmining_path, sep="\t", names=textmining_header)
df_knowledge = pd.read_table(knowledge_path, sep="\t", names=knowledge_header)

<div class="alert alert-block alert-info">

No missing values

In [6]:
df_textmining.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288294 entries, 0 to 288293
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   gene_id           288294 non-null  object 
 1   gene_name         288294 non-null  object 
 2   disease_id        288294 non-null  object 
 3   disease_name      288294 non-null  object 
 4   z_score           288294 non-null  float64
 5   confidence_score  288294 non-null  float64
 6   url               288294 non-null  object 
dtypes: float64(2), object(5)
memory usage: 113.6 MB


In [7]:
df_textmining

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,z_score,confidence_score,url
0,18S_rRNA,18S_rRNA,DOID:9643,Babesiosis,7.245,3.623,https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=18S_rRNA&type2=-26&id2=DOID:9643
1,18S_rRNA,18S_rRNA,DOID:3733,Theileriasis,6.381,3.191,https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=18S_rRNA&type2=-26&id2=DOID:3733
2,18S_rRNA,18S_rRNA,DOID:12365,Malaria,6.298,3.149,https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=18S_rRNA&type2=-26&id2=DOID:12365
3,18S_rRNA,18S_rRNA,DOID:9640,Sarcocystosis,6.139,3.069,https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=18S_rRNA&type2=-26&id2=DOID:9640
4,18S_rRNA,18S_rRNA,DOID:1733,Cryptosporidiosis,6.114,3.057,https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=18S_rRNA&type2=-26&id2=DOID:1733
...,...,...,...,...,...,...,...
288289,snoU13,snoU13,DOID:0110084,Arrhythmogenic right ventricular dysplasia 13,3.773,1.886,https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=snoU13&type2=-26&id2=DOID:0110084
288290,snoU13,snoU13,DOID:0110408,Retinitis pigmentosa 11,3.385,1.692,https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=snoU13&type2=-26&id2=DOID:0110408
288291,snoU13,snoU13,DOID:1849,Cannabis dependence,3.226,1.613,https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=snoU13&type2=-26&id2=DOID:1849
288292,snoU13,snoU13,DOID:0060775,Microvillus inclusion disease,3.145,1.573,https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=snoU13&type2=-26&id2=DOID:0060775


In [8]:
df_knowledge.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7775 entries, 0 to 7774
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   gene_id           7775 non-null   object
 1   gene_name         7775 non-null   object
 2   disease_id        7775 non-null   object
 3   disease_name      7775 non-null   object
 4   source_db         7775 non-null   object
 5   evidence_type     7775 non-null   object
 6   confidence_score  7775 non-null   int64 
dtypes: int64(1), object(6)
memory usage: 2.8 MB


In [9]:
df_knowledge

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,source_db,evidence_type,confidence_score
0,ABHD11-AS1,ABHD11-AS1,DOID:1928,Williams-Beuren syndrome,MedlinePlus,CURATED,5
1,ENSP00000001146,CYP26B1,DOID:2340,Craniosynostosis,UniProtKB-KW,CURATED,4
2,ENSP00000003084,CFTR,DOID:0111862,Congenital bilateral absence of vas deferens,MedlinePlus,CURATED,5
3,ENSP00000003084,CFTR,DOID:1485,Cystic fibrosis,MedlinePlus,CURATED,5
4,ENSP00000005226,USH1C,DOID:0050439,Usher syndrome,MedlinePlus,CURATED,5
...,...,...,...,...,...,...,...
7770,hsa-miR-145-5p,hsa-miR-145-5p,DOID:0090016,Chromosome 5q deletion syndrome,MedlinePlus,CURATED,5
7771,hsa-miR-146a-5p,hsa-miR-146a-5p,DOID:0090016,Chromosome 5q deletion syndrome,MedlinePlus,CURATED,5
7772,hsa-miR-184,hsa-miR-184,DOID:10126,Keratoconus,MedlinePlus,CURATED,5
7773,hsa-miR-590-5p,hsa-miR-590-5p,DOID:1928,Williams-Beuren syndrome,MedlinePlus,CURATED,5


## Check for duplicates

<div class="alert alert-block alert-info">

Duplicates found in knowledge file. This is a DISEASES-side problem.
    
Ideas for how to handle duplicates (including if they show up in text-mining file later): 
* ingest entire file and remove duplicates
* **doing this right now**: check a set of "already-done" edges (gene_id, disease_id, source_db for knowledge) right before creating document (earlier steps will cut down number of rows to create documents with). 

In [10]:
df_textmining[df_textmining.duplicated(subset=["gene_id", "disease_id"], keep=False)].shape

(0, 7)

In [11]:
df_knowledge[df_knowledge.duplicated(keep=False)].shape

(16, 7)

In [12]:
## double-checking if specific column subset can work to check for duplicates
df_knowledge[df_knowledge.duplicated(subset=["gene_id", "disease_id", "source_db"], keep=False)].shape
## same count so it can

## print for manual review
df_knowledge[df_knowledge.duplicated(keep=False)]

(16, 7)

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,source_db,evidence_type,confidence_score
1618,ENSP00000269703,CYP4F22,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5
1619,ENSP00000269703,CYP4F22,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5
1668,ENSP00000272895,ABCA12,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5
1669,ENSP00000272895,ABCA12,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5
1994,ENSP00000291295,CALM3,DOID:2843,Long QT syndrome,UniProtKB-KW,CURATED,4
1995,ENSP00000291295,CALM3,DOID:2843,Long QT syndrome,UniProtKB-KW,CURATED,4
2579,ENSP00000311687,NIPAL4,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5
2580,ENSP00000311687,NIPAL4,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5
3436,ENSP00000342392,MESP2,DOID:0050568,Spondylocostal dysostosis,MedlinePlus,CURATED,5
3437,ENSP00000342392,MESP2,DOID:0050568,Spondylocostal dysostosis,MedlinePlus,CURATED,5


## Remove rows

<div class="alert alert-block alert-info">

Removing:
* knowledge's UniProtKB-KW data: Doesn't seem accurate, Likely has licensing issues (OMIM). Analysis in Slack canvas
* rows that don't have ENSP IDs in gene_id column: can't NodeNorm them
* rows that don't have DOID IDs in disease_id column: can't NodeNorm them

    Not removing duplicates! Because planning to ingest file in chunks, so need other method (check set of already-done). 

In [13]:
## calculate stats before removing

nrows_textmining_original = df_textmining.shape[0]
nrows_knowledge_original = df_knowledge.shape[0]

In [14]:
## put into parser (don't do stats, print)

## remove knowledge's UniProtKB-KW data
df_knowledge = df_knowledge[df_knowledge["source_db"] != "UniProtKB-KW"]

## stats
nrows_knowledge_filtered = df_knowledge.shape[0]
print(f"knowledge: {nrows_knowledge_original - nrows_knowledge_filtered} rows removed")

knowledge: 3624 rows removed


In [15]:
## put into parser (don't do stats, print)

## remove rows that don't have ENSP IDs
df_textmining = df_textmining[df_textmining["gene_id"].str.startswith("ENSP")]
df_knowledge = df_knowledge[df_knowledge["gene_id"].str.startswith("ENSP")]


## stats
nrows_textmining_filtered = df_textmining.shape[0]
print(f"textmining: {nrows_textmining_original - nrows_textmining_filtered} rows removed")
## don't rewrite variable yet, so you can compare previous to now
print(f"knowledge: {nrows_knowledge_filtered - df_knowledge.shape[0]} rows removed")

## update this variable
nrows_knowledge_filtered = df_knowledge.shape[0]

textmining: 16492 rows removed
knowledge: 41 rows removed


In [16]:
## put into parser (don't do stats, print)

## remove rows that don't have DOID IDs
df_textmining = df_textmining[df_textmining["disease_id"].str.startswith("DOID:")]
df_knowledge = df_knowledge[df_knowledge["disease_id"].str.startswith("DOID:")]


## stats
## don't rewrite variable yet, so you can compare previous to now
print(f"textmining: {nrows_textmining_filtered - df_textmining.shape[0]} rows removed")
print(f"knowledge: {nrows_knowledge_filtered - df_knowledge.shape[0]} rows removed")

## update this variable
nrows_textmining_filtered = df_textmining.shape[0]
nrows_knowledge_filtered = df_knowledge.shape[0]

textmining: 42 rows removed
knowledge: 71 rows removed


In [17]:
## code to reviewing dataframes, make sure all removed properly

## textmining
df_textmining.shape
df_textmining[~ df_textmining["disease_id"].str.contains("DOID")].shape

(271760, 7)

(0, 7)

In [18]:
## knowledge
df_knowledge.shape
df_knowledge[~ df_knowledge["disease_id"].str.contains("DOID")].shape
df_knowledge[df_knowledge["source_db"] == "UniProtKB-KW"].shape

(4039, 7)

(0, 7)

(0, 7)

In [19]:
## AmyCo data still in database (wasn't all under AmyCo disease IDs)
df_knowledge[df_knowledge["source_db"] == "AmyCo"].shape

(186, 7)

## Pre-NodeNorming

In [23]:
## put into parser (format)

## adding Translator/biolink prefixes to gene ID, for NodeNorming
## don't need to do DOID, already in right format

## loop for notebook only, to do both dataframes 
for df in [df_textmining, df_knowledge]:
    df["gene_id"] = "ENSEMBL:" + df["gene_id"]

In [24]:
## double-check how they look

df_knowledge

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,source_db,evidence_type,confidence_score
2,ENSEMBL:ENSP00000003084,CFTR,DOID:0111862,Congenital bilateral absence of vas deferens,MedlinePlus,CURATED,5
3,ENSEMBL:ENSP00000003084,CFTR,DOID:1485,Cystic fibrosis,MedlinePlus,CURATED,5
4,ENSEMBL:ENSP00000005226,USH1C,DOID:0050439,Usher syndrome,MedlinePlus,CURATED,5
5,ENSEMBL:ENSP00000005226,USH1C,DOID:0050563,Nonsyndromic deafness,MedlinePlus,CURATED,5
11,ENSEMBL:ENSP00000013807,ERCC1,DOID:0050427,Xeroderma pigmentosum,MedlinePlus,CURATED,5
...,...,...,...,...,...,...,...
7728,ENSEMBL:ENSP00000500990,SNCA,DOID:4752,Multiple system atrophy,MedlinePlus,CURATED,5
7729,ENSEMBL:ENSP00000501092,TFAP2A,DOID:0050691,Branchiooculofacial syndrome,MedlinePlus,CURATED,5
7730,ENSEMBL:ENSP00000501092,TFAP2A,DOID:10629,Microphthalmia,MedlinePlus,CURATED,5
7731,ENSEMBL:ENSP00000501092,TFAP2A,DOID:12270,Coloboma,MedlinePlus,CURATED,5


Querying NodeNorm: send unique values (no duplicates!) in large batches -> generate mapping dict to use. 
<br>
__Not querying 1-by-1 or 1 row at a time: much slower__ and would involve sending duplicate IDs (unless saved dict is kept outside loop and checked) 

<div class="alert alert-block alert-danger">

Set the NodeNorm URL you want to use. 

In [25]:
## put into parser (format)

import requests

## from BioThings annotator code: for interoperability between diff Python versions
try:
    from itertools import batched  # new in Python 3.12
except ImportError:
    from itertools import islice

    def batched(iterable, n):
        # batched('ABCDEFG', 3) → ABC DEF G
        if n < 1:
            raise ValueError("n must be at least one")
        iterator = iter(iterable)
        while batch := tuple(islice(iterator, n)):
            yield batch

## dev instance
# nodenorm_url = "https://nodenormalization-sri.renci.org/get_normalized_nodes"
nodenorm_url = "https://nodenorm.ci.transltr.io/get_normalized_nodes"

### Genes

In [26]:
## put into parser (set unique on curie list)

## get set of unique CURIEs to put into NodeNorm
geneIDs_textmining = set(df_textmining["gene_id"].unique())
geneIDs_knowledge = set(df_knowledge["gene_id"].unique())
geneIDs_all = geneIDs_textmining | geneIDs_knowledge

len(geneIDs_textmining)
len(geneIDs_knowledge)
len(geneIDs_all)

16858

2567

16861

In [27]:
## put into parser (partial?)

gene_nodenorm_mapping = {}

## set up variables to catch potential mapping failures
stats_gene_mapping_failures = {
    "unexpected_error": {},
    "nodenorm_returned_none": [],
    "wrong_category": {},
    "no_label": []
}

Allowed types are Gene and Protein (depending on if gene/protein conflation worked for the ENSP ID or not)

In [28]:
## put into parser (don't need to iterate through batch, just use all in chunk you're on)

## larger batches are quicker
for batch in batched(geneIDs_all, 1000):
    ## returns tuples -> cast to list
    req_body = {
        "curies": list(batch),
        "conflate": True,
    }
    r = requests.post(nodenorm_url, json=req_body)
    response = r.json()
    
    ## not doing dict comprehension. allows easier review, logic writing
    for k,v in response.items():
        ## catch unexpected errors
        try:
            ## if NodeNorm didn't have info on this ID, v will be None
            if v is not None:
                ## don't keep mapping if category is not the expected one
                if v["type"][0] in ["biolink:Gene", "biolink:Protein"]:
                    ## also throw out mapping if no primary label found
                    if v["id"].get("label"):
                        temp = {
                            k: {"primary_id": v["id"]["identifier"],
                                "primary_label": v["id"]["label"],
                                ## saving category: currently required for KGX nodes file output
                                "category": v["type"][0]
                               }
                        }
                        gene_nodenorm_mapping.update(temp)
                    else:
                        stats_gene_mapping_failures["no_label"].append(k)
#                         print(f"{k}: NodeNorm didn't find primary label. Not keeping this mapping.")
                else:
                    stats_gene_mapping_failures["wrong_category"].update({k: v["type"][0]})
#                     print(f'{k}: NodeNorm found different category {v["type"][0]}. Not keeping this mapping.')
            else:
                stats_gene_mapping_failures["nodenorm_returned_none"].append(k)
#                 print(f"{k}: NodeNorm didn't recognize this ID")
        except:
            stats_gene_mapping_failures["unexpected_error"].update({k: v})
            print(f'Encountered an unexpected error.')
            print(f'NodeNorm response key: {k}')
            print(f'NodeNorm response value: {v}')

In [29]:
len(gene_nodenorm_mapping)

stats_gene_mapping_failures["unexpected_error"]

len(stats_gene_mapping_failures["nodenorm_returned_none"])
len(stats_gene_mapping_failures["wrong_category"])
len(stats_gene_mapping_failures["no_label"])

16478

{}

319

0

64

In [30]:
## calculate stats: number of rows affected by each type of mapping failure
stats_gene_mapping_failures.update({
    "nrows_textmining_none": df_textmining[df_textmining["gene_id"].isin(stats_gene_mapping_failures["nodenorm_returned_none"])].shape[0],
    "nrows_knowledge_none": df_knowledge[df_knowledge["gene_id"].isin(stats_gene_mapping_failures["nodenorm_returned_none"])].shape[0],

    "n_rows_textmining_wrongcategory": df_textmining[df_textmining["gene_id"].isin(stats_gene_mapping_failures["wrong_category"].keys())].shape[0],
    "n_rows_knowledge_wrongcategory": df_knowledge[df_knowledge["gene_id"].isin(stats_gene_mapping_failures["wrong_category"].keys())].shape[0],

    
    "n_rows_textmining_nolabel": df_textmining[df_textmining["gene_id"].isin(stats_gene_mapping_failures["no_label"])].shape[0],
    "n_rows_knowledge_nolabel": df_knowledge[df_knowledge["gene_id"].isin(stats_gene_mapping_failures["no_label"])].shape[0],
})

In [31]:
stats_gene_mapping_failures["nrows_textmining_none"]
stats_gene_mapping_failures["nrows_knowledge_none"]
stats_gene_mapping_failures["n_rows_textmining_wrongcategory"]
stats_gene_mapping_failures["n_rows_knowledge_wrongcategory"]
stats_gene_mapping_failures["n_rows_textmining_nolabel"]
stats_gene_mapping_failures["n_rows_knowledge_nolabel"]

4231

55

0

0

1445

11

### Diseases

In [32]:

## get set of unique CURIEs to put into NodeNorm
diseaseIDs_textmining = set(df_textmining["disease_id"].unique())
diseaseIDs_knowledge = set(df_knowledge["disease_id"].unique())
diseaseIDs_all = diseaseIDs_textmining | diseaseIDs_knowledge

len(diseaseIDs_textmining)
len(diseaseIDs_knowledge)
len(diseaseIDs_all)

5556

1037

5671

In [33]:

disease_nodenorm_mapping = {}

## set up variables to catch mapping failures
stats_disease_mapping_failures = {
    "unexpected_error": {},
    "nodenorm_returned_none": [],
    "wrong_category": {},
    "no_label": []
    
}

In [34]:


## larger batches are quicker
for batch in batched(diseaseIDs_all, 1000):
    ## returns tuples -> cast to list
    req_body = {
        "curies": list(batch),
        "conflate": True,
    }
    r = requests.post(nodenorm_url, json=req_body)
    response = r.json()
    
    ## not doing dict comprehension. allows easier review, logic writing
    for k,v in response.items():
        ## catch unexpected errors
        try:
            ## if NodeNorm didn't have info on this ID, v will be None
            if v is not None:
                ## don't keep mapping if category is not the expected one
                if v["type"][0] == "biolink:Disease":
                    ## also throw out mapping if no primary label found
                    if v["id"].get("label"):
                        temp = {
                            k: {"primary_id": v["id"]["identifier"],
                                "primary_label": v["id"]["label"]
                               }
                        }
                        disease_nodenorm_mapping.update(temp)
                    else:
                        stats_disease_mapping_failures["no_label"].append(k)
#                         print(f"{k}: NodeNorm didn't find primary label. Not keeping this mapping.")
                else:
                    stats_disease_mapping_failures["wrong_category"].update({k: v["type"][0]})
#                     print(f'{k}: NodeNorm found different category {v["type"][0]}. Not keeping this mapping.')
            else:
                stats_disease_mapping_failures["nodenorm_returned_none"].append(k)
#                 print(f"{k}: NodeNorm didn't recognize this ID")
        except:
            stats_disease_mapping_failures["unexpected_error"].update({k: v})
            print(f'Encountered an unexpected error.')
            print(f'NodeNorm response key: {k}')
            print(f'NodeNorm response value: {v}')

In [35]:
len(disease_nodenorm_mapping)

stats_disease_mapping_failures["unexpected_error"]

len(stats_disease_mapping_failures["nodenorm_returned_none"])
len(stats_disease_mapping_failures["wrong_category"])
len(stats_disease_mapping_failures["no_label"])

5641

{}

30

0

0

In [36]:
## calculate stats: number of rows affected by each type of mapping failure
stats_disease_mapping_failures.update({
    "nrows_textmining_none": df_textmining[df_textmining["disease_id"].isin(stats_disease_mapping_failures["nodenorm_returned_none"])].shape[0],
    "nrows_knowledge_none": df_knowledge[df_knowledge["disease_id"].isin(stats_disease_mapping_failures["nodenorm_returned_none"])].shape[0],

    "n_rows_textmining_wrongcategory": df_textmining[df_textmining["disease_id"].isin(stats_disease_mapping_failures["wrong_category"].keys())].shape[0],
    "n_rows_knowledge_wrongcategory": df_knowledge[df_knowledge["disease_id"].isin(stats_disease_mapping_failures["wrong_category"].keys())].shape[0],

    
    "n_rows_textmining_nolabel": df_textmining[df_textmining["disease_id"].isin(stats_disease_mapping_failures["no_label"])].shape[0],
    "n_rows_knowledge_nolabel": df_knowledge[df_knowledge["disease_id"].isin(stats_disease_mapping_failures["no_label"])].shape[0],
})

In [37]:
stats_disease_mapping_failures["nrows_textmining_none"]
stats_disease_mapping_failures["nrows_knowledge_none"]
stats_disease_mapping_failures["n_rows_textmining_wrongcategory"]
stats_disease_mapping_failures["n_rows_knowledge_wrongcategory"]
stats_disease_mapping_failures["n_rows_textmining_nolabel"]
stats_disease_mapping_failures["n_rows_knowledge_nolabel"]

628

10

0

0

0

0

### Adding NodeNorm data, removing rows

<div class="alert alert-block alert-info">

Removing rows that lack NodeNorm data, due to small amount of failures during NodeNorm process: 
* NodeNorm returned none
* NodeNorm didn't have primary label

In [38]:
## put into parser (format): 

## text-mining
df_textmining["gene_nodenorm_id"] = [gene_nodenorm_mapping[i]["primary_id"] 
                                     if gene_nodenorm_mapping.get(i) 
                                     else pd.NA
                                     for i in df_textmining["gene_id"]]

df_textmining["gene_nodenorm_label"] = [gene_nodenorm_mapping[i]["primary_label"] 
                                        if gene_nodenorm_mapping.get(i) 
                                        else pd.NA
                                        for i in df_textmining["gene_id"]]

df_textmining["gene_nodenorm_category"] = [gene_nodenorm_mapping[i]["category"] 
                                           if gene_nodenorm_mapping.get(i) 
                                           else pd.NA
                                           for i in df_textmining["gene_id"]]

df_textmining["disease_nodenorm_id"] = [disease_nodenorm_mapping[i]["primary_id"] 
                                        if disease_nodenorm_mapping.get(i) 
                                        else pd.NA
                                        for i in df_textmining["disease_id"]]

df_textmining["disease_nodenorm_label"] = [disease_nodenorm_mapping[i]["primary_label"] 
                                           if disease_nodenorm_mapping.get(i) 
                                           else pd.NA
                                           for i in df_textmining["disease_id"]]

In [39]:

## knowledge
df_knowledge["gene_nodenorm_id"] = [gene_nodenorm_mapping[i]["primary_id"] 
                                    if gene_nodenorm_mapping.get(i) 
                                    else pd.NA
                                    for i in df_knowledge["gene_id"]]

df_knowledge["gene_nodenorm_label"] = [gene_nodenorm_mapping[i]["primary_label"] 
                                       if gene_nodenorm_mapping.get(i) 
                                       else pd.NA
                                       for i in df_knowledge["gene_id"]]

df_knowledge["gene_nodenorm_category"] = [gene_nodenorm_mapping[i]["category"] 
                                          if gene_nodenorm_mapping.get(i) 
                                          else pd.NA
                                          for i in df_knowledge["gene_id"]]

df_knowledge["disease_nodenorm_id"] = [disease_nodenorm_mapping[i]["primary_id"] 
                                       if disease_nodenorm_mapping.get(i) 
                                       else pd.NA
                                       for i in df_knowledge["disease_id"]]

df_knowledge["disease_nodenorm_label"] = [disease_nodenorm_mapping[i]["primary_label"] 
                                          if disease_nodenorm_mapping.get(i) 
                                          else pd.NA
                                          for i in df_knowledge["disease_id"]]

In [None]:
## code chunk to review df

# df_textmining
# df_knowledge

In [40]:
## put into parser: 

## doing subset just in case (original data didn't have any NAs that would be
##   dropped on accident, but still) 

df_textmining.dropna(subset=["gene_nodenorm_id", "gene_nodenorm_label", 
                            "disease_nodenorm_id", "disease_nodenorm_label"],
                    ignore_index=True, inplace=True)

df_knowledge.dropna(subset=["gene_nodenorm_id", "gene_nodenorm_label", 
                            "disease_nodenorm_id", "disease_nodenorm_label"],
                    ignore_index=True, inplace=True)

In [41]:
## didn't do stats for before/after

df_textmining.shape
df_knowledge.shape

(265468, 12)

(3964, 12)

## Generating documents

In [42]:
## put in parser!!
## want jsonlines format

import jsonlines

### Rows not included

<div class="alert alert-block alert-info">

* knowledge's UniProtKB-KW data
* No ENSP in gene_id columns (seemed to be names, couldn't NodeNorm)
* No DOID in disease_id columns (can't NodeNorm AmyCo)
* NodeNorm mapping failures for gene or disease IDs
* duplicates: will check when generating docs, not create if already did

### Columns not included

<div class="alert alert-block alert-info">

- gene_name
- disease_name
- evidence_type (knowledge file only): same for all rows, not needed

In [43]:
df_textmining.columns

df_knowledge.columns

Index(['gene_id', 'gene_name', 'disease_id', 'disease_name', 'z_score',
       'confidence_score', 'url', 'gene_nodenorm_id', 'gene_nodenorm_label',
       'gene_nodenorm_category', 'disease_nodenorm_id',
       'disease_nodenorm_label'],
      dtype='object')

Index(['gene_id', 'gene_name', 'disease_id', 'disease_name', 'source_db',
       'evidence_type', 'confidence_score', 'gene_nodenorm_id',
       'gene_nodenorm_label', 'gene_nodenorm_category', 'disease_nodenorm_id',
       'disease_nodenorm_label'],
      dtype='object')

### File: List of TRAPI edges 

In [None]:
## code chunk for testing parts of inner code

# with jsonlines.open('DISEASES_trapi_edges.jsonl', mode='w', compact=True) as trapi_writer:
#     knowledge_tally = set()
    
#     ## using itertuples because it's faster, preserves datatypes
#     for row in df_knowledge.itertuples(index=False):

#             break

In [44]:
## uncomment to write files

## put into parser (format): 
## separate functions for textmined vs not, trapi edges vs kgx??
## will want the tally to live outside the reading data -> writing loop, so it can
##   keep track of what edges were already done

## if needed to create two diff output file formats at same time, could initiate
##   to separate writers and use both at once 


# with jsonlines.open('DISEASES_trapi_edges.jsonl', mode='w', compact=True) as trapi_writer:
    
# ## text-mined data: 
#     textmining_tally = set()
    
#     ## using itertuples because it's faster, preserves datatypes
#     for row in df_textmining.itertuples(index=False):
#         ## construct row abbreviation
#         temp = row.gene_id + "_" + row.disease_id
        
#         if temp not in textmining_tally:
#             textmining_tally.add(temp)

#             document = {
#                 "subject": row.gene_nodenorm_id,
#                 "predicate": "biolink:occurs_together_in_literature_with",
#                 "object": row.disease_nodenorm_id,
#                 "sources": [
#                     {
#                         "resource_id": "infores:diseases",
#                         "resource_role": "primary_knowledge_source",
#                         "source_record_urls": [row.url]
#                     }
#                 ],
#                 "attributes": [
#                     {
#                         "attribute_type_id": "biolink:knowledge_level",
#                         "value": "statistical_association"
#                     },
#                     {
#                         "attribute_type_id": "biolink:agent_type",
#                         "value": "text_mining_agent"
#                     },
#                     {  ## no original attribute names since file had no header
#                         "attribute_type_id": "biolink:z_score",
#                         "value": row.z_score
#                     },
#                     {
#                         "attribute_type_id": "biolink:has_confidence_score",
#                         "value": row.confidence_score
#                     },
#                     {
#                         "attribute_type_id": "biolink:original_subject",
#                         "value": row.gene_id
#                     },
#                     {
#                         "attribute_type_id": "biolink:original_object",
#                         "value": row.disease_id
#                     },
#                 ]
#             }
#             ## doing so it doesn't print
#             bytes = trapi_writer.write(document)
#         else:
#             ## won't write the document
#             print(f"duplicate row encountered: {temp}")

# ## knowledge data: parser - separate function (diff row abbreviation, document)
#     knowledge_tally = set()
    
#     ## using itertuples because it's faster, preserves datatypes
#     for row in df_knowledge.itertuples(index=False):
#         ## construct row abbreviation: needs source_db!
#         temp = row.gene_id + "_" + row.disease_id + "_" + row.source_db
        
#         if temp not in knowledge_tally:
#             knowledge_tally.add(temp)

#             document = {
#                 ## simple assignments: no if
#                 "subject": row.gene_nodenorm_id,
#                 "predicate": "biolink:associated_with",
#                 "object": row.disease_nodenorm_id,
#                 "attributes": [
#                     {
#                         "attribute_type_id": "biolink:knowledge_level",
#                         "value": "knowledge_assertion"
#                     },
#                     {
#                         "attribute_type_id": "biolink:agent_type",
#                         "value": "manual_agent"
#                     },
#                     {  ## no original attribute names since file had no header
#                         "attribute_type_id": "biolink:has_confidence_score",
#                         "value": row.confidence_score
#                     },
#                     {
#                         "attribute_type_id": "biolink:original_subject",
#                         "value": row.gene_id
#                     },
#                     {
#                         "attribute_type_id": "biolink:original_object",
#                         "value": row.disease_id
#                     },
#                 ]
#             }
#             ## sources: depends on source_db value
#             if row.source_db == "MedlinePlus":
#                 document["sources"] = [
#                     {
#                         "resource_id": "infores:medlineplus",
#                         "resource_role": "primary_knowledge_source"
#                     },
#                     {
#                         "resource_id": "infores:diseases",
#                         "resource_role": "aggregator_knowledge_source",
#                         "upstream_resource_ids": ["infores:medlineplus"]
#                     }
#                 ]
#             elif row.source_db == "AmyCo":
#                 document["sources"] = [
#                     {
#                         "resource_id": "infores:amyco",
#                         "resource_role": "primary_knowledge_source"
#                     },
#                     {
#                         "resource_id": "infores:diseases",
#                         "resource_role": "aggregator_knowledge_source",
#                         "upstream_resource_ids": ["infores:amyco"]
#                     }
#                 ]
#             else:
#                 raise ValueError(f"Unexpected source_db value during source mapping: {row.source_db}. Adjust parser.")
#             ## doing so it doesn't print
#             bytes = trapi_writer.write(document)
#         else:
#             ## won't write the document
#             print(f"duplicate row encountered: {temp}")

duplicate row encountered: ENSEMBL:ENSP00000269703_DOID:0060655_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000272895_DOID:0060655_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000311687_DOID:0060655_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000342392_DOID:0050568_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000379845_DOID:0060407_MedlinePlus


In [45]:
## duplicates: comparing to printout

df_knowledge[df_knowledge.duplicated(subset=["gene_id", "disease_id", "source_db"], keep=False)]

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,source_db,evidence_type,confidence_score,gene_nodenorm_id,gene_nodenorm_label,gene_nodenorm_category,disease_nodenorm_id,disease_nodenorm_label
812,ENSEMBL:ENSP00000269703,CYP4F22,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5,NCBIGene:126410,CYP4F22,biolink:Gene,MONDO:0017265,autosomal recessive congenital ichthyosis
813,ENSEMBL:ENSP00000269703,CYP4F22,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5,NCBIGene:126410,CYP4F22,biolink:Gene,MONDO:0017265,autosomal recessive congenital ichthyosis
838,ENSEMBL:ENSP00000272895,ABCA12,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5,UniProtKB:Q86UK0-1,glucosylceramide transporter ABCA12 isoform h1 (human),biolink:Protein,MONDO:0017265,autosomal recessive congenital ichthyosis
839,ENSEMBL:ENSP00000272895,ABCA12,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5,UniProtKB:Q86UK0-1,glucosylceramide transporter ABCA12 isoform h1 (human),biolink:Protein,MONDO:0017265,autosomal recessive congenital ichthyosis
1293,ENSEMBL:ENSP00000311687,NIPAL4,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5,UniProtKB:Q0D2K0-1,magnesium transporter NIPA4 isoform h1 (human),biolink:Protein,MONDO:0017265,autosomal recessive congenital ichthyosis
1294,ENSEMBL:ENSP00000311687,NIPAL4,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5,UniProtKB:Q0D2K0-1,magnesium transporter NIPA4 isoform h1 (human),biolink:Protein,MONDO:0017265,autosomal recessive congenital ichthyosis
1717,ENSEMBL:ENSP00000342392,MESP2,DOID:0050568,Spondylocostal dysostosis,MedlinePlus,CURATED,5,NCBIGene:145873,MESP2,biolink:Gene,MONDO:0000359,spondylocostal dysostosis
1718,ENSEMBL:ENSP00000342392,MESP2,DOID:0050568,Spondylocostal dysostosis,MedlinePlus,CURATED,5,NCBIGene:145873,MESP2,biolink:Gene,MONDO:0000359,spondylocostal dysostosis
2811,ENSEMBL:ENSP00000379845,ABAT,DOID:0060407,Chromosome 18q deletion syndrome,MedlinePlus,CURATED,5,NCBIGene:18,ABAT,biolink:Gene,MONDO:0011147,chromosome 18q deletion syndrome
2812,ENSEMBL:ENSP00000379845,ABAT,DOID:0060407,Chromosome 18q deletion syndrome,MedlinePlus,CURATED,5,NCBIGene:18,ABAT,biolink:Gene,MONDO:0011147,chromosome 18q deletion syndrome


In [46]:
## number of docs that should be created
df_textmining.shape[0] + df_knowledge.shape[0] - 5  ## add together, remove duplicates

269427

### File: KGX edges

In [None]:
# df_textmining

In [None]:
## code chunk for testing parts of inner code

# with jsonlines.open('DISEASES_kgx_edges.jsonl', mode='w', compact=True) as kgx_edges_writer:
#     textmining_tally = set()
    
#     ## using itertuples because it's faster, preserves datatypes
#     for row in df_textmining.itertuples(index=False):
#         ## construct row abbreviation
#         temp = row.gene_id + "_" + row.disease_id
        
        
#         if row.gene_id == "ENSEMBL:ENSP00000000412":
#             break    

In [47]:
## uncomment to write files

# with jsonlines.open('DISEASES_kgx_edges.jsonl', mode='w', compact=True) as kgx_edges_writer:
    
# ## text-mined data: parser - separate function (diff row abbreviation, document)
#     textmining_tally = set()
    
#     ## using itertuples because it's faster, preserves datatypes
#     for row in df_textmining.itertuples(index=False):
#         ## construct row abbreviation
#         temp = row.gene_id + "_" + row.disease_id
        
#         if temp not in textmining_tally:
#             textmining_tally.add(temp)

#             document = {
#                 "subject": row.gene_nodenorm_id,
#                 "predicate": "biolink:occurs_together_in_literature_with",
#                 "object": row.disease_nodenorm_id,
#                 "sources": [
#                     {
#                         "resource_id": "infores:diseases",
#                         "resource_role": "primary_knowledge_source",
#                         "source_record_urls": [row.url]
#                     }
#                 ],
#                 "knowledge_level": "statistical_association",
#                 "agent_type": "text_mining_agent",
#                 "z_score": row.z_score,
#                 "has_confidence_score": row.confidence_score,
#                 "original_subject": row.gene_id,
#                 "original_object": row.disease_id,
#             }
            
#             ## doing so it doesn't print
#             bytes = kgx_edges_writer.write(document)
#         else:
#             ## won't write the document
#             print(f"duplicate row encountered: {temp}")

# ## knowledge data: parser - separate function (diff row abbreviation, document)
#     knowledge_tally = set()
    
#     ## using itertuples because it's faster, preserves datatypes
#     for row in df_knowledge.itertuples(index=False):
#         ## construct row abbreviation: needs source_db!
#         temp = row.gene_id + "_" + row.disease_id + "_" + row.source_db
        
#         if temp not in knowledge_tally:
#             knowledge_tally.add(temp)

#             document = {
#                 ## simple assignments: no if
#                 "subject": row.gene_nodenorm_id,
#                 "predicate": "biolink:associated_with",
#                 "object": row.disease_nodenorm_id,
#                 "knowledge_level": "knowledge_assertion",
#                 "agent_type": "manual_agent",
#                 "has_confidence_score": row.confidence_score,
#                 "original_subject": row.gene_id,
#                 "original_object": row.disease_id,
#             }
            
#             ## sources: depends on source_db value
#             if row.source_db == "MedlinePlus":
#                 document["sources"] = [
#                     {
#                         "resource_id": "infores:medlineplus",
#                         "resource_role": "primary_knowledge_source"
#                     },
#                     {
#                         "resource_id": "infores:diseases",
#                         "resource_role": "aggregator_knowledge_source",
#                         "upstream_resource_ids": ["infores:medlineplus"]
#                     }
#                 ]
#             elif row.source_db == "AmyCo":
#                 document["sources"] = [
#                     {
#                         "resource_id": "infores:amyco",
#                         "resource_role": "primary_knowledge_source"
#                     },
#                     {
#                         "resource_id": "infores:diseases",
#                         "resource_role": "aggregator_knowledge_source",
#                         "upstream_resource_ids": ["infores:amyco"]
#                     }
#                 ]
#             else:
#                 raise ValueError(f"Unexpected source_db value during source mapping: {row.source_db}. Adjust parser.")
            
#             ## doing so it doesn't print
#             bytes = kgx_edges_writer.write(document)
#         else:
#             ## won't write the document
#             print(f"duplicate row encountered: {temp}")

duplicate row encountered: ENSEMBL:ENSP00000269703_DOID:0060655_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000272895_DOID:0060655_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000311687_DOID:0060655_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000342392_DOID:0050568_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000379845_DOID:0060407_MedlinePlus


### File: KGX nodes

Requires id and category. name and other properties (basically node attributes) are optional. 

In [48]:
## get unique list of NodeNormed nodes

## need category 
nodenormed_genes_final = pd.concat([df_textmining[["gene_nodenorm_id", "gene_nodenorm_label", "gene_nodenorm_category"]], 
                         df_knowledge[["gene_nodenorm_id", "gene_nodenorm_label", "gene_nodenorm_category"]]], 
                        ignore_index=True).drop_duplicates()
nodenormed_diseases_final = pd.concat([df_textmining[["disease_nodenorm_id", "disease_nodenorm_label"]], 
                            df_knowledge[["disease_nodenorm_id", "disease_nodenorm_label"]]], 
                           ignore_index=True).drop_duplicates()

nodenormed_genes_final
nodenormed_diseases_final

Unnamed: 0,gene_nodenorm_id,gene_nodenorm_label,gene_nodenorm_category
0,NCBIGene:381,ARF5,biolink:Gene
3,NCBIGene:4074,M6PR,biolink:Gene
10,NCBIGene:2288,FKBP4,biolink:Gene
34,UniProtKB:Q9NR63-1,cytochrome P450 26B1 isoform 1 (human),biolink:Protein
50,UniProtKB:Q7L592-1,"protein arginine methyltransferase NDUFAF7, mitochondrial isoform 1 (human)",biolink:Protein
...,...,...,...
265462,UniProtKB:Q149M9-3,NACHT domain- and WD repeat-containing protein 1 isoform h3 (human),biolink:Protein
265463,UniProtKB:Q86U70-1,LIM domain-binding protein 1 isoform 1 (human),biolink:Protein
265929,UniProtKB:P41219-1,peripherin isoform h1 (human),biolink:Protein
268348,UniProtKB:Q96JH8-4,Ras-associating and dilute domain-containing protein isoform h4 (human),biolink:Protein


Unnamed: 0,disease_nodenorm_id,disease_nodenorm_label
0,MONDO:0009271,geroderma osteodysplastica
1,MONDO:0004992,cancer
2,MONDO:0005071,nervous system disorder
3,MONDO:0009650,Mucolipidosis 2
4,MONDO:0018931,"mucolipidosis type III, alpha/beta"
...,...,...
269183,MONDO:0011152,PHGDH deficiency
269198,MONDO:0100428,progressive bulbar palsy of childhood
269209,MONDO:0012791,"mitochondrial DNA depletion syndrome, encephalomyopathic form with methylmalonic aciduria"
269409,MONDO:0009949,pyruvate carboxylase deficiency disease


In [49]:
## uncomment to write files

# with jsonlines.open('DISEASES_kgx_nodes.jsonl', mode='w', compact=True) as kgx_nodes_writer:
    
#     ## using itertuples because it's faster, preserves datatypes
#     for row in nodenormed_genes_final.itertuples(index=False):
#         ## doing so it doesn't print
#         bytes = kgx_nodes_writer.write({
#             "id": row.gene_nodenorm_id,
#             "name": row.gene_nodenorm_label,
#             "category": [row.gene_nodenorm_category]
#         })

#     ## using itertuples because it's faster, preserves datatypes
#     for row in nodenormed_diseases_final.itertuples(index=False):
#         ## doing so it doesn't print
#         bytes = kgx_nodes_writer.write({
#             "id": row.disease_nodenorm_id,
#             "name": row.disease_nodenorm_label,
#             ## hard-coded because during pre-NodeNorm process, only kept entities with this primary category
#             "category": ["biolink:Disease"]
#         })

In [50]:
nodenormed_genes_final.shape[0] + nodenormed_diseases_final.shape[0]

21954

## Notes

* __will create edges that look like duplicates because triple is the same, but the original IDs/data is diff. Just leaving it this way for now -> but will need to consider what to do (possible to merge??).__
* confidence_score is sometimes float (from text-mining) and sometimes int (from knowledge). Not worried about, for now. 