<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Exploring-data" data-toc-modified-id="Exploring-data-1">Exploring data</a></span><ul class="toc-item"><li><span><a href="#Downloading-files" data-toc-modified-id="Downloading-files-1.1">Downloading files</a></span></li><li><span><a href="#Check-for-duplicates" data-toc-modified-id="Check-for-duplicates-1.2">Check for duplicates</a></span></li><li><span><a href="#gene_id-column" data-toc-modified-id="gene_id-column-1.3">gene_id column</a></span><ul class="toc-item"><li><span><a href="#Text-mining" data-toc-modified-id="Text-mining-1.3.1">Text-mining</a></span></li><li><span><a href="#Knowledge" data-toc-modified-id="Knowledge-1.3.2">Knowledge</a></span></li></ul></li><li><span><a href="#gene_name-column" data-toc-modified-id="gene_name-column-1.4">gene_name column</a></span></li><li><span><a href="#disease_id" data-toc-modified-id="disease_id-1.5">disease_id</a></span><ul class="toc-item"><li><span><a href="#Text-mining" data-toc-modified-id="Text-mining-1.5.1">Text-mining</a></span></li><li><span><a href="#Knowledge" data-toc-modified-id="Knowledge-1.5.2">Knowledge</a></span></li></ul></li><li><span><a href="#Other-columns" data-toc-modified-id="Other-columns-1.6">Other columns</a></span><ul class="toc-item"><li><span><a href="#Text-mining" data-toc-modified-id="Text-mining-1.6.1">Text-mining</a></span></li><li><span><a href="#Knowledge" data-toc-modified-id="Knowledge-1.6.2">Knowledge</a></span></li></ul></li></ul></li><li><span><a href="#Remove-rows" data-toc-modified-id="Remove-rows-2">Remove rows</a></span></li><li><span><a href="#Pre-NodeNorming" data-toc-modified-id="Pre-NodeNorming-3">Pre-NodeNorming</a></span><ul class="toc-item"><li><span><a href="#Genes" data-toc-modified-id="Genes-3.1">Genes</a></span></li><li><span><a href="#Diseases" data-toc-modified-id="Diseases-3.2">Diseases</a></span></li><li><span><a href="#Adding-NodeNorm-data,-removing-rows" data-toc-modified-id="Adding-NodeNorm-data,-removing-rows-3.3">Adding NodeNorm data, removing rows</a></span></li><li><span><a href="#Exploring-&quot;duplicates&quot;-from-NodeNorming" data-toc-modified-id="Exploring-&quot;duplicates&quot;-from-NodeNorming-3.4">Exploring "duplicates" from NodeNorming</a></span></li></ul></li><li><span><a href="#Generating-documents" data-toc-modified-id="Generating-documents-4">Generating documents</a></span><ul class="toc-item"><li><span><a href="#Rows-not-included" data-toc-modified-id="Rows-not-included-4.1">Rows not included</a></span></li><li><span><a href="#Columns-not-included" data-toc-modified-id="Columns-not-included-4.2">Columns not included</a></span></li><li><span><a href="#File:-List-of-TRAPI-edges" data-toc-modified-id="File:-List-of-TRAPI-edges-4.3">File: List of TRAPI edges</a></span></li><li><span><a href="#File:-KGX-edges" data-toc-modified-id="File:-KGX-edges-4.4">File: KGX edges</a></span></li><li><span><a href="#File:-KGX-nodes" data-toc-modified-id="File:-KGX-nodes-4.5">File: KGX nodes</a></span></li></ul></li><li><span><a href="#Notes" data-toc-modified-id="Notes-5">Notes</a></span></li></ul></div>

# Notebook for DISEASES parser development

In [1]:
## not for parser. for notebook only 

## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Exploring data

### Downloading files

Going to load entire files for review. 

In parser, may want to do a generator approach and ingest large chunks (ex: 1000-2000 lines) at 1 time. This balances "less memory footprint" vs "NodeNorming in large batches is faster". 

(pandas [read_table](https://pandas.pydata.org/docs/reference/api/pandas.read_table.html) has an iterator for rows/chunks! see iterator/chunksize parameters)

In [2]:
## put in parser (format)
import pathlib
import pandas as pd

## don't put in parser. Just for this notebook
from pprint import pprint

## unsure on putting into parser: more for notebook viewing/debugging...
pd.options.display.max_columns = None

In [3]:
## useful function for exploring data
def check_if_contains(df, column_name, patterns):
    for i in patterns:
        temp = df[df[column_name].str.contains(pat=i, case=False)]
        if temp.size > 0:
            print(f'"{i}"')
            print(temp.shape)

<div class="alert alert-block alert-danger">
    
This notebook was originally written using data files downloaded 2025-04-14 between 4:23-4:24 PM Pacific time (23:23-23:24 UTC+0) from https://diseases.jensenlab.org/Downloads. 

In [4]:
## put in parser (format)

## paths to raw data files

base_file_path = pathlib.Path.home().joinpath("Desktop", "DISEASES_files", "2025_05_02")

textmining_path = base_file_path.joinpath("human_disease_textmining_filtered.tsv")
knowledge_path = base_file_path.joinpath("human_disease_knowledge_filtered.tsv")

textmining_path
knowledge_path

PosixPath('/Users/colleenxu/Desktop/DISEASES_files/2025_05_02/human_disease_textmining_filtered.tsv')

PosixPath('/Users/colleenxu/Desktop/DISEASES_files/2025_05_02/human_disease_knowledge_filtered.tsv')

In [5]:
## put in parser (format)

## download files

## files have no headers: adding based on https://diseases.jensenlab.org/Downloads
textmining_header = ["gene_id", "gene_name", "disease_id", "disease_name", 
                     "z_score", "confidence_score", "url"]
knowledge_header = ["gene_id", "gene_name", "disease_id", "disease_name", 
                    "source_db", "evidence_type", "confidence_score"]


df_textmining = pd.read_table(textmining_path, sep="\t", names=textmining_header)
df_knowledge = pd.read_table(knowledge_path, sep="\t", names=knowledge_header)

<div class="alert alert-block alert-info">

No missing values

In [6]:
df_textmining.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288571 entries, 0 to 288570
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   gene_id           288571 non-null  object 
 1   gene_name         288571 non-null  object 
 2   disease_id        288571 non-null  object 
 3   disease_name      288571 non-null  object 
 4   z_score           288571 non-null  float64
 5   confidence_score  288571 non-null  float64
 6   url               288571 non-null  object 
dtypes: float64(2), object(5)
memory usage: 113.7 MB


In [7]:
df_textmining

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,z_score,confidence_score,url
0,18S_rRNA,18S_rRNA,DOID:9643,Babesiosis,7.230,3.615,https://diseases.jensenlab.org/Entity?document...
1,18S_rRNA,18S_rRNA,DOID:3733,Theileriasis,6.363,3.181,https://diseases.jensenlab.org/Entity?document...
2,18S_rRNA,18S_rRNA,DOID:12365,Malaria,6.293,3.146,https://diseases.jensenlab.org/Entity?document...
3,18S_rRNA,18S_rRNA,DOID:9640,Sarcocystosis,6.142,3.071,https://diseases.jensenlab.org/Entity?document...
4,18S_rRNA,18S_rRNA,DOID:1733,Cryptosporidiosis,6.108,3.054,https://diseases.jensenlab.org/Entity?document...
...,...,...,...,...,...,...,...
288566,snoU13,snoU13,DOID:0110084,Arrhythmogenic right ventricular dysplasia 13,3.774,1.887,https://diseases.jensenlab.org/Entity?document...
288567,snoU13,snoU13,DOID:0110408,Retinitis pigmentosa 11,3.386,1.693,https://diseases.jensenlab.org/Entity?document...
288568,snoU13,snoU13,DOID:1849,Cannabis dependence,3.227,1.613,https://diseases.jensenlab.org/Entity?document...
288569,snoU13,snoU13,DOID:0060775,Microvillus inclusion disease,3.146,1.573,https://diseases.jensenlab.org/Entity?document...


In [8]:
df_knowledge.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7638 entries, 0 to 7637
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   gene_id           7638 non-null   object
 1   gene_name         7638 non-null   object
 2   disease_id        7638 non-null   object
 3   disease_name      7638 non-null   object
 4   source_db         7638 non-null   object
 5   evidence_type     7638 non-null   object
 6   confidence_score  7638 non-null   int64 
dtypes: int64(1), object(6)
memory usage: 2.7 MB


In [9]:
df_knowledge

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,source_db,evidence_type,confidence_score
0,ABHD11-AS1,ABHD11-AS1,DOID:1928,Williams-Beuren syndrome,MedlinePlus,CURATED,5
1,ENSP00000001146,CYP26B1,DOID:2340,Craniosynostosis,UniProtKB-KW,CURATED,4
2,ENSP00000003084,CFTR,DOID:0111862,Congenital bilateral absence of vas deferens,MedlinePlus,CURATED,5
3,ENSP00000003084,CFTR,DOID:1485,Cystic fibrosis,MedlinePlus,CURATED,5
4,ENSP00000005226,USH1C,DOID:0050439,Usher syndrome,MedlinePlus,CURATED,5
...,...,...,...,...,...,...,...
7633,hsa-miR-145-5p,hsa-miR-145-5p,DOID:0090016,Chromosome 5q deletion syndrome,MedlinePlus,CURATED,5
7634,hsa-miR-146a-5p,hsa-miR-146a-5p,DOID:0090016,Chromosome 5q deletion syndrome,MedlinePlus,CURATED,5
7635,hsa-miR-184,hsa-miR-184,DOID:10126,Keratoconus,MedlinePlus,CURATED,5
7636,hsa-miR-590-5p,hsa-miR-590-5p,DOID:1928,Williams-Beuren syndrome,MedlinePlus,CURATED,5


### Check for duplicates

<div class="alert alert-block alert-info">

Duplicates found in knowledge file. This is a DISEASES-side problem.
    
Ideas for how to handle duplicates (including if they show up in text-mining file later): 
* ingest entire file and remove duplicates
* **doing this right now**: check a set of "already-done" edges (gene_id, disease_id, source_db for knowledge) right before creating document (earlier steps will cut down number of rows to create documents with). 

In [10]:
df_textmining[df_textmining.duplicated(subset=["gene_id", "disease_id"], keep=False)].shape

(0, 7)

In [11]:
df_knowledge[df_knowledge.duplicated(keep=False)].shape

(16, 7)

In [12]:
## double-checking if specific column subset can work to check for duplicates
df_knowledge[df_knowledge.duplicated(subset=["gene_id", "disease_id", "source_db"], keep=False)].shape
## same count so it can

## print for manual review
df_knowledge[df_knowledge.duplicated(keep=False)]

(16, 7)

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,source_db,evidence_type,confidence_score
1600,ENSP00000269703,CYP4F22,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5
1601,ENSP00000269703,CYP4F22,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5
1649,ENSP00000272895,ABCA12,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5
1650,ENSP00000272895,ABCA12,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5
1970,ENSP00000291295,CALM3,DOID:2843,Long QT syndrome,UniProtKB-KW,CURATED,4
1971,ENSP00000291295,CALM3,DOID:2843,Long QT syndrome,UniProtKB-KW,CURATED,4
2542,ENSP00000311687,NIPAL4,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5
2543,ENSP00000311687,NIPAL4,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5
3374,ENSP00000342392,MESP2,DOID:0050568,Spondylocostal dysostosis,MedlinePlus,CURATED,5
3375,ENSP00000342392,MESP2,DOID:0050568,Spondylocostal dysostosis,MedlinePlus,CURATED,5


### gene_id column

<div class="alert alert-block alert-info">

**A small portion of the data doesn't have ENSP IDs (text-mining and knowledge)!** Instead, the values are names (same as gene_name). **Drop these rows from further parsing (can't NodeNorm)** 
    
I checked this "no ENSP" data for other ID namespaces mentioned in the paper, but couldn't find any matches. Other ID namespaces mentioned in [paper's Materials and methods](https://academic.oup.com/database/article/doi/10.1093/database/baac019/6554833?login=false#344427091) -> Dictionaries section:

> The human gene dictionary was obtained from STRING v11.0 (27) and is based on information from **Ensembl (34), UniProtKB (12) and HGNC (35)** databases.

In [13]:
## check for delimiters (between prefix and ID, between IDs)

delimiters = [",", ";", ":"]

print("text-mining")
check_if_contains(df_textmining, "gene_id", delimiters)

print("knowledge")
check_if_contains(df_knowledge, "gene_id", delimiters)

## didn't print any stats, so not found

text-mining
knowledge


In [14]:
## double-check work on 1 value
df_textmining[df_textmining["gene_id"].str.contains(":", case=False)]

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,z_score,confidence_score,url


In [15]:
## looked at "_": sometimes it can be a delimiter between prefix and ID
## text-mining: there are values, but all were names

# df_textmining[df_textmining["gene_id"].str.contains("_", case=False)]["gene_id"].unique()

# len(df_textmining[df_textmining["gene_id"].str.contains("_", case=False)]["gene_id"].unique())

In [16]:
## no values

df_knowledge[df_knowledge["gene_id"].str.contains("_")].shape

(0, 7)

#### Text-mining

In [17]:
## how much data doesn't have ENSP ID
df_no_ENSP_textmining = df_textmining[~ df_textmining["gene_id"].str.startswith("ENSP")].copy()
df_no_ENSP_textmining.shape

## vs entire dataset
df_no_ENSP_textmining.shape[0] / df_textmining.shape[0]

(16505, 7)

0.05719562949845965

In [18]:
## when data doesn't have ENSP ID, are ID == name?
df_no_ENSP_textmining[df_no_ENSP_textmining["gene_id"] == 
                      df_no_ENSP_textmining["gene_name"]].shape

## yes: shape is the same

(16505, 7)

In [19]:
## thousands of unique values
df_no_ENSP_textmining["gene_id"].unique()

array(['18S_rRNA', '28S_rRNA', '45S_rRNA', ..., 'pRNA', 'snoU13',
       'snoU18'], shape=(3961,), dtype=object)

In [20]:
## look for other ID namespaces mentioned in paper

possible_gene_namespaces = ["ENSG", "ensembl", "uniprotkb", "uniprot", "hgnc"]
## remember function uses case=False (not sensitive)
check_if_contains(df_no_ENSP_textmining, "gene_id", possible_gene_namespaces)

## none: nothing prints

In [21]:
## double-check work on 1 value
df_textmining[df_textmining["gene_id"].str.contains("hgnc", case=False)]

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,z_score,confidence_score,url


#### Knowledge

In [22]:
## quick look at knowledge data: short enough for manual review

## how much data doesn't have ENSP ID
df_knowledge[~ df_knowledge["gene_id"].str.startswith("ENSP")].shape

## vs entire dataset
df_knowledge[~ df_knowledge["gene_id"].str.startswith("ENSP")].shape[0] / df_knowledge.shape[0]

## unique values: can see that gene_id is a name (== gene_name), not an ID
df_knowledge[~ df_knowledge["gene_id"].str.startswith("ENSP")][["gene_id", "gene_name"]].drop_duplicates()

(41, 7)

0.00536789735532862

Unnamed: 0,gene_id,gene_name
0,ABHD11-AS1,ABHD11-AS1
7598,H19,H19
7602,KCNQ1OT1,KCNQ1OT1
7603,MIR17HG,MIR17HG
7604,MT-RNR1,MT-RNR1
7605,MT-TF,MT-TF
7606,MT-TH,MT-TH
7608,MT-TI,MT-TI
7610,MT-TK,MT-TK
7612,MT-TL1,MT-TL1


### gene_name column

Didn't find IDs in this column that aren't in `gene_id`. 

Don't need this column for Translator output

In [23]:
delimiters
possible_gene_namespaces

[',', ';', ':']

['ENSG', 'ensembl', 'uniprotkb', 'uniprot', 'hgnc']

In [24]:
## look for delimiters

print("text-mining")
check_if_contains(df_textmining, "gene_name", delimiters)

print("knowledge")
check_if_contains(df_knowledge, "gene_name", delimiters)

## didn't print any stats, so not found

text-mining
knowledge


In [25]:
## look for other ID namespaces

print("text-mining")
check_if_contains(df_textmining, "gene_name", possible_gene_namespaces)

print("knowledge")
check_if_contains(df_knowledge, "gene_name", possible_gene_namespaces)

## didn't print any stats, so not found

text-mining
knowledge


In [26]:
## for textmining only, are there rows with ENSP in gene_name but not gene_id?
## not for knowledge: already reviewed manually

df_no_ENSP_textmining[df_no_ENSP_textmining["gene_name"].str.contains("ENSP", case=False)]
## no, empty df

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,z_score,confidence_score,url


### disease_id

<div class="alert alert-block alert-info">

**A small portion of the data doesn't have DOID IDs!** Instead, they're AmyCo IDs (not usable). **Drop these rows from further parsing (can't NodeNorm)** 

In [27]:
## check for delimiters (between prefix and ID, between IDs)
## custom set since DOID: has :

print("text-mining")
check_if_contains(df_textmining, "disease_id", [",", ";", "_"])

print("knowledge")
check_if_contains(df_knowledge, "disease_id", [",", ";", "_"])

## didn't print any stats, so not found

text-mining
knowledge


#### Text-mining

In [28]:
## how much data doesn't have DOID ID
df_no_DOID_textmining = df_textmining[~ df_textmining["disease_id"].str.startswith("DOID:")].copy()
df_no_DOID_textmining.shape

## vs entire dataset
df_no_DOID_textmining.shape[0] / df_textmining.shape[0]

(42, 7)

0.00014554477061104546

In [29]:
## look at non-DOID data: unique values
df_no_DOID_textmining[["disease_id", "disease_name"]].drop_duplicates()

Unnamed: 0,disease_id,disease_name
51380,AmyCo:23,AH Amyloidosis
59407,AmyCo:35,Calcifying Epithelial Odontogenic Tumor
61404,AmyCo:73,Senile Seminal Vesicle Amyloidosis
79414,AmyCo:30,Wild-type transthyretin-related Amyloidosis
96613,AmyCo:27,Leukocyte chemotactic factor 2 Amyloidosis
127630,AmyCo:29,AA Amyloidosis
176518,AmyCo:37,Isolated Atrial Amyloidosis
187314,AmyCo:20,Hereditary lysozyme Amyloidosis
246160,AmyCo:24,AHL Amyloidosis
254995,AmyCo:8,Immunoglobulin Light-chain Amyloidosis


#### Knowledge

In [30]:
## how much data doesn't have DOID ID
df_no_DOID_knowledge = df_knowledge[~ df_knowledge["disease_id"].str.startswith("DOID:")].copy()
df_no_DOID_knowledge.shape

## vs entire dataset
df_no_DOID_knowledge.shape[0] / df_knowledge.shape[0]

(71, 7)

0.009295627127520294

In [31]:
## look at non-DOID data: unique values
df_no_DOID_knowledge[["disease_id", "disease_name"]].drop_duplicates()

Unnamed: 0,disease_id,disease_name
351,AmyCo:26,Apolipoprotein C-III associated Amyloidosis
508,AmyCo:16,Localized insulin-derived Amyloidosis
509,AmyCo:17,Apolipoprotein A-I associated Amyloidosis
511,AmyCo:63,Enfuvirtide-induced Amyloidosis
518,AmyCo:30,Wild-type transthyretin-related Amyloidosis
731,AmyCo:22,Apolipoprotein A-II associated Amyloidosis
732,AmyCo:23,AH Amyloidosis
733,AmyCo:24,AHL Amyloidosis
734,AmyCo:25,Apolipoprotein C-II associated Amyloidosis
736,AmyCo:27,Leukocyte chemotactic factor 2 Amyloidosis


### Other columns

Quick checks on what other data looks like. 

Didn't review disease_name: already reviewed non-DOID rows, don't need in Translator output 

#### Text-mining

In [32]:
## minimum being 3 means filtering was probably done, to remove anything with low z-score
df_textmining["z_score"].describe()

## 1 = 2 standard deviations above mean, cutoff at 4
df_textmining["confidence_score"].describe()

count    288571.000000
mean          3.900408
std           0.825318
min           3.000000
25%           3.279000
50%           3.664000
75%           4.280000
max          10.055000
Name: z_score, dtype: float64

count    288571.000000
mean          1.950045
std           0.411813
min           1.500000
25%           1.639000
50%           1.832000
75%           2.140000
max           4.000000
Name: confidence_score, dtype: float64

In [33]:
## url: check for delimiters (multiple urls)
## custom set since DOID: (:), some names have _

check_if_contains(df_textmining, "url", [",", ";"])

## didn't print any stats, so not found

In [34]:
## url: check for whitespace - none

df_textmining[df_textmining["url"].str.isspace()]

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,z_score,confidence_score,url


#### Knowledge

In [35]:
df_knowledge["source_db"].value_counts()

source_db
MedlinePlus     3770
UniProtKB-KW    3611
AmyCo            257
Name: count, dtype: int64

In [36]:
df_knowledge["evidence_type"].value_counts()
## always the same value

evidence_type
CURATED    7638
Name: count, dtype: int64

In [37]:
df_knowledge["confidence_score"].value_counts()

confidence_score
4    3817
5    3770
3      35
2      16
Name: count, dtype: int64

## Remove rows

<div class="alert alert-block alert-info">

Removing:
* knowledge's UniProtKB-KW data: Doesn't seem accurate, Likely has licensing issues (OMIM). Analysis in Slack canvas
* rows that don't have ENSP IDs in gene_id column: can't NodeNorm them
* rows that don't have DOID IDs in disease_id column: can't NodeNorm them

Not removing duplicates! Because planning to ingest file in chunks, so need other method (check set of already-done). 

In [38]:
## calculate stats before removing

nrows_textmining_original = df_textmining.shape[0]
nrows_knowledge_original = df_knowledge.shape[0]

In [39]:
## put into parser (don't do stats, print)

## remove knowledge's UniProtKB-KW data
df_knowledge = df_knowledge[df_knowledge["source_db"] != "UniProtKB-KW"]

## stats
nrows_knowledge_filtered = df_knowledge.shape[0]
print(f"knowledge: {nrows_knowledge_original - nrows_knowledge_filtered} rows removed")

knowledge: 3611 rows removed


In [40]:
## put into parser (don't do stats, print)

## remove rows that don't have ENSP IDs
df_textmining = df_textmining[df_textmining["gene_id"].str.startswith("ENSP")]
df_knowledge = df_knowledge[df_knowledge["gene_id"].str.startswith("ENSP")]


## stats
nrows_textmining_filtered = df_textmining.shape[0]
print(f"textmining: {nrows_textmining_original - nrows_textmining_filtered} rows removed")
## don't rewrite variable yet, so you can compare previous to now
print(f"knowledge: {nrows_knowledge_filtered - df_knowledge.shape[0]} rows removed")

## update this variable
nrows_knowledge_filtered = df_knowledge.shape[0]

textmining: 16505 rows removed
knowledge: 41 rows removed


In [41]:
## put into parser (don't do stats, print)

## remove rows that don't have DOID IDs
df_textmining = df_textmining[df_textmining["disease_id"].str.startswith("DOID:")]
df_knowledge = df_knowledge[df_knowledge["disease_id"].str.startswith("DOID:")]


## stats
## don't rewrite variable yet, so you can compare previous to now
print(f"textmining: {nrows_textmining_filtered - df_textmining.shape[0]} rows removed")
print(f"knowledge: {nrows_knowledge_filtered - df_knowledge.shape[0]} rows removed")

## update this variable
nrows_textmining_filtered = df_textmining.shape[0]
nrows_knowledge_filtered = df_knowledge.shape[0]

textmining: 42 rows removed
knowledge: 71 rows removed


In [42]:
## code to reviewing dataframes, make sure all removed properly

# df_textmining
# df_textmining[~ df_textmining["disease_id"].str.contains("DOID")]

# df_knowledge
# df_knowledge[~ df_knowledge["disease_id"].str.contains("DOID")]
# df_knowledge[df_knowledge["source_db"] == "UniProtKB-KW"]

## AmyCo data still in database (wasn't all under AmyCo disease IDs)
df_knowledge[df_knowledge["source_db"] == "AmyCo"]

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,source_db,evidence_type,confidence_score
63,ENSP00000167586,KRT14,DOID:0050430,Multiple endocrine neoplasia type 2A,AmyCo,CURATED,2
64,ENSP00000167586,KRT14,DOID:0050639,Primary cutaneous amyloidosis,AmyCo,CURATED,4
66,ENSP00000167586,KRT14,DOID:2513,Basal cell carcinoma,AmyCo,CURATED,4
70,ENSP00000167586,KRT14,DOID:6498,Seborrheic keratosis,AmyCo,CURATED,4
71,ENSP00000167586,KRT14,DOID:7039,Borst-Jadassohn intraepidermal carcinoma,AmyCo,CURATED,3
...,...,...,...,...,...,...,...
7400,ENSP00000497221,ITM2B,DOID:0070029,ITM2B-related cerebral amyloid angiopathy 1,AmyCo,CURATED,4
7401,ENSP00000497221,ITM2B,DOID:0070030,ITM2B-related cerebral amyloid angiopathy 2,AmyCo,CURATED,4
7458,ENSP00000497910,B2M,DOID:0050747,DOID:0050747,AmyCo,CURATED,4
7586,ENSP00000500990,SNCA,DOID:12217,Lewy body dementia,AmyCo,CURATED,4


## Pre-NodeNorming

In [43]:
## put into parser (format)

## adding Translator/biolink prefixes to gene ID, for NodeNorming
## don't need to do DOID, already in right format

## loop for notebook only, to do both dataframes 
for df in [df_textmining, df_knowledge]:
    df["gene_id"] = "ENSEMBL:" + df["gene_id"]

In [44]:
## double-check how they look

df_knowledge

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,source_db,evidence_type,confidence_score
2,ENSEMBL:ENSP00000003084,CFTR,DOID:0111862,Congenital bilateral absence of vas deferens,MedlinePlus,CURATED,5
3,ENSEMBL:ENSP00000003084,CFTR,DOID:1485,Cystic fibrosis,MedlinePlus,CURATED,5
4,ENSEMBL:ENSP00000005226,USH1C,DOID:0050439,Usher syndrome,MedlinePlus,CURATED,5
5,ENSEMBL:ENSP00000005226,USH1C,DOID:0050563,Nonsyndromic deafness,MedlinePlus,CURATED,5
11,ENSEMBL:ENSP00000013807,ERCC1,DOID:0050427,Xeroderma pigmentosum,MedlinePlus,CURATED,5
...,...,...,...,...,...,...,...
7590,ENSEMBL:ENSP00000500990,SNCA,DOID:14330,Parkinson's disease,MedlinePlus,CURATED,5
7592,ENSEMBL:ENSP00000500990,SNCA,DOID:4752,Multiple system atrophy,MedlinePlus,CURATED,5
7593,ENSEMBL:ENSP00000501092,TFAP2A,DOID:0050691,Branchiooculofacial syndrome,MedlinePlus,CURATED,5
7594,ENSEMBL:ENSP00000501092,TFAP2A,DOID:12270,Coloboma,MedlinePlus,CURATED,5


Querying NodeNorm: send unique values (no duplicates!) in large batches -> generate mapping dict to use. 
<br>
__Not querying 1-by-1 or 1 row at a time: much slower__ and would involve sending duplicate IDs (unless saved dict is kept outside loop and checked) 

<div class="alert alert-block alert-danger">

Set the NodeNorm URL you want to use. 

In [45]:
## put into parser (format)

import requests

## from BioThings annotator code: for interoperability between diff Python versions
try:
    from itertools import batched  # new in Python 3.12
except ImportError:
    from itertools import islice

    def batched(iterable, n):
        # batched('ABCDEFG', 3) → ABC DEF G
        if n < 1:
            raise ValueError("n must be at least one")
        iterator = iter(iterable)
        while batch := tuple(islice(iterator, n)):
            yield batch

## dev instance
nodenorm_url = "https://nodenormalization-sri.renci.org/get_normalized_nodes"
# nodenorm_url = "https://nodenorm.ci.transltr.io/get_normalized_nodes"

### Genes

In [46]:
## put into parser (set unique on curie list)

## get set of unique CURIEs to put into NodeNorm
geneIDs_textmining = set(df_textmining["gene_id"].unique())
geneIDs_knowledge = set(df_knowledge["gene_id"].unique())
geneIDs_all = geneIDs_textmining | geneIDs_knowledge

len(geneIDs_textmining)
len(geneIDs_knowledge)
len(geneIDs_all)

16847

2535

16850

In [47]:
## put into parser (partial?)

gene_nodenorm_mapping = {}

## set up variables to catch potential mapping failures
stats_gene_mapping_failures = {
    "unexpected_error": {},
    "nodenorm_returned_none": [],
    "wrong_category": {},
    "no_label": []
}

Allowed types are Gene and Protein (depending on if gene/protein conflation worked for the ENSP ID or not)

In [48]:
## put into parser (don't need to iterate through batch, just use all in chunk you're on)

## larger batches are quicker
for batch in batched(geneIDs_all, 1000):
    ## returns tuples -> cast to list
    req_body = {
        "curies": list(batch),
        "conflate": True,
    }
    r = requests.post(nodenorm_url, json=req_body)
    response = r.json()
    
    ## not doing dict comprehension. allows easier review, logic writing
    for k,v in response.items():
        ## catch unexpected errors
        try:
            ## if NodeNorm didn't have info on this ID, v will be None
            if v is not None:
                ## don't keep mapping if category is not the expected one
                if v["type"][0] in ["biolink:Gene", "biolink:Protein"]:
                    ## also throw out mapping if no primary label found
                    if v["id"].get("label"):
                        temp = {
                            k: {"primary_id": v["id"]["identifier"],
                                "primary_label": v["id"]["label"],
                                ## saving category: currently required for KGX nodes file output
                                "category": v["type"][0]
                               }
                        }
                        gene_nodenorm_mapping.update(temp)
                    else:
                        stats_gene_mapping_failures["no_label"].append(k)
#                         print(f"{k}: NodeNorm didn't find primary label. Not keeping this mapping.")
                else:
                    stats_gene_mapping_failures["wrong_category"].update({k: v["type"][0]})
#                     print(f'{k}: NodeNorm found different category {v["type"][0]}. Not keeping this mapping.')
            else:
                stats_gene_mapping_failures["nodenorm_returned_none"].append(k)
#                 print(f"{k}: NodeNorm didn't recognize this ID")
        except:
            stats_gene_mapping_failures["unexpected_error"].update({k: v})
            print(f'Encountered an unexpected error.')
            print(f'NodeNorm response key: {k}')
            print(f'NodeNorm response value: {v}')

In [49]:
len(gene_nodenorm_mapping)

stats_gene_mapping_failures["unexpected_error"]

len(stats_gene_mapping_failures["nodenorm_returned_none"])
len(stats_gene_mapping_failures["wrong_category"])
len(stats_gene_mapping_failures["no_label"])

16477

{}

315

0

58

In [50]:
## code block for reviewing failures

# stats_gene_mapping_failures["nodenorm_returned_none"][270:280]
stats_gene_mapping_failures["no_label"][50:58]

['ENSEMBL:ENSP00000479693',
 'ENSEMBL:ENSP00000328848',
 'ENSEMBL:ENSP00000222388',
 'ENSEMBL:ENSP00000305742',
 'ENSEMBL:ENSP00000380432',
 'ENSEMBL:ENSP00000377616',
 'ENSEMBL:ENSP00000215727',
 'ENSEMBL:ENSP00000388552']

<div class="alert alert-block alert-info">

**2025-05-02 data:**

**Reviewed ENSEMBL ENSP mapping failures**

315 cases where NodeNorm returned None (didn't recognize/resolve ID). __I checked some (20)__ (0:10, 270:280). __Conclusion: DISEASES problem, using outdated IDs__

* 17: ID is deprecated. Language: "no longer in the database but it has been mapped to 1 deprecated identifier")
* 3: ID has been replaced? Language: "no longer in the database.It has been mapped to 1 current identifier". (ENSP00000074304, ENSP00000480798, ENSP00000482552)

<br>
    
58 cases where NodeNorm didn't have a primary label. __I checked some (18)__ (0:10, 50:58). __Conclusion: NodeNorm issue sometimes, other times a complex biology/data situation__

All were legit IDs, but I guess NodeNorm isn't mapping them to their genes. Some thoughts:
* noticed many genes were overlapping locus/readthrough gene
* But some of these genes look "fairly regular" and the ENSP ID is the "Ensembl Canonical". So it's strange that NodeNorm isn't mapping them to their genes -> messaged NodeNorm
  * ENSP00000419081 for CALML4 ENSG00000129007 
  * ENSP00000211076 for TPSD1 ENSG00000095917 (not overlapping!)
  * ENSP00000361746 for EPPIN ENSG00000101448
  * ENSP00000215727 for SERPIND1 ENSG00000099937 (not overlapping!)
  * ENSP00000348089 for ERCC6 ENSG00000225830 (not overlapping!)
  * ENSP00000361232 for DDX31 ENSG00000125485 (not overlapping!)

In [51]:
## calculate stats: number of rows affected by each type of mapping failure
stats_gene_mapping_failures.update({
    "nrows_textmining_none": df_textmining[df_textmining["gene_id"].isin(stats_gene_mapping_failures["nodenorm_returned_none"])].shape[0],
    "nrows_knowledge_none": df_knowledge[df_knowledge["gene_id"].isin(stats_gene_mapping_failures["nodenorm_returned_none"])].shape[0],

    "n_rows_textmining_wrongcategory": df_textmining[df_textmining["gene_id"].isin(stats_gene_mapping_failures["wrong_category"].keys())].shape[0],
    "n_rows_knowledge_wrongcategory": df_knowledge[df_knowledge["gene_id"].isin(stats_gene_mapping_failures["wrong_category"].keys())].shape[0],

    
    "n_rows_textmining_nolabel": df_textmining[df_textmining["gene_id"].isin(stats_gene_mapping_failures["no_label"])].shape[0],
    "n_rows_knowledge_nolabel": df_knowledge[df_knowledge["gene_id"].isin(stats_gene_mapping_failures["no_label"])].shape[0],
})

In [52]:
stats_gene_mapping_failures["nrows_textmining_none"]
stats_gene_mapping_failures["nrows_knowledge_none"]
stats_gene_mapping_failures["n_rows_textmining_wrongcategory"]
stats_gene_mapping_failures["n_rows_knowledge_wrongcategory"]
stats_gene_mapping_failures["n_rows_textmining_nolabel"]
stats_gene_mapping_failures["n_rows_knowledge_nolabel"]

4232

52

0

0

1419

8

### Diseases

In [53]:

## get set of unique CURIEs to put into NodeNorm
diseaseIDs_textmining = set(df_textmining["disease_id"].unique())
diseaseIDs_knowledge = set(df_knowledge["disease_id"].unique())
diseaseIDs_all = diseaseIDs_textmining | diseaseIDs_knowledge

len(diseaseIDs_textmining)
len(diseaseIDs_knowledge)
len(diseaseIDs_all)

5556

1036

5671

In [54]:

disease_nodenorm_mapping = {}

## set up variables to catch mapping failures
stats_disease_mapping_failures = {
    "unexpected_error": {},
    "nodenorm_returned_none": [],
    "wrong_category": {},
    "no_label": []
    
}

In [55]:


## larger batches are quicker
for batch in batched(diseaseIDs_all, 1000):
    ## returns tuples -> cast to list
    req_body = {
        "curies": list(batch),
        "conflate": True,
    }
    r = requests.post(nodenorm_url, json=req_body)
    response = r.json()
    
    ## not doing dict comprehension. allows easier review, logic writing
    for k,v in response.items():
        ## catch unexpected errors
        try:
            ## if NodeNorm didn't have info on this ID, v will be None
            if v is not None:
                ## don't keep mapping if category is not the expected one
                if v["type"][0] == "biolink:Disease":
                    ## also throw out mapping if no primary label found
                    if v["id"].get("label"):
                        temp = {
                            k: {"primary_id": v["id"]["identifier"],
                                "primary_label": v["id"]["label"]
                               }
                        }
                        disease_nodenorm_mapping.update(temp)
                    else:
                        stats_disease_mapping_failures["no_label"].append(k)
#                         print(f"{k}: NodeNorm didn't find primary label. Not keeping this mapping.")
                else:
                    stats_disease_mapping_failures["wrong_category"].update({k: v["type"][0]})
#                     print(f'{k}: NodeNorm found different category {v["type"][0]}. Not keeping this mapping.')
            else:
                stats_disease_mapping_failures["nodenorm_returned_none"].append(k)
#                 print(f"{k}: NodeNorm didn't recognize this ID")
        except:
            stats_disease_mapping_failures["unexpected_error"].update({k: v})
            print(f'Encountered an unexpected error.')
            print(f'NodeNorm response key: {k}')
            print(f'NodeNorm response value: {v}')

In [56]:
len(disease_nodenorm_mapping)

stats_disease_mapping_failures["unexpected_error"]

len(stats_disease_mapping_failures["nodenorm_returned_none"])
len(stats_disease_mapping_failures["wrong_category"])
len(stats_disease_mapping_failures["no_label"])

5641

{}

30

0

0

In [57]:
## code used to review mapping failures 

stats_disease_mapping_failures["nodenorm_returned_none"][0:10]

['DOID:1607',
 'DOID:0080365',
 'DOID:1920',
 'DOID:2383',
 'DOID:0060321',
 'DOID:4377',
 'DOID:11595',
 'DOID:11282',
 'DOID:10986',
 'DOID:14070']

<div class="alert alert-block alert-info">

**2025-05-02 data:**

**Reviewed Disease DOID mapping failures**

30 cases where NodeNorm returned None (didn't recognize/resolve ID). __I checked some (first 10)__:

All were legit IDs: don't understand why NodeNorm returned none -> messaged NodeNorm

In [58]:
## calculate stats: number of rows affected by each type of mapping failure
stats_disease_mapping_failures.update({
    "nrows_textmining_none": df_textmining[df_textmining["disease_id"].isin(stats_disease_mapping_failures["nodenorm_returned_none"])].shape[0],
    "nrows_knowledge_none": df_knowledge[df_knowledge["disease_id"].isin(stats_disease_mapping_failures["nodenorm_returned_none"])].shape[0],

    "n_rows_textmining_wrongcategory": df_textmining[df_textmining["disease_id"].isin(stats_disease_mapping_failures["wrong_category"].keys())].shape[0],
    "n_rows_knowledge_wrongcategory": df_knowledge[df_knowledge["disease_id"].isin(stats_disease_mapping_failures["wrong_category"].keys())].shape[0],

    
    "n_rows_textmining_nolabel": df_textmining[df_textmining["disease_id"].isin(stats_disease_mapping_failures["no_label"])].shape[0],
    "n_rows_knowledge_nolabel": df_knowledge[df_knowledge["disease_id"].isin(stats_disease_mapping_failures["no_label"])].shape[0],
})

In [59]:
stats_disease_mapping_failures["nrows_textmining_none"]
stats_disease_mapping_failures["nrows_knowledge_none"]
stats_disease_mapping_failures["n_rows_textmining_wrongcategory"]
stats_disease_mapping_failures["n_rows_knowledge_wrongcategory"]
stats_disease_mapping_failures["n_rows_textmining_nolabel"]
stats_disease_mapping_failures["n_rows_knowledge_nolabel"]

624

10

0

0

0

0

### Adding NodeNorm data, removing rows

<div class="alert alert-block alert-info">

Removing rows that lack NodeNorm data, due to small amount of failures during NodeNorm process: 
* NodeNorm returned none
* NodeNorm didn't have primary label (genes only) 

In [60]:
## put into parser (format): 

## text-mining
df_textmining["gene_nodenorm_id"] = [gene_nodenorm_mapping[i]["primary_id"] 
                                     if gene_nodenorm_mapping.get(i) 
                                     else pd.NA
                                     for i in df_textmining["gene_id"]]

df_textmining["gene_nodenorm_label"] = [gene_nodenorm_mapping[i]["primary_label"] 
                                        if gene_nodenorm_mapping.get(i) 
                                        else pd.NA
                                        for i in df_textmining["gene_id"]]

df_textmining["gene_nodenorm_category"] = [gene_nodenorm_mapping[i]["category"] 
                                           if gene_nodenorm_mapping.get(i) 
                                           else pd.NA
                                           for i in df_textmining["gene_id"]]

df_textmining["disease_nodenorm_id"] = [disease_nodenorm_mapping[i]["primary_id"] 
                                        if disease_nodenorm_mapping.get(i) 
                                        else pd.NA
                                        for i in df_textmining["disease_id"]]

df_textmining["disease_nodenorm_label"] = [disease_nodenorm_mapping[i]["primary_label"] 
                                           if disease_nodenorm_mapping.get(i) 
                                           else pd.NA
                                           for i in df_textmining["disease_id"]]

In [61]:

## knowledge
df_knowledge["gene_nodenorm_id"] = [gene_nodenorm_mapping[i]["primary_id"] 
                                    if gene_nodenorm_mapping.get(i) 
                                    else pd.NA
                                    for i in df_knowledge["gene_id"]]

df_knowledge["gene_nodenorm_label"] = [gene_nodenorm_mapping[i]["primary_label"] 
                                       if gene_nodenorm_mapping.get(i) 
                                       else pd.NA
                                       for i in df_knowledge["gene_id"]]

df_knowledge["gene_nodenorm_category"] = [gene_nodenorm_mapping[i]["category"] 
                                          if gene_nodenorm_mapping.get(i) 
                                          else pd.NA
                                          for i in df_knowledge["gene_id"]]

df_knowledge["disease_nodenorm_id"] = [disease_nodenorm_mapping[i]["primary_id"] 
                                       if disease_nodenorm_mapping.get(i) 
                                       else pd.NA
                                       for i in df_knowledge["disease_id"]]

df_knowledge["disease_nodenorm_label"] = [disease_nodenorm_mapping[i]["primary_label"] 
                                          if disease_nodenorm_mapping.get(i) 
                                          else pd.NA
                                          for i in df_knowledge["disease_id"]]

In [None]:
## code chunk to review df

# df_textmining
# df_knowledge

In [62]:
## put into parser: 

## doing subset just in case (original data didn't have any NAs that would be
##   dropped on accident, but still) 

df_textmining.dropna(subset=["gene_nodenorm_id", "gene_nodenorm_label", 
                            "disease_nodenorm_id", "disease_nodenorm_label"],
                    ignore_index=True, inplace=True)

df_knowledge.dropna(subset=["gene_nodenorm_id", "gene_nodenorm_label", 
                            "disease_nodenorm_id", "disease_nodenorm_label"],
                    ignore_index=True, inplace=True)

In [63]:
## didn't do stats for before/after

df_textmining.shape
df_knowledge.shape

(265761, 12)

(3846, 12)

### Exploring "duplicates" from NodeNorming

In [66]:
df_knowledge_noTotalDups = df_knowledge.drop_duplicates()

df_knowledge_noTotalDups[
    df_knowledge_noTotalDups.duplicated(
        subset=["gene_nodenorm_id", "disease_nodenorm_id", "source_db"], 
        keep=False)].sort_values(by=["gene_nodenorm_id", "disease_nodenorm_id"])

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,source_db,evidence_type,confidence_score,gene_nodenorm_id,gene_nodenorm_label,gene_nodenorm_category,disease_nodenorm_id,disease_nodenorm_label
2216,ENSEMBL:ENSP00000361021,PTEN,DOID:0050657,Bannayan-Riley-Ruvalcaba syndrome,MedlinePlus,CURATED,5,UniProtKB:P60484-1,"phosphatidylinositol 3,4,5-trisphosphate 3-pho...",biolink:Protein,MONDO:0016063,Cowden disease
2224,ENSEMBL:ENSP00000361021,PTEN,DOID:6457,Cowden syndrome,MedlinePlus,CURATED,5,UniProtKB:P60484-1,"phosphatidylinositol 3,4,5-trisphosphate 3-pho...",biolink:Protein,MONDO:0016063,Cowden disease
340,ENSEMBL:ENSP00000251595,HBA2,DOID:1099,Alpha thalassemia,MedlinePlus,CURATED,5,UniProtKB:P69905,HBA_HUMAN Hemoglobin subunit alpha (sprot),biolink:Protein,MONDO:0011399,alpha thalassemia spectrum
1388,ENSEMBL:ENSP00000322421,HBA1,DOID:1099,Alpha thalassemia,MedlinePlus,CURATED,5,UniProtKB:P69905,HBA_HUMAN Hemoglobin subunit alpha (sprot),biolink:Protein,MONDO:0011399,alpha thalassemia spectrum
117,ENSEMBL:ENSP00000221700,CYP4F2,DOID:0080665,Warfarin resistance,MedlinePlus,CURATED,5,UniProtKB:P78329-1,cytochrome P450 4F2 isoform h1 (human),biolink:Protein,MONDO:0007390,coumarin resistance
118,ENSEMBL:ENSP00000221700,CYP4F2,DOID:0080666,Warfarin sensitivity,MedlinePlus,CURATED,5,UniProtKB:P78329-1,cytochrome P450 4F2 isoform h1 (human),biolink:Protein,MONDO:0007390,coumarin resistance
2539,ENSEMBL:ENSP00000370083,SMN1,DOID:12377,Spinal muscular atrophy,MedlinePlus,CURATED,5,UniProtKB:Q16637-1,survival motor neuron protein isoform 1 (human),biolink:Protein,MONDO:0001516,spinal muscular atrophy
2542,ENSEMBL:ENSP00000370119,SMN2,DOID:12377,Spinal muscular atrophy,MedlinePlus,CURATED,5,UniProtKB:Q16637-1,survival motor neuron protein isoform 1 (human),biolink:Protein,MONDO:0001516,spinal muscular atrophy
2693,ENSEMBL:ENSP00000378426,VKORC1,DOID:0080665,Warfarin resistance,MedlinePlus,CURATED,5,UniProtKB:Q9BQB6-1,vitamin K epoxide reductase complex subunit 1 ...,biolink:Protein,MONDO:0007390,coumarin resistance
2694,ENSEMBL:ENSP00000378426,VKORC1,DOID:0080666,Warfarin sensitivity,MedlinePlus,CURATED,5,UniProtKB:Q9BQB6-1,vitamin K epoxide reductase complex subunit 1 ...,biolink:Protein,MONDO:0007390,coumarin resistance


In [67]:
df_textmining[
    df_textmining.duplicated(
        subset=["gene_nodenorm_id", "disease_nodenorm_id"], 
        keep=False)].sort_values(by=["gene_nodenorm_id", "disease_nodenorm_id"]).shape

df_textmining[
    df_textmining.duplicated(
        subset=["gene_nodenorm_id", "disease_nodenorm_id"], 
        keep=False)].sort_values(by=["gene_nodenorm_id", "disease_nodenorm_id"])

(1122, 12)

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,z_score,confidence_score,url,gene_nodenorm_id,gene_nodenorm_label,gene_nodenorm_category,disease_nodenorm_id,disease_nodenorm_label
12467,ENSEMBL:ENSP00000222553,NAMPT,DOID:11612,Polycystic ovary syndrome,4.966,2.483,https://diseases.jensenlab.org/Entity?document...,NCBIGene:10135,NAMPT,biolink:Gene,MONDO:0008487,Androgen excess
12484,ENSEMBL:ENSP00000222553,NAMPT,DOID:11613,Hyperandrogenism,3.953,1.977,https://diseases.jensenlab.org/Entity?document...,NCBIGene:10135,NAMPT,biolink:Gene,MONDO:0008487,Androgen excess
34107,ENSEMBL:ENSP00000256958,SLCO1B1,DOID:0080666,Warfarin sensitivity,4.099,2.050,https://diseases.jensenlab.org/Entity?document...,NCBIGene:10599,SLCO1B1,biolink:Gene,MONDO:0007390,coumarin resistance
34135,ENSEMBL:ENSP00000256958,SLCO1B1,DOID:0080665,Warfarin resistance,3.236,1.618,https://diseases.jensenlab.org/Entity?document...,NCBIGene:10599,SLCO1B1,biolink:Gene,MONDO:0007390,coumarin resistance
80319,ENSEMBL:ENSP00000299198,CKB,DOID:1575,Rheumatic disease,4.728,2.364,https://diseases.jensenlab.org/Entity?document...,NCBIGene:1152,CKB,biolink:Gene,MONDO:0005554,Rheumatism
...,...,...,...,...,...,...,...,...,...,...,...,...
85769,ENSEMBL:ENSP00000303178,CDY1B,DOID:12336,Male infertility,5.439,2.719,https://diseases.jensenlab.org/Entity?document...,UniProtKB:Q9Y6F8-2,testis-specific chromodomain protein Y 1 isofo...,biolink:Protein,MONDO:0005372,male infertility
85573,ENSEMBL:ENSP00000302968,CDY1,DOID:1921,Klinefelter syndrome,4.214,2.107,https://diseases.jensenlab.org/Entity?document...,UniProtKB:Q9Y6F8-2,testis-specific chromodomain protein Y 1 isofo...,biolink:Protein,MONDO:0006823,Klinefelter syndrome
85771,ENSEMBL:ENSP00000303178,CDY1B,DOID:1921,Klinefelter syndrome,3.726,1.863,https://diseases.jensenlab.org/Entity?document...,UniProtKB:Q9Y6F8-2,testis-specific chromodomain protein Y 1 isofo...,biolink:Protein,MONDO:0006823,Klinefelter syndrome
85575,ENSEMBL:ENSP00000302968,CDY1,DOID:11383,Cryptorchidism,3.720,1.860,https://diseases.jensenlab.org/Entity?document...,UniProtKB:Q9Y6F8-2,testis-specific chromodomain protein Y 1 isofo...,biolink:Protein,MONDO:0009047,cryptorchidism


## Generating documents

In [69]:
## put in parser!!
## want jsonlines format

import jsonlines

### Rows not included

<div class="alert alert-block alert-info">

* knowledge's UniProtKB-KW data
* No ENSP in gene_id columns (seemed to be names, couldn't NodeNorm)
* No DOID in disease_id columns (can't NodeNorm AmyCo)
* NodeNorm mapping failures for gene or disease IDs
* duplicates: will check when generating docs, not create if already did

### Columns not included

<div class="alert alert-block alert-info">

- gene_name
- disease_name
- evidence_type: same for all rows, not needed

In [68]:
df_textmining.columns

df_knowledge.columns

Index(['gene_id', 'gene_name', 'disease_id', 'disease_name', 'z_score',
       'confidence_score', 'url', 'gene_nodenorm_id', 'gene_nodenorm_label',
       'gene_nodenorm_category', 'disease_nodenorm_id',
       'disease_nodenorm_label'],
      dtype='object')

Index(['gene_id', 'gene_name', 'disease_id', 'disease_name', 'source_db',
       'evidence_type', 'confidence_score', 'gene_nodenorm_id',
       'gene_nodenorm_label', 'gene_nodenorm_category', 'disease_nodenorm_id',
       'disease_nodenorm_label'],
      dtype='object')

### File: List of TRAPI edges 

In [73]:
## code chunk for testing parts of inner code

with jsonlines.open('DISEASES_trapi_edges.jsonl', mode='w', compact=True) as trapi_writer:
    knowledge_tally = set()
    
    ## using itertuples because it's faster, preserves datatypes
    for row in df_knowledge.itertuples(index=False):
        ## construct row abbreviation: needs source_db!
        temp = row.gene_id + "_" + row.disease_id + "_" + row.source_db
        
        if temp not in knowledge_tally:
            knowledge_tally.add(temp)

            document = {
                ## simple assignments: no if
                "subject": row.gene_nodenorm_id,
                "predicate": "biolink:genetically_associated_with",
                "object": row.disease_nodenorm_id,
                "attributes": [
                    {
                        "attribute_type_id": "biolink:knowledge_level",
                        "value": "knowledge_assertion"
                    },
                    {
                        "attribute_type_id": "biolink:agent_type",
                        "value": "manual_agent"
                    },
                    {   ## needs data-modeling/TRAPI validation review
                        "attribute_type_id": "SEPIO:0000168",
                        "value": row.confidence_score
                    },
                    {
                        "attribute_type_id": "biolink:original_subject",
                        "original_attribute_name": "gene_id",  ## original column name
                        "value": row.gene_id
                    },
                    {
                        "attribute_type_id": "biolink:original_object",
                        "original_attribute_name": "disease_id",  ## original column name
                        "value": row.disease_id
                    },
                ]
            }
            ## sources: depends on source_db value
            if row.source_db == "MedlinePlus":
                document["sources"] = [
                    {
                        "resource_id": "infores:medlineplus",
                        "resource_role": "primary_knowledge_source"
                    },
                    {
                        "resource_id": "infores:diseases",
                        "resource_role": "aggregator_knowledge_source",
                        "upstream_resource_ids": ["infores:medlineplus"]
                    }
                ]
            elif row.source_db == "AmyCo":
                document["sources"] = [
                    {   ## not in infores registry yet!
                        "resource_id": "infores:amyco",
                        "resource_role": "primary_knowledge_source"
                    },
                    {
                        "resource_id": "infores:diseases",
                        "resource_role": "aggregator_knowledge_source",
                        "upstream_resource_ids": ["infores:amyco"]
                    }
                ]
            else:
                raise ValueError(f"Unexpected source_db value during source mapping: {row.source_db}. Adjust parser.")
            
            ## doing so it doesn't print
            bytes = trapi_writer.write(document)
        else:
            ## won't write the document
            print(f"duplicate row encountered: {temp}")
            break

duplicate row encountered: ENSEMBL:ENSP00000269703_DOID:0060655_MedlinePlus


In [76]:
## put into parser (format): 
## separate functions for textmined vs not, trapi edges vs kgx??
## will want the tally to live outside the reading data -> writing loop, so it can
##   keep track of what edges were already done

## if needed to create two diff output file formats at same time, could initiate
##   to separate writers and use both at once 


with jsonlines.open('DISEASES_trapi_edges.jsonl', mode='w', compact=True) as trapi_writer:
    
## text-mined data: 
    textmining_tally = set()
    
    ## using itertuples because it's faster, preserves datatypes
    for row in df_textmining.itertuples(index=False):
        ## construct row abbreviation
        temp = row.gene_id + "_" + row.disease_id
        
        if temp not in textmining_tally:
            textmining_tally.add(temp)

            document = {
                "subject": row.gene_nodenorm_id,
                "predicate": "biolink:occurs_together_in_literature_with",
                "object": row.disease_nodenorm_id,
                "sources": [
                    {
                        "resource_id": "infores:diseases",
                        "resource_role": "primary_knowledge_source",
                        "source_record_urls": [row.url]
                    }
                ],
                "attributes": [
                    {
                        "attribute_type_id": "biolink:knowledge_level",
                        "value": "statistical_association"
                    },
                    {
                        "attribute_type_id": "biolink:agent_type",
                        "value": "text_mining_agent"
                    },
                    {   ## needs data-modeling/TRAPI validation review
                        "attribute_type_id": "STATO:0000104",
                        "value": row.z_score
                    },
                    {   ## needs data-modeling/TRAPI validation review
                        "attribute_type_id": "SEPIO:0000168",
                        "value": row.confidence_score
                    },
                    {
                        "attribute_type_id": "biolink:original_subject",
                        "original_attribute_name": "gene_id",  ## original column name
                        "value": row.gene_id
                    },
                    {
                        "attribute_type_id": "biolink:original_object",
                        "original_attribute_name": "disease_id",  ## original column name
                        "value": row.disease_id
                    },
                ]
            }
            ## doing so it doesn't print
            bytes = trapi_writer.write(document)
        else:
            ## won't write the document
            print(f"duplicate row encountered: {temp}")

## knowledge data: parser - separate function (diff row abbreviation, document)
    knowledge_tally = set()
    
    ## using itertuples because it's faster, preserves datatypes
    for row in df_knowledge.itertuples(index=False):
        ## construct row abbreviation: needs source_db!
        temp = row.gene_id + "_" + row.disease_id + "_" + row.source_db
        
        if temp not in knowledge_tally:
            knowledge_tally.add(temp)

            document = {
                ## simple assignments: no if
                "subject": row.gene_nodenorm_id,
                "predicate": "biolink:genetically_associated_with",
                "object": row.disease_nodenorm_id,
                "attributes": [
                    {
                        "attribute_type_id": "biolink:knowledge_level",
                        "value": "knowledge_assertion"
                    },
                    {
                        "attribute_type_id": "biolink:agent_type",
                        "value": "manual_agent"
                    },
                    {   ## needs data-modeling/TRAPI validation review
                        "attribute_type_id": "SEPIO:0000168",
                        "value": row.confidence_score
                    },
                    {
                        "attribute_type_id": "biolink:original_subject",
                        "original_attribute_name": "gene_id",  ## original column name
                        "value": row.gene_id
                    },
                    {
                        "attribute_type_id": "biolink:original_object",
                        "original_attribute_name": "disease_id",  ## original column name
                        "value": row.disease_id
                    },
                ]
            }
            ## sources: depends on source_db value
            if row.source_db == "MedlinePlus":
                document["sources"] = [
                    {
                        "resource_id": "infores:medlineplus",
                        "resource_role": "primary_knowledge_source"
                    },
                    {
                        "resource_id": "infores:diseases",
                        "resource_role": "aggregator_knowledge_source",
                        "upstream_resource_ids": ["infores:medlineplus"]
                    }
                ]
            elif row.source_db == "AmyCo":
                document["sources"] = [
                    {   ## not in infores registry yet!
                        "resource_id": "infores:amyco",
                        "resource_role": "primary_knowledge_source"
                    },
                    {
                        "resource_id": "infores:diseases",
                        "resource_role": "aggregator_knowledge_source",
                        "upstream_resource_ids": ["infores:amyco"]
                    }
                ]
            else:
                raise ValueError(f"Unexpected source_db value during source mapping: {row.source_db}. Adjust parser.")
            ## doing so it doesn't print
            bytes = trapi_writer.write(document)
        else:
            ## won't write the document
            print(f"duplicate row encountered: {temp}")

duplicate row encountered: ENSEMBL:ENSP00000269703_DOID:0060655_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000272895_DOID:0060655_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000311687_DOID:0060655_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000342392_DOID:0050568_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000379845_DOID:0060407_MedlinePlus


In [77]:
df_knowledge[df_knowledge.duplicated(subset=["gene_id", "disease_id", "source_db"], keep=False)]

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,source_db,evidence_type,confidence_score,gene_nodenorm_id,gene_nodenorm_label,gene_nodenorm_category,disease_nodenorm_id,disease_nodenorm_label
798,ENSEMBL:ENSP00000269703,CYP4F22,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5,NCBIGene:126410,CYP4F22,biolink:Gene,MONDO:0017265,autosomal recessive congenital ichthyosis
799,ENSEMBL:ENSP00000269703,CYP4F22,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5,NCBIGene:126410,CYP4F22,biolink:Gene,MONDO:0017265,autosomal recessive congenital ichthyosis
823,ENSEMBL:ENSP00000272895,ABCA12,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5,UniProtKB:Q86UK0-1,glucosylceramide transporter ABCA12 isoform h1...,biolink:Protein,MONDO:0017265,autosomal recessive congenital ichthyosis
824,ENSEMBL:ENSP00000272895,ABCA12,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5,UniProtKB:Q86UK0-1,glucosylceramide transporter ABCA12 isoform h1...,biolink:Protein,MONDO:0017265,autosomal recessive congenital ichthyosis
1261,ENSEMBL:ENSP00000311687,NIPAL4,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5,UniProtKB:Q0D2K0-1,magnesium transporter NIPA4 isoform h1 (human),biolink:Protein,MONDO:0017265,autosomal recessive congenital ichthyosis
1262,ENSEMBL:ENSP00000311687,NIPAL4,DOID:0060655,Autosomal recessive congenital ichthyosis,MedlinePlus,CURATED,5,UniProtKB:Q0D2K0-1,magnesium transporter NIPA4 isoform h1 (human),biolink:Protein,MONDO:0017265,autosomal recessive congenital ichthyosis
1662,ENSEMBL:ENSP00000342392,MESP2,DOID:0050568,Spondylocostal dysostosis,MedlinePlus,CURATED,5,NCBIGene:145873,MESP2,biolink:Gene,MONDO:0000359,spondylocostal dysostosis
1663,ENSEMBL:ENSP00000342392,MESP2,DOID:0050568,Spondylocostal dysostosis,MedlinePlus,CURATED,5,NCBIGene:145873,MESP2,biolink:Gene,MONDO:0000359,spondylocostal dysostosis
2731,ENSEMBL:ENSP00000379845,ABAT,DOID:0060407,Chromosome 18q deletion syndrome,MedlinePlus,CURATED,5,NCBIGene:18,ABAT,biolink:Gene,MONDO:0011147,chromosome 18q deletion syndrome
2732,ENSEMBL:ENSP00000379845,ABAT,DOID:0060407,Chromosome 18q deletion syndrome,MedlinePlus,CURATED,5,NCBIGene:18,ABAT,biolink:Gene,MONDO:0011147,chromosome 18q deletion syndrome


In [78]:
## number of docs that should be created
df_textmining.shape[0] + df_knowledge.shape[0] - 5  ## add together, remove duplicates

269602

### File: KGX edges

In [79]:
df_textmining

Unnamed: 0,gene_id,gene_name,disease_id,disease_name,z_score,confidence_score,url,gene_nodenorm_id,gene_nodenorm_label,gene_nodenorm_category,disease_nodenorm_id,disease_nodenorm_label
0,ENSEMBL:ENSP00000000233,ARF5,DOID:0111266,Geroderma osteodysplasticum,4.775,2.388,https://diseases.jensenlab.org/Entity?document...,NCBIGene:381,ARF5,biolink:Gene,MONDO:0009271,geroderma osteodysplastica
1,ENSEMBL:ENSP00000000233,ARF5,DOID:162,Cancer,3.214,1.607,https://diseases.jensenlab.org/Entity?document...,NCBIGene:381,ARF5,biolink:Gene,MONDO:0004992,cancer
2,ENSEMBL:ENSP00000000233,ARF5,DOID:863,Nervous system disease,3.055,1.528,https://diseases.jensenlab.org/Entity?document...,NCBIGene:381,ARF5,biolink:Gene,MONDO:0005071,nervous system disorder
3,ENSEMBL:ENSP00000000412,M6PR,DOID:0080070,Mucolipidosis II alpha/beta,5.533,2.767,https://diseases.jensenlab.org/Entity?document...,NCBIGene:4074,M6PR,biolink:Gene,MONDO:0009650,Mucolipidosis 2
4,ENSEMBL:ENSP00000000412,M6PR,DOID:0080071,Mucolipidosis III alpha/beta,4.227,2.113,https://diseases.jensenlab.org/Entity?document...,NCBIGene:4074,M6PR,biolink:Gene,MONDO:0018931,"mucolipidosis type III, alpha/beta"
...,...,...,...,...,...,...,...,...,...,...,...,...
265756,ENSEMBL:ENSP00000501277,LDB1,DOID:0081445,Sickle cell disease,4.322,2.161,https://diseases.jensenlab.org/Entity?document...,UniProtKB:Q86U70-1,LIM domain-binding protein 1 isoform 1 (human),biolink:Protein,MONDO:0011382,sickle cell anemia
265757,ENSEMBL:ENSP00000501277,LDB1,DOID:2355,Anemia,4.298,2.149,https://diseases.jensenlab.org/Entity?document...,UniProtKB:Q86U70-1,LIM domain-binding protein 1 isoform 1 (human),biolink:Protein,MONDO:0002280,anemia
265758,ENSEMBL:ENSP00000501277,LDB1,DOID:162,Cancer,4.214,2.107,https://diseases.jensenlab.org/Entity?document...,UniProtKB:Q86U70-1,LIM domain-binding protein 1 isoform 1 (human),biolink:Protein,MONDO:0004992,cancer
265759,ENSEMBL:ENSP00000501277,LDB1,DOID:9467,nail-patella syndrome,3.957,1.978,https://diseases.jensenlab.org/Entity?document...,UniProtKB:Q86U70-1,LIM domain-binding protein 1 isoform 1 (human),biolink:Protein,MONDO:0008061,nail-patella syndrome


In [80]:
## code chunk for testing parts of inner code

with jsonlines.open('DISEASES_kgx_edges.jsonl', mode='w', compact=True) as kgx_edges_writer:
    textmining_tally = set()
    
    ## using itertuples because it's faster, preserves datatypes
    for row in df_textmining.itertuples(index=False):
        ## construct row abbreviation
        temp = row.gene_id + "_" + row.disease_id
        
        if temp not in textmining_tally:
            textmining_tally.add(temp)

            document = {
                "subject": row.gene_nodenorm_id,
                "predicate": "biolink:occurs_together_in_literature_with",
                "object": row.disease_nodenorm_id,
                "primary_knowledge_source": "infores:diseases",
                ## taken from Sierra. This should be disease source's source_record_urls
                "pks_record_urls": [row.url],  
                "knowledge_level": "statistical_association",
                "agent_type": "text_mining_agent",
                "STATO:0000104": row.z_score,  ## needs data-modeling/TRAPI validation review
                "SEPIO:0000168": row.confidence_score,  ## needs data-modeling/TRAPI validation review
                "original_subject": row.gene_id,
                "original_object": row.disease_id,
            }
            ## doing so it doesn't print
            bytes = kgx_edges_writer.write(document)
        else:
            ## won't write the document
            print(f"duplicate row encountered: {temp}")
        
        if row.gene_id == "ENSEMBL:ENSP00000000412":
            break    

In [81]:


with jsonlines.open('DISEASES_kgx_edges.jsonl', mode='w', compact=True) as kgx_edges_writer:
    
## text-mined data: parser - separate function (diff row abbreviation, document)
    textmining_tally = set()
    
    ## using itertuples because it's faster, preserves datatypes
    for row in df_textmining.itertuples(index=False):
        ## construct row abbreviation
        temp = row.gene_id + "_" + row.disease_id
        
        if temp not in textmining_tally:
            textmining_tally.add(temp)

            document = {
                "subject": row.gene_nodenorm_id,
                "predicate": "biolink:occurs_together_in_literature_with",
                "object": row.disease_nodenorm_id,
                "primary_knowledge_source": "infores:diseases",
                ## taken from Sierra. This should be disease source's source_record_urls
                "pks_record_urls": [row.url],  
                "knowledge_level": "statistical_association",
                "agent_type": "text_mining_agent",
                "STATO:0000104": row.z_score,  ## needs data-modeling/TRAPI validation review
                "SEPIO:0000168": row.confidence_score,  ## needs data-modeling/TRAPI validation review
                "original_subject": row.gene_id,
                "original_object": row.disease_id,
            }
            ## doing so it doesn't print
            bytes = kgx_edges_writer.write(document)
        else:
            ## won't write the document
            print(f"duplicate row encountered: {temp}")

## knowledge data: parser - separate function (diff row abbreviation, document)
    knowledge_tally = set()
    
    ## using itertuples because it's faster, preserves datatypes
    for row in df_knowledge.itertuples(index=False):
        ## construct row abbreviation: needs source_db!
        temp = row.gene_id + "_" + row.disease_id + "_" + row.source_db
        
        if temp not in knowledge_tally:
            knowledge_tally.add(temp)

            document = {
                ## simple assignments: no if
                "subject": row.gene_nodenorm_id,
                "predicate": "biolink:genetically_associated_with",
                "object": row.disease_nodenorm_id,
                "knowledge_level": "knowledge_assertion",
                "agent_type": "manual_agent",
                "SEPIO:0000168": row.confidence_score,  ## needs data-modeling/TRAPI validation review
                "original_subject": row.gene_id,
                "original_object": row.disease_id,
            }
            ## sources: depends on source_db value
            if row.source_db == "MedlinePlus":
                document["primary_knowledge_source"] = "infores:medlineplus"
                document["aggregator_knowledge_source"] = "infores:diseases"
            elif row.source_db == "AmyCo":
                 ## not in infores registry yet!
                document["primary_knowledge_source"] = "infores:amyco"
                document["aggregator_knowledge_source"] = "infores:diseases"                
            else:
                raise ValueError(f"Unexpected source_db value during source mapping: {row.source_db}. Adjust parser.")
            ## doing so it doesn't print
            bytes = kgx_edges_writer.write(document)
        else:
            ## won't write the document
            print(f"duplicate row encountered: {temp}")

duplicate row encountered: ENSEMBL:ENSP00000269703_DOID:0060655_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000272895_DOID:0060655_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000311687_DOID:0060655_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000342392_DOID:0050568_MedlinePlus
duplicate row encountered: ENSEMBL:ENSP00000379845_DOID:0060407_MedlinePlus


### File: KGX nodes

Requires id and category. name and other properties (basically node attributes) are optional. 

In [85]:
## get unique list of NodeNormed nodes

## need category 
nodenormed_genes_final = pd.concat([df_textmining[["gene_nodenorm_id", "gene_nodenorm_label", "gene_nodenorm_category"]], 
                         df_knowledge[["gene_nodenorm_id", "gene_nodenorm_label", "gene_nodenorm_category"]]], 
                        ignore_index=True).drop_duplicates()
nodenormed_diseases_final = pd.concat([df_textmining[["disease_nodenorm_id", "disease_nodenorm_label"]], 
                            df_knowledge[["disease_nodenorm_id", "disease_nodenorm_label"]]], 
                           ignore_index=True).drop_duplicates()

nodenormed_genes_final
## vs len mapping 16477
nodenormed_diseases_final
## vs len mapping 5641

Unnamed: 0,gene_nodenorm_id,gene_nodenorm_label,gene_nodenorm_category
0,NCBIGene:381,ARF5,biolink:Gene
3,NCBIGene:4074,M6PR,biolink:Gene
10,NCBIGene:2288,FKBP4,biolink:Gene
34,UniProtKB:Q9NR63-1,cytochrome P450 26B1 isoform 1 (human),biolink:Protein
50,UniProtKB:Q7L592-1,"protein arginine methyltransferase NDUFAF7, mi...",biolink:Protein
...,...,...,...
265755,UniProtKB:Q149M9-3,NACHT domain- and WD repeat-containing protein...,biolink:Protein
265756,UniProtKB:Q86U70-1,LIM domain-binding protein 1 isoform 1 (human),biolink:Protein
266215,UniProtKB:P41219-1,peripherin isoform h1 (human),biolink:Protein
268558,UniProtKB:Q96JH8-4,Ras-associating and dilute domain-containing p...,biolink:Protein


Unnamed: 0,disease_nodenorm_id,disease_nodenorm_label
0,MONDO:0009271,geroderma osteodysplastica
1,MONDO:0004992,cancer
2,MONDO:0005071,nervous system disorder
3,MONDO:0009650,Mucolipidosis 2
4,MONDO:0018931,"mucolipidosis type III, alpha/beta"
...,...,...
269365,MONDO:0011152,PHGDH deficiency
269378,MONDO:0100428,progressive bulbar palsy of childhood
269389,MONDO:0012791,"mitochondrial DNA depletion syndrome, encephal..."
269584,MONDO:0009949,pyruvate carboxylase deficiency disease


In [83]:


with jsonlines.open('DISEASES_kgx_nodes.jsonl', mode='w', compact=True) as kgx_nodes_writer:
    
    ## using itertuples because it's faster, preserves datatypes
    for row in nodenormed_genes_final.itertuples(index=False):
        ## doing so it doesn't print
        bytes = kgx_nodes_writer.write({
            "id": row.gene_nodenorm_id,
            "name": row.gene_nodenorm_label,
            "category": [row.gene_nodenorm_category]
        })

    ## using itertuples because it's faster, preserves datatypes
    for row in nodenormed_diseases_final.itertuples(index=False):
        ## doing so it doesn't print
        bytes = kgx_nodes_writer.write({
            "id": row.disease_nodenorm_id,
            "name": row.disease_nodenorm_label,
            ## hard-coded because during pre-NodeNorm process, only kept entities with this primary category
            "category": ["biolink:Disease"]
        })

In [86]:
nodenormed_genes_final.shape[0] + nodenormed_diseases_final.shape[0]

21952

## Notes

* KGX version is missing original_attribute_name, upstream_resource_ids
* __uses infores:amyco, which isn't in infores catalog release yet__ 
* __will create edges that look like duplicates because triple is the same, but the original IDs/data is diff. Just leaving it this way for now -> but will need to consider what to do (possible to merge??).__
* confidence_score is sometimes float (from text-mining) and sometimes int (from knowledge). Not worried about, for now. 