<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-data" data-toc-modified-id="Loading-data-1">Loading data</a></span></li><li><span><a href="#Check,-remove-duplicates" data-toc-modified-id="Check,-remove-duplicates-2">Check, remove duplicates</a></span></li><li><span><a href="#Column-level-transforms" data-toc-modified-id="Column-level-transforms-3">Column-level transforms</a></span></li><li><span><a href="#Confidence-values" data-toc-modified-id="Confidence-values-4">Confidence values</a></span><ul class="toc-item"><li><span><a href="#Removing-rows-+-stats" data-toc-modified-id="Removing-rows-+-stats-4.1">Removing rows + stats</a></span></li></ul></li><li><span><a href="#Pre-NodeNorming" data-toc-modified-id="Pre-NodeNorming-5">Pre-NodeNorming</a></span><ul class="toc-item"><li><span><a href="#Exploring:-Genes" data-toc-modified-id="Exploring:-Genes-5.1">Exploring: Genes</a></span><ul class="toc-item"><li><span><a href="#HGNC" data-toc-modified-id="HGNC-5.1.1">HGNC</a></span></li><li><span><a href="#OMIM" data-toc-modified-id="OMIM-5.1.2">OMIM</a></span></li><li><span><a href="#Comparing-HGNC-vs-OMIM" data-toc-modified-id="Comparing-HGNC-vs-OMIM-5.1.3">Comparing HGNC vs OMIM</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-5.1.4">Conclusions</a></span></li></ul></li><li><span><a href="#Exploring:-Diseases" data-toc-modified-id="Exploring:-Diseases-5.2">Exploring: Diseases</a></span><ul class="toc-item"><li><span><a href="#OMIM/orphanet" data-toc-modified-id="OMIM/orphanet-5.2.1">OMIM/orphanet</a></span></li><li><span><a href="#MONDO" data-toc-modified-id="MONDO-5.2.2">MONDO</a></span></li><li><span><a href="#Comparing-OMIM/orphanet-vs-MONDO" data-toc-modified-id="Comparing-OMIM/orphanet-vs-MONDO-5.2.3">Comparing OMIM/orphanet vs MONDO</a></span></li><li><span><a href="#Checking-MONDO-data" data-toc-modified-id="Checking-MONDO-data-5.2.4">Checking MONDO data</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-5.2.5">Conclusions</a></span></li></ul></li></ul></li><li><span><a href="#Stats-on-rows-removed-during-NodeNorming" data-toc-modified-id="Stats-on-rows-removed-during-NodeNorming-6">Stats on rows removed during NodeNorming</a></span></li><li><span><a href="#Adding-NodeNorm-data,-removing-rows" data-toc-modified-id="Adding-NodeNorm-data,-removing-rows-7">Adding NodeNorm data, removing rows</a></span></li><li><span><a href="#Generating-documents" data-toc-modified-id="Generating-documents-8">Generating documents</a></span><ul class="toc-item"><li><span><a href="#Rows-not-included" data-toc-modified-id="Rows-not-included-8.1">Rows not included</a></span></li><li><span><a href="#Columns-not-included" data-toc-modified-id="Columns-not-included-8.2">Columns not included</a></span></li><li><span><a href="#Generating-now!" data-toc-modified-id="Generating-now!-8.3">Generating now!</a></span></li></ul></li><li><span><a href="#Checking-documents" data-toc-modified-id="Checking-documents-9">Checking documents</a></span></li><li><span><a href="#BioThings-Parser-notes" data-toc-modified-id="BioThings-Parser-notes-10">BioThings Parser notes</a></span></li></ul></div>

# Notebook for parser development

In [1]:
## not for parser. for notebook only 

## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Loading data

__Current approach__ is to load all files into 1 pandas dataframe. Then I can...

1. check the duplicates situation: records found in multiple panel files. I can check whether the same record looks different between files or not (by checking duplicates using all columns vs key columns). -> Raise errors if yes
2. remove duplicates before generating documents
3. Do some tasks column-wise over all the data, rather than while iterating over rows

Notes:
* There are a few existing BioThings parsers that also use `pandas` to load the entire raw data file at once: https://github.com/search?q=repo%3Abiothings%2Fpending.api%20pandas&type=code
* But there are other parsers that use `csv` to load the file **one row at a time** (generator): https://github.com/search?q=repo%3Abiothings%2Fpending.api+csv+reader&type=code

---

If I did the generator approach (load files 1 by 1, 1 row at a time), I'd have to modify how I do things:
1. Don't do the duplicates check. But try to mitigate potential "duplicate" issues: 
   - Sort all delimited strings
   - Use a hash of all column values (when they're all strings) for `_id`. Want rows with all the same values to produce the same hash
2. Either leave to BioThings toolset to remove duplicates, or could save a running set of `_id` hashes to check if row was already encountered -> not create duplicate docs
3. Do the tasks on single rows/chunks (pandas [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) has an iterator for rows/chunks! see iterator/chunksize parameters)

In [2]:
## put into parser: DONE
import pathlib
import pandas as pd

## don't put in parser. Just for this notebook
import glob
from pprint import pprint

## unsure on putting into parser: more for notebook viewing/debugging...
pd.options.display.max_columns = None

<div class="alert alert-block alert-danger">

Adjust the code block below for path/pattern for data files. 
    
This notebook was originally written using data files from the 2025-02-28 release to FTP site 

In [3]:
## put into parser (format): DONE

base_file_path = pathlib.Path.home().joinpath("Desktop", "EBIgene2pheno_files", 
                                              "From_FTP", "2025-03-28")

## uses pathlib's Path.glob, which produces a generator. 
## cast into list so parser code can check if paths were actually matched or not
all_file_paths = list(base_file_path.glob("*.csv.gz"))
all_file_paths

[PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-03-28/DDG2P_2025-03-28.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-03-28/HearingLossG2P_2025-03-28.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-03-28/SkinG2P_2025-03-28.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-03-28/CancerG2P_2025-03-28.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-03-28/CardiacG2P_2025-03-28.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-03-28/SkeletalG2P_2025-03-28.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-03-28/EyeG2P_2025-03-28.csv.gz')]

In [4]:
## an example: pathlib's Path.glob produces a generator
## vs glob.glob produces an array (from cwd only?)
base_file_path.glob("*2025-02-28.csv.gz")
glob.glob("*2025-02-28.csv.gz")

<generator object Path.glob at 0x10ebd7010>

[]

In [5]:
## put into parser (format): DONE

## using generator expression (think list/dict comprehension) within pd.concat to load files 1 at a time
## ingesting all columns as str for now
df = pd.concat((pd.read_csv(f, dtype=str) for f in all_file_paths), ignore_index=True)

## make column names snake-case - usable with itertuples later
df.columns = df.columns.str.replace(" ", "_")

In [6]:
df["date_of_last_review"].info(memory_usage="deep")

<class 'pandas.core.series.Series'>
RangeIndex: 4726 entries, 0 to 4725
Series name: date_of_last_review
Non-Null Count  Dtype 
--------------  ----- 
4726 non-null   object
dtypes: object(1)
memory usage: 341.7 KB


In [7]:
## change this column to datetime, saves memory
df["date_of_last_review"] = pd.to_datetime(df["date_of_last_review"])
df["date_of_last_review"].info(memory_usage="deep")

<class 'pandas.core.series.Series'>
RangeIndex: 4726 entries, 0 to 4725
Series name: date_of_last_review
Non-Null Count  Dtype              
--------------  -----              
4726 non-null   datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1)
memory usage: 37.1 KB


In [8]:
## I couldn't figure out how to import + ingest column as datetime in 1 step 
## this is what I tried that didn't work

## worked with pandas 2.0.3, but didn't work with pandas 2.2.3: ingested as str
# df = pd.concat((pd.read_csv(f, dtype=str, parse_dates=["date of last review"]) 
#                 for f in all_file_paths), ignore_index=True)

## doesn't work
# df = pd.concat((pd.read_csv(f, dtype=str, parse_dates=["date of last review"], 
#                            date_format="%Y-%m-%d %H:%M:%S%:z") 
#                 for f in all_file_paths), ignore_index=True)
## throws an error
# df = pd.concat((pd.read_csv(f, dtype=str, parse_dates=[["date of last review"]], 
#                            date_format="%Y-%m-%d %H:%M:%S%:z") 
#                 for f in all_file_paths), ignore_index=True)
## throws an error
# df = pd.concat((pd.read_csv(f, dtype={"date of last review": pd.datetime64[ns, tz]})
#                 for f in all_file_paths), ignore_index=True)

In [9]:
df.shape
df.head()

(4726, 21)

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review
0,G2P00001,HMX1,142992,5017,H6; NKX5-3,HMX1-related oculoauricular syndrome,612109,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0000568; HP:0000589; HP:0000007; HP:0000482...,18423520; 25574057; 29140751,DD; Eye,,2019-09-26 16:23:46+00:00
1,G2P00002,SLX4,613278,23845,BTBD12; FANCP; KIAA1784; KIAA1987,SLX4-related Fanconi anemia,613951,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0000347; HP:0000007; HP:0001903; HP:0002984...,21240275; 21240277,DD,,2025-01-28 23:09:54+00:00
2,G2P00003,ARG1,608313,663,,ARG1-related argininemia,207800,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0000752; HP:0000737; HP:0000007; HP:0008339...,10502833; 1598908; 7649538; 1463019; 2365823,DD,,2015-07-22 16:14:07+00:00
3,G2P00004,ATR,601215,882,FRP1; MEC1; SCKL; SCKL1,ATR-related Seckel syndrome,210600,,biallelic_autosomal,,strong,absent gene product,,loss of function,inferred,,HP:0000347; HP:0010230; HP:0001249; HP:0002750...,,DD; Skeletal,,2025-01-27 14:24:27+00:00
4,G2P00005,FANCB,300515,3583,FAAP95; FAB; FLJ34064,FANCB-related Fanconi anemia,300514,,monoallelic_X_hemizygous,,definitive,absent gene product,,loss of function,inferred,,HP:0000924; HP:0001871; HP:0001701; HP:0000083...,16679491; 21910217; 36135330; 32106311; 307922...,DD; Skin,,2024-08-20 14:13:58+00:00


In [10]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4726 entries, 0 to 4725
Data columns (total 21 columns):
 #   Column                              Non-Null Count  Dtype              
---  ------                              --------------  -----              
 0   g2p_id                              4726 non-null   object             
 1   gene_symbol                         4726 non-null   object             
 2   gene_mim                            4724 non-null   object             
 3   hgnc_id                             4726 non-null   object             
 4   previous_gene_symbols               4251 non-null   object             
 5   disease_name                        4726 non-null   object             
 6   disease_mim                         3576 non-null   object             
 7   disease_MONDO                       641 non-null    object             
 8   allelic_requirement                 4726 non-null   object             
 9   cross_cutting_modifier              632 n

## Check, remove duplicates

There are duplicate rows in this dataframe because the record (gene + disease + more) is in several panels (disease falls into multiple categories). This was explored in the data-playground notebook. 

We want to drop those duplicates. 
However, I was concerned that the delimited-string values could differ (only in list order) for the same record in diff files. 
So that's what this check is for. 

In [11]:
## put into parser (format): DONE

n_duplicates_column_combo = df[df.duplicated(subset=["g2p_id"], keep=False)].shape

n_duplicates_all_columns = df[df.duplicated(keep=False)].shape

## for testing
# n_duplicates_all_columns = (1, 1)


if n_duplicates_column_combo != n_duplicates_all_columns: 
    raise AssertionError("The data format has changed, and the assumptions about duplicates/key columns may " \
                          "no longer hold. Re-explore the data and adjust the parser.")

In [12]:
## put into parser (format): DONE

## drop duplicates
df.drop_duplicates(inplace=True, ignore_index=True)

In [13]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3659 entries, 0 to 3658
Data columns (total 21 columns):
 #   Column                              Non-Null Count  Dtype              
---  ------                              --------------  -----              
 0   g2p_id                              3659 non-null   object             
 1   gene_symbol                         3659 non-null   object             
 2   gene_mim                            3657 non-null   object             
 3   hgnc_id                             3659 non-null   object             
 4   previous_gene_symbols               3287 non-null   object             
 5   disease_name                        3659 non-null   object             
 6   disease_mim                         2572 non-null   object             
 7   disease_MONDO                       564 non-null    object             
 8   allelic_requirement                 3659 non-null   object             
 9   cross_cutting_modifier              454 n

## Column-level transforms

Based on data-playground "Notes on parsing data to create documents" section

In [14]:
## double-checking how to add prefixes to OMIM vs orphanet IDs

df_diseasemim = df.copy()

## done to preserve NA
df_diseasemim["disease_mim"] = [i if pd.isna(i) \
                                else "OMIM:" + i if i.isnumeric() \
                                else i \
                                for i in df_diseasemim["disease_mim"]]

df_diseasemim["disease_mim"] = df_diseasemim["disease_mim"].str.replace("Orphanet", "orphanet")

In [15]:
df_diseasemim[df_diseasemim["disease_mim"].str.contains("OMIM:", na=False)].shape

df_diseasemim[df_diseasemim["disease_mim"].str.contains("orphanet:", na=False)].shape

## add up row count. If == num non-null in info above, you're good 
## right now 2570 == 2570, so good

(2563, 21)

(9, 21)

In [16]:
## put into parser (format): DONE

## COLUMN-LEVEL TRANSFORMS

## adding Translator/biolink prefixes to IDs
df["gene_mim"] = "OMIM:" + df["gene_mim"]
df["hgnc_id"] = "HGNC:" + df["hgnc_id"]
df["disease_mim"] = df["disease_mim"].str.replace("Orphanet", "orphanet")
## done to preserve NA
df["disease_mim"] = [i if pd.isna(i)
                     else "OMIM:" + i if i.isnumeric()
                     else i
                     for i in df["disease_mim"]]

## strip whitespace
df["disease_name"] = df["disease_name"].str.strip()
df["comments"] = df["comments"].str.strip()

## create new columns
## UI really wants resource website urls like this. May need to adjust over time as website changes
df["g2p_record_url"] = "https://www.ebi.ac.uk/gene2phenotype/lgd/" +  df["g2p_id"]

## replace panel keywords with full names shown on G2P website for single record
## keeping "Hearing loss" as-is, changing all other values
df["panel"] = df["panel"].str.replace("DD", "Developmental disorders")
df["panel"] = df["panel"].str.replace("Cancer", "Cancer disorders")
df["panel"] = df["panel"].str.replace("Cardiac", "Cardiac disorders")
df["panel"] = df["panel"].str.replace("Eye", "Eye disorders")
df["panel"] = df["panel"].str.replace("Skeletal", "Skeletal disorders")
df["panel"] = df["panel"].str.replace("Skin", "Skin disorders")

In [17]:
## checking on column-level transforms

df.head()
# df["g2p record url"].unique()[0:100]

# df[df["disease mim"].str.contains("orphanet", na=False)]  ## 9 rows, so that's correct
# df[df["panel"].str.contains("Hearing", na=False)]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
0,G2P00001,HMX1,OMIM:142992,HGNC:5017,H6; NKX5-3,HMX1-related oculoauricular syndrome,OMIM:612109,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0000568; HP:0000589; HP:0000007; HP:0000482...,18423520; 25574057; 29140751,Developmental disorders; Eye disorders,,2019-09-26 16:23:46+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00001
1,G2P00002,SLX4,OMIM:613278,HGNC:23845,BTBD12; FANCP; KIAA1784; KIAA1987,SLX4-related Fanconi anemia,OMIM:613951,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0000347; HP:0000007; HP:0001903; HP:0002984...,21240275; 21240277,Developmental disorders,,2025-01-28 23:09:54+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00002
2,G2P00003,ARG1,OMIM:608313,HGNC:663,,ARG1-related argininemia,OMIM:207800,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0000752; HP:0000737; HP:0000007; HP:0008339...,10502833; 1598908; 7649538; 1463019; 2365823,Developmental disorders,,2015-07-22 16:14:07+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00003
3,G2P00004,ATR,OMIM:601215,HGNC:882,FRP1; MEC1; SCKL; SCKL1,ATR-related Seckel syndrome,OMIM:210600,,biallelic_autosomal,,strong,absent gene product,,loss of function,inferred,,HP:0000347; HP:0010230; HP:0001249; HP:0002750...,,Developmental disorders; Skeletal disorders,,2025-01-27 14:24:27+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00004
4,G2P00005,FANCB,OMIM:300515,HGNC:3583,FAAP95; FAB; FLJ34064,FANCB-related Fanconi anemia,OMIM:300514,,monoallelic_X_hemizygous,,definitive,absent gene product,,loss of function,inferred,,HP:0000924; HP:0001871; HP:0001701; HP:0000083...,16679491; 21910217; 36135330; 32106311; 307922...,Developmental disorders; Skin disorders,,2024-08-20 14:13:58+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00005


## Confidence values

**2024-04-15:**

Every row/record has 1 confidence value, representing how confident the curators are that "this gene has a causal role in this disease". The definitions of the possible values are provided [here (under G2P Confidence Category)](https://www.ebi.ac.uk/gene2phenotype/about/terminology). 


**CURRENT DEFINITIONS** (including in case they change later)

> **definitive**: The role of this gene in this particular disease has been repeatedly demonstrated in both the research and clinical diagnostic settings, and has been upheld over time (at least 2 independent publication over 3 years' time). No convincing evidence has emerged that contradicts the role of the gene in the specified disease. (previously labelled as confirmed) The strength of evidence within publications as well as their number and publication dates is taken into account. In practice, this usually means at least 4 publications over 5 years. Typically this will also include convincing bioinformatic or functional evidence of causation, making it very unlikely that this gene-disease association would ever be refuted.
>
>**strong**: The role of this gene as a monogenic cause of disease has been repeatedly and independently demonstrated providing very strong convincing evidence in humans and no conflicting evidence for this gene's role in this disease. (previously labelled as probable).
>
>**moderate**: There is moderate evidence in humans to support a casual role for this gene in this disease with no contradictory evidence. The body of evidence is not large (e.g possibly only one key paper) but appears convincing enough that the gene-disease pair is likely to be validated with additional evidence in the near future.
>
>**"limited"**: Little human evidence exists to support a casual role for this gene in this disease, but not all evidence has been refuted. For example, there may be a collection of rare missense variants in humans but without convincing functional impact, segregration data that could either arise by chance (e.g across one or two meioses) or does not implicate a single gene, or functional data without direct recapitulation of the phenotype. Overall, the body of evidence does not meet contemporary criteria for claiming a valid association with disease. The majority are probably false associations. (previously labelled as possible).
>
>**"disputed"**: "Although evidence has been reported, other evidence of equal weight disputes the claim."
>
>**"refuted"**: "There has been an assertion of a gene-disease association in the literature, but new valid evidence has arisen that refutes the entire original body of evidence."

<div class="alert alert-block alert-success">

**2024-04-15:**

My thinking is...
1. rows with **"refuted"** and **"disputed"** values **should not be used to create edges for Translator**, because there's strong evidence that there ISN'T an association (negation) based on the definitions. 
2. rows with **"limited"** confidence can be kept because I interpret the definition as saying there is AN association - it's just not causal (as far as we know) and it's unclear how "real"/important it is. So these rows should have a predicate weaker than "causes"/"contributes to" -> **using "related to" for now**. 
3. keep rows with **"moderate", "strong", "definitive"** values, because there's moderate-definitive evidence that a gene DOES HAVE a causal role in this disease -> **"using "causes" for now**

    
Plus: use subject_form_or_variant_qualifier *genetic_variant_form*. Okay because every row has an allelic_requirement value, and those [terms](https://www.ebi.ac.uk/gene2phenotype/about/terminology) are for the gene's mutations that possibly cause the disease. 

<div class="alert alert-block alert-danger">

Data-modeling notes: options for gene-disease associations are confusing 
* can "causes / contributes to" be used? Maybe it makes more sense to use them with qualifiers on gene/protein (form or variant, aspect)...but are we allowed to use qualifiers here?
* what's the diff between "associated with" and "genetically associated with"? 
* "gene associated with condition" is child of "genetically associated with", but seems to be more general - basically a "related to". Also would look weird in UI, right? 

In [18]:
df["confidence"].value_counts()

confidence
definitive    2047
strong         853
limited        518
moderate       240
refuted          1
Name: count, dtype: int64

**2025-03-28 data:** 
No "disputed" values, only 1 "refuted" row to remove

### Removing rows + stats

In [19]:
## put into parser (format): DONE

## calculate stats before removing

n_rows_original = df.shape[0]
n_rows_refuted = df[df["confidence"] == "refuted"].shape[0]
n_rows_disputed = df[df["confidence"] == "disputed"].shape[0]

In [20]:
## put into parser (format): DONE

## remove rows, calculate stats after

df = df[~ df["confidence"].isin(["refuted", "disputed"])].reset_index(drop=True)
n_rows_after_confidence = df.shape[0]

In [21]:
## put into parser (format): DONE

## Print stats

print(f"{n_rows_original} unique rows/records in original dataset\n")

print(f"Removing rows based on confidence:")
print(f"{n_rows_refuted}: 'refuted'")
print(f"{n_rows_disputed}: 'disputed'\n")

print(f"{n_rows_after_confidence} rows afterwards")

3659 unique rows/records in original dataset

Removing rows based on confidence:
1: 'refuted'
0: 'disputed'

3658 rows afterwards


## Pre-NodeNorming

Querying NodeNorm: send unique values (no duplicates!) from entire column in large batches -> generate mapping dict to use. 
<br>
__Not querying 1-by-1 or 1 row at a time: much slower__ and would involve sending duplicate IDs (unless saved dict is kept outside loop and checked) 

Not going to use NameResolver: not optimistic this would work anyways. My manual process of getting "better" disease IDs is to use the gene IDs, find the diseases they're linked to in OMIM and Monarch, and seeing if those match the data's disease name / phenotypes / publications. This is more complicated than just using NameResolver.

<div class="alert alert-block alert-danger">

Set the NodeNorm URL you want to use. 

In [22]:
## put into parser (format): DONE

import requests

## from BioThings annotator code: for interoperability between diff Python versions
# try:
#     from itertools import batched  # new in Python 3.12
# except ImportError:
#     from itertools import islice

#     def batched(iterable, n):
#         # batched('ABCDEFG', 3) → ABC DEF G
#         if n < 1:
#             raise ValueError("n must be at least one")
#         iterator = iter(iterable)
#         while batch := tuple(islice(iterator, n)):
#             yield batch

## doing to test that this works
from itertools import islice

def batched(iterable, n):
    # batched('ABCDEFG', 3) → ABC DEF G
    if n < 1:
        raise ValueError("n must be at least one")
    iterator = iter(iterable)
    while batch := tuple(islice(iterator, n)):
        yield batch

nodenorm_url = "https://nodenorm.ci.transltr.io/get_normalized_nodes"

### Exploring: Genes

**2025-03-28 data:** Every row has at least 1 gene ID (HGNC column has no missing values). So no rows will be removed because there's no gene IDs to use for the pre-NodeNorming. 

In [23]:
df[["gene_symbol", "hgnc_id", "gene_mim"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3658 entries, 0 to 3657
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   gene_symbol  3658 non-null   object
 1   hgnc_id      3658 non-null   object
 2   gene_mim     3656 non-null   object
dtypes: object(3)
memory usage: 85.9+ KB


In [24]:
df[df["gene_mim"].isna()]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
954,G2P00977,ZNF599,,HGNC:26408,FLJ30663,ZNF599-related NOT IN OMIM,,,monoallelic_autosomal,,limited,uncertain,,undetermined,inferred,,,,Developmental disorders,,2015-07-22 16:15:03+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00977
3451,G2P02168,MFSD6L,,HGNC:26656,FLJ35773,MFSD6L-related congenital cataract,,MONDO:0005129,biallelic_autosomal,,limited,altered gene product structure,inframe_insertion; inframe_deletion; missense_...,undetermined,inferred,,HP:0000007; HP:0010864,22935719.0,Eye disorders,,2017-08-29 09:35:13+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02168


In [25]:
## saving stats on data with no gene IDs, just in case

stats_no_gene_IDs = {
    "n_rows": df[df["gene_mim"].isna() & df["hgnc_id"].isna()].shape[0],
    "n_names": len(df[df["gene_mim"].isna() & df["hgnc_id"].isna()]["gene_symbol"].unique())
}

stats_no_gene_IDs["n_rows"]
stats_no_gene_IDs["n_names"]

0

0

#### HGNC

__Running Gene HGNC IDs through NodeNorm__


Catching potential mapping failures for later stats report

In [26]:
## saving stats on data with no HGNC IDs, just in case

n_rows_no_hgnc = df["hgnc_id"].isna().sum()

In [27]:
## get set of unique CURIEs to put into NodeNorm
hgnc = df["hgnc_id"].dropna().unique()
len(hgnc)

3000

In [28]:
hgnc_nodenorm_mapping = {}

## set up variables to catch potential mapping failures
stats_hgnc_mapping_failures = {
    "unexpected_error": {},
    "nodenorm_returned_none": [],
    "wrong_category": {},
    "no_label": []
    
}

In [29]:
## larger batches are quicker
for batch in batched(hgnc, 1000):
    ## returns tuples -> cast to list
    req_body = {
        "curies": list(batch),
        "conflate": True,
    }
    r = requests.post(nodenorm_url, json=req_body)
    response = r.json()
    
    ## not doing dict comprehension. allows easier review, logic writing
    for k,v in response.items():
        ## catch unexpected errors
        try:
            ## if NodeNorm didn't have info on this ID, v will be None
            if v is not None:
                ## don't keep mapping if category is not the expected one
                if v["type"][0] == "biolink:Gene":
                    ## also throw out mapping if no primary label found
                    if v["id"].get("label"):
                        temp = {
                            k: {"primary_id": v["id"]["identifier"],
                                "primary_label": v["id"]["label"]
                               }
                        }
                        hgnc_nodenorm_mapping.update(temp)
                    else:
                        stats_hgnc_mapping_failures["no_label"].append(k)
#                         print(f"{k}: NodeNorm didn't find primary label. Not keeping this mapping.")
                else:
                    stats_hgnc_mapping_failures["wrong_category"].update({k: v["type"][0]})
#                     print(f'{k}: NodeNorm found different category {v["type"][0]}. Not keeping this mapping.')
            else:
                stats_hgnc_mapping_failures["nodenorm_returned_none"].append(k)
#                 print(f"{k}: NodeNorm didn't recognize this ID")
        except:
            stats_hgnc_mapping_failures["unexpected_error"].update({k: v})
            print(f'Encountered an unexpected error.')
            print(f'NodeNorm response key: {k}')
            print(f'NodeNorm response value: {v}')

In [30]:
len(hgnc_nodenorm_mapping)

stats_hgnc_mapping_failures

3000

{'unexpected_error': {},
 'nodenorm_returned_none': [],
 'wrong_category': {},
 'no_label': []}

#### OMIM

__Running Gene OMIM IDs through NodeNorm__

Catching potential mapping failures for later stats report. 

Pasted, adjusted from HGNC code blocks above.

In [31]:
## get set of unique CURIEs to put into NodeNorm
gene_omim = df["gene_mim"].dropna().unique()
len(gene_omim)

2998

In [32]:
gene_omim_nodenorm_mapping = {}

## set up variables to catch potential mapping failures
stats_gene_omim_mapping_failures = {
    "unexpected_error": {},
    "nodenorm_returned_none": [],
    "wrong_category": {},
    "no_label": []
    
}

In [33]:
## larger batches are quicker
for batch in batched(gene_omim, 1000):
    ## returns tuples -> cast to list
    req_body = {
        "curies": list(batch),
        "conflate": True,
    }
    r = requests.post(nodenorm_url, json=req_body)
    response = r.json()
    
    ## not doing dict comprehension. allows easier review, logic writing
    for k,v in response.items():
        ## catch unexpected errors
        try:
            ## if NodeNorm didn't have info on this ID, v will be None
            if v is not None:
                ## don't keep mapping if category is not the expected one
                if v["type"][0] == "biolink:Gene":
                    ## also throw out mapping if no primary label found
                    if v["id"].get("label"):
                        temp = {
                            k: {"primary_id": v["id"]["identifier"],
                                "primary_label": v["id"]["label"]
                               }
                        }
                        gene_omim_nodenorm_mapping.update(temp)
                    else:
                        stats_gene_omim_mapping_failures["no_label"].append(k)
#                         print(f"{k}: NodeNorm didn't find primary label. Not keeping this mapping.")
                else:
                    stats_gene_omim_mapping_failures["wrong_category"].update({k: v["type"][0]})
#                     print(f'{k}: NodeNorm found different category {v["type"][0]}. Not keeping this mapping.')
            else:
                stats_gene_omim_mapping_failures["nodenorm_returned_none"].append(k)
#                 print(f"{k}: NodeNorm didn't recognize this ID")
        except:
            stats_gene_omim_mapping_failures["unexpected_error"].update({k: v})
            print(f'Encountered an unexpected error.')
            print(f'NodeNorm response key: {k}')
            print(f'NodeNorm response value: {v}')

In [34]:
len(gene_omim_nodenorm_mapping)

stats_gene_omim_mapping_failures

2997

{'unexpected_error': {},
 'nodenorm_returned_none': ['OMIM:621003'],
 'wrong_category': {},
 'no_label': []}

In [35]:
## from looking at 2025-03-28 data
df[df["gene_mim"] == "OMIM:621003"]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
2765,G2P03714,SP9,OMIM:621003,HGNC:30690,ZNF990,SP9-related neurodevelopmental disorder with/w...,,,monoallelic_autosomal,,moderate,altered gene product structure,frameshift_variant_NMD_escaping; missense_variant,undetermined,inferred,,,38288683,Developmental disorders,Discussions during curation agreed that there ...,2025-03-05 11:14:46+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03714


**2025-03-28 data:** `OMIM:621003` is a valid gene ID [(OMIM webpage)](https://omim.org/entry/602668), so this is a NodeNorm error.

#### Comparing HGNC vs OMIM

In [36]:
## if row has both IDs, look for diff in mappings from each ID
for row in df[["gene_mim", "hgnc_id"]].itertuples(index=False):
    ## has both IDs
    if pd.notna(row.gene_mim) and pd.notna(row.hgnc_id):
        ## if have NodeNorm mappings for both
        if gene_omim_nodenorm_mapping.get(row.gene_mim) and \
        hgnc_nodenorm_mapping.get(row.hgnc_id):
            ## check if mappings are diff
            if gene_omim_nodenorm_mapping[row.gene_mim]["primary_id"] != \
            hgnc_nodenorm_mapping[row.hgnc_id]["primary_id"]:
                print(row)

## 2025-03-28 data: nothing prints, so there are no mismatches

In [37]:
## look for differences in name between NodeNormed and original data

for row in df[["gene_symbol", "hgnc_id"]].itertuples(index=False):
    ## works because both columns have no missing values and there's no failed mappings
    ## if this changes, need to adjust this code block
    if row.gene_symbol != hgnc_nodenorm_mapping[row.hgnc_id]["primary_label"]:
        print(f"G2P name {row.gene_symbol}, ID {row.hgnc_id}")
        print(f'NodeNorm name {hgnc_nodenorm_mapping[row.hgnc_id]["primary_label"]}, ID {hgnc_nodenorm_mapping[row.hgnc_id]["primary_id"]}')
        print("\n")

G2P name MT-TP, ID HGNC:7494
NodeNorm name TRNP, ID NCBIGene:4571


G2P name CENPJ, ID HGNC:17272
NodeNorm name CPAP, ID NCBIGene:55835


G2P name CCDC103, ID HGNC:32700
NodeNorm name DNAAF19, ID NCBIGene:388389


G2P name MT-TL1, ID HGNC:7490
NodeNorm name TRNL1, ID NCBIGene:4567


G2P name MT-ND1, ID HGNC:7455
NodeNorm name ND1, ID NCBIGene:4535


G2P name MT-ND4, ID HGNC:7459
NodeNorm name ND4, ID NCBIGene:4538


G2P name MT-ATP6, ID HGNC:7414
NodeNorm name ATP6, ID NCBIGene:4508


G2P name MT-ND5, ID HGNC:7461
NodeNorm name ND5, ID NCBIGene:4540


G2P name MT-ND6, ID HGNC:7462
NodeNorm name ND6, ID NCBIGene:4541




**2025-03-28 data:** 

Review of mismatched names:
* NodeNorm is correct that CENPJ should be CPAP, CCDC103 -> DNAAF19
* The rest look like mitochondrial genes, and NCBIGene main name seems to match G2P name, not NodeNorm -> messaged NodeNorm

#### Conclusions

<div class="alert alert-block alert-success">

**2025-03-28 data:** 
    
__Exploration__

* no mapping failures
* when rows have both OMIM and HGNC IDs, there were no differences in NodeNorm mapping ("mismatches")
    
__Decision: Use HGNC ID column to generate NodeNorm values__

* less missing values (none right now)
* these IDs are probably only genes (vs OMIM terms can be multiple types) 

### Exploring: Diseases

There are many more missing IDs for Disease, compared to Gene. 

As mentioned at the beginning of the "Pre-NodeNorming" section, I won't be using NameResolver right now. 

__This means all rows w/o any disease IDs will be removed__ because they cannot be pre-NodeNormed. 

In [38]:
df[["disease_name", "disease_mim", "disease_MONDO"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3658 entries, 0 to 3657
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   disease_name   3658 non-null   object
 1   disease_mim    2571 non-null   object
 2   disease_MONDO  564 non-null    object
dtypes: object(3)
memory usage: 85.9+ KB


In [39]:
df[df["disease_mim"].isna() & df["disease_MONDO"].isna()]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
15,G2P00016,NAA10,OMIM:300013,HGNC:18704,ARD1; ARD1A; DXS707; TE2,NAA10-related nonpecific severe intellectual d...,,,monoallelic_X_heterozygous,,definitive,altered gene product structure,inframe_insertion; inframe_deletion; missense_...,undetermined,inferred,,,25099252,Developmental disorders,,2015-07-22 16:14:09+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00016
20,G2P00021,COL1A1,OMIM:120150,HGNC:2197,OI4,COL1A1-related osteogenesis imperfecta spectrum,,,monoallelic_autosomal,restricted mutation set,definitive,altered gene product structure,,dominant negative,inferred,,HP:0002808; HP:0000347; HP:0005622; HP:0001075...,9295084; 3082886; 18409203; 2295701; 1988452; ...,Developmental disorders; Skin disorders; Skele...,,2025-01-15 11:51:09+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00021
32,G2P00033,SCN2A,OMIM:182390,HGNC:10588,HBSCI; HBSCII; NAV1.2; SCN2A1; SCN2A2,SCN2A-related nonspecific severe intellectual ...,,,monoallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,,30062040,Developmental disorders,,2015-07-22 16:14:11+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00033
56,G2P00057,FLNA,OMIM:300017,HGNC:3754,ABP-280; FLN; FLN1; OPD1; OPD2,FLNA-related epileptic encephalopathy,,,monoallelic_X_heterozygous,,definitive,altered gene product structure,inframe_insertion; inframe_deletion; missense_...,undetermined,inferred,,,23934111,Developmental disorders; Skin disorders,,2025-01-27 22:44:48+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00057
83,G2P00086,PRRT2,OMIM:614386,HGNC:30500,DKFZP547J199; DSPB3; DYT10; EKD1; FICCA; FLJ25...,PRRT2-related intellectual developmental disorder,,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,,21937992,Developmental disorders,,2015-07-22 16:14:14+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00086
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3620,G2P02428,IPO13,OMIM:610411,HGNC:16853,IMP13; KIAA0724; RANBP13,"IPO13-related ocular coloboma, microphthalmia,...",,,biallelic_autosomal,,strong,absent gene product,,loss of function,inferred,,HP:0000007; HP:0000568; HP:0000482; HP:0000612,29700284,Eye disorders,,2018-05-25 14:49:39+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02428
3623,G2P02432,MAFB,OMIM:608968,HGNC:6408,KRML,MAFB-related focal segmental glomerulosclerosi...,,,monoallelic_autosomal,,strong,altered gene product structure,inframe_insertion; inframe_deletion; missense_...,undetermined,inferred,,,29779709,Eye disorders,,2018-05-30 15:12:32+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02432
3629,G2P02540,IFT88,OMIM:600595,HGNC:20606,D13S1056E; HTG737; MGC26259; TG737; TTC10,IFT88-related non-syndromic retinal degeneration,,,biallelic_autosomal,,limited,absent gene product,,loss of function,inferred,,HP:0000007; HP:0000662; HP:0007947; HP:0008035...,29978320,Eye disorders,,2025-01-16 11:09:04+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02540
3645,G2P02823,MFRP,OMIM:606227,HGNC:18121,C1QTNF5; FLJ30570; NNO2; RD6,MFRP-related non-syndromic retinitis pigmenta,,,biallelic_autosomal,,definitive,uncertain,,undetermined,inferred,,HP:0000556; HP:0000510,24474277; 22605927; 22142163,Eye disorders,,2024-10-25 14:34:52+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02823


In [40]:
## saving stats on data with no disease IDs

stats_no_disease_IDs = {
    "n_rows": df[df["disease_mim"].isna() & df["disease_MONDO"].isna()].shape[0],
    "n_names": len(df[df["disease_mim"].isna() & df["disease_MONDO"].isna()]["disease_name"].unique())
}

stats_no_disease_IDs["n_rows"]
stats_no_disease_IDs["n_names"]

642

629

#### OMIM/orphanet

__Running OMIM/orphanet IDs through NodeNorm__

Catching mapping failures for later stats report

Pasted, adjusted from HGNC code blocks above.

In [41]:
## put into parser (format): DONE

## get set of unique CURIEs to put into NodeNorm
disease_OmOr = df["disease_mim"].dropna().unique()
len(disease_OmOr)

2402

In [42]:
## put into parser (format): DONE

OmOr_nodenorm_mapping = {}

## set up variables to catch mapping failures
stats_OmOr_mapping_failures = {
    "unexpected_error": {},
    "nodenorm_returned_none": [],
    "wrong_category": {},
    "no_label": []
    
}

In [43]:
## put into parser (format): DONE

## larger batches are quicker
for batch in batched(disease_OmOr, 1000):
    ## returns tuples -> cast to list
    req_body = {
        "curies": list(batch),
        "conflate": True,
    }
    r = requests.post(nodenorm_url, json=req_body)
    response = r.json()
    
    ## not doing dict comprehension. allows easier review, logic writing
    for k,v in response.items():
        ## catch unexpected errors
        try:
            ## if NodeNorm didn't have info on this ID, v will be None
            if v is not None:
                ## don't keep mapping if category is not the expected one
                if v["type"][0] == "biolink:Disease":
                    ## also throw out mapping if no primary label found
                    if v["id"].get("label"):
                        temp = {
                            k: {"primary_id": v["id"]["identifier"],
                                "primary_label": v["id"]["label"]
                               }
                        }
                        OmOr_nodenorm_mapping.update(temp)
                    else:
                        stats_OmOr_mapping_failures["no_label"].append(k)
#                         print(f"{k}: NodeNorm didn't find primary label. Not keeping this mapping.")
                else:
                    stats_OmOr_mapping_failures["wrong_category"].update({k: v["type"][0]})
#                     print(f'{k}: NodeNorm found different category {v["type"][0]}. Not keeping this mapping.')
            else:
                stats_OmOr_mapping_failures["nodenorm_returned_none"].append(k)
#                 print(f"{k}: NodeNorm didn't recognize this ID")
        except:
            stats_OmOr_mapping_failures["unexpected_error"].update({k: v})
            print(f'Encountered an unexpected error.')
            print(f'NodeNorm response key: {k}')
            print(f'NodeNorm response value: {v}')

In [44]:
## put into parser (format): DONE

## calculate stats: number of rows affected by each type of mapping failure
stats_OmOr_mapping_failures.update({
    "n_rows_none": df[df["disease_mim"].isin(stats_OmOr_mapping_failures["nodenorm_returned_none"])].shape[0],
    "n_rows_wrong_category": df[df["disease_mim"].isin(stats_OmOr_mapping_failures["wrong_category"].keys())].shape[0],
    "n_rows_no_label": df[df["disease_mim"].isin(stats_OmOr_mapping_failures["no_label"])].shape[0]
})

In [45]:
len(OmOr_nodenorm_mapping)

stats_OmOr_mapping_failures["unexpected_error"]

len(stats_OmOr_mapping_failures["nodenorm_returned_none"])
len(stats_OmOr_mapping_failures["wrong_category"])
len(stats_OmOr_mapping_failures["no_label"])

2334

{}

39

26

3

In [46]:
## code used to review mapping failures 

# stats_OmOr_mapping_failures["nodenorm_returned_none"]

stats_OmOr_mapping_failures["wrong_category"].keys()

# stats_OmOr_mapping_failures["no_label"]

dict_keys(['OMIM:300153', 'OMIM:300171', 'OMIM:188400', 'OMIM:611579', 'OMIM:603707', 'OMIM:601789', 'OMIM:603360', 'OMIM:609413', 'OMIM:300197', 'OMIM:603164', 'OMIM:123580', 'OMIM:601758', 'OMIM:601498', 'OMIM:601791', 'OMIM:170993', 'OMIM:601893', 'OMIM:606525', 'OMIM:602859', 'OMIM:612082', 'OMIM:601757', 'OMIM:300204', 'OMIM:600112', 'OMIM:104155', 'OMIM:618615', 'OMIM:614281', 'OMIM:603718'])

In [47]:
## code used to review mapping failures 

# df[df["disease_mim"] == "OMIM:613180"]

<div class="alert alert-block alert-info">

**2025-02-28 data:**    
    
__Reviewed Disease OMIM/orphanet NodeNorm mapping failures__

All were OMIM IDs, none were orphanet. 
    
39 cases where NodeNorm returned None (didn't recognize/resolve ID). __I checked some (10)__:
* 5: ID has been replaced/moved to a diff ID (OMIM:607236, OMIM:608890, OMIM:613180, OMIM:300706, OMIM:300141) -> emailed EBI gene2pheno
* 3: ID doesn't exist (OMIM:249163, OMIM:319029, OMIM:237145) -> emailed EBI gene2pheno
* 1: NodeNorm error - this is a valid disease ID that it should recognize (OMIM:133700) -> messaged NodeNorm
* 1: valid ID, but it doesn't seem to be a disease. There may be better IDs out there (OMIM:601884) -> messaged NodeNorm, emailed EBI gene2pheno

26 cases where NodeNorm category was something else (currently, always Gene). I checked all: 
* 25: NodeNorm is correct, this is a gene -> emailed EBI gene2pheno
* 1: NodeNorm error - this is a valid disease ID (OMIM:188400) -> messaged NodeNorm
    
3 cases where NodeNorm didn't have a primary label. I checked all:
* 2: NodeNorm error - these are valid disease IDs with labels (OMIM:620987, OMIM:620964) -> messaged NodeNorm
* 1: valid ID, but it doesn't seem to be a disease (OMIM:300129). EBI gene2pheno shouldn't use, not sure it should be in NodeNorm -> messaged NodeNorm, emailed EBI gene2pheno
    
</div>

In [48]:
## code used to check for orphanet mapping failures 

for i in stats_OmOr_mapping_failures["nodenorm_returned_none"]:
    if "orphanet" in i:
        print(i)
        
for i in stats_OmOr_mapping_failures["wrong_category"].keys():
    if "orphanet" in i:
        print(i)

for i in stats_OmOr_mapping_failures["no_label"]:
    if "orphanet" in i:
        print(i)

**2025-02-28 data:** 

No orphanet IDs were had mapping failures but I checked all (9) mappings anyways - they looked fine. 

In [49]:
df[df["disease_mim"].str.contains("orphanet", na=False)]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
1958,G2P02555,SMAD6,OMIM:602931,HGNC:6772,HST17432; MADH6; MADH7,SMAD6-related non-syndromic craniosynostosis,orphanet:139390,,monoallelic_autosomal,,limited,absent gene product,,loss of function,inferred,,HP:0001363,27606499; 28808027,Developmental disorders,,2019-04-17 12:18:34+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02555
1967,G2P02564,TONSL,OMIM:604546,HGNC:7801,IKBR; NFKBIL2,TONSL-related sponastrime dysplasia,orphanet:93357,,biallelic_autosomal,,strong,absent gene product,,loss of function,inferred,,HP:0100255; HP:0002650; HP:0005281; HP:0004322,30773277; 30773278,Developmental disorders,,2018-11-07 09:53:40+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02564
2791,G2P01819,RET,OMIM:164761,HGNC:9967,CDHF12; CDHR16; HSCR1; MEN2A; MEN2B; MTC1; PTC...,RET-related medullary thyroid carcinoma,orphanet:1332,,monoallelic_autosomal,,definitive,uncertain,,undetermined,inferred,,HP:0002865,10323403; 11454140; 14602786; 15240641; 950672...,Skin disorders; Cancer disorders,,2017-09-01 16:19:16+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P01819
2912,G2P02669,INSR,OMIM:147670,HGNC:6091,CD220,INSR-related leprechaunism,orphanet:508,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0000924; HP:0008897; HP:0003202; HP:0003162...,8105179; 7815442,Skin disorders,,2019-09-16 15:47:00+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02669
3142,G2P03354,SEC23B,OMIM:610512,HGNC:10702,CDA-II; CDAII; CDAN2; HEMPAS,SEC23B-related Cowden syndrome,orphanet:201,,monoallelic_autosomal,,limited,uncertain,,gain of function,inferred,,HP:0012846; HP:0005584; HP:0500009; HP:0012114...,26522472,Cancer disorders,,2022-11-30 08:49:50+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03354
3315,G2P02001,RCBTB1,OMIM:607867,HGNC:18243,CLLD7; CLLL7; FLJ10716,RCBTB1-related familial exudative vitreoretino...,orphanet:891,,monoallelic_autosomal,,limited,altered gene product structure,inframe_insertion; inframe_deletion; missense_...,undetermined,inferred,,HP:0001141; HP:0012231; HP:0003829; HP:0000541...,26908610,Eye disorders,,2017-06-11 18:14:48+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02001
3495,G2P02222,REV3L,OMIM:602776,HGNC:9968,POLZ; REV3,REV3L-related Moebius syndrome,orphanet:570,,monoallelic_autosomal,,limited,uncertain,,undetermined,inferred,,,26068067,Eye disorders,,2017-08-30 12:00:00+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02222
3537,G2P02286,TMEM98,OMIM:615949,HGNC:24529,DKFZP564K1964,TMEM98-related nanophthalmos,orphanet:35612,,monoallelic_autosomal,,strong,uncertain,,undetermined,inferred,,,24852644; 26392740,Eye disorders,,2017-08-31 13:26:51+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02286
3618,G2P02426,TENM3,OMIM:610083,HGNC:29944,KIAA1455; ODZ3; TEN-M3; TEN3,TENM3-related colobomatous microphthalmia,orphanet:98938,,biallelic_autosomal,,strong,absent gene product,,loss of function,inferred,,HP:0000565; HP:0000568; HP:0000589; HP:0000567...,29753094; 27103084; 22766609,Eye disorders,,2018-05-25 09:59:48+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02426


In [50]:
OmOr_nodenorm_mapping["orphanet:1332"]

{'primary_id': 'MONDO:0015277',
 'primary_label': 'medullary thyroid gland carcinoma'}

<div class="alert alert-block alert-success">

**2025-03-28 data:** 
    
I decided <b>not to try using MONDO mappings when the OMIM mapping failed</b>, because there's only a few cases where those rows even have MONDO IDs to use. 

* nodenorm_returned_none (39): none have MONDO
* wrong_category (26): only 2 have MONDO
* no_label (3): none have MONDO

In [51]:
## code used to check how many rows have OMIM failure + MONDO ID 

df[df["disease_mim"].isin(stats_OmOr_mapping_failures["nodenorm_returned_none"]) & 
   df["disease_MONDO"].notna()].shape

df[df["disease_mim"].isin(stats_OmOr_mapping_failures["wrong_category"].keys()) & 
   df["disease_MONDO"].notna()].shape

df[df["disease_mim"].isin(stats_OmOr_mapping_failures["no_label"]) & 
   df["disease_MONDO"].notna()].shape

(0, 22)

(2, 22)

(0, 22)

#### MONDO

__Running MONDO IDs through NodeNorm__

Catching potential mapping failures for later stats report

Pasted, adjusted from Disease OMIM/orphanet code blocks above.

In [52]:
## get set of unique CURIEs to put into NodeNorm
mondo = df["disease_MONDO"].dropna().unique()
len(mondo)

383

In [53]:
mondo_nodenorm_mapping = {}

## set up variables to catch mapping failures
stats_mondo_mapping_failures = {
    "unexpected_error": {},
    "nodenorm_returned_none": [],
    "wrong_category": {},
    "no_label": []
    
}

In [54]:
## larger batches are quicker
for batch in batched(mondo, 1000):
    ## returns tuples -> cast to list
    req_body = {
        "curies": list(batch),
        "conflate": True,
    }
    r = requests.post(nodenorm_url, json=req_body)
    response = r.json()
    
    ## not doing dict comprehension. allows easier review, logic writing
    for k,v in response.items():
        ## catch unexpected errors
        try:
            ## if NodeNorm didn't have info on this ID, v will be None
            if v is not None:
                ## don't keep mapping if category is not the expected one
                if v["type"][0] == "biolink:Disease":
                    ## also throw out mapping if no primary label found
                    if v["id"].get("label"):
                        temp = {
                            k: {"primary_id": v["id"]["identifier"],
                                "primary_label": v["id"]["label"]
                               }
                        }
                        mondo_nodenorm_mapping.update(temp)
                    else:
                        stats_mondo_mapping_failures["no_label"].append(k)
#                         print(f"{k}: NodeNorm didn't find primary label. Not keeping this mapping.")
                else:
                    stats_mondo_mapping_failures["wrong_category"].update({k: v["type"][0]})
#                     print(f'{k}: NodeNorm found different category {v["type"][0]}. Not keeping this mapping.')
            else:
                stats_mondo_mapping_failures["nodenorm_returned_none"].append(k)
#                 print(f"{k}: NodeNorm didn't recognize this ID")
        except:
            stats_mondo_mapping_failures["unexpected_error"].update({k: v})
            print(f'Encountered an unexpected error.')
            print(f'NodeNorm response key: {k}')
            print(f'NodeNorm response value: {v}')

In [55]:
len(mondo_nodenorm_mapping)

stats_mondo_mapping_failures

383

{'unexpected_error': {},
 'nodenorm_returned_none': [],
 'wrong_category': {},
 'no_label': []}

#### Comparing OMIM/orphanet vs MONDO

In [56]:
## if row has both IDs, look for diff in mappings from each ID

## list of tuples (omim/orpha, mondo)
mismatches = []

for row in df[["disease_mim", "disease_MONDO"]].itertuples(index=False):
    ## has both IDs
    if pd.notna(row.disease_mim) and pd.notna(row.disease_MONDO):
        ## if have NodeNorm mappings for both
        if OmOr_nodenorm_mapping.get(row.disease_mim) and \
        mondo_nodenorm_mapping.get(row.disease_MONDO):
            ## check if mappings are diff
            if OmOr_nodenorm_mapping[row.disease_mim]["primary_id"] != \
            mondo_nodenorm_mapping[row.disease_MONDO]["primary_id"]:
                mismatches.append((row.disease_mim, row.disease_MONDO))

print(f"There's {len(mismatches)} mismatches between the OMIM/orphanet and MONDO NodeNorm mappings.")

There's 22 mismatches between the OMIM/orphanet and MONDO NodeNorm mappings.


In [57]:
## code chunk to review mismatches 1 by 1
mismatches[21]

('OMIM:175800', 'MONDO:0006602')

In [58]:
## code chunk to review mismatches 1 by 1

OmOr_nodenorm_mapping["OMIM:175800"]
mondo_nodenorm_mapping["MONDO:0006602"]

{'primary_id': 'MONDO:0008290',
 'primary_label': 'porokeratosis 1, Mibelli type'}

{'primary_id': 'MONDO:0006602', 'primary_label': 'porokeratosis'}

In [59]:
## code chunk to review mismatches 1 by 1

df[df["disease_mim"] == "OMIM:300696"]
df[df["disease_MONDO"] == "MONDO:0013812"]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
2523,G2P03293,FHL1,OMIM:300163,HGNC:3702,BA535K18.1; FHL1B; FLH1A; KYO-T; MGC111107; SL...,FHL1-related Emery-Dreifuss muscular dystrophy,OMIM:300696,MONDO:0010680,monoallelic_X_hemizygous,,definitive,decreased gene product level; absent gene prod...,stop_gained_NMD_escaping; stop_lost; splice_re...,loss of function,inferred,,HP:0003306; HP:0003704; HP:0003805; HP:0003676...,18179888; 19687455; 30681346; 19716112; 201868...,Developmental disorders; Cardiac disorders,Expert review done on 12/01/2022; FHL1-related...,2024-03-26 10:33:21+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03293


Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
122,G2P00126,ACTB,OMIM:102630,HGNC:132,,ACTB-related Baraitser-Winter syndrome,OMIM:243310,MONDO:0013812,monoallelic_autosomal,restricted mutation set,definitive,altered gene product structure,inframe_insertion; inframe_deletion; missense_...,gain of function,inferred,,HP:0001274; HP:0002162; HP:0008897; HP:0005487...,22366783; 25052316; 27625340; 38592426; 34970860,Developmental disorders; Eye disorders,,2015-07-22 16:14:17+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00126
1609,G2P01638,ACTG1,OMIM:102560,HGNC:144,ACTG; DFNA20; DFNA26,ACTG1-related Baraitser-Winter syndrome,OMIM:614583,MONDO:0013812,monoallelic_autosomal,restricted mutation set,definitive,altered gene product structure,inframe_insertion; inframe_deletion; missense_...,gain of function,inferred,,HP:0000159; HP:0001167; HP:0001274; HP:0000465...,22366783; 25052316; 27096712; 27240540; 29024830,Developmental disorders; Eye disorders,,2015-07-22 16:15:41+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P01638


<div class="alert alert-block alert-info">    

**2025-02-28 data:** 

__Review of OMIM vs MONDO NodeNorm mismatches (22)__

None were orphanet.    
    
---

__19: OMIM's mapping is better__

> __6: Mondo ID is related but wrong__ -> emailed EBI gene2pheno w/ example
> * 'OMIM:243310', 'MONDO:0013812': omim is correct syndrome 1, but mondo is syndrome 2 (diff gene)
> * 'OMIM:613575', 'MONDO:0044314': omim is correct 55, but mondo is 78 (diff gene)
> * 'OMIM:101000', 'MONDO:0008075': omim is correct type of schwannomatosis (NF2/type 2), vs mondo is a sibling. 
>   * NodeNorm should map omim to MONDO:0007039 but isn't -> messaged NodeNorm
> * 'OMIM:613987', __'MONDO:0009136'__: omim is correct recessive 2, but mondo is recessive 1 (diff gene? Confusing because Monarch page links to gene NHP2 but OMIM page doesn't)
>   * NodeNorm should map omim to MONDO:0013519 but isn't -> messaged NodeNorm  
> * 'OMIM:613988', 'MONDO:0009136': omim is correct recessive 3, but mondo is recessive 1 (diff gene)
>   * NodeNorm should map omim to MONDO:0013520 but isn't -> messaged NodeNorm
> * 'OMIM:616353', 'MONDO:0009136': omim is correct recessive 6, but mondo is recessive 1 (diff gene)
>   * NodeNorm should map omim to MONDO:0014600 but isn't -> messaged NodeNorm

> __13: Mondo ID is too general__ (can see on Monarch website) -> emailed EBI gene2pheno w/ example
> * 'OMIM:300696', 'MONDO:0010680': omim maps to MONDO:0010401, child of the mondo
> * 'OMIM:304120', 'MONDO:0019027': omim maps to MONDO:0010571 (syndrome type 2), child of the mondo (syndrome)
> * 'OMIM:610019', 'MONDO:0005129': omim maps to MONDO:0012395 (cataract 18), child of the mondo (cataract)
> * 'OMIM:611726', 'MONDO:0016295': omim maps to MONDO:0012721, child of the mondo 
> * 'OMIM:602668', 'MONDO:0016107': omim maps to MONDO:0011266 (type 2), child of the mondo
> * 'OMIM:203200', 'MONDO:0018910': omim maps to MONDO:0008746 (type 2), child of the mondo
> * 'OMIM:614328', 'MONDO:0017411': omim maps to MONDO:0013693 (type 1), child of the mondo
> * 'OMIM:175800', 'MONDO:0006602': omim maps to MONDO:0008290 (1, mibelli type), grandchild of the mondo
> * 'OMIM:614073', **'MONDO:0019312'**: omim maps to MONDO:0013556 (syndrome 4), child of the mondo (syndrome)
> * 'OMIM:614074', 'MONDO:0019312': omim maps to MONDO:0013557 (syndrome 5), child of the mondo (syndrome)
> * 'OMIM:614075', 'MONDO:0019312': omim maps to MONDO:0013558 (syndrome 6), child of the mondo (syndrome)
> * 'OMIM:614076', 'MONDO:0019312': omim maps to MONDO:0013559 (syndrome 7), child of the mondo (syndrome)
> * 'OMIM:614077', 'MONDO:0019312': omim maps to MONDO:0013560 (syndrome 8), child of the mondo (syndrome)

    
**1: MONDO's mapping is better**
<br>
Omim ID is slightly off -> __TELL EBI GENE2PHENO?__
* 'OMIM:613723', 'MONDO:0009181': mondo matches the disease name and phenotypes listed in the record better than the omim 


**1: Unsure**
* 'OMIM:158350', 'MONDO:0017623': omim is for Cowden syndrome 1, mondo is for PTEN hamartoma tumor syndrome. These are very similar, so I'm not sure which one is better. -> __TELL EBI GENE2PHENO?__
  * There's also another record w/ just the OMIM ID. I think the two rows should be merged. -> __TELL EBI GENE2PHENO?__


**1: NodeNorm error** -> messaged NodeNorm
* 'OMIM:224230', 'MONDO:0009136': both are recessive 1, NodeNorm should map to same entity

**Other rows reviewed:**
* 'OMIM:614583', 'MONDO:0013812': map to same correct entity

**2025-02-28 data:** 

The prelim decision is to use disease OMIM/orphanet IDs because:
* less missing values
* more accurate in cases where there's also a MONDO ID

#### Checking MONDO data

Above, I decided the OMIM/orphanet disease IDs were better. 

However, I wondered if the MONDO IDs were accurate to the disease name when there weren't OMIM/orphanet IDs. Then they could be used for NodeNorming and less data would be dropped because it wasn't pre-NodeNormed. 

In [60]:
## get the data that has MONDO, doesn't have OMIM/orphanet

df_mondo_only = df[df["disease_mim"].isna() & df["disease_MONDO"].notna()].copy()

mondo_only = df_mondo_only["disease_MONDO"].dropna().unique()

In [61]:
## saving stats on data with only MONDO ID

stats_mondo_only = {
    "n_rows": df_mondo_only.shape[0],
    "n_names": len(mondo_only)
}

stats_mondo_only["n_rows"]
stats_mondo_only["n_names"]

445

278

In [62]:
## code chunk used to review some of the data

# df_mondo_only[df_mondo_only["disease_MONDO"] == mondo_only[240]]

df_mondo_only[df_mondo_only["panel"].str.contains("Skeletal", na=False)]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
3240,G2P03497,POP1,OMIM:602486,HGNC:30129,,POP1-related anauxetic dysplasia,,MONDO:0011773,biallelic_autosomal,,definitive,decreased gene product level,,undetermined,inferred,,,21455487; 27380734; 28067412,Skeletal disorders,,2023-12-20 09:04:04+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03497


<div class="alert alert-block alert-info">    

**2025-02-28 data:** 

__Reviewed some of the data__

Method: look at individual MONDO IDs. Covered all panels (only 1 skeletal, no hearing). 3 from earlier review (related to mismatches) + idx 0-240, step 10 + skeletal. 
    
__Summary__
* 37 rows (29 unique MONDO)
* __~16%__ were wrong (6/37) 
* Could tell EBI gene2pheno of issues but they are similar to those listed in mismatch mapping section

__Details__

__6 MONDO is related but wrong__
* "MONDO:0009136" for "RTEL1-related dyskeratosis congenita" (two rows): mondo is recessive 1, which is wrong. Should be recessive 5 MONDO:0014076/OMIM:615190 (old/synonym name is dominant 4) 
* "MONDO:0044314" for 4 rows "CLN3-related retinal dystrophy", "GUCA1B-", "PRPS1-", "SNRNP200-": mondo is type 78 (specifically for ARHGEF18), which is wrong. Should instead be:
  * CLN3 and PRPS1: a more general term like MONDO:0004580 (retinal degeneration) -> MONDO:0019118 (inherited retinal dystrophy) -> MONDO:0019200 (retinitis pigmentosa)
  * GUCA1B: type 48, MONDO:0013447
  * SNRNP200: type 33, MONDO:0012477
* "MONDO:0013522" for "TERC-related dyskeratosis congenita": mondo is for type 3 (specifically for TINF2, that row is in "Great" section). Should be type 1 MONDO:0007485/OMIM:127550. (confusing because Monarch's page of type 1 includes TINF2 and TERT too, but OMIM page only includes TERC)


__4 MONDO is too general__ 
* "MONDO:0020341" (periventricular nodular heterotopia) for "ERMARD-related periventricular heterotopia". The ERMARD-specific version is a child term: MONDO:0014240/OMIM:615544 (type 6)
* "MONDO:0018965" (Alport syndrome) for "COL4A5-related Alport syndrome". The COL4A5-specific version is a child term: MONDO:0010520/OMIM:301050  (X-linked)
* "MONDO:0024676" for "REST-related Wilms tumour": The REST-specific version is a **related** term: MONDO:0014779/OMIM:616806 (type 6)
* "MONDO:0011773" for "POP1-related anauxetic dysplasia": the POP1-specific version is a child term: MONDO:0054561/OMIM:617396 (type 2)


__4 Unsure -> TELL EBI GENE2PHENO?__
* "MONDO:0005129" for "CYP51A1-related congenital cataract": mondo is cataract, which is not wrong but kinda general. MONDO:0033853 seems better (correlated with gene, matches phenotypes, orphanet ref uses one of the ref papers) 
* "MONDO:0018869" for "TMTC3-related cobblestone lissencephaly": while the mondo (cobblestone lissencephaly) sounds correct, it isn't linked to this gene. VS another sibling disease is linked to the gene, matches phenotypes, uses same paper: MONDO:0014992/OMIM:617255 (lissencephaly 8)
* "MONDO:0100100" for "SELENON-related myopathy": while mondo has exact name match, it's not directly linked to gene. Instead, its child disease is directly linked to gene MONDO:0011271/OMIM:602771 (rigid spine muscular dystrophy 1)
* "MONDO:0020367" for "MYOC-related juvenile open angle glaucoma": while mondo is almost-exact name match, it's not directly linked to gene. Instead, its child disease is directly linked to gene MONDO:0007664/OMIM:137750 (glaucoma 1, open angle, A) 


__5 Okay (using general term is fine)__
* "MONDO:0005129" for 3 other rows "WDR87-related congenital cataract", "AKR1E2-", "MFSD6L-": couldn't find better mappings. 
* "MONDO:0015469" for "DHRS3 related craniosynostosis": couldn't find better mapping
* "MONDO:0024676" (childhood kidney Wilms tumor) for "CTR9-related Wilms tumour", "TRIM28-": couldn't find better mapping. TRIM28 is correlated to parent term (kidney Wilms tumor). 


__18 Great__
* "MONDO:0012506" for "DSC2-related arrhythmogenic right ventricular cardiomyopathy"
* "MONDO:0011001" for "SCN5A-related Brugada syndrome"
* "MONDO:0013262" for "MYH7-related dilated cardiomyopathy"
* "MONDO:0013369" for "TNNI3-related hypertrophic cardiomyopathy"
* "MONDO:0010946" for "PRKAG2-related cardiomyopathy"
* "MONDO:0014143" for "RIT1-related Noonan syndrome"
* "MONDO:0010015" for "PXDN-related anterior segment dysgenesis with sclerocornea"
* "MONDO:0014214" for "DYNC2I1-related short-rib polydactyly"
* "MONDO:0013522" for "TINF2-related dyskeratosis congenita"
* "MONDO:0032876" for "WASF1-related intellectual disability with seizures"
* "MONDO:0859164" for "UNC45A-related osteootohepatoenteric syndrome"
* "MONDO:0018772" for "SLC30A7-related Joubert syndrome": using general term is fine since there isn't any established subtype of Joubert syndrome for this gene
* "MONDO:0010215" for "ERCC4-related xeroderma pigmentosum, group F"
* "MONDO:0009735" for "SPINK5-related Netherton syndrome"
* "MONDO:0007808" for "KRT1-related ichthyosis hystrix, Curth-Macklin type"
* "MONDO:0007566" for "TGFBR1-related multiple self-healing squamous epithelioma"
* "MONDO:0008285" for "PDGFRA-related gastrointestinal stromal tumor/GIST-plus syndrome, somatic or familial"
* "MONDO:0010912" for "TUBB3-related fibrosis of extraocular muscles, congenital"

#### Conclusions

<div class="alert alert-block alert-success">

**2025-02-28 data:** 
    
__Exploration__

* some rows have no disease IDs
* a few NodeNorm mapping failures for OMIM IDs (several diff kinds): ~2.8%. 68 failures / (2401 unique values in column - 9 orphanet)
  * no NodeNorm mapping failures for MONDO IDs
* when rows have both OMIM and MONDO IDs, there are sometimes differences in NodeNorm mapping ("mismatches"). __In these cases, OMIM IDs were much more accurate__
* __MONDO IDs can be inaccurate__ - see the blue review boxes
  * VS it was much rarer to find an inaccurate OMIM ID mapping (found 1 case)


__Decision: Use OMIM ID column to generate NodeNorm values__

* less missing values
* seems to be more accurate (for successful NodeNorm mappings)

## Stats on rows removed during NodeNorming

This section prints the statistics on rows in the original data that were removed. 

(Uses variables generated during the previous section "Pre-NodeNorming")

<div class="alert alert-block alert-success">

**2025-03-28 data:** 

Genes: No rows removed due to lack of IDs for NodeNorming or NodeNorm mapping issues.

In [63]:
## partial put into parser (format): DONE

print("Gene Pre-NodeNorming\n")

## no gene IDs
print(f'{stats_no_gene_IDs["n_rows"]} row(s) with no gene IDs')

## no HGNC IDs: key column for NodeNorming
print(f'{n_rows_no_hgnc} row(s) with no HGNC IDs')

## HGNC NodeNorm issues: none, but showing anyways
print("\n")
print("HGNC NodeNorm mapping failures:")

print(f'IDs with no data in NodeNorm: {len(stats_hgnc_mapping_failures["nodenorm_returned_none"])}')
print(f'IDs with the wrong NodeNormed category: {len(stats_hgnc_mapping_failures["wrong_category"])}')
print(f'IDs with no label in NodeNorm: {len(stats_hgnc_mapping_failures["no_label"])}')

Gene Pre-NodeNorming

0 row(s) with no gene IDs
0 row(s) with no HGNC IDs


HGNC NodeNorm mapping failures:
IDs with no data in NodeNorm: 0
IDs with the wrong NodeNormed category: 0
IDs with no label in NodeNorm: 0


<div class="alert alert-block alert-success">

**2025-03-28 data:** 
    
__Diseases: many rows removed__ due to lack of IDs for NodeNorming or NodeNorm mapping issues.

In [64]:
stats_OmOr_mapping_failures.keys()

dict_keys(['unexpected_error', 'nodenorm_returned_none', 'wrong_category', 'no_label', 'n_rows_none', 'n_rows_wrong_category', 'n_rows_no_label'])

In [65]:
## partial put into parser (format): DONE

print("Disease Pre-NodeNorming\n")

## no disease IDs
print(f'{stats_no_disease_IDs["n_rows"]} row(s) with no disease IDs '
      f'(= {stats_no_disease_IDs["n_names"]} unique diseases)')

## plus the rows that only lack OMIM IDs: key column for NodeNorming
print(f'+ {stats_mondo_only["n_rows"]} row(s) with no OMIM ID '
      f'(= {stats_mondo_only["n_names"]} unique diseases)')

## OMIM NodeNorm issues
print("\n")
print("OMIM NodeNorm mapping failures:")

print(f'{stats_OmOr_mapping_failures["n_rows_none"]} row(s) for '
      f'{len(stats_OmOr_mapping_failures["nodenorm_returned_none"])} '
      f'IDs with no data in NodeNorm')

print(f'{stats_OmOr_mapping_failures["n_rows_wrong_category"]} row(s) for '
      f'{len(stats_OmOr_mapping_failures["wrong_category"])} '
      f'IDs with the wrong NodeNormed category')

print(f'{stats_OmOr_mapping_failures["n_rows_no_label"]} row(s) for '
      f'{len(stats_OmOr_mapping_failures["no_label"])} '
      f'IDs with no label in NodeNorm')

Disease Pre-NodeNorming

642 row(s) with no disease IDs (= 629 unique diseases)
+ 445 row(s) with no OMIM ID (= 278 unique diseases)


OMIM NodeNorm mapping failures:
41 row(s) for 39 IDs with no data in NodeNorm
27 row(s) for 26 IDs with the wrong NodeNormed category
3 row(s) for 3 IDs with no label in NodeNorm


<div class="alert alert-block alert-success">
    
__Totals__

In [71]:
## put into parser (format): DONE

n_rows_before_nodenorm = df.shape[0]
n_rows_nodenorm_removed = stats_no_disease_IDs["n_rows"] + stats_mondo_only["n_rows"] + \
                          stats_OmOr_mapping_failures["n_rows_none"] + \
                          stats_OmOr_mapping_failures["n_rows_wrong_category"] + \
                          stats_OmOr_mapping_failures["n_rows_no_label"]
n_rows_after_nodenorm = n_rows_before_nodenorm - n_rows_nodenorm_removed

print(f"{n_rows_before_nodenorm} rows/records before Pre-NodeNorming\n")

print(f"{n_rows_nodenorm_removed} rows removed during Disease NodeNorming process\n")

print(f"{n_rows_after_nodenorm} rows/records left ({n_rows_after_nodenorm/n_rows_before_nodenorm:.1%})")

3658 rows/records before Pre-NodeNorming

1158 rows removed during Disease NodeNorming process

2500 rows/records left (68.3%)


## Adding NodeNorm data, removing rows

Using gene HGNC and disease OMIM/orphanet IDs for pre-NodeNorming

In [67]:
## put into parser (format): DONE

## Gene: assumes no missing values
df["gene_nodenorm_id"] = [hgnc_nodenorm_mapping[i]["primary_id"] for i in df["hgnc_id"]]
df["gene_nodenorm_label"] = [hgnc_nodenorm_mapping[i]["primary_label"] for i in df["hgnc_id"]]

df["disease_nodenorm_id"] = [OmOr_nodenorm_mapping[i]["primary_id"] 
                             if OmOr_nodenorm_mapping.get(i) 
                             else pd.NA
                             for i in df["disease_mim"]]

df["disease_nodenorm_label"] = [OmOr_nodenorm_mapping[i]["primary_label"] 
                                if OmOr_nodenorm_mapping.get(i) 
                                else pd.NA
                                for i in df["disease_mim"]]

In [68]:
df.head()

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url,gene_nodenorm_id,gene_nodenorm_label,disease_nodenorm_id,disease_nodenorm_label
0,G2P00001,HMX1,OMIM:142992,HGNC:5017,H6; NKX5-3,HMX1-related oculoauricular syndrome,OMIM:612109,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0000568; HP:0000589; HP:0000007; HP:0000482...,18423520; 25574057; 29140751,Developmental disorders; Eye disorders,,2019-09-26 16:23:46+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00001,NCBIGene:3166,HMX1,MONDO:0012802,oculoauricular syndrome
1,G2P00002,SLX4,OMIM:613278,HGNC:23845,BTBD12; FANCP; KIAA1784; KIAA1987,SLX4-related Fanconi anemia,OMIM:613951,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0000347; HP:0000007; HP:0001903; HP:0002984...,21240275; 21240277,Developmental disorders,,2025-01-28 23:09:54+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00002,NCBIGene:84464,SLX4,MONDO:0013499,Fanconi anemia complementation group P
2,G2P00003,ARG1,OMIM:608313,HGNC:663,,ARG1-related argininemia,OMIM:207800,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0000752; HP:0000737; HP:0000007; HP:0008339...,10502833; 1598908; 7649538; 1463019; 2365823,Developmental disorders,,2015-07-22 16:14:07+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00003,NCBIGene:383,ARG1,MONDO:0008814,Argininemia
3,G2P00004,ATR,OMIM:601215,HGNC:882,FRP1; MEC1; SCKL; SCKL1,ATR-related Seckel syndrome,OMIM:210600,,biallelic_autosomal,,strong,absent gene product,,loss of function,inferred,,HP:0000347; HP:0010230; HP:0001249; HP:0002750...,,Developmental disorders; Skeletal disorders,,2025-01-27 14:24:27+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00004,NCBIGene:545,ATR,MONDO:0008869,Seckel syndrome 1
4,G2P00005,FANCB,OMIM:300515,HGNC:3583,FAAP95; FAB; FLJ34064,FANCB-related Fanconi anemia,OMIM:300514,,monoallelic_X_hemizygous,,definitive,absent gene product,,loss of function,inferred,,HP:0000924; HP:0001871; HP:0001701; HP:0000083...,16679491; 21910217; 36135330; 32106311; 307922...,Developmental disorders; Skin disorders,,2024-08-20 14:13:58+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00005,NCBIGene:2187,FANCB,MONDO:0010351,Fanconi anemia complementation group B


In [69]:
## put into parser (change in-place): DONE

df_only_nodenormed = df.dropna(subset=["gene_nodenorm_id", "gene_nodenorm_label", 
                                       "disease_nodenorm_id", "disease_nodenorm_label"],
                              ignore_index=True).copy()

In [72]:
## same! so it works as expected

df_only_nodenormed.shape

n_rows_after_nodenorm

(2500, 26)

2500

In [73]:
df_only_nodenormed.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 26 columns):
 #   Column                              Non-Null Count  Dtype              
---  ------                              --------------  -----              
 0   g2p_id                              2500 non-null   object             
 1   gene_symbol                         2500 non-null   object             
 2   gene_mim                            2500 non-null   object             
 3   hgnc_id                             2500 non-null   object             
 4   previous_gene_symbols               2247 non-null   object             
 5   disease_name                        2500 non-null   object             
 6   disease_mim                         2500 non-null   object             
 7   disease_MONDO                       117 non-null    object             
 8   allelic_requirement                 2500 non-null   object             
 9   cross_cutting_modifier              283 n

## Generating documents

<div class="alert alert-block alert-danger">

Currently generating as list of TRAPI edges WITH special field `_id` for BioThings. 

### Rows not included

<div class="alert alert-block alert-info">

See section "Stats on rows removed during NodeNorming"
* No IDs in disease_mim column (can't NodeNorm)
* NodeNorm mapping failures for disease_mim column IDs 
* confidence column value == "refuted" or "disputed"

### Columns not included

<div class="alert alert-block alert-info">

See data-playground for details

<br>

Seem **easier** to get into Translator, potentially useful: 
- **molecular_mechanism**: could affect subject_form_or_variant_qualifier's value? Can just keep genetic_variant_form for "undetermined". But kinda tricky to pick biolink terms for some rare values (dominant negative, undetermined non-loss-of-function) - and would be loss of info to drop them. 
- **confidence**: 
   - there's biolink association-slot *has confidence level*. But there's also a biolink entity *confidence level* that's supposed to have values from CIO. 
   - Are G2P's terms okay? Or are they supposed to be mapped to ontology terms like CIO/SEPIO?-(which...may be a loss of info compared to G2P's term definitions)
- **allelic_requirement**: I thought there was a biolink-term to put this on an edge, but I can't find it now. The values can be converted into HPO "mode of inheritance terms" if necessary (see data-playground notebook for mapping table)

<br>

Seem harder to get into Translator, potentially useful: 
- **molecular_mechanism_categorisation**: "qualifies" the molecular_mechanism (seems to say how molecular mechanism was decided: "inferred" or "evidence") 
- **cross_cutting_modifier**: additional info on inheritance. Limited set of terms BUT "; "- delimited. Some terms may map to "HPO inheritance qualifier terms" (didn't try). Lots of missing data (NA)
- **variant_consequence**: row can have multiple values ("; "- delimited). Limited set of terms already mapped to SO.
- **variant_types**: row can have multiple values ("; "- delimited). Medium set of terms already mapped to SO. Lots of missing data (NA)
- **molecular_mechanism_evidence**: treat as free text? very complicated string 
- **comments**: treat as free text
    
<br>

Can ignore: 
- g2p_id: helpful in looking up data in original resource. But kinda accounted for with g2p_record_url?
- gene_mim
- disease_MONDO
- gene_symbol
- previous_gene_symbols
- disease_name
- phenotypes: "reported by the publication". Unclear how they fit in gene-disease association or a diff edge (gene-phenotype, phenotype-disease)
- panel: pretty specific, original resource's way of organizing data
- gene_nodenorm_label: can get elsewhere?
- disease_nodenorm_label: can get elsewhere?

In [74]:
## code chunk to review data

df["molecular_mechanism"].value_counts()

molecular_mechanism
loss of function                     2167
undetermined                         1172
gain of function                      185
dominant negative                     131
undetermined non-loss-of-function       3
Name: count, dtype: int64

In [75]:
## code chunk to review data
## checking date of last review

df[df["g2p_id"] == "G2P03538"]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url,gene_nodenorm_id,gene_nodenorm_label,disease_nodenorm_id,disease_nodenorm_label
3174,G2P03538,NPAT,OMIM:601448,HGNC:7896,E14; P220,NPAT-related cancer,,,monoallelic_autosomal,,moderate,decreased gene product level,stop_gained,loss of function,inferred,,,38778081,Cancer disorders,,2025-03-14 12:04:00+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03538,NCBIGene:4863,NPAT,,


In [76]:
## code chunk to review data

df["molecular_mechanism_evidence"].value_counts()[0:5]

## df.info()

molecular_mechanism_evidence
34965576 -> models: non-human model organism                                                                                                   2
37126546 -> models: non-human model organism; rescue: non-human model organism                                                                 2
39480921 -> function: protein expression; models: non-human model organism                                                                     2
29992740 -> functional_alteration: non patient cells                                                                                           1
37951597 -> function: biochemical; functional_alteration: patient cells; models: non-human model organism; rescue: non-human model organism    1
Name: count, dtype: int64

### Generating now!

In [77]:
## code chunk for testing parts of inner code

for row in df.itertuples(index=False):
    document = {
        "_id": row.g2p_id,
        "subject": row.gene_nodenorm_id,
        "sources": [
            {
                "resource_id": "infores:ebi-gene2phenotype",
                "resource_role": "primary_knowledge_source",
                "source_record_urls": [row.g2p_record_url]
            }
        ],
        "attributes": [
            {
                "attribute_type_id": "biolink:original_subject",
                "value": row.hgnc_id
            }
        ]
    }
    if pd.notna(row.publications):
        document["attributes"].append(
            {
                "attribute_type_id": "biolink:publications",
                "value": ["PMID:" + i.strip() for i in row.publications.split(";")]
            }
        )
    document
    break

{'_id': 'G2P00001',
 'subject': 'NCBIGene:3166',
 'sources': [{'resource_id': 'infores:ebi-gene2phenotype',
   'resource_role': 'primary_knowledge_source',
   'source_record_urls': ['https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00001']}],
 'attributes': [{'attribute_type_id': 'biolink:original_subject',
   'value': 'HGNC:5017'},
  {'attribute_type_id': 'biolink:publications',
   'value': ['PMID:18423520', 'PMID:25574057', 'PMID:29140751']}]}

In [78]:
## put into parser (format): DONE
##   don't save in array, yield each document instead

## GENERATING DOCS, saving in array
documents = []

## using itertuples because it's faster, preserves datatypes
for row in df_only_nodenormed.itertuples(index=False):
    ## simple assignments: no NA or "if"
    document = {
        "_id": row.g2p_id,
        "subject": row.gene_nodenorm_id,
        "qualifiers": [  ## needs data-modeling/TRAPI validation review
            {
                "qualifier_type_id": "biolink:subject_form_or_variant_qualifier",
                "qualifier_value": "genetic_variant_form"
            }
        ],
        "object": row.disease_nodenorm_id,
        "sources": [
            {
                "resource_id": "infores:ebi-gene2phenotype",
                "resource_role": "primary_knowledge_source",
                "source_record_urls": [row.g2p_record_url]
            }
        ],
        "attributes": [
            {
                "attribute_type_id": "biolink:knowledge_level",
                "value": "knowledge_assertion"
            },
            {
                "attribute_type_id": "biolink:agent_type",
                "value": "manual_agent"
            },
            {
                "attribute_type_id": "biolink:original_subject",
                "original_attribute_name": "hgnc id",  ## original column name
                "value": row.hgnc_id
            },
            {   ## currently, after NodeNorming, no NAs in OMIM/orphanet column
                "attribute_type_id": "biolink:original_object",
                "original_attribute_name": "disease mim",  ## original column name
                "value": row.disease_mim
            },
            {   ## needs data-modeling/TRAPI validation review
                ## EBI gene2pheno website calls this "Last Updated"/"Last Updated On"
                "attribute_type_id": "biolink:update_date",
                "original_attribute_name": "date of last review",  ## original column name
                "value": str(row.date_of_last_review)
            },
        ]
    }
    
    ## more complex assignments ("if", handling NA). When value is NA, list comprehension with split won't work
    ## predicate
    if row.confidence == "limited":
        document["predicate"] = "biolink:related_to"
    elif row.confidence in ["moderate", "strong", "definitive"]:
        document["predicate"] = "biolink:causes"
    else:
        raise ValueError(f"Unexpected confidence value during predicate mapping: {row.confidence}. Adjust parser.")
    ## publications
    if pd.notna(row.publications):
        document["attributes"].append(
            {
                "attribute_type_id": "biolink:publications",
                "value": ["PMID:" + i.strip() for i in row.publications.split(";")]
            }
        )
    
    documents.append(document)

## Checking documents

In [79]:
len(documents)

2500

In [82]:
df_only_nodenormed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 26 columns):
 #   Column                              Non-Null Count  Dtype              
---  ------                              --------------  -----              
 0   g2p_id                              2500 non-null   object             
 1   gene_symbol                         2500 non-null   object             
 2   gene_mim                            2500 non-null   object             
 3   hgnc_id                             2500 non-null   object             
 4   previous_gene_symbols               2247 non-null   object             
 5   disease_name                        2500 non-null   object             
 6   disease_mim                         2500 non-null   object             
 7   disease_MONDO                       117 non-null    object             
 8   allelic_requirement                 2500 non-null   object             
 9   cross_cutting_modifier              283 n

In [90]:
## code chunk for finding rows -> then look up corresponding doc by idx
# df_only_nodenormed[df_only_nodenormed["disease_mim"].str.contains("orphanet", na=False)]
df_only_nodenormed[df_only_nodenormed["confidence"] == "limited"]
# df_only_nodenormed[df_only_nodenormed["publications"].isna()]
# df_only_nodenormed[~df_only_nodenormed["publications"].str.contains(";", na=True)]



# df_only_nodenormed[df_only_nodenormed["previous_gene_symbols"].isna()]
# df_only_nodenormed[df_only_nodenormed["disease_MONDO"].notna()]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url,gene_nodenorm_id,gene_nodenorm_label,disease_nodenorm_id,disease_nodenorm_label
8,G2P00010,TRIM32,OMIM:602290,HGNC:16380,BBS11; HT2A; LGMD2H; TATIP,TRIM32-related Bardet-Biedl syndrome,OMIM:615988,,biallelic_autosomal,,limited,uncertain,,undetermined,inferred,,HP:0000148; HP:0002251; HP:0002167; HP:0000518...,16606853,Developmental disorders; Eye disorders,,2025-01-21 22:12:35+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00010,NCBIGene:22954,TRIM32,MONDO:0014439,Bardet-Biedl syndrome 11
26,G2P00030,ERLIN2,OMIM:611605,HGNC:1356,C8ORF2; ERLIN-2; NET32; SPFH2; SPG18,ERLIN2-related intellectual developmental diso...,OMIM:611225,,biallelic_autosomal,,limited,altered gene product structure,inframe_insertion; inframe_deletion; missense_...,undetermined,inferred,,,21937992,Developmental disorders,,2025-01-16 13:37:48+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00030,NCBIGene:11160,ERLIN2,MONDO:0012639,hereditary spastic paraplegia 18
34,G2P00040,GJA1,OMIM:121014,HGNC:4274,CX43; GJAL; ODD; ODDD; ODOD; SDTY3,GJA1-related Hallermann-Streiff syndrome,OMIM:234100,,biallelic_autosomal,,limited,altered gene product structure,inframe_insertion; inframe_deletion; missense_...,undetermined,inferred,,HP:0000347; HP:0000588; HP:0003307; HP:0000233...,14981729; 14974090,Developmental disorders; Eye disorders; Skin d...,,2023-05-24 09:07:26+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00040,NCBIGene:2697,GJA1,MONDO:0009318,Hallermann-Streiff syndrome
43,G2P00051,SUMO1,OMIM:601912,HGNC:12502,GMP1; OFC10; PIC1; SMT3C; SMT3H3; SUMO-1; UBL1,SUMO1-related cleft lip with or without cleft ...,OMIM:608874,,monoallelic_autosomal,,limited,absent gene product,,loss of function,inferred,,HP:0100333; HP:0100334,16990542,Developmental disorders,,2025-01-27 16:20:50+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00051,NCBIGene:7341,SUMO1,MONDO:0012142,orofacial cleft 5
69,G2P00081,STIM1,OMIM:605921,HGNC:11386,D11S4896E; GOK,STIM1-related tubular-aggregate myopathy,OMIM:160565,,monoallelic_autosomal,restricted mutation set,limited,altered gene product structure,,gain of function,inferred,,HP:0003388; HP:0003552; HP:0003581; HP:0002522...,23332920,Developmental disorders,,2015-07-22 16:14:14+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00081,NCBIGene:6786,STIM1,MONDO:0024531,"myopathy, tubular aggregate, 1"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2479,G2P02409,SEMA4A,OMIM:607292,HGNC:10729,CORD10; FLJ12287; SEMAB; SEMB,SEMA4A-related retinitis pigmentosa,OMIM:610282,,biallelic_autosomal,,limited,altered gene product structure,inframe_insertion; inframe_deletion; missense_...,undetermined,inferred,,HP:0000556,16199541; 28805479,Eye disorders,,2019-10-30 15:30:20+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02409,NCBIGene:64218,SEMA4A,MONDO:0012463,retinitis pigmentosa 35
2482,G2P02425,MARK3,OMIM:602678,HGNC:6897,CTAK1; KP78; PAR-1A,MARK3-related visual impairment and progressiv...,OMIM:618283,,biallelic_autosomal,,limited,altered gene product structure,inframe_insertion; inframe_deletion; missense_...,undetermined,inferred,,HP:0000568; HP:0000007; HP:0007663; HP:0000540...,29771303,Eye disorders,,2018-05-25 09:40:38+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02425,NCBIGene:4140,MARK3,MONDO:0032655,visual impairment and progressive phthisis bulbi
2484,G2P02427,AHR,OMIM:600253,HGNC:348,BHLHE76,AHR-related retinitis pigmentosa,OMIM:618345,,biallelic_autosomal,,limited,absent gene product,,loss of function,inferred,,HP:0000556; HP:0000007; HP:0000512; HP:0000510,29726989,Eye disorders,,2025-01-14 14:22:05+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02427,NCBIGene:196,AHR,MONDO:0032689,retinitis pigmentosa 85
2485,G2P02430,ESPN,OMIM:606351,HGNC:13281,DFNB36,ESPN-related Usher syndrome,OMIM:614504,,biallelic_autosomal,,limited,uncertain,,undetermined,inferred,,HP:0000662; HP:0000510; HP:0000529; HP:0000543...,29572253,Eye disorders,,2018-05-25 16:39:18+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02430,NCBIGene:83715,ESPN,MONDO:0013788,Usher syndrome type 3B


In [91]:
pprint(documents[34])

# documents[416]

{'_id': 'G2P00040',
 'attributes': [{'attribute_type_id': 'biolink:knowledge_level',
                 'value': 'knowledge_assertion'},
                {'attribute_type_id': 'biolink:agent_type',
                 'value': 'manual_agent'},
                {'attribute_type_id': 'biolink:original_subject',
                 'original_attribute_name': 'hgnc id',
                 'value': 'HGNC:4274'},
                {'attribute_type_id': 'biolink:original_object',
                 'original_attribute_name': 'disease mim',
                 'value': 'OMIM:234100'},
                {'attribute_type_id': 'biolink:update_date',
                 'original_attribute_name': 'date of last review',
                 'value': '2023-05-24 09:07:26+00:00'},
                {'attribute_type_id': 'biolink:publications',
                 'value': ['PMID:14981729', 'PMID:14974090']}],
 'object': 'MONDO:0009318',
 'predicate': 'biolink:related_to',
 'qualifiers': [{'qualifier_type_id': 'biolink:subject_form_o

## BioThings Parser notes

Fine to use raise/assert in parser (raise is technically better programming behavior: https://realpython.com/python-assert-statement/#understanding-common-pitfalls-of-assert)


My notes on parser:
* adding prefixes to gene/disease IDs is good for pre-NodeNorming steps
* keeping diff gene/disease ID namespaces as separate fields right now is good for current BTE/x-bte-annotation system
  * Also, original subject will always be HGNC, original object will always be disease OMIM with current code


My notes on syntax:
* use `yield` when you want to "return" within a "for loop" (return only happen once, then exit for-loop/function execution)
  * that's what it's used in main execution, when you're iterating over csv rows to generate documents
* use `yield from {function}` to get the data from a generator (created by `yield` being used the function)