<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-data" data-toc-modified-id="Loading-data-1">Loading data</a></span></li><li><span><a href="#Check,-remove-duplicates" data-toc-modified-id="Check,-remove-duplicates-2">Check, remove duplicates</a></span></li><li><span><a href="#Column-level-transforms" data-toc-modified-id="Column-level-transforms-3">Column-level transforms</a></span></li><li><span><a href="#Confidence-values" data-toc-modified-id="Confidence-values-4">Confidence values</a></span><ul class="toc-item"><li><span><a href="#Removing-rows-+-stats" data-toc-modified-id="Removing-rows-+-stats-4.1">Removing rows + stats</a></span></li></ul></li><li><span><a href="#Pre-NodeNorming" data-toc-modified-id="Pre-NodeNorming-5">Pre-NodeNorming</a></span><ul class="toc-item"><li><span><a href="#Exploring:-Genes" data-toc-modified-id="Exploring:-Genes-5.1">Exploring: Genes</a></span><ul class="toc-item"><li><span><a href="#HGNC" data-toc-modified-id="HGNC-5.1.1">HGNC</a></span></li><li><span><a href="#OMIM" data-toc-modified-id="OMIM-5.1.2">OMIM</a></span></li><li><span><a href="#Comparing-HGNC-vs-OMIM" data-toc-modified-id="Comparing-HGNC-vs-OMIM-5.1.3">Comparing HGNC vs OMIM</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-5.1.4">Conclusions</a></span></li></ul></li><li><span><a href="#Exploring:-Diseases" data-toc-modified-id="Exploring:-Diseases-5.2">Exploring: Diseases</a></span><ul class="toc-item"><li><span><a href="#OMIM/orphanet" data-toc-modified-id="OMIM/orphanet-5.2.1">OMIM/orphanet</a></span></li><li><span><a href="#MONDO" data-toc-modified-id="MONDO-5.2.2">MONDO</a></span></li><li><span><a href="#Comparing-OMIM/orphanet-vs-MONDO" data-toc-modified-id="Comparing-OMIM/orphanet-vs-MONDO-5.2.3">Comparing OMIM/orphanet vs MONDO</a></span></li><li><span><a href="#Checking-MONDO-data" data-toc-modified-id="Checking-MONDO-data-5.2.4">Checking MONDO data</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-5.2.5">Conclusions</a></span></li></ul></li></ul></li><li><span><a href="#Stats-on-rows-removed-during-NodeNorming" data-toc-modified-id="Stats-on-rows-removed-during-NodeNorming-6">Stats on rows removed during NodeNorming</a></span></li><li><span><a href="#Adding-NodeNorm-data,-removing-rows" data-toc-modified-id="Adding-NodeNorm-data,-removing-rows-7">Adding NodeNorm data, removing rows</a></span></li><li><span><a href="#Generating-documents" data-toc-modified-id="Generating-documents-8">Generating documents</a></span><ul class="toc-item"><li><span><a href="#Rows-not-included" data-toc-modified-id="Rows-not-included-8.1">Rows not included</a></span></li><li><span><a href="#Columns-not-included" data-toc-modified-id="Columns-not-included-8.2">Columns not included</a></span></li><li><span><a href="#BioThings-type-parser" data-toc-modified-id="BioThings-type-parser-8.3">BioThings-type parser</a></span></li><li><span><a href="#File:-List-of-TRAPI-edges" data-toc-modified-id="File:-List-of-TRAPI-edges-8.4">File: List of TRAPI edges</a></span></li><li><span><a href="#File:-KGX-edges" data-toc-modified-id="File:-KGX-edges-8.5">File: KGX edges</a></span></li><li><span><a href="#File:-KGX-nodes" data-toc-modified-id="File:-KGX-nodes-8.6">File: KGX nodes</a></span></li></ul></li><li><span><a href="#Checking-documents" data-toc-modified-id="Checking-documents-9">Checking documents</a></span></li><li><span><a href="#BioThings-Parser-notes" data-toc-modified-id="BioThings-Parser-notes-10">BioThings Parser notes</a></span></li></ul></div>

# Notebook for EBI gene2pheno parser development

In [1]:
## not for parser. for notebook only 

## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Loading data

__Current approach__ is to load all files into 1 pandas dataframe. Then I can...

1. check the duplicates situation: records found in multiple panel files. I can check whether the same record looks different between files or not (by checking duplicates using all columns vs key columns). -> Raise errors if yes
2. remove duplicates before generating documents
3. Do some tasks column-wise over all the data, rather than while iterating over rows

Notes:
* There are a few existing BioThings parsers that also use `pandas` to load the entire raw data file at once: https://github.com/search?q=repo%3Abiothings%2Fpending.api%20pandas&type=code
* But there are other parsers that use `csv` to load the file **one row at a time** (generator): https://github.com/search?q=repo%3Abiothings%2Fpending.api+csv+reader&type=code

---

If I did the generator approach (load files 1 by 1, 1 row at a time), I'd have to modify how I do things:
1. Don't do the duplicates check. But try to mitigate potential "duplicate" issues: 
   - Sort all delimited strings
   - Use a hash of all column values (when they're all strings) for `_id`. Want rows with all the same values to produce the same hash
2. Either leave to BioThings toolset to remove duplicates, or could save a running set of `_id` hashes to check if row was already encountered -> not create duplicate docs
3. Do the tasks on single rows/chunks (pandas [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) has an iterator for rows/chunks! see iterator/chunksize parameters)

In [2]:
## put into parser: DONE
import pathlib
import pandas as pd

## don't put in parser. Just for this notebook
import glob
from pprint import pprint

## unsure on putting into parser: more for notebook viewing/debugging...
pd.options.display.max_columns = None

<div class="alert alert-block alert-danger">

Adjust the code block below for path/pattern for data files. 
    
This notebook was originally written using data files from the 2025-02-28 static release on the [FTP site](https://ftp.ebi.ac.uk/pub/databases/gene2phenotype/G2P_data_downloads/).
The latest data can be downloaded from the [website](https://www.ebi.ac.uk/gene2phenotype/download)

In [3]:
## put into parser (format): DONE

base_file_path = pathlib.Path.home().joinpath("Desktop", "EBIgene2pheno_files", 
                                              "From_FTP", "2025-06-27")

## uses pathlib's Path.glob, which produces a generator. 
## cast into list so parser code can check if paths were actually matched or not
all_file_paths = list(base_file_path.glob("*.csv.gz"))
all_file_paths

[PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-06-27/CardiacG2P_2025-06-27.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-06-27/SkeletalG2P_2025-06-27.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-06-27/DDG2P_2025-06-27.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-06-27/SkinG2P_2025-06-27.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-06-27/CancerG2P_2025-06-27.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-06-27/EyeG2P_2025-06-27.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/2025-06-27/EarG2P_2025-06-27.csv.gz')]

In [4]:
## an example: pathlib's Path.glob produces a generator
## vs glob.glob produces an array (from cwd only?)
base_file_path.glob("*2025-02-28.csv.gz")
glob.glob("*2025-02-28.csv.gz")

<generator object Path.glob at 0x109293010>

[]

In [5]:
## put into parser (format): DONE

## using generator expression (think list/dict comprehension) within pd.concat to load files 1 at a time
## ingesting all columns as str for now
df = pd.concat((pd.read_csv(f, dtype=str) for f in all_file_paths), ignore_index=True)

## make column names snake-case - usable with itertuples later
df.columns = df.columns.str.replace(" ", "_")

In [6]:
df["date_of_last_review"].info(memory_usage="deep")

<class 'pandas.core.series.Series'>
RangeIndex: 4844 entries, 0 to 4843
Series name: date_of_last_review
Non-Null Count  Dtype 
--------------  ----- 
4844 non-null   object
dtypes: object(1)
memory usage: 350.2 KB


In [7]:
## change this column to datetime, saves memory
df["date_of_last_review"] = pd.to_datetime(df["date_of_last_review"])
df["date_of_last_review"].info(memory_usage="deep")

<class 'pandas.core.series.Series'>
RangeIndex: 4844 entries, 0 to 4843
Series name: date_of_last_review
Non-Null Count  Dtype              
--------------  -----              
4844 non-null   datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1)
memory usage: 38.0 KB


In [8]:
## I couldn't figure out how to import + ingest column as datetime in 1 step 
## this is what I tried that didn't work

## worked with pandas 2.0.3, but didn't work with pandas 2.2.3: ingested as str
# df = pd.concat((pd.read_csv(f, dtype=str, parse_dates=["date of last review"]) 
#                 for f in all_file_paths), ignore_index=True)

## doesn't work
# df = pd.concat((pd.read_csv(f, dtype=str, parse_dates=["date of last review"], 
#                            date_format="%Y-%m-%d %H:%M:%S%:z") 
#                 for f in all_file_paths), ignore_index=True)
## throws an error
# df = pd.concat((pd.read_csv(f, dtype=str, parse_dates=[["date of last review"]], 
#                            date_format="%Y-%m-%d %H:%M:%S%:z") 
#                 for f in all_file_paths), ignore_index=True)
## throws an error
# df = pd.concat((pd.read_csv(f, dtype={"date of last review": pd.datetime64[ns, tz]})
#                 for f in all_file_paths), ignore_index=True)

In [9]:
df.shape
df.head()

(4844, 21)

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review
0,G2P00124,KCNE1,176261,6240,ISK; JLNS2; LQT5; MINK,KCNE1-related Jervell and Lange-Nielsen syndrome,612347.0,MONDO:0012871,biallelic_autosomal,potential secondary finding,strong,altered gene product structure,missense_variant; inframe_insertion; inframe_d...,undetermined,inferred,,HP:0001657; HP:0001279; HP:0000007; HP:0000407,30461122,DD; Cardiac,KCNE1-related JLNS is due to altered gene prod...,2024-04-05 12:05:01+00:00
1,G2P00841,PTPN11,176876,9644,BPTP3; NS1; PTP2C; SH-PTP2; SHP-2; SHP2,PTPN11-related Noonan syndrome with multiple l...,151100.0,,monoallelic_autosomal,,definitive,altered gene product structure,missense_variant; inframe_deletion; inframe_in...,undetermined,inferred,,HP:0001709; HP:0000957; HP:0004409; HP:0001639...,27484170; 26377839; 25917897; 25884655; 248207...,DD; Skin; Cardiac,Expert review done on 12/01/2022; Noonan syndr...,2025-01-21 14:56:43+00:00
2,G2P03247,DSC2,125645,3036,CDHF2; DSC3,DSC2-related arrhythmogenic right ventricular ...,,MONDO:0012506,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,stop_gained; frameshift_variant; NMD_triggerin...,undetermined,inferred,,,31028357; 23911551; 21636032; 33831308; 263105...,Cardiac,Expert review done on 05/01/2022; DSC2-related...,2024-03-20 09:36:09+00:00
3,G2P03248,DSC2,125645,3036,CDHF2; DSC3,DSC2-related arrhythmogenic right ventricular ...,,MONDO:0012506,biallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,stop_gained; frameshift_variant; NMD_triggerin...,undetermined,inferred,,,31028357; 23911551; 21636032; 33831308; 263105...,Cardiac,Expert review done on 05/01/2022; DSC2-related...,2024-03-20 09:35:19+00:00
4,G2P03249,DSG2,125671,3049,CDHF5,DSG2-related arrhythmogenic right ventricular ...,,MONDO:0012434,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,stop_gained; frameshift_variant; splice_accept...,undetermined,inferred,,,21636032; 33831308; 33917638; 34400560; 240707...,Cardiac,Expert review done on 05/01/2022; DSG2-related...,2024-03-20 09:40:18+00:00


In [10]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4844 entries, 0 to 4843
Data columns (total 21 columns):
 #   Column                              Non-Null Count  Dtype              
---  ------                              --------------  -----              
 0   g2p_id                              4844 non-null   object             
 1   gene_symbol                         4844 non-null   object             
 2   gene_mim                            4841 non-null   object             
 3   hgnc_id                             4844 non-null   object             
 4   previous_gene_symbols               4476 non-null   object             
 5   disease_name                        4844 non-null   object             
 6   disease_mim                         4036 non-null   object             
 7   disease_MONDO                       3028 non-null   object             
 8   allelic_requirement                 4844 non-null   object             
 9   cross_cutting_modifier              649 n

## Check, remove duplicates

There are duplicate rows in this dataframe because the record (gene + disease + more) is in several panels (disease falls into multiple categories). This was explored in the data-playground notebook. 

We want to drop those duplicates. 
However, I was concerned that the delimited-string values could differ (only in list order) for the same record in diff files. 
So that's what this check is for. 

In [11]:
## put into parser (format): DONE

n_duplicates_column_combo = df[df.duplicated(subset=["g2p_id"], keep=False)].shape

n_duplicates_all_columns = df[df.duplicated(keep=False)].shape

## for testing
# n_duplicates_all_columns = (1, 1)


if n_duplicates_column_combo != n_duplicates_all_columns: 
    raise AssertionError("The data format has changed, and the assumptions about duplicates/key columns may " \
                          "no longer hold. Re-explore the data and adjust the parser.")

In [12]:
## put into parser (format): DONE

## drop duplicates
df.drop_duplicates(inplace=True, ignore_index=True)

In [13]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3707 entries, 0 to 3706
Data columns (total 21 columns):
 #   Column                              Non-Null Count  Dtype              
---  ------                              --------------  -----              
 0   g2p_id                              3707 non-null   object             
 1   gene_symbol                         3707 non-null   object             
 2   gene_mim                            3704 non-null   object             
 3   hgnc_id                             3707 non-null   object             
 4   previous_gene_symbols               3424 non-null   object             
 5   disease_name                        3707 non-null   object             
 6   disease_mim                         2944 non-null   object             
 7   disease_MONDO                       2187 non-null   object             
 8   allelic_requirement                 3707 non-null   object             
 9   cross_cutting_modifier              462 n

## Column-level transforms

Based on data-playground "Notes on parsing data to create documents" section

In [14]:
## double-checking how to add prefixes to OMIM vs orphanet IDs

df_diseasemim = df.copy()

## done to preserve NA
df_diseasemim["disease_mim"] = [i if pd.isna(i) \
                                else "OMIM:" + i if i.isnumeric() \
                                else i \
                                for i in df_diseasemim["disease_mim"]]

df_diseasemim["disease_mim"] = df_diseasemim["disease_mim"].str.replace("Orphanet", "orphanet")

In [15]:
df_diseasemim[df_diseasemim["disease_mim"].str.contains("OMIM:", na=False)].shape

df_diseasemim[df_diseasemim["disease_mim"].str.contains("orphanet:", na=False)].shape

## add up row count. If == num non-null in info above, you're good 
## right now 2944 == 2944, so good

(2943, 21)

(1, 21)

In [16]:
## put into parser (format): DONE

## COLUMN-LEVEL TRANSFORMS

## adding Translator/biolink prefixes to IDs
df["gene_mim"] = "OMIM:" + df["gene_mim"]
df["hgnc_id"] = "HGNC:" + df["hgnc_id"]
df["disease_mim"] = df["disease_mim"].str.replace("Orphanet", "orphanet")
## done to preserve NA
df["disease_mim"] = [i if pd.isna(i)
                     else "OMIM:" + i if i.isnumeric()
                     else i
                     for i in df["disease_mim"]]

## strip whitespace
df["disease_name"] = df["disease_name"].str.strip()
df["comments"] = df["comments"].str.strip()

## create new columns
## UI really wants resource website urls like this. May need to adjust over time as website changes
df["g2p_record_url"] = "https://www.ebi.ac.uk/gene2phenotype/lgd/" +  df["g2p_id"]

## replace panel keywords with full names shown on G2P website for single record
## keeping "Hearing loss" as-is, changing all other values
df["panel"] = df["panel"].str.replace("DD", "Developmental disorders")
df["panel"] = df["panel"].str.replace("Cancer", "Cancer disorders")
df["panel"] = df["panel"].str.replace("Cardiac", "Cardiac disorders")
df["panel"] = df["panel"].str.replace("Eye", "Eye disorders")
df["panel"] = df["panel"].str.replace("Skeletal", "Skeletal disorders")
df["panel"] = df["panel"].str.replace("Skin", "Skin disorders")

In [17]:
## checking on column-level transforms

df.head()
# df["g2p record url"].unique()[0:100]

# df[df["disease_mim"].str.contains("orphanet", na=False)]  ## 9 rows, so that's correct
# df[df["panel"].str.contains("Hearing", na=False)]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
0,G2P00124,KCNE1,OMIM:176261,HGNC:6240,ISK; JLNS2; LQT5; MINK,KCNE1-related Jervell and Lange-Nielsen syndrome,OMIM:612347,MONDO:0012871,biallelic_autosomal,potential secondary finding,strong,altered gene product structure,missense_variant; inframe_insertion; inframe_d...,undetermined,inferred,,HP:0001657; HP:0001279; HP:0000007; HP:0000407,30461122,Developmental disorders; Cardiac disorders,KCNE1-related JLNS is due to altered gene prod...,2024-04-05 12:05:01+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00124
1,G2P00841,PTPN11,OMIM:176876,HGNC:9644,BPTP3; NS1; PTP2C; SH-PTP2; SHP-2; SHP2,PTPN11-related Noonan syndrome with multiple l...,OMIM:151100,,monoallelic_autosomal,,definitive,altered gene product structure,missense_variant; inframe_deletion; inframe_in...,undetermined,inferred,,HP:0001709; HP:0000957; HP:0004409; HP:0001639...,27484170; 26377839; 25917897; 25884655; 248207...,Developmental disorders; Skin disorders; Cardi...,Expert review done on 12/01/2022; Noonan syndr...,2025-01-21 14:56:43+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00841
2,G2P03247,DSC2,OMIM:125645,HGNC:3036,CDHF2; DSC3,DSC2-related arrhythmogenic right ventricular ...,,MONDO:0012506,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,stop_gained; frameshift_variant; NMD_triggerin...,undetermined,inferred,,,31028357; 23911551; 21636032; 33831308; 263105...,Cardiac disorders,Expert review done on 05/01/2022; DSC2-related...,2024-03-20 09:36:09+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03247
3,G2P03248,DSC2,OMIM:125645,HGNC:3036,CDHF2; DSC3,DSC2-related arrhythmogenic right ventricular ...,,MONDO:0012506,biallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,stop_gained; frameshift_variant; NMD_triggerin...,undetermined,inferred,,,31028357; 23911551; 21636032; 33831308; 263105...,Cardiac disorders,Expert review done on 05/01/2022; DSC2-related...,2024-03-20 09:35:19+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03248
4,G2P03249,DSG2,OMIM:125671,HGNC:3049,CDHF5,DSG2-related arrhythmogenic right ventricular ...,,MONDO:0012434,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,stop_gained; frameshift_variant; splice_accept...,undetermined,inferred,,,21636032; 33831308; 33917638; 34400560; 240707...,Cardiac disorders,Expert review done on 05/01/2022; DSG2-related...,2024-03-20 09:40:18+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03249


## Confidence values

**2025-07-22:**

Every row/record has 1 confidence value, representing how confident the curators are that "this gene has a causal role in this disease". The definitions of the possible values are provided [here](https://www.ebi.ac.uk/gene2phenotype/about/terminology#g2p-confidence-section). 


**CURRENT DEFINITIONS** (including in case they change later)

> **definitive**: The role of this gene in this particular disease has been repeatedly demonstrated in both the research and clinical diagnostic settings, and has been upheld over time (at least 2 independent publication over 3 years' time). No convincing evidence has emerged that contradicts the role of the gene in the specified disease. (previously labelled as confirmed) The strength of evidence within publications as well as their number and publication dates is taken into account. In practice, this usually means at least 4 publications over 5 years. Typically this will also include convincing bioinformatic or functional evidence of causation, making it very unlikely that this gene-disease association would ever be refuted.
>
>**strong**: The role of this gene as a monogenic cause of disease has been repeatedly and independently demonstrated providing very strong convincing evidence in humans and no conflicting evidence for this gene's role in this disease. (previously labelled as probable).
>
>**moderate**: There is moderate evidence in humans to support a casual role for this gene in this disease with no contradictory evidence. The body of evidence is not large (e.g possibly only one key paper) but appears convincing enough that the gene-disease pair is likely to be validated with additional evidence in the near future.
>
>**limited**: Little human evidence exists to support a casual role for this gene in this disease, but not all evidence has been refuted. For example, there may be a collection of rare missense variants in humans but without convincing functional impact, segregration data that could either arise by chance (e.g across one or two meioses) or does not implicate a single gene, or functional data without direct recapitulation of the phenotype. Overall, the body of evidence does not meet contemporary criteria for claiming a valid association with disease. The majority are probably false associations. (previously labelled as possible).
>
>**disputed**: Although evidence has been reported, other evidence of equal weight disputes the claim.
>
>**refuted**: There has been an assertion of a gene-disease association in the literature, but new valid evidence has arisen that refutes the entire original body of evidence.

<div class="alert alert-block alert-success">

**2025-07-22:**

After input from Sierra and Matt:
1. **FOR NOW**, rows with **"refuted"** or **"disputed"** values **should not be used** to create edges for Translator. These mean there's strong evidence that there ISN'T an association (negation). **This decision can be revisited** once Translator can model/handle negation better. 
2. rows with **"limited"** value **should not be used** to create edges for Translator. Sierra and Matt pointed out the last sentence of the definition: "The majority are probably false associations. (previously labelled as possible)." The reasoning is that these may not be "real" associations. 
3. keep rows with **"moderate", "strong", "definitive"** values, because there's moderate-definitive evidence that a gene DOES HAVE a causal role in this disease. Can use the strong predicate "causes" 

Other **data-modeling decision**: **FOR NOW**, using `subject_form_or_variant_qualifier` (for Gene). Values are `genetic_variant_form` or a descendant, based on the allelic_requirement values. The allelic requirement [terms](https://www.ebi.ac.uk/gene2phenotype/about/terminology#allelic-requirement-section) describe the genotype (of gene variants) linked to the disease. 

<div class="alert alert-block alert-danger">

Data-modeling notes: options for gene-disease associations are confusing 
* "causes / contributes to" makes more sense when qualifiers on the gene/protein are used. 
* what's the diff between "associated with" and "genetically associated with"? 
* "gene associated with condition" is child of "genetically associated with", but seems to be more general - basically a "related to". Also would look weird in UI, right? 

In [18]:
df["confidence"].value_counts()

confidence
definitive    2064
strong         862
limited        525
moderate       255
refuted          1
Name: count, dtype: int64

### Removing rows + stats

In [19]:
## put into parser (format): 

## calculate stats before removing

n_rows_original = df.shape[0]
n_rows_refuted = df[df["confidence"] == "refuted"].shape[0]
n_rows_disputed = df[df["confidence"] == "disputed"].shape[0]
n_rows_limited = df[df["confidence"] == "limited"].shape[0]

In [20]:
## put into parser (format): 

## remove rows, calculate stats after

df = df[~ df["confidence"].isin(["refuted", "disputed", "limited"])].reset_index(drop=True)
n_rows_after_confidence = df.shape[0]

In [22]:
## put into parser (format): 

## Print stats

print(f"{n_rows_original} unique rows/records in original dataset\n")

print(f"Removing rows based on confidence:")
print(f"{n_rows_refuted}: 'refuted'")
print(f"{n_rows_disputed}: 'disputed'")
print(f"{n_rows_limited}: 'limited'\n")


print(f"{n_rows_after_confidence} rows afterwards")

3707 unique rows/records in original dataset

Removing rows based on confidence:
1: 'refuted'
0: 'disputed'
525: 'limited'

3181 rows afterwards


## Pre-NodeNorming

Querying NodeNorm: send unique values (no duplicates!) from entire column in large batches -> generate mapping dict to use. 
<br>
__Not querying 1-by-1 or 1 row at a time: much slower__ and would involve sending duplicate IDs (unless saved dict is kept outside loop and checked) 

Not going to use NameResolver: not optimistic this would work anyways. My manual process of getting "better" disease IDs is to use the gene IDs, find the diseases they're linked to in OMIM and Monarch, and seeing if those match the data's disease name / phenotypes / publications. This is more complicated than just using NameResolver.

<div class="alert alert-block alert-danger">

Set the NodeNorm URL you want to use. 

In [23]:
## put into parser (format): DONE

import requests

## from BioThings annotator code: for interoperability between diff Python versions
# try:
#     from itertools import batched  # new in Python 3.12
# except ImportError:
#     from itertools import islice

#     def batched(iterable, n):
#         # batched('ABCDEFG', 3) → ABC DEF G
#         if n < 1:
#             raise ValueError("n must be at least one")
#         iterator = iter(iterable)
#         while batch := tuple(islice(iterator, n)):
#             yield batch

## doing to test that this works
from itertools import islice

def batched(iterable, n):
    # batched('ABCDEFG', 3) → ABC DEF G
    if n < 1:
        raise ValueError("n must be at least one")
    iterator = iter(iterable)
    while batch := tuple(islice(iterator, n)):
        yield batch

nodenorm_url = "https://nodenorm.ci.transltr.io/get_normalized_nodes"

### Exploring: Genes

**2025-06-27 data:** Every row has at least 1 gene ID (HGNC column has no missing values). So no rows will be removed because there's no gene IDs to use for the pre-NodeNorming. 

In [24]:
df[["gene_symbol", "hgnc_id", "gene_mim"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3181 entries, 0 to 3180
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   gene_symbol  3181 non-null   object
 1   hgnc_id      3181 non-null   object
 2   gene_mim     3180 non-null   object
dtypes: object(3)
memory usage: 74.7+ KB


In [25]:
df[df["gene_mim"].isna()]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
2477,G2P03745,RNU2-2P,,HGNC:10152,,RNU2-2P-related neurodevelopmental disorder wi...,,,monoallelic_autosomal,,strong,altered gene product structure,ncRNA,undetermined,inferred,,,40210679; 40442284,Developmental disorders,Gene now called RNU2-2. Recurrent variants: n....,2025-06-04 09:34:09+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03745


In [26]:
## saving stats on data with no gene IDs, just in case

stats_no_gene_IDs = {
    "n_rows": df[df["gene_mim"].isna() & df["hgnc_id"].isna()].shape[0],
    "n_names": len(df[df["gene_mim"].isna() & df["hgnc_id"].isna()]["gene_symbol"].unique())
}

stats_no_gene_IDs["n_rows"]
stats_no_gene_IDs["n_names"]

0

0

#### HGNC

__Running Gene HGNC IDs through NodeNorm__


Catching potential mapping failures for later stats report

In [27]:
## saving stats on data with no HGNC IDs, just in case

n_rows_no_hgnc = df["hgnc_id"].isna().sum()

In [28]:
## get set of unique CURIEs to put into NodeNorm
hgnc = df["hgnc_id"].dropna().unique()
len(hgnc)

2627

In [29]:
hgnc_nodenorm_mapping = {}

## set up variables to catch potential mapping failures
stats_hgnc_mapping_failures = {
    "unexpected_error": {},
    "nodenorm_returned_none": [],
    "wrong_category": {},
    "no_label": []
    
}

In [30]:
## larger batches are quicker
for batch in batched(hgnc, 1000):
    ## returns tuples -> cast to list
    req_body = {
        "curies": list(batch),
        "conflate": True,
    }
    r = requests.post(nodenorm_url, json=req_body)
    response = r.json()
    
    ## not doing dict comprehension. allows easier review, logic writing
    for k,v in response.items():
        ## catch unexpected errors
        try:
            ## if NodeNorm didn't have info on this ID, v will be None
            if v is not None:
                ## don't keep mapping if category is not the expected one
                if v["type"][0] == "biolink:Gene":
                    ## also throw out mapping if no primary label found
                    if v["id"].get("label"):
                        temp = {
                            k: {"primary_id": v["id"]["identifier"],
                                "primary_label": v["id"]["label"]
                               }
                        }
                        hgnc_nodenorm_mapping.update(temp)
                    else:
                        stats_hgnc_mapping_failures["no_label"].append(k)
#                         print(f"{k}: NodeNorm didn't find primary label. Not keeping this mapping.")
                else:
                    stats_hgnc_mapping_failures["wrong_category"].update({k: v["type"][0]})
#                     print(f'{k}: NodeNorm found different category {v["type"][0]}. Not keeping this mapping.')
            else:
                stats_hgnc_mapping_failures["nodenorm_returned_none"].append(k)
#                 print(f"{k}: NodeNorm didn't recognize this ID")
        except:
            stats_hgnc_mapping_failures["unexpected_error"].update({k: v})
            print(f'Encountered an unexpected error.')
            print(f'NodeNorm response key: {k}')
            print(f'NodeNorm response value: {v}')

In [31]:
len(hgnc_nodenorm_mapping)

stats_hgnc_mapping_failures

2627

{'unexpected_error': {},
 'nodenorm_returned_none': [],
 'wrong_category': {},
 'no_label': []}

#### OMIM

__Running Gene OMIM IDs through NodeNorm__

Catching potential mapping failures for later stats report. 

Pasted, adjusted from HGNC code blocks above.

In [32]:
## get set of unique CURIEs to put into NodeNorm
gene_omim = df["gene_mim"].dropna().unique()
len(gene_omim)

2626

In [33]:
gene_omim_nodenorm_mapping = {}

## set up variables to catch potential mapping failures
stats_gene_omim_mapping_failures = {
    "unexpected_error": {},
    "nodenorm_returned_none": [],
    "wrong_category": {},
    "no_label": []
    
}

In [34]:
## larger batches are quicker
for batch in batched(gene_omim, 1000):
    ## returns tuples -> cast to list
    req_body = {
        "curies": list(batch),
        "conflate": True,
    }
    r = requests.post(nodenorm_url, json=req_body)
    response = r.json()
    
    ## not doing dict comprehension. allows easier review, logic writing
    for k,v in response.items():
        ## catch unexpected errors
        try:
            ## if NodeNorm didn't have info on this ID, v will be None
            if v is not None:
                ## don't keep mapping if category is not the expected one
                if v["type"][0] == "biolink:Gene":
                    ## also throw out mapping if no primary label found
                    if v["id"].get("label"):
                        temp = {
                            k: {"primary_id": v["id"]["identifier"],
                                "primary_label": v["id"]["label"]
                               }
                        }
                        gene_omim_nodenorm_mapping.update(temp)
                    else:
                        stats_gene_omim_mapping_failures["no_label"].append(k)
#                         print(f"{k}: NodeNorm didn't find primary label. Not keeping this mapping.")
                else:
                    stats_gene_omim_mapping_failures["wrong_category"].update({k: v["type"][0]})
#                     print(f'{k}: NodeNorm found different category {v["type"][0]}. Not keeping this mapping.')
            else:
                stats_gene_omim_mapping_failures["nodenorm_returned_none"].append(k)
#                 print(f"{k}: NodeNorm didn't recognize this ID")
        except:
            stats_gene_omim_mapping_failures["unexpected_error"].update({k: v})
            print(f'Encountered an unexpected error.')
            print(f'NodeNorm response key: {k}')
            print(f'NodeNorm response value: {v}')

In [35]:
len(gene_omim_nodenorm_mapping)

stats_gene_omim_mapping_failures

2626

{'unexpected_error': {},
 'nodenorm_returned_none': [],
 'wrong_category': {},
 'no_label': []}

#### Comparing HGNC vs OMIM

In [36]:
## if row has both IDs, look for diff in mappings from each ID
for row in df[["gene_mim", "hgnc_id"]].itertuples(index=False):
    ## has both IDs
    if pd.notna(row.gene_mim) and pd.notna(row.hgnc_id):
        ## if have NodeNorm mappings for both
        if gene_omim_nodenorm_mapping.get(row.gene_mim) and \
        hgnc_nodenorm_mapping.get(row.hgnc_id):
            ## check if mappings are diff
            if gene_omim_nodenorm_mapping[row.gene_mim]["primary_id"] != \
            hgnc_nodenorm_mapping[row.hgnc_id]["primary_id"]:
                print(row)

## 2025-06-27 data: nothing prints, so there are no mismatches

In [37]:
## look for differences in name between NodeNormed and original data

for row in df[["gene_symbol", "hgnc_id"]].itertuples(index=False):
    ## works because both columns have no missing values and there's no failed mappings
    ## if this changes, need to adjust this code block
    if row.gene_symbol != hgnc_nodenorm_mapping[row.hgnc_id]["primary_label"]:
        print(f"G2P name {row.gene_symbol}, ID {row.hgnc_id}")
        print(f'NodeNorm name {hgnc_nodenorm_mapping[row.hgnc_id]["primary_label"]}, ID {hgnc_nodenorm_mapping[row.hgnc_id]["primary_id"]}')
        print("\n")

G2P name MT-TP, ID HGNC:7494
NodeNorm name TRNP, ID NCBIGene:4571


G2P name CENPJ, ID HGNC:17272
NodeNorm name CPAP, ID NCBIGene:55835


G2P name CCDC103, ID HGNC:32700
NodeNorm name DNAAF19, ID NCBIGene:388389


G2P name CCDC115, ID HGNC:28178
NodeNorm name VMA22, ID NCBIGene:84317


G2P name TMEM199, ID HGNC:18085
NodeNorm name VMA12, ID NCBIGene:147007


G2P name RNU2-2P, ID HGNC:10152
NodeNorm name RNU2-2, ID NCBIGene:26855


G2P name MT-ND1, ID HGNC:7455
NodeNorm name ND1, ID NCBIGene:4535


G2P name MT-ND4, ID HGNC:7459
NodeNorm name ND4, ID NCBIGene:4538


G2P name MT-ATP6, ID HGNC:7414
NodeNorm name ATP6, ID NCBIGene:4508


G2P name MT-ND5, ID HGNC:7461
NodeNorm name ND5, ID NCBIGene:4540


G2P name MT-ND6, ID HGNC:7462
NodeNorm name ND6, ID NCBIGene:4541




**2025-03-28 data:** 

Review of mismatched names:
* NodeNorm is correct that CENPJ should be CPAP, CCDC103 -> DNAAF19
* The rest look like mitochondrial genes, and NCBIGene main name seems to match G2P name, not NodeNorm -> messaged NodeNorm

#### Conclusions

<div class="alert alert-block alert-success">

**2025-03-28 data:** 
    
__Exploration__

* no mapping failures
* when rows have both OMIM and HGNC IDs, there were no differences in NodeNorm mapping ("mismatches")
    
__Decision: Use HGNC ID column to generate NodeNorm values__

* less missing values (none right now)
* these IDs are probably only genes (vs OMIM ID namespace has multiple kinds of entities)

### Exploring: Diseases

There are many more missing IDs for Disease, compared to Gene. 

As mentioned at the beginning of the "Pre-NodeNorming" section, I won't be using NameResolver right now. 

__This means all rows w/o any disease IDs will be removed__ because they cannot be pre-NodeNormed. 

In [38]:
df[["disease_name", "disease_mim", "disease_MONDO"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3181 entries, 0 to 3180
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   disease_name   3181 non-null   object
 1   disease_mim    2638 non-null   object
 2   disease_MONDO  1878 non-null   object
dtypes: object(3)
memory usage: 74.7+ KB


In [39]:
df[df["disease_mim"].isna() & df["disease_MONDO"].isna()]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
69,G2P03445,GAA,OMIM:606800,HGNC:4065,,GAA-related Pompe disease,,,biallelic_autosomal,restricted mutation set,definitive,decreased gene product level; altered gene pro...,,undetermined,inferred,,,30681346; 31254424; 1652892; 8094613; 7981676;...,Cardiac disorders,Pompe disease is inherited as an autosomal rec...,2024-03-26 10:53:54+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03445
77,G2P03717,ACTN2,OMIM:102573,HGNC:164,,ACTN2-related cardiac and skeletal myopathy,,,monoallelic_autosomal,,definitive,altered gene product level; decreased gene pro...,stop_gained; frameshift_variant; missense_vari...,undetermined,inferred,,,17097056; 20022194; 25173926; 25224718; 275322...,Cardiac disorders,Pathogenic variants in ACTN2 are definitively ...,2025-03-07 13:46:46+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03717
78,G2P03718,FHOD3,OMIM:609691,HGNC:26178,FHOS2; FLJ22297; FLJ22717; KIAA1695,FHOD3-related hypertrophic cardiomyopathy,,,monoallelic_autosomal,,definitive,altered gene product structure,splice_acceptor_variant_NMD_escaping; missense...,undetermined,inferred,,,19706596; 29907873; 30442288; 30898215; 317428...,Cardiac disorders,Pathogenic variants in FHOD3 are definitively ...,2025-03-07 13:48:22+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03718
157,G2P00448,IGF2,OMIM:147470,HGNC:5466,C11ORF43; FLJ44734; IGF-II,IGF2-related Beckwith-Wiedemann syndrome,,,monoallelic_autosomal,imprinted region; restricted mutation set,definitive,altered gene product structure,,gain of function,inferred,,HP:0001548; HP:0000269; HP:0002240; HP:0002667...,,Developmental disorders; Skeletal disorders,,2023-05-24 09:07:28+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00448
216,G2P00872,PIK3CA,OMIM:171834,HGNC:8975,PI3K,PIK3CA-related overgrowth spectrum disorder wi...,,,monoallelic_autosomal,typically mosaic; restricted mutation set,definitive,altered gene product structure,,gain of function,inferred,,HP:0000494; HP:0001744; HP:0002667; HP:0001852...,22658544; 22729224,Developmental disorders; Skeletal disorders,,2024-12-11 11:40:22+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00872
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3143,G2P02826,RPE65,OMIM:180069,HGNC:10294,BCO3; LCA2; RD12; RP20,RPE65-related retinal dystrophy,,,monoallelic_autosomal,,definitive,altered gene product structure,missense_variant; inframe_deletion; inframe_in...,undetermined,inferred,,HP:0001139; HP:0000556,27307694; 21654732; 29947567,Eye disorders,,2019-10-30 14:45:12+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02826
3153,G2P03582,MYO6,OMIM:600970,HGNC:7605,DFNA22; DFNB37; KIAA0389,MYO6-related nonsyndromic genetic hearing loss,,,biallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,stop_gained; frameshift_variant; splice_accept...,undetermined,inferred,,,18348273; 23485424; 25999546; 12687499; 24105371,Ear,,2024-11-28 14:52:17+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03582
3154,G2P03583,MYO6,OMIM:600970,HGNC:7605,DFNA22; DFNB37; KIAA0389,MYO6-related nonsyndromic genetic hearing loss,,,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,stop_gained; frameshift_variant; splice_accept...,undetermined,inferred,,,18348273; 23485424; 25999546; 24105371,Ear,,2024-11-28 14:47:17+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03583
3155,G2P01747,CDC14A,OMIM:603504,HGNC:1718,CDC14; CDC14A1; CDC14A2; DFNB105; DFNB32,CDC14A-related deafness,,,biallelic_autosomal,,strong,absent gene product,,loss of function,inferred,,,27259055,Ear,,2025-04-08 17:02:31+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P01747


In [40]:
## saving stats on data with no disease IDs

stats_no_disease_IDs = {
    "n_rows": df[df["disease_mim"].isna() & df["disease_MONDO"].isna()].shape[0],
    "n_names": len(df[df["disease_mim"].isna() & df["disease_MONDO"].isna()]["disease_name"].unique())
}

stats_no_disease_IDs["n_rows"]
stats_no_disease_IDs["n_names"]

248

243

#### OMIM/orphanet

__Running OMIM/orphanet IDs through NodeNorm__

Catching mapping failures for later stats report

Pasted, adjusted from HGNC code blocks above.

In [41]:
## put into parser (format): DONE

## get set of unique CURIEs to put into NodeNorm
disease_OmOr = df["disease_mim"].dropna().unique()
len(disease_OmOr)

2552

In [42]:
## put into parser (format): DONE

OmOr_nodenorm_mapping = {}

## set up variables to catch mapping failures
stats_OmOr_mapping_failures = {
    "unexpected_error": {},
    "nodenorm_returned_none": [],
    "wrong_category": {},
    "no_label": []
    
}

In [43]:
## put into parser (format): DONE

## larger batches are quicker
for batch in batched(disease_OmOr, 1000):
    ## returns tuples -> cast to list
    req_body = {
        "curies": list(batch),
        "conflate": True,
    }
    r = requests.post(nodenorm_url, json=req_body)
    response = r.json()
    
    ## not doing dict comprehension. allows easier review, logic writing
    for k,v in response.items():
        ## catch unexpected errors
        try:
            ## if NodeNorm didn't have info on this ID, v will be None
            if v is not None:
                ## don't keep mapping if category is not the expected one
                if v["type"][0] == "biolink:Disease":
                    ## also throw out mapping if no primary label found
                    if v["id"].get("label"):
                        temp = {
                            k: {"primary_id": v["id"]["identifier"],
                                "primary_label": v["id"]["label"]
                               }
                        }
                        OmOr_nodenorm_mapping.update(temp)
                    else:
                        stats_OmOr_mapping_failures["no_label"].append(k)
#                         print(f"{k}: NodeNorm didn't find primary label. Not keeping this mapping.")
                else:
                    stats_OmOr_mapping_failures["wrong_category"].update({k: v["type"][0]})
#                     print(f'{k}: NodeNorm found different category {v["type"][0]}. Not keeping this mapping.')
            else:
                stats_OmOr_mapping_failures["nodenorm_returned_none"].append(k)
#                 print(f"{k}: NodeNorm didn't recognize this ID")
        except:
            stats_OmOr_mapping_failures["unexpected_error"].update({k: v})
            print(f'Encountered an unexpected error.')
            print(f'NodeNorm response key: {k}')
            print(f'NodeNorm response value: {v}')

In [44]:
## put into parser (format): DONE

## calculate stats: number of rows affected by each type of mapping failure
stats_OmOr_mapping_failures.update({
    "n_rows_none": df[df["disease_mim"].isin(stats_OmOr_mapping_failures["nodenorm_returned_none"])].shape[0],
    "n_rows_wrong_category": df[df["disease_mim"].isin(stats_OmOr_mapping_failures["wrong_category"].keys())].shape[0],
    "n_rows_no_label": df[df["disease_mim"].isin(stats_OmOr_mapping_failures["no_label"])].shape[0]
})

In [45]:
len(OmOr_nodenorm_mapping)

stats_OmOr_mapping_failures["unexpected_error"]

len(stats_OmOr_mapping_failures["nodenorm_returned_none"])
len(stats_OmOr_mapping_failures["wrong_category"])
len(stats_OmOr_mapping_failures["no_label"])

2544

{}

4

3

1

In [46]:
## code used to review mapping failures 

stats_OmOr_mapping_failures["nodenorm_returned_none"]

stats_OmOr_mapping_failures["wrong_category"]

stats_OmOr_mapping_failures["no_label"]

['OMIM:601884', 'OMIM:133701', 'OMIM:133700', 'OMIM:150800']

{'OMIM:188400': 'biolink:Gene',
 'OMIM:123580': 'biolink:Gene',
 'OMIM:300204': 'biolink:Gene'}

['OMIM:621034']

In [48]:
## code used to review mapping failures 

df[df["disease_mim"] == "OMIM:133701"]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
162,G2P00486,EXT2,OMIM:608210,HGNC:3513,SOTV,EXT2-related multiple exostoses,OMIM:133701,MONDO:0007586,monoallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0004322; HP:0003276; HP:0000006; HP:0002812...,9326317,Developmental disorders; Cancer disorders; Ske...,,2023-05-24 09:07:28+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00486


<div class="alert alert-block alert-info">

**Update 2025-06-27: Fewer mapping failures!**
    
4 cases where NodeNorm returned None: 
* OMIM:601884 - [valid ID](https://omim.org/entry/601884), but it doesn't seem to be a disease (previously reviewed, reported to EBI gene2pheno, NodeNorm)
* OMIM:133701 - [valid disease ID](https://omim.org/entry/133701), NodeNorm issue
* OMIM:133700 - [valid disease ID](https://omim.org/entry/133700), NodeNorm issue (previously reviewed, reported)
* OMIM:150800 - [valid disease ID](https://omim.org/entry/150800), NodeNorm issue

   
3 cases where NodeNorm category was something else (currently, always Gene): 
* OMIM:188400 - [valid disease ID](https://omim.org/entry/188400), NodeNorm error (previously reviewed, reported)
* OMIM:123580 - [confirmed to be a gene ID](https://omim.org/entry/123580), EBI gene2pheno error
* OMIM:300204 - [confirmed to be a gene ID](https://omim.org/entry/300204), EBI gene2pheno error
    
    
1 case where NodeNorm didn't have a primary label: 
* OMIM:621034 - [valid disease ID](https://omim.org/entry/621034), NodeNorm error

<div class="alert alert-block alert-success">

**POSSIBLE REVISIT**: Could re-analyze disease_MONDO column data to see if it's more reliable. 
    
**2025-03-28 data:** 
    
I decided <b>not to try using MONDO mappings when the OMIM mapping failed</b>, because there's only a few cases where those rows even have MONDO IDs to use. 

In [50]:
## code used to check how many rows have OMIM failure + MONDO ID 

df[df["disease_mim"].isin(stats_OmOr_mapping_failures["nodenorm_returned_none"]) & 
   df["disease_MONDO"].notna()].shape

df[df["disease_mim"].isin(stats_OmOr_mapping_failures["wrong_category"].keys()) & 
   df["disease_MONDO"].notna()].shape

df[df["disease_mim"].isin(stats_OmOr_mapping_failures["no_label"]) & 
   df["disease_MONDO"].notna()].shape

(2, 22)

(0, 22)

(0, 22)

#### MONDO

__Running MONDO IDs through NodeNorm__

Catching potential mapping failures for later stats report

Pasted, adjusted from Disease OMIM/orphanet code blocks above.

In [51]:
## get set of unique CURIEs to put into NodeNorm
mondo = df["disease_MONDO"].dropna().unique()
len(mondo)

1735

In [52]:
mondo_nodenorm_mapping = {}

## set up variables to catch mapping failures
stats_mondo_mapping_failures = {
    "unexpected_error": {},
    "nodenorm_returned_none": [],
    "wrong_category": {},
    "no_label": []
    
}

In [53]:
## larger batches are quicker
for batch in batched(mondo, 1000):
    ## returns tuples -> cast to list
    req_body = {
        "curies": list(batch),
        "conflate": True,
    }
    r = requests.post(nodenorm_url, json=req_body)
    response = r.json()
    
    ## not doing dict comprehension. allows easier review, logic writing
    for k,v in response.items():
        ## catch unexpected errors
        try:
            ## if NodeNorm didn't have info on this ID, v will be None
            if v is not None:
                ## don't keep mapping if category is not the expected one
                if v["type"][0] == "biolink:Disease":
                    ## also throw out mapping if no primary label found
                    if v["id"].get("label"):
                        temp = {
                            k: {"primary_id": v["id"]["identifier"],
                                "primary_label": v["id"]["label"]
                               }
                        }
                        mondo_nodenorm_mapping.update(temp)
                    else:
                        stats_mondo_mapping_failures["no_label"].append(k)
#                         print(f"{k}: NodeNorm didn't find primary label. Not keeping this mapping.")
                else:
                    stats_mondo_mapping_failures["wrong_category"].update({k: v["type"][0]})
#                     print(f'{k}: NodeNorm found different category {v["type"][0]}. Not keeping this mapping.')
            else:
                stats_mondo_mapping_failures["nodenorm_returned_none"].append(k)
#                 print(f"{k}: NodeNorm didn't recognize this ID")
        except:
            stats_mondo_mapping_failures["unexpected_error"].update({k: v})
            print(f'Encountered an unexpected error.')
            print(f'NodeNorm response key: {k}')
            print(f'NodeNorm response value: {v}')

In [54]:
len(mondo_nodenorm_mapping)

stats_mondo_mapping_failures

1734

{'unexpected_error': {},
 'nodenorm_returned_none': ['MONDO:0976124'],
 'wrong_category': {},
 'no_label': []}

**2025-06-27:**

[MONDO:0976124](https://monarchinitiative.org/MONDO:0976124) is a valid disease ID, so this is a NodeNorm issue. 

In [52]:
df[df["disease_MONDO"] == "MONDO:0976124"]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
2817,G2P03584,TCP1,OMIM:186980,HGNC:11655,CCT1; CCTA; D6S230E,TCP1-related neurodevelopmental disorder with ...,,MONDO:0976124,monoallelic_autosomal,typically de novo,moderate,decreased gene product level; absent gene prod...,missense_variant; stop_gained; frameshift_variant,loss of function,evidence,39480921 -> functional_alteration: non patient...,,39480921,Developmental disorders,Gene also known as CCT1.,2025-02-27 15:14:14+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03584


#### Comparing OMIM/orphanet vs MONDO

**POSSIBLE REVISIT**: Could re-analyze disease_MONDO column data to see if it's more reliable. 

In [None]:


## if row has both IDs, look for diff in mappings from each ID

## list of tuples (omim/orpha, mondo)
mismatches = []

for row in df[["disease_mim", "disease_MONDO"]].itertuples(index=False):
    ## has both IDs
    if pd.notna(row.disease_mim) and pd.notna(row.disease_MONDO):
        ## if have NodeNorm mappings for both
        if OmOr_nodenorm_mapping.get(row.disease_mim) and \
        mondo_nodenorm_mapping.get(row.disease_MONDO):
            ## check if mappings are diff
            if OmOr_nodenorm_mapping[row.disease_mim]["primary_id"] != \
            mondo_nodenorm_mapping[row.disease_MONDO]["primary_id"]:
                mismatches.append((row.disease_mim, row.disease_MONDO))

print(f"There's {len(mismatches)} mismatches between the OMIM/orphanet and MONDO NodeNorm mappings.")

In [57]:
## code chunk to review mismatches 1 by 1
mismatches[0]

('OMIM:300696', 'MONDO:0010680')

In [58]:
## code chunk to review mismatches 1 by 1

OmOr_nodenorm_mapping["OMIM:300696"]
mondo_nodenorm_mapping["MONDO:0010680"]

{'primary_id': 'MONDO:0010401',
 'primary_label': 'X-linked myopathy with postural muscle atrophy'}

{'primary_id': 'MONDO:0010680',
 'primary_label': 'X-linked Emery-Dreifuss muscular dystrophy'}

In [59]:
## code chunk to review mismatches 1 by 1

df[df["disease_mim"] == "OMIM:300696"]
df[df["disease_MONDO"] == "MONDO:0010680"]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
48,G2P03293,FHL1,OMIM:300163,HGNC:3702,BA535K18.1; FHL1B; FLH1A; KYO-T; MGC111107; SL...,FHL1-related Emery-Dreifuss muscular dystrophy,OMIM:300696,MONDO:0010680,monoallelic_X_hemizygous,,definitive,decreased gene product level; absent gene prod...,stop_gained_NMD_escaping; missense_variant; fr...,loss of function,inferred,,HP:0003691; HP:0003701; HP:0003704; HP:0001419...,18179888; 19687455; 30681346; 19716112; 201868...,Developmental disorders; Cardiac disorders,Expert review done on 12/01/2022; FHL1-related...,2024-03-26 10:33:21+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03293


Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
48,G2P03293,FHL1,OMIM:300163,HGNC:3702,BA535K18.1; FHL1B; FLH1A; KYO-T; MGC111107; SL...,FHL1-related Emery-Dreifuss muscular dystrophy,OMIM:300696,MONDO:0010680,monoallelic_X_hemizygous,,definitive,decreased gene product level; absent gene prod...,stop_gained_NMD_escaping; missense_variant; fr...,loss of function,inferred,,HP:0003691; HP:0003701; HP:0003704; HP:0001419...,18179888; 19687455; 30681346; 19716112; 201868...,Developmental disorders; Cardiac disorders,Expert review done on 12/01/2022; FHL1-related...,2024-03-26 10:33:21+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03293


<div class="alert alert-block alert-info">    

**2025-02-28 data:** 

__Review of OMIM vs MONDO NodeNorm mismatches (22)__

None were orphanet.    
    
---

__19: OMIM's mapping is better__

> __6: Mondo ID is related but wrong__ -> emailed EBI gene2pheno w/ example
> * 'OMIM:243310', 'MONDO:0013812': omim is correct syndrome 1, but mondo is syndrome 2 (diff gene)
> * 'OMIM:613575', 'MONDO:0044314': omim is correct 55, but mondo is 78 (diff gene)
> * 'OMIM:101000', 'MONDO:0008075': omim is correct type of schwannomatosis (NF2/type 2), vs mondo is a sibling. 
>   * NodeNorm should map omim to MONDO:0007039 but isn't -> messaged NodeNorm
> * 'OMIM:613987', __'MONDO:0009136'__: omim is correct recessive 2, but mondo is recessive 1 (diff gene? Confusing because Monarch page links to gene NHP2 but OMIM page doesn't)
>   * NodeNorm should map omim to MONDO:0013519 but isn't -> messaged NodeNorm  
> * 'OMIM:613988', 'MONDO:0009136': omim is correct recessive 3, but mondo is recessive 1 (diff gene)
>   * NodeNorm should map omim to MONDO:0013520 but isn't -> messaged NodeNorm
> * 'OMIM:616353', 'MONDO:0009136': omim is correct recessive 6, but mondo is recessive 1 (diff gene)
>   * NodeNorm should map omim to MONDO:0014600 but isn't -> messaged NodeNorm

> __13: Mondo ID is too general__ (can see on Monarch website) -> emailed EBI gene2pheno w/ example
> * 'OMIM:300696', 'MONDO:0010680': omim maps to MONDO:0010401, child of the mondo
> * 'OMIM:304120', 'MONDO:0019027': omim maps to MONDO:0010571 (syndrome type 2), child of the mondo (syndrome)
> * 'OMIM:610019', 'MONDO:0005129': omim maps to MONDO:0012395 (cataract 18), child of the mondo (cataract)
> * 'OMIM:611726', 'MONDO:0016295': omim maps to MONDO:0012721, child of the mondo 
> * 'OMIM:602668', 'MONDO:0016107': omim maps to MONDO:0011266 (type 2), child of the mondo
> * 'OMIM:203200', 'MONDO:0018910': omim maps to MONDO:0008746 (type 2), child of the mondo
> * 'OMIM:614328', 'MONDO:0017411': omim maps to MONDO:0013693 (type 1), child of the mondo
> * 'OMIM:175800', 'MONDO:0006602': omim maps to MONDO:0008290 (1, mibelli type), grandchild of the mondo
> * 'OMIM:614073', **'MONDO:0019312'**: omim maps to MONDO:0013556 (syndrome 4), child of the mondo (syndrome)
> * 'OMIM:614074', 'MONDO:0019312': omim maps to MONDO:0013557 (syndrome 5), child of the mondo (syndrome)
> * 'OMIM:614075', 'MONDO:0019312': omim maps to MONDO:0013558 (syndrome 6), child of the mondo (syndrome)
> * 'OMIM:614076', 'MONDO:0019312': omim maps to MONDO:0013559 (syndrome 7), child of the mondo (syndrome)
> * 'OMIM:614077', 'MONDO:0019312': omim maps to MONDO:0013560 (syndrome 8), child of the mondo (syndrome)

    
**1: MONDO's mapping is better**
<br>
Omim ID is slightly off -> __TELL EBI GENE2PHENO?__
* 'OMIM:613723', 'MONDO:0009181': mondo matches the disease name and phenotypes listed in the record better than the omim 


**1: Unsure**
* 'OMIM:158350', 'MONDO:0017623': omim is for Cowden syndrome 1, mondo is for PTEN hamartoma tumor syndrome. These are very similar, so I'm not sure which one is better. -> __TELL EBI GENE2PHENO?__
  * There's also another record w/ just the OMIM ID. I think the two rows should be merged. -> __TELL EBI GENE2PHENO?__


**1: NodeNorm error** -> messaged NodeNorm
* 'OMIM:224230', 'MONDO:0009136': both are recessive 1, NodeNorm should map to same entity

**Other rows reviewed:**
* 'OMIM:614583', 'MONDO:0013812': map to same correct entity

**2025-02-28 data:** 

The prelim decision is to use disease OMIM/orphanet IDs because:
* less missing values
* more accurate in cases where there's also a MONDO ID

#### Checking MONDO data

Above, I decided the OMIM/orphanet disease IDs were better. 

However, I wondered if the MONDO IDs were accurate to the disease name when there weren't OMIM/orphanet IDs. Then they could be used for NodeNorming and less data would be dropped because it wasn't pre-NodeNormed. 

In [60]:
## get the data that has MONDO, doesn't have OMIM/orphanet

df_mondo_only = df[df["disease_mim"].isna() & df["disease_MONDO"].notna()].copy()

mondo_only = df_mondo_only["disease_MONDO"].dropna().unique()

In [61]:
## saving stats on data with only MONDO ID

stats_mondo_only = {
    "n_rows": df_mondo_only.shape[0],
    "n_names": len(mondo_only)
}

stats_mondo_only["n_rows"]
stats_mondo_only["n_names"]

295

181

In [62]:
## code chunk used to review some of the data

# df_mondo_only[df_mondo_only["disease_MONDO"] == mondo_only[240]]

df_mondo_only[df_mondo_only["panel"].str.contains("Skeletal", na=False)]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url
129,G2P00284,ALPL,OMIM:171760,HGNC:438,HOPS; TNALP; TNAP; TNSALP,ALPL-related hypophosphatasia,,MONDO:0018570,biallelic_autosomal,,definitive,altered gene product structure,missense_variant; inframe_deletion; inframe_in...,undetermined,inferred,,HP:0002979; HP:0001945; HP:0002659; HP:0002150...,3174660,Developmental disorders; Skeletal disorders,,2025-01-29 08:10:33+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00284
387,G2P02554,PRRX1,OMIM:167420,HGNC:9142,PHOX1; PMX1,PRRX1-related craniosynostosis,,MONDO:0015469,monoallelic_autosomal,,moderate,decreased gene product level; altered gene pro...,missense_variant; stop_gained; frameshift_variant,undetermined,inferred,,HP:0001363,37154149,Developmental disorders; Skeletal disorders,,2024-03-22 10:30:40+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P02554
406,G2P03725,MIR140,OMIM:611894,HGNC:31527,MIRN140,"MIR140-related spondyloepiphyseal dysplasia, N...",,MONDO:0032835,monoallelic_autosomal,,moderate,altered gene product structure,ncRNA,gain of function,evidence,30804514 -> function: protein expression; func...,HP:0034281; HP:0001156; HP:0003498; HP:0011800...,30804514,Developmental disorders; Skeletal disorders,,2025-05-07 10:53:46+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03725


<div class="alert alert-block alert-info">    

**2025-02-28 data:** 

__Reviewed some of the data__

Method: look at individual MONDO IDs. Covered all panels (only 1 skeletal, no hearing). 3 from earlier review (related to mismatches) + idx 0-240, step 10 + skeletal. 
    
__Summary__
* 37 rows (29 unique MONDO)
* __~16%__ were wrong (6/37) 
* Could tell EBI gene2pheno of issues but they are similar to those listed in mismatch mapping section

__Details__

__6 MONDO is related but wrong__
* "MONDO:0009136" for "RTEL1-related dyskeratosis congenita" (two rows): mondo is recessive 1, which is wrong. Should be recessive 5 MONDO:0014076/OMIM:615190 (old/synonym name is dominant 4) 
* "MONDO:0044314" for 4 rows "CLN3-related retinal dystrophy", "GUCA1B-", "PRPS1-", "SNRNP200-": mondo is type 78 (specifically for ARHGEF18), which is wrong. Should instead be:
  * CLN3 and PRPS1: a more general term like MONDO:0004580 (retinal degeneration) -> MONDO:0019118 (inherited retinal dystrophy) -> MONDO:0019200 (retinitis pigmentosa)
  * GUCA1B: type 48, MONDO:0013447
  * SNRNP200: type 33, MONDO:0012477
* "MONDO:0013522" for "TERC-related dyskeratosis congenita": mondo is for type 3 (specifically for TINF2, that row is in "Great" section). Should be type 1 MONDO:0007485/OMIM:127550. (confusing because Monarch's page of type 1 includes TINF2 and TERT too, but OMIM page only includes TERC)


__4 MONDO is too general__ 
* "MONDO:0020341" (periventricular nodular heterotopia) for "ERMARD-related periventricular heterotopia". The ERMARD-specific version is a child term: MONDO:0014240/OMIM:615544 (type 6)
* "MONDO:0018965" (Alport syndrome) for "COL4A5-related Alport syndrome". The COL4A5-specific version is a child term: MONDO:0010520/OMIM:301050  (X-linked)
* "MONDO:0024676" for "REST-related Wilms tumour": The REST-specific version is a **related** term: MONDO:0014779/OMIM:616806 (type 6)
* "MONDO:0011773" for "POP1-related anauxetic dysplasia": the POP1-specific version is a child term: MONDO:0054561/OMIM:617396 (type 2)


__4 Unsure -> TELL EBI GENE2PHENO?__
* "MONDO:0005129" for "CYP51A1-related congenital cataract": mondo is cataract, which is not wrong but kinda general. MONDO:0033853 seems better (correlated with gene, matches phenotypes, orphanet ref uses one of the ref papers) 
* "MONDO:0018869" for "TMTC3-related cobblestone lissencephaly": while the mondo (cobblestone lissencephaly) sounds correct, it isn't linked to this gene. VS another sibling disease is linked to the gene, matches phenotypes, uses same paper: MONDO:0014992/OMIM:617255 (lissencephaly 8)
* "MONDO:0100100" for "SELENON-related myopathy": while mondo has exact name match, it's not directly linked to gene. Instead, its child disease is directly linked to gene MONDO:0011271/OMIM:602771 (rigid spine muscular dystrophy 1)
* "MONDO:0020367" for "MYOC-related juvenile open angle glaucoma": while mondo is almost-exact name match, it's not directly linked to gene. Instead, its child disease is directly linked to gene MONDO:0007664/OMIM:137750 (glaucoma 1, open angle, A) 


__5 Okay (using general term is fine)__
* "MONDO:0005129" for 3 other rows "WDR87-related congenital cataract", "AKR1E2-", "MFSD6L-": couldn't find better mappings. 
* "MONDO:0015469" for "DHRS3 related craniosynostosis": couldn't find better mapping
* "MONDO:0024676" (childhood kidney Wilms tumor) for "CTR9-related Wilms tumour", "TRIM28-": couldn't find better mapping. TRIM28 is correlated to parent term (kidney Wilms tumor). 


__18 Great__
* "MONDO:0012506" for "DSC2-related arrhythmogenic right ventricular cardiomyopathy"
* "MONDO:0011001" for "SCN5A-related Brugada syndrome"
* "MONDO:0013262" for "MYH7-related dilated cardiomyopathy"
* "MONDO:0013369" for "TNNI3-related hypertrophic cardiomyopathy"
* "MONDO:0010946" for "PRKAG2-related cardiomyopathy"
* "MONDO:0014143" for "RIT1-related Noonan syndrome"
* "MONDO:0010015" for "PXDN-related anterior segment dysgenesis with sclerocornea"
* "MONDO:0014214" for "DYNC2I1-related short-rib polydactyly"
* "MONDO:0013522" for "TINF2-related dyskeratosis congenita"
* "MONDO:0032876" for "WASF1-related intellectual disability with seizures"
* "MONDO:0859164" for "UNC45A-related osteootohepatoenteric syndrome"
* "MONDO:0018772" for "SLC30A7-related Joubert syndrome": using general term is fine since there isn't any established subtype of Joubert syndrome for this gene
* "MONDO:0010215" for "ERCC4-related xeroderma pigmentosum, group F"
* "MONDO:0009735" for "SPINK5-related Netherton syndrome"
* "MONDO:0007808" for "KRT1-related ichthyosis hystrix, Curth-Macklin type"
* "MONDO:0007566" for "TGFBR1-related multiple self-healing squamous epithelioma"
* "MONDO:0008285" for "PDGFRA-related gastrointestinal stromal tumor/GIST-plus syndrome, somatic or familial"
* "MONDO:0010912" for "TUBB3-related fibrosis of extraocular muscles, congenital"

#### Conclusions

<div class="alert alert-block alert-success">

**2025-02-28 data:** 
    
__Exploration__

* some rows have no disease IDs
* a few NodeNorm mapping failures for OMIM IDs (several diff kinds): ~2.8%. 68 failures / (2401 unique values in column - 9 orphanet)
  * no NodeNorm mapping failures for MONDO IDs
* when rows have both OMIM and MONDO IDs, there are sometimes differences in NodeNorm mapping ("mismatches"). __In these cases, OMIM IDs were much more accurate__
* __MONDO IDs can be inaccurate__ - see the blue review boxes
  * VS it was much rarer to find an inaccurate OMIM ID mapping (found 1 case)


__Decision: Use OMIM ID column to generate NodeNorm values__

* less missing values
* seems to be more accurate (for successful NodeNorm mappings)

## Stats on rows removed during NodeNorming

This section prints the statistics on rows in the original data that were removed. 

(Uses variables generated during the previous section "Pre-NodeNorming")

<div class="alert alert-block alert-success">

**2025-03-28 data:** 

Genes: No rows removed due to lack of IDs for NodeNorming or NodeNorm mapping issues.

In [63]:
## partial put into parser (format): DONE

print("Gene Pre-NodeNorming\n")

## no gene IDs
print(f'{stats_no_gene_IDs["n_rows"]} row(s) with no gene IDs')

## no HGNC IDs: key column for NodeNorming
print(f'{n_rows_no_hgnc} row(s) with no HGNC IDs')

## HGNC NodeNorm issues: none, but showing anyways
print("\n")
print("HGNC NodeNorm mapping failures:")

print(f'IDs with no data in NodeNorm: {len(stats_hgnc_mapping_failures["nodenorm_returned_none"])}')
print(f'IDs with the wrong NodeNormed category: {len(stats_hgnc_mapping_failures["wrong_category"])}')
print(f'IDs with no label in NodeNorm: {len(stats_hgnc_mapping_failures["no_label"])}')

Gene Pre-NodeNorming

0 row(s) with no gene IDs
0 row(s) with no HGNC IDs


HGNC NodeNorm mapping failures:
IDs with no data in NodeNorm: 0
IDs with the wrong NodeNormed category: 0
IDs with no label in NodeNorm: 0


<div class="alert alert-block alert-success">

**2025-03-28 data:** 
    
__Diseases: many rows removed__ due to lack of IDs for NodeNorming or NodeNorm mapping issues.

In [64]:
stats_OmOr_mapping_failures.keys()

dict_keys(['unexpected_error', 'nodenorm_returned_none', 'wrong_category', 'no_label', 'n_rows_none', 'n_rows_wrong_category', 'n_rows_no_label'])

In [65]:
## partial put into parser (format): DONE

print("Disease Pre-NodeNorming\n")

## no disease IDs
print(f'{stats_no_disease_IDs["n_rows"]} row(s) with no disease IDs '
      f'(= {stats_no_disease_IDs["n_names"]} unique diseases)')

## plus the rows that only lack OMIM IDs: key column for NodeNorming
print(f'+ {stats_mondo_only["n_rows"]} row(s) with no OMIM ID '
      f'(= {stats_mondo_only["n_names"]} unique diseases)')

## OMIM NodeNorm issues
print("\n")
print("OMIM NodeNorm mapping failures:")

print(f'{stats_OmOr_mapping_failures["n_rows_none"]} row(s) for '
      f'{len(stats_OmOr_mapping_failures["nodenorm_returned_none"])} '
      f'IDs with no data in NodeNorm')

print(f'{stats_OmOr_mapping_failures["n_rows_wrong_category"]} row(s) for '
      f'{len(stats_OmOr_mapping_failures["wrong_category"])} '
      f'IDs with the wrong NodeNormed category')

print(f'{stats_OmOr_mapping_failures["n_rows_no_label"]} row(s) for '
      f'{len(stats_OmOr_mapping_failures["no_label"])} '
      f'IDs with no label in NodeNorm')

Disease Pre-NodeNorming

248 row(s) with no disease IDs (= 243 unique diseases)
+ 295 row(s) with no OMIM ID (= 181 unique diseases)


OMIM NodeNorm mapping failures:
4 row(s) for 4 IDs with no data in NodeNorm
3 row(s) for 3 IDs with the wrong NodeNormed category
1 row(s) for 1 IDs with no label in NodeNorm


<div class="alert alert-block alert-success">
    
__Totals__

In [66]:
## put into parser (format): DONE

n_rows_before_nodenorm = df.shape[0]
n_rows_nodenorm_removed = stats_no_disease_IDs["n_rows"] + stats_mondo_only["n_rows"] + \
                          stats_OmOr_mapping_failures["n_rows_none"] + \
                          stats_OmOr_mapping_failures["n_rows_wrong_category"] + \
                          stats_OmOr_mapping_failures["n_rows_no_label"]
n_rows_after_nodenorm = n_rows_before_nodenorm - n_rows_nodenorm_removed

print(f"{n_rows_before_nodenorm} rows/records before Pre-NodeNorming\n")

print(f"{n_rows_nodenorm_removed} rows removed during Disease NodeNorming process\n")

print(f"{n_rows_after_nodenorm} rows/records left ({n_rows_after_nodenorm/n_rows_before_nodenorm:.1%})")

3181 rows/records before Pre-NodeNorming

551 rows removed during Disease NodeNorming process

2630 rows/records left (82.7%)


## Adding NodeNorm data, removing rows

Using gene HGNC and disease OMIM/orphanet IDs for pre-NodeNorming

In [67]:
## put into parser (format): DONE

## Gene: assumes no missing values
df["gene_nodenorm_id"] = [hgnc_nodenorm_mapping[i]["primary_id"] for i in df["hgnc_id"]]
df["gene_nodenorm_label"] = [hgnc_nodenorm_mapping[i]["primary_label"] for i in df["hgnc_id"]]

df["disease_nodenorm_id"] = [OmOr_nodenorm_mapping[i]["primary_id"] 
                             if OmOr_nodenorm_mapping.get(i) 
                             else pd.NA
                             for i in df["disease_mim"]]

df["disease_nodenorm_label"] = [OmOr_nodenorm_mapping[i]["primary_label"] 
                                if OmOr_nodenorm_mapping.get(i) 
                                else pd.NA
                                for i in df["disease_mim"]]

In [68]:
df.head()

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url,gene_nodenorm_id,gene_nodenorm_label,disease_nodenorm_id,disease_nodenorm_label
0,G2P00124,KCNE1,OMIM:176261,HGNC:6240,ISK; JLNS2; LQT5; MINK,KCNE1-related Jervell and Lange-Nielsen syndrome,OMIM:612347,MONDO:0012871,biallelic_autosomal,potential secondary finding,strong,altered gene product structure,missense_variant; inframe_insertion; inframe_d...,undetermined,inferred,,HP:0001657; HP:0001279; HP:0000007; HP:0000407,30461122,Developmental disorders; Cardiac disorders,KCNE1-related JLNS is due to altered gene prod...,2024-04-05 12:05:01+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00124,NCBIGene:3753,KCNE1,MONDO:0012871,Jervell and Lange-Nielsen syndrome 2
1,G2P00841,PTPN11,OMIM:176876,HGNC:9644,BPTP3; NS1; PTP2C; SH-PTP2; SHP-2; SHP2,PTPN11-related Noonan syndrome with multiple l...,OMIM:151100,,monoallelic_autosomal,,definitive,altered gene product structure,missense_variant; inframe_deletion; inframe_in...,undetermined,inferred,,HP:0001709; HP:0000957; HP:0004409; HP:0001639...,27484170; 26377839; 25917897; 25884655; 248207...,Developmental disorders; Skin disorders; Cardi...,Expert review done on 12/01/2022; Noonan syndr...,2025-01-21 14:56:43+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P00841,NCBIGene:5781,PTPN11,MONDO:0100082,LEOPARD syndrome 1
2,G2P03247,DSC2,OMIM:125645,HGNC:3036,CDHF2; DSC3,DSC2-related arrhythmogenic right ventricular ...,,MONDO:0012506,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,stop_gained; frameshift_variant; NMD_triggerin...,undetermined,inferred,,,31028357; 23911551; 21636032; 33831308; 263105...,Cardiac disorders,Expert review done on 05/01/2022; DSC2-related...,2024-03-20 09:36:09+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03247,NCBIGene:1824,DSC2,,
3,G2P03248,DSC2,OMIM:125645,HGNC:3036,CDHF2; DSC3,DSC2-related arrhythmogenic right ventricular ...,,MONDO:0012506,biallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,stop_gained; frameshift_variant; NMD_triggerin...,undetermined,inferred,,,31028357; 23911551; 21636032; 33831308; 263105...,Cardiac disorders,Expert review done on 05/01/2022; DSC2-related...,2024-03-20 09:35:19+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03248,NCBIGene:1824,DSC2,,
4,G2P03249,DSG2,OMIM:125671,HGNC:3049,CDHF5,DSG2-related arrhythmogenic right ventricular ...,,MONDO:0012434,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,stop_gained; frameshift_variant; splice_accept...,undetermined,inferred,,,21636032; 33831308; 33917638; 34400560; 240707...,Cardiac disorders,Expert review done on 05/01/2022; DSG2-related...,2024-03-20 09:40:18+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03249,NCBIGene:1829,DSG2,,


In [69]:
## put into parser (change in-place): DONE

df_only_nodenormed = df.dropna(subset=["gene_nodenorm_id", "gene_nodenorm_label", 
                                       "disease_nodenorm_id", "disease_nodenorm_label"],
                              ignore_index=True).copy()

In [70]:
## same! so it works as expected

df_only_nodenormed.shape

n_rows_after_nodenorm

(2630, 26)

2630

In [71]:
df_only_nodenormed.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2630 entries, 0 to 2629
Data columns (total 26 columns):
 #   Column                              Non-Null Count  Dtype              
---  ------                              --------------  -----              
 0   g2p_id                              2630 non-null   object             
 1   gene_symbol                         2630 non-null   object             
 2   gene_mim                            2630 non-null   object             
 3   hgnc_id                             2630 non-null   object             
 4   previous_gene_symbols               2430 non-null   object             
 5   disease_name                        2630 non-null   object             
 6   disease_mim                         2630 non-null   object             
 7   disease_MONDO                       1581 non-null   object             
 8   allelic_requirement                 2630 non-null   object             
 9   cross_cutting_modifier              287 n

In [None]:
## code block to review NodeNorm "duplicate edges" - from DISEASES

# df_textmining_nodenorm_dups = df_textmining[
#     df_textmining.duplicated(
#         subset=["gene_nodenorm_id", "disease_nodenorm_id"], 
#         keep=False)].copy()

# df_textmining_nodenorm_dups.sort_values(by=["disease_nodenorm_id", "gene_nodenorm_id"],
#                                         inplace=True)

# df_textmining_nodenorm_dups.shape

# df_textmining_nodenorm_dups

## Generating documents

### Rows not included

<div class="alert alert-block alert-info">

See section "Stats on rows removed during NodeNorming"
* No IDs in disease_mim column 
* NodeNorm mapping failures for disease_mim column IDs
* confidence column value == ("refuted" OR "disputed" OR "limited")

### Columns not included

<div class="alert alert-block alert-info">

See data-playground for details

<br>

Seem **easier** to get into Translator, potentially useful: 

- disease_MONDO
- **confidence**: 
   - there's biolink association-slot *has confidence level*. But there's also a biolink entity *confidence level* that's supposed to have values from CIO. 
   - Are G2P's terms okay? Or are they supposed to be mapped to ontology terms like CIO/SEPIO?-(which...may be a loss of info compared to G2P's term definitions)
- **allelic_requirement**: biolink-model PR to create edge property for this. The values will likely need to be converted into HPO "mode of inheritance terms" (see data-playground notebook for mapping table)

<br>

Seem harder to get into Translator, potentially useful: 
- **molecular_mechanism_categorisation**: "qualifies" the molecular_mechanism (seems to say how molecular mechanism was decided: "inferred" or "evidence").
  - tricky since it's like "how knowledge was obtained" for a specific part of edge (I'm using molecular_mechanism to adjust the subject qualifier) 
- **cross_cutting_modifier**: additional info on inheritance. Limited set of terms BUT "; "- delimited. Some terms may map to "HPO inheritance qualifier terms" (didn't try). Lots of missing data (NA). 
  - would be a new edge/node property or qualifier. But complicated because EBI gene2pheno has custom terms, not just from HPO inheritance qualifiers. 
- **variant_consequence**: row can have multiple values ("; "- delimited). Limited set of terms already mapped to SO.
  - seems like aspect qualifier, but this can be a list for a gene-disease edge - and I'm not sure how to handle this (not that comfortable splitting into multiple edges)
- **variant_types**: row can have multiple values ("; "- delimited). Medium set of terms already mapped to SO. Lots of missing data (NA)
  - would be a new edge/node property or qualifier (somewhat modeled as predicates, for variant-gene relationships).
- **molecular_mechanism_evidence**: treat as free text? very complicated string 
- **comments**: treat as free text
    
<br>

Can ignore: 
- gene_mim
- gene_symbol
- previous_gene_symbols
- disease_name
- phenotypes: "reported by the publication". Unclear how they fit in gene-disease association or a diff edge (gene-phenotype, phenotype-disease)
- panel: pretty specific, original resource's way of organizing data

In [72]:
## code chunk to review data

df["molecular_mechanism"].value_counts()

molecular_mechanism
loss of function                     2015
undetermined                          872
gain of function                      172
dominant negative                     119
undetermined non-loss-of-function       3
Name: count, dtype: int64

In [73]:
## code chunk to review data
## checking date of last review

df[df["g2p_id"] == "G2P03538"]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,g2p_record_url,gene_nodenorm_id,gene_nodenorm_label,disease_nodenorm_id,disease_nodenorm_label
2825,G2P03538,NPAT,OMIM:601448,HGNC:7896,E14; P220,NPAT-related cancer,,,monoallelic_autosomal,,moderate,decreased gene product level,stop_gained,loss of function,inferred,,,38778081,Cancer disorders,,2025-03-14 12:04:00+00:00,https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03538,NCBIGene:4863,NPAT,,


In [74]:
## code chunk to review data

df["molecular_mechanism_evidence"].value_counts()[0:5]

## df.info()

molecular_mechanism_evidence
39480921 -> function: protein expression; models: non-human model organism                                                           2
34965576 -> models: non-human model organism                                                                                         2
37126546 -> models: non-human model organism; rescue: non-human model organism                                                       2
10611230 -> function: biochemical, protein expression; functional_alteration: non patient cells; models: non-human model organism    1
34040189 -> models: non-human model organism; rescue: non-human model organism                                                       1
Name: count, dtype: int64

In [76]:
## want jsonlines format

import jsonlines

### BioThings-type parser 

Includes `_id` set to g2p_id value

In [None]:
## code chunk for testing parts of inner code

for row in df.itertuples(index=False):
    document = {
        "_id": row.g2p_id,
        "subject": row.gene_nodenorm_id,
        "sources": [
            {
                "resource_id": "infores:ebi-gene2phenotype",
                "resource_role": "primary_knowledge_source",
                "source_record_urls": [row.g2p_record_url]
            }
        ],
        "attributes": [
            {
                "attribute_type_id": "biolink:original_subject",
                "value": row.hgnc_id
            }
        ]
    }
    if pd.notna(row.publications):
        document["attributes"].append(
            {
                "attribute_type_id": "biolink:publications",
                "value": ["PMID:" + i.strip() for i in row.publications.split(";")]
            }
        )
    document
    break

In [None]:
## put into parser (format): DONE
##   don't save in array, yield each document instead

## GENERATING DOCS, saving in array
documents = []

## using itertuples because it's faster, preserves datatypes
for row in df_only_nodenormed.itertuples(index=False):
    ## simple assignments: no NA or "if"
    document = {
        "_id": row.g2p_id,
        "subject": row.gene_nodenorm_id,
        "qualifiers": [  ## needs data-modeling/TRAPI validation review
            {
                "qualifier_type_id": "biolink:subject_form_or_variant_qualifier",
                "qualifier_value": "genetic_variant_form"
            }
        ],
        "object": row.disease_nodenorm_id,
        "sources": [
            {
                "resource_id": "infores:ebi-gene2phenotype",
                "resource_role": "primary_knowledge_source",
                "source_record_urls": [row.g2p_record_url]
            }
        ],
        "attributes": [
            {
                "attribute_type_id": "biolink:knowledge_level",
                "value": "knowledge_assertion"
            },
            {
                "attribute_type_id": "biolink:agent_type",
                "value": "manual_agent"
            },
            {
                "attribute_type_id": "biolink:original_subject",
                "original_attribute_name": "hgnc id",  ## original column name
                "value": row.hgnc_id
            },
            {   ## currently, after NodeNorming, no NAs in OMIM/orphanet column
                "attribute_type_id": "biolink:original_object",
                "original_attribute_name": "disease mim",  ## original column name
                "value": row.disease_mim
            },
            {   ## needs data-modeling/TRAPI validation review
                ## EBI gene2pheno website calls this "Last Updated"/"Last Updated On"
                "attribute_type_id": "biolink:update_date",
                "original_attribute_name": "date of last review",  ## original column name
                "value": str(row.date_of_last_review)
            },
        ]
    }
    
    ## more complex assignments ("if", handling NA). When value is NA, list comprehension with split won't work
    ## predicate
    if row.confidence == "limited":
        document["predicate"] = "biolink:related_to"
    elif row.confidence in ["moderate", "strong", "definitive"]:
        document["predicate"] = "biolink:causes"
    else:
        raise ValueError(f"Unexpected confidence value during predicate mapping: {row.confidence}. Adjust parser.")
    ## publications
    if pd.notna(row.publications):
        document["attributes"].append(
            {
                "attribute_type_id": "biolink:publications",
                "value": ["PMID:" + i.strip() for i in row.publications.split(";")]
            }
        )
    
    documents.append(document)

### File: List of TRAPI edges

This code isn't in parser.py

Doesn't have `_id`! Doesn't include original_attribute_name for original_subject, original_object, update_date.

In [73]:
df_only_nodenormed.columns

Index(['g2p_id', 'gene_symbol', 'gene_mim', 'hgnc_id', 'previous_gene_symbols',
       'disease_name', 'disease_mim', 'disease_MONDO', 'allelic_requirement',
       'cross_cutting_modifier', 'confidence', 'variant_consequence',
       'variant_types', 'molecular_mechanism',
       'molecular_mechanism_categorisation', 'molecular_mechanism_evidence',
       'phenotypes', 'publications', 'panel', 'comments',
       'date_of_last_review', 'g2p_record_url', 'gene_nodenorm_id',
       'gene_nodenorm_label', 'disease_nodenorm_id', 'disease_nodenorm_label'],
      dtype='object')

In [77]:
## wrapped with file writer, otherwise contents very similar to before
## commented out original_attribute_name

with jsonlines.open('EBIgene2pheno_trapi_edges.jsonl', mode='w', compact=True) as trapi_writer:

    ## using itertuples because it's faster, preserves datatypes
    for row in df_only_nodenormed.itertuples(index=False):
        
        ## simple assignments: no NA or "if"
        document = {
            "subject": row.gene_nodenorm_id,
            "predicate": "biolink:causes",
            "object": row.disease_nodenorm_id,
            "sources": [
                {
                    "resource_id": "infores:ebi-gene2phenotype",
                    "resource_role": "primary_knowledge_source",
                    "source_record_urls": [row.g2p_record_url]
                }
            ],
            "attributes": [
                {
                    "attribute_type_id": "biolink:knowledge_level",
                    "value": "knowledge_assertion"
                },
                {
                    "attribute_type_id": "biolink:agent_type",
                    "value": "manual_agent"
                },
                {
                    "attribute_type_id": "biolink:original_subject",
#                     "original_attribute_name": "hgnc id",  ## original column name
                    "value": row.hgnc_id
                },
                {   ## currently, after NodeNorming, no NAs in OMIM/orphanet column
                    "attribute_type_id": "biolink:original_object",
#                     "original_attribute_name": "disease mim",  ## original column name
                    "value": row.disease_mim
                },
                {   ## needs data-modeling/TRAPI validation review
                    ## EBI gene2pheno website calls this "Last Updated"/"Last Updated On"
                    "attribute_type_id": "biolink:update_date",
#                     "original_attribute_name": "date of last review",  ## original column name
                    "value": str(row.date_of_last_review)
                },
            ]
        }

        ## more complex assignments ("if", handling NA). When value is NA, list comprehension with split won't work
        ## publications
        if pd.notna(row.publications):
            document["attributes"].append(
                {
                    "attribute_type_id": "biolink:publications",
                    "value": ["PMID:" + i.strip() for i in row.publications.split(";")]
                }
            )
            
        ## qualifier
        if row.molecular_mechanism == "loss of function":
            document["qualifiers"] = [
                {
                    "qualifier_type_id": "biolink:subject_form_or_variant_qualifier",
                    "qualifier_value": "loss_of_function_variant_form"
                }
            ]
        elif row.molecular_mechanism == "undetermined":
            document["qualifiers"] = [
                {
                    "qualifier_type_id": "biolink:subject_form_or_variant_qualifier",
                    "qualifier_value": "genetic_variant_form"
                }
            ]
        elif row.molecular_mechanism == "gain of function":
            document["qualifiers"] = [
                {
                    "qualifier_type_id": "biolink:subject_form_or_variant_qualifier",
                    "qualifier_value": "gain_of_function_variant_form"
                }
            ]
        elif row.molecular_mechanism == "dominant negative":
            document["qualifiers"] = [
                {
                    "qualifier_type_id": "biolink:subject_form_or_variant_qualifier",
                    "qualifier_value": "dominant_negative_variant_form"
                }
            ]
        elif row.molecular_mechanism == "undetermined non-loss-of-function":
            document["qualifiers"] = [
                {
                    "qualifier_type_id": "biolink:subject_form_or_variant_qualifier",
                    "qualifier_value": "non_loss_of_function_variant_form"
                }
            ]
        else:
            raise ValueError(f"Unexpected molecular_mechanism value during qualifier mapping: {row.molecular_mechanism}. Adjust parser.")

        ## doing so it doesn't print
        bytes = trapi_writer.write(document)

### File: KGX edges

This code isn't in parser.py

Doesn't include original_attribute_name for original_subject, original_object, update_date

In [78]:

with jsonlines.open('EBIgene2pheno_kgx_edges.jsonl', mode='w', compact=True) as kgx_edges_writer:

    ## using itertuples because it's faster, preserves datatypes
    for row in df_only_nodenormed.itertuples(index=False):
        
        ## simple assignments: no NA or "if"
        document = {
            "subject": row.gene_nodenorm_id,
            "predicate": "biolink:causes",
            "object": row.disease_nodenorm_id,
            "sources": [
                {
                    "resource_id": "infores:ebi-gene2phenotype",
                    "resource_role": "primary_knowledge_source",
                    "source_record_urls": [row.g2p_record_url]
                }
            ],
            "knowledge_level": "knowledge_assertion",
            "agent_type": "manual_agent",
            "original_subject": row.hgnc_id,
            ## currently, after NodeNorming, no NAs in OMIM/orphanet column
            "original_object": row.disease_mim,
            ## needs data-modeling/TRAPI validation review
            ## EBI gene2pheno website calls this "Last Updated"/"Last Updated On"
            "update_date": str(row.date_of_last_review),
        }

        ## more complex assignments ("if", handling NA). When value is NA, list comprehension with split won't work
        ## publications
        if pd.notna(row.publications):
            document["publications"] = ["PMID:" + i.strip() for i in row.publications.split(";")]

        ## qualifier
        if row.molecular_mechanism == "loss of function":
            document["subject_form_or_variant_qualifier"] = "loss_of_function_variant_form"
        elif row.molecular_mechanism == "undetermined":
            document["subject_form_or_variant_qualifier"] = "genetic_variant_form"
        elif row.molecular_mechanism == "gain of function":
            document["subject_form_or_variant_qualifier"] = "gain_of_function_variant_form"
        elif row.molecular_mechanism == "dominant negative":
            document["subject_form_or_variant_qualifier"] = "dominant_negative_variant_form"
        elif row.molecular_mechanism == "undetermined non-loss-of-function":
            document["subject_form_or_variant_qualifier"] = "non_loss_of_function_variant_form"
        else:
            raise ValueError(f"Unexpected molecular_mechanism value during qualifier mapping: {row.molecular_mechanism}. Adjust parser.")

            
        ## doing so it doesn't print
        bytes = kgx_edges_writer.write(document)

### File: KGX nodes

This code isn't in parser.py

Requires id and category. name and other properties (basically node attributes) are optional. 

In [79]:
nodenormed_genes_final = df_only_nodenormed[["gene_nodenorm_id", "gene_nodenorm_label"]].drop_duplicates()
nodenormed_diseases_final = df_only_nodenormed[["disease_nodenorm_id", "disease_nodenorm_label"]].drop_duplicates()

nodenormed_genes_final
nodenormed_diseases_final

Unnamed: 0,gene_nodenorm_id,gene_nodenorm_label
0,NCBIGene:3753,KCNE1
1,NCBIGene:5781,PTPN11
2,NCBIGene:3728,JUP
3,NCBIGene:801,CALM1
4,NCBIGene:2273,FHL1
...,...,...
2621,NCBIGene:374462,PTPRQ
2622,NCBIGene:5269,SERPINB6
2623,NCBIGene:7007,TECTA
2625,NCBIGene:286262,TPRN


Unnamed: 0,disease_nodenorm_id,disease_nodenorm_label
0,MONDO:0012871,Jervell and Lange-Nielsen syndrome 2
1,MONDO:0100082,LEOPARD syndrome 1
2,MONDO:0011017,Naxos disease
3,MONDO:0013966,catecholaminergic polymorphic ventricular tach...
4,MONDO:0010401,X-linked myopathy with postural muscle atrophy
...,...,...
2625,MONDO:0013215,autosomal recessive nonsyndromic hearing loss 79
2626,MONDO:0013963,autosomal recessive nonsyndromic hearing loss 93
2627,MONDO:0007395,craniofacial-deafness-hand syndrome
2628,MONDO:0011519,autosomal dominant nonsyndromic hearing loss 23


In [80]:

with jsonlines.open('EBIgene2pheno_kgx_nodes.jsonl', mode='w', compact=True) as kgx_nodes_writer:
    
    ## using itertuples because it's faster, preserves datatypes
    for row in nodenormed_genes_final.itertuples(index=False):
        ## doing so it doesn't print
        bytes = kgx_nodes_writer.write({
            "id": row.gene_nodenorm_id,
            "name": row.gene_nodenorm_label,            
            ## hard-coded because during pre-NodeNorm process, only kept entities with this primary category
            "category": ["biolink:Gene"]
            
        })

    ## using itertuples because it's faster, preserves datatypes
    for row in nodenormed_diseases_final.itertuples(index=False):
        ## doing so it doesn't print
        bytes = kgx_nodes_writer.write({
            "id": row.disease_nodenorm_id,
            "name": row.disease_nodenorm_label,
            ## hard-coded because during pre-NodeNorm process, only kept entities with this primary category
            "category": ["biolink:Disease"]
        })

In [81]:
nodenormed_genes_final.shape[0] + nodenormed_diseases_final.shape[0]

4795

## Checking documents

In [None]:
len(documents)

In [None]:
df_only_nodenormed.info()

In [None]:
## code chunk for finding rows -> then look up corresponding doc by idx
# df_only_nodenormed[df_only_nodenormed["disease_mim"].str.contains("orphanet", na=False)]
# df_only_nodenormed[df_only_nodenormed["confidence"] == "limited"]
# df_only_nodenormed[df_only_nodenormed["publications"].isna()]
# df_only_nodenormed[~df_only_nodenormed["publications"].str.contains(";", na=True)]



# df_only_nodenormed[df_only_nodenormed["previous_gene_symbols"].isna()]
# df_only_nodenormed[df_only_nodenormed["disease_MONDO"].notna()]

In [None]:
pprint(documents[34])

# documents[416]

## BioThings Parser notes

Fine to use raise/assert in parser (raise is technically better programming behavior: https://realpython.com/python-assert-statement/#understanding-common-pitfalls-of-assert)


My notes on parser:
* adding prefixes to gene/disease IDs is good for pre-NodeNorming steps
* keeping diff gene/disease ID namespaces as separate fields right now is good for current BTE/x-bte-annotation system
  * Also, original subject will always be HGNC, original object will always be disease OMIM with current code


My notes on syntax:
* use `yield` when you want to "return" within a "for loop" (return only happen once, then exit for-loop/function execution)
  * that's what it's used in main execution, when you're iterating over csv rows to generate documents
* use `yield from {function}` to get the data from a generator (created by `yield` being used the function)