# Exploring EBI gene2pheno data for parser

## Downloading data

Latest data can be downloaded from https://www.ebi.ac.uk/gene2phenotype/download. But I can't figure out the actual links for the downloads. 

Static release files (currently 2025-02-28) can be downloaded from the FTP site http://ftp.ebi.ac.uk/pub/databases/gene2phenotype/ (in the G2P_data_downloads subfolder). It is supposed to update on a monthly basis, according to the [README](http://ftp.ebi.ac.uk/pub/databases/gene2phenotype/README). However, I notice that the FTP site is sometimes slow on my computer (Firefox browser). 

I'm using all disorder panels/files:
* Cancer
* Cardiac
* Developmental disorders (DD)
* Eye
* Hearing
* Skeletal
* Skin

## Load data into separate pandas dataframes

In [17]:
## import packages for exploring here

## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pathlib
import pandas as pd
import glob

pd.options.display.max_columns = None

In [104]:
## Construct paths for files

## adjust based on where files are stored
base_file_path = pathlib.Path.home().joinpath('Desktop', 'EBIgene2pheno_files', 'From_Website')

cancer = pd.read_csv(base_file_path.joinpath('G2P_Cancer_2025-03-13.csv'), 
                      dtype={"gene mim": str, "hgnc id": str, "disease mim": str})
cardiac = pd.read_csv(base_file_path.joinpath('G2P_Cardiac_2025-03-13.csv'), 
                      dtype={"gene mim": str, "hgnc id": str, "disease mim": str})
developmental = pd.read_csv(base_file_path.joinpath('G2P_DD_2025-03-13.csv'), 
                      dtype={"gene mim": str, "hgnc id": str, "disease mim": str})
eye = pd.read_csv(base_file_path.joinpath('G2P_Eye_2025-03-13.csv'), 
                      dtype={"gene mim": str, "hgnc id": str, "disease mim": str})
hearing = pd.read_csv(base_file_path.joinpath('G2P_Hearing loss_2025-03-13.csv'), 
                      dtype={"gene mim": str, "hgnc id": str, "disease mim": str})
skeletal = pd.read_csv(base_file_path.joinpath('G2P_Skeletal_2025-03-13.csv'), 
                      dtype={"gene mim": str, "hgnc id": str, "disease mim": str})
skin = pd.read_csv(base_file_path.joinpath('G2P_Skin_2025-03-13.csv'), 
                      dtype={"gene mim": str, "hgnc id": str, "disease mim": str})

## saving for possible future use
# unique_filename_substring = [
#     'Cancer',
#     'Cardiac',
#     'DD',
#     'Eye',
#     'Hearing loss',
#     'Skeletal',
#     'Skin'
# ]

In [4]:
## way to get all paths for all files without knowing their actual names
for i in base_file_path.glob("*.csv"):
    print(i)

/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_Skin_2025-03-13.csv
/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_Hearing loss_2025-03-13.csv
/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_Cardiac_2025-03-13.csv
/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_Skeletal_2025-03-13.csv
/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_Cancer_2025-03-13.csv
/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_DD_2025-03-13.csv
/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_Eye_2025-03-13.csv


In [5]:
## checking that each file's columns are the same: they are
cancer.columns == skin.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

In [79]:
cardiac.columns

Index(['g2p id', 'gene symbol', 'gene mim', 'hgnc id', 'previous gene symbols',
       'disease name', 'disease mim', 'disease MONDO', 'allelic requirement',
       'cross cutting modifier', 'confidence', 'variant consequence',
       'variant types', 'molecular mechanism',
       'molecular mechanism categorisation', 'molecular mechanism evidence',
       'phenotypes', 'publications', 'panel', 'comments',
       'date of last review'],
      dtype='object')

In [80]:
## there's 1 gene-disease pair duplicate
cardiac[cardiac.duplicated(subset=['gene symbol', 'disease name'], keep=False)]

Unnamed: 0,g2p id,gene symbol,gene mim,hgnc id,previous gene symbols,disease name,disease mim,disease MONDO,allelic requirement,cross cutting modifier,confidence,variant consequence,variant types,molecular mechanism,molecular mechanism categorisation,molecular mechanism evidence,phenotypes,publications,panel,comments,date of last review
2,G2P03247,DSC2,125645,3036,CDHF2; DSC3,DSC2-related arrhythmogenic right ventricular ...,,MONDO:0012506,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,splice_region_variant; inframe_insertion; spli...,undetermined,inferred,,,31028357; 23911551; 21636032; 33831308; 263105...,Cardiac,Expert review done on 05/01/2022; DSC2-related...,2024-03-20 09:36:09+00:00
3,G2P03248,DSC2,125645,3036,CDHF2; DSC3,DSC2-related arrhythmogenic right ventricular ...,,MONDO:0012506,biallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,splice_region_variant; inframe_insertion; spli...,undetermined,inferred,,,31028357; 23911551; 21636032; 33831308; 263105...,Cardiac,Expert review done on 05/01/2022; DSC2-related...,2024-03-20 09:35:19+00:00
4,G2P03249,DSG2,125671,3049,CDHF5,DSG2-related arrhythmogenic right ventricular ...,,MONDO:0012434,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,inframe_insertion; splice_donor_variant; infra...,undetermined,inferred,,,21636032; 33831308; 33917638; 34400560; 240707...,Cardiac,Expert review done on 05/01/2022; DSG2-related...,2024-03-20 09:40:18+00:00
5,G2P03250,DSG2,125671,3049,CDHF5,DSG2-related arrhythmogenic right ventricular ...,,MONDO:0012434,biallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,inframe_insertion; splice_donor_variant; infra...,undetermined,inferred,,,21636032; 33831308; 33917638; 34400560; 240707...,Cardiac,Expert review done on 05/01/2022; DSG2-related...,2024-03-20 09:37:58+00:00
6,G2P03251,DSP,125647,3052,DPI; DPII; KPPS2; PPKS2,DSP-related arrhythmogenic right ventricular c...,,MONDO:0011831,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,inframe_insertion; splice_donor_variant; infra...,undetermined,inferred,,,32372669; 23137101; 21636032; 33831308; 313199...,Cardiac,Expert review done on 05/01/2022; DSP-related ...,2024-03-20 09:42:53+00:00
7,G2P03252,DSP,125647,3052,DPI; DPII; KPPS2; PPKS2,DSP-related arrhythmogenic right ventricular c...,,MONDO:0011831,biallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,inframe_insertion; splice_donor_variant; infra...,undetermined,inferred,,,32372669; 23137101; 21636032; 33831308; 313199...,Cardiac,Expert review done on 05/01/2022; DSP-related ...,2024-03-20 09:42:01+00:00
17,G2P03262,CASQ2,114251,1513,PDIB2,CASQ2-related catecholaminergic polymorphic ve...,,MONDO:0012762,monoallelic_autosomal,,moderate,decreased gene product level; altered gene pro...,missense_variant; splice_region_variant; splic...,undetermined,inferred,,,12386154; 16908766; 21618644; 15176429; 326936...,Cardiac,Expert review done on 19/05/2021; CASQ2 is an ...,2024-03-13 11:49:22+00:00
18,G2P03263,CASQ2,114251,1513,PDIB2,CASQ2-related catecholaminergic polymorphic ve...,,MONDO:0012762,biallelic_autosomal,,definitive,absent gene product; altered gene product stru...,missense_variant; splice_region_variant; splic...,undetermined,inferred,,,12386154; 16908766; 21618644; 15176429; 326936...,Cardiac,Expert review done on 19/05/2021; CASQ2 is an ...,2022-04-25 12:45:21+00:00
40,G2P03285,KCNQ1,607542,6294,JLNS1; KCNA8; KCNA9; KV7.1; KVLQT1; LQT; LQT1,KCNQ1-related long QT syndrome,,MONDO:0100316,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,frameshift_variant_NMD_triggering; inframe_ins...,undetermined,inferred,,,31983240; 9020846; 23392653; 23591039; 2704115...,Cardiac,Expert review done on 16/06/2021; Pathogenic v...,2024-03-13 11:56:22+00:00
41,G2P03286,KCNQ1,607542,6294,JLNS1; KCNA8; KCNA9; KV7.1; KVLQT1; LQT; LQT1,KCNQ1-related long QT syndrome,,MONDO:0100316,biallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,frameshift_variant_NMD_triggering; inframe_ins...,undetermined,inferred,,,31983240; 9020846; 23392653; 23591039; 2704115...,Cardiac,Expert review done on 16/06/2021; Pathogenic v...,2024-03-13 11:57:14+00:00


In [105]:
cardiac.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 21 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   g2p id                              80 non-null     object 
 1   gene symbol                         80 non-null     object 
 2   gene mim                            80 non-null     object 
 3   hgnc id                             80 non-null     object 
 4   previous gene symbols               74 non-null     object 
 5   disease name                        80 non-null     object 
 6   disease mim                         6 non-null      object 
 7   disease MONDO                       73 non-null     object 
 8   allelic requirement                 80 non-null     object 
 9   cross cutting modifier              22 non-null     object 
 10  confidence                          80 non-null     object 
 11  variant consequence                 80 non-null

In [82]:
cardiac["g2p id"].nunique()

80

In [115]:
cardiac["disease MONDO"].nunique()
cardiac["disease MONDO"].unique()

61

array([nan, 'MONDO:0012506', 'MONDO:0012434', 'MONDO:0011831',
       'MONDO:0012180', 'MONDO:0011459', 'MONDO:0011017', 'MONDO:0013966',
       'MONDO:0017990', 'MONDO:0013529', 'MONDO:0014191', 'MONDO:0011001',
       'MONDO:0012762', 'MONDO:0012312', 'MONDO:0012314', 'MONDO:0012313',
       'MONDO:0000453', 'MONDO:0013479', 'MONDO:0011482', 'MONDO:0005021',
       'MONDO:0007269', 'MONDO:0013262', 'MONDO:0012362', 'MONDO:0013168',
       'MONDO:0012745', 'MONDO:0007267', 'MONDO:0011400', 'MONDO:0012799',
       'MONDO:0007268', 'MONDO:0012112', 'MONDO:0012111', 'MONDO:0013369',
       'MONDO:0013367', 'MONDO:0100316', 'MONDO:0011377', 'MONDO:0019171',
       'MONDO:0054838', 'MONDO:0011076', 'MONDO:0010680', 'MONDO:0012289',
       'MONDO:0010526', 'MONDO:0010281', 'MONDO:0010946', 'MONDO:0024540',
       'MONDO:0008919', 'MONDO:0008647', 'MONDO:0014548', 'MONDO:0014550',
       'MONDO:0007266', 'MONDO:0008222', 'MONDO:0010979', 'MONDO:0008104',
       'MONDO:0012690', 'MONDO:001414

In [116]:
# cardiac[cardiac["disease mim"] == 612347]

cardiac[cardiac["disease mim"].isna() & cardiac["disease MONDO"].isna()].shape

# cancer[cancer["comments"].notna()]

# cardiac["disease mim"].astype(str).unique()

(4, 21)

### Organizing data into documents/individual association objects:

#### NOW ON: 
cardiac: allelic requirement


#### Gene subject section:
- "gene symbol": str. "HGNC-assigned gene symbol" (according to DataDownloadFormat txt file)
  - DOC TRANSFORM: hgnc_symbol
- "gene mim": OMIM ID for gene. NodeNorm does recognize these
  - DF READ: force dtype to be str
  - DF CHECK: 
    - Unsure: add current Translator prefix
  - DOC TRANSFORM: omim
- "hgnc id"
  - DF READ: force dtype to be str
  - DF CHECK: same as "gene mim"
- "previous gene symbols": "; "-delimited.
  - DOC TRANSFORM:
    - CHECK for NA. Only assign field if not NA
    - Turn into array of str: ";"-split and then strip whitespace 

#### Disease object section:
- "disease name":
  - DOC TRANSFORM: name
- "disease mim": OMIM ID for disease. But there are some Orphanet IDs mixed in (start with "Orphanet:")
  - DF READ: force dtype to be str
  - DF CHECK:
    - Unsure: add/replace current Translator prefix
  - DOC TRANSFORM: omim, orphanet
    - CHECK for NA (many rows are). Only assign field if not NA
    - CHECK for "Orphanet:": if yes, assign to orphanet field instead
- "disease MONDO": MONDO ID for disease (already Translator-formatted CURIE) 
  - DOC TRANSFORM: mondo
    - CHECK for NA (many rows are). Only assign field if not NA
    
#### Association section:
- "g2p id": resource's unique stable ID for each record/row (according to DataDownloadFormat txt file, G2P webpage for single record)
  - DF TRANSFORM:
    - create new column, generating G2P webpage urls for each record like https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03386. this may be brittle/work to maintain, but I know UI really wants stuff like this
  - DOC TRANSFORM: g2p_record_id, g2p_record_url
    - use to create
- "allelic requirement": required genotype (according to G2P webpage for single record). str (categorical). [G2P website](https://www.ebi.ac.uk/gene2phenotype/about/terminology) says these are synonyms of HPO "mode of inheritance terms", which seem to be from the [inheritance part of the ontology](https://hpo.jax.org/browse/term/HP:0000005). Ive included a mapping table in reference section.
  - DF TRANSFORM:
    - optional: turn dtype into 'category'
    - very optional: create new columns for HPO mapping. Probably based on static mapping dict, which would be brittle/work to maintain
- "cross cutting modifier": additional info relevant to gene-disease inheritance (according to G2P webpage for single record). Can be "; "-delimited, otherwise would be categorical. [G2P website](https://www.ebi.ac.uk/gene2phenotype/about/terminology) says these are HPO inheritance qualifier terms "when available" ([this part of ontology](https://hpo.jax.org/browse/term/HP:0034335)). I didn't try mapping these.
  - DOC TRANSFORM:
    - CHECK for NA (many rows are). Only assign field if not NA
    - Turn into array of str: ";"-split and then strip whitespace 
- "confidence": confidence that the association is real (according to G2P webpage for panel). [G2P website](https://www.ebi.ac.uk/gene2phenotype/about/terminology) has definitions of terms. str (categorical)
  - DF TRANSFORM:
    - optional: turn dtype into 'category'
- "variant consequence": consequences of reported variants on product (protein or RNA) (according to G2P webpage for single record). Can be "; "-delimited, otherwise would be categorical. [G2P website](https://www.ebi.ac.uk/gene2phenotype/about/terminology) has definitions of terms and SO mappings. 
  - DF TRANSFORM:
    - very optional: create new columns for SO mapping. Probably based on static mapping dict, which would be brittle/work to maintain
  - DOC TRANSFORM:
    - Turn into array of str: ";"-split and then strip whitespace 
- "variant types": associated with gene-disease pair (according to G2P webpage for single record). Can be "; "-delimited, otherwise would be categorical. [G2P website](https://www.ebi.ac.uk/gene2phenotype/about/terminology) has definitions of terms and SO mappings (large table!). 
  - DF TRANSFORM:
    - very optional: create new columns for SO mapping. Probably based on static mapping dict, which would be brittle/work to maintain
  - DOC TRANSFORM:
    - CHECK for NA (many rows are). Only assign field if not NA
    - Turn into array of str: ";"-split and then strip whitespace 
- "molecular mechanism": mechanism of how gene's variants causes disease (according to G2P webpage for single record). str (categorical). [G2P website](https://www.ebi.ac.uk/gene2phenotype/about/terminology) has definitions of terms.
  - DF TRANSFORM:
    - optional: turn dtype into 'category'
- "molecular mechanism categorisation": seems to say how molecular mechanism was decided. Shows up as "support" on G2P webpage for single record. possible values: "inferred" or "evidence". str (categorical). 
  - DF TRANSFORM:
    - optional: turn dtype into 'category'
- "molecular mechanism evidence": complicated structure. Looks like " & "-delimited, then primary_key (publication ID) -> object (diff key: value pairs split by "; ", values can be ", "-delimited lists). 
  - DOC TRANSFORM:
    - CHECK for NA (many rows are). Only assign field if not NA
    - Turn into array of str: "&"-split and then strip whitespace. Not going to do more parsing for now.  
- "phenotypes": HPO IDs (already Translator-formatted CURIE) for phenotypes reported by the publications (according to G2P webpage for single record). Can be "; "-delimited
  - based on G2P website, these are organized/assigned to specific gene-disease records...so I'm keeping as a part of the gene-disease assocation, rather than creating diff objects for gene-phenotype (if I did this, it would be a separate function)
  - DOC TRANSFORM:
    - CHECK for NA. Only assign field if not NA
    - Turn into array of str: ";"-split and then strip whitespace 
- "publications": these are PMIDs with no prefix (based on G2P webpage for single record). Can be "; "-delimited
  - DOC TRANSFORM: pmids
    - CHECK for NA. Only assign field if not NA
    - Turn into array of str: ";"-split and then strip whitespace 
- "panel": the G2P panels this record is assigned to (according to DataDownloadFormat txt file). Can be "; "-delimited, otherwise would be categorical.
  - DOC TRANSFORM: g2p_panels
    - map (str.replace) values to match G2P webpage for single record
      - "DD" = "Developmental disorders"
      - add " disorders" to end: "Cancer", "Eye", "Skeletal", "Skin",
    - Turn into array of str: ";"-split and then strip whitespace 
- "comments": additional comments from curation team (according to DataDownloadFormat txt file). Rare, appears to be free-text.
  - DOC TRANSFORM: curator_comments
    - CHECK for NA (many rows are). Only assign field if not NA
- "date of last review":
  - DF TRANSFORM:
    - optional: turn dtype into datetime https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime









#### Don't use:



### Reference

#### Allelic requirement terms
| G2P | HPO name | HPO ID | Exact synonym? | 
| :- | :- | :- | :- |
| biallelic_autosomal | Autosomal recessive inheritance | HP:0000007 | Yes |
| monoallelic_autosomal | Autosomal dominant inheritance | HP:0000006 | Yes |
| biallelic_PAR | Pseudoautosomal recessive inheritance | HP:0034341 | Yes |
| monoallelic_PAR | Pseudoautosomal dominant inheritance | HP:0034340 | Yes |
| mitochondrial | Mitochondrial inheritance | HP:0001427 | Yes |
| monoallelic_Y_hemizygous | Y-linked inheritance | HP:0001450 | Yes |
| monoallelic_X | X-linked inheritance | HP:0001417 | No (my guess) |
| monoallelic_X_hemizygous | X-linked recessive inheritance | HP:0001419 | No (my guess) |
| monoallelic_X_heterozygous | X-linked dominant inheritance | HP:0001423 | No (my guess) |

General Ideas:
* each row = unique gene-disease-allelic requirement combo => turn into 1 association object
  * "G2P records are Locus-Genotype-Mechanism-Disease-Evidence (LGMDE) threads describing gene-disease associations" -> this is what a unique row is from resource POV (ref: homepage table "Total LGMDE Records" info (i) button)
* sometimes disease doesn't have any IDs for it => still make document/object but won't use in Translator unless it has ID
* confidence is important to filter on for Translator! => will still make document/object but don't use! 
  * don't use "limited", "disputed", "refuted": Translator doesn't handle negation ("refuted") or low/conflicting evidence ("disputed" or "limited")



  
* if I loaded all data into 1 dataframe, there will be duplicate rows because the gene-disease pair is in several panels (aka disease falls into multiple categories). See how panel is a comma-delimited string => drop duplicates will be needed (but be careful to include all fields that matter for unique row)
* str split method will create single-element arrays if no delimiter found

## BioThings Parser requirements

* document/object MUST have `_id` key (primary ID for BioThings database - ElasticSearch / mongo DB?) 

* parser feeds into upload step -> function should be generator to iterate through all stuff" -> for each, yield the document/object as json



* output needs to be json dump