# Exploring EBI gene2pheno data for parser

## Downloading data

Latest data can be downloaded from https://www.ebi.ac.uk/gene2phenotype/download. But I can't figure out the actual links for the downloads. 

Static release files (currently 2025-02-28) can be downloaded from the FTP site http://ftp.ebi.ac.uk/pub/databases/gene2phenotype/ (in the G2P_data_downloads subfolder). It is supposed to update on a monthly basis, according to the [README](http://ftp.ebi.ac.uk/pub/databases/gene2phenotype/README). However, I notice that the FTP site is sometimes slow on my computer (Firefox browser). 

I'm using all disorder panels/files:
* Cancer
* Cardiac
* Developmental disorders (DD)
* Eye
* Hearing
* Skeletal
* Skin

## Load data into separate pandas dataframes

In [17]:
## import packages for exploring here

## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pathlib
import pandas as pd
import glob

pd.options.display.max_columns = None

In [3]:
## Construct paths for files

## adjust based on where files are stored
base_file_path = pathlib.Path.home().joinpath('Desktop', 'EBIgene2pheno_files', 'From_Website')

cancer = pd.read_csv(base_file_path.joinpath('G2P_Cancer_2025-03-13.csv'))
cardiac = pd.read_csv(base_file_path.joinpath('G2P_Cardiac_2025-03-13.csv'))
developmental = pd.read_csv(base_file_path.joinpath('G2P_DD_2025-03-13.csv'))
eye = pd.read_csv(base_file_path.joinpath('G2P_Eye_2025-03-13.csv'))
hearing = pd.read_csv(base_file_path.joinpath('G2P_Hearing loss_2025-03-13.csv'))
skeletal = pd.read_csv(base_file_path.joinpath('G2P_Skeletal_2025-03-13.csv'))
skin = pd.read_csv(base_file_path.joinpath('G2P_Skin_2025-03-13.csv'))

## saving for possible future use
# unique_filename_substring = [
#     'Cancer',
#     'Cardiac',
#     'DD',
#     'Eye',
#     'Hearing loss',
#     'Skeletal',
#     'Skin'
# ]

In [4]:
## way to get all paths for all files without knowing their actual names
for i in base_file_path.glob("*.csv"):
    print(i)

/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_Skin_2025-03-13.csv
/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_Hearing loss_2025-03-13.csv
/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_Cardiac_2025-03-13.csv
/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_Skeletal_2025-03-13.csv
/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_Cancer_2025-03-13.csv
/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_DD_2025-03-13.csv
/Users/colleenxu/Desktop/EBIgene2pheno_files/From_Website/G2P_Eye_2025-03-13.csv


In [5]:
## checking that each file's columns are the same: they are
cancer.columns == skin.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

In [6]:
cancer.columns

Index(['g2p id', 'gene symbol', 'gene mim', 'hgnc id', 'previous gene symbols',
       'disease name', 'disease mim', 'disease MONDO', 'allelic requirement',
       'cross cutting modifier', 'confidence', 'variant consequence',
       'variant types', 'molecular mechanism',
       'molecular mechanism categorisation', 'molecular mechanism evidence',
       'phenotypes', 'publications', 'panel', 'comments',
       'date of last review'],
      dtype='object')

In [18]:
## there's 1 gene-disease pair duplicate
cancer[cancer.duplicated(subset=['gene symbol', 'disease name'], keep=False)]

Unnamed: 0,g2p id,gene symbol,gene mim,hgnc id,previous gene symbols,disease name,disease mim,disease MONDO,allelic requirement,cross cutting modifier,confidence,variant consequence,variant types,molecular mechanism,molecular mechanism categorisation,molecular mechanism evidence,phenotypes,publications,panel,comments,date of last review
92,G2P03361,RTEL1,608833,15888,BK3184A7.3; C20ORF41; DKFZP434C013; KIAA1088; ...,RTEL1-related dyskeratosis congenita,,MONDO:0009136,monoallelic_autosomal,,moderate,absent gene product; altered gene product stru...,,loss of function,inferred,,HP:0030438; HP:0030442; HP:0004808; HP:0012182,23329068; 23453664,DD; Skin; Cancer,,2024-12-24 10:10:42+00:00
107,G2P03386,RTEL1,608833,15888,BK3184A7.3; C20ORF41; DKFZP434C013; KIAA1088; ...,RTEL1-related dyskeratosis congenita,,MONDO:0009136,biallelic_autosomal,,definitive,absent gene product; altered gene product stru...,,loss of function,evidence,23329068 -> functional_alteration: patient cel...,HP:0030438; HP:0030442; HP:0004808; HP:0012182,23329068; 23453664; 23591994; 23959892,DD; Skin; Cancer,,2025-03-05 15:53:21+00:00


In [7]:
cancer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 21 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   g2p id                              125 non-null    object
 1   gene symbol                         125 non-null    object
 2   gene mim                            125 non-null    int64 
 3   hgnc id                             125 non-null    int64 
 4   previous gene symbols               121 non-null    object
 5   disease name                        125 non-null    object
 6   disease mim                         68 non-null     object
 7   disease MONDO                       43 non-null     object
 8   allelic requirement                 125 non-null    object
 9   cross cutting modifier              5 non-null      object
 10  confidence                          125 non-null    object
 11  variant consequence                 125 non-null    object

In [19]:
cancer.head()

Unnamed: 0,g2p id,gene symbol,gene mim,hgnc id,previous gene symbols,disease name,disease mim,disease MONDO,allelic requirement,cross cutting modifier,confidence,variant consequence,variant types,molecular mechanism,molecular mechanism categorisation,molecular mechanism evidence,phenotypes,publications,panel,comments,date of last review
0,G2P00027,BRIP1,605882,20473,BACH1; FANCJ; OF,BRIP1-related Fanconi anemia,609054,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0003221; HP:0009778; HP:0001511; HP:0001263...,16116424,DD; Cancer,,2025-01-28 23:03:26+00:00
1,G2P00049,XPC,613208,12816,RAD4; XPCC,XPC-related xeroderma pigmentosum,278720,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0004334; HP:0012056; HP:0000491; HP:0002671...,8298653; 14662655; 11511294; 11121128; 9804340...,DD; Skin; Cancer,,2017-09-01 16:19:16+00:00
2,G2P00165,NOP10,606471,14378,MGC70651; NOLA3; NOP10P,NOP10-related dyskeratosis congenita,224230,MONDO:0009136,biallelic_autosomal,,moderate,absent gene product; altered gene product stru...,missense_variant; inframe_deletion; inframe_in...,undetermined,inferred,,HP:0009926; HP:0000691; HP:0002745; HP:0002206...,17507419,DD; Cancer,,2015-07-22 16:14:19+00:00
3,G2P00177,DDB2,600811,2718,DDBB; FLJ34321; UV-DDB2; XPE,"DDB2-related xeroderma pigmentosum, group E, d...",278740,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0004334; HP:0000491; HP:0002671; HP:0000509...,8798680; 10469312; 12812979,DD; Skin; Cancer,,2017-09-01 16:19:15+00:00
4,G2P00181,ERCC3,133510,3435,BTF2; RAD25; SSL2; XPB,"ERCC3-related xeroderma pigmentosum, group B",610651,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0004315; HP:0001480; HP:0002135; HP:0011400...,16947863; 8408834; 4811796,DD; Skin; Cancer,,2024-08-20 14:10:55+00:00


In [37]:
cancer["disease name"].nunique()

124

In [36]:
cancer["disease mim"].nunique()
cancer["disease mim"].unique()

64

array(['609054', '278720', '224230', '278740', '610651', '162200',
       '158350', '615272', '133701', '162300', '613988', '133700',
       '614038', '278780', '613987', '276300', '191100', '305000',
       '208900', '613325', '175050', '278700', '109400', '610832',
       '609053', '613254', nan, '614327', '174900', '137215', '601626',
       '609265', '132700', '613244', '150800', '135150', '131100',
       '605074', '101000', '614337', '160980', '180200', 'Orphanet:1332',
       '171400', '148500', '601399', '601650', '606764', '606864',
       '605373', '151623', '193300', '613013', '145001', '171300',
       '162091', '614165', 'Orphanet:201', '616553', '613014', '175200',
       '278750', '228550', '613989', '616353'], dtype=object)

In [39]:
cancer[cancer["disease mim"] == "Orphanet:201"]

Unnamed: 0,g2p id,gene symbol,gene mim,hgnc id,previous gene symbols,disease name,disease mim,disease MONDO,allelic requirement,cross cutting modifier,confidence,variant consequence,variant types,molecular mechanism,molecular mechanism categorisation,molecular mechanism evidence,phenotypes,publications,panel,comments,date of last review
85,G2P03354,SEC23B,610512,10702,CDA-II; CDAII; CDAN2; HEMPAS,SEC23B-related Cowden syndrome,Orphanet:201,,monoallelic_autosomal,,limited,uncertain,,gain of function,inferred,,HP:0012056; HP:0500009; HP:0012114; HP:0012846...,26522472,Cancer,,2022-11-30 08:49:50+00:00


### Organizing data into documents/individual association objects:

NOW ON:
* "disease mim" -> OMG sometimes there's orphanet stuff mixed in?? Checked both IDs already. These are correct mappings...just not always omim o_0 



Gene subject section:
* "gene symbol": str, "HGNC-assigned gene symbol" => hgnc_gene_symbol 
  * optional check for ;-delimiter, raise error if there is one?
* "gene mim": OMIM ID for gene. NodeNorm does recognize these => gene_omim_id. Turn into str
  * optional check if loaded as int (means format is correct): if not, raise error and report unique values so we can see what's going on
* "hgnc id" => turn into str
  * optional check: same as "gene mim" 
* "previous gene symbols": looks like "; "-delimited. => turn into array of str. Probably ";"-split and then strip whitespace. 
  * CHECK for NA, don't build field


Association section:
* "g2p id": str, resource's unique stable ID for each record/row => g2p_record_id
  * USE TO GENERATE NEW FIELD: resource webpage url for record/association. Like https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03386. => probably best method: create new column. Then when iterating by row, simple assignment to g2p_record_url
    * this may be brittle/work to maintain, but I know UI really wants stuff like this -> create new column to generate these urls
  
Disease object section:
* "disease name"
* "disease mim": OMIM ID for disease. But there are some Orphanet IDs mixed in => disease_omim_id, disease_orphanet_id
  * CHECK if loaded as int (format/all omim IDs): if not, raise warning and report unique values so we can see what's going on
  * 

Don't use:

General Ideas:
* each row = unique gene-disease-allelic requirement combo => turn into 1 association object
  * "G2P records are Locus-Genotype-Mechanism-Disease-Evidence (LGMDE) threads describing gene-disease associations" -> this is what a unique row is from resource POV (ref: homepage table "Total LGMDE Records" info (i) button)
* sometimes disease doesn't have any IDs for it => still make document/object but won't use in Translator unless it has ID

  
* if I loaded all data into 1 dataframe, there will be duplicate rows because the gene-disease pair is in several panels (aka disease falls into multiple categories). See how panel is a comma-delimited string => drop duplicates will be needed (but be careful to include all fields that matter for unique row)

## BioThings Parser requirements

* parser feeds into upload step -> function should be generator to iterate through all stuff" -> for each, yield the document/object as json


* output needs to be json dump (object -> `_id` key (should , then value is the document) 