# Notebook for parser development

In [1]:
## not for parser. for notebook only 

## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Loading data

Current approach: loading all files into 1 pandas dataframe (currently (4721, 21)). Then I can...
1. check the duplicates situation (key columns vs all columns) and raise errors if need be
2. remove duplicates before generating documents
3. Do some tasks column-wise over all the data, rather than while iterating over rows

If I did the generator approach (load files 1 by 1, 1 row at a time), I'd have to modify how I do things:
1. Don't do this check/raise errors. But try to mitigate potential "duplicate" issues: 
  * Sort all delimited strings
  * Use a hash of all column values (when they're all strings) for `_id`. Want rows with all the same values to produce the same hash
2. Either leave to BioThings toolset to remove duplicates, or could use a set of `_id` hashes so far to check/not create duplicate docs.
3. Do the tasks on single rows/chunks (pandas [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) has an iterator for rows/chunks! see iterator/chunksize parameters)


Notes:
* There are a few existing parsers that use `pandas` to load the entire raw data file at once: https://github.com/search?q=repo%3Abiothings%2Fpending.api%20pandas&type=code
* But there are other existing parsers that use `csv` to load the file **one row at a time** (generator): https://github.com/search?q=repo%3Abiothings%2Fpending.api+csv+reader&type=code

In [2]:
## put into parser: import statements
import pathlib
import pandas as pd
import glob


## unsure on putting into parser: more for notebook viewing/debugging...
pd.options.display.max_columns = None

In [3]:
## notebook code chunk

base_file_path = pathlib.Path.home().joinpath('Desktop', 'EBIgene2pheno_files', 'From_FTP')

## using list works to check if paths were actually found or not
list(base_file_path.glob("*.csv.gz"))
all_file_paths = list(base_file_path.glob("*.csv.gz"))

# ## force columns to be str type (if the IDs become float, it introduces .0 to values which is a mistake)
df = pd.concat((pd.read_csv(f, dtype=str) for f in all_file_paths), ignore_index=True)

df.shape
df.head()

[PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/CardiacG2P_2025-02-28.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/SkeletalG2P_2025-02-28.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/DDG2P_2025-02-28.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/SkinG2P_2025-02-28.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/Hearing_lossG2P_2025-02-28.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/CancerG2P_2025-02-28.csv.gz'),
 PosixPath('/Users/colleenxu/Desktop/EBIgene2pheno_files/From_FTP/EyeG2P_2025-02-28.csv.gz')]

(4714, 21)

Unnamed: 0,g2p id,gene symbol,gene mim,hgnc id,previous gene symbols,disease name,disease mim,disease MONDO,allelic requirement,cross cutting modifier,confidence,variant consequence,variant types,molecular mechanism,molecular mechanism categorisation,molecular mechanism evidence,phenotypes,publications,panel,comments,date of last review
0,G2P00124,KCNE1,176261,6240,ISK; JLNS2; LQT5; MINK,KCNE1-related Jervell and Lange-Nielsen syndrome,612347.0,,biallelic_autosomal,potential secondary finding,strong,altered gene product structure,missense_variant; inframe_deletion; stop_gaine...,undetermined,inferred,,HP:0000407; HP:0001657; HP:0000007; HP:0001279,30461122,DD; Cardiac,KCNE1-related JLNS is due to altered gene prod...,2024-04-05 12:05:01+00:00
1,G2P00841,PTPN11,176876,9644,BPTP3; NS1; PTP2C; SH-PTP2; SHP-2; SHP2,PTPN11-related Noonan syndrome with multiple l...,151100.0,,monoallelic_autosomal,,definitive,altered gene product structure,missense_variant; inframe_deletion; inframe_in...,undetermined,inferred,,HP:0000325; HP:0002996; HP:0000957; HP:0001709...,27484170; 26377839; 25917897; 25884655; 248207...,DD; Skin; Cardiac,Expert review done on 12/01/2022; Noonan syndr...,2025-01-21 14:56:43+00:00
2,G2P03247,DSC2,125645,3036,CDHF2; DSC3,DSC2-related arrhythmogenic right ventricular ...,,MONDO:0012506,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,inframe_deletion; splice_region_variant; misse...,undetermined,inferred,,,31028357; 23911551; 21636032; 33831308; 263105...,Cardiac,Expert review done on 05/01/2022; DSC2-related...,2024-03-20 09:36:09+00:00
3,G2P03248,DSC2,125645,3036,CDHF2; DSC3,DSC2-related arrhythmogenic right ventricular ...,,MONDO:0012506,biallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,inframe_deletion; splice_region_variant; misse...,undetermined,inferred,,,31028357; 23911551; 21636032; 33831308; 263105...,Cardiac,Expert review done on 05/01/2022; DSC2-related...,2024-03-20 09:35:19+00:00
4,G2P03249,DSG2,125671,3049,CDHF5,DSG2-related arrhythmogenic right ventricular ...,,MONDO:0012434,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,inframe_deletion; missense_variant; stop_gaine...,undetermined,inferred,,,21636032; 33831308; 33917638; 34400560; 240707...,Cardiac,Expert review done on 05/01/2022; DSG2-related...,2024-03-20 09:40:18+00:00


In [3]:
## parser function

def _load_multiple_csv_into_one_df(folder_path, data_file_pattern, dtype_settings):
    """
    Assumptions:
    * all files are csv with the same columns/header so they can be concatenated together easily.
      Compression is file: pandas can infer and do on-the-fly decompression: see compression parameter
      in https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv
    
    folder_path: path to folder containing data files. Can use os.path or pathlib.
    data_file_pattern: str regex pattern for glob to match on all data filenames.
    dtype_settings: pandas dtype or dict of column_name:dtype that pandas should use when reading csv
        
    output: pandas dataframe containing data from all files
    """
    ## turn generator into a list, so then can test if it's empty or not
    all_file_paths = list(folder_path.glob(data_file_pattern))
    
    if all_file_paths:
        return pd.concat((pd.read_csv(f, dtype=dtype_settings) for f in all_file_paths), ignore_index=True)
    else:
        raise FileNotFoundError(f"Can't find files in `{folder_path}` matching `{data_file_pattern}`")

In [8]:
## trying out function: some of this can be used in main execution later

base_file_path = pathlib.Path.home().joinpath('Desktop', 'EBIgene2pheno_files', 'From_FTP')

## Using pattern "*.csv.gz" to get files for all panels. Current plan is to get .csv.gz files from FTP site
## `http://ftp.ebi.ac.uk/pub/databases/gene2phenotype/`





## force columns to be str type (if the IDs become float, it introduces .0 to values which is a mistake)

df_from_funct = _load_multiple_csv_into_one_df(folder_path=base_file_path, 
                                               data_file_pattern="*.csv.gz", 
                                               dtype_settings=str)

df_from_funct.shape
df_from_funct.info()

(4714, 21)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4714 entries, 0 to 4713
Data columns (total 21 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   g2p id                              4714 non-null   object
 1   gene symbol                         4714 non-null   object
 2   gene mim                            4712 non-null   object
 3   hgnc id                             4714 non-null   object
 4   previous gene symbols               4241 non-null   object
 5   disease name                        4714 non-null   object
 6   disease mim                         3574 non-null   object
 7   disease MONDO                       638 non-null    object
 8   allelic requirement                 4714 non-null   object
 9   cross cutting modifier              629 non-null    object
 10  confidence                          4714 non-null   object
 11  variant consequence                 4693 non-null   obje

In [13]:
# Look at DF notes: decide whether to make variable for dtype_settings or keep ingesting all as strings.
## -> compare memory usage? 

df_diff_dtypes = _load_multiple_csv_into_one_df(folder_path=base_file_path, 
                                               data_file_pattern="*.csv.gz", 
                                               dtype_settings={
                                                   "gene mim": str, 
                                                   "hgnc id": str, 
                                                   "disease mim": str,
                                                   
                                                   "allelic requirement": "category",
                                                   "confidence": "category"
                                               })

df_diff_dtypes.shape
df_diff_dtypes.info()
df_diff_dtypes.head()

(4714, 21)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4714 entries, 0 to 4713
Data columns (total 21 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   g2p id                              4714 non-null   object
 1   gene symbol                         4714 non-null   object
 2   gene mim                            4712 non-null   object
 3   hgnc id                             4714 non-null   object
 4   previous gene symbols               4241 non-null   object
 5   disease name                        4714 non-null   object
 6   disease mim                         3574 non-null   object
 7   disease MONDO                       638 non-null    object
 8   allelic requirement                 4714 non-null   object
 9   cross cutting modifier              629 non-null    object
 10  confidence                          4714 non-null   object
 11  variant consequence                 4693 non-null   obje

Unnamed: 0,g2p id,gene symbol,gene mim,hgnc id,previous gene symbols,disease name,disease mim,disease MONDO,allelic requirement,cross cutting modifier,confidence,variant consequence,variant types,molecular mechanism,molecular mechanism categorisation,molecular mechanism evidence,phenotypes,publications,panel,comments,date of last review
0,G2P00124,KCNE1,176261,6240,ISK; JLNS2; LQT5; MINK,KCNE1-related Jervell and Lange-Nielsen syndrome,612347.0,,biallelic_autosomal,potential secondary finding,strong,altered gene product structure,missense_variant; inframe_deletion; stop_gaine...,undetermined,inferred,,HP:0000407; HP:0001657; HP:0000007; HP:0001279,30461122,DD; Cardiac,KCNE1-related JLNS is due to altered gene prod...,2024-04-05 12:05:01+00:00
1,G2P00841,PTPN11,176876,9644,BPTP3; NS1; PTP2C; SH-PTP2; SHP-2; SHP2,PTPN11-related Noonan syndrome with multiple l...,151100.0,,monoallelic_autosomal,,definitive,altered gene product structure,missense_variant; inframe_deletion; inframe_in...,undetermined,inferred,,HP:0000325; HP:0002996; HP:0000957; HP:0001709...,27484170; 26377839; 25917897; 25884655; 248207...,DD; Skin; Cardiac,Expert review done on 12/01/2022; Noonan syndr...,2025-01-21 14:56:43+00:00
2,G2P03247,DSC2,125645,3036,CDHF2; DSC3,DSC2-related arrhythmogenic right ventricular ...,,MONDO:0012506,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,inframe_deletion; splice_region_variant; misse...,undetermined,inferred,,,31028357; 23911551; 21636032; 33831308; 263105...,Cardiac,Expert review done on 05/01/2022; DSC2-related...,2024-03-20 09:36:09+00:00
3,G2P03248,DSC2,125645,3036,CDHF2; DSC3,DSC2-related arrhythmogenic right ventricular ...,,MONDO:0012506,biallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,inframe_deletion; splice_region_variant; misse...,undetermined,inferred,,,31028357; 23911551; 21636032; 33831308; 263105...,Cardiac,Expert review done on 05/01/2022; DSC2-related...,2024-03-20 09:35:19+00:00
4,G2P03249,DSG2,125671,3049,CDHF5,DSG2-related arrhythmogenic right ventricular ...,,MONDO:0012434,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,inframe_deletion; missense_variant; stop_gaine...,undetermined,inferred,,,21636032; 33831308; 33917638; 34400560; 240707...,Cardiac,Expert review done on 05/01/2022; DSG2-related...,2024-03-20 09:40:18+00:00


Could ask Johnathan?
* Can I write functions so they raise errors / stop execution? I see other parsers with "raise" statements
* what path should the actual parser use for the data files (csv.gz)? Does it depend how the dumper works?

* What does the main execution section look like? Sepid mAB doesn't have one (just load_data function). VS other parsers have `if __name__ == "__main__":`