**Table of contents**<a id='toc0_'></a>    
- [Gene Symbol Capture Data Generation](#toc1_)    
    - [CAPTURED: annotated with a relationship type, relationship being between the symbol and the gene concept](#toc1_1_1_)    
  - [Download gene records from ENSG, HGNC, and NCBI](#toc1_2_)    
      - [This subset file was created in the alias-primary collision analysis notebook by the following modifications:](#toc1_2_1_1_)    
  - [Combine data from all sources](#toc1_3_)    
  - [Group the associated data by primary gene symbol- alias symbol pairs](#toc1_4_)    
      - [This will ensure that there are no duplicate primary gene symbol- alias symbol pairs as well as preserving in which sources these pairs occur](#toc1_4_1_1_)    
- [Ortholog Capture](#toc2_)    
  - [Download an Ensembl Biomart export file with the Gene Name and the Ortholog Gene Name](#toc2_1_)    
  - [Make all of the gene symbols all caps](#toc2_2_)    
  - [Match aliases to orthologs!](#toc2_3_)    
    - [Drosophila melanogaster](#toc2_3_1_)    
    - [Mouse](#toc2_3_2_)    
    - [(1)Abingdon island giant tortoise](#toc2_3_3_)    
    - [(1)African ostrich](#toc2_3_4_)    
    - [(1)Algerian mouse](#toc2_3_5_)    
    - [(2)Alpaca](#toc2_3_6_)    
    - [(2)Alpine marmot](#toc2_3_7_)    
    - [(2)Amazon molly](#toc2_3_8_)    
    - [(2)American bison](#toc2_3_9_)    
    - [(2)American black bear](#toc2_3_10_)    
    - [(3)American Mink](#toc2_3_11_)    
    - [(3)Arabian camel](#toc2_3_12_)    
    - [(3)Arctic ground squirrel](#toc2_3_13_)    
    - [(3)Argentine black and white tegu](#toc2_3_14_)    
    - [(3)Armadillo](#toc2_3_15_)    
    - [(3)Asian bonytongue](#toc2_3_16_)    
    - [(4)Atlantic Cod](#toc2_3_17_)    
    - [(4)Atlantic herring](#toc2_3_18_)    
    - [(4)Atlantic salmon](#toc2_3_19_)    
    - [(4)Australian saltwater crocodile](#toc2_3_20_)    
    - [(4)Ballan wrasse](#toc2_3_21_)    
    - [(5)Barramundi perch](#toc2_3_22_)    
    - [(5)Beluga whale](#toc2_3_23_)    
    - [(5)Bicolor damselfish](#toc2_3_24_)    
    - [(5)Black snub-nosed monkey](#toc2_3_25_)    
    - [(5)Blue whale](#toc2_3_26_)    
    - [(6) Blue-ringed sea krait](#toc2_3_27_)    
    - [(6) Burton's mouthbrooder](#toc2_3_28_)    
    - [(6) C.intestinalis](#toc2_3_29_)    
    - [(6) C.savignyi](#toc2_3_30_)    
    - [(6) Caenorhabditis elegans (Nematode, N2)](#toc2_3_31_)    
    - [(7) Cat](#toc2_3_32_)    
    - [(7) Chacoan peccary](#toc2_3_33_)    
    - [(7) Channel bull blenny](#toc2_3_34_)    
    - [(7) Channel catfish](#toc2_3_35_)    
    - [(7) Chicken](#toc2_3_36_)    
    - [(8) Chimpanzee](#toc2_3_37_)    
    - [(8) Chinese hamster CHOK1GS](#toc2_3_38_)    
    - [(8) Chinese medaka](#toc2_3_39_)    
    - [(8) Chinese softshell turtle](#toc2_3_40_)    
    - [(8) Chinook salmon](#toc2_3_41_)    
    - [(9) Climbing perch](#toc2_3_42_)    
    - [(9) Clown anemone fish](#toc2_3_43_)    
    - [(9) Coelacanth](#toc2_3_44_)    
    - [(9) Coho salmon](#toc2_3_45_)    
    - [(9) Collared flycatcher](#toc2_3_46_)    
    - [(10) Common canary](#toc2_3_47_)    
    - [(10) Common carp](#toc2_3_48_)    
    - [(10) Commonwall lizard](#toc2_3_49_)    
    - [(10) Common wombat](#toc2_3_50_)    
    - [(10) Coquerel's sifaka](#toc2_3_51_)    
  - [Convert ortholog_analysis_dfs to csv for use in other notebooks](#toc2_4_)    
- [HGNC Previous Symbol Capture](#toc3_)    
  - [Download the HGNC custom download including the gene symbol, ID, and previous symbols](#toc3_1_)    
  - [Remove all genes with no previous symbols](#toc3_2_)    
  - [Explode the previous symbols so that it is only one symbol per row](#toc3_3_)    
  - [Make all of the gene symbols all caps](#toc3_4_)    
  - [Match aliases to previous symbols!](#toc3_5_)    
- [FLJ Clone Name Capture](#toc4_)    
  - [Download the FLJ database file including FLJ IDs](#toc4_1_)    
- [Gene Family Symbol Capture](#toc5_)    
- [Disorder/Disease Symbol Capture](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Gene Symbol Capture Data Generation](#toc0_)

### <a id='toc1_1_1_'></a>[Annotating the relationship between a gene symbol and a gene concept using a descriptive database](#toc0_)

Expect a new table for each type of relationship. Orthologs are a special case where a group of species has its own table. The subsets created after establishing a relationship, are only the symbols that are TRUE for that relationship.

In [3]:
import pandas as pd
import numpy as np
import re

In [4]:
def remove_nan_from_set(s):
    """Remove null instances from set
    
    :param s: selected set
    :return: set with no null values
    """
    return {x for x in s if pd.notna(x)}

In [5]:
def read_subset_genes_csv(location):
    """Create a df of primary gene symbol- alias symbol pairs

    :param location: file location
    return: a df of gene records
    """
    
    subset_genes_xxxx_df = pd.read_csv(
        location, index_col=[0],dtype={"NCBI_ID": str,"HGNC_ID":str})
    subset_genes_xxxx_df['primary_gene_symbol'] = subset_genes_xxxx_df['gene_symbol'].str.upper()
    subset_genes_xxxx_df.drop(["gene_symbol"], axis=1, inplace=True)
    subset_genes_xxxx_df['alias_symbol'] = subset_genes_xxxx_df['alias_symbol'].str.upper()
    return subset_genes_xxxx_df

In [6]:
def convert_all_columns_to_uppercase(df):
    """Convert gene symbols to all-caps. Diffferent species have differing capitalization requirements and this will standardize.

    :param df: DataFrame containing gene symbols of unknown capitalizations
    :return: a DataFrame with all gene symbols all-caps
    """
    for column in df.columns:
        if df[column].dtype == 'object':  # Check if the column type is object
            df[column] = df[column].str.upper()

    return df

In [7]:
def combine_rows(series):
    """Combine duplicate rows.

    :param series: a Pandas Series containing values from a DataFrame column.
                    this Series may contain NaN values, and the function will
                    return the first non-null value, or None if all values are NaN.
    :return: combined value from the Series, or None if the Series is empty or contains only NaNs.
    """
    return series.ffill().bfill().drop_duplicates().values[0] if not series.dropna().empty else None

In [8]:
def make_col_ortholog_match(recording_df, source_df, animal= str):
    """Check for ortholog matches in the primary gene symbol- alias symbol pairs. 
    Adds a T/F column for each pair. T if the alias is an ortholog from the specified animal and F if not

    :param recording_df: df that contains the primary gene symbol- alias symbol pairs
    :param source_df: df that contains the orthologs and their associated human genes
    :param animal: the animal from with the orthologs are being checked
    return: the number of primary gene symbol- alias symbol pairs where the alias is an ortholog from the specified animal
    """
    df = recording_df.copy()
    df[f'{animal} Match'] = df.apply(lambda row: 
                            any((source_df['Gene name'] == row['primary_gene_symbol']) 
                                & 
                                (source_df[f'{animal} gene name'] == row['alias_symbol'])), axis=1)
    print(f"Added column: {animal} Match")
    return df

In [9]:
def match_alias_to_ortholog(og_recording_df, source_df):
    """Apply the make_col_ortholog_match function to all animal columns in the DataFrame.

    :param og_recording_df: DataFrame containing the primary gene symbol- alias symbol pairs
    :param source_df: DataFrame containing the orthologs and their associated human genes
    :return: a DataFrame with all match columns added
    """
    source_df = source_df.dropna(subset=['Gene name'])

    source_df = convert_all_columns_to_uppercase(source_df)
    
    source_df = source_df.groupby('Gene name', as_index=False).agg(combine_rows)

    recording_df = og_recording_df.copy()
    recording_df.columns = recording_df.columns.str.strip().str.replace(r'\s+', ' ', regex=True)
    source_df.columns = source_df.columns.str.strip().str.replace(r'\s+', ' ', regex=True)

    animal_columns = [col for col in source_df.columns if 'gene name' in col and col != 'Gene name']
    
    true_counts = {}

    for animal in animal_columns:
        # Extract the animal name
        animal_name = animal.replace(' gene name', '')
        recording_df = make_col_ortholog_match(recording_df, source_df, animal_name)

        true_count = recording_df[f'{animal_name} Match'].sum()
        true_counts[animal_name] = true_count
    return recording_df, true_counts

## <a id='toc1_2_'></a>[Download gene records from ENSG, HGNC, and NCBI](#toc0_)

#### <a id='toc1_2_1_1_'></a>[This subset file was created in the alias-primary collision analysis notebook by the following modifications:](#toc0_)
 - Gene records with no aliases were removed.
 - Primary gene symbol- alias symbol pairs where the alias was an exact match to the primary symbol were removed.
 - Primary gene symbol- alias symbol pairs that were duplicated were removed.

In [8]:
subset_genes_ensg_df = read_subset_genes_csv("../output/subset_genes_ensg_df.csv")
subset_genes_hgnc_df = read_subset_genes_csv("../output/subset_genes_hgnc_df.csv")
subset_genes_ncbi_df = read_subset_genes_csv("../output/subset_genes_ncbi_df.csv")

## <a id='toc1_3_'></a>[Combine data from all sources](#toc0_)

In [9]:
subset_genes_df = pd.concat([subset_genes_ensg_df, subset_genes_hgnc_df, subset_genes_ncbi_df], axis=0)
subset_genes_df

Unnamed: 0,ENSG_ID,alias_symbol,HGNC_ID,NCBI_ID,primary_gene_symbol
0,ENSG00000210049,MTTF,7481,,MT-TF
1,ENSG00000210049,TRNF,7481,,MT-TF
2,ENSG00000211459,12S,7470,,MT-RNR1
3,ENSG00000211459,MOTS-C,7470,,MT-RNR1
4,ENSG00000211459,MTRNR1,7470,,MT-RNR1
...,...,...,...,...,...
190961,,GLTC,56861,131840634,GLTC1
193342,,ARBAG,40248,132532400,GABRA6-AS1
193377,,ARGI,56890,133395150,LNCARGI
193378,,MP31,55481,133834869,MLDHR


## <a id='toc1_4_'></a>[Group the associated data by primary gene symbol- alias symbol pairs](#toc0_)

#### <a id='toc1_4_1_1_'></a>[This will ensure that there are no duplicate primary gene symbol- alias symbol pairs as well as preserving in which sources these pairs occur](#toc0_)

In [10]:
subset_genes_df = subset_genes_df.rename(columns={'gene_symbol': 'primary_gene_symbol'})

In [11]:
subset_genes_df = subset_genes_df.groupby(['primary_gene_symbol',"alias_symbol"], as_index=False).agg({
    "HGNC_ID": lambda x: set(x),
    'ENSG_ID': lambda x: set(x),
    'NCBI_ID': lambda x: set(x),
})
subset_genes_df

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID
0,A-GAMMA3'E,A-GAMMA-E,{nan},{nan},{109951028}
1,A1BG,A1B,{5},{ENSG00000121410},{1}
2,A1BG,ABG,{5},{ENSG00000121410},{1}
3,A1BG,GAB,{5},{ENSG00000121410},{1}
4,A1BG,HYST2477,{5},{ENSG00000121410},{1}
...,...,...,...,...,...
86546,ZZEF1,FLJ10821,{29027},{ENSG00000074755},{23140}
86547,ZZEF1,KIAA0399,{29027},{ENSG00000074755},{23140}
86548,ZZEF1,ZZZ4,{29027},{ENSG00000074755},{23140}
86549,ZZZ3,ATAC1,{24523},{ENSG00000036549},{26009}


In [12]:
subset_genes_df['NCBI_ID'] = subset_genes_df['NCBI_ID'].apply(remove_nan_from_set)
subset_genes_df['ENSG_ID'] = subset_genes_df['ENSG_ID'].apply(remove_nan_from_set)
subset_genes_df['HGNC_ID'] = subset_genes_df['HGNC_ID'].apply(remove_nan_from_set)

In [13]:
subset_genes_df.loc[subset_genes_df["primary_gene_symbol"]=="BICDL3P"]

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID
6181,BICDL3P,ABHD11-AS1,{18289},{ENSG00000225969},{}
6182,BICDL3P,LINC00035,{18289},{ENSG00000225969},{}
6183,BICDL3P,NCRNA00035,{18289},{ENSG00000225969},{171022}
6184,BICDL3P,WBSCR26,{18289},{ENSG00000225969},{}


In [14]:
subset_genes_df.to_hdf(
    "../output/subset_genes_df.h5", key='df', mode='w'
)

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['primary_gene_symbol', 'alias_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID'], dtype='object')]

  subset_genes_df.to_hdf(


In [15]:
subset_genes_df.to_csv(
    "../output/subset_genes_df.csv", index=True
)

# <a id='toc2_'></a>[Ortholog Capture](#toc0_)

## <a id='toc2_1_'></a>[Download an Ensembl Biomart export file with the Gene Name and the Ortholog Gene Name](#toc0_)

In [16]:
mur_dros_ortho_df = pd.read_csv(
    "../input/ensg_mart_export_dros_murin_ortho.txt", sep=",", index_col=[0])
mur_dros_ortho_df

Unnamed: 0_level_0,Drosophila melanogaster (Fruit fly) gene name,Drosophila melanogaster (Fruit fly) gene stable ID,Mouse gene stable ID,Mouse gene name,Gene name
Gene stable ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENSG00000210049,,,,,MT-TF
ENSG00000211459,,,,,MT-RNR1
ENSG00000210077,,,,,MT-TV
ENSG00000210082,,,,,MT-RNR2
ENSG00000209082,,,,,MT-TL1
...,...,...,...,...,...
ENSG00000232679,,,,,LINC01705
ENSG00000200033,,,ENSMUSG00000088001,Gm22883,RNU6-403P
ENSG00000228437,,,,,LINC02474
ENSG00000229463,,,,,LYST-AS1


## <a id='toc2_2_'></a>[Make all of the gene symbols all caps](#toc0_)

Different species follow different gene nomenclature conventions. <br>
For example, mouse genes have the first letter capitalized but the rest lowercase.<br>
They need to be all caps for matching

In [17]:
mur_dros_ortho_df = convert_all_columns_to_uppercase(mur_dros_ortho_df)

## <a id='toc2_3_'></a>[Match aliases to orthologs!](#toc0_)

### <a id='toc2_3_1_'></a>[Drosophila melanogaster](#toc0_)

In [18]:
fruitfly_df = make_col_ortholog_match(subset_genes_df, mur_dros_ortho_df,"Drosophila melanogaster (Fruit fly)")
print(len(fruitfly_df))

KeyboardInterrupt: 

### <a id='toc2_3_2_'></a>[Mouse](#toc0_)

In [None]:
mouse_df = make_col_ortholog_match(subset_genes_df, mur_dros_ortho_df,"Mouse")
print(len(mouse_df))

KeyboardInterrupt: 

In [None]:
mouse_df

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,source,Mouse Match
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},{NCBI},False
1,A1BG,A1B,{5},{ENSG00000121410},{1},{NCBI},False
2,A1BG,ABG,{5},{ENSG00000121410},{1},{NCBI},False
3,A1BG,GAB,{5},{ENSG00000121410},{1},{NCBI},False
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},{NCBI},False
...,...,...,...,...,...,...,...
86546,ZZEF1,FLJ10821,{29027},{ENSG00000074755},{23140},"{ENSG, HGNC}",False
86547,ZZEF1,KIAA0399,{29027},{ENSG00000074755},{23140},"{ENSG, HGNC}",False
86548,ZZEF1,ZZZ4,{29027},{ENSG00000074755},{23140},"{NCBI, ENSG, HGNC}",False
86549,ZZZ3,ATAC1,{24523},{ENSG00000036549},{26009},"{NCBI, ENSG, HGNC}",False


### <a id='toc2_3_3_'></a>[(1)Abingdon island giant tortoise](#toc0_)
### <a id='toc2_3_4_'></a>[(1)African ostrich](#toc0_)
### <a id='toc2_3_5_'></a>[(1)Algerian mouse](#toc0_)

In [None]:
ortholog_set_1_df = pd.read_csv(
    "../input/ortholog_set_1_df.txt", sep=",")

In [None]:
ortholog_analysis_1_df, ortholog_analysis_1_counts = match_alias_to_ortholog(subset_genes_df, ortholog_set_1_df)
print(ortholog_analysis_1_counts)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = df[column].str.upper()


Added column: Abingdon island giant tortoise Match
Added column: African ostrich Match
Added column: Algerian mouse Match
{'Abingdon island giant tortoise': np.int64(1), 'African ostrich': np.int64(1), 'Algerian mouse': np.int64(162)}


In [None]:
ortholog_analysis_1_df.head()

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Abingdon island giant tortoise Match,African ostrich Match,Algerian mouse Match
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False,False,False
1,A1BG,A1B,{5},{ENSG00000121410},{1},False,False,False
2,A1BG,ABG,{5},{ENSG00000121410},{1},False,False,False
3,A1BG,GAB,{5},{ENSG00000121410},{1},False,False,False
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False,False,False


### <a id='toc2_3_6_'></a>[(2)Alpaca](#toc0_)
### <a id='toc2_3_7_'></a>[(2)Alpine marmot](#toc0_)
### <a id='toc2_3_8_'></a>[(2)Amazon molly](#toc0_)
### <a id='toc2_3_9_'></a>[(2)American bison](#toc0_)
### <a id='toc2_3_10_'></a>[(2)American black bear](#toc0_)

In [None]:
ortholog_set_2_df = pd.read_csv(
    "../input/ortholog_set_2_df.txt", sep=",")

In [None]:
ortholog_analysis_2_df, ortholog_analysis_2_counts = match_alias_to_ortholog(subset_genes_df, ortholog_set_2_df)
print(ortholog_analysis_2_counts)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = df[column].str.upper()


Added column: Alpaca Match
Added column: Alpine marmot Match
Added column: Amazon molly Match
Added column: American bison Match
Added column: American black bear Match
{'Alpaca': np.int64(13), 'Alpine marmot': np.int64(6), 'Amazon molly': np.int64(96), 'American bison': np.int64(277), 'American black bear': np.int64(1)}


In [None]:
ortholog_analysis_2_df.head()

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Alpaca Match,Alpine marmot Match,Amazon molly Match,American bison Match,American black bear Match
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False,False,False,False,False
1,A1BG,A1B,{5},{ENSG00000121410},{1},False,False,False,False,False
2,A1BG,ABG,{5},{ENSG00000121410},{1},False,False,False,False,False
3,A1BG,GAB,{5},{ENSG00000121410},{1},False,False,False,False,False
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False,False,False,False,False


### <a id='toc2_3_11_'></a>[(3)American Mink](#toc0_)
### <a id='toc2_3_12_'></a>[(3)Arabian camel](#toc0_)
### <a id='toc2_3_13_'></a>[(3)Arctic ground squirrel](#toc0_)
### <a id='toc2_3_14_'></a>[(3)Argentine black and white tegu](#toc0_)
### <a id='toc2_3_15_'></a>[(3)Armadillo](#toc0_)
### <a id='toc2_3_16_'></a>[(3)Asian bonytongue](#toc0_)

In [None]:
ortholog_set_3_df = pd.read_csv(
    "../input/ortholog_set_3_df.txt", sep=",")

Unnamed: 0,Gene stable ID,Gene name,American mink gene name,Arabian camel gene name,Arctic ground squirrel gene name,Argentine black and white tegu gene name,Armadillo gene name,Asian bonytongue gene name
0,ENSG00000210049,MT-TF,,,,,,
1,ENSG00000211459,MT-RNR1,,,,,,
2,ENSG00000210077,MT-TV,,,,,,
3,ENSG00000210082,MT-RNR2,,,,,,
4,ENSG00000209082,MT-TL1,,,,,,
...,...,...,...,...,...,...,...,...
139603,ENSG00000235358,SCMH1-DT,,,,,,
139604,ENSG00000228067,LINC01740,,,,,,
139605,ENSG00000293271,SLC44A3-AS1,,,,,,
139606,ENSG00000310526,WASH7P,,,,,,


In [None]:
ortholog_analysis_3_df, ortholog_analysis_3_counts = match_alias_to_ortholog(subset_genes_df, ortholog_set_3_df)
print(ortholog_analysis_3_counts)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = df[column].str.upper()


Added column: American mink Match
Added column: Arabian camel Match
Added column: Arctic ground squirrel Match
Added column: Argentine black and white tegu Match
Added column: Armadillo Match
Added column: Asian bonytongue Match
{'American mink': np.int64(15), 'Arabian camel': np.int64(32), 'Arctic ground squirrel': np.int64(2), 'Argentine black and white tegu': np.int64(4), 'Armadillo': np.int64(28), 'Asian bonytongue': np.int64(96)}


In [None]:
ortholog_analysis_3_df.head()

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,American mink Match,Arabian camel Match,Arctic ground squirrel Match,Argentine black and white tegu Match,Armadillo Match,Asian bonytongue Match
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False,False,False,False,False,False
1,A1BG,A1B,{5},{ENSG00000121410},{1},False,False,False,False,False,False
2,A1BG,ABG,{5},{ENSG00000121410},{1},False,False,False,False,False,False
3,A1BG,GAB,{5},{ENSG00000121410},{1},False,False,False,False,False,False
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False,False,False,False,False,False


### <a id='toc2_3_17_'></a>[(4)Atlantic Cod](#toc0_)
### <a id='toc2_3_18_'></a>[(4)Atlantic herring](#toc0_)
### <a id='toc2_3_19_'></a>[(4)Atlantic salmon](#toc0_)
### <a id='toc2_3_20_'></a>[(4)Australian saltwater crocodile](#toc0_)
### <a id='toc2_3_21_'></a>[(4)Ballan wrasse](#toc0_)

In [None]:
ortholog_set_4_df = pd.read_csv(
    "../input/ortholog_set_4_df.txt", sep=",")
ortholog_set_4_df.head()

Unnamed: 0,Gene stable ID,Gene name,Atlantic cod gene name,Atlantic herring gene name,Atlantic salmon gene name,Australian saltwater crocodile gene name,Ballan wrasse gene name
0,ENSG00000263418,5S_rRNA,,,,,
1,ENSG00000265816,5S_rRNA,,,,,
2,ENSG00000266035,5S_rRNA,,,,,
3,ENSG00000266615,5S_rRNA,,,,,
4,ENSG00000266653,5S_rRNA,,,,,
5,ENSG00000266726,5S_rRNA,,,,,
6,ENSG00000273928,5S_rRNA,,,,,
7,ENSG00000275780,5S_rRNA,,,,,
8,ENSG00000275999,5S_rRNA,,,,,
9,ENSG00000276861,5S_rRNA,,,,,


In [None]:
ortholog_set_4_df = ortholog_set_4_df.groupby('Gene name', as_index=False).agg(combine_rows)

In [None]:
ortholog_analysis_4_df, ortholog_analysis_4_counts = match_alias_to_ortholog(subset_genes_df, ortholog_set_4_df)
print(ortholog_analysis_4_counts)

Added column: Atlantic cod Match
Added column: Atlantic herring Match
Added column: Atlantic salmon Match
Added column: Australian saltwater crocodile Match
Added column: Ballan wrasse Match
{'Atlantic cod': np.int64(57), 'Atlantic herring': np.int64(211), 'Atlantic salmon': np.int64(167), 'Australian saltwater crocodile': np.int64(16), 'Ballan wrasse': np.int64(97)}


In [None]:
ortholog_analysis_4_df.head()

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Atlantic cod Match,Atlantic herring Match,Atlantic salmon Match,Australian saltwater crocodile Match,Ballan wrasse Match
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False,False,False,False,False
1,A1BG,A1B,{5},{ENSG00000121410},{1},False,False,False,False,False
2,A1BG,ABG,{5},{ENSG00000121410},{1},False,False,False,False,False
3,A1BG,GAB,{5},{ENSG00000121410},{1},False,False,False,False,False
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False,False,False,False,False


### <a id='toc2_3_22_'></a>[(5)Barramundi perch](#toc0_)
### <a id='toc2_3_23_'></a>[(5)Beluga whale](#toc0_)
### <a id='toc2_3_24_'></a>[(5)Bicolor damselfish](#toc0_)
### <a id='toc2_3_25_'></a>[(5)Black snub-nosed monkey](#toc0_)
### <a id='toc2_3_26_'></a>[(5)Blue whale](#toc0_)


In [None]:
ortholog_set_5_df = pd.read_csv(
    "../input/ortholog_set_5_df.txt", sep=",")

In [None]:
ortholog_analysis_5_df, ortholog_analysis_5_counts = match_alias_to_ortholog(subset_genes_df, ortholog_set_5_df)
print(ortholog_analysis_5_counts)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = df[column].str.upper()


Added column: Blue whale Match
Added column: Black snub-nosed monkey Match
Added column: Bicolor damselfish Match
Added column: Beluga whale Match
Added column: Barramundi perch Match
{'Blue whale': np.int64(8), 'Black snub-nosed monkey': np.int64(14), 'Bicolor damselfish': np.int64(93), 'Beluga whale': np.int64(28), 'Barramundi perch': np.int64(93)}


In [None]:
ortholog_analysis_5_df.head()

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Blue whale Match,Black snub-nosed monkey Match,Bicolor damselfish Match,Beluga whale Match,Barramundi perch Match
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False,False,False,False,False
1,A1BG,A1B,{5},{ENSG00000121410},{1},False,False,False,False,False
2,A1BG,ABG,{5},{ENSG00000121410},{1},False,False,False,False,False
3,A1BG,GAB,{5},{ENSG00000121410},{1},False,False,False,False,False
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False,False,False,False,False


### <a id='toc2_3_27_'></a>[(6) Blue-ringed sea krait](#toc0_)
### <a id='toc2_3_28_'></a>[(6) Burton's mouthbrooder](#toc0_)
### <a id='toc2_3_29_'></a>[(6) C.intestinalis](#toc0_)
### <a id='toc2_3_30_'></a>[(6) C.savignyi](#toc0_)
### <a id='toc2_3_31_'></a>[(6) Caenorhabditis elegans (Nematode, N2)](#toc0_)

In [None]:
ortholog_set_6_df = pd.read_csv(
    "../input/ortholog_set_6_df.txt", sep=",")

In [None]:
ortholog_analysis_6_df, ortholog_analysis_6_counts = match_alias_to_ortholog(subset_genes_df, ortholog_set_6_df)
print(ortholog_analysis_6_counts)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = df[column].str.upper()


Added column: Caenorhabditis elegans (Nematode, N2) Match
Added column: C.savignyi Match
Added column: C.intestinalis Match
Added column: Burton's mouthbrooder Match
Added column: Blue-ringed sea krait Match
{'Caenorhabditis elegans (Nematode, N2)': np.int64(101), 'C.savignyi': np.int64(9), 'C.intestinalis': np.int64(34), "Burton's mouthbrooder": np.int64(88), 'Blue-ringed sea krait': np.int64(2)}


In [None]:
ortholog_analysis_6_df.head()

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,"Caenorhabditis elegans (Nematode, N2) Match",C.savignyi Match,C.intestinalis Match,Burton's mouthbrooder Match,Blue-ringed sea krait Match
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False,False,False,False,False
1,A1BG,A1B,{5},{ENSG00000121410},{1},False,False,False,False,False
2,A1BG,ABG,{5},{ENSG00000121410},{1},False,False,False,False,False
3,A1BG,GAB,{5},{ENSG00000121410},{1},False,False,False,False,False
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False,False,False,False,False


### <a id='toc2_3_32_'></a>[(7) Cat](#toc0_)
### <a id='toc2_3_33_'></a>[(7) Chacoan peccary](#toc0_)
### <a id='toc2_3_34_'></a>[(7) Channel bull blenny](#toc0_)
### <a id='toc2_3_35_'></a>[(7) Channel catfish](#toc0_)
### <a id='toc2_3_36_'></a>[(7) Chicken](#toc0_)

In [None]:
ortholog_set_7_df = pd.read_csv(
    "../input/ortholog_set_7_df.txt", sep=",")

In [None]:
ortholog_analysis_7_df, ortholog_analysis_7_counts = match_alias_to_ortholog(subset_genes_df, ortholog_set_7_df)
print(ortholog_analysis_7_counts)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = df[column].str.upper()


Added column: Chicken Match
Added column: Channel catfish Match
Added column: Channel bull blenny Match
Added column: Chacoan peccary Match
Added column: Cat Match
{'Chicken': np.int64(154), 'Channel catfish': np.int64(172), 'Channel bull blenny': np.int64(129), 'Chacoan peccary': np.int64(10), 'Cat': np.int64(30)}


In [None]:
ortholog_analysis_7_df.head()

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Chicken Match,Channel catfish Match,Channel bull blenny Match,Chacoan peccary Match,Cat Match
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False,False,False,False,False
1,A1BG,A1B,{5},{ENSG00000121410},{1},False,False,False,False,False
2,A1BG,ABG,{5},{ENSG00000121410},{1},False,False,False,False,False
3,A1BG,GAB,{5},{ENSG00000121410},{1},False,False,False,False,False
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False,False,False,False,False


### <a id='toc2_3_37_'></a>[(8) Chimpanzee](#toc0_)
### <a id='toc2_3_38_'></a>[(8) Chinese hamster CHOK1GS](#toc0_)
### <a id='toc2_3_39_'></a>[(8) Chinese medaka](#toc0_)
### <a id='toc2_3_40_'></a>[(8) Chinese softshell turtle](#toc0_)
### <a id='toc2_3_41_'></a>[(8) Chinook salmon](#toc0_)

In [None]:
ortholog_set_8_df = pd.read_csv(
    "../input/ortholog_set_8_df.txt", sep=",")

In [None]:
ortholog_analysis_8_df, ortholog_analysis_8_counts = match_alias_to_ortholog(subset_genes_df, ortholog_set_8_df)
print(ortholog_analysis_8_counts)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = df[column].str.upper()


Added column: Chinook salmon Match
Added column: Chinese softshell turtle Match
Added column: Chinese medaka Match
Added column: Chinese hamster CHOK1GS Match
Added column: Chimpanzee Match
{'Chinook salmon': np.int64(210), 'Chinese softshell turtle': np.int64(2), 'Chinese medaka': np.int64(85), 'Chinese hamster CHOK1GS': np.int64(180), 'Chimpanzee': np.int64(44)}


In [None]:
ortholog_analysis_8_df.head()

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Chinook salmon Match,Chinese softshell turtle Match,Chinese medaka Match,Chinese hamster CHOK1GS Match,Chimpanzee Match
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False,False,False,False,False
1,A1BG,A1B,{5},{ENSG00000121410},{1},False,False,False,False,False
2,A1BG,ABG,{5},{ENSG00000121410},{1},False,False,False,False,False
3,A1BG,GAB,{5},{ENSG00000121410},{1},False,False,False,False,False
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False,False,False,False,False


### <a id='toc2_3_42_'></a>[(9) Climbing perch](#toc0_)
### <a id='toc2_3_43_'></a>[(9) Clown anemone fish](#toc0_)
### <a id='toc2_3_44_'></a>[(9) Coelacanth](#toc0_)
### <a id='toc2_3_45_'></a>[(9) Coho salmon](#toc0_)
### <a id='toc2_3_46_'></a>[(9) Collared flycatcher](#toc0_)

In [None]:
ortholog_set_9_df = pd.read_csv(
    "../input/ortholog_set_9_df.txt", sep=",")

In [None]:
ortholog_analysis_9_df, ortholog_analysis_9_counts = match_alias_to_ortholog(subset_genes_df, ortholog_set_9_df)
print(ortholog_analysis_9_counts)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = df[column].str.upper()


Added column: Climbing perch Match
Added column: Clown anemonefish Match
Added column: Coelacanth Match
Added column: Coho salmon Match
Added column: Collared flycatcher Match
{'Climbing perch': np.int64(198), 'Clown anemonefish': np.int64(203), 'Coelacanth': np.int64(35), 'Coho salmon': np.int64(84), 'Collared flycatcher': np.int64(13)}


In [None]:
ortholog_analysis_9_df.head()

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Climbing perch Match,Clown anemonefish Match,Coelacanth Match,Coho salmon Match,Collared flycatcher Match
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False,False,False,False,False
1,A1BG,A1B,{5},{ENSG00000121410},{1},False,False,False,False,False
2,A1BG,ABG,{5},{ENSG00000121410},{1},False,False,False,False,False
3,A1BG,GAB,{5},{ENSG00000121410},{1},False,False,False,False,False
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False,False,False,False,False


### <a id='toc2_3_47_'></a>[(10) Common canary](#toc0_)
### <a id='toc2_3_48_'></a>[(10) Common carp](#toc0_)
### <a id='toc2_3_49_'></a>[(10) Commonwall lizard](#toc0_)
### <a id='toc2_3_50_'></a>[(10) Common wombat](#toc0_)
### <a id='toc2_3_51_'></a>[(10) Coquerel's sifaka](#toc0_)

In [None]:
ortholog_set_10_df = pd.read_csv(
    "../input/ortholog_set_10_df.txt", sep=",")

In [None]:
ortholog_analysis_10_df, ortholog_analysis_10_counts = match_alias_to_ortholog(subset_genes_df, ortholog_set_10_df)
print(ortholog_analysis_10_counts)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = df[column].str.upper()


Added column: Common canary Match
Added column: Common carp Match
Added column: Common wall lizard Match
Added column: Common wombat Match
Added column: Coquerel's sifaka Match
{'Common canary': np.int64(6), 'Common carp': np.int64(103), 'Common wall lizard': np.int64(10), 'Common wombat': np.int64(17), "Coquerel's sifaka": np.int64(14)}


In [None]:
ortholog_analysis_10_df.head()

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Common canary Match,Common carp Match,Common wall lizard Match,Common wombat Match,Coquerel's sifaka Match
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False,False,False,False,False
1,A1BG,A1B,{5},{ENSG00000121410},{1},False,False,False,False,False
2,A1BG,ABG,{5},{ENSG00000121410},{1},False,False,False,False,False
3,A1BG,GAB,{5},{ENSG00000121410},{1},False,False,False,False,False
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False,False,False,False,False


## <a id='toc2_4_'></a>[Convert ortholog_analysis_dfs to csv for use in other notebooks](#toc0_)

In [None]:
ortholog_analysis_dfs_dict = {
    key: value for key, value in globals().items()
    if key.startswith('ortholog_analysis')  
    and key.endswith('_df')                
    and isinstance(value, pd.DataFrame)    
}

ortholog_analysis_dfs_dict

{'ortholog_analysis_1_df':       primary_gene_symbol  alias_symbol  HGNC_ID            ENSG_ID  \
 0              A-GAMMA3'E     A-GAMMA-E       {}                 {}   
 1                    A1BG           A1B      {5}  {ENSG00000121410}   
 2                    A1BG           ABG      {5}  {ENSG00000121410}   
 3                    A1BG           GAB      {5}  {ENSG00000121410}   
 4                    A1BG      HYST2477      {5}  {ENSG00000121410}   
 ...                   ...           ...      ...                ...   
 86546               ZZEF1      FLJ10821  {29027}  {ENSG00000074755}   
 86547               ZZEF1      KIAA0399  {29027}  {ENSG00000074755}   
 86548               ZZEF1          ZZZ4  {29027}  {ENSG00000074755}   
 86549                ZZZ3         ATAC1  {24523}  {ENSG00000036549}   
 86550                ZZZ3  DKFZP564I052  {24523}  {ENSG00000036549}   
 
            NCBI_ID  Abingdon island giant tortoise Match  \
 0      {109951028}                            

In [None]:
output_dir = "../output/"

for name, df in ortholog_analysis_dfs_dict.items():
    file_path = f"{output_dir}{name}.csv"
    
    df.to_csv(file_path, index=True)
    print(f"Saved: {file_path}")


Saved: output/ortholog_analysis_1_df.csv
Saved: output/ortholog_analysis_2_df.csv
Saved: output/ortholog_analysis_3_df.csv
Saved: output/ortholog_analysis_4_df.csv
Saved: output/ortholog_analysis_5_df.csv
Saved: output/ortholog_analysis_6_df.csv
Saved: output/ortholog_analysis_7_df.csv
Saved: output/ortholog_analysis_8_df.csv
Saved: output/ortholog_analysis_9_df.csv
Saved: output/ortholog_analysis_10_df.csv


# <a id='toc3_'></a>[HGNC Previous Symbol Capture](#toc0_)

## <a id='toc3_1_'></a>[Download the HGNC custom download including the gene symbol, ID, and previous symbols](#toc0_)

In [None]:
hgnc_previous_symbols_df = pd.read_csv(
    "../input/HGNC_previous_symbols20241010.txt", sep="\t")
hgnc_previous_symbols_df

Unnamed: 0,HGNC ID,Approved symbol,Previous symbols
0,HGNC:5,A1BG,
1,HGNC:37133,A1BG-AS1,"NCRNA00181, A1BGAS, A1BG-AS"
2,HGNC:24086,A1CF,
3,HGNC:6,A1S9T,
4,HGNC:7,A2M,
...,...,...,...
49077,HGNC:25820,ZYG11B,ZYG11
49078,HGNC:13200,ZYX,
49079,HGNC:51695,ZYXP1,
49080,HGNC:29027,ZZEF1,


## <a id='toc3_2_'></a>[Remove all genes with no previous symbols](#toc0_)

In [None]:
hgnc_previous_symbols_df = hgnc_previous_symbols_df.dropna(subset=["Previous symbols"])
hgnc_previous_symbols_df

Unnamed: 0,HGNC ID,Approved symbol,Previous symbols
1,HGNC:37133,A1BG-AS1,"NCRNA00181, A1BGAS, A1BG-AS"
6,HGNC:23336,A2ML1,CPAMD9
9,HGNC:8,A2MP1,A2MP
12,HGNC:30005,A3GALT2,A3GALT2P
13,HGNC:18149,A4GALT,P1
...,...,...,...
49063,HGNC:23528,ZSWIM8,KIAA0913
49065,HGNC:34495,ZSWIM9,C19orf68
49066,HGNC:21224,ZUP1,"C6orf113, ZUFSP"
49071,HGNC:13197,ZWS1,ZWS


## <a id='toc3_3_'></a>[Explode the previous symbols so that it is only one symbol per row](#toc0_)

In [None]:
hgnc_previous_symbols_df["previous_symbol"] = hgnc_previous_symbols_df['Previous symbols'].str.split(',').apply(lambda x: [s.strip() for s in x])
hgnc_previous_symbols_df = hgnc_previous_symbols_df.explode('previous_symbol')
hgnc_previous_symbols_df = hgnc_previous_symbols_df.drop(columns=['Previous symbols'])
hgnc_previous_symbols_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hgnc_previous_symbols_df["previous_symbol"] = hgnc_previous_symbols_df['Previous symbols'].str.split(',').apply(lambda x: [s.strip() for s in x])


Unnamed: 0,HGNC ID,Approved symbol,previous_symbol
1,HGNC:37133,A1BG-AS1,NCRNA00181
1,HGNC:37133,A1BG-AS1,A1BGAS
1,HGNC:37133,A1BG-AS1,A1BG-AS
6,HGNC:23336,A2ML1,CPAMD9
9,HGNC:8,A2MP1,A2MP
...,...,...,...
49065,HGNC:34495,ZSWIM9,C19orf68
49066,HGNC:21224,ZUP1,C6orf113
49066,HGNC:21224,ZUP1,ZUFSP
49071,HGNC:13197,ZWS1,ZWS


In [None]:
hgnc_previous_symbols_df.to_hdf(
    "../output/hgnc_previous_symbols_df.h5", key='df', mode='w'
)

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['HGNC ID', 'Approved symbol', 'previous_symbol'], dtype='object')]

  hgnc_previous_symbols_df.to_hdf(


## <a id='toc3_4_'></a>[Make all of the gene symbols all caps](#toc0_)

Different species follow different gene nomenclature conventions. <br>
For example, mouse genes have the first letter capitalized but the rest lowercase.<br>
They need to be all caps for matching

In [None]:
hgnc_previous_symbols_df["Approved symbol"] = hgnc_previous_symbols_df["Approved symbol"].str.upper()
hgnc_previous_symbols_df["previous_symbol"] = hgnc_previous_symbols_df["previous_symbol"].str.upper()

## <a id='toc3_5_'></a>[Match aliases to previous symbols!](#toc0_)

In [None]:
previous_symbol_analysis_df = subset_genes_df.copy()
previous_symbol_analysis_df

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028}
1,A1BG,A1B,{5},{ENSG00000121410},{1}
2,A1BG,ABG,{5},{ENSG00000121410},{1}
3,A1BG,GAB,{5},{ENSG00000121410},{1}
4,A1BG,HYST2477,{5},{ENSG00000121410},{1}
...,...,...,...,...,...
86546,ZZEF1,FLJ10821,{29027},{ENSG00000074755},{23140}
86547,ZZEF1,KIAA0399,{29027},{ENSG00000074755},{23140}
86548,ZZEF1,ZZZ4,{29027},{ENSG00000074755},{23140}
86549,ZZZ3,ATAC1,{24523},{ENSG00000036549},{26009}


In [None]:
previous_symbol_analysis_df["Previous Symbol Match"] = previous_symbol_analysis_df.apply(lambda row: 
                        any((hgnc_previous_symbols_df['Approved symbol'] == row['primary_gene_symbol']) 
                            & 
                            (hgnc_previous_symbols_df["previous_symbol"] == row['alias_symbol'])), axis=1)
previous_symbol_match_subset_genes_df = previous_symbol_analysis_df[previous_symbol_analysis_df["Previous Symbol Match"]]
previous_symbol_match_subset_genes_df

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Previous Symbol Match
5,A1BG-AS1,A1BG-AS,{37133},{ENSG00000268895},{503538},True
6,A1BG-AS1,A1BGAS,{37133},{ENSG00000268895},{503538},True
8,A1BG-AS1,NCRNA00181,{37133},{ENSG00000268895},{503538},True
18,A2ML1,CPAMD9,{23336},{ENSG00000166535},{144568},True
22,A2MP1,A2MP,{8},"{ENSG00000291190, ENSG00000256069}",{3},True
...,...,...,...,...,...,...
86516,ZSWIM8,KIAA0913,{23528},{ENSG00000214655},{23053},True
86517,ZSWIM9,C19ORF68,{34495},{ENSG00000185453},{374920},True
86519,ZUP1,C6ORF113,{21224},{ENSG00000153975},{221302},True
86522,ZUP1,ZUFSP,{21224},{ENSG00000153975},{221302},True


In [None]:
previous_symbol_match_subset_genes_df.head(20)

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Previous Symbol Match
5,A1BG-AS1,A1BG-AS,{37133},{ENSG00000268895},{503538},True
6,A1BG-AS1,A1BGAS,{37133},{ENSG00000268895},{503538},True
8,A1BG-AS1,NCRNA00181,{37133},{ENSG00000268895},{503538},True
18,A2ML1,CPAMD9,{23336},{ENSG00000166535},{144568},True
22,A2MP1,A2MP,{8},"{ENSG00000291190, ENSG00000256069}",{3},True
23,A3GALT2,A3GALT2P,{30005},{ENSG00000184389},{127550},True
30,A4GALT,P1,{18149},{ENSG00000128274},{53947},True
47,AACSP1,AACSL,{18226},"{ENSG00000291019, ENSG00000250420}",{729522},True
63,AAMDC,C11ORF67,{30205},{ENSG00000087884},{28971},True
70,AAR2,C20ORF4,{15886},{ENSG00000131043},{25980},True


In [None]:
previous_symbol_match_subset_genes_df.loc[previous_symbol_match_subset_genes_df["alias_symbol"] == "FWP007"]

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Previous Symbol Match


In [None]:
previous_symbol_match_subset_genes_df.to_hdf(
    "output/previous_symbol_match_subset_genes_df.h5", key='df', mode='w'
)

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['primary_gene_symbol', 'alias_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID'], dtype='object')]

  previous_symbol_match_subset_genes_df.to_hdf(


In [None]:
previous_symbol_analysis_df.to_hdf(
    "../output/previous_symbol_analysis_df.h5", key='df', mode='w'
)

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['primary_gene_symbol', 'alias_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID'], dtype='object')]

  previous_symbol_analysis_df.to_hdf(


# <a id='toc4_'></a>[FLJ Clone Name Capture](#toc0_)

## <a id='toc4_1_'></a>[Download the FLJ database file including FLJ IDs](#toc0_)
- https://flj.lifesciencedb.jp/top/sys_info/02_about_database/accession_no/download_v032.html 
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2780955/ 
- https://www.ncbi.nlm.nih.gov/nuccore/AK027222?report=GenBank

In [None]:
with open("../input/Conversion_table_FLJ(1).txt", 'r') as file:
    for i, line in enumerate(file):
        print(f"Line {i}: {line.strip()}")

Line 0: Conversion table of Accession No/ FLJ ID/ Clone ID/ Sequence ID/ Another Sequence ID
Line 1: (Data by Extracting System of Accurate ORFs)
Line 2: 
Line 3: Accesion No	FLJ ID	Clone ID	Sequence ID	Another Sequence ID
Line 4: AK075326	PSEC0001(FLJ91001)	NT2RM1000066	C-NT2RM1000066
Line 5: AK172724	PSEC0002(FLJ91002)	NT2RM1000295	C-NT2RM1000295
Line 6: AK075327	PSEC0003(FLJ91003)	NT2RM1000361	C-NT2RM1000361
Line 7: AK075328	PSEC0004(FLJ91004)	NT2RM1000558	C-NT2RM1000558
Line 8: AK075329	PSEC0005(FLJ91005)	NT2RM1000566	C-NT2RM1000566
Line 9: AK075330	PSEC0006(FLJ91006)	NT2RM1000630	C-NT2RM1000630
Line 10: AK075331	PSEC0007(FLJ91007)	NT2RM1000634	C-NT2RM1000634
Line 11: AK075332	PSEC0008(FLJ91008)	NT2RM1000726	C-NT2RM1000726
Line 12: AK075333	PSEC0009(FLJ91009)	NT2RM1000731	C-NT2RM1000731
Line 13: AK075334	PSEC0011(FLJ91010)	NT2RM1000779	C-NT2RM1000779
Line 14: AK075335	PSEC0012(FLJ91011)	NT2RM1000853	C-NT2RM1000853
Line 15: AK075336	PSEC0013(FLJ91012)	NT2RM1000960	C-NT2RM1000960
Lin

In [None]:
clone_symbols_df = pd.read_csv(
    "../input/Conversion_table_FLJ(1).txt", 
    sep="\t", 
    skiprows=3
)
clone_symbols_df

Unnamed: 0,Accesion No,FLJ ID,Clone ID,Sequence ID,Another Sequence ID
0,AK075326,PSEC0001(FLJ91001),NT2RM1000066,C-NT2RM1000066,
1,AK172724,PSEC0002(FLJ91002),NT2RM1000295,C-NT2RM1000295,
2,AK075327,PSEC0003(FLJ91003),NT2RM1000361,C-NT2RM1000361,
3,AK075328,PSEC0004(FLJ91004),NT2RM1000558,C-NT2RM1000558,
4,AK075329,PSEC0005(FLJ91005),NT2RM1000566,C-NT2RM1000566,
...,...,...,...,...,...
30321,AK057825,FLJ25096,CBR00778,C-CBR00778,
30322,AK000479,FLJ20472,KAT07023,C-KAT07023,
30323,AK125921,FLJ43933,TESTI4013685,C-TESTI4013685,
30324,AK125959,FLJ43971,TESTI4017901,C-TESTI4017901,


Extract values from the FLJ column so that there is only one ID per row

In [None]:
extracted_ids = clone_symbols_df['FLJ ID'].str.extract(r'([^()]+)\((.+?)\)')

result_rows = []

for index, row in clone_symbols_df.iterrows():
    flj_id = row['FLJ ID']
    if pd.notnull(extracted_ids.iloc[index, 0]):
        result_rows.append({'Accesion No': row['Accesion No'], 'ID': extracted_ids.iloc[index, 0]})  
        result_rows.append({'Accesion No': row['Accesion No'], 'ID': extracted_ids.iloc[index, 1]})  
    else:
        result_rows.append({'Accesion No': row['Accesion No'], 'ID': flj_id})

result_df = pd.DataFrame(result_rows)
result_df

Unnamed: 0,Accesion No,ID
0,AK075326,PSEC0001
1,AK075326,FLJ91001
2,AK172724,PSEC0002
3,AK172724,FLJ91002
4,AK075327,PSEC0003
...,...,...
30581,AK057825,FLJ25096
30582,AK000479,FLJ20472
30583,AK125921,FLJ43933
30584,AK125959,FLJ43971


In [None]:
result_df["ID"] = result_df["ID"].str.strip()

In [None]:
result_df.to_hdf(
    "../output/flj_clone_symbols_df.h5", key='df', mode='w'
)

In [None]:
clone_symbol_analysis_df = subset_genes_df.copy()
clone_symbol_analysis_df["Clone Symbol Match"] = clone_symbol_analysis_df['alias_symbol'].isin(result_df['ID'])

clone_symbol_match_subset_genes_df = clone_symbol_analysis_df[clone_symbol_analysis_df["Clone Symbol Match"]]
clone_symbol_match_subset_genes_df.head(20)

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Clone Symbol Match
7,A1BG-AS1,FLJ23569,{37133},{ENSG00000268895},{503538},True
19,A2ML1,FLJ25179,{23336},{ENSG00000166535},{144568},True
45,AACS,FLJ12389,{21298},{ENSG00000081760},{65985},True
56,AAGAB,FLJ11506,{25662},{ENSG00000103591},{79719},True
65,AAMDC,FLJ21035,{30205},{ENSG00000087884},{28971},True
132,ABCA11P,FLJ14297,{31},{ENSG00000251595},{79963},True
138,ABCA13,FLJ33876,{14638},{ENSG00000179869},{154664},True
139,ABCA13,FLJ33951,{14638},{ENSG00000179869},{154664},True
141,ABCA15P,FLJ41766,{34405},{ENSG00000189149},{400508},True
348,ABHD1,FLJ36128,{17553},{ENSG00000143994},{84696},True


In [None]:
clone_symbol_analysis_df

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Clone Symbol Match
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False
1,A1BG,A1B,{5},{ENSG00000121410},{1},False
2,A1BG,ABG,{5},{ENSG00000121410},{1},False
3,A1BG,GAB,{5},{ENSG00000121410},{1},False
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False
...,...,...,...,...,...,...
86546,ZZEF1,FLJ10821,{29027},{ENSG00000074755},{23140},True
86547,ZZEF1,KIAA0399,{29027},{ENSG00000074755},{23140},False
86548,ZZEF1,ZZZ4,{29027},{ENSG00000074755},{23140},False
86549,ZZZ3,ATAC1,{24523},{ENSG00000036549},{26009},False


In [None]:
clone_symbol_analysis_df.to_hdf(
    "../output/clone_symbol_analysis_df.h5", key='df', mode='w'
)

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['primary_gene_symbol', 'alias_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID'], dtype='object')]

  clone_symbol_analysis_df.to_hdf(


In [None]:
clone_symbol_match_subset_genes_df.to_hdf(
    "../output/clone_symbol_match_subset_genes_df.h5", key='df', mode='w'
)

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['primary_gene_symbol', 'alias_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID'], dtype='object')]

  clone_symbol_match_subset_genes_df.to_hdf(


# <a id='toc5_'></a>[Gene Family Symbol Capture](#toc0_)

Gene Group ID and abbreviation reference from HGNC: https://storage.googleapis.com/public-download-files/hgnc/csv/csv/genefamily_db_tables/family.csv or family.csv from https://www.genenames.org/download/gene-groups/#!/#tocAnchor-1-1

In [None]:
hgnc_genefamilies_df = pd.read_csv(
    "../input/hgnc_genefamily.csv", sep=",")
hgnc_genefamilies_df

Unnamed: 0,id,abbreviation,name,external_note,pubmed_ids,desc_comment,desc_label,desc_source,desc_go,typical_gene
0,3,FSCN,Fascin family,,21618240,,,,,FSCN1
1,4,ABHD,Abhydrolase domain containing,,23328280,,,,,ABHD1
2,6,ZYG11,ZYG11 cell cycle regulator family,,,,,,,
3,8,ZP,Zona pellucida glycoproteins,,15760956,There are four major zona pellucida glycoprote...,Zona pellucida glycoproteins,Wikipedia|https://en.wikipedia.org/wiki/Zona p...,,ZP1
4,10,VNN,Vanin family,,22155241,,,,,VNN1
...,...,...,...,...,...,...,...,...,...,...
1810,3338,,WICH complex,,21326359,Chromatin remodeling complex required for main...,WICH complex,Complex portal|https://www.ebi.ac.uk/complexpo...,,
1811,3339,,NoRC complex,,,NoRC remodels nucleosomes at the rDNA promoter...,NoRC complex,Complex Portal|https://www.ebi.ac.uk/complexpo...,,
1812,3340,,RSF complex,,,A nucleosome remodeling complex that participa...,RSF complex,Complex Portal|https://www.ebi.ac.uk/complexpo...,,
1813,3341,,ATP-dependent chromatin remodeling complexes,,19355820,,,,,


In [None]:
hgnc_genefamilies_df = hgnc_genefamilies_df[["id","abbreviation"]]
hgnc_genefamilies_df.rename(columns={'id': 'Gene group ID'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hgnc_genefamilies_df.rename(columns={'id': 'Gene group ID'}, inplace=True)


Gene and Primary gene symbol and Group ID reference from HGNC: https://storage.googleapis.com/public-download-files/hgnc/csv/csv/genefamily_db_tables/gene_has_family.csv or gene_has_family.csv from https://www.genenames.org/download/gene-groups/#!/#tocAnchor-1-1 or HGNC Custom Downloads and selcet these three fields

In [None]:
hgnc_gene_groupid_df = pd.read_csv(
    "input/hgnc_id_symbol_genegroupid.txt", sep="\t")
hgnc_gene_groupid_df

Unnamed: 0,HGNC ID,Approved symbol,Gene group ID
0,HGNC:5,A1BG,594
1,HGNC:37133,A1BG-AS1,1987
2,HGNC:24086,A1CF,725
3,HGNC:6,A1S9T,
4,HGNC:7,A2M,2148
...,...,...,...
49078,HGNC:25820,ZYG11B,6|1492
49079,HGNC:13200,ZYX,1402|1691
49080,HGNC:51695,ZYXP1,
49081,HGNC:29027,ZZEF1,91|863


In [None]:
hgnc_gene_groupid_df['HGNC ID'] = hgnc_gene_groupid_df['HGNC ID'].str.replace('^HGNC:', '', regex=True)
hgnc_gene_groupid_df['Gene group ID'] = hgnc_gene_groupid_df['Gene group ID'].str.split('|')
hgnc_gene_groupid_df = hgnc_gene_groupid_df.explode('Gene group ID')

In [None]:
hgnc_gene_groupid_df = hgnc_gene_groupid_df.dropna(subset=['Gene group ID'])
hgnc_gene_groupid_df['Gene group ID'] = hgnc_gene_groupid_df['Gene group ID'].astype(int)

In [None]:
hgnc_gene_group_root_df = hgnc_gene_groupid_df.merge(hgnc_genefamilies_df, on='Gene group ID', how='left')
hgnc_gene_group_root_df

Unnamed: 0,HGNC ID,Approved symbol,Gene group ID,abbreviation
0,5,A1BG,594,
1,37133,A1BG-AS1,1987,
2,24086,A1CF,725,RBM
3,7,A2M,2148,
4,27057,A2M-AS1,1987,
...,...,...,...,...
31399,29027,ZZEF1,91,ZZZ
31400,29027,ZZEF1,863,
31401,24523,ZZZ3,91,ZZZ
31402,24523,ZZZ3,532,


In [None]:
hgnc_gene_group_root_df["HGNC ID"] = hgnc_gene_group_root_df["HGNC ID"].apply(
    lambda x: f"HGNC:{int(x)}" if pd.notna(x) and x == int(x) else f"HGNC:{x}" if pd.notna(x) else x
)

In [None]:
hgnc_gene_group_root_df = hgnc_gene_group_root_df.dropna(subset=['abbreviation'])

In [None]:
hgnc_gene_group_root_df["abbreviation"] = hgnc_gene_group_root_df["abbreviation"].str.upper()
hgnc_gene_group_root_df["Approved symbol"] = hgnc_gene_group_root_df["Approved symbol"].str.upper()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hgnc_gene_group_root_df["abbreviation"] = hgnc_gene_group_root_df["abbreviation"].str.upper()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hgnc_gene_group_root_df["Approved symbol"] = hgnc_gene_group_root_df["Approved symbol"].str.upper()


In [None]:
hgnc_gene_group_root_df.to_hdf(
    "../output/hgnc_gene_group_root_df.h5", key='df', mode='w'
)

In [None]:
gene_group_analysis_df = subset_genes_df.copy()

Add the 'Prefix Gene Group Symbol Match' column with True/False

In [None]:
gene_group_analysis_df["Prefix Gene Group Symbol Match"] = gene_group_analysis_df.apply(
    lambda row: any(
        row['alias_symbol'].startswith(abbreviation)  # Check if alias_symbol starts with abbreviation
        for abbreviation in hgnc_gene_group_root_df.loc[
            hgnc_gene_group_root_df['Approved symbol'] == row['primary_gene_symbol'], 
            'abbreviation'
        ]  # Loop through all abbreviations for the gene_symbol
    ), axis=1
)

Add the 'Matching Abbreviation' column with the actual abbreviation (or None if no match)

In [None]:
gene_group_analysis_df["Matching Abbreviation"] = gene_group_analysis_df.apply(
    lambda row: next((
        abbreviation for abbreviation in hgnc_gene_group_root_df.loc[
            hgnc_gene_group_root_df['Approved symbol'] == row['primary_gene_symbol'], 
            'abbreviation'
        ] 
        if row['alias_symbol'].startswith(abbreviation)  # Check if alias_symbol starts with abbreviation
    ),  ""), axis=1
)

In [None]:

gene_group_prefix_match_subset_genes_df = gene_group_analysis_df[gene_group_analysis_df["Prefix Gene Group Symbol Match"]]
gene_group_analysis_df


Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Prefix Gene Group Symbol Match,Matching Abbreviation
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False,
1,A1BG,A1B,{5},{ENSG00000121410},{1},False,
2,A1BG,ABG,{5},{ENSG00000121410},{1},False,
3,A1BG,GAB,{5},{ENSG00000121410},{1},False,
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False,
...,...,...,...,...,...,...,...
86546,ZZEF1,FLJ10821,{29027},{ENSG00000074755},{23140},False,
86547,ZZEF1,KIAA0399,{29027},{ENSG00000074755},{23140},False,
86548,ZZEF1,ZZZ4,{29027},{ENSG00000074755},{23140},True,ZZZ
86549,ZZZ3,ATAC1,{24523},{ENSG00000036549},{26009},True,ATAC


In [None]:
gene_group_prefix_match_subset_genes_df.to_hdf(
    "../output/gene_group_prefix_match_subset_genes_df.h5", key='df', mode='w'
)

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['primary_gene_symbol', 'alias_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID',
       'Matching Abbreviation'],
      dtype='object')]

  gene_group_prefix_match_subset_genes_df.to_hdf(


In [None]:
gene_group_analysis_df.to_hdf(
    "../output/gene_group_analysis_df.h5", key='df', mode='w'
)

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['primary_gene_symbol', 'alias_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID',
       'Matching Abbreviation'],
      dtype='object')]

  gene_group_analysis_df.to_hdf(


# <a id='toc6_'></a>[Disorder/Disease Symbol Capture](#toc0_)

OMIM downloads: https://omim.org/downloads/

In [11]:
mim2gene_df = pd.read_csv(
    "../input/mim2gene.txt", 
    sep="\t", 
    comment="#",
    names=["gene_MIM_number", "MIM Entry Type", "Entrez Gene ID (NCBI)", "Approved Gene Symbol (HGNC)", "Ensembl Gene ID (Ensembl)"],
    dtype={"Entrez Gene ID (NCBI)": "Int64"}
    )

mim2gene_df.drop(columns=['MIM Entry Type'], inplace=True)

mim2gene_df

Unnamed: 0,gene_MIM_number,Entrez Gene ID (NCBI),Approved Gene Symbol (HGNC),Ensembl Gene ID (Ensembl)
0,100050,,,
1,100070,100329167,,
2,100100,,,
3,100200,,,
4,100300,,,
...,...,...,...,...
28979,621039,,,
28980,621040,,,
28981,621041,,,
28982,621042,,,


In [12]:
mim2gene_df["Entrez Gene ID (NCBI)"] = mim2gene_df["Entrez Gene ID (NCBI)"].apply(
    lambda x: f"GENE ID:{int(x)}" if pd.notna(x) and x == int(x) else f"GENE ID:{x}" if pd.notna(x) else x
)

In [83]:
mim2gene_df.loc[mim2gene_df["Approved Gene Symbol (HGNC)"]=="A2ML1"]

Unnamed: 0,gene_MIM_number,Entrez Gene ID (NCBI),Approved Gene Symbol (HGNC),Ensembl Gene ID (Ensembl)
18573,610627,GENE ID:144568,A2ML1,ENSG00000166535


In [14]:
mimTitles_df = pd.read_csv(
    "../input/mimTitles.txt", 
    sep="\t", 
    comment="#",
    names=["Prefix", "phenotype_MIM_number", "phenotype_preferred_title_symbol", "phenotype_alternative_titles_symbols", "phenotype__included_titles_symbols"]
    )
mimTitles_df.head()

Unnamed: 0,Prefix,phenotype_MIM_number,phenotype_preferred_title_symbol,phenotype_alternative_titles_symbols,phenotype__included_titles_symbols
0,,100050,"AARSKOG SYNDROME, AUTOSOMAL DOMINANT",,
1,Percent,100070,"AORTIC ANEURYSM, FAMILIAL ABDOMINAL, 1; AAA1","ANEURYSM, ABDOMINAL AORTIC; AAA;; ABDOMINAL AO...",
2,Number Sign,100100,PRUNE BELLY SYNDROME; PBS,"ABDOMINAL MUSCLES, ABSENCE OF, WITH URINARY TR...",
3,,100200,ABDUCENS PALSY,,
4,Number Sign,100300,ADAMS-OLIVER SYNDROME 1; AOS1,"AOS;; ABSENCE DEFECT OF LIMBS, SCALP, AND SKUL...","APLASIA CUTIS CONGENITA, CONGENITAL HEART DEFE..."


In [15]:
mimTitles_df[mimTitles_df['phenotype__included_titles_symbols'].str.contains("INCLUDED", na=False)]

Unnamed: 0,Prefix,phenotype_MIM_number,phenotype_preferred_title_symbol,phenotype_alternative_titles_symbols,phenotype__included_titles_symbols
4,Number Sign,100300,ADAMS-OLIVER SYNDROME 1; AOS1,"AOS;; ABSENCE DEFECT OF LIMBS, SCALP, AND SKUL...","APLASIA CUTIS CONGENITA, CONGENITAL HEART DEFE..."
8,Plus,100650,ALDEHYDE DEHYDROGENASE 2 FAMILY; ALDH2,"ALDEHYDE DEHYDROGENASE 2;; ALDH, LIVER MITOCHO...","SUBLINGUAL NITROGLYCERIN, SUSCEPTIBILITY TO PO..."
30,Number Sign,101200,APERT SYNDROME,"ACROCEPHALOSYNDACTYLY, TYPE I; ACS1;; ACS I","APERT-CROUZON DISEASE, INCLUDED;; ACROCEPHALOS..."
31,Number Sign,101400,SAETHRE-CHOTZEN SYNDROME; SCS,"ACROCEPHALOSYNDACTYLY, TYPE III; ACS3;; ACS II...",SAETHRE-CHOTZEN SYNDROME WITH EYELID ANOMALIES...
32,Number Sign,101600,PFEIFFER SYNDROME,"ACROCEPHALOSYNDACTYLY, TYPE V; ACS5;; ACS V;; ...","CRANIOFACIAL-SKELETAL-DERMATOLOGIC DYSPLASIA, ..."
...,...,...,...,...,...
25884,Number Sign,618057,"DRUG METABOLISM, ALTERED, CES1-RELATED",,"CARBOXYLESTERASE 1 DEFICIENCY, INCLUDED;; MONO..."
25974,Number Sign,618147,INTELLECTUAL DEVELOPMENTAL DISORDER WITH HYPER...,,"CHROMOSOME 14q32 DELETION SYNDROME, INCLUDED"
26086,Asterisk,618259,LONG INTERGENIC NONCODING RNA 1565; LINC01565,lincRNA 1565;; GR6,"GR6/EVI1 FUSION GENE, INCLUDED"
26186,Asterisk,618359,ZINC FINGER PROTEIN 197; ZNF197,ZINC FINGER PROTEIN 20; ZNF20,VHL-ASSOCIATED KRAB-A DOMAIN-CONTAINING PROTEI...


In [84]:
# remove all asterisks(*) and carets(^). gene(*) and entry(^) records removed

In [16]:
mimTitles_df = mimTitles_df[mimTitles_df['Prefix'] != "Asterisk"]

In [17]:
mimTitles_df = mimTitles_df[mimTitles_df['Prefix'] != "Caret"]

In [18]:
df = mimTitles_df.copy()
df_preffered = df[['Prefix', 'phenotype_MIM_number', 'phenotype_preferred_title_symbol']].copy()
df_preffered['titles_symbols'] = df_preffered['phenotype_preferred_title_symbol']
df_preffered

Unnamed: 0,Prefix,phenotype_MIM_number,phenotype_preferred_title_symbol,titles_symbols
0,,100050,"AARSKOG SYNDROME, AUTOSOMAL DOMINANT","AARSKOG SYNDROME, AUTOSOMAL DOMINANT"
1,Percent,100070,"AORTIC ANEURYSM, FAMILIAL ABDOMINAL, 1; AAA1","AORTIC ANEURYSM, FAMILIAL ABDOMINAL, 1; AAA1"
2,Number Sign,100100,PRUNE BELLY SYNDROME; PBS,PRUNE BELLY SYNDROME; PBS
3,,100200,ABDUCENS PALSY,ABDUCENS PALSY
4,Number Sign,100300,ADAMS-OLIVER SYNDROME 1; AOS1,ADAMS-OLIVER SYNDROME 1; AOS1
...,...,...,...,...
26266,Number Sign,618440,OCULOSKELETODENTAL SYNDROME; OCSKD,OCULOSKELETODENTAL SYNDROME; OCSKD
26269,Number Sign,618443,NEURODEVELOPMENTAL DISORDER WITH OR WITHOUT VA...,NEURODEVELOPMENTAL DISORDER WITH OR WITHOUT VA...
26274,Number Sign,618449,"CILIARY DYSKINESIA, PRIMARY, 41; CILD41","CILIARY DYSKINESIA, PRIMARY, 41; CILD41"
26276,Number Sign,618451,"NEURODEGENERATION, EARLY-ONSET, WITH CHOREOATH...","NEURODEGENERATION, EARLY-ONSET, WITH CHOREOATH..."


In [19]:
df = mimTitles_df.copy()
df_alt = df[['Prefix', 'phenotype_MIM_number', 'phenotype_alternative_titles_symbols']].copy()
df_alt['titles_symbols'] = df_alt['phenotype_alternative_titles_symbols']
df_alt

Unnamed: 0,Prefix,phenotype_MIM_number,phenotype_alternative_titles_symbols,titles_symbols
0,,100050,,
1,Percent,100070,"ANEURYSM, ABDOMINAL AORTIC; AAA;; ABDOMINAL AO...","ANEURYSM, ABDOMINAL AORTIC; AAA;; ABDOMINAL AO..."
2,Number Sign,100100,"ABDOMINAL MUSCLES, ABSENCE OF, WITH URINARY TR...","ABDOMINAL MUSCLES, ABSENCE OF, WITH URINARY TR..."
3,,100200,,
4,Number Sign,100300,"AOS;; ABSENCE DEFECT OF LIMBS, SCALP, AND SKUL...","AOS;; ABSENCE DEFECT OF LIMBS, SCALP, AND SKUL..."
...,...,...,...,...
26266,Number Sign,618440,"CATARACTS, EARLY-ONSET, WITH SKELETAL AND DENT...","CATARACTS, EARLY-ONSET, WITH SKELETAL AND DENT..."
26269,Number Sign,618443,,
26274,Number Sign,618449,,
26276,Number Sign,618451,,


In [20]:
df = mimTitles_df.copy()
df_inc = df[['Prefix', 'phenotype_MIM_number', 'phenotype__included_titles_symbols']].copy()
df_inc['titles_symbols'] = df_inc['phenotype__included_titles_symbols']
df_inc

Unnamed: 0,Prefix,phenotype_MIM_number,phenotype__included_titles_symbols,titles_symbols
0,,100050,,
1,Percent,100070,,
2,Number Sign,100100,,
3,,100200,,
4,Number Sign,100300,"APLASIA CUTIS CONGENITA, CONGENITAL HEART DEFE...","APLASIA CUTIS CONGENITA, CONGENITAL HEART DEFE..."
...,...,...,...,...
26266,Number Sign,618440,,
26269,Number Sign,618443,,
26274,Number Sign,618449,,
26276,Number Sign,618451,,


In [21]:
df_combined = pd.concat([df_preffered, df_alt, df_inc], ignore_index=True)
df_combined.drop(["phenotype_preferred_title_symbol","phenotype_alternative_titles_symbols", "phenotype__included_titles_symbols"], axis=1, inplace=True)
df_combined

Unnamed: 0,Prefix,phenotype_MIM_number,titles_symbols
0,,100050,"AARSKOG SYNDROME, AUTOSOMAL DOMINANT"
1,Percent,100070,"AORTIC ANEURYSM, FAMILIAL ABDOMINAL, 1; AAA1"
2,Number Sign,100100,PRUNE BELLY SYNDROME; PBS
3,,100200,ABDUCENS PALSY
4,Number Sign,100300,ADAMS-OLIVER SYNDROME 1; AOS1
...,...,...,...
26743,Number Sign,618440,
26744,Number Sign,618443,
26745,Number Sign,618449,
26746,Number Sign,618451,


In [22]:
df_combined.dropna(subset=['titles_symbols'], inplace=True)
df_combined

Unnamed: 0,Prefix,phenotype_MIM_number,titles_symbols
0,,100050,"AARSKOG SYNDROME, AUTOSOMAL DOMINANT"
1,Percent,100070,"AORTIC ANEURYSM, FAMILIAL ABDOMINAL, 1; AAA1"
2,Number Sign,100100,PRUNE BELLY SYNDROME; PBS
3,,100200,ABDUCENS PALSY
4,Number Sign,100300,ADAMS-OLIVER SYNDROME 1; AOS1
...,...,...,...
26442,Percent,617955,"FETAL HYDANTOIN SYNDROME, INCLUDED; FHS, INCLUDED"
26449,Number Sign,617966,"EZETIMIBE, RESPONSE TO, INCLUDED"
26496,Number Sign,618057,"CARBOXYLESTERASE 1 DEFICIENCY, INCLUDED;; MONO..."
26548,Number Sign,618147,"CHROMOSOME 14q32 DELETION SYNDROME, INCLUDED"


split title and symbol

In [23]:
def split_name_symbol_pairs(value):
    pairs = value.split(';;')

    result = []
    for pair in pairs:
        parts = pair.strip().split(';')
        if len(parts) == 2:
            result.append(parts)
    return result


In [24]:
split_data = df_combined['titles_symbols'].apply(split_name_symbol_pairs)
split_data

0                                                       []
1        [[AORTIC ANEURYSM, FAMILIAL ABDOMINAL, 1,  AAA1]]
2                           [[PRUNE BELLY SYNDROME,  PBS]]
3                                                       []
4                       [[ADAMS-OLIVER SYNDROME 1,  AOS1]]
                               ...                        
26442    [[FETAL HYDANTOIN SYNDROME, INCLUDED,  FHS, IN...
26449                                                   []
26496                                                   []
26548                                                   []
26721                                                   []
Name: titles_symbols, Length: 13988, dtype: object

In [25]:
df_reset = df_combined.reset_index(drop=True)
split_data_reset = split_data.reset_index(drop=True)

put titles and symbols into different columns

In [26]:
expanded_rows = []

for row, titles_symbols in zip(df_reset.iterrows(), split_data_reset):
    _, row_data = row  # Unpack the row from iterrows()
    col1_value = row_data['Prefix']
    col2_value = row_data['phenotype_MIM_number']
    
    # Ensure split_data contains lists of name-symbol pairs
    for title_symbol in titles_symbols:
        title = title_symbol[0].strip()
        symbol = title_symbol[1].strip()
        # Append the expanded row (duplicating col1 and col2)
        expanded_rows.append([col1_value, col2_value, title, symbol])


In [27]:
expanded_df = pd.DataFrame(expanded_rows, columns=['Prefix', 'phenotype_MIM_number', 'pheno_title', 'pheno_symbol'])
expanded_df

Unnamed: 0,Prefix,phenotype_MIM_number,pheno_title,pheno_symbol
0,Percent,100070,"AORTIC ANEURYSM, FAMILIAL ABDOMINAL, 1",AAA1
1,Number Sign,100100,PRUNE BELLY SYNDROME,PBS
2,Number Sign,100300,ADAMS-OLIVER SYNDROME 1,AOS1
3,Plus,100650,ALDEHYDE DEHYDROGENASE 2 FAMILY,ALDH2
4,Number Sign,100800,ACHONDROPLASIA,ACH
...,...,...,...,...
7922,Number Sign,616553,"DYSKERATOSIS CONGENITA, AUTOSOMAL RECESSIVE 7,...","DKCB7, INCLUDED"
7923,Number Sign,617047,"CARDIOMYOPATHY, FAMILIAL RESTRICTIVE, 5, INCLUDED","RCM5, INCLUDED"
7924,Number Sign,617347,LOW DENSITY LIPOPROTEIN CHOLESTEROL LEVEL QUAN...,"LDLCQ5, INCLUDED"
7925,Number Sign,617562,"JOUBERT SYNDROME 29, INCLUDED","JBTS29, INCLUDED"


In pheno-symbols, there are vlaues that are not just symbols

In [28]:
comma_df = expanded_df[expanded_df["pheno_symbol"].str.contains(",")]
comma_df

Unnamed: 0,Prefix,phenotype_MIM_number,pheno_title,pheno_symbol
5985,Number Sign,106210,"ANIRIDIA II, FORMERLY","AN2, FORMERLY"
6084,Number Sign,122000,"CORNEAL ENDOTHELIAL DYSTROPHY 1, AUTOSOMAL DOM...","CHED1, FORMERLY"
6125,Number Sign,129500,"ECTODERMAL DYSPLASIA, HIDROTIC, 2, FORMERLY","HED2, FORMERLY"
6127,Number Sign,130000,"EHLERS-DANLOS SYNDROME, TYPE I, FORMERLY","EDS1, FORMERLY"
6128,Number Sign,130010,"EHLERS-DANLOS SYNDROME, TYPE II, FORMERLY","EDS2, FORMERLY"
...,...,...,...,...
7922,Number Sign,616553,"DYSKERATOSIS CONGENITA, AUTOSOMAL RECESSIVE 7,...","DKCB7, INCLUDED"
7923,Number Sign,617047,"CARDIOMYOPATHY, FAMILIAL RESTRICTIVE, 5, INCLUDED","RCM5, INCLUDED"
7924,Number Sign,617347,LOW DENSITY LIPOPROTEIN CHOLESTEROL LEVEL QUAN...,"LDLCQ5, INCLUDED"
7925,Number Sign,617562,"JOUBERT SYNDROME 29, INCLUDED","JBTS29, INCLUDED"


the vocab present that is not a symbol in DIGENIC, INCLUDED, FORMERLY, ICHTHYOSIS, CONGENITAL, WITH TRICHOTHIODYSTROPHY (identified via manual review)

always occur after a comma (excpet ICHTHYOSIS)

In [29]:
comma_df[comma_df["pheno_symbol"].str.contains("DIGENIC")]

Unnamed: 0,Prefix,phenotype_MIM_number,pheno_title,pheno_symbol
7368,Number Sign,611818,"LONG QT SYNDROME 2/9, DIGENIC, INCLUDED","LQT2/9, DIGENIC, INCLUDED"
7684,Number Sign,192500,"LONG QT SYNDROME 1/2, DIGENIC, INCLUDED","LQT1/2, DIGENIC, INCLUDED"
7795,Number Sign,603830,"LONG QT SYNDROME 2/3, DIGENIC, INCLUDED","LQT2/3, DIGENIC, INCLUDED"
7796,Number Sign,603830,"LONG QT SYNDROME 3/6, DIGENIC, INCLUDED","LQT3/6, DIGENIC, INCLUDED"
7876,Number Sign,613688,"LONG QT SYNDROME 1/2, DIGENIC, INCLUDED","LQT1/2, DIGENIC, INCLUDED"
7877,Number Sign,613688,"LONG QT SYNDROME 2/3, DIGENIC, INCLUDED","LQT2/3, DIGENIC, INCLUDED"
7878,Number Sign,613688,"LONG QT SYNDROME 2/5, DIGENIC, INCLUDED","LQT2/5, DIGENIC, INCLUDED"
7879,Number Sign,613688,"LONG QT SYNDROME 2/9, DIGENIC, INCLUDED","LQT2/9, DIGENIC, INCLUDED"
7880,Number Sign,613693,"LONG QT SYNDROME 3/6, DIGENIC, INCLUDED","LQT3/6, DIGENIC, INCLUDED"
7881,Number Sign,613695,"LONG QT SYNDROME 2/5, DIGENIC, INCLUDED","LQT2/5, DIGENIC, INCLUDED"


Remove comman and all vocabulary after in the pheno_symbol column

In [30]:
expanded_df['pheno_symbol'] = expanded_df['pheno_symbol'].str.split(',').str[0]

In [31]:
cleaned_mimTitles_df = expanded_df.copy()

In [32]:
cleaned_mimTitles_df[cleaned_mimTitles_df['pheno_symbol'].str.contains("A2M", na=False)]

Unnamed: 0,Prefix,phenotype_MIM_number,pheno_title,pheno_symbol
3983,,614036,ALPHA-2-MACROGLOBULIN DEFICIENCY,A2MD


In [33]:
cleaned_mimTitles_df.loc[cleaned_mimTitles_df["pheno_symbol"]=="A2MD"]

Unnamed: 0,Prefix,phenotype_MIM_number,pheno_title,pheno_symbol
3983,,614036,ALPHA-2-MACROGLOBULIN DEFICIENCY,A2MD


In [36]:
cleaned_mimTitles_df.loc[cleaned_mimTitles_df["phenotype_MIM_number"]==166760]

Unnamed: 0,Prefix,phenotype_MIM_number,pheno_title,pheno_symbol
600,Number Sign,166760,"OTITIS MEDIA, SUSCEPTIBILITY TO",OMS


In [37]:
cleaned_mimTitles_df.loc[cleaned_mimTitles_df["pheno_symbol"]=="OMS"]


Unnamed: 0,Prefix,phenotype_MIM_number,pheno_title,pheno_symbol
600,Number Sign,166760,"OTITIS MEDIA, SUSCEPTIBILITY TO",OMS


In [63]:
morbidmap_df = pd.read_csv(
    "../input/morbidmap.txt", 
    sep="\t", 
    comment="#",
    names=["Phenotype", "gene_symbols", "gene_MIM_number", "gene_cyto_location"]
    )
morbidmap_df.head(50)

Unnamed: 0,Phenotype,gene_symbols,gene_MIM_number,gene_cyto_location
0,"17,20-lyase deficiency, isolated, 202110 (3)","CYP17A1, CYP17, P450C17",609300,10q24.32
1,"17-alpha-hydroxylase/17,20-lyase deficiency, 2...","CYP17A1, CYP17, P450C17",609300,10q24.32
2,"2-aminoadipic 2-oxoadipic aciduria, 204750 (3)","DHTKD1, KIAA1630, AMOXAD, CMT2Q",614984,10p14
3,"2-methylbutyrylglycinuria, 610006 (3)","ACADSB, SBCAD",600301,10q26.13
4,"3-M syndrome 1, 273750 (3)","CUL7, 3M1",609577,6p21.1
5,"3-M syndrome 2, 612921 (3)","OBSL1, KIAA0657, 3M2",610991,2q35
6,"3-M syndrome 3, 614205 (3)","CCDC8, 3M3",614145,19q13.32
7,"3-Methylcrotonyl-CoA carboxylase 1 deficiency,...","MCCC1, MCCA",609010,3q27.1
8,"3-Methylcrotonyl-CoA carboxylase 2 deficiency,...","MCCC2, MCCB",609014,5q13.2
9,"3-hydroxyacyl-CoA dehydrogenase deficiency, 23...","HADHSC, SCHAD, HHF4",601609,4q25


In [64]:
morbidmap_df[morbidmap_df['gene_symbols'].str.contains("OMS", na=False)]

Unnamed: 0,Phenotype,gene_symbols,gene_MIM_number,gene_cyto_location
514,Abdominal obesity-metabolic syndrome (2),AOMS2,605572,17p12
515,Abdominal obesity-metabolic syndrome 1 (2),"AOMS1, SYNX",605552,3q27
516,"Abdominal obesity-metabolic syndrome 3, 615812...","DYRK1B, MIRK, AOMS3",604556,19q13.2
5687,"Stromme syndrome, 243605 (3)","CENPF, CILD31, STROMS",600236,1q41
7130,"{Otitis media, susceptibility to} (2)",OMS,166760,10q26.3


In [68]:
morbidmap_df[morbidmap_df['gene_symbols'].str.contains("A2ML1", na=False)]

Unnamed: 0,Phenotype,gene_symbols,gene_MIM_number,gene_cyto_location


In [69]:
morbidmap_df[morbidmap_df["gene_MIM_number"] == 610627]

Unnamed: 0,Phenotype,gene_symbols,gene_MIM_number,gene_cyto_location


In [70]:
def split_phenotype_on_last_comma(phenotype):
    parts = phenotype.rsplit(',', 1)
    return parts[0], parts[1] if len(parts) > 1 else ''


morbidmap_df[['phenotype', 'MIM_number/phenotype_mapping_key']] = morbidmap_df['Phenotype'].apply(lambda x: pd.Series(split_phenotype_on_last_comma(x)))


morbidmap_df.drop(columns=['Phenotype'], inplace=True)


In [71]:
morbidmap_df

Unnamed: 0,gene_symbols,gene_MIM_number,gene_cyto_location,phenotype,MIM_number/phenotype_mapping_key
0,"CYP17A1, CYP17, P450C17",609300,10q24.32,"17,20-lyase deficiency, isolated",202110 (3)
1,"CYP17A1, CYP17, P450C17",609300,10q24.32,"17-alpha-hydroxylase/17,20-lyase deficiency",202110 (3)
2,"DHTKD1, KIAA1630, AMOXAD, CMT2Q",614984,10p14,2-aminoadipic 2-oxoadipic aciduria,204750 (3)
3,"ACADSB, SBCAD",600301,10q26.13,2-methylbutyrylglycinuria,610006 (3)
4,"CUL7, 3M1",609577,6p21.1,3-M syndrome 1,273750 (3)
...,...,...,...,...,...
7408,"CCR5, CMKBR5, CCCKR5, IDDM22",601373,3p21.31,"{West nile virus, susceptibility to}",610379 (3)
7409,"REST, NRSF, WT6, GINGF5, HGF5",600571,4q12,"{Wilms tumor 6, susceptibility to}",616806 (3)
7410,"POU6F2, WTSL, WT5",609062,7p14.1,{Wilms tumor susceptibility-5},601583 (3)
7411,"NOD2, CARD15, IBD1, CD, YAOS, BLAUS",605956,16q12.1,{Yao syndrome},617321 (3)


In [73]:
morbidmap_df.loc[morbidmap_df["gene_MIM_number"]==166760]

Unnamed: 0,gene_symbols,gene_MIM_number,gene_cyto_location,phenotype,MIM_number/phenotype_mapping_key
7130,OMS,166760,10q26.3,{Otitis media,susceptibility to} (2)


In [62]:
morbidmap_df[morbidmap_df['gene_symbols'].str.contains("OMS", na=False)]

Unnamed: 0,gene_symbols,gene_MIM_number,gene_cyto_location,phenotype,MIM_number/phenotype_mapping_key
514,AOMS2,605572,17p12,Abdominal obesity-metabolic syndrome (2),
515,"AOMS1, SYNX",605552,3q27,Abdominal obesity-metabolic syndrome 1 (2),
516,"DYRK1B, MIRK, AOMS3",604556,19q13.2,Abdominal obesity-metabolic syndrome 3,615812 (3)
5687,"CENPF, CILD31, STROMS",600236,1q41,Stromme syndrome,243605 (3)
7130,OMS,166760,10q26.3,{Otitis media,susceptibility to} (2)


In [56]:
morbidmap_df[morbidmap_df["gene_MIM_number"] == 610627]

Unnamed: 0,gene_symbols,gene_MIM_number,gene_cyto_location,phenotype,MIM_number/phenotype_mapping_key


In [75]:
def split_phenotype_description(description):
    match = re.match(r"(.*?)(\s*\(\d+\))?$", description)
    if match:
        return match.group(1).strip(), match.group(2).strip() if match.group(2) else ''
    else:
        return description, ''


morbidmap_df[['phenotype_MIM_number', 'phenotype_mapping_key']] = morbidmap_df['MIM_number/phenotype_mapping_key'].apply(
    lambda x: pd.Series(split_phenotype_description(x))
)
morbidmap_df.drop(columns=['MIM_number/phenotype_mapping_key', 'gene_cyto_location','phenotype','phenotype_mapping_key', 'gene_symbols'], inplace=True)


In [76]:
morbidmap_df

Unnamed: 0,gene_MIM_number,phenotype_MIM_number
0,609300,202110
1,609300,202110
2,614984,204750
3,600301,610006
4,609577,273750
...,...,...
7408,601373,610379
7409,600571,616806
7410,609062,601583
7411,605956,617321


In [77]:
morbidmap_df.loc[morbidmap_df["gene_MIM_number"] == "166760"]

Unnamed: 0,gene_MIM_number,phenotype_MIM_number


In [78]:
cleaned_mimTitles_df['phenotype_MIM_number'] = cleaned_mimTitles_df['phenotype_MIM_number'].astype(str) 

In [79]:
morbidmap_and_mimTitles_df = pd.merge(morbidmap_df, cleaned_mimTitles_df, on='phenotype_MIM_number', how='left')

In [80]:
morbidmap_and_mimTitles_df

Unnamed: 0,gene_MIM_number,phenotype_MIM_number,Prefix,pheno_title,pheno_symbol
0,609300,202110,,,
1,609300,202110,,,
2,614984,204750,Number Sign,2-AMINOADIPIC 2-OXOADIPIC ACIDURIA,AMOXAD
3,600301,610006,Number Sign,SHORT/BRANCHED-CHAIN ACYL-CoA DEHYDROGENASE DE...,SBCADD
4,609577,273750,Number Sign,THREE M SYNDROME 1,3M1
...,...,...,...,...,...
9357,600571,616806,Number Sign,WILMS TUMOR 6,WT6
9358,609062,601583,Number Sign,WILMS TUMOR 5,WT5
9359,609062,601583,Number Sign,"WILMS TUMOR, SUSCEPTIBILITY TO",WTSL
9360,605956,617321,Number Sign,YAO SYNDROME,YAOS


In [82]:
morbidmap_and_mimTitles_df.loc[morbidmap_and_mimTitles_df["phenotype_MIM_number"] == "166760"]

Unnamed: 0,gene_MIM_number,phenotype_MIM_number,Prefix,pheno_title,pheno_symbol


In [None]:
morbidmap_and_mimTitles_df.loc[morbidmap_and_mimTitles_df["pheno_symbol"]=="A2MD"]

Unnamed: 0,gene_MIM_number,phenotype_MIM_number,Prefix,pheno_title,pheno_symbol
744,103950,614036,,ALPHA-2-MACROGLOBULIN DEFICIENCY,A2MD


In [None]:
mim2gene_and_morbidmap_and_mimTitles_df = pd.merge(morbidmap_and_mimTitles_df, mim2gene_df, on='gene_MIM_number', how='left')

In [None]:
mim2gene_and_morbidmap_and_mimTitles_df

Unnamed: 0,gene_MIM_number,phenotype_MIM_number,Prefix,pheno_title,pheno_symbol,Entrez Gene ID (NCBI),Approved Gene Symbol (HGNC),Ensembl Gene ID (Ensembl)
0,609300,202110,,,,GENE ID:1586,CYP17A1,ENSG00000148795
1,609300,202110,,,,GENE ID:1586,CYP17A1,ENSG00000148795
2,614984,204750,Number Sign,2-AMINOADIPIC 2-OXOADIPIC ACIDURIA,AMOXAD,GENE ID:55526,DHTKD1,ENSG00000181192
3,600301,610006,Number Sign,SHORT/BRANCHED-CHAIN ACYL-CoA DEHYDROGENASE DE...,SBCADD,GENE ID:36,ACADSB,ENSG00000196177
4,609577,273750,Number Sign,THREE M SYNDROME 1,3M1,GENE ID:9820,CUL7,ENSG00000044090
...,...,...,...,...,...,...,...,...
9357,600571,616806,Number Sign,WILMS TUMOR 6,WT6,GENE ID:5978,REST,ENSG00000084093
9358,609062,601583,Number Sign,WILMS TUMOR 5,WT5,GENE ID:11281,POU6F2,ENSG00000106536
9359,609062,601583,Number Sign,"WILMS TUMOR, SUSCEPTIBILITY TO",WTSL,GENE ID:11281,POU6F2,ENSG00000106536
9360,605956,617321,Number Sign,YAO SYNDROME,YAOS,GENE ID:64127,NOD2,ENSG00000167207


In [None]:
mim2gene_and_morbidmap_and_mimTitles_df.loc[mim2gene_and_morbidmap_and_mimTitles_df["pheno_symbol"]=="A2MD"]

Unnamed: 0,gene_MIM_number,phenotype_MIM_number,Prefix,pheno_title,pheno_symbol,Entrez Gene ID (NCBI),Approved Gene Symbol (HGNC),Ensembl Gene ID (Ensembl)
744,103950,614036,,ALPHA-2-MACROGLOBULIN DEFICIENCY,A2MD,GENE ID:2,A2M,ENSG00000175899


In [None]:
mim2gene_and_morbidmap_and_mimTitles_df.dropna(subset=['pheno_symbol'], inplace=True)

In [None]:
gene2disease_df = mim2gene_and_morbidmap_and_mimTitles_df.copy()

In [None]:
gene2disease_df.to_hdf(
    "output/gene2disease_df.h5", key='df', mode='w'
)

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['phenotype_MIM_number', 'Prefix', 'pheno_title', 'pheno_symbol',
       'Entrez Gene ID (NCBI)', 'Approved Gene Symbol (HGNC)',
       'Ensembl Gene ID (Ensembl)'],
      dtype='object')]

  gene2disease_df.to_hdf(


In [None]:
disease_analysis_df = subset_genes_df.copy()
disease_analysis_df

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028}
1,A1BG,A1B,{5},{ENSG00000121410},{1}
2,A1BG,ABG,{5},{ENSG00000121410},{1}
3,A1BG,GAB,{5},{ENSG00000121410},{1}
4,A1BG,HYST2477,{5},{ENSG00000121410},{1}
...,...,...,...,...,...
86546,ZZEF1,FLJ10821,{29027},{ENSG00000074755},{23140}
86547,ZZEF1,KIAA0399,{29027},{ENSG00000074755},{23140}
86548,ZZEF1,ZZZ4,{29027},{ENSG00000074755},{23140}
86549,ZZZ3,ATAC1,{24523},{ENSG00000036549},{26009}


In [None]:
print(gene2disease_df.dtypes)

gene_MIM_number                 int64
phenotype_MIM_number           object
Prefix                         object
pheno_title                    object
pheno_symbol                   object
Entrez Gene ID (NCBI)          object
Approved Gene Symbol (HGNC)    object
Ensembl Gene ID (Ensembl)      object
dtype: object


In [None]:
print(disease_analysis_df.dtypes)

primary_gene_symbol    object
alias_symbol           object
HGNC_ID                object
ENSG_ID                object
NCBI_ID                object
dtype: object


In [None]:
disease_analysis_df["Prefix Disease Symbol Match"] = disease_analysis_df.apply(
    lambda row: any(
        row['alias_symbol'].startswith(phenotype_symbol)  # Check if alias_symbol starts with phenotype symbol
        for phenotype_symbol in gene2disease_df.loc[
            gene2disease_df['Approved Gene Symbol (HGNC)'] == row['primary_gene_symbol'], 
            'pheno_symbol'
        ]  # Loop through all abbreviations for the gene_symbol
    ), axis=1
)
disease_analysis_df

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Prefix Disease Symbol Match
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False
1,A1BG,A1B,{5},{ENSG00000121410},{1},False
2,A1BG,ABG,{5},{ENSG00000121410},{1},False
3,A1BG,GAB,{5},{ENSG00000121410},{1},False
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False
...,...,...,...,...,...,...
86546,ZZEF1,FLJ10821,{29027},{ENSG00000074755},{23140},False
86547,ZZEF1,KIAA0399,{29027},{ENSG00000074755},{23140},False
86548,ZZEF1,ZZZ4,{29027},{ENSG00000074755},{23140},False
86549,ZZZ3,ATAC1,{24523},{ENSG00000036549},{26009},False


Add the 'Matching Phenotype Symbol' column with the actual symbol (or None if no match)

In [None]:
disease_analysis_df["Matching Phenotype Symbol"] = disease_analysis_df.apply(
    lambda row: next((
        symbol for symbol in gene2disease_df.loc[
            gene2disease_df['Approved Gene Symbol (HGNC)'] == row['primary_gene_symbol'], 
            'pheno_symbol'
        ] 
        if row['alias_symbol'].startswith(symbol)  # Check if alias_symbol starts with abbreviation
    ),  ""), axis=1
)
disease_analysis_df

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Prefix Disease Symbol Match,Matching Phenotype Symbol
0,A-GAMMA3'E,A-GAMMA-E,{},{},{109951028},False,
1,A1BG,A1B,{5},{ENSG00000121410},{1},False,
2,A1BG,ABG,{5},{ENSG00000121410},{1},False,
3,A1BG,GAB,{5},{ENSG00000121410},{1},False,
4,A1BG,HYST2477,{5},{ENSG00000121410},{1},False,
...,...,...,...,...,...,...,...
86546,ZZEF1,FLJ10821,{29027},{ENSG00000074755},{23140},False,
86547,ZZEF1,KIAA0399,{29027},{ENSG00000074755},{23140},False,
86548,ZZEF1,ZZZ4,{29027},{ENSG00000074755},{23140},False,
86549,ZZZ3,ATAC1,{24523},{ENSG00000036549},{26009},False,


In [None]:
disease_analysis_df["Prefix Disease Symbol Match"].value_counts()

Prefix Disease Symbol Match
False    81968
True      4583
Name: count, dtype: int64

In [None]:
disease_analysis_df.loc[disease_analysis_df["alias_symbol"]=="A2MD"]

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Prefix Disease Symbol Match,Matching Phenotype Symbol
14,A2M,A2MD,{7},{ENSG00000175899},{2},True,A2MD


In [None]:
disease_analysis_df.loc[disease_analysis_df["Prefix Disease Symbol Match"]]

Unnamed: 0,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,NCBI_ID,Prefix Disease Symbol Match,Matching Phenotype Symbol
14,A2M,A2MD,{7},{ENSG00000175899},{2},True,A2MD
39,AAAS,AAASB,{13666},{ENSG00000094914},{8086},True,AAAS
59,AAGAB,PPKP1,{25662},{ENSG00000103591},{79719},True,PPKP1
60,AAGAB,PPKP1A,{25662},{ENSG00000103591},{79719},True,PPKP1A
76,AARS1,CMT2N,{20},{ENSG00000090861},{16},True,CMT2N
...,...,...,...,...,...,...,...
86136,ZNF711,MRX97,{13128},{ENSG00000147180},{7552},True,MRX97
86347,ZNHIT3,PEHO,{12309},{ENSG00000273611},{9326},True,PEHO
86372,ZP1,OOMD1,{13187},{ENSG00000149506},{22917},True,OOMD1
86379,ZP3,OOMD3,{13189},{ENSG00000188372},{7784},True,OOMD3


In [None]:
disease_prefix_match_subset_genes_df = disease_analysis_df[disease_analysis_df["Prefix Disease Symbol Match"]]

In [None]:
disease_prefix_match_subset_genes_df.to_hdf(
    "../output/disease_prefix_match_subset_genes_df.h5", key='df', mode='w'
)

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['primary_gene_symbol', 'alias_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID',
       'Matching Phenotype Symbol'],
      dtype='object')]

  disease_prefix_match_subset_genes_df.to_hdf(


In [None]:
disease_analysis_df.to_hdf(
    "../output/disease_analysis_df.h5", key='df', mode='w'
)

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['primary_gene_symbol', 'alias_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID',
       'Matching Phenotype Symbol'],
      dtype='object')]

  disease_analysis_df.to_hdf(
