**Table of contents**<a id='toc0_'></a>    
- [Gene Symbol Capture Transformation](#toc1_)    
    - [Transforming symbol capture data into a table to be used as a sqlite file for a gene symbol relationship look up tool](#toc1_1_1_)    
  - [Set Up](#toc1_2_)    
    - [Import packages](#toc1_2_1_)    
    - [Define Functions](#toc1_2_2_)    
  - [Download gene records](#toc1_3_)    
    - [Ensembl gene records](#toc1_3_1_)    
      - [Combine symbols (primary and aliases) into one column](#toc1_3_1_1_)    
      - [Add "GENE ID:" as a prefix for NCBI IDs](#toc1_3_1_2_)    
      - [Combine concept ids( HGNC_ID, ENSG_ID, NCBI_ID) into one column](#toc1_3_1_3_)    
    - [HGNC gene records](#toc1_3_2_)    
      - [Combine symbols (primary and aliases) into one column](#toc1_3_2_1_)    
      - [Add "GENE ID:" as a prefix for NCBI IDs](#toc1_3_2_2_)    
      - [Combine concept ids( HGNC_ID, ENSG_ID, NCBI_ID) into one column](#toc1_3_2_3_)    
    - [NCBI gene records](#toc1_3_3_)    
      - [Extract necessary cross references](#toc1_3_3_1_)    
      - [Add "ENSG" as a prefix for Ensembl IDs](#toc1_3_3_2_)    
      - [Split aliases to one per row](#toc1_3_3_3_)    
      - [Combine symbols (primary and aliases) into one column](#toc1_3_3_4_)    
      - [Add "GENE ID:" as a prefix for NCBI IDs](#toc1_3_3_5_)    
      - [Add "HGNC:" as a prefix for HGNC IDs](#toc1_3_3_6_)    
      - [Combine concept ids( HGNC_ID, ENSG_ID, NCBI_ID) into one column](#toc1_3_3_7_)    
      - [Remove records with no associated gene symbols](#toc1_3_3_8_)    
  - [Combine gene records from Ensembl, HGNC, and NCBI](#toc1_4_)    
      - [Dropped duplicate rows and rows with no gene symbols.](#toc1_4_1_1_)    
  - [Add Ortholog relationship label](#toc1_5_)    
    - [Import ortholog sets](#toc1_5_1_)    
    - [Combine all species orthologs into one table](#toc1_5_2_)    
    - [Add Ortholog label to the table of gene records](#toc1_5_3_)    
  - [Add HGNC Previous Symbol relationship label](#toc1_6_)    
    - [Import the expired HGNC symbols](#toc1_6_1_)    
    - [Add HGNC Previous Symbol label to the table of gene records](#toc1_6_2_)    
  - [Add FLJ Clone Symbol label](#toc1_7_)    
    - [Import the FLJ Clone symbols](#toc1_7_1_)    
    - [Add FLJ Clone Symbol label to the table of gene records](#toc1_7_2_)    
  - [Add Gene Family Symbol label](#toc1_8_)    
    - [Import the Gene Family Root symbols](#toc1_8_1_)    
  - [Add Disease Symbol label](#toc1_9_)    
    - [Import the Disease root symbols](#toc1_9_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Gene Symbol Capture Transformation](#toc0_)

### <a id='toc1_1_1_'></a>[Transforming symbol capture data into a table to be used as a sqlite file for a gene symbol relationship look up tool](#toc0_)

this table is denormalized compared to the table in symbol_capture_generation.ipynb

## <a id='toc1_2_'></a>[Set Up](#toc0_)

### <a id='toc1_2_1_'></a>[Import packages](#toc0_)

In [93]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import glob
import os
import plotly.express as px

### <a id='toc1_2_2_'></a>[Define Functions](#toc0_)

In [94]:
def combine_columns(df, columns_to_combine, columns_to_keep, new_name, columns_to_drop):
    """Combine multiple columns into one while keeping associated data attached.
    Use this function when the columns to combine are easier to list 
    than the columns not to combine.
    
    :param df: The DataFrame containing the columns to be combined
    :param columns_to_combine: List of column names to combine into one
    :param columns_to_keep: List of columns to keep in the final DataFrame
    :param new_name: The name of the new combined column
    :param columns_to_drop: List of columns to drop from the final DataFrame
    :return: A new DataFrame with combined columns and selected columns retained
    """
    og_df = df.copy()

    combined_dfs = []

    # Loop through each column in columns_to_combine and create a new DataFrame
    for col in columns_to_combine:
        temp_df = og_df[list(set([col] + columns_to_keep))].copy()
        temp_df[new_name] = temp_df[col]
        combined_dfs.append(temp_df)

    df_combined = pd.concat(combined_dfs, ignore_index=True)
    df_combined.drop(columns_to_drop, axis=1, inplace=True)
    df_combined.drop_duplicates(inplace=True)
    
    return df_combined

In [95]:
def combine_columns_except(df, new_name, cols_not_combine):
    """Combine multiple columns into one while keeping associated data attached.
    Use this function when the columns not to combine are easier to list 
    than the columns to combine.

    :param df: The DataFrame containing the columns to be combined
    :param new_name: The name of the new combined column
    :param cols_not_combine: List of columns TO NOT combine from the original DataFrame
    :return: A new DataFrame with combined columns and selected columns retained    
    """

    og_df = df.copy()
    
    columns_to_keep = [col for col in df.columns if col in cols_not_combine]
    columns_to_combine = [col for col in df.columns if col not in columns_to_keep]

    # Loop through each column in columns_to_combine and create a new DataFrame
    combined_df = []
    for col in columns_to_combine:
        temp_df = og_df[list(set([col] + columns_to_keep))].copy()
        temp_df[new_name] = temp_df[col]
        combined_df.append(temp_df)

    df_combined = pd.concat(combined_df, ignore_index=True)
    df_combined.drop(columns=columns_to_combine, axis=1, inplace=True)
    df_combined.drop_duplicates(inplace=True)
    
    return df_combined


In [96]:
def combine_columns_except_multiple(dfs, new_name,cols_not_combine):
    """Apply the combine_columns_except function to several DataFrames and then combine all of those dfs into one.

    :param dfs: The df(s) with the columns to combine (list or dictionary of DataFrames)
    :param new_name: The name of the new combined column
    :param cols_not_combine: List of columns TO NOT combine from the original DataFrame
    :return: A new DataFrame with combined dfs that had combined columns and selected columns retained
    """

    # Convert to list if the input is a dictionary
    if isinstance(dfs, dict):
        dfs = list(dfs.values())
    combined_dfs = []

    # Combine multiple columns into one while keeping associated data attached
    for df in dfs:
        combined_df = combine_columns_except(df, new_name, cols_not_combine)
        combined_dfs.append(combined_df)
    
    # Combine multiple dfs into one
    final_combined_df = pd.concat(combined_dfs, ignore_index=True)

    final_combined_df.drop_duplicates(inplace=True)
    final_combined_df = final_combined_df.dropna(subset=[new_name])

    
    return final_combined_df

In [97]:
def add_relationship_and_source_if_symbol(
    destination_df, relationship_df, destination_df_cols, relationship_df_cols, relationship_val, source_val, combine_with=None
):
    """To add a relationship and source description for gene symbols based on a dataset.
    
    :param destination_df: The DataFrame that is getting the relationships added to it
    :param relationship_df: The DataFrame that is the source of the relationships
    :param destination_df_cols: The columns used to match gene symbols to the relationship_df
    (usually primary gene concept symbol, an identfier, and alternate gene symbol)
    :param relationship_df_cols: The columns used to match gene symbols to the destination_df
    (usually primary gene concept symbol, an identfier, and alternate gene symbol)
    :param relationship_val: Based on the relationship_df, the relationship that a precense of 
    the alternate symbol in the dataset would indicate
    :param source_val: The label to indicate wehre the dataset in relationship_df came from
    :param combine_with: A DataFrame that was a result of this function with a different relationship category(s)
    :return: A new DataFrame with descriptive relationship and source labels

    Notes: 
    - The cols variables need to be in the correct order. The columns to be compared need 
    to be in the same positions in the lists. 
    - There cannot be any shared column names bw relationship_df and destination_df.

    """
    # Create uppercase columns for merging
    for df, cols, suffix in [(destination_df, destination_df_cols, "_upper"), (relationship_df, relationship_df_cols, "_upper")]:
        df[[col + suffix for col in cols]] = df[cols].apply(lambda x: x.str.upper())
    
    # Perform the merge
    merged_df = destination_df.merge(
        relationship_df,
        left_on=[col + "_upper" for col in destination_df_cols],
        right_on=[col + "_upper" for col in relationship_df_cols],
        how="left",
        indicator=True,
    )

    # Assign relationships and sources
    merged_df["relationship"] = merged_df["_merge"].map(lambda x: relationship_val if x == "both" else "")
    merged_df["source"] = merged_df["_merge"].map(lambda x: source_val if x == "both" else "")
    
    # Drop extra columns and remove duplicates
    merged_df.drop(
        columns=[col + "_upper" for col in (destination_df_cols + relationship_df_cols)] + ["_merge"] + relationship_df.columns.tolist(),
        inplace=True
    )
    merged_df.drop_duplicates(inplace=True)

    destination_df.drop(
        columns= [col + "_upper" for col in (destination_df_cols)],
        inplace=True
    )
    relationship_df.drop(
    columns= [col + "_upper" for col in (relationship_df_cols)],
    inplace=True
    )
    # Optionally combine with an existing DataFrame
    if combine_with is not None:
        combined_df = pd.concat([combine_with, merged_df], ignore_index=True).drop_duplicates()
        return combined_df
    
    return merged_df


In [98]:
def add_relationship_and_source_if_prefix(
    destination_df, relationship_df, exact_cols_df1, exact_cols_df2,
    prefix_col_df1, prefix_col_df2, relationship_val, source_val,
    combine_with=None, suffix="_upper"
):
    """
    Compare two DataFrames to find matching rows based on:
    1. Exact matches for specified columns (which can differ between DataFrames).
    2. A single prefix match for a specific column.
    Add relationship and source descriptions for gene symbols.

    :param destination_df: The DataFrame getting relationships added.
    :param relationship_df: The DataFrame as the source of the relationships.
    :param exact_cols_df1: Columns in destination_df requiring exact matches.
    :param exact_cols_df2: Columns in relationship_df requiring exact matches.
    :param prefix_col_df1: Column in destination_df for prefix matching.
    :param prefix_col_df2: Column in relationship_df for prefix matching.
    :param relationship_val: The relationship to assign if a match is found.
    :param source_val: The source label to assign if a match is found.
    :param combine_with: Optionally combine with an existing DataFrame.
    :param suffix: Suffix for new uppercase columns.
    :return: DataFrame with descriptive relationship and source labels.
    """
    # Create uppercase columns for merging
    for df, cols, suffix in [(destination_df, exact_cols_df1 + [prefix_col_df1], "_upper"), (relationship_df, exact_cols_df2 + [prefix_col_df2], "_upper")]:
        df[[col + suffix for col in cols]] = df[cols].apply(lambda x: x.str.upper())
    
    # Merge DataFrames on exact columns
    merged_df = destination_df.merge(
        relationship_df,
        left_on=[col + suffix for col in exact_cols_df1],
        right_on=[col + suffix for col in exact_cols_df2],
        how="left",
        indicator=True,
    )

    # Identify prefix matches and assign relationships
    merged_df["Prefix Match"] = merged_df.apply(
        lambda row: row[prefix_col_df1 + suffix].startswith(
            row[prefix_col_df2 + suffix]
        ) if pd.notna(row[prefix_col_df2 + suffix]) else False,
        axis=1
    )

    # Assign relationships and sources
    merged_df["relationship"] = merged_df.apply(
        lambda row: relationship_val if row["_merge"] == "both" and row["Prefix Match"] else "", axis=1
    )
    merged_df["source"] = merged_df.apply(
        lambda row: source_val if row["_merge"] == "both" and row["Prefix Match"] else "", axis=1
    )

    # Drop extra columns and remove duplicates
    merged_df.drop(
        columns=[col + suffix for col in exact_cols_df1 + exact_cols_df2 + [prefix_col_df1, prefix_col_df2]] 
                + ["_merge", "Prefix Match"] 
                + relationship_df.columns.tolist(),
        inplace=True, errors='ignore'
    )
    merged_df.drop_duplicates(inplace=True)

    destination_df.drop(
        columns=[col + suffix for col in exact_cols_df1 + [prefix_col_df1]], 
        inplace=True, 
        errors='ignore')
    
    relationship_df.drop(
    columns=[col + suffix for col in exact_cols_df2 + [prefix_col_df2]], 
    inplace=True, 
    errors='ignore')

    # Optionally combine with an existing DataFrame
    if combine_with is not None:
        combined_df = pd.concat([combine_with, merged_df], ignore_index=True).drop_duplicates()
        return combined_df

    return merged_df



## <a id='toc1_3_'></a>[Download gene records](#toc0_)

### <a id='toc1_3_1_'></a>[Ensembl gene records](#toc0_)

In [99]:
mini_ensg_df = pd.read_csv(
    "../input/ensg_biomart_gene20240626.txt", sep="\t",dtype={"NCBI gene (formerly Entrezgene) ID": pd.Int64Dtype()}
)
mini_ensg_df = mini_ensg_df.rename(
    columns={
        "HGNC ID": "HGNC_ID",
        "Gene Synonym": "alias_symbol",
        "Gene name": "primary_gene_symbol",
        "Gene stable ID": "ENSG_ID",
        "NCBI gene (formerly Entrezgene) ID": "NCBI_ID",
    }
)
mini_ensg_df.head()

Unnamed: 0,ENSG_ID,primary_gene_symbol,alias_symbol,HGNC_ID,NCBI_ID
0,ENSG00000210049,MT-TF,MTTF,HGNC:7481,
1,ENSG00000210049,MT-TF,TRNF,HGNC:7481,
2,ENSG00000211459,MT-RNR1,12S,HGNC:7470,
3,ENSG00000211459,MT-RNR1,MOTS-C,HGNC:7470,
4,ENSG00000211459,MT-RNR1,MTRNR1,HGNC:7470,


#### <a id='toc1_3_1_1_'></a>[Combine symbols (primary and aliases) into one column](#toc0_)

so that there is a generic symbol column representing all symbols associate with a gene concept.

In [100]:
ensg_combined_symbols_df = combine_columns(mini_ensg_df, ['primary_gene_symbol','alias_symbol'], ['primary_gene_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID'], "symbol", "alias_symbol")
ensg_combined_symbols_df.head()

Unnamed: 0,HGNC_ID,NCBI_ID,primary_gene_symbol,ENSG_ID,symbol
0,HGNC:7481,,MT-TF,ENSG00000210049,MT-TF
2,HGNC:7470,,MT-RNR1,ENSG00000211459,MT-RNR1
5,HGNC:7500,,MT-TV,ENSG00000210077,MT-TV
7,HGNC:7471,,MT-RNR2,ENSG00000210082,MT-RNR2
10,HGNC:7490,,MT-TL1,ENSG00000209082,MT-TL1


#### <a id='toc1_3_1_2_'></a>[Add "GENE ID:" as a prefix for NCBI IDs](#toc0_)

to indicate that the number is specifically a NCBI identifier

In [101]:
ensg_combined_symbols_df["NCBI_ID"] = ensg_combined_symbols_df["NCBI_ID"].apply(
    lambda x: f"GENE ID:{int(x)}" if pd.notna(x) and x == int(x) else f"GENE ID:{x}" if pd.notna(x) else x
)

In [102]:
ensg_combined_symbols_df.loc[ensg_combined_symbols_df["primary_gene_symbol"]== "C3"]

Unnamed: 0,HGNC_ID,NCBI_ID,primary_gene_symbol,ENSG_ID,symbol
90547,HGNC:1318,GENE ID:718,C3,ENSG00000125730,C3
207687,HGNC:1318,GENE ID:718,C3,ENSG00000125730,ARMD9
207688,HGNC:1318,GENE ID:718,C3,ENSG00000125730,C3A
207689,HGNC:1318,GENE ID:718,C3,ENSG00000125730,C3B
207690,HGNC:1318,GENE ID:718,C3,ENSG00000125730,CPAMD1


Make a table for concordance analysis

In [103]:
ensg_combined_symbols_df.to_csv(
    "../output/ensg_combined_symbols_df.csv", index=True
)

#### <a id='toc1_3_1_3_'></a>[Combine concept ids( HGNC_ID, ENSG_ID, NCBI_ID) into one column](#toc0_)

In [104]:
ensg_combined_concept_ids_df = combine_columns(ensg_combined_symbols_df, ["HGNC_ID", "ENSG_ID", "NCBI_ID"], ['primary_gene_symbol',"symbol"], "concept_id", ["HGNC_ID", "ENSG_ID", "NCBI_ID"])
ensg_combined_concept_ids_df.head()

Unnamed: 0,symbol,primary_gene_symbol,concept_id
0,MT-TF,MT-TF,HGNC:7481
1,MT-RNR1,MT-RNR1,HGNC:7470
2,MT-TV,MT-TV,HGNC:7500
3,MT-RNR2,MT-RNR2,HGNC:7471
4,MT-TL1,MT-TL1,HGNC:7490


In [105]:
ensg_combined_concept_ids_df.to_csv(
    "../output/ensg_combined_concept_ids_df.csv", index=True
)

Creating table for cytoscape

In [106]:
ensg_combined_hgnc_ncbi_concept_ids_df = combine_columns(ensg_combined_symbols_df, ["HGNC_ID", "NCBI_ID"], ["symbol","ENSG_ID"], "xref_concept_id", ["HGNC_ID", "NCBI_ID"])
ensg_combined_symbol_xref_concept_ids_df = combine_columns(ensg_combined_hgnc_ncbi_concept_ids_df, ["symbol", "xref_concept_id"], ["ENSG_ID"], "target", ["symbol", "xref_concept_id"])
ensg_combined_symbol_xref_concept_ids_df["source"] = ensg_combined_symbol_xref_concept_ids_df["ENSG_ID"]
ensg_combined_symbol_xref_concept_ids_df = ensg_combined_symbol_xref_concept_ids_df.drop(columns =["ENSG_ID"])
ensg_combined_symbol_xref_concept_ids_df.loc[ensg_combined_symbol_xref_concept_ids_df["source"]== "ENSG00000125730"]

Unnamed: 0,target,source
54793,C3,ENSG00000125730
139127,ARMD9,ENSG00000125730
139128,C3A,ENSG00000125730
139129,C3B,ENSG00000125730
139130,CPAMD1,ENSG00000125730
360628,HGNC:1318,ENSG00000125730
505628,GENE ID:718,ENSG00000125730


### <a id='toc1_3_2_'></a>[HGNC gene records](#toc0_)

In [107]:
mini_hgnc_df = pd.read_csv(
    "../input/hgnc_biomart_gene20240626.txt", sep="\t",dtype={"NCBI gene ID": pd.Int64Dtype()}
)
mini_hgnc_df = mini_hgnc_df.rename(
    columns={
        "HGNC ID": "HGNC_ID",
        "Approved symbol": "primary_gene_symbol",
        "Alias symbol": "alias_symbol",
        "Ensembl gene ID": "ENSG_ID",
        "NCBI gene ID": "NCBI_ID",
    }
)
mini_hgnc_df.head()

Unnamed: 0,HGNC_ID,alias_symbol,NCBI_ID,ENSG_ID,primary_gene_symbol
0,HGNC:5,,1,ENSG00000121410,A1BG
1,HGNC:37133,FLJ23569,503538,ENSG00000268895,A1BG-AS1
2,HGNC:24086,ACF,29974,ENSG00000148584,A1CF
3,HGNC:24086,ASP,29974,ENSG00000148584,A1CF
4,HGNC:24086,ACF64,29974,ENSG00000148584,A1CF


#### <a id='toc1_3_2_1_'></a>[Combine symbols (primary and aliases) into one column](#toc0_)

so that there is a generic symbol column representing all symbols associate with a gene concept.

In [108]:
hgnc_combined_symbols_df = combine_columns(mini_hgnc_df, ['primary_gene_symbol','alias_symbol'], ['primary_gene_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID'], "symbol", "alias_symbol")
hgnc_combined_symbols_df.head()

Unnamed: 0,HGNC_ID,NCBI_ID,primary_gene_symbol,ENSG_ID,symbol
0,HGNC:5,1,A1BG,ENSG00000121410,A1BG
1,HGNC:37133,503538,A1BG-AS1,ENSG00000268895,A1BG-AS1
2,HGNC:24086,29974,A1CF,ENSG00000148584,A1CF
7,HGNC:7,2,A2M,ENSG00000175899,A2M
10,HGNC:27057,144571,A2M-AS1,ENSG00000245105,A2M-AS1


#### <a id='toc1_3_2_2_'></a>[Add "GENE ID:" as a prefix for NCBI IDs](#toc0_)

to indicate that the number is specifically a NCBI identifier

In [109]:
hgnc_combined_symbols_df["NCBI_ID"] = hgnc_combined_symbols_df["NCBI_ID"].apply(
    lambda x: f"GENE ID:{int(x)}" if pd.notna(x) and x == int(x) else f"GENE ID:{x}" if pd.notna(x) else x
)
hgnc_combined_symbols_df.head()

Unnamed: 0,HGNC_ID,NCBI_ID,primary_gene_symbol,ENSG_ID,symbol
0,HGNC:5,GENE ID:1,A1BG,ENSG00000121410,A1BG
1,HGNC:37133,GENE ID:503538,A1BG-AS1,ENSG00000268895,A1BG-AS1
2,HGNC:24086,GENE ID:29974,A1CF,ENSG00000148584,A1CF
7,HGNC:7,GENE ID:2,A2M,ENSG00000175899,A2M
10,HGNC:27057,GENE ID:144571,A2M-AS1,ENSG00000245105,A2M-AS1


In [110]:
hgnc_combined_symbols_df.loc[hgnc_combined_symbols_df["primary_gene_symbol"] == "C3"]

Unnamed: 0,HGNC_ID,NCBI_ID,primary_gene_symbol,ENSG_ID,symbol
5028,HGNC:1318,GENE ID:718,C3,ENSG00000125730,C3
72611,HGNC:1318,GENE ID:718,C3,ENSG00000125730,CPAMD1
72612,HGNC:1318,GENE ID:718,C3,ENSG00000125730,ARMD9
72613,HGNC:1318,GENE ID:718,C3,ENSG00000125730,C3a
72614,HGNC:1318,GENE ID:718,C3,ENSG00000125730,C3b


Make a table for concordance analysis

In [111]:
hgnc_combined_symbols_df.to_csv(
    "../output/hgnc_combined_symbols_df.csv", index=True
)

#### <a id='toc1_3_2_3_'></a>[Combine concept ids( HGNC_ID, ENSG_ID, NCBI_ID) into one column](#toc0_)

In [112]:
hgnc_combined_concept_ids_df = combine_columns(hgnc_combined_symbols_df, ["HGNC_ID", "ENSG_ID", "NCBI_ID"], ['primary_gene_symbol',"symbol"], "concept_id", ["HGNC_ID", "ENSG_ID", "NCBI_ID"])
hgnc_combined_concept_ids_df.head()

Unnamed: 0,symbol,primary_gene_symbol,concept_id
0,A1BG,A1BG,HGNC:5
1,A1BG-AS1,A1BG-AS1,HGNC:37133
2,A1CF,A1CF,HGNC:24086
3,A2M,A2M,HGNC:7
4,A2M-AS1,A2M-AS1,HGNC:27057


In [113]:
hgnc_combined_concept_ids_df.to_csv(
    "../output/hgnc_combined_concept_ids_df.csv", index=True
)

Creating table for cytoscape

In [114]:
hgnc_combined_hgnc_ncbi_concept_ids_df = combine_columns(hgnc_combined_symbols_df, ["ENSG_ID", "NCBI_ID"], ["symbol","HGNC_ID"], "xref_concept_id", ["ENSG_ID", "NCBI_ID"])
hgnc_combined_hgnc_ncbi_concept_ids_df.head()
hgnc_combined_symbol_xref_concept_ids_df = combine_columns(hgnc_combined_hgnc_ncbi_concept_ids_df, ["symbol", "xref_concept_id"], ["HGNC_ID"], "target", ["symbol", "xref_concept_id"])
hgnc_combined_symbol_xref_concept_ids_df["source"] = hgnc_combined_symbol_xref_concept_ids_df["HGNC_ID"]
hgnc_combined_symbol_xref_concept_ids_df = hgnc_combined_symbol_xref_concept_ids_df.drop(columns =["HGNC_ID"])
hgnc_combined_symbol_xref_concept_ids_df.loc[hgnc_combined_symbol_xref_concept_ids_df["source"]== "HGNC:1318"]

Unnamed: 0,target,source
3034,C3,HGNC:1318
50674,CPAMD1,HGNC:1318
50675,ARMD9,HGNC:1318
50676,C3a,HGNC:1318
50677,C3b,HGNC:1318
225709,ENSG00000125730,HGNC:1318
338838,GENE ID:718,HGNC:1318


### <a id='toc1_3_3_'></a>[NCBI gene records](#toc0_)

In [115]:
mini_ncbi_df = pd.read_csv("../input/Homo_sapiens.gene_info20240627", sep="\t")

In [116]:
mini_ncbi_df = mini_ncbi_df[
["GeneID", "Symbol", "Synonyms", "dbXrefs"]
]
mini_ncbi_df = mini_ncbi_df.rename(
    columns={"GeneID": "NCBI_ID", "Symbol": "primary_gene_symbol", "Synonyms": "alias_symbol"}
)
mini_ncbi_df

Unnamed: 0,NCBI_ID,primary_gene_symbol,alias_symbol,dbXrefs
0,1,A1BG,A1B|ABG|GAB|HYST2477,MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410...
1,2,A2M,A2MD|CPAMD5|FWP007|S863-7,MIM:103950|HGNC:HGNC:7|Ensembl:ENSG00000175899...
2,3,A2MP1,A2MP,HGNC:HGNC:8|Ensembl:ENSG00000291190|AllianceGe...
3,9,NAT1,AAC1|MNAT|NAT-1|NATI,MIM:108345|HGNC:HGNC:7645|Ensembl:ENSG00000171...
4,10,NAT2,AAC2|NAT-2|PNAT,MIM:612182|HGNC:HGNC:7646|Ensembl:ENSG00000156...
...,...,...,...,...
193451,8923215,trnD,-,-
193452,8923216,trnP,-,-
193453,8923217,trnA,-,-
193454,8923218,COX1,-,-


#### <a id='toc1_3_3_1_'></a>[Extract necessary cross references](#toc0_)

In [117]:
mini_ncbi_df = mini_ncbi_df.assign(
    MIM=np.nan,
    HGNC_ID=np.nan,
    ENSG_ID=np.nan,
    AllianceGenome=np.nan,
    MIRbase=np.nan,
    IMGTgene_db=np.nan,
    dash=np.nan,
    unknown=np.nan,
)

In [118]:
index_pos = 0

print(len(mini_ncbi_df))
while index_pos < len(mini_ncbi_df):
    xrefs = mini_ncbi_df["dbXrefs"][index_pos].split("|")

    for xref in xrefs:
        xref = xref.lower()
        if xref.startswith("mim:"):
            xref = xref.replace("mim:", "")
            mini_ncbi_df["MIM"][index_pos] = xref
        elif xref.startswith("hgnc:hgnc:"):
            xref = xref.replace("hgnc:hgnc:", "")
            mini_ncbi_df["HGNC_ID"][index_pos] = xref
        elif xref.startswith("ensembl:"):
            xref = xref.replace("ensembl:", "")
            mini_ncbi_df["ENSG_ID"][index_pos] = xref
        elif xref.startswith("alliancegenome:"):
            xref = xref.replace("alliancegenome:", "")
            mini_ncbi_df["AllianceGenome"][index_pos] = xref
        elif xref.startswith("mirbase"):
            xref = xref.replace("mirbase:", "")
            mini_ncbi_df["MIRbase"][index_pos] = xref
        elif xref.startswith("imgt/gene-db:"):
            xref = xref.replace("imgt/gene-db:", "")
            mini_ncbi_df["IMGTgene_db"][index_pos] = xref
        elif xref.startswith("-"):
            mini_ncbi_df["dash"][index_pos] = xref
        else:
            mini_ncbi_df["unknown"][index_pos] = xref

    index_pos += 1
    pass

print(index_pos)

193456


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  mini_ncbi_df["MIM"][index_pos] = xref
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df["MIM"][index

193456


#### <a id='toc1_3_3_2_'></a>[Add "ENSG" as a prefix for Ensembl IDs](#toc0_)

to indicate that the number is specifically a Ensembl identifier

In [119]:
mini_ncbi_df["ENSG_ID"] = mini_ncbi_df["ENSG_ID"].str.replace("ensg", "ENSG", 1)

In [120]:
mini_ncbi_df = mini_ncbi_df.drop(
    [
        "AllianceGenome",
        "MIRbase",
        "IMGTgene_db",
        "dash",
        "unknown",
        "dbXrefs",
        "MIM",
    ],
    axis=1,
)
mini_ncbi_df

Unnamed: 0,NCBI_ID,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B|ABG|GAB|HYST2477,5,ENSG00000121410
1,2,A2M,A2MD|CPAMD5|FWP007|S863-7,7,ENSG00000175899
2,3,A2MP1,A2MP,8,ENSG00000291190
3,9,NAT1,AAC1|MNAT|NAT-1|NATI,7645,ENSG00000171428
4,10,NAT2,AAC2|NAT-2|PNAT,7646,ENSG00000156006
...,...,...,...,...,...
193451,8923215,trnD,-,,
193452,8923216,trnP,-,,
193453,8923217,trnA,-,,
193454,8923218,COX1,-,,


#### <a id='toc1_3_3_3_'></a>[Split aliases to one per row](#toc0_)

In [121]:
mini_ncbi_df['alias_symbol'] = mini_ncbi_df['alias_symbol'].str.split('|')
mini_ncbi_df = mini_ncbi_df.explode('alias_symbol')
mini_ncbi_df

Unnamed: 0,NCBI_ID,primary_gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B,5,ENSG00000121410
0,1,A1BG,ABG,5,ENSG00000121410
0,1,A1BG,GAB,5,ENSG00000121410
0,1,A1BG,HYST2477,5,ENSG00000121410
1,2,A2M,A2MD,7,ENSG00000175899
...,...,...,...,...,...
193451,8923215,trnD,-,,
193452,8923216,trnP,-,,
193453,8923217,trnA,-,,
193454,8923218,COX1,-,,


#### <a id='toc1_3_3_4_'></a>[Combine symbols (primary and aliases) into one column](#toc0_)

so that there is a generic symbol column representing all symbols associate with a gene concept.

In [122]:
ncbi_combined_symbols_df = combine_columns(mini_ncbi_df, ['primary_gene_symbol','alias_symbol'], ['primary_gene_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID'], "symbol", "alias_symbol")
ncbi_combined_symbols_df.head()

Unnamed: 0,HGNC_ID,NCBI_ID,primary_gene_symbol,ENSG_ID,symbol
0,5,1,A1BG,ENSG00000121410,A1BG
4,7,2,A2M,ENSG00000175899,A2M
8,8,3,A2MP1,ENSG00000291190,A2MP1
9,7645,9,NAT1,ENSG00000171428,NAT1
13,7646,10,NAT2,ENSG00000156006,NAT2


#### <a id='toc1_3_3_5_'></a>[Add "GENE ID:" as a prefix for NCBI IDs](#toc0_)

In [123]:
ncbi_combined_symbols_df["NCBI_ID"] = ncbi_combined_symbols_df["NCBI_ID"].apply(
    lambda x: f"GENE ID:{int(x)}" if pd.notna(x) and x == int(x) else f"GENE ID:{x}" if pd.notna(x) else x
)

#### <a id='toc1_3_3_6_'></a>[Add "HGNC:" as a prefix for HGNC IDs](#toc0_)

In [124]:
ncbi_combined_symbols_df["HGNC_ID"] = ncbi_combined_symbols_df["HGNC_ID"].apply(
    lambda x: f"HGNC:{int(x)}" if pd.notna(x) and x == int(x) else f"HGNC:{x}" if pd.notna(x) else x
)

In [125]:
ncbi_combined_symbols_df.loc[ncbi_combined_symbols_df["primary_gene_symbol"]== "C3"]

Unnamed: 0,HGNC_ID,NCBI_ID,primary_gene_symbol,ENSG_ID,symbol
2043,HGNC:1318,GENE ID:718,C3,ENSG00000125730,C3
241972,HGNC:1318,GENE ID:718,C3,ENSG00000125730,AHUS5
241973,HGNC:1318,GENE ID:718,C3,ENSG00000125730,ARMD9
241974,HGNC:1318,GENE ID:718,C3,ENSG00000125730,ASP
241975,HGNC:1318,GENE ID:718,C3,ENSG00000125730,C3a
241976,HGNC:1318,GENE ID:718,C3,ENSG00000125730,C3b
241977,HGNC:1318,GENE ID:718,C3,ENSG00000125730,CPAMD1
241978,HGNC:1318,GENE ID:718,C3,ENSG00000125730,HEL-S-62p


In [126]:
ncbi_combined_symbols_df.to_csv(
    "../output/ncbi_combined_symbols_df.csv", index=True
)

#### <a id='toc1_3_3_7_'></a>[Combine concept ids( HGNC_ID, ENSG_ID, NCBI_ID) into one column](#toc0_)

In [127]:
ncbi_combined_concept_ids_df = combine_columns(ncbi_combined_symbols_df, ["HGNC_ID", "ENSG_ID", "NCBI_ID"], ['primary_gene_symbol',"symbol"], "concept_id", ["HGNC_ID", "ENSG_ID", "NCBI_ID"])
ncbi_combined_concept_ids_df.head()

Unnamed: 0,symbol,primary_gene_symbol,concept_id
0,A1BG,A1BG,HGNC:5
1,A2M,A2M,HGNC:7
2,A2MP1,A2MP1,HGNC:8
3,NAT1,NAT1,HGNC:7645
4,NAT2,NAT2,HGNC:7646


#### <a id='toc1_3_3_8_'></a>[Remove records with no associated gene symbols](#toc0_)

In [128]:
ncbi_combined_concept_ids_df.loc[ncbi_combined_concept_ids_df["symbol"] == "-"]

Unnamed: 0,symbol,primary_gene_symbol,concept_id
193480,-,AAMP,HGNC:18
193580,-,ACAT2,HGNC:94
193595,-,ACLS,
193655,-,ACTBP2,HGNC:135
193656,-,ACTBP3,HGNC:136
...,...,...,...
1300150,-,trnD,GENE ID:8923215
1300151,-,trnP,GENE ID:8923216
1300152,-,trnA,GENE ID:8923217
1300153,-,COX1,GENE ID:8923218


In [129]:
ncbi_combined_concept_ids_df.to_csv(
    "../output/ncbi_combined_concept_ids_df.csv", index=True
)

Creating table for cytoscape

In [130]:
ncbi_combined_hgnc_ncbi_concept_ids_df = combine_columns(ncbi_combined_symbols_df, ["ENSG_ID", "HGNC_ID"], ["symbol","NCBI_ID"], "xref_concept_id", ["ENSG_ID", "HGNC_ID"])
ncbi_combined_hgnc_ncbi_concept_ids_df.head()
ncbi_combined_symbol_xref_concept_ids_df = combine_columns(ncbi_combined_hgnc_ncbi_concept_ids_df, ["symbol", "xref_concept_id"], ["NCBI_ID"], "target", ["symbol", "xref_concept_id"])
ncbi_combined_symbol_xref_concept_ids_df["source"] = ncbi_combined_symbol_xref_concept_ids_df["NCBI_ID"]
ncbi_combined_symbol_xref_concept_ids_df = ncbi_combined_symbol_xref_concept_ids_df.drop(columns =["NCBI_ID"])
ncbi_combined_symbol_xref_concept_ids_df.loc[ncbi_combined_symbol_xref_concept_ids_df["source"]== "GENE ID:718"]

Unnamed: 0,target,source
586,C3,GENE ID:718
195499,AHUS5,GENE ID:718
195500,ARMD9,GENE ID:718
195501,ASP,GENE ID:718
195502,C3a,GENE ID:718
195503,C3b,GENE ID:718
195504,CPAMD1,GENE ID:718
195505,HEL-S-62p,GENE ID:718
572761,ENSG00000125730,GENE ID:718
1006127,HGNC:1318,GENE ID:718


## <a id='toc1_4_'></a>[Combine gene records from Ensembl, HGNC, and NCBI](#toc0_)

#### <a id='toc1_4_1_1_'></a>[Dropped duplicate rows and rows with no gene symbols.](#toc0_)

In [131]:
genes_df = pd.concat([ensg_combined_concept_ids_df, hgnc_combined_concept_ids_df, ncbi_combined_concept_ids_df], ignore_index=True)
genes_df.drop_duplicates(inplace=True)
genes_df = genes_df.dropna(subset=["symbol","primary_gene_symbol"])
genes_df = genes_df[~genes_df['symbol'].isin(["-", "", " "])].copy()
genes_df

Unnamed: 0,symbol,primary_gene_symbol,concept_id
0,MT-TF,MT-TF,HGNC:7481
1,MT-RNR1,MT-RNR1,HGNC:7470
2,MT-TV,MT-TV,HGNC:7500
3,MT-RNR2,MT-RNR2,HGNC:7471
4,MT-TL1,MT-TL1,HGNC:7490
...,...,...,...
1679976,AANCR,LOC127891700,GENE ID:127891700
1686878,PBEF,NAMPT-AS1,GENE ID:128266843
1687669,DUSP13,DUSP13A,GENE ID:128854680
1737082,NPIPA5L,NPIPA6,GENE ID:131675794


In [132]:
genes_df.loc[genes_df["primary_gene_symbol"] == "C3"]

Unnamed: 0,symbol,primary_gene_symbol,concept_id
31739,C3,C3,HGNC:1318
97785,ARMD9,C3,HGNC:1318
97786,C3A,C3,HGNC:1318
97787,C3B,C3,HGNC:1318
97788,CPAMD1,C3,HGNC:1318
170125,C3,C3,ENSG00000125730
254456,ARMD9,C3,ENSG00000125730
254457,C3A,C3,ENSG00000125730
254458,C3B,C3,ENSG00000125730
254459,CPAMD1,C3,ENSG00000125730


Creating table for cytoscape

In [133]:
cytoscape_genes_df = pd.concat([ensg_combined_symbol_xref_concept_ids_df, hgnc_combined_symbol_xref_concept_ids_df, ncbi_combined_symbol_xref_concept_ids_df], ignore_index=True)
cytoscape_genes_df.drop_duplicates(inplace=True)
cytoscape_genes_df = cytoscape_genes_df.dropna(subset=["source","target"])
cytoscape_genes_df = cytoscape_genes_df[~cytoscape_genes_df['target'].isin(["-", "", " "])].copy()
cytoscape_genes_df.loc[cytoscape_genes_df["source"].isin(["ENSG00000125730","HGNC:1318","GENE ID:718"])]

Unnamed: 0,target,source
54792,C3,ENSG00000125730
139123,ARMD9,ENSG00000125730
139124,C3A,ENSG00000125730
139125,C3B,ENSG00000125730
139126,CPAMD1,ENSG00000125730
194947,HGNC:1318,ENSG00000125730
236618,GENE ID:718,ENSG00000125730
249742,C3,HGNC:1318
297382,CPAMD1,HGNC:1318
297383,ARMD9,HGNC:1318


In [134]:
cytoscape_genes_df.to_csv('../output/cytoscape_genes_df.csv', index=True)

## <a id='toc1_5_'></a>[Add Ortholog relationship label](#toc0_)

### <a id='toc1_5_1_'></a>[Import ortholog sets](#toc0_)

created in symbol_capture_generation_df.ipynb

In [135]:
folder_path = "../input/"  
num_files = 10

file_names = [os.path.join(folder_path, f'ortholog_set_{i}_df.txt') for i in range(1, num_files + 1)]

ortholog_set_dfs = {}

for i, file_name in enumerate(file_names, start=1):
    df = pd.read_csv(file_name, index_col=None)  

    ortholog_set_dfs[f'ortholog_set_{i}_df'] = df
    
    globals()[f'ortholog_set_{i}_df'] = df

### <a id='toc1_5_2_'></a>[Combine all species orthologs into one table](#toc0_)

combine all species in table in to one column then combine all tables into one table

TODO: add qualifier and qualifier value columns to indicate species. can also use to indicate disease later

In [136]:
combined_ortholog_set_df = combine_columns_except_multiple(ortholog_set_dfs, 'Ortholog', ["Gene name","Gene stable ID"])
combined_ortholog_set_df

Unnamed: 0,Gene stable ID,Gene name,Ortholog
769,ENSG00000251925,SNORA70,SNORA70
815,ENSG00000251796,SNORA70,SNORA70
1281,ENSG00000027001,MIPEP,MIPEP
1297,ENSG00000102753,KPNA3,KPNA3
1303,ENSG00000165475,CRYL1,CRYL1
...,...,...,...
1174150,ENSG00000186115,CYP4F2,CYP4F162
1174151,ENSG00000186115,CYP4F2,CYP4F155
1174152,ENSG00000186115,CYP4F2,CYP4F163
1174219,ENSG00000134716,CYP2J2,CYP2J89


### <a id='toc1_5_3_'></a>[Add Ortholog label to the table of gene records](#toc0_)

a match of an identifier, primary gene symbol, and alternate symbol between the gene records table and the relationship table (ortholog in this case) indicate that the alternate symol is an ortholog.

In [137]:
ortholog_capture_df = add_relationship_and_source_if_symbol(
    genes_df,
    combined_ortholog_set_df,
    ["concept_id", "primary_gene_symbol", "symbol"],
    ["Gene stable ID", "Gene name", "Ortholog"],
    "Ortholog Symbol",
    "Ensembl",
    combine_with=None
)
ortholog_capture_df

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
0,MT-TF,MT-TF,HGNC:7481,,
1,MT-RNR1,MT-RNR1,HGNC:7470,,
2,MT-TV,MT-TV,HGNC:7500,,
3,MT-RNR2,MT-RNR2,HGNC:7471,,
4,MT-TL1,MT-TL1,HGNC:7490,,
...,...,...,...,...,...
806429,AANCR,LOC127891700,GENE ID:127891700,,
806430,PBEF,NAMPT-AS1,GENE ID:128266843,,
806431,DUSP13,DUSP13A,GENE ID:128854680,,
806432,NPIPA5L,NPIPA6,GENE ID:131675794,,


## <a id='toc1_6_'></a>[Add HGNC Previous Symbol relationship label](#toc0_)

### <a id='toc1_6_1_'></a>[Import the expired HGNC symbols](#toc0_)

used the table generated in the symbol_capture_generation.ipynb

In [138]:
hgnc_previous_symbols_df = pd.read_hdf(
    "../output/hgnc_previous_symbols_df.h5", key='df'
    )
hgnc_previous_symbols_df

Unnamed: 0,HGNC ID,Approved symbol,previous_symbol
1,HGNC:37133,A1BG-AS1,NCRNA00181
1,HGNC:37133,A1BG-AS1,A1BGAS
1,HGNC:37133,A1BG-AS1,A1BG-AS
6,HGNC:23336,A2ML1,CPAMD9
9,HGNC:8,A2MP1,A2MP
...,...,...,...
49065,HGNC:34495,ZSWIM9,C19orf68
49066,HGNC:21224,ZUP1,C6orf113
49066,HGNC:21224,ZUP1,ZUFSP
49071,HGNC:13197,ZWS1,ZWS


### <a id='toc1_6_2_'></a>[Add HGNC Previous Symbol label to the table of gene records](#toc0_)

a match of an identifier, primary gene symbol, and alternate symbol between the gene records table and the relationship table (HGNC previous symbol in this case) indicate that the alternate symol is an expired symbol.

using the ortholog_capture_df to add expired labels to ortholog labels for a complete table

In [139]:
expired_capture_df = add_relationship_and_source_if_symbol(
    ortholog_capture_df,
    hgnc_previous_symbols_df,
    ["concept_id", "primary_gene_symbol", "symbol"],
    ["HGNC ID", "Approved symbol", "previous_symbol"],
    "HGNC Previous Symbol",
    "HGNC",
    ortholog_capture_df
)
expired_capture_df

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
0,MT-TF,MT-TF,HGNC:7481,,
1,MT-RNR1,MT-RNR1,HGNC:7470,,
2,MT-TV,MT-TV,HGNC:7500,,
3,MT-RNR2,MT-RNR2,HGNC:7471,,
4,MT-TL1,MT-TL1,HGNC:7490,,
...,...,...,...,...,...
1376807,bHLHb9,ARMCX5-GPRASP2,ENSG00000286237,,
1377107,NPY4R,NPY4R2,ENSG00000264717,,
1377127,OPN1MW,OPN1MW3,ENSG00000269433,,
1377597,TMSB15B,TMSB15C,ENSG00000269226,,


example of a gene concept that has alternate symbols representing orthologs and expired symbols

In [140]:
expired_capture_df.loc[expired_capture_df["primary_gene_symbol"] == "ABITRAM"]

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
11767,ABITRAM,ABITRAM,HGNC:1364,,
56017,C9ORF6,ABITRAM,HGNC:1364,,
56018,CG-8,ABITRAM,HGNC:1364,,
56019,FAM206A,ABITRAM,HGNC:1364,,
56020,FLJ20457,ABITRAM,HGNC:1364,,
56021,SIMIATE,ABITRAM,HGNC:1364,,
111432,ABITRAM,ABITRAM,ENSG00000119328,Ortholog Symbol,Ensembl
162597,C9ORF6,ABITRAM,ENSG00000119328,,
162598,CG-8,ABITRAM,ENSG00000119328,,
162599,FAM206A,ABITRAM,ENSG00000119328,Ortholog Symbol,Ensembl


## <a id='toc1_7_'></a>[Add FLJ Clone Symbol label](#toc0_)

### <a id='toc1_7_1_'></a>[Import the FLJ Clone symbols](#toc0_)

used the table generated in the symbol_capture_generation.ipynb

In [141]:
flj_clone_symbols_df = pd.read_hdf(
    "../output/flj_clone_symbols_df.h5", key='df'
    )
flj_clone_symbols_df

Unnamed: 0,Accesion No,ID
0,AK075326,PSEC0001
1,AK075326,FLJ91001
2,AK172724,PSEC0002
3,AK172724,FLJ91002
4,AK075327,PSEC0003
...,...,...
30581,AK057825,FLJ25096
30582,AK000479,FLJ20472
30583,AK125921,FLJ43933
30584,AK125959,FLJ43971


### <a id='toc1_7_2_'></a>[Add FLJ Clone Symbol label to the table of gene records](#toc0_)

In [142]:
flj_clone_capture_df = add_relationship_and_source_if_symbol(
    expired_capture_df,
    flj_clone_symbols_df,
    ["symbol"],
    ["ID"],
    "FLJ Clone Symbol",
    "FLJ Human cDNA Database",
    expired_capture_df
)
flj_clone_capture_df

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
0,MT-TF,MT-TF,HGNC:7481,,
1,MT-RNR1,MT-RNR1,HGNC:7470,,
2,MT-TV,MT-TV,HGNC:7500,,
3,MT-RNR2,MT-RNR2,HGNC:7471,,
4,MT-TL1,MT-TL1,HGNC:7490,,
...,...,...,...,...,...
1418459,FLJ45513,FLJ45513,GENE ID:729220,FLJ Clone Symbol,FLJ Human cDNA Database
1583243,PSEC0257,GOLM1,GENE ID:51280,FLJ Clone Symbol,FLJ Human cDNA Database
1585225,PSEC0146,CYSLTR2,GENE ID:57105,FLJ Clone Symbol,FLJ Human cDNA Database
1587331,PSEC0198,ARMC10,GENE ID:83787,FLJ Clone Symbol,FLJ Human cDNA Database


In [143]:
flj_clone_capture_df.loc[flj_clone_capture_df["primary_gene_symbol"] == "MIR99AHG"]

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
5714,MIR99AHG,MIR99AHG,HGNC:1274,,
48440,C21ORF34,MIR99AHG,HGNC:1274,,
48441,C21ORF35,MIR99AHG,HGNC:1274,,
48442,DILA1,MIR99AHG,HGNC:1274,,
48443,FLJ38295,MIR99AHG,HGNC:1274,,
48444,LINC00478,MIR99AHG,HGNC:1274,,
48445,MONC,MIR99AHG,HGNC:1274,,
104760,MIR99AHG,MIR99AHG,ENSG00000215386,,
154390,C21ORF34,MIR99AHG,ENSG00000215386,,
154391,C21ORF35,MIR99AHG,ENSG00000215386,,


In [144]:
flj_clone_symbols_df

Unnamed: 0,Accesion No,ID
0,AK075326,PSEC0001
1,AK075326,FLJ91001
2,AK172724,PSEC0002
3,AK172724,FLJ91002
4,AK075327,PSEC0003
...,...,...
30581,AK057825,FLJ25096
30582,AK000479,FLJ20472
30583,AK125921,FLJ43933
30584,AK125959,FLJ43971


## <a id='toc1_8_'></a>[Add Gene Family Symbol label](#toc0_)

### <a id='toc1_8_1_'></a>[Import the Gene Family Root symbols](#toc0_)

used the table generated in the symbol_capture_generation.ipynb

In [163]:
hgnc_gene_group_root_df = pd.read_hdf(
    "../output/hgnc_gene_group_root_df.h5", key='df'
    )
hgnc_gene_group_root_df

Unnamed: 0,HGNC ID,Approved symbol,Gene group ID,abbreviation
2,HGNC:24086,A1CF,725,RBM
12,HGNC:13666,AAAS,1051,NUP
13,HGNC:13666,AAAS,362,WDR
14,HGNC:21298,AACS,40,ACS
15,HGNC:17,AADAC,464,LIP
...,...,...,...,...
31395,HGNC:25820,ZYG11B,6,ZYG11
31396,HGNC:25820,ZYG11B,1492,ARMH
31399,HGNC:29027,ZZEF1,91,ZZZ
31401,HGNC:24523,ZZZ3,91,ZZZ


In [166]:
hgnc_gene_group_root_df.loc[hgnc_gene_group_root_df["abbreviation"]== "PCDH"]

Unnamed: 0,HGNC ID,Approved symbol,Gene group ID,abbreviation


In [146]:
gene_group_capture_df = add_relationship_and_source_if_prefix(
    flj_clone_capture_df,
    hgnc_gene_group_root_df,
    ["primary_gene_symbol","concept_id"],
    ["Approved symbol","HGNC ID"],
    "symbol",
    "abbreviation",
    "Prefix Gene Group Symbol",
    "HGNC",
    flj_clone_capture_df,
    "_upper"
)
gene_group_capture_df

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
0,MT-TF,MT-TF,HGNC:7481,,
1,MT-RNR1,MT-RNR1,HGNC:7470,,
2,MT-TV,MT-TV,HGNC:7500,,
3,MT-RNR2,MT-RNR2,HGNC:7471,,
4,MT-TL1,MT-TL1,HGNC:7490,,
...,...,...,...,...,...
1397814,SNORD3P2,SNORD3F,HGNC:52239,Prefix Gene Group Symbol,HGNC
1397900,SCGB1B3,SCGB1B3P,HGNC:20943,Prefix Gene Group Symbol,HGNC
1397901,LINC02815,LINC02814,HGNC:54346,Prefix Gene Group Symbol,HGNC
1397929,OR4M2,OR4M2B,HGNC:55109,Prefix Gene Group Symbol,HGNC


In [147]:
gene_group_capture_df.loc[gene_group_capture_df["primary_gene_symbol"] == "ABTB3"]

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
18494,ABTB3,ABTB3,HGNC:23844,,
65024,ABTB2B,ABTB3,HGNC:23844,,
65025,BTBD11,ABTB3,HGNC:23844,,
65026,FLJ33957,ABTB3,HGNC:23844,,
119610,ABTB3,ABTB3,ENSG00000151136,Ortholog Symbol,Ensembl
174548,ABTB2B,ABTB3,ENSG00000151136,,
174549,BTBD11,ABTB3,ENSG00000151136,Ortholog Symbol,Ensembl
174550,FLJ33957,ABTB3,ENSG00000151136,,
236590,ABTB3,ABTB3,GENE ID:121551,,
285731,ABTB2B,ABTB3,GENE ID:121551,,


In [148]:
hgnc_gene_group_root_df.loc[hgnc_gene_group_root_df["Approved symbol"] == "ABTB3"]

Unnamed: 0,HGNC ID,Approved symbol,Gene group ID,abbreviation
144,HGNC:23844,ABTB3,861,BTBD
145,HGNC:23844,ABTB3,403,ANKRD


## <a id='toc1_9_'></a>[Add Disease Symbol label](#toc0_)

### <a id='toc1_9_1_'></a>[Import the Disease root symbols](#toc0_)

used the table generated in the symbol_capture_generation.ipynb

In [149]:
gene2disease_df = pd.read_hdf(
    "../output/gene2disease_df.h5", key='df'
    )
gene2disease_df

Unnamed: 0,gene_MIM_number,phenotype_MIM_number,Prefix,pheno_title,pheno_symbol,Entrez Gene ID (NCBI),Approved Gene Symbol (HGNC),Ensembl Gene ID (Ensembl)
2,614984,204750,Number Sign,2-AMINOADIPIC 2-OXOADIPIC ACIDURIA,AMOXAD,GENE ID:55526,DHTKD1,ENSG00000181192
3,614984,204750,Number Sign,ALPHA-AMINOADIPIC AND ALPHA-KETOADIPIC ACIDURIA,AAKAD,GENE ID:55526,DHTKD1,ENSG00000181192
4,614984,204750,Number Sign,2-AMINOADIPIC 2-OXOADIPIC ACIDURIA,AMOXAD,GENE ID:55526,DHTKD1,ENSG00000181192
5,600301,610006,Number Sign,SHORT/BRANCHED-CHAIN ACYL-CoA DEHYDROGENASE DE...,SBCADD,GENE ID:36,ACADSB,ENSG00000196177
6,609577,273750,Number Sign,THREE M SYNDROME 1,3M1,GENE ID:9820,CUL7,ENSG00000044090
...,...,...,...,...,...,...,...,...
17642,606636,606579,Number Sign,VITILIGO-ASSOCIATED MULTIPLE AUTOIMMUNE DISEAS...,VAMAS1,GENE ID:22861,NLRP1,ENSG00000091592
17643,606636,606579,Number Sign,VITILIGO,VTLG,GENE ID:22861,NLRP1,ENSG00000091592
17644,606636,606579,Number Sign,"SYSTEMIC LUPUS ERYTHEMATOSUS, VITILIGO-RELATED",SLEV1,GENE ID:22861,NLRP1,ENSG00000091592
17646,600571,616806,Number Sign,WILMS TUMOR 6,WT6,GENE ID:5978,REST,ENSG00000084093


using ENSG ID

In [150]:
disease_capture_ensg_df = add_relationship_and_source_if_prefix(
    gene_group_capture_df,
    gene2disease_df,
    ["primary_gene_symbol","concept_id"],
    ["Approved Gene Symbol (HGNC)","Ensembl Gene ID (Ensembl)"],
    "symbol",
    "pheno_symbol",
    "Prefix Disease Symbol",
    "OMIM",
    gene_group_capture_df,
    "_upper"
)
disease_capture_ensg_df

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
0,MT-TF,MT-TF,HGNC:7481,,
1,MT-RNR1,MT-RNR1,HGNC:7470,,
2,MT-TV,MT-TV,HGNC:7500,,
3,MT-RNR2,MT-RNR2,HGNC:7471,,
4,MT-TL1,MT-TL1,HGNC:7490,,
...,...,...,...,...,...
1458142,NIID,NOTCH2NLC,ENSG00000286219,Prefix Disease Symbol,OMIM
1458147,OPDM3,NOTCH2NLC,ENSG00000286219,Prefix Disease Symbol,OMIM
1458184,OPML1,NUTM2B-AS1,ENSG00000225484,Prefix Disease Symbol,OMIM
1458246,CFZS2,MYMX,ENSG00000262179,Prefix Disease Symbol,OMIM


using NCBI ID

In [151]:
disease_capture_ncbi_df = add_relationship_and_source_if_prefix(
    disease_capture_ensg_df,
    gene2disease_df,
    ["primary_gene_symbol","concept_id"],
    ["Approved Gene Symbol (HGNC)","Entrez Gene ID (NCBI)"],
    "symbol",
    "pheno_symbol",
    "Prefix Disease Symbol",
    "OMIM",
    disease_capture_ensg_df,
    "_upper"
)
disease_capture_ncbi_df

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
0,MT-TF,MT-TF,HGNC:7481,,
1,MT-RNR1,MT-RNR1,HGNC:7470,,
2,MT-TV,MT-TV,HGNC:7500,,
3,MT-RNR2,MT-RNR2,HGNC:7471,,
4,MT-TL1,MT-TL1,HGNC:7490,,
...,...,...,...,...,...
1647455,NIID,NOTCH2NLC,GENE ID:100996717,Prefix Disease Symbol,OMIM
1647460,OPDM3,NOTCH2NLC,GENE ID:100996717,Prefix Disease Symbol,OMIM
1647504,OPML1,NUTM2B-AS1,GENE ID:101060691,Prefix Disease Symbol,OMIM
1647598,CFZS2,MYMX,GENE ID:101929726,Prefix Disease Symbol,OMIM


In [152]:
disease_capture_ncbi_df.loc[disease_capture_ncbi_df["symbol"] == "ASP"]

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
54029,ASP,ATG5,HGNC:589,,
67023,ASP,ASIP,HGNC:745,,
80200,ASP,A1CF,HGNC:24086,,
80300,ASP,ASPA,HGNC:756,,
85797,ASP,ASPM,HGNC:19048,,
89014,ASP,TMPRSS11D,HGNC:24059,,
95157,ASP,ROPN1L,HGNC:24060,,
160344,ASP,ATG5,ENSG00000057663,,
176827,ASP,ASIP,ENSG00000101440,,
195197,ASP,A1CF,ENSG00000148584,,


In [153]:
disease_capture_ncbi_df.loc[disease_capture_ncbi_df["symbol"] == "ASP"]

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
54029,ASP,ATG5,HGNC:589,,
67023,ASP,ASIP,HGNC:745,,
80200,ASP,A1CF,HGNC:24086,,
80300,ASP,ASPA,HGNC:756,,
85797,ASP,ASPM,HGNC:19048,,
89014,ASP,TMPRSS11D,HGNC:24059,,
95157,ASP,ROPN1L,HGNC:24060,,
160344,ASP,ATG5,ENSG00000057663,,
176827,ASP,ASIP,ENSG00000101440,,
195197,ASP,A1CF,ENSG00000148584,,


In [154]:
disease_capture_ncbi_df.loc[disease_capture_ncbi_df["primary_gene_symbol"] == "C3"]

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
31738,C3,C3,HGNC:1318,,
84983,ARMD9,C3,HGNC:1318,,
84984,C3A,C3,HGNC:1318,,
84985,C3B,C3,HGNC:1318,,
84986,CPAMD1,C3,HGNC:1318,,
136425,C3,C3,ENSG00000125730,Ortholog Symbol,Ensembl
200590,ARMD9,C3,ENSG00000125730,,
200591,C3A,C3,ENSG00000125730,,
200592,C3B,C3,ENSG00000125730,,
200593,CPAMD1,C3,ENSG00000125730,,


In [155]:
filtered_df = disease_capture_ncbi_df[disease_capture_ncbi_df["symbol"].str.startswith("FWP", na=False)]

In [156]:
filtered_df

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
68257,FWP007,A2M,HGNC:7,,
178292,FWP007,A2M,ENSG00000175899,,
289000,FWP007,A2M,GENE ID:2,,
546135,FWP010,CD2BP2,HGNC:1656,,
546663,FWP006,MORF4L1,HGNC:16989,,
553023,FWP009,OCEL1,HGNC:26221,,
554892,FWP005,ADAT3,HGNC:25151,,
562713,FWP004,RPL13AP25,HGNC:36981,,
579278,FWP010,CD2BP2,ENSG00000169217,,
579805,FWP006,MORF4L1,ENSG00000185787,,


In [157]:
disease_capture_ncbi_df.to_csv('../output/disease_capture_ncbi_df.csv', index=False)

In [158]:
edges_df = disease_capture_ncbi_df[disease_capture_ncbi_df["concept_id"].notna()]

In [159]:
edges_df.to_csv('../output/edges_df.csv', index=True)

In [160]:
disease_capture_ncbi_df = disease_capture_ncbi_df.sort_values(by='primary_gene_symbol')
disease_capture_ncbi_df.head(50)

Unnamed: 0,symbol,primary_gene_symbol,concept_id,relationship,source
746992,12S rRNA,12S rRNA,GENE ID:6775087,,
747033,12S rRNA,12S rRNA,GENE ID:8923213,,
534242,12S rRNA,12S rRNA,,,
747039,16S rRNA,16S rRNA,GENE ID:8923219,,
534254,16S rRNA,16S rRNA,,,
123434,5S_rRNA,5S_rRNA,ENSG00000285609,,
132338,5S_rRNA,5S_rRNA,ENSG00000277488,Ortholog Symbol,Ensembl
811283,5S_rRNA,5S_rRNA,ENSG00000276861,,
127691,5S_rRNA,5S_rRNA,ENSG00000265816,,
110296,5S_rRNA,5S_rRNA,ENSG00000283454,,
