**Table of contents**<a id='toc0_'></a>    
- [Gene Symbol Capture Transformation](#toc1_)    
    - [Transforming symbol capture data into a table to be used as a sqlite file for a gene symbol relationship look up tool](#toc1_1_1_)    
  - [Set Up](#toc1_2_)    
    - [Import packages](#toc1_2_1_)    
    - [Define Functions](#toc1_2_2_)    
  - [Download gene records](#toc1_3_)    
    - [Ensembl gene records](#toc1_3_1_)    
      - [Combine symbols (primary and aliases) into one column](#toc1_3_1_1_)    
      - [Add "GENE ID:" as a prefix for NCBI IDs](#toc1_3_1_2_)    
      - [Combine concept ids( HGNC_ID, ENSG_ID, NCBI_ID) into one column](#toc1_3_1_3_)    
    - [HGNC gene records](#toc1_3_2_)    
      - [Combine symbols (primary and aliases) into one column](#toc1_3_2_1_)    
      - [Add "GENE ID:" as a prefix for NCBI IDs](#toc1_3_2_2_)    
      - [Combine concept ids( HGNC_ID, ENSG_ID, NCBI_ID) into one column](#toc1_3_2_3_)    
    - [NCBI gene records](#toc1_3_3_)    
      - [Extract necessary cross references](#toc1_3_3_1_)    
      - [Add "ENSG" as a prefix for Ensembl IDs](#toc1_3_3_2_)    
      - [Split aliases to one per row](#toc1_3_3_3_)    
      - [Combine symbols (primary and aliases) into one column](#toc1_3_3_4_)    
      - [Add "GENE ID:" as a prefix for NCBI IDs](#toc1_3_3_5_)    
      - [Add "HGNC:" as a prefix for HGNC IDs](#toc1_3_3_6_)    
      - [Combine concept ids( HGNC_ID, ENSG_ID, NCBI_ID) into one column](#toc1_3_3_7_)    
      - [Remove records with no associated gene symbols](#toc1_3_3_8_)    
  - [Combine gene records from Ensembl, HGNC, and NCBI](#toc1_4_)    
      - [Dropped duplicate rows and rows with no gene symbols.](#toc1_4_1_1_)    
  - [Add Ortholog relationship label](#toc1_5_)    
    - [Import ortholog sets](#toc1_5_1_)    
    - [Combine all species orthologs into one table](#toc1_5_2_)    
    - [Add Ortholog label to the table of gene records](#toc1_5_3_)    
  - [Add HGNC Previous Symbol relationship label](#toc1_6_)    
    - [Import the expired HGNC symbols](#toc1_6_1_)    
    - [Add HGNC Previous Symbol label to the table of gene records](#toc1_6_2_)    
  - [Add FLJ Clone Symbol label](#toc1_7_)    
    - [Import the FLJ Clone symbols](#toc1_7_1_)    
    - [Add FLJ Clone Symbol label to the table of gene records](#toc1_7_2_)    
  - [Add Gene Family Symbol label](#toc1_8_)    
    - [Import the Gene Family Root symbols](#toc1_8_1_)    
  - [Add Disease Symbol label](#toc1_9_)    
    - [Import the Disease root symbols](#toc1_9_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Gene Symbol Capture Transformation](#toc0_)

### <a id='toc1_1_1_'></a>[Transforming symbol capture data into a table to be used as a sqlite file for a gene symbol relationship look up tool](#toc0_)

this table is denormalized compared to the table in symbol_capture_generation.ipynb

## <a id='toc1_2_'></a>[Set Up](#toc0_)

### <a id='toc1_2_1_'></a>[Import packages](#toc0_)

In [380]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import glob
import os
import plotly.express as px

### <a id='toc1_2_2_'></a>[Define Functions](#toc0_)

In [381]:
def combine_columns(df, columns_to_combine, columns_to_keep, new_name, columns_to_drop):
    """Combine multiple columns into one while keeping associated data attached.
    Use this function when the columns to combine are easier to list 
    than the columns not to combine.
    
    :param df: The DataFrame containing the columns to be combined
    :param columns_to_combine: List of column names to combine into one
    :param columns_to_keep: List of columns to keep in the final DataFrame
    :param new_name: The name of the new combined column
    :param columns_to_drop: List of columns to drop from the final DataFrame
    :return: A new DataFrame with combined columns and selected columns retained
    """
    og_df = df.copy()

    combined_dfs = []

    # Loop through each column in columns_to_combine and create a new DataFrame
    for col in columns_to_combine:
        temp_df = og_df[list(set([col] + columns_to_keep))].copy()
        temp_df[new_name] = temp_df[col]
        combined_dfs.append(temp_df)

    df_combined = pd.concat(combined_dfs, ignore_index=True)
    df_combined.drop(columns_to_drop, axis=1, inplace=True)
    df_combined.drop_duplicates(inplace=True)
    
    return df_combined

In [382]:
def combine_columns_except(df, new_name, cols_not_combine):
    """Combine multiple columns into one while keeping associated data attached.
    Use this function when the columns not to combine are easier to list 
    than the columns to combine.

    :param df: The DataFrame containing the columns to be combined
    :param new_name: The name of the new combined column
    :param cols_not_combine: List of columns TO NOT combine from the original DataFrame
    :return: A new DataFrame with combined columns and selected columns retained    
    """

    og_df = df.copy()
    
    columns_to_keep = [col for col in df.columns if col in cols_not_combine]
    columns_to_combine = [col for col in df.columns if col not in columns_to_keep]

    # Loop through each column in columns_to_combine and create a new DataFrame
    combined_df = []
    for col in columns_to_combine:
        temp_df = og_df[list(set([col] + columns_to_keep))].copy()
        temp_df[new_name] = temp_df[col]
        combined_df.append(temp_df)

    df_combined = pd.concat(combined_df, ignore_index=True)
    df_combined.drop(columns=columns_to_combine, axis=1, inplace=True)
    df_combined.drop_duplicates(inplace=True)
    
    return df_combined


In [383]:
def combine_columns_except_multiple(dfs, new_name,cols_not_combine):
    """Apply the combine_columns_except function to several DataFrames and then combine all of those dfs into one.

    :param dfs: The df(s) with the columns to combine (list or dictionary of DataFrames)
    :param new_name: The name of the new combined column
    :param cols_not_combine: List of columns TO NOT combine from the original DataFrame
    :return: A new DataFrame with combined dfs that had combined columns and selected columns retained
    """

    # Convert to list if the input is a dictionary
    if isinstance(dfs, dict):
        dfs = list(dfs.values())
    combined_dfs = []

    # Combine multiple columns into one while keeping associated data attached
    for df in dfs:
        combined_df = combine_columns_except(df, new_name, cols_not_combine)
        combined_dfs.append(combined_df)
    
    # Combine multiple dfs into one
    final_combined_df = pd.concat(combined_dfs, ignore_index=True)

    final_combined_df.drop_duplicates(inplace=True)
    final_combined_df = final_combined_df.dropna(subset=[new_name])

    
    return final_combined_df

In [384]:
def add_relationship_and_source_if_symbol(
    destination_df, relationship_df, destination_df_cols, relationship_df_cols, relationship_val, source_val, combine_with=None
):
    """To add a relationship and source description for gene symbols based on a dataset.
    
    :param destination_df: The DataFrame that is getting the relationships added to it
    :param relationship_df: The DataFrame that is the source of the relationships
    :param destination_df_cols: The columns used to match gene symbols to the relationship_df
    (usually primary gene concept symbol, an identfier, and alternate gene symbol)
    :param relationship_df_cols: The columns used to match gene symbols to the destination_df
    (usually primary gene concept symbol, an identfier, and alternate gene symbol)
    :param relationship_val: Based on the relationship_df, the relationship that a precense of 
    the alternate symbol in the dataset would indicate
    :param source_val: The label to indicate wehre the dataset in relationship_df came from
    :param combine_with: A DataFrame that was a result of this function with a different relationship category(s)
    :return: A new DataFrame with descriptive relationship and source labels

    Notes: 
    - The cols variables need to be in the correct order. The columns to be compared need 
    to be in the same positions in the lists. 
    - There cannot be any shared column names bw relationship_df and destination_df.

    """
    # Create uppercase columns for merging
    for df, cols, suffix in [(destination_df, destination_df_cols, "_upper"), (relationship_df, relationship_df_cols, "_upper")]:
        df[[col + suffix for col in cols]] = df[cols].apply(lambda x: x.str.upper())
    
    # Perform the merge
    merged_df = destination_df.merge(
        relationship_df,
        left_on=[col + "_upper" for col in destination_df_cols],
        right_on=[col + "_upper" for col in relationship_df_cols],
        how="left",
        indicator=True,
    )

    # Assign relationships and sources
    merged_df["relationship"] = merged_df["_merge"].map(lambda x: relationship_val if x == "both" else "")
    merged_df["source"] = merged_df["_merge"].map(lambda x: source_val if x == "both" else "")
    
    # Drop extra columns and remove duplicates
    merged_df.drop(
        columns=[col + "_upper" for col in (destination_df_cols + relationship_df_cols)] + ["_merge"] + relationship_df.columns.tolist(),
        inplace=True
    )
    merged_df.drop_duplicates(inplace=True)

    destination_df.drop(
        columns= [col + "_upper" for col in (destination_df_cols)],
        inplace=True
    )
    relationship_df.drop(
    columns= [col + "_upper" for col in (relationship_df_cols)],
    inplace=True
    )
    # Optionally combine with an existing DataFrame
    if combine_with is not None:
        combined_df = pd.concat([combine_with, merged_df], ignore_index=True).drop_duplicates()
        return combined_df
    
    return merged_df


In [385]:
def add_relationship_and_source_if_prefix(
    destination_df, relationship_df, exact_cols_df1, exact_cols_df2,
    prefix_col_df1, prefix_col_df2, relationship_val, source_val,
    combine_with=None, suffix="_upper"
):
    """
    Compare two DataFrames to find matching rows based on:
    1. Exact matches for specified columns (which can differ between DataFrames).
    2. A single prefix match for a specific column.
    Add relationship and source descriptions for gene symbols.

    :param destination_df: The DataFrame getting relationships added.
    :param relationship_df: The DataFrame as the source of the relationships.
    :param exact_cols_df1: Columns in destination_df requiring exact matches.
    :param exact_cols_df2: Columns in relationship_df requiring exact matches.
    :param prefix_col_df1: Column in destination_df for prefix matching.
    :param prefix_col_df2: Column in relationship_df for prefix matching.
    :param relationship_val: The relationship to assign if a match is found.
    :param source_val: The source label to assign if a match is found.
    :param combine_with: Optionally combine with an existing DataFrame.
    :param suffix: Suffix for new uppercase columns.
    :return: DataFrame with descriptive relationship and source labels.
    """
    # Create uppercase columns for merging
    for df, cols, suffix in [(destination_df, exact_cols_df1 + [prefix_col_df1], "_upper"), (relationship_df, exact_cols_df2 + [prefix_col_df2], "_upper")]:
        df[[col + suffix for col in cols]] = df[cols].apply(lambda x: x.str.upper())
    
    # Merge DataFrames on exact columns
    merged_df = destination_df.merge(
        relationship_df,
        left_on=[col + suffix for col in exact_cols_df1],
        right_on=[col + suffix for col in exact_cols_df2],
        how="left",
        indicator=True,
    )

    # Identify prefix matches and assign relationships
    merged_df["Prefix Match"] = merged_df.apply(
        lambda row: row[prefix_col_df1 + suffix].startswith(
            row[prefix_col_df2 + suffix]
        ) if pd.notna(row[prefix_col_df2 + suffix]) else False,
        axis=1
    )

    # Assign relationships and sources
    merged_df["relationship"] = merged_df.apply(
        lambda row: relationship_val if row["_merge"] == "both" and row["Prefix Match"] else "", axis=1
    )
    merged_df["source"] = merged_df.apply(
        lambda row: source_val if row["_merge"] == "both" and row["Prefix Match"] else "", axis=1
    )

    # Drop extra columns and remove duplicates
    merged_df.drop(
        columns=[col + suffix for col in exact_cols_df1 + exact_cols_df2 + [prefix_col_df1, prefix_col_df2]] 
                + ["_merge", "Prefix Match"] 
                + relationship_df.columns.tolist(),
        inplace=True, errors='ignore'
    )
    merged_df.drop_duplicates(inplace=True)

    destination_df.drop(
        columns=[col + suffix for col in exact_cols_df1 + [prefix_col_df1]], 
        inplace=True, 
        errors='ignore')
    
    relationship_df.drop(
    columns=[col + suffix for col in exact_cols_df2 + [prefix_col_df2]], 
    inplace=True, 
    errors='ignore')

    # Optionally combine with an existing DataFrame
    if combine_with is not None:
        combined_df = pd.concat([combine_with, merged_df], ignore_index=True).drop_duplicates()
        return combined_df

    return merged_df



## <a id='toc1_3_'></a>[Download gene records](#toc0_)

### <a id='toc1_3_1_'></a>[Ensembl gene records](#toc0_)

In [386]:
mini_ensg_df = pd.read_csv(
    "../output/mini_ensg_df.csv", sep=",", index_col=[0]
)
mini_ensg_df

Unnamed: 0,ENSG_ID,NCBI_ID,HGNC_ID,alias_symbol,gene_symbol
0,ENSG00000210049,,HGNC:7481,MTTF,MT-TF
1,ENSG00000210049,,HGNC:7481,TRNF,MT-TF
2,ENSG00000211459,,HGNC:7470,12S,MT-RNR1
3,ENSG00000211459,,HGNC:7470,MOTS-C,MT-RNR1
4,ENSG00000211459,,HGNC:7470,MTRNR1,MT-RNR1
...,...,...,...,...,...
133058,ENSG00000197989,GENE ID:85028,HGNC:30062,LINC00100,SNHG12
133059,ENSG00000197989,GENE ID:85028,HGNC:30062,PNAS-123,SNHG12
133060,ENSG00000229388,,HGNC:52502,LINC01715,TAF12-DT
133062,ENSG00000274978,GENE ID:26824,HGNC:10108,RNU11-1,RNU11


#### <a id='toc1_3_1_1_'></a>[Combine symbols (primary and aliases) into one column](#toc0_)

so that there is a generic symbol column representing all symbols associate with a gene concept.

In [387]:
ensg_combined_symbols_df = combine_columns(mini_ensg_df, ['gene_symbol','alias_symbol'], ['gene_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID'], "symbol", "alias_symbol")
ensg_combined_symbols_df.head()

Unnamed: 0,HGNC_ID,ENSG_ID,NCBI_ID,gene_symbol,symbol
0,HGNC:7481,ENSG00000210049,,MT-TF,MT-TF
2,HGNC:7470,ENSG00000211459,,MT-RNR1,MT-RNR1
5,HGNC:7500,ENSG00000210077,,MT-TV,MT-TV
7,HGNC:7471,ENSG00000210082,,MT-RNR2,MT-RNR2
10,HGNC:7490,ENSG00000209082,,MT-TL1,MT-TL1


In [388]:
ensg_combined_symbols_df.loc[ensg_combined_symbols_df["gene_symbol"]== "C3"]

Unnamed: 0,HGNC_ID,ENSG_ID,NCBI_ID,gene_symbol,symbol
66246,HGNC:1318,ENSG00000125730,GENE ID:718,C3,C3
161270,HGNC:1318,ENSG00000125730,GENE ID:718,C3,ARMD9
161271,HGNC:1318,ENSG00000125730,GENE ID:718,C3,C3A
161272,HGNC:1318,ENSG00000125730,GENE ID:718,C3,C3B
161273,HGNC:1318,ENSG00000125730,GENE ID:718,C3,CPAMD1


Make a table for concordance analysis

In [389]:
ensg_combined_symbols_df.to_csv(
    "../output/ensg_combined_symbols_df.csv", index=True
)

#### <a id='toc1_3_1_3_'></a>[Combine concept ids( HGNC_ID, ENSG_ID, NCBI_ID) into one column](#toc0_)

In [390]:
ensg_combined_concept_ids_df = combine_columns(ensg_combined_symbols_df, ["HGNC_ID", "ENSG_ID", "NCBI_ID"], ['gene_symbol',"symbol"], "concept_id", ["HGNC_ID", "ENSG_ID", "NCBI_ID"])
ensg_combined_concept_ids_df.head()

Unnamed: 0,symbol,gene_symbol,concept_id
0,MT-TF,MT-TF,HGNC:7481
1,MT-RNR1,MT-RNR1,HGNC:7470
2,MT-TV,MT-TV,HGNC:7500
3,MT-RNR2,MT-RNR2,HGNC:7471
4,MT-TL1,MT-TL1,HGNC:7490


In [391]:
ensg_combined_concept_ids_df.to_csv(
    "../output/ensg_combined_concept_ids_df.csv", index=True
)

Creating table for cytoscape

In [392]:
ensg_combined_hgnc_ncbi_concept_ids_df = combine_columns(ensg_combined_symbols_df, ["HGNC_ID", "NCBI_ID"], ["symbol","ENSG_ID"], "xref_concept_id", ["HGNC_ID", "NCBI_ID"])
ensg_combined_symbol_xref_concept_ids_df = combine_columns(ensg_combined_hgnc_ncbi_concept_ids_df, ["symbol", "xref_concept_id"], ["ENSG_ID"], "target", ["symbol", "xref_concept_id"])
ensg_combined_symbol_xref_concept_ids_df["source"] = ensg_combined_symbol_xref_concept_ids_df["ENSG_ID"]
ensg_combined_symbol_xref_concept_ids_df = ensg_combined_symbol_xref_concept_ids_df.drop(columns =["ENSG_ID"])
ensg_combined_symbol_xref_concept_ids_df.loc[ensg_combined_symbol_xref_concept_ids_df["source"]== "ENSG00000125730"]

Unnamed: 0,target,source
32889,C3,ENSG00000125730
109956,ARMD9,ENSG00000125730
109957,C3A,ENSG00000125730
109958,C3B,ENSG00000125730
109959,CPAMD1,ENSG00000125730
315553,HGNC:1318,ENSG00000125730
454621,GENE ID:718,ENSG00000125730


### <a id='toc1_3_2_'></a>[HGNC gene records](#toc0_)

In [393]:
mini_hgnc_df = pd.read_csv(
    "../output/mini_hgnc_df.csv", sep=",", index_col=[0]
)
mini_hgnc_df

Unnamed: 0,HGNC_ID,gene_symbol,alias_symbol,NCBI_ID,ENSG_ID
0,HGNC:100,ASIC1,BNaC2,GENE ID:41,ENSG00000110881
0,HGNC:100,ASIC1,hBNaC2,GENE ID:41,ENSG00000110881
1,HGNC:10000,RGS4,,GENE ID:5999,ENSG00000117152
2,HGNC:10001,RGS5,,GENE ID:8490,ENSG00000143248
3,HGNC:10002,RGS6,,GENE ID:9628,ENSG00000182732
...,...,...,...,...,...
44232,HGNC:9997,RGS16,RGS-r,GENE ID:6004,ENSG00000143333
44233,HGNC:9998,RGS2,,GENE ID:5997,ENSG00000116741
44234,HGNC:9999,RGS3,C2PA,GENE ID:5998,ENSG00000138835
44234,HGNC:9999,RGS3,FLJ20370,GENE ID:5998,ENSG00000138835


#### <a id='toc1_3_2_1_'></a>[Combine symbols (primary and aliases) into one column](#toc0_)

so that there is a generic symbol column representing all symbols associate with a gene concept.

In [394]:
hgnc_combined_symbols_df = combine_columns(mini_hgnc_df, ['gene_symbol','alias_symbol'], ['gene_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID'], "symbol", "alias_symbol")
hgnc_combined_symbols_df.head()

Unnamed: 0,HGNC_ID,ENSG_ID,NCBI_ID,gene_symbol,symbol
0,HGNC:100,ENSG00000110881,GENE ID:41,ASIC1,ASIC1
2,HGNC:10000,ENSG00000117152,GENE ID:5999,RGS4,RGS4
3,HGNC:10001,ENSG00000143248,GENE ID:8490,RGS5,RGS5
4,HGNC:10002,ENSG00000182732,GENE ID:9628,RGS6,RGS6
5,HGNC:10003,ENSG00000182901,GENE ID:6000,RGS7,RGS7


In [395]:
hgnc_combined_symbols_df.loc[hgnc_combined_symbols_df["gene_symbol"] == "C3"]

Unnamed: 0,HGNC_ID,ENSG_ID,NCBI_ID,gene_symbol,symbol
5726,HGNC:1318,ENSG00000125730,GENE ID:718,C3,C3
71933,HGNC:1318,ENSG00000125730,GENE ID:718,C3,CPAMD1
71934,HGNC:1318,ENSG00000125730,GENE ID:718,C3,ARMD9
71935,HGNC:1318,ENSG00000125730,GENE ID:718,C3,C3a
71936,HGNC:1318,ENSG00000125730,GENE ID:718,C3,C3b


Make a table for concordance analysis

In [396]:
hgnc_combined_symbols_df.to_csv(
    "../output/hgnc_combined_symbols_df.csv", index=True
)

#### <a id='toc1_3_2_3_'></a>[Combine concept ids( HGNC_ID, ENSG_ID, NCBI_ID) into one column](#toc0_)

In [397]:
hgnc_combined_concept_ids_df = combine_columns(hgnc_combined_symbols_df, ["HGNC_ID", "ENSG_ID", "NCBI_ID"], ['gene_symbol',"symbol"], "concept_id", ["HGNC_ID", "ENSG_ID", "NCBI_ID"])
hgnc_combined_concept_ids_df.head()

Unnamed: 0,symbol,gene_symbol,concept_id
0,ASIC1,ASIC1,HGNC:100
1,RGS4,RGS4,HGNC:10000
2,RGS5,RGS5,HGNC:10001
3,RGS6,RGS6,HGNC:10002
4,RGS7,RGS7,HGNC:10003


In [398]:
hgnc_combined_concept_ids_df.to_csv(
    "../output/hgnc_combined_concept_ids_df.csv", index=True
)

Creating table for cytoscape

In [399]:
hgnc_combined_hgnc_ncbi_concept_ids_df = combine_columns(hgnc_combined_symbols_df, ["ENSG_ID", "NCBI_ID"], ["symbol","HGNC_ID"], "xref_concept_id", ["ENSG_ID", "NCBI_ID"])
hgnc_combined_hgnc_ncbi_concept_ids_df.head()
hgnc_combined_symbol_xref_concept_ids_df = combine_columns(hgnc_combined_hgnc_ncbi_concept_ids_df, ["symbol", "xref_concept_id"], ["HGNC_ID"], "target", ["symbol", "xref_concept_id"])
hgnc_combined_symbol_xref_concept_ids_df["source"] = hgnc_combined_symbol_xref_concept_ids_df["HGNC_ID"]
hgnc_combined_symbol_xref_concept_ids_df = hgnc_combined_symbol_xref_concept_ids_df.drop(columns =["HGNC_ID"])
hgnc_combined_symbol_xref_concept_ids_df.loc[hgnc_combined_symbol_xref_concept_ids_df["source"]== "HGNC:1318"]

Unnamed: 0,target,source
2727,C3,HGNC:1318
49961,CPAMD1,HGNC:1318
49962,ARMD9,HGNC:1318
49963,C3a,HGNC:1318
49964,C3b,HGNC:1318
222997,ENSG00000125730,HGNC:1318
333430,GENE ID:718,HGNC:1318


### <a id='toc1_3_3_'></a>[NCBI gene records](#toc0_)

In [400]:
mini_ncbi_df = pd.read_csv(
    "../output/mini_ncbi_df.csv", sep=",", index_col=[0]
)
mini_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,GENE ID:1,A1BG,A1B,HGNC:5,ENSG00000121410
0,GENE ID:1,A1BG,ABG,HGNC:5,ENSG00000121410
0,GENE ID:1,A1BG,GAB,HGNC:5,ENSG00000121410
0,GENE ID:1,A1BG,HYST2477,HGNC:5,ENSG00000121410
1,GENE ID:2,A2M,A2MD,HGNC:7,ENSG00000175899
...,...,...,...,...,...
193502,GENE ID:141732005,ADCY2-AS1,,HGNC:40064,
193503,GENE ID:141732006,NSG2-AS1,,HGNC:41074,
193504,GENE ID:141732007,ST18-AS1,,HGNC:58430,
193505,GENE ID:141732008,MICAL2-AS1,,HGNC:58437,


#### <a id='toc1_3_3_4_'></a>[Combine symbols (primary and aliases) into one column](#toc0_)

so that there is a generic symbol column representing all symbols associate with a gene concept.

In [401]:
ncbi_combined_symbols_df = combine_columns(mini_ncbi_df, ['gene_symbol','alias_symbol'], ['gene_symbol', 'HGNC_ID', 'ENSG_ID', 'NCBI_ID'], "symbol", "alias_symbol")
ncbi_combined_symbols_df.head()

Unnamed: 0,HGNC_ID,ENSG_ID,NCBI_ID,gene_symbol,symbol
0,HGNC:5,ENSG00000121410,GENE ID:1,A1BG,A1BG
4,HGNC:7,ENSG00000175899,GENE ID:2,A2M,A2M
8,HGNC:7645,ENSG00000171428,GENE ID:9,NAT1,NAT1
12,HGNC:7646,ENSG00000156006,GENE ID:10,NAT2,NAT2
15,HGNC:15,,GENE ID:11,NATP,NATP


In [402]:
ncbi_combined_symbols_df.loc[ncbi_combined_symbols_df["gene_symbol"]== "C3"]

Unnamed: 0,HGNC_ID,ENSG_ID,NCBI_ID,gene_symbol,symbol
2032,HGNC:1318,ENSG00000125730,GENE ID:718,C3,C3
93409,HGNC:1318,ENSG00000125730,GENE ID:718,C3,AHUS5
93410,HGNC:1318,ENSG00000125730,GENE ID:718,C3,ARMD9
93411,HGNC:1318,ENSG00000125730,GENE ID:718,C3,ASP
93412,HGNC:1318,ENSG00000125730,GENE ID:718,C3,C3a
93413,HGNC:1318,ENSG00000125730,GENE ID:718,C3,C3b
93414,HGNC:1318,ENSG00000125730,GENE ID:718,C3,CPAMD1
93415,HGNC:1318,ENSG00000125730,GENE ID:718,C3,HEL-S-62p


In [403]:
ncbi_combined_symbols_df.to_csv(
    "../output/ncbi_combined_symbols_df.csv", index=True
)

#### <a id='toc1_3_3_7_'></a>[Combine concept ids( HGNC_ID, ENSG_ID, NCBI_ID) into one column](#toc0_)

In [404]:
ncbi_combined_concept_ids_df = combine_columns(ncbi_combined_symbols_df, ["HGNC_ID", "ENSG_ID", "NCBI_ID"], ['gene_symbol',"symbol"], "concept_id", ["HGNC_ID", "ENSG_ID", "NCBI_ID"])
ncbi_combined_concept_ids_df.head()

Unnamed: 0,symbol,gene_symbol,concept_id
0,A1BG,A1BG,HGNC:5
1,A2M,A2M,HGNC:7
2,NAT1,NAT1,HGNC:7645
3,NAT2,NAT2,HGNC:7646
4,NATP,NATP,HGNC:15


#### <a id='toc1_3_3_8_'></a>[Remove records with no associated gene symbols](#toc0_)

In [405]:
ncbi_combined_concept_ids_df.loc[ncbi_combined_concept_ids_df["symbol"] == "-"]

Unnamed: 0,symbol,gene_symbol,concept_id


In [406]:
ncbi_combined_concept_ids_df.to_csv(
    "../output/ncbi_combined_concept_ids_df.csv", index=True
)

Creating table for cytoscape

In [407]:
ncbi_combined_hgnc_ncbi_concept_ids_df = combine_columns(ncbi_combined_symbols_df, ["ENSG_ID", "HGNC_ID"], ["symbol","NCBI_ID"], "xref_concept_id", ["ENSG_ID", "HGNC_ID"])
ncbi_combined_hgnc_ncbi_concept_ids_df.head()
ncbi_combined_symbol_xref_concept_ids_df = combine_columns(ncbi_combined_hgnc_ncbi_concept_ids_df, ["symbol", "xref_concept_id"], ["NCBI_ID"], "target", ["symbol", "xref_concept_id"])
ncbi_combined_symbol_xref_concept_ids_df["source"] = ncbi_combined_symbol_xref_concept_ids_df["NCBI_ID"]
ncbi_combined_symbol_xref_concept_ids_df = ncbi_combined_symbol_xref_concept_ids_df.drop(columns =["NCBI_ID"])
ncbi_combined_symbol_xref_concept_ids_df.loc[ncbi_combined_symbol_xref_concept_ids_df["source"]== "GENE ID:718"]

Unnamed: 0,target,source
566,C3,GENE ID:718
46671,AHUS5,GENE ID:718
46672,ARMD9,GENE ID:718
46673,ASP,GENE ID:718
46674,C3a,GENE ID:718
46675,C3b,GENE ID:718
46676,CPAMD1,GENE ID:718
46677,HEL-S-62p,GENE ID:718
271848,ENSG00000125730,GENE ID:718
407864,HGNC:1318,GENE ID:718


## <a id='toc1_4_'></a>[Combine gene records from Ensembl, HGNC, and NCBI](#toc0_)

#### <a id='toc1_4_1_1_'></a>[Dropped duplicate rows and rows with no gene symbols.](#toc0_)

In [408]:
genes_df = pd.concat([ensg_combined_concept_ids_df, hgnc_combined_concept_ids_df, ncbi_combined_concept_ids_df], ignore_index=True)
genes_df.drop_duplicates(inplace=True)
genes_df = genes_df.dropna(subset=["symbol","gene_symbol"])
genes_df = genes_df[~genes_df['symbol'].isin(["-", "", " "])].copy()
genes_df

Unnamed: 0,symbol,gene_symbol,concept_id
0,MT-TF,MT-TF,HGNC:7481
1,MT-RNR1,MT-RNR1,HGNC:7470
2,MT-TV,MT-TV,HGNC:7500
3,MT-RNR2,MT-RNR2,HGNC:7471
4,MT-TL1,MT-TL1,HGNC:7490
...,...,...,...
1113278,AGPS,IFT70A-AS1,GENE ID:139281660
1113279,PDE11A,IFT70A-AS1,GENE ID:139281660
1113286,lnc-BCAT1,BCAT1-DT,GENE ID:139281667
1113359,lnc-OB1,LNCOB1,GENE ID:139440214


In [409]:
genes_df.loc[genes_df["gene_symbol"] == "C3"]

Unnamed: 0,symbol,gene_symbol,concept_id
27731,C3,C3,HGNC:1318
92145,ARMD9,C3,HGNC:1318
92146,C3A,C3,HGNC:1318
92147,C3B,C3,HGNC:1318
92148,CPAMD1,C3,HGNC:1318
148511,C3,C3,ENSG00000125730
225574,ARMD9,C3,ENSG00000125730
225575,C3A,C3,ENSG00000125730
225576,C3B,C3,ENSG00000125730
225577,CPAMD1,C3,ENSG00000125730


Creating table for cytoscape

In [410]:
cytoscape_genes_df = pd.concat([ensg_combined_symbol_xref_concept_ids_df, hgnc_combined_symbol_xref_concept_ids_df, ncbi_combined_symbol_xref_concept_ids_df], ignore_index=True)
cytoscape_genes_df.drop_duplicates(inplace=True)
cytoscape_genes_df = cytoscape_genes_df.dropna(subset=["source","target"])
cytoscape_genes_df = cytoscape_genes_df[~cytoscape_genes_df['target'].isin(["-", "", " "])].copy()
cytoscape_genes_df.loc[cytoscape_genes_df["source"].isin(["ENSG00000125730","HGNC:1318","GENE ID:718"])]

Unnamed: 0,target,source
32889,C3,ENSG00000125730
109952,ARMD9,ENSG00000125730
109953,C3A,ENSG00000125730
109954,C3B,ENSG00000125730
109955,CPAMD1,ENSG00000125730
168375,HGNC:1318,ENSG00000125730
210289,GENE ID:718,ENSG00000125730
225945,C3,HGNC:1318
273179,CPAMD1,HGNC:1318
273180,ARMD9,HGNC:1318


In [411]:
cytoscape_genes_df.to_csv('../output/cytoscape_genes_df.csv', index=True)

## <a id='toc1_5_'></a>[Add Ortholog relationship label](#toc0_)

### <a id='toc1_5_1_'></a>[Import ortholog sets](#toc0_)

created in symbol_capture_generation_df.ipynb

In [412]:
folder_path = "../input/"  
num_files = 10

file_names = [os.path.join(folder_path, f'ortholog_set_{i}_df.txt') for i in range(1, num_files + 1)]

ortholog_set_dfs = {}

for i, file_name in enumerate(file_names, start=1):
    df = pd.read_csv(file_name, index_col=None)  

    ortholog_set_dfs[f'ortholog_set_{i}_df'] = df
    
    globals()[f'ortholog_set_{i}_df'] = df

### <a id='toc1_5_2_'></a>[Combine all species orthologs into one table](#toc0_)

combine all species in table in to one column then combine all tables into one table

TODO: add qualifier and qualifier value columns to indicate species. can also use to indicate disease later

In [413]:
combined_ortholog_set_df = combine_columns_except_multiple(ortholog_set_dfs, 'Ortholog', ["Gene name","Gene stable ID"])
combined_ortholog_set_df

Unnamed: 0,Gene stable ID,Gene name,Ortholog
769,ENSG00000251925,SNORA70,SNORA70
815,ENSG00000251796,SNORA70,SNORA70
1281,ENSG00000027001,MIPEP,MIPEP
1297,ENSG00000102753,KPNA3,KPNA3
1303,ENSG00000165475,CRYL1,CRYL1
...,...,...,...
1174150,ENSG00000186115,CYP4F2,CYP4F162
1174151,ENSG00000186115,CYP4F2,CYP4F155
1174152,ENSG00000186115,CYP4F2,CYP4F163
1174219,ENSG00000134716,CYP2J2,CYP2J89


### <a id='toc1_5_3_'></a>[Add Ortholog label to the table of gene records](#toc0_)

a match of an identifier, primary gene symbol, and alternate symbol between the gene records table and the relationship table (ortholog in this case) indicate that the alternate symol is an ortholog.

In [414]:
ortholog_capture_df = add_relationship_and_source_if_symbol(
    genes_df,
    combined_ortholog_set_df,
    ["concept_id", "gene_symbol", "symbol"],
    ["Gene stable ID", "Gene name", "Ortholog"],
    "Ortholog Symbol",
    "Ensembl",
    combine_with=None
)
ortholog_capture_df

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
0,MT-TF,MT-TF,HGNC:7481,,
1,MT-RNR1,MT-RNR1,HGNC:7470,,
2,MT-TV,MT-TV,HGNC:7500,,
3,MT-RNR2,MT-RNR2,HGNC:7471,,
4,MT-TL1,MT-TL1,HGNC:7490,,
...,...,...,...,...,...
507297,AGPS,IFT70A-AS1,GENE ID:139281660,,
507298,PDE11A,IFT70A-AS1,GENE ID:139281660,,
507299,lnc-BCAT1,BCAT1-DT,GENE ID:139281667,,
507300,lnc-OB1,LNCOB1,GENE ID:139440214,,


## <a id='toc1_6_'></a>[Add HGNC Previous Symbol relationship label](#toc0_)

### <a id='toc1_6_1_'></a>[Import the expired HGNC symbols](#toc0_)

used the table generated in the symbol_capture_generation.ipynb

In [415]:
hgnc_previous_symbols_df = pd.read_hdf(
    "../output/hgnc_previous_symbols_df.h5", key='df'
    )
hgnc_previous_symbols_df

Unnamed: 0,HGNC ID,Approved symbol,previous_symbol
1,HGNC:37133,A1BG-AS1,NCRNA00181
1,HGNC:37133,A1BG-AS1,A1BGAS
1,HGNC:37133,A1BG-AS1,A1BG-AS
6,HGNC:23336,A2ML1,CPAMD9
9,HGNC:8,A2MP1,A2MP
...,...,...,...
49065,HGNC:34495,ZSWIM9,C19orf68
49066,HGNC:21224,ZUP1,C6orf113
49066,HGNC:21224,ZUP1,ZUFSP
49071,HGNC:13197,ZWS1,ZWS


### <a id='toc1_6_2_'></a>[Add HGNC Previous Symbol label to the table of gene records](#toc0_)

a match of an identifier, primary gene symbol, and alternate symbol between the gene records table and the relationship table (HGNC previous symbol in this case) indicate that the alternate symol is an expired symbol.

using the ortholog_capture_df to add expired labels to ortholog labels for a complete table

In [416]:
expired_capture_df = add_relationship_and_source_if_symbol(
    ortholog_capture_df,
    hgnc_previous_symbols_df,
    ["concept_id", "gene_symbol", "symbol"],
    ["HGNC ID", "Approved symbol", "previous_symbol"],
    "HGNC Previous Symbol",
    "HGNC",
    ortholog_capture_df
)
expired_capture_df

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
0,MT-TF,MT-TF,HGNC:7481,,
1,MT-RNR1,MT-RNR1,HGNC:7470,,
2,MT-TV,MT-TV,HGNC:7500,,
3,MT-RNR2,MT-RNR2,HGNC:7471,,
4,MT-TL1,MT-TL1,HGNC:7490,,
...,...,...,...,...,...
925322,bHLHb9,ARMCX5-GPRASP2,ENSG00000286237,,
925646,NPY4R,NPY4R2,ENSG00000264717,,
925665,OPN1MW,OPN1MW3,ENSG00000269433,,
926486,TMSB15B,TMSB15C,ENSG00000269226,,


example of a gene concept that has alternate symbols representing orthologs and expired symbols

In [417]:
expired_capture_df.loc[expired_capture_df["gene_symbol"] == "ABITRAM"]

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
9609,ABITRAM,ABITRAM,HGNC:1364,,
54076,C9ORF6,ABITRAM,HGNC:1364,,
54077,CG-8,ABITRAM,HGNC:1364,,
54078,FAM206A,ABITRAM,HGNC:1364,,
54079,FLJ20457,ABITRAM,HGNC:1364,,
54080,SIMIATE,ABITRAM,HGNC:1364,,
109097,ABITRAM,ABITRAM,ENSG00000119328,Ortholog Symbol,Ensembl
160346,C9ORF6,ABITRAM,ENSG00000119328,,
160347,CG-8,ABITRAM,ENSG00000119328,,
160348,FAM206A,ABITRAM,ENSG00000119328,Ortholog Symbol,Ensembl


## <a id='toc1_7_'></a>[Add FLJ Clone Symbol label](#toc0_)

### <a id='toc1_7_1_'></a>[Import the FLJ Clone symbols](#toc0_)

used the table generated in the symbol_capture_generation.ipynb

In [418]:
flj_clone_symbols_df = pd.read_hdf(
    "../output/flj_clone_symbols_df.h5", key='df'
    )
flj_clone_symbols_df

Unnamed: 0,Accesion No,ID
0,AK075326,PSEC0001
1,AK075326,FLJ91001
2,AK172724,PSEC0002
3,AK172724,FLJ91002
4,AK075327,PSEC0003
...,...,...
30581,AK057825,FLJ25096
30582,AK000479,FLJ20472
30583,AK125921,FLJ43933
30584,AK125959,FLJ43971


### <a id='toc1_7_2_'></a>[Add FLJ Clone Symbol label to the table of gene records](#toc0_)

In [419]:
flj_clone_capture_df = add_relationship_and_source_if_symbol(
    expired_capture_df,
    flj_clone_symbols_df,
    ["symbol"],
    ["ID"],
    "FLJ Clone Symbol",
    "FLJ Human cDNA Database",
    expired_capture_df
)
flj_clone_capture_df

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
0,MT-TF,MT-TF,HGNC:7481,,
1,MT-RNR1,MT-RNR1,HGNC:7470,,
2,MT-TV,MT-TV,HGNC:7500,,
3,MT-RNR2,MT-RNR2,HGNC:7471,,
4,MT-TL1,MT-TL1,HGNC:7490,,
...,...,...,...,...,...
966336,FLJ45513,FLJ45513,GENE ID:729220,FLJ Clone Symbol,FLJ Human cDNA Database
985046,PSEC0257,GOLM1,GENE ID:51280,FLJ Clone Symbol,FLJ Human cDNA Database
987046,PSEC0146,CYSLTR2,GENE ID:57105,FLJ Clone Symbol,FLJ Human cDNA Database
989157,PSEC0198,ARMC10,GENE ID:83787,FLJ Clone Symbol,FLJ Human cDNA Database


In [420]:
flj_clone_capture_df.loc[flj_clone_capture_df["gene_symbol"] == "MIR99AHG"]

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
6762,MIR99AHG,MIR99AHG,HGNC:1274,,
50375,C21ORF34,MIR99AHG,HGNC:1274,,
50376,C21ORF35,MIR99AHG,HGNC:1274,,
50377,DILA1,MIR99AHG,HGNC:1274,,
50378,FLJ38295,MIR99AHG,HGNC:1274,,
50379,LINC00478,MIR99AHG,HGNC:1274,,
50380,MONC,MIR99AHG,HGNC:1274,,
106056,MIR99AHG,MIR99AHG,ENSG00000215386,,
156548,C21ORF34,MIR99AHG,ENSG00000215386,,
156549,C21ORF35,MIR99AHG,ENSG00000215386,,


In [421]:
flj_clone_symbols_df

Unnamed: 0,Accesion No,ID
0,AK075326,PSEC0001
1,AK075326,FLJ91001
2,AK172724,PSEC0002
3,AK172724,FLJ91002
4,AK075327,PSEC0003
...,...,...
30581,AK057825,FLJ25096
30582,AK000479,FLJ20472
30583,AK125921,FLJ43933
30584,AK125959,FLJ43971


## <a id='toc1_8_'></a>[Add Gene Family Symbol label](#toc0_)

### <a id='toc1_8_1_'></a>[Import the Gene Family Root symbols](#toc0_)

used the table generated in the symbol_capture_generation.ipynb

In [422]:
hgnc_gene_group_root_df = pd.read_hdf(
    "../output/hgnc_gene_group_root_df.h5", key='df'
    )
hgnc_gene_group_root_df

Unnamed: 0,HGNC ID,Approved symbol,Gene group ID,abbreviation
2,HGNC:24086,A1CF,725,RBM
12,HGNC:13666,AAAS,1051,NUP
13,HGNC:13666,AAAS,362,WDR
14,HGNC:21298,AACS,40,ACS
15,HGNC:17,AADAC,464,LIP
...,...,...,...,...
31395,HGNC:25820,ZYG11B,6,ZYG11
31396,HGNC:25820,ZYG11B,1492,ARMH
31399,HGNC:29027,ZZEF1,91,ZZZ
31401,HGNC:24523,ZZZ3,91,ZZZ


In [423]:
hgnc_gene_group_root_df.loc[hgnc_gene_group_root_df["abbreviation"]== "PCDH"]

Unnamed: 0,HGNC ID,Approved symbol,Gene group ID,abbreviation


In [424]:
gene_group_capture_df = add_relationship_and_source_if_prefix(
    flj_clone_capture_df,
    hgnc_gene_group_root_df,
    ["gene_symbol","concept_id"],
    ["Approved symbol","HGNC ID"],
    "symbol",
    "abbreviation",
    "Prefix Gene Group Symbol",
    "HGNC",
    flj_clone_capture_df,
    "_upper"
)
gene_group_capture_df

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
0,MT-TF,MT-TF,HGNC:7481,,
1,MT-RNR1,MT-RNR1,HGNC:7470,,
2,MT-TV,MT-TV,HGNC:7500,,
3,MT-RNR2,MT-RNR2,HGNC:7471,,
4,MT-TL1,MT-TL1,HGNC:7490,,
...,...,...,...,...,...
948149,TRNAL49,TRL-CAA1-1,HGNC:38583,Prefix Gene Group Symbol,HGNC
948165,SNORD3P2,SNORD3F,HGNC:52239,Prefix Gene Group Symbol,HGNC
948199,SCGB1B3,SCGB1B3P,HGNC:20943,Prefix Gene Group Symbol,HGNC
948212,OR4M2,OR4M2B,HGNC:55109,Prefix Gene Group Symbol,HGNC


In [425]:
gene_group_capture_df.loc[gene_group_capture_df["gene_symbol"] == "ABTB3"]

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
12848,ABTB3,ABTB3,HGNC:23844,,
58080,ABTB2B,ABTB3,HGNC:23844,,
58081,BTBD11,ABTB3,HGNC:23844,,
58082,FLJ33957,ABTB3,HGNC:23844,,
112704,ABTB3,ABTB3,ENSG00000151136,Ortholog Symbol,Ensembl
164674,ABTB2B,ABTB3,ENSG00000151136,,
164675,BTBD11,ABTB3,ENSG00000151136,Ortholog Symbol,Ensembl
164676,FLJ33957,ABTB3,ENSG00000151136,,
231270,ABTB3,ABTB3,GENE ID:121551,,
279107,ABTB2B,ABTB3,GENE ID:121551,,


In [426]:
hgnc_gene_group_root_df.loc[hgnc_gene_group_root_df["Approved symbol"] == "ABTB3"]

Unnamed: 0,HGNC ID,Approved symbol,Gene group ID,abbreviation
144,HGNC:23844,ABTB3,861,BTBD
145,HGNC:23844,ABTB3,403,ANKRD


## <a id='toc1_9_'></a>[Add Disease Symbol label](#toc0_)

### <a id='toc1_9_1_'></a>[Import the Disease root symbols](#toc0_)

used the table generated in the symbol_capture_generation.ipynb

In [427]:
gene2disease_df = pd.read_hdf(
    "../output/gene2disease_df.h5", key='df'
    )
gene2disease_df

Unnamed: 0,gene_MIM_number,phenotype_MIM_number,Prefix,pheno_title,pheno_symbol,Entrez Gene ID (NCBI),Approved Gene Symbol (HGNC),Ensembl Gene ID (Ensembl)
2,614984,204750,Number Sign,2-AMINOADIPIC 2-OXOADIPIC ACIDURIA,AMOXAD,GENE ID:55526,DHTKD1,ENSG00000181192
3,614984,204750,Number Sign,ALPHA-AMINOADIPIC AND ALPHA-KETOADIPIC ACIDURIA,AAKAD,GENE ID:55526,DHTKD1,ENSG00000181192
4,614984,204750,Number Sign,2-AMINOADIPIC 2-OXOADIPIC ACIDURIA,AMOXAD,GENE ID:55526,DHTKD1,ENSG00000181192
5,600301,610006,Number Sign,SHORT/BRANCHED-CHAIN ACYL-CoA DEHYDROGENASE DE...,SBCADD,GENE ID:36,ACADSB,ENSG00000196177
6,609577,273750,Number Sign,THREE M SYNDROME 1,3M1,GENE ID:9820,CUL7,ENSG00000044090
...,...,...,...,...,...,...,...,...
17642,606636,606579,Number Sign,VITILIGO-ASSOCIATED MULTIPLE AUTOIMMUNE DISEAS...,VAMAS1,GENE ID:22861,NLRP1,ENSG00000091592
17643,606636,606579,Number Sign,VITILIGO,VTLG,GENE ID:22861,NLRP1,ENSG00000091592
17644,606636,606579,Number Sign,"SYSTEMIC LUPUS ERYTHEMATOSUS, VITILIGO-RELATED",SLEV1,GENE ID:22861,NLRP1,ENSG00000091592
17646,600571,616806,Number Sign,WILMS TUMOR 6,WT6,GENE ID:5978,REST,ENSG00000084093


using ENSG ID

In [428]:
disease_capture_ensg_df = add_relationship_and_source_if_prefix(
    gene_group_capture_df,
    gene2disease_df,
    ["gene_symbol","concept_id"],
    ["Approved Gene Symbol (HGNC)","Ensembl Gene ID (Ensembl)"],
    "symbol",
    "pheno_symbol",
    "Prefix Disease Symbol",
    "OMIM",
    gene_group_capture_df,
    "_upper"
)
disease_capture_ensg_df

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
0,MT-TF,MT-TF,HGNC:7481,,
1,MT-RNR1,MT-RNR1,HGNC:7470,,
2,MT-TV,MT-TV,HGNC:7500,,
3,MT-RNR2,MT-RNR2,HGNC:7471,,
4,MT-TL1,MT-TL1,HGNC:7490,,
...,...,...,...,...,...
1006908,NIID,NOTCH2NLC,ENSG00000286219,Prefix Disease Symbol,OMIM
1006913,OPDM3,NOTCH2NLC,ENSG00000286219,Prefix Disease Symbol,OMIM
1006948,OPML1,NUTM2B-AS1,ENSG00000225484,Prefix Disease Symbol,OMIM
1007028,CFZS2,MYMX,ENSG00000262179,Prefix Disease Symbol,OMIM


using NCBI ID

In [429]:
disease_capture_ncbi_df = add_relationship_and_source_if_prefix(
    disease_capture_ensg_df,
    gene2disease_df,
    ["gene_symbol","concept_id"],
    ["Approved Gene Symbol (HGNC)","Entrez Gene ID (NCBI)"],
    "symbol",
    "pheno_symbol",
    "Prefix Disease Symbol",
    "OMIM",
    disease_capture_ensg_df,
    "_upper"
)
disease_capture_ncbi_df

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
0,MT-TF,MT-TF,HGNC:7481,,
1,MT-RNR1,MT-RNR1,HGNC:7470,,
2,MT-TV,MT-TV,HGNC:7500,,
3,MT-RNR2,MT-RNR2,HGNC:7471,,
4,MT-TL1,MT-TL1,HGNC:7490,,
...,...,...,...,...,...
1049642,NIID,NOTCH2NLC,GENE ID:100996717,Prefix Disease Symbol,OMIM
1049647,OPDM3,NOTCH2NLC,GENE ID:100996717,Prefix Disease Symbol,OMIM
1049684,OPML1,NUTM2B-AS1,GENE ID:101060691,Prefix Disease Symbol,OMIM
1049790,CFZS2,MYMX,GENE ID:101929726,Prefix Disease Symbol,OMIM


In [430]:
disease_capture_ncbi_df.loc[disease_capture_ncbi_df["symbol"] == "ASP"]

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
51016,ASP,ATG5,HGNC:589,,
62839,ASP,ASIP,HGNC:745,,
73662,ASP,ASPA,HGNC:756,,
78175,ASP,A1CF,HGNC:24086,,
81337,ASP,ASPM,HGNC:19048,,
85088,ASP,TMPRSS11D,HGNC:24059,,
88675,ASP,ROPN1L,HGNC:24060,,
157199,ASP,ATG5,ENSG00000057663,,
170048,ASP,ASIP,ENSG00000101440,,
183761,ASP,ASPA,ENSG00000108381,,


In [431]:
disease_capture_ncbi_df.loc[disease_capture_ncbi_df["symbol"] == "ASP"]

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
51016,ASP,ATG5,HGNC:589,,
62839,ASP,ASIP,HGNC:745,,
73662,ASP,ASPA,HGNC:756,,
78175,ASP,A1CF,HGNC:24086,,
81337,ASP,ASPM,HGNC:19048,,
85088,ASP,TMPRSS11D,HGNC:24059,,
88675,ASP,ROPN1L,HGNC:24060,,
157199,ASP,ATG5,ENSG00000057663,,
170048,ASP,ASIP,ENSG00000101440,,
183761,ASP,ASPA,ENSG00000108381,,


In [432]:
disease_capture_ncbi_df.loc[disease_capture_ncbi_df["gene_symbol"] == "C3"]

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
27731,C3,C3,HGNC:1318,,
81662,ARMD9,C3,HGNC:1318,,
81663,C3A,C3,HGNC:1318,,
81664,C3B,C3,HGNC:1318,,
81665,CPAMD1,C3,HGNC:1318,,
131943,C3,C3,ENSG00000125730,Ortholog Symbol,Ensembl
196571,ARMD9,C3,ENSG00000125730,,
196572,C3A,C3,ENSG00000125730,,
196573,C3B,C3,ENSG00000125730,,
196574,CPAMD1,C3,ENSG00000125730,,


In [433]:
filtered_df = disease_capture_ncbi_df[disease_capture_ncbi_df["symbol"].str.startswith("FWP", na=False)]

In [434]:
filtered_df

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
61957,FWP007,A2M,HGNC:7,,
168923,FWP007,A2M,ENSG00000175899,,
283042,FWP007,A2M,GENE ID:2,,
395700,FWP010,CD2BP2,HGNC:1656,,
396231,FWP006,MORF4L1,HGNC:16989,,
402655,FWP009,OCEL1,HGNC:26221,,
404548,FWP005,ADAT3,HGNC:25151,,
412313,FWP004,RPL13AP25,HGNC:36981,,
426581,FWP010,CD2BP2,ENSG00000169217,,
427115,FWP006,MORF4L1,ENSG00000185787,,


In [435]:
disease_capture_ncbi_df.to_csv('../output/disease_capture_ncbi_df.csv', index=False)

In [436]:
edges_df = disease_capture_ncbi_df[disease_capture_ncbi_df["concept_id"].notna()]

In [437]:
edges_df.to_csv('../output/edges_df.csv', index=True)

In [438]:
disease_capture_ncbi_df = disease_capture_ncbi_df.sort_values(by='gene_symbol')
disease_capture_ncbi_df.head(50)

Unnamed: 0,symbol,gene_symbol,concept_id,relationship,source
124249,5S_rRNA,5S_rRNA,ENSG00000277488,Ortholog Symbol,Ensembl
99091,5S_rRNA,5S_rRNA,ENSG00000278457,Ortholog Symbol,Ensembl
37,5S_rRNA,5S_rRNA,,,
117096,5S_rRNA,5S_rRNA,ENSG00000278779,,
117097,5S_rRNA,5S_rRNA,ENSG00000277395,,
112706,5S_rRNA,5S_rRNA,ENSG00000273928,,
506494,5S_rRNA,5S_rRNA,ENSG00000277488,,
136323,5S_rRNA,5S_rRNA,ENSG00000276861,Ortholog Symbol,Ensembl
511937,5S_rRNA,5S_rRNA,ENSG00000276861,,
123395,5S_rRNA,5S_rRNA,ENSG00000275999,,
