**Table of contents**<a id='toc0_'></a>    
- [ENSG](#toc1_)    
    - [How many total unique gene records are there in Ensembl](#toc1_1_1_)    
    - [Identify alias-alias collision symbols](#toc1_1_2_)    
    - [Create a histogram displaying how frequent the numebr of gene records sharing an alias is](#toc1_1_3_)    
- [HGNC](#toc2_)    
    - [How many total unique gene records are there in HGNC](#toc2_1_1_)    
    - [Identify alias-alias collision symbols](#toc2_1_2_)    
    - [Create a histogram displaying how frequent the numebr of gene records sharing an alias is](#toc2_1_3_)    
- [NCBI Info](#toc3_)    
    - [How many total unique gene records are there in NCBI Gene](#toc3_1_1_)    
    - [Identify alias-alias collision symbols](#toc3_1_2_)    
    - [Create a histogram displaying how frequent the numebr of gene records sharing an alias is](#toc3_1_3_)    
- [Merge to create Alias-Alias Collision Table- On Primary Gene Symbol](#toc4_)    
- [Merge to create Alias-Alias Collision Table- On Alias Symbol](#toc5_)    
- [How many unique primary gene symbols are there?](#toc6_)    
  - [Per Source](#toc6_1_)    
  - [All sources](#toc6_2_)    
    - [How many symbols appear in all sources?](#toc6_2_1_)    
    - [How many unique symbols are found between all sources?](#toc6_2_2_)    
- [How many unique aliases are there?](#toc7_)    
  - [Per Source](#toc7_1_)    
  - [All sources](#toc7_2_)    
    - [How many aliases appear in all sources?](#toc7_2_1_)    
    - [How many unique aliases are found between all sources?](#toc7_2_2_)    
- [How many gene records have an alias that is shared?](#toc8_)    
  - [Per Source](#toc8_1_)    
  - [All Sources](#toc8_2_)    
    - [How many gene records have at least one shared alias in all sources?](#toc8_2_1_)    
    - [How many unique gene records that have at least one shared alias are found between all sources?](#toc8_2_2_)    
- [How many alias symbols are being shared?](#toc9_)    
  - [Per Source](#toc9_1_)    
  - [All Sources](#toc9_2_)    
    - [How many aliases are shared in all sources?](#toc9_2_1_)    
    - [How many unique shared aliases are found between all sources?](#toc9_2_2_)    
- [How many gene concept-alias relationships are there?](#toc10_)    
  - [Per Source](#toc10_1_)    
  - [All Sources](#toc10_2_)    
    - [How many unique gene-alias pairs are found between all sources?](#toc10_2_1_)    
      - [Remove duplicate concept-alias pairs](#toc10_2_1_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

In [2]:
def create_aa_collision_df(subset_genes_xxxx_df: pd.DataFrame, merged_alias_xxxx_df: pd.DataFrame, source: str) -> pd.DataFrame:
    """Create a df of alias-alias collision symbols 

    :param subset_genes_xxxx_df: Processed df of gene records
    :param source: Representation of the source of the gene records
    :return: A df of genes that share an alias with another gene
    """

    #Create df with genes that have an alias that can be found in another gene's alias set
    aa_collision_gene_xxxx_df = subset_genes_xxxx_df.copy()
    aa_collision_gene_xxxx_df["alias_duplicates"] = aa_collision_gene_xxxx_df.duplicated(
    subset="alias_symbol", keep=False
    )
    aa_collision_gene_xxxx_df = aa_collision_gene_xxxx_df[aa_collision_gene_xxxx_df["alias_duplicates"]]
    aa_collision_gene_xxxx_df = aa_collision_gene_xxxx_df.rename(
    columns={"alias_symbol": "collision"})
    aa_collision_gene_xxxx_df = aa_collision_gene_xxxx_df.drop(["alias_duplicates"], axis=1)
    aa_collision_gene_xxxx_df = aa_collision_gene_xxxx_df.sort_values("collision")

    #Add a source tag for future merging efforts
    aa_collision_gene_xxxx_df["source"] = str(source)

    #Create a secondary collision df that merges the alias symbols for each record
    aa_collision_gene_xxxx_df = pd.merge(aa_collision_gene_xxxx_df, merged_alias_xxxx_df, on=["ENSG_ID", "gene_symbol", "HGNC_ID"], how="left")
    aa_collision_gene_xxxx_df = aa_collision_gene_xxxx_df[["gene_symbol","alias_symbol","ENSG_ID","HGNC_ID","NCBI_ID","collision","source"]]

    #Convert the df into a csv
    aa_collision_gene_xxxx_df.to_csv(f'../created_files/aa_collision_gene_{source}_df.csv', index=True)

    #Create a secondary collision df that prioritizes the collision symbol
    aa_collision_alias_xxxx_df = aa_collision_gene_xxxx_df[
    ["collision", "gene_symbol", "ENSG_ID", "source"]
    ]
    aa_collision_alias_xxxx_df = aa_collision_alias_xxxx_df.map(str)
    aa_collision_alias_xxxx_df = (
    aa_collision_alias_xxxx_df.groupby("collision")
    .agg({"ENSG_ID": ", ".join, "gene_symbol": ", ".join, "source": "first"})
    .reset_index()
    )

    #Convert the df into a csv and save
    aa_collision_alias_xxxx_df.to_csv(f'../created_files/aa_collision_alias_{source}_df.csv', index=True)

    return subset_genes_xxxx_df.head(), aa_collision_gene_xxxx_df.head(), aa_collision_alias_xxxx_df.head()

In [3]:
def create_aa_collision_histogram(aa_collision_gene_xxxx_df: pd.DataFrame, source: str, xxxx_alias_count: int):
    """Create a histogram of the frequencies at which aliases are shared

    :param aa_collision_gene_xxxx_df: A df of alias-alias collisions organized by primary gene symbol
    :param source: Representation of the source of the gene records
    :param xxxx_alias_count: Number of aliases total in the source
    :return: A histogram of the percentage of aliases that are shared between 2 genes, 3 genes, and so on
    """
    
    #Count the number of times each shared alias is used
    aa_collision_xxxx_count_df = aa_collision_gene_xxxx_df.pivot_table(
    index=["collision"], aggfunc="size"
    )
    aa_collision_xxxx_count_df = aa_collision_xxxx_count_df.reset_index()
    aa_collision_xxxx_count_df.rename(columns={0: "num_gene_records"}, inplace=True)
    aa_collision_xxxx_count_df = aa_collision_xxxx_count_df.sort_values(
        "num_gene_records", ascending=False)

    #Convert to csv
    aa_collision_xxxx_count_df.to_csv(f'../created_files/aa_collision_{source}_count_df.csv', index=True)

    #Count the frequency at which aliases are shared 
    aa_collision_xxxx_distribution_df = aa_collision_xxxx_count_df.pivot_table(
    index=["num_gene_records"], aggfunc="size"
    )
    aa_collision_xxxx_distribution_df = aa_collision_xxxx_distribution_df.reset_index()
    aa_collision_xxxx_distribution_df.rename(columns={0: "num_collision_symbol"}, inplace=True)
    aa_collision_xxxx_distribution_df["percent_collision_symbol"] = (
        aa_collision_xxxx_distribution_df["num_collision_symbol"] / xxxx_alias_count
    ) * 100

    #Convert to csv
    aa_collision_xxxx_distribution_df.to_csv(f'../created_files/aa_collision_{source}_distribution_df.csv', index=True)

    #Create histogram df 
    xxxx_alias_count_histogram_df = aa_collision_xxxx_distribution_df.drop(
    "num_collision_symbol", axis=1)

    #Convert to csv
    xxxx_alias_count_histogram_df.to_csv(f'../created_files/{source}_alias_count_histogram_df.csv', index=True)

    return px.bar(xxxx_alias_count_histogram_df, x="num_gene_records", y="percent_collision_symbol")



# <a id='toc1_'></a>[ENSG](#toc0_)

In [4]:
mini_ensg_df = pd.read_csv(
    "../created_files/mini_ensg_df.csv",
    dtype={"HGNC_ID": pd.Int64Dtype(), "NCBI_ID": pd.Int64Dtype()},
)
mini_ensg_df

Unnamed: 0.1,Unnamed: 0,ENSG_ID,gene_symbol,alias_symbol,HGNC_ID,NCBI_ID
0,0,ENSG00000210049,MT-TF,MTTF,7481,
1,1,ENSG00000210049,MT-TF,TRNF,7481,
2,2,ENSG00000211459,MT-RNR1,12S,7470,
3,3,ENSG00000211459,MT-RNR1,MOTS-C,7470,
4,4,ENSG00000211459,MT-RNR1,MTRNR1,7470,
...,...,...,...,...,...,...
117135,117135,ENSG00000200033,RNU6-403P,,47366,
117136,117136,ENSG00000228437,LINC02474,LNCSLCC1,53417,
117137,117137,ENSG00000228437,LINC02474,RP11-400N13.2,53417,
117138,117138,ENSG00000229463,LYST-AS1,LYST-IT2,41320,


In [5]:
subset_genes_ensg_df = pd.read_csv(
    "../created_files/subset_genes_ensg_df.csv", index_col=[0])
subset_genes_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,alias_symbol,HGNC_ID,NCBI_ID
0,ENSG00000210049,MT-TF,MTTF,7481.0,
1,ENSG00000210049,MT-TF,TRNF,7481.0,
2,ENSG00000211459,MT-RNR1,12S,7470.0,
3,ENSG00000211459,MT-RNR1,MOTS-C,7470.0,
4,ENSG00000211459,MT-RNR1,MTRNR1,7470.0,
...,...,...,...,...,...
117133,ENSG00000232679,LINC01705,ERLR,52493.0,105372950.0
117134,ENSG00000232679,LINC01705,RP11-400N13.3,52493.0,105372950.0
117136,ENSG00000228437,LINC02474,LNCSLCC1,53417.0,
117137,ENSG00000228437,LINC02474,RP11-400N13.2,53417.0,


In [6]:
merged_alias_ensg_df = pd.read_csv(
    "../created_files/merged_alias_ensg_df.csv", index_col=[0])
merged_alias_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858.0,"T245,TM4SF6,TSPAN-6"
1,ENSG00000000005,TNMD,17757.0,"BRICD4,CHM1L,MYODULIN,TEM,TENDIN"
2,ENSG00000000419,DPM1,3005.0,"CDGIE,MPDS"
3,ENSG00000000457,SCYL3,19285.0,"PACE-1,PACE1"
4,ENSG00000000460,FIRRM,25565.0,"APOLO1,C1ORF112,FLIP,FLJ10706,MEICA1"
...,...,...,...,...
70609,ENSG00000293596,,,
70610,ENSG00000293597,LINC00970,48730.0,
70611,ENSG00000293599,,,
70612,ENSG00000293600,,,


### <a id='toc1_1_1_'></a>[How many total unique gene records are there in Ensembl](#toc0_)

By ENSG ID

In [7]:
ensg_gene_id_set = set(mini_ensg_df["ENSG_ID"])
len(ensg_gene_id_set)

70611

### <a id='toc1_1_2_'></a>[Identify alias-alias collision symbols](#toc0_)

In [8]:
create_aa_collision_df(subset_genes_ensg_df, merged_alias_ensg_df, source="ENSG")

(           ENSG_ID gene_symbol alias_symbol  HGNC_ID  NCBI_ID
 0  ENSG00000210049       MT-TF         MTTF   7481.0      NaN
 1  ENSG00000210049       MT-TF         TRNF   7481.0      NaN
 2  ENSG00000211459     MT-RNR1          12S   7470.0      NaN
 3  ENSG00000211459     MT-RNR1       MOTS-C   7470.0      NaN
 4  ENSG00000211459     MT-RNR1       MTRNR1   7470.0      NaN,
   gene_symbol                                       alias_symbol  \
 0       KLRG1                            2F1,CLEC15A,MAFA,MAFA-L   
 1     SLC25A5                                     2F1,ANT2,T2,T3   
 2      S100A8        60B8AG,CAGA,CFAG,CGLA,MRP-8,MRP8,P8,S100-A8   
 3      S100A9  60B8AG,CAGB,CFAG,CGLB,LIAG,MAC387,MIF,MRP-14,M...   
 4       GNAI3                                               87U6   
 
            ENSG_ID  HGNC_ID  NCBI_ID collision source  
 0  ENSG00000139187   6380.0  10219.0       2F1   ENSG  
 1  ENSG00000005022  10991.0    292.0       2F1   ENSG  
 2  ENSG00000143546  10498.0   627

In [9]:
subset_genes_ensg_df = pd.read_csv(
    "../created_files/subset_genes_ensg_df.csv", index_col=[0])
subset_genes_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,alias_symbol,HGNC_ID,NCBI_ID
0,ENSG00000210049,MT-TF,MTTF,7481.0,
1,ENSG00000210049,MT-TF,TRNF,7481.0,
2,ENSG00000211459,MT-RNR1,12S,7470.0,
3,ENSG00000211459,MT-RNR1,MOTS-C,7470.0,
4,ENSG00000211459,MT-RNR1,MTRNR1,7470.0,
...,...,...,...,...,...
117133,ENSG00000232679,LINC01705,ERLR,52493.0,105372950.0
117134,ENSG00000232679,LINC01705,RP11-400N13.3,52493.0,105372950.0
117136,ENSG00000228437,LINC02474,LNCSLCC1,53417.0,
117137,ENSG00000228437,LINC02474,RP11-400N13.2,53417.0,


In [10]:
subset_genes_ensg_df.loc[subset_genes_ensg_df["gene_symbol"] == "NPY6R"]

Unnamed: 0,ENSG_ID,gene_symbol,alias_symbol,HGNC_ID,NCBI_ID
112937,ENSG00000226306,NPY6R,NPY1RL,7959.0,
112938,ENSG00000226306,NPY6R,NPY6RP,7959.0,
112939,ENSG00000226306,NPY6R,PP2,7959.0,
112944,ENSG00000293504,NPY6R,Y2B,,4888.0


In [11]:
aa_collision_gene_ensg_df = pd.read_csv(
    "../created_files/aa_collision_gene_ensg_df.csv", index_col=[0])
aa_collision_gene_ensg_df

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID,HGNC_ID,NCBI_ID,collision,source
0,KLRG1,"2F1,CLEC15A,MAFA,MAFA-L",ENSG00000139187,6380.0,10219.0,2F1,ENSG
1,SLC25A5,"2F1,ANT2,T2,T3",ENSG00000005022,10991.0,292.0,2F1,ENSG
2,S100A8,"60B8AG,CAGA,CFAG,CGLA,MRP-8,MRP8,P8,S100-A8",ENSG00000143546,10498.0,6279.0,60B8AG,ENSG
3,S100A9,"60B8AG,CAGB,CFAG,CGLB,LIAG,MAC387,MIF,MRP-14,M...",ENSG00000163220,10499.0,6280.0,60B8AG,ENSG
4,GNAI3,87U6,ENSG00000065135,4387.0,2773.0,87U6,ENSG
...,...,...,...,...,...,...,...
3674,SLC30A10,"DKFZP547M236,ZNT-10,ZNT10,ZNT8,ZRC1",ENSG00000196660,25355.0,55532.0,ZNT8,ENSG
3675,SLC30A10,"DKFZP547M236,ZNT-10,ZNT10,ZNT8,ZRC1",ENSG00000196660,25355.0,55532.0,ZRC1,ENSG
3676,SLC30A1,"ZNT1,ZRC1",ENSG00000170385,11012.0,7779.0,ZRC1,ENSG
3677,ZYG11B,"FLJ13456,ZYG11",ENSG00000162378,25820.0,79699.0,ZYG11,ENSG


In [12]:
aa_collision_alias_ensg_df = pd.read_csv(
    "../created_files/aa_collision_alias_ensg_df.csv", index_col=[0])
aa_collision_alias_ensg_df

Unnamed: 0,collision,ENSG_ID,gene_symbol,source
0,2F1,"ENSG00000139187, ENSG00000005022","KLRG1, SLC25A5",ENSG
1,60B8AG,"ENSG00000143546, ENSG00000163220","S100A8, S100A9",ENSG
2,87U6,"ENSG00000065135, ENSG00000206832","GNAI3, RNU6V",ENSG
3,9G8,"ENSG00000115875, ENSG00000164609","SRSF7, SLU7",ENSG
4,A1,"ENSG00000163918, ENSG00000033627, ENSG00000049...","RFC4, ATP6V0A1, RFC2, RFC1",ENSG
...,...,...,...,...
1612,ZIP4,"ENSG00000120498, ENSG00000285243","TEX11, SLC39A4",ENSG
1613,ZNF422,"ENSG00000165512, ENSG00000172943","ZNF22, PHF8",ENSG
1614,ZNT8,"ENSG00000164756, ENSG00000196660","SLC30A8, SLC30A10",ENSG
1615,ZRC1,"ENSG00000196660, ENSG00000170385","SLC30A10, SLC30A1",ENSG


### <a id='toc1_1_3_'></a>[Create a histogram displaying how frequent the numebr of gene records sharing an alias is](#toc0_)

In [13]:
ensg_alias_symbol_set = set(subset_genes_ensg_df["alias_symbol"])
ensg_alias_count = len(ensg_alias_symbol_set)

In [14]:
create_aa_collision_histogram(aa_collision_gene_ensg_df, "ENSG", ensg_alias_count)

In [15]:
aa_collision_ensg_count_df = pd.read_csv(
    "../created_files/aa_collision_ensg_count_df.csv", index_col=[0])

In [16]:
aa_collision_ensg_distribution_df = pd.read_csv(
    "../created_files/aa_collision_ensg_distribution_df.csv", index_col=[0])

In [17]:
ensg_alias_count_histogram_df = pd.read_csv(
    "../created_files/ensg_alias_count_histogram_df.csv", index_col=[0])

# <a id='toc2_'></a>[HGNC](#toc0_)

In [18]:
mini_hgnc_df = pd.read_csv(
    "../created_files/mini_hgnc_df.csv",
    dtype={"HGNC_ID": pd.Int64Dtype(), "NCBI_ID": pd.Int64Dtype()},
)
mini_hgnc_df

Unnamed: 0.1,Unnamed: 0,HGNC_ID,alias_symbol,NCBI_ID,ENSG_ID,gene_symbol
0,0,5,,1,ENSG00000121410,A1BG
1,1,37133,FLJ23569,503538,ENSG00000268895,A1BG-AS1
2,2,24086,ACF,29974,ENSG00000148584,A1CF
3,3,24086,ASP,29974,ENSG00000148584,A1CF
4,4,24086,ACF64,29974,ENSG00000148584,A1CF
...,...,...,...,...,...,...
67578,67578,29027,KIAA0399,23140,ENSG00000074755,ZZEF1
67579,67579,29027,ZZZ4,23140,ENSG00000074755,ZZEF1
67580,67580,29027,FLJ10821,23140,ENSG00000074755,ZZEF1
67581,67581,24523,DKFZP564I052,26009,ENSG00000036549,ZZZ3


In [19]:
subset_genes_hgnc_df = pd.read_csv(
    "../created_files/subset_genes_hgnc_df.csv", index_col=[0])
subset_genes_hgnc_df

Unnamed: 0,HGNC_ID,alias_symbol,NCBI_ID,ENSG_ID,gene_symbol
1,37133,FLJ23569,503538.0,ENSG00000268895,A1BG-AS1
2,24086,ACF,29974.0,ENSG00000148584,A1CF
3,24086,ASP,29974.0,ENSG00000148584,A1CF
4,24086,ACF64,29974.0,ENSG00000148584,A1CF
5,24086,ACF65,29974.0,ENSG00000148584,A1CF
...,...,...,...,...,...
67578,29027,KIAA0399,23140.0,ENSG00000074755,ZZEF1
67579,29027,ZZZ4,23140.0,ENSG00000074755,ZZEF1
67580,29027,FLJ10821,23140.0,ENSG00000074755,ZZEF1
67581,24523,DKFZP564I052,26009.0,ENSG00000036549,ZZZ3


In [20]:
merged_alias_hgnc_df = pd.read_csv(
    "../created_files/merged_alias_hgnc_df.csv", index_col=[0])
merged_alias_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,"T245,TSPAN-6"
1,ENSG00000000005,TNMD,17757,"myodulin,ChM1L,tendin,TEM,BRICD4"
2,ENSG00000000419,DPM1,3005,"MPDS,CDGIE"
3,ENSG00000000457,SCYL3,19285,"PACE-1,PACE1"
4,ENSG00000000460,FIRRM,25565,"FLJ10706,Apolo1,FLIP,MEICA1"
...,...,...,...,...
45641,,ZNF97,13173,
45642,,ZNFP1,13181,
45643,,ZPAXP,51635,ZPX1P
45644,,ZRK,13193,


### <a id='toc2_1_1_'></a>[How many total unique gene records are there in HGNC](#toc0_)

By HGNC ID

In [21]:
hgnc_gene_id_set = set(mini_hgnc_df["HGNC_ID"])
len(hgnc_gene_id_set)

45646

### <a id='toc2_1_2_'></a>[Identify alias-alias collision symbols](#toc0_)

In [22]:
create_aa_collision_df(subset_genes_hgnc_df, merged_alias_hgnc_df, source="HGNC")

(   HGNC_ID alias_symbol   NCBI_ID          ENSG_ID gene_symbol
 1    37133     FLJ23569  503538.0  ENSG00000268895    A1BG-AS1
 2    24086          ACF   29974.0  ENSG00000148584        A1CF
 3    24086          ASP   29974.0  ENSG00000148584        A1CF
 4    24086        ACF64   29974.0  ENSG00000148584        A1CF
 5    24086        ACF65   29974.0  ENSG00000148584        A1CF,
   gene_symbol                                       alias_symbol  \
 0       KLRG1                            MAFA,2F1,MAFA-L,CLEC15A   
 1     SLC25A5                                          T2,2F1,T3   
 2      S100A8                  P8,MRP8,MRP-8,60B8AG,CGLA,S100-A8   
 3      S100A9  P14,MIF,NIF,LIAG,MRP14,MAC387,60B8AG,CGLB,MRP-...   
 4       RNU6V                                          87U6,LH87   
 
            ENSG_ID  HGNC_ID  NCBI_ID collision source  
 0  ENSG00000139187     6380  10219.0       2F1   HGNC  
 1  ENSG00000005022    10991    292.0       2F1   HGNC  
 2  ENSG00000143546    10498

In [23]:
subset_genes_hgnc_df = pd.read_csv(
    "../created_files/subset_genes_hgnc_df.csv", index_col=[0])

In [24]:
subset_genes_hgnc_df.loc[subset_genes_hgnc_df["gene_symbol"] == "NPY6R"]

Unnamed: 0,HGNC_ID,alias_symbol,NCBI_ID,ENSG_ID,gene_symbol
36566,7959,PP2,4888.0,ENSG00000226306,NPY6R
36567,7959,NPY1RL,4888.0,ENSG00000226306,NPY6R
36568,7959,NPY6RP,4888.0,ENSG00000226306,NPY6R


In [25]:
aa_collision_gene_hgnc_df = pd.read_csv(
    "../created_files/aa_collision_gene_hgnc_df.csv", index_col=[0])

In [26]:
aa_collision_alias_hgnc_df = pd.read_csv(
    "../created_files/aa_collision_alias_hgnc_df.csv", index_col=[0])

### <a id='toc2_1_3_'></a>[Create a histogram displaying how frequent the numebr of gene records sharing an alias is](#toc0_)

In [27]:
hgnc_alias_symbol_set = set(subset_genes_hgnc_df["alias_symbol"])
hgnc_alias_count = len(hgnc_alias_symbol_set)

In [28]:
create_aa_collision_histogram(aa_collision_gene_hgnc_df, "HGNC", hgnc_alias_count)

In [29]:
aa_collision_hgnc_count_df = pd.read_csv(
    "../created_files/aa_collision_hgnc_count_df.csv", index_col=[0])

In [30]:
aa_collision_hgnc_distribution_df = pd.read_csv(
    "../created_files/aa_collision_hgnc_distribution_df.csv", index_col=[0])

In [31]:
hgnc_alias_count_histogram_df = pd.read_csv(
    "../created_files/hgnc_alias_count_histogram_df.csv", index_col=[0])

# <a id='toc3_'></a>[NCBI Info](#toc0_)

In [32]:
mini_ncbi_df = pd.read_csv(
    "../created_files/mini_ncbi_df.csv",
    dtype={"HGNC_ID": pd.Int64Dtype(), "NCBI_ID": pd.Int64Dtype()},
)
mini_ncbi_df

Unnamed: 0.1,Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,0,1,A1BG,A1B,5,ENSG00000121410
1,0,1,A1BG,ABG,5,ENSG00000121410
2,0,1,A1BG,GAB,5,ENSG00000121410
3,0,1,A1BG,HYST2477,5,ENSG00000121410
4,1,2,A2M,A2MD,7,ENSG00000175899
...,...,...,...,...,...,...
239924,193451,8923215,trnD,-,,
239925,193452,8923216,trnP,-,,
239926,193453,8923217,trnA,-,,
239927,193454,8923218,COX1,-,,


In [33]:
subset_genes_ncbi_df = pd.read_csv(
    "../created_files/subset_genes_ncbi_df.csv", index_col=[0])
subset_genes_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B,5.0,ENSG00000121410
0,1,A1BG,ABG,5.0,ENSG00000121410
0,1,A1BG,GAB,5.0,ENSG00000121410
0,1,A1BG,HYST2477,5.0,ENSG00000121410
1,2,A2M,A2MD,7.0,ENSG00000175899
...,...,...,...,...,...
190961,131840634,GLTC1,GLTC,56861.0,
193342,132532400,GABRA6-AS1,ARBAG,40248.0,
193377,133395150,LNCARGI,ARGI,56890.0,
193378,133834869,MLDHR,MP31,55481.0,


In [34]:
merged_alias_ncbi_df = pd.read_csv(
    "../created_files/merged_alias_ncbi_df.csv", index_col=[0])
merged_alias_ncbi_df


Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.



Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858.0,"T245,TM4SF6,TSPAN-6"
1,ENSG00000000005,TNMD,17757.0,"BRICD4,CHM1L,TEM"
2,ENSG00000000419,DPM1,3005.0,"CDGIE,MPDS"
3,ENSG00000000457,SCYL3,19285.0,"PACE-1,PACE1"
4,ENSG00000000460,FIRRM,25565.0,"Apolo1,C1orf112,FLIP,MEICA1"
...,...,...,...,...
193316,,trnS,,"-,-,-,-"
193317,,trnT,,"-,-"
193318,,trnV,,"-,-"
193319,,trnW,,"-,-"


### <a id='toc3_1_1_'></a>[How many total unique gene records are there in NCBI Gene](#toc0_)

By ENSG ID

In [35]:
ncbi_gene_id_set = set(mini_ncbi_df["ENSG_ID"])
len(ncbi_gene_id_set)

36803

### <a id='toc3_1_2_'></a>[Identify alias-alias collision symbols](#toc0_)

In [36]:
create_aa_collision_df(subset_genes_ncbi_df, merged_alias_ncbi_df, source="NCBI")

(   NCBI_ID gene_symbol alias_symbol  HGNC_ID          ENSG_ID
 0        1        A1BG          A1B      5.0  ENSG00000121410
 0        1        A1BG          ABG      5.0  ENSG00000121410
 0        1        A1BG          GAB      5.0  ENSG00000121410
 0        1        A1BG     HYST2477      5.0  ENSG00000121410
 1        2         A2M         A2MD      7.0  ENSG00000175899,
   gene_symbol                                       alias_symbol  \
 0        PTEN  10q23del,BZS,CWS1,DEC,GLM2,MHAM,MMAC1,PTEN1,PT...   
 1      BMPR1A     10q23del,ACVRLK3,ALK-3,ALK3,BMPR-1A,CD292,SKR5   
 2      ALOX15                       12-LOX,15-LOX,15-LOX-1,LOG15   
 3      ALOX12                               12-LOX,12S-LOX,LOG12   
 4      AKR1C1  2-ALPHA-HSD,20-ALPHA-HSD,C9,DD1,DD1/DD2,DDH,DD...   
 
            ENSG_ID  HGNC_ID  NCBI_ID     collision source  
 0  ENSG00000171862   9588.0     5728      10Q23DEL   NCBI  
 1  ENSG00000107779   1076.0      657      10Q23DEL   NCBI  
 2  ENSG00000161905   

In [37]:
subset_genes_ncbi_df = pd.read_csv(
    "../created_files/subset_genes_ncbi_df.csv", index_col=[0])
subset_genes_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B,5.0,ENSG00000121410
0,1,A1BG,ABG,5.0,ENSG00000121410
0,1,A1BG,GAB,5.0,ENSG00000121410
0,1,A1BG,HYST2477,5.0,ENSG00000121410
1,2,A2M,A2MD,7.0,ENSG00000175899
...,...,...,...,...,...
190961,131840634,GLTC1,GLTC,56861.0,
193342,132532400,GABRA6-AS1,ARBAG,40248.0,
193377,133395150,LNCARGI,ARGI,56890.0,
193378,133834869,MLDHR,MP31,55481.0,


In [38]:
aa_collision_gene_ncbi_df = pd.read_csv(
    "../created_files/aa_collision_gene_ncbi_df.csv", index_col=[0])
aa_collision_gene_ncbi_df

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID,HGNC_ID,NCBI_ID,collision,source
0,PTEN,"10q23del,BZS,CWS1,DEC,GLM2,MHAM,MMAC1,PTEN1,PT...",ENSG00000171862,9588.0,5728,10Q23DEL,NCBI
1,BMPR1A,"10q23del,ACVRLK3,ALK-3,ALK3,BMPR-1A,CD292,SKR5",ENSG00000107779,1076.0,657,10Q23DEL,NCBI
2,ALOX15,"12-LOX,15-LOX,15-LOX-1,LOG15",ENSG00000161905,433.0,246,12-LOX,NCBI
3,ALOX12,"12-LOX,12S-LOX,LOG12",ENSG00000108839,429.0,239,12-LOX,NCBI
4,AKR1C1,"2-ALPHA-HSD,20-ALPHA-HSD,C9,DD1,DD1/DD2,DDH,DD...",ENSG00000187134,384.0,1645,20-ALPHA-HSD,NCBI
...,...,...,...,...,...,...,...
8986,SLC30A10,"HMDPC,HMNDYT1,ZNT10,ZNT8,ZRC1,ZnT-10",ENSG00000196660,25355.0,55532,ZRC1,NCBI
8987,PEX13,"NALD,PBD11A,PBD11B,ZWS",ENSG00000162928,8855.0,5194,ZWS,NCBI
8988,PEX1,"HMLR1,PBD1A,PBD1B,ZWS,ZWS1",ENSG00000127980,8850.0,5189,ZWS,NCBI
8989,ZYG11B,ZYG11,ENSG00000162378,25820.0,79699,ZYG11,NCBI


In [39]:
aa_collision_alias_ncbi_df = pd.read_csv(
    "../created_files/aa_collision_alias_ncbi_df.csv", index_col=[0])

### <a id='toc3_1_3_'></a>[Create a histogram displaying how frequent the numebr of gene records sharing an alias is](#toc0_)

In [40]:
ncbi_alias_symbol_set = set(subset_genes_ncbi_df["alias_symbol"])
ncbi_alias_count = len(ncbi_alias_symbol_set)

In [41]:
create_aa_collision_histogram(aa_collision_gene_ncbi_df, "NCBI", ncbi_alias_count)

In [42]:
aa_collision_ncbi_count_df = pd.read_csv(
    "../created_files/aa_collision_ncbi_count_df.csv", index_col=[0])
aa_collision_ncbi_count_df

Unnamed: 0,collision,num_gene_records
3636,VH,37
1409,H4-16,14
1414,H4C14,13
1419,H4C4,13
1413,H4C13,13
...,...,...
1351,GT334,2
1352,GTB,2
1353,GTF2IRD2A,2
1354,GTK,2


In [43]:
aa_collision_ncbi_distribution_df = pd.read_csv(
    "../created_files/aa_collision_ncbi_distribution_df.csv", index_col=[0])
aa_collision_ncbi_distribution_df

Unnamed: 0,num_gene_records,num_collision_symbol,percent_collision_symbol
0,2,2961,4.305594
1,3,444,0.645621
2,4,148,0.215207
3,5,59,0.085792
4,6,31,0.045077
5,7,21,0.030536
6,8,9,0.013087
7,9,15,0.021812
8,10,2,0.002908
9,11,3,0.004362


In [44]:
ncbi_alias_count_histogram_df = pd.read_csv(
    "../created_files/ncbi_alias_count_histogram_df.csv", index_col=[0])
ncbi_alias_count_histogram_df

Unnamed: 0,num_gene_records,percent_collision_symbol
0,2,4.305594
1,3,0.645621
2,4,0.215207
3,5,0.085792
4,6,0.045077
5,7,0.030536
6,8,0.013087
7,9,0.021812
8,10,0.002908
9,11,0.004362


In [45]:
len(subset_genes_ensg_df)

57275

In [46]:
len(subset_genes_hgnc_df)

44542

In [47]:
len(subset_genes_ncbi_df)

74051

# <a id='toc4_'></a>[Merge to create Alias-Alias Collision Table- On Primary Gene Symbol](#toc0_)

In [48]:
merged_aa_collision_gene_df = pd.concat(
    [
        aa_collision_gene_hgnc_df[["gene_symbol", "alias_symbol", "ENSG_ID", "collision", "source"]],
        aa_collision_gene_ncbi_df[["gene_symbol", "alias_symbol", "ENSG_ID", "collision", "source"]],
        aa_collision_gene_ensg_df[["gene_symbol", "alias_symbol", "ENSG_ID", "collision", "source"]],
    ]
)
merged_aa_collision_gene_df

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID,collision,source
0,KLRG1,"MAFA,2F1,MAFA-L,CLEC15A",ENSG00000139187,2F1,HGNC
1,SLC25A5,"T2,2F1,T3",ENSG00000005022,2F1,HGNC
2,S100A8,"P8,MRP8,MRP-8,60B8AG,CGLA,S100-A8",ENSG00000143546,60B8AG,HGNC
3,S100A9,"P14,MIF,NIF,LIAG,MRP14,MAC387,60B8AG,CGLB,MRP-...",ENSG00000163220,60B8AG,HGNC
4,RNU6V,"87U6,LH87",ENSG00000206832,87U6,HGNC
...,...,...,...,...,...
3674,SLC30A10,"DKFZP547M236,ZNT-10,ZNT10,ZNT8,ZRC1",ENSG00000196660,ZNT8,ENSG
3675,SLC30A10,"DKFZP547M236,ZNT-10,ZNT10,ZNT8,ZRC1",ENSG00000196660,ZRC1,ENSG
3676,SLC30A1,"ZNT1,ZRC1",ENSG00000170385,ZRC1,ENSG
3677,ZYG11B,"FLJ13456,ZYG11",ENSG00000162378,ZYG11,ENSG


In [49]:
merged_aa_collision_gene_df.to_csv(
    "../created_files/merged_aa_collision_gene_df.csv", index=False
)

In [50]:
merged_aa_collision_gene_df.loc[merged_aa_collision_gene_df.collision == "ALP"]

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID,collision,source
82,ASRGL1,"FLJ22316,ALP1,ALP",ENSG00000162174,ALP,HGNC
83,ATHS,ALP,,ALP,HGNC
84,ATRNL1,"KIAA0534,FLJ45344,ALP",ENSG00000107518,ALP,HGNC
85,PDLIM3,ALP,ENSG00000154553,ALP,HGNC
86,CCL27,"ALP,ILC,CTACK,skinkine,ESkine,PESKY,CTAK",ENSG00000213927,ALP,HGNC
87,SLPI,"HUSI-I,ALK1,ALP,BLPI,HUSI,WAP4,WFDC4",ENSG00000124107,ALP,HGNC
279,ASRGL1,"ALP,ALP1,CRASH",ENSG00000162174,ALP,NCBI
280,ATRNL1,"ALP,bA338L11.1,bA454H24.1",ENSG00000107518,ALP,NCBI
281,ATHS,ALP,,ALP,NCBI
282,NAT10,"ALP,Kre33,NET43",ENSG00000135372,ALP,NCBI


# <a id='toc5_'></a>[Merge to create Alias-Alias Collision Table- On Alias Symbol](#toc0_)

In [51]:
merged_aa_collision_alias_df = pd.concat(
    [
        aa_collision_alias_hgnc_df[["collision", "gene_symbol", "ENSG_ID", "source"]],
        aa_collision_alias_ncbi_df[["collision", "gene_symbol", "ENSG_ID", "source"]],
        aa_collision_alias_ensg_df[["collision", "gene_symbol", "ENSG_ID", "source"]],
    ]
)
merged_aa_collision_alias_df

Unnamed: 0,collision,gene_symbol,ENSG_ID,source
0,2F1,"KLRG1, SLC25A5","ENSG00000139187, ENSG00000005022",HGNC
1,60B8AG,"S100A8, S100A9","ENSG00000143546, ENSG00000163220",HGNC
2,87U6,"RNU6V, GNAI3","ENSG00000206832, ENSG00000065135",HGNC
3,9G8,"SRSF7, SLU7","ENSG00000115875, ENSG00000164609",HGNC
4,A1,"ATP6V0A1, RFC1, RFC4, RFC2","ENSG00000033627, ENSG00000035928, ENSG00000163...",HGNC
...,...,...,...,...
1612,ZIP4,"TEX11, SLC39A4","ENSG00000120498, ENSG00000285243",ENSG
1613,ZNF422,"ZNF22, PHF8","ENSG00000165512, ENSG00000172943",ENSG
1614,ZNT8,"SLC30A8, SLC30A10","ENSG00000164756, ENSG00000196660",ENSG
1615,ZRC1,"SLC30A10, SLC30A1","ENSG00000196660, ENSG00000170385",ENSG


In [52]:
merged_aa_collision_alias_df["gene_symbol"] = merged_aa_collision_alias_df[
    "gene_symbol"
].str.split(",")
merged_aa_collision_alias_df["gene_symbol_count"] = [
    len(c) for c in merged_aa_collision_alias_df["gene_symbol"]
]
merged_aa_collision_alias_df = merged_aa_collision_alias_df.sort_values(
    by="gene_symbol_count", ascending=False
)
merged_aa_collision_alias_df

Unnamed: 0,collision,gene_symbol,ENSG_ID,source,gene_symbol_count
3636,VH,"[IGHV4-59, IGHV4-39, IGHV4-34, IGHV4-28, I...","ENSG00000224373, ENSG00000211959, ENSG00000211...",NCBI,37
1409,H4-16,"[H4C4, H4C8, H4C16, H4C3, H4C11, H4C5, H...","ENSG00000277157, ENSG00000158406, ENSG00000197...",NCBI,14
1412,H4C12,"[H4C3, H4C16, H4C14, H4C13, H4C2, H4C9, ...","ENSG00000197061, ENSG00000197837, ENSG00000270...",NCBI,13
1413,H4C13,"[H4C1, H4C15, H4C16, H4C9, H4C8, H4C4, H...","ENSG00000278637, ENSG00000270276, ENSG00000197...",NCBI,13
1423,H4C9,"[H4C11, H4C6, H4C15, H4C16, H4C1, H4C5, ...","ENSG00000197238, ENSG00000274618, ENSG00000270...",NCBI,13
...,...,...,...,...,...
1151,FEB3B,"[SCN9A, GEFSP7]","ENSG00000169432, nan",NCBI,2
1150,FE,"[GTF2E1, GTF2E2]","ENSG00000153767, ENSG00000197265",NCBI,2
1149,FDH,"[ADH5, ALDH1L1]","ENSG00000197894, ENSG00000144908",NCBI,2
1148,FCT3A,"[FUT6, FUT4]","ENSG00000156413, ENSG00000196371",NCBI,2


In [53]:
merged_aa_collision_alias_df.loc[merged_aa_collision_alias_df["collision"] == "ASP"]

Unnamed: 0,collision,gene_symbol,ENSG_ID,source,gene_symbol_count
236,ASP,"[ROPN1L, TMPRSS11D, ASPM, A1CF, ATG5, ASP...","ENSG00000145491, ENSG00000153802, ENSG00000066...",NCBI,8
112,ASP,"[ATG5, ROPN1L, ASPA, A1CF, TMPRSS11D, ASI...","ENSG00000057663, ENSG00000145491, ENSG00000108...",ENSG,7
82,ASP,"[TMPRSS11D, ROPN1L, ATG5, A1CF, ASPM, ASP...","ENSG00000153802, ENSG00000145491, ENSG00000057...",HGNC,7


In [54]:
merged_aa_collision_alias_df.to_csv(
    "../created_files/merged_aa_collision_alias_df.csv", index=True, quoting=0
)

In [55]:
aa_collision_set = set(merged_aa_collision_alias_df["collision"].tolist())
len(aa_collision_set)

3824

# <a id='toc6_'></a>[How many unique primary gene symbols are there?](#toc0_)

## <a id='toc6_1_'></a>[Per Source](#toc0_)

In [56]:
ensg_gene_symbol_set = set(mini_ensg_df["gene_symbol"])
ensg_gene_symbol_count = len(ensg_gene_symbol_set)

In [57]:
hgnc_gene_symbol_set = set(mini_hgnc_df["gene_symbol"])
hgnc_gene_symbol_count = len(hgnc_gene_symbol_set)

In [58]:
ncbi_gene_symbol_set = set(mini_ncbi_df["gene_symbol"])
ncbi_gene_symbol_count = len(ncbi_gene_symbol_set)

In [59]:
unique_primary_symbol_summary_index = "HGNC", "ENSG", "NCBI"
unique_primary_symbol_summary = {
    "Number of Unique Primary Gene Symbols": [
        ensg_gene_symbol_count,
        hgnc_gene_symbol_count,
        ncbi_gene_symbol_count,
    ]
}
unique_primary_symbol_summary_df = pd.DataFrame(
    unique_primary_symbol_summary, index = unique_primary_symbol_summary_index
)
unique_primary_symbol_summary_df

Unnamed: 0,Number of Unique Primary Gene Symbols
HGNC,41068
ENSG,45646
NCBI,193303


## <a id='toc6_2_'></a>[All sources](#toc0_)

### <a id='toc6_2_1_'></a>[How many symbols appear in all sources?](#toc0_)

In [60]:
all_sources_unique_primary_symbol_set = (
    ensg_gene_symbol_set
    & hgnc_gene_symbol_set
    & ncbi_gene_symbol_set
)
all_sources_unique_primary_symbol_count = len(all_sources_unique_primary_symbol_set)
all_sources_unique_primary_symbol_count

40885

### NCBI has almost 4x times the amount of unique primary gene symbols than the other sources. Why? What are they?

In [61]:
only_ncbi_gene_symbol_set = ncbi_gene_symbol_set - all_sources_unique_primary_symbol_set

In [62]:
len(only_ncbi_gene_symbol_set)

152418

In [63]:
only_ncbi_gene_symbol_set

{'LOC124629355',
 'LOC127400380',
 'LOC107986892',
 'LOC130060677',
 'LOC126807094',
 'LOC129936972',
 'LOC130063141',
 'LOC130063388',
 'LOC126806480',
 'LOC127397924',
 'LOC122094896',
 'LOC127269131',
 'LOC127405279',
 'LOC127406171',
 'LOC127818039',
 'TRE-CTC14-1',
 'LOC127818437',
 'LOC105371757',
 'LOC127814858',
 'LOC127896547',
 'LOC127458894',
 'LOC129930396',
 'LOC129937862',
 'LOC127274129',
 'LOC127827796',
 'LOC110121180',
 'LOC129934742',
 'LOC127828433',
 'LOC126653324',
 'LOC111562374',
 'LOC126860544',
 'RPL39P21',
 'LOC127888921',
 'LOC130066430',
 'LOC129994486',
 'LOC126861204',
 'LOC127898482',
 'LOC105371738',
 'LOC127402240',
 'LOC129997725',
 'LOC110121221',
 'LOC124900455',
 'LOC129993843',
 'LOC130066184',
 'LOC129664372',
 'LOC130003735',
 'LOC127821134',
 'LOC132089873',
 'MRPS23P1',
 'LOC127829607',
 'LOC126807309',
 'LOC130004267',
 'LOC129662820',
 'LOC127459256',
 'LOC127893233',
 'LOC121725162',
 'LOC127267116',
 'LOC129662923',
 'LOC130002001',
 'LOC1

#### Most of the symbols that are unique to NCBI (147,913/152,418-97%) begin with "LOC"

In [64]:
filtered_set = {gene for gene in only_ncbi_gene_symbol_set if not gene.startswith('LOC')}
len(filtered_set)

4505

### <a id='toc6_2_2_'></a>[How many unique symbols are found between all sources?](#toc0_)

In [65]:
bw_all_sources_unique_primary_symbol_df = pd.concat(
    [
        mini_ensg_df[["alias_symbol", "gene_symbol"]],
        mini_hgnc_df[["alias_symbol", "gene_symbol"]],
        mini_ncbi_df[["alias_symbol", "gene_symbol"]],
    ]
)

In [66]:
bw_all_sources_unique_primary_symbol_set = set(bw_all_sources_unique_primary_symbol_df["gene_symbol"])
bw_all_sources_unique_primary_symbol_count = len(bw_all_sources_unique_primary_symbol_set)
bw_all_sources_unique_primary_symbol_count

194866

# <a id='toc7_'></a>[How many unique aliases are there?](#toc0_)

## <a id='toc7_1_'></a>[Per Source](#toc0_)

In [67]:
unique_alias_summary_index = "HGNC", "ENSG", "NCBI"
unique_alias_summary = {
    "Number of Unique Aliases": [
        ensg_alias_count,
        hgnc_alias_count,
        ncbi_alias_count,
    ]
}
unique_alias_summary_df = pd.DataFrame(
    unique_alias_summary, index = unique_alias_summary_index
)
unique_alias_summary_df

Unnamed: 0,Number of Unique Aliases
HGNC,55213
ENSG,42918
NCBI,68771


## <a id='toc7_2_'></a>[All sources](#toc0_)

### <a id='toc7_2_1_'></a>[How many aliases appear in all sources?](#toc0_)

In [68]:
all_sources_unique_alias_set = (
    ensg_alias_symbol_set
    & hgnc_alias_symbol_set
    & ncbi_alias_symbol_set
)
all_sources_unique_alias_count = len(all_sources_unique_alias_set)
all_sources_unique_alias_count

29984

### <a id='toc7_2_2_'></a>[How many unique aliases are found between all sources?](#toc0_)

In [69]:
bw_all_sources_unique_alias_df = pd.concat(
    [
        subset_genes_ensg_df[["alias_symbol", "gene_symbol"]],
        subset_genes_hgnc_df[["alias_symbol", "gene_symbol"]],
        subset_genes_ncbi_df[["alias_symbol", "gene_symbol"]],
    ]
)

In [70]:
bw_all_sources_unique_alias_set = set(bw_all_sources_unique_alias_df["alias_symbol"])
bw_all_sources_unique_alias_count = len(bw_all_sources_unique_alias_set)
bw_all_sources_unique_alias_count

81138

# <a id='toc8_'></a>[How many gene records have an alias that is shared?](#toc0_)

## <a id='toc8_1_'></a>[Per Source](#toc0_)

In [71]:
ensg_aa_collision_primary_symbol_set = set(aa_collision_gene_ensg_df["gene_symbol"])
ensg_aa_collision_primary_symbol_count = len(ensg_aa_collision_primary_symbol_set)

In [72]:
hgnc_aa_collision_primary_symbol_set = set(aa_collision_gene_hgnc_df["gene_symbol"])
hgnc_aa_collision_primary_symbol_count = len(hgnc_aa_collision_primary_symbol_set)

In [73]:
ncbi_aa_collision_primary_symbol_set = set(aa_collision_gene_ncbi_df["gene_symbol"])
ncbi_aa_collision_primary_symbol_count = len(ncbi_aa_collision_primary_symbol_set)

In [74]:
aa_collision_primary_symbol_summary_index = "HGNC", "ENSG", "NCBI"
aa_collision_primary_symbol_summary= {
    "Number of Gene Records With a Shared Alias": [
        ensg_aa_collision_primary_symbol_count,
        hgnc_aa_collision_primary_symbol_count,
        ncbi_aa_collision_primary_symbol_count,
    ]
}
aa_collision_primary_symbol_summary_df = pd.DataFrame(
    aa_collision_primary_symbol_summary, index = aa_collision_primary_symbol_summary_index
)
aa_collision_primary_symbol_summary_df

Unnamed: 0,Number of Gene Records With a Shared Alias
HGNC,3113
ENSG,2530
NCBI,6078


## <a id='toc8_2_'></a>[All Sources](#toc0_)

### <a id='toc8_2_1_'></a>[How many gene records have at least one shared alias in all sources?](#toc0_)

In [75]:
all_sources_aa_collision_genes = (
    ensg_aa_collision_primary_symbol_set
    & hgnc_aa_collision_primary_symbol_set
    & ncbi_aa_collision_primary_symbol_set
)
len(all_sources_aa_collision_genes)

2319

### <a id='toc8_2_2_'></a>[How many unique gene records that have at least one shared alias are found between all sources?](#toc0_)

In [76]:
bw_all_sources_aa_collision_df = pd.concat(
    [
        aa_collision_gene_ensg_df[["collision", "gene_symbol"]],
        aa_collision_gene_hgnc_df[["collision", "gene_symbol"]],
        aa_collision_gene_ncbi_df[["collision", "gene_symbol"]],
    ]
)

In [77]:
bw_all_sources_aa_collision_genes_set = set(bw_all_sources_aa_collision_df["gene_symbol"])
bw_all_sources_aa_collision_genes_count = len(bw_all_sources_aa_collision_genes_set)
bw_all_sources_aa_collision_genes_count

6257

# <a id='toc9_'></a>[How many alias symbols are being shared?](#toc0_)

## <a id='toc9_1_'></a>[Per Source](#toc0_)

In [78]:
ensg_aa_collision_set = set(aa_collision_gene_ensg_df["collision"])
ensg_aa_collision_count = len(ensg_aa_collision_set)

In [79]:
hgnc_aa_collision_set = set(aa_collision_gene_hgnc_df["collision"])
hgnc_aa_collision_count = len(hgnc_aa_collision_set)

In [80]:
ncbi_aa_collision_set = set(aa_collision_gene_ncbi_df["collision"])
ncbi_aa_collision_count = len(ncbi_aa_collision_set)

In [81]:
aa_collision_alias_symbol_summary_index = "HGNC", "ENSG", "NCBI"
aa_collision_alias_symbol_summary = {
    "Number of Shared Aliases": [
        ensg_aa_collision_count,
        hgnc_aa_collision_count,
        ncbi_aa_collision_count,
    ]
}
aa_collision_alias_symbol_summary_df = pd.DataFrame(
    aa_collision_alias_symbol_summary, index = aa_collision_alias_symbol_summary_index
)
aa_collision_alias_symbol_summary_df

Unnamed: 0,Number of Shared Aliases
HGNC,1617
ENSG,1250
NCBI,3711


## <a id='toc9_2_'></a>[All Sources](#toc0_)

### <a id='toc9_2_1_'></a>[How many aliases are shared in all sources?](#toc0_)

In [82]:
all_sources_aa_collision_aliases = (
    ensg_aa_collision_set
    & hgnc_aa_collision_set
    & ncbi_aa_collision_set
)
len(all_sources_aa_collision_aliases)

1131

### <a id='toc9_2_2_'></a>[How many unique shared aliases are found between all sources?](#toc0_)

In [83]:
bw_all_sources_aa_collision_aliases_set = set(bw_all_sources_aa_collision_df["collision"])
bw_all_sources_aa_collision_aliases_count = len(bw_all_sources_aa_collision_aliases_set)
bw_all_sources_aa_collision_aliases_count

3824

# <a id='toc10_'></a>[How many gene concept-alias relationships are there?](#toc0_)

## <a id='toc10_1_'></a>[Per Source](#toc0_)

In [84]:
ensg_primary_alias_pair_count = len(subset_genes_ensg_df)

In [85]:
hgnc_primary_alias_pair_count = len(subset_genes_hgnc_df)

In [86]:
ncbi_primary_alias_pair_count = len(subset_genes_ncbi_df)

In [87]:
primary_alias_pairs_summary_index = "HGNC", "ENSG", "NCBI"
primary_alias_pairs_summary = {
    "Number of Unique Gene Concept-Alias Pairs": [
        ensg_primary_alias_pair_count,
        hgnc_primary_alias_pair_count,
        ncbi_primary_alias_pair_count,
    ]
}
primary_alias_pairs_summary_df = pd.DataFrame(
    primary_alias_pairs_summary, index=primary_alias_pairs_summary_index
)
primary_alias_pairs_summary_df

Unnamed: 0,Number of Unique Gene Concept-Alias Pairs
HGNC,57275
ENSG,44542
NCBI,74051


## <a id='toc10_2_'></a>[All Sources](#toc0_)

### <a id='toc10_2_1_'></a>[How many unique gene-alias pairs are found between all sources?](#toc0_)

In [88]:
bw_all_sources_primary_alias_pairs_df = pd.concat(
    [
        subset_genes_ensg_df[["alias_symbol", "gene_symbol"]],
        subset_genes_hgnc_df[["alias_symbol", "gene_symbol"]],
        subset_genes_ncbi_df[["alias_symbol", "gene_symbol"]],
    ]
)

In [89]:
len(bw_all_sources_primary_alias_pairs_df)

175868

#### <a id='toc10_2_1_1_'></a>[Remove duplicate concept-alias pairs](#toc0_)

In [90]:
bw_all_sources_primary_alias_pairs_df = bw_all_sources_primary_alias_pairs_df.drop_duplicates(
    subset=["gene_symbol", "alias_symbol"], keep="first"
)

In [91]:
len(bw_all_sources_primary_alias_pairs_df)

86774