**Table of contents**<a id='toc0_'></a>    
- [ENSG](#toc1_)    
    - [How many total unique gene records are there in Ensembl](#toc1_1_1_)    
    - [Identify alias-alias collision symbols](#toc1_1_2_)    
    - [Create a histogram displaying how frequent the numebr of gene records sharing an alias is](#toc1_1_3_)    
- [HGNC](#toc2_)    
    - [How many total unique gene records are there in HGNC](#toc2_1_1_)    
    - [Identify alias-alias collision symbols](#toc2_1_2_)    
    - [Create a histogram displaying how frequent the numebr of gene records sharing an alias is](#toc2_1_3_)    
- [NCBI Info](#toc3_)    
    - [How many total unique gene records are there in NCBI Gene](#toc3_1_1_)    
    - [Identify alias-alias collision symbols](#toc3_1_2_)    
    - [Create a histogram displaying how frequent the numebr of gene records sharing an alias is](#toc3_1_3_)    
- [Merge to create Alias-Alias Collision Table- On Primary Gene Symbol](#toc4_)    
- [Merge to create Alias-Alias Collision Table- On Alias Symbol](#toc5_)    
- [How many unique primary gene symbols are there?](#toc6_)    
  - [Per Source](#toc6_1_)    
  - [All sources](#toc6_2_)    
    - [How many symbols appear in all sources?](#toc6_2_1_)    
    - [How many unique symbols are found between all sources?](#toc6_2_2_)    
- [How many unique aliases are there?](#toc7_)    
  - [Per Source](#toc7_1_)    
  - [All sources](#toc7_2_)    
    - [How many aliases appear in all sources?](#toc7_2_1_)    
    - [How many unique aliases are found between all sources?](#toc7_2_2_)    
- [How many gene records have an alias that is shared?](#toc8_)    
  - [Per Source](#toc8_1_)    
  - [All Sources](#toc8_2_)    
    - [How many gene records have at least one shared alias in all sources?](#toc8_2_1_)    
    - [How many unique gene records that have at least one shared alias are found between all sources?](#toc8_2_2_)    
- [How many alias symbols are being shared?](#toc9_)    
  - [Per Source](#toc9_1_)    
  - [All Sources](#toc9_2_)    
    - [How many aliases are shared in all sources?](#toc9_2_1_)    
    - [How many unique shared aliases are found between all sources?](#toc9_2_2_)    
- [How many gene concept-alias relationships are there?](#toc10_)    
  - [Per Source](#toc10_1_)    
  - [All Sources](#toc10_2_)    
    - [How many unique gene-alias pairs are found between all sources?](#toc10_2_1_)    
      - [Remove duplicate concept-alias pairs](#toc10_2_1_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [229]:
import pandas as pd
import numpy as np
import plotly.express as px

In [230]:
def create_aa_collision_df(mini_xxxx_df: pd.DataFrame, source: str, split_on_character: str) -> pd.DataFrame:
    """Create a df of alias-alias collision symbols 

    :param mini_xxxx_df: Processed df of gene records
    :param source: Representation of the source of the gene records
    :param split_on_character: Character that is used to seperate alias symbols in the mini_xxxx_df
    :return: A df of genes that share an alias with another gene
    """

    #Drop genes without any aliases
    subset_genes_xxxx_df = mini_xxxx_df.replace("-", np.nan)
    subset_genes_xxxx_df = subset_genes_xxxx_df.dropna(subset=["alias_symbol"])

    #Make each value the alias symbol column a set
    subset_genes_xxxx_df["alias_symbol"] = subset_genes_xxxx_df["alias_symbol"].astype(str)
    subset_genes_xxxx_df["alias_symbol"] = [x.split(split_on_character) for x in subset_genes_xxxx_df.alias_symbol]
    subset_genes_xxxx_df["alias_symbol"] = np.where(
    subset_genes_xxxx_df.alias_symbol == "", "", subset_genes_xxxx_df.alias_symbol.map(set)
    )

    #Explode the alias sets so that there is one alias per row
    subset_genes_xxxx_df = subset_genes_xxxx_df.explode("alias_symbol")
    
    #Remove duplicate instances of primary gene symbol- alias pairs
        #(occur because the same primary gene symbol may have multiple different ENSG IDs, 
        #see gene RFLNB for example)
    subset_genes_xxxx_df = subset_genes_xxxx_df.drop_duplicates(
    subset=["gene_symbol", "alias_symbol"], keep="first"
    )

    #Convert the df into a csv and save
    subset_genes_xxxx_df.to_csv(f'../created_files/subset_genes_{source}_df.csv', index=True) 

    #Create df with genes that have an alias that can be found in another gene's alias set
    aa_collision_gene_xxxx_df = subset_genes_xxxx_df.copy()
    aa_collision_gene_xxxx_df["alias_duplicates"] = aa_collision_gene_xxxx_df.duplicated(
    subset="alias_symbol", keep=False
    )
    aa_collision_gene_xxxx_df = aa_collision_gene_xxxx_df[aa_collision_gene_xxxx_df["alias_duplicates"] == True]
    aa_collision_gene_xxxx_df = aa_collision_gene_xxxx_df.drop(["alias_duplicates"], axis=1)
    aa_collision_gene_xxxx_df = aa_collision_gene_xxxx_df.sort_values("alias_symbol")

    #Add a source tag for future merging efforts
    aa_collision_gene_xxxx_df["source"] = str(source)
    aa_collision_gene_xxxx_df

    #Convert the df into a csv
    aa_collision_gene_xxxx_df.to_csv(f'../created_files/aa_collision_gene_{source}_df.csv', index=True)

    #Create a secondary collision df that prioritizes the collision symbol
    aa_collision_alias_xxxx_df = aa_collision_gene_xxxx_df[
    ["alias_symbol", "gene_symbol", "ENSG_ID", "source"]
    ]
    aa_collision_alias_xxxx_df = aa_collision_alias_xxxx_df.map(str)
    aa_collision_alias_xxxx_df = (
    aa_collision_alias_xxxx_df.groupby("alias_symbol")
    .agg({"ENSG_ID": ", ".join, "gene_symbol": ", ".join, "source": "first"})
    .reset_index()
    )

    #Convert the df into a csv and save
    aa_collision_alias_xxxx_df.to_csv(f'../created_files/aa_collision_alias_{source}_df.csv', index=True)

    return aa_collision_gene_xxxx_df.head(), aa_collision_alias_xxxx_df.head()

In [231]:
def create_aa_collision_histogram(aa_collision_gene_xxxx_df: pd.DataFrame, source: str, xxxx_alias_count: int):
    """Create a histogram of the frequencies at which aliases are shared

    :param aa_collision_gene_xxxx_df: A df of alias-alias collisions organized by primary gene symbol
    :param source: Representation of the source of the gene records
    :param xxxx_alias_count: Number of aliases total in the source
    :return: A histogram of the percentage of aliases that are shared between 2 genes, 3 genes, and so on
    """
    
    #Count the number of times each shared alias is used
    aa_collision_xxxx_count_df = aa_collision_gene_xxxx_df.pivot_table(
    index=["alias_symbol"], aggfunc="size"
    )
    aa_collision_xxxx_count_df = aa_collision_xxxx_count_df.reset_index()
    aa_collision_xxxx_count_df.rename(columns={0: "num_gene_records"}, inplace=True)
    aa_collision_xxxx_count_df = aa_collision_xxxx_count_df.sort_values(
        "num_gene_records", ascending=False)

    #Convert to csv
    aa_collision_xxxx_count_df.to_csv(f'../created_files/aa_collision_{source}_count_df.csv', index=True)

    #Count the frequency at which aliases are shared 
    aa_collision_xxxx_distribution_df = aa_collision_xxxx_count_df.pivot_table(
    index=["num_gene_records"], aggfunc="size"
    )
    aa_collision_xxxx_distribution_df = aa_collision_xxxx_distribution_df.reset_index()
    aa_collision_xxxx_distribution_df.rename(columns={0: "num_alias_symbol"}, inplace=True)
    aa_collision_xxxx_distribution_df["percent_alias_symbol"] = (
        aa_collision_xxxx_distribution_df["num_alias_symbol"] / xxxx_alias_count
    ) * 100

    #Convert to csv
    aa_collision_xxxx_distribution_df.to_csv(f'../created_files/aa_collision_{source}_distribution_df.csv', index=True)

    #Create histogram df 
    xxxx_alias_count_histogram_df = aa_collision_xxxx_distribution_df.drop(
    "num_alias_symbol", axis=1)

    #Convert to csv
    xxxx_alias_count_histogram_df.to_csv(f'../created_files/{source}_alias_count_histogram_df.csv', index=True)

    return px.bar(xxxx_alias_count_histogram_df, x="num_gene_records", y="percent_alias_symbol")



# <a id='toc1_'></a>[ENSG](#toc0_)

In [232]:
mini_ensg_df = pd.read_csv(
    "../created_files/mini_ensg_df.csv",
    dtype={"HGNC_ID": pd.Int64Dtype(), "NCBI_ID": pd.Int64Dtype()},
)
mini_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"T245, TM4SF6, TSPAN-6"
1,ENSG00000000005,TNMD,17757,64102,"BRICD4, CHM1L, MYODULIN, TEM, TENDIN"
2,ENSG00000000419,DPM1,3005,8813,"CDGIE, MPDS"
3,ENSG00000000457,SCYL3,19285,57147,"PACE-1, PACE1"
4,ENSG00000000460,FIRRM,25565,55732,"APOLO1, C1ORF112, FLIP, FLJ10706, MEICA1"
...,...,...,...,...,...
75829,ENSG00000293596,,,105372654,
75830,ENSG00000293597,LINC00970,48730,101978719,
75831,ENSG00000293599,,,,
75832,ENSG00000293600,,,131768270,


### <a id='toc1_1_1_'></a>[How many total unique gene records are there in Ensembl](#toc0_)

By ENSG ID

In [233]:
ensg_gene_id_set = set(mini_ensg_df["ENSG_ID"])
len(ensg_gene_id_set)

70611

### <a id='toc1_1_2_'></a>[Identify alias-alias collision symbols](#toc0_)

In [234]:
create_aa_collision_df(mini_ensg_df, source="ENSG", split_on_character= ",")

(               ENSG_ID gene_symbol  HGNC_ID  NCBI_ID alias_symbol source
 8000   ENSG00000140379      BCL2A1      991      597         ACC1   ENSG
 58193  ENSG00000275176       ACACA       84       31         ACC1   ENSG
 1354   ENSG00000076555       ACACB       85       32         ACC2   ENSG
 8000   ENSG00000140379      BCL2A1      991      597         ACC2   ENSG
 2085   ENSG00000097021       ACOT7    24157    11332          ACT   ENSG,
   alias_symbol                           ENSG_ID      gene_symbol source
 0         ACC1  ENSG00000140379, ENSG00000275176    BCL2A1, ACACA   ENSG
 1         ACC2  ENSG00000076555, ENSG00000140379    ACACB, BCL2A1   ENSG
 2          ACT  ENSG00000097021, ENSG00000196136  ACOT7, SERPINA3   ENSG
 3       AGPAT9  ENSG00000138678, ENSG00000153395    GPAT3, LPCAT1   ENSG
 4         AIP1  ENSG00000136848, ENSG00000187391    DAB2IP, MAGI2   ENSG)

In [235]:
subset_genes_ensg_df = pd.read_csv(
    "../created_files/subset_genes_ensg_df.csv", index_col=[0])

In [236]:
aa_collision_gene_ensg_df = pd.read_csv(
    "../created_files/aa_collision_gene_ensg_df.csv", index_col=[0])

In [237]:
aa_collision_alias_ensg_df = pd.read_csv(
    "../created_files/aa_collision_alias_ensg_df.csv", index_col=[0])

### <a id='toc1_1_3_'></a>[Create a histogram displaying how frequent the numebr of gene records sharing an alias is](#toc0_)

In [238]:
ensg_alias_symbol_set = set(subset_genes_ensg_df["alias_symbol"])
ensg_alias_count = len(ensg_alias_symbol_set)

In [239]:
create_aa_collision_histogram(aa_collision_gene_ensg_df, "ENSG", ensg_alias_count)

In [240]:
aa_collision_ensg_count_df = pd.read_csv(
    "../created_files/aa_collision_ensg_count_df.csv", index_col=[0])

In [241]:
aa_collision_ensg_distribution_df = pd.read_csv(
    "../created_files/aa_collision_ensg_distribution_df.csv", index_col=[0])

In [242]:
ensg_alias_count_histogram_df = pd.read_csv(
    "../created_files/ensg_alias_count_histogram_df.csv", index_col=[0])

# <a id='toc2_'></a>[HGNC](#toc0_)

In [243]:
mini_hgnc_df = pd.read_csv(
    "../created_files/mini_hgnc_df.csv",
    dtype={"HGNC_ID": pd.Int64Dtype(), "NCBI_ID": pd.Int64Dtype()},
)
mini_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"T245, TSPAN-6"
1,ENSG00000000005,TNMD,17757,64102,"myodulin, ChM1L, tendin, TEM, BRICD4"
2,ENSG00000000419,DPM1,3005,8813,"MPDS, CDGIE"
3,ENSG00000000457,SCYL3,19285,57147,"PACE-1, PACE1"
4,ENSG00000000460,FIRRM,25565,55732,"FLJ10706, Apolo1, FLIP, MEICA1"
...,...,...,...,...,...
45641,,ZNF97,13173,,
45642,,ZNFP1,13181,,
45643,,ZPAXP,51635,105373450,ZPX1P
45644,,ZRK,13193,,


### <a id='toc2_1_1_'></a>[How many total unique gene records are there in HGNC](#toc0_)

By HGNC ID

In [244]:
hgnc_gene_id_set = set(mini_hgnc_df["HGNC_ID"])
len(hgnc_gene_id_set)

45646

### <a id='toc2_1_2_'></a>[Identify alias-alias collision symbols](#toc0_)

In [245]:
create_aa_collision_df(mini_hgnc_df, source="HGNC", split_on_character= ",")

(               ENSG_ID gene_symbol  HGNC_ID  NCBI_ID alias_symbol source
 75     ENSG00000005022     SLC25A5    10991      292          2F1   HGNC
 7761   ENSG00000139187       KLRG1     6380    10219          2F1   HGNC
 8398   ENSG00000143546      S100A8    10498     6279       60B8AG   HGNC
 10916  ENSG00000163220      S100A9    10499     6280       60B8AG   HGNC
 9226   ENSG00000149735       GPHA2    18054   170589           A2   HGNC,
   alias_symbol                                            ENSG_ID  \
 0          2F1                   ENSG00000005022, ENSG00000139187   
 1       60B8AG                   ENSG00000143546, ENSG00000163220   
 2           A2  ENSG00000149735, ENSG00000160226, ENSG00000108823   
 3         ACC2                   ENSG00000140379, ENSG00000076555   
 4         ACS2                   ENSG00000164398, ENSG00000197142   
 
             gene_symbol source  
 0        SLC25A5, KLRG1   HGNC  
 1        S100A8, S100A9   HGNC  
 2  GPHA2, CFAP410, SGCA   HGNC

In [246]:
subset_genes_hgnc_df = pd.read_csv(
    "../created_files/subset_genes_hgnc_df.csv", index_col=[0])

In [247]:
aa_collision_gene_hgnc_df = pd.read_csv(
    "../created_files/aa_collision_gene_hgnc_df.csv", index_col=[0])

In [248]:
aa_collision_alias_hgnc_df = pd.read_csv(
    "../created_files/aa_collision_alias_hgnc_df.csv", index_col=[0])

### <a id='toc2_1_3_'></a>[Create a histogram displaying how frequent the numebr of gene records sharing an alias is](#toc0_)

In [249]:
hgnc_alias_symbol_set = set(subset_genes_hgnc_df["alias_symbol"])
hgnc_alias_count = len(hgnc_alias_symbol_set)

In [250]:
create_aa_collision_histogram(aa_collision_gene_hgnc_df, "HGNC", hgnc_alias_count)

In [251]:
aa_collision_hgnc_count_df = pd.read_csv(
    "../created_files/aa_collision_hgnc_count_df.csv", index_col=[0])

In [252]:
aa_collision_hgnc_distribution_df = pd.read_csv(
    "../created_files/aa_collision_hgnc_distribution_df.csv", index_col=[0])

In [253]:
hgnc_alias_count_histogram_df = pd.read_csv(
    "../created_files/hgnc_alias_count_histogram_df.csv", index_col=[0])

# <a id='toc3_'></a>[NCBI Info](#toc0_)

In [254]:
mini_ncbi_df = pd.read_csv(
    "../created_files/mini_ncbi_df.csv",
    dtype={"HGNC_ID": pd.Int64Dtype(), "NCBI_ID": pd.Int64Dtype()},
)
mini_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B|ABG|GAB|HYST2477,5,ENSG00000121410
1,2,A2M,A2MD|CPAMD5|FWP007|S863-7,7,ENSG00000175899
2,3,A2MP1,A2MP,8,ENSG00000291190
3,9,NAT1,AAC1|MNAT|NAT-1|NATI,7645,ENSG00000171428
4,10,NAT2,AAC2|NAT-2|PNAT,7646,ENSG00000156006
...,...,...,...,...,...
193451,8923215,trnD,-,,
193452,8923216,trnP,-,,
193453,8923217,trnA,-,,
193454,8923218,COX1,-,,


### <a id='toc3_1_1_'></a>[How many total unique gene records are there in NCBI Gene](#toc0_)

By ENSG ID

In [255]:
ncbi_gene_id_set = set(mini_ncbi_df["ENSG_ID"])
len(ncbi_gene_id_set)

36803

### <a id='toc3_1_2_'></a>[Identify alias-alias collision symbols](#toc0_)

In [256]:
create_aa_collision_df(mini_ncbi_df, source="NCBI", split_on_character= "|")

(      NCBI_ID gene_symbol alias_symbol  HGNC_ID          ENSG_ID source
 4525     5728        PTEN     10q23del     9588  ENSG00000171862   NCBI
 537       657      BMPR1A     10q23del     1076  ENSG00000107779   NCBI
 199       239      ALOX12       12-LOX      429  ENSG00000108839   NCBI
 205       246      ALOX15       12-LOX      433  ENSG00000161905   NCBI
 245       292     SLC25A5          2F1    10991  ENSG00000005022   NCBI,
   alias_symbol                           ENSG_ID     gene_symbol source
 0     10q23del  ENSG00000171862, ENSG00000107779    PTEN, BMPR1A   NCBI
 1       12-LOX  ENSG00000108839, ENSG00000161905  ALOX12, ALOX15   NCBI
 2          2F1  ENSG00000005022, ENSG00000139187  SLC25A5, KLRG1   NCBI
 3  3-alpha-HSD  ENSG00000198610, ENSG00000073737   AKR1C4, DHRS9   NCBI
 4        35DAG  ENSG00000102683, ENSG00000170624      SGCG, SGCD   NCBI)

In [257]:
subset_genes_ncbi_df = pd.read_csv(
    "../created_files/subset_genes_ncbi_df.csv", index_col=[0])

In [258]:
aa_collision_gene_ncbi_df = pd.read_csv(
    "../created_files/aa_collision_gene_ncbi_df.csv", index_col=[0])

In [259]:
aa_collision_alias_ncbi_df = pd.read_csv(
    "../created_files/aa_collision_alias_ncbi_df.csv", index_col=[0])

### <a id='toc3_1_3_'></a>[Create a histogram displaying how frequent the numebr of gene records sharing an alias is](#toc0_)

In [260]:
ncbi_alias_symbol_set = set(subset_genes_ncbi_df["alias_symbol"])
ncbi_alias_count = len(ncbi_alias_symbol_set)

In [261]:
create_aa_collision_histogram(aa_collision_gene_ncbi_df, "NCBI", ncbi_alias_count)

In [262]:
aa_collision_ncbi_count_df = pd.read_csv(
    "../created_files/aa_collision_ncbi_count_df.csv", index_col=[0])

In [263]:
aa_collision_ncbi_distribution_df = pd.read_csv(
    "../created_files/aa_collision_ncbi_distribution_df.csv", index_col=[0])

In [264]:
ncbi_alias_count_histogram_df = pd.read_csv(
    "../created_files/ncbi_alias_count_histogram_df.csv", index_col=[0])

# <a id='toc4_'></a>[Merge to create Alias-Alias Collision Table- On Primary Gene Symbol](#toc0_)

In [265]:
merged_aa_collision_gene_df = pd.concat(
    [
        aa_collision_gene_hgnc_df[["gene_symbol", "ENSG_ID", "alias_symbol", "source"]],
        aa_collision_gene_ncbi_df[["gene_symbol", "ENSG_ID", "alias_symbol", "source"]],
        aa_collision_gene_ensg_df[["gene_symbol", "ENSG_ID", "alias_symbol", "source"]],
    ]
)
merged_aa_collision_gene_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol,source
75,SLC25A5,ENSG00000005022,2F1,HGNC
7761,KLRG1,ENSG00000139187,2F1,HGNC
8398,S100A8,ENSG00000143546,60B8AG,HGNC
10916,S100A9,ENSG00000163220,60B8AG,HGNC
9226,GPHA2,ENSG00000149735,A2,HGNC
...,...,...,...,...
2565,USP14,ENSG00000101557,TGT,ENSG
6615,UBE2G1,ENSG00000132388,UBC7,ENSG
15955,UBE2G2,ENSG00000184787,UBC7,ENSG
5350,PLAU,ENSG00000122861,UPA,ENSG


In [266]:
merged_aa_collision_gene_df.to_csv(
    "../created_files/merged_aa_collision_gene_df.csv", index=False
)

In [267]:
merged_aa_collision_gene_df.loc[merged_aa_collision_gene_df.alias_symbol == "ALP"]

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol,source
9748,PDLIM3,ENSG00000154553,ALP,HGNC
21413,CCL27,ENSG00000213927,ALP,HGNC
41314,ATHS,,ALP,HGNC
209,ALPP,ENSG00000163283,ALP,NCBI
9843,ATRNL1,ENSG00000107518,ALP,NCBI
5223,SLPI,ENSG00000124107,ALP,NCBI
15068,ASRGL1,ENSG00000162174,ALP,NCBI
8497,CCL27,ENSG00000213927,ALP,NCBI
391,ATHS,,ALP,NCBI
10410,PDLIM3,ENSG00000154553,ALP,NCBI


# <a id='toc5_'></a>[Merge to create Alias-Alias Collision Table- On Alias Symbol](#toc0_)

In [268]:
merged_aa_collision_alias_df = pd.concat(
    [
        aa_collision_alias_hgnc_df[["alias_symbol", "gene_symbol", "ENSG_ID", "source"]],
        aa_collision_alias_ncbi_df[["alias_symbol", "gene_symbol", "ENSG_ID", "source"]],
        aa_collision_alias_ensg_df[["alias_symbol", "gene_symbol", "ENSG_ID", "source"]],
    ]
)
merged_aa_collision_alias_df

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source
0,2F1,"SLC25A5, KLRG1","ENSG00000005022, ENSG00000139187",HGNC
1,60B8AG,"S100A8, S100A9","ENSG00000143546, ENSG00000163220",HGNC
2,A2,"GPHA2, CFAP410, SGCA","ENSG00000149735, ENSG00000160226, ENSG00000108823",HGNC
3,ACC2,"BCL2A1, ACACB","ENSG00000140379, ENSG00000076555",HGNC
4,ACS2,"ACSL6, ACSL5","ENSG00000164398, ENSG00000197142",HGNC
...,...,...,...,...
1144,TCRBV15S1,"TRBV15, TRBV24-1","ENSG00000276819, ENSG00000211750",ENSG
1145,TCRGV5P,"TRGV5P, TRGV6","ENSG00000228668, ENSG00000226212",ENSG
1146,TGT,"QTRT1, USP14","ENSG00000213339, ENSG00000101557",ENSG
1147,UBC7,"UBE2G1, UBE2G2","ENSG00000132388, ENSG00000184787",ENSG


In [269]:
merged_aa_collision_alias_df["gene_symbol"] = merged_aa_collision_alias_df[
    "gene_symbol"
].str.split(",")
merged_aa_collision_alias_df["gene_symbol_count"] = [
    len(c) for c in merged_aa_collision_alias_df["gene_symbol"]
]
merged_aa_collision_alias_df = merged_aa_collision_alias_df.sort_values(
    by="gene_symbol_count", ascending=False
)
merged_aa_collision_alias_df

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source,gene_symbol_count
3305,VH,"[IGHV4-4, IGHV3-7, IGHV3-66, IGHM, IGHV4-2...","ENSG00000276775, ENSG00000211938, ENSG00000211...",NCBI,36
1303,H4-16,"[H4C12, H4C5, H4C6, H4C1, H4C16, H4C15, ...","ENSG00000273542, ENSG00000276966, ENSG00000274...",NCBI,14
1317,H4C9,"[H4C2, H4C4, H4C3, H4C15, H4C12, H4C16, ...","ENSG00000278705, ENSG00000277157, ENSG00000197...",NCBI,13
1304,H4C1,"[H4C14, H4C12, H4C3, H4C16, H4C6, H4C2, ...","ENSG00000270882, ENSG00000273542, ENSG00000197...",NCBI,13
1305,H4C11,"[H4C1, H4C3, H4C5, H4C4, H4C8, H4C6, H4C...","ENSG00000278637, ENSG00000197061, ENSG00000276...",NCBI,13
...,...,...,...,...,...
1240,GST3,"[GSTP1, CHST4]","ENSG00000084207, ENSG00000140835",NCBI,2
1238,GST1,"[GSPT1, GSTM1]","ENSG00000103342, ENSG00000134184",NCBI,2
1237,GST,"[GSTK1, SLCO6A1]","ENSG00000197448, ENSG00000205359",NCBI,2
1236,GSP,"[GNAS, GSM1]","ENSG00000087460, nan",NCBI,2


In [270]:
merged_aa_collision_alias_df.loc[merged_aa_collision_alias_df["alias_symbol"] == "ASP"]

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source,gene_symbol_count
222,ASP,"[ASPA, ROPN1L, TMPRSS11D, ASPM, ATG5, C3,...","ENSG00000108381, ENSG00000145491, ENSG00000153...",NCBI,8
394,ASP,"[ASIP, ROPN1L, ATG5, ASPA]","ENSG00000101440, ENSG00000145491, ENSG00000057...",HGNC,4
864,ASP,"[TMPRSS11D, ASPM, ROPN1L]","ENSG00000153802, ENSG00000066279, ENSG00000145491",ENSG,3


In [271]:
merged_aa_collision_alias_df.to_csv(
    "../created_files/merged_aa_collision_alias_df.csv", index=True, quoting=0
)

In [272]:
aa_collision_set = set(merged_aa_collision_alias_df["alias_symbol"].tolist())
len(aa_collision_set)

4494

# <a id='toc6_'></a>[How many unique primary gene symbols are there?](#toc0_)

## <a id='toc6_1_'></a>[Per Source](#toc0_)

In [273]:
ensg_gene_symbol_set = set(mini_ensg_df["gene_symbol"])
ensg_gene_symbol_count = len(ensg_gene_symbol_set)

In [274]:
hgnc_gene_symbol_set = set(mini_hgnc_df["gene_symbol"])
hgnc_gene_symbol_count = len(hgnc_gene_symbol_set)

In [275]:
ncbi_gene_symbol_set = set(mini_ncbi_df["gene_symbol"])
ncbi_gene_symbol_count = len(ncbi_gene_symbol_set)
ncbi_gene_symbol_count

193303

In [276]:
unique_primary_symbol_summary_index = "HGNC", "ENSG", "NCBI"
unique_primary_symbol_summary = {
    "Number of Unique Primary Gene Symbols": [
        ensg_gene_symbol_count,
        hgnc_gene_symbol_count,
        ncbi_gene_symbol_count,
    ]
}
unique_primary_symbol_summary_df = pd.DataFrame(
    unique_primary_symbol_summary, index = unique_primary_symbol_summary_index
)
unique_primary_symbol_summary_df

Unnamed: 0,Number of Unique Primary Gene Symbols
HGNC,41068
ENSG,45646
NCBI,193303


## <a id='toc6_2_'></a>[All sources](#toc0_)

### <a id='toc6_2_1_'></a>[How many symbols appear in all sources?](#toc0_)

In [277]:
all_sources_unique_primary_symbol_set = (
    ensg_gene_symbol_set
    & hgnc_gene_symbol_set
    & ncbi_gene_symbol_set
)
all_sources_unique_primary_symbol_count = len(all_sources_unique_primary_symbol_set)
all_sources_unique_primary_symbol_count

40885

### <a id='toc6_2_2_'></a>[How many unique symbols are found between all sources?](#toc0_)

In [278]:
bw_all_sources_unique_primary_symbol_df = pd.concat(
    [
        mini_ensg_df[["alias_symbol", "gene_symbol"]],
        mini_hgnc_df[["alias_symbol", "gene_symbol"]],
        mini_ncbi_df[["alias_symbol", "gene_symbol"]],
    ]
)

In [279]:
bw_all_sources_unique_primary_symbol_set = set(bw_all_sources_unique_primary_symbol_df["gene_symbol"])
bw_all_sources_unique_primary_symbol_count = len(bw_all_sources_unique_primary_symbol_set)
bw_all_sources_unique_primary_symbol_count

194866

# <a id='toc7_'></a>[How many unique aliases are there?](#toc0_)

## <a id='toc7_1_'></a>[Per Source](#toc0_)

In [280]:
unique_alias_summary_index = "HGNC", "ENSG", "NCBI"
unique_alias_summary = {
    "Number of Unique Aliases": [
        ensg_alias_count,
        hgnc_alias_count,
        ncbi_alias_count,
    ]
}
unique_alias_summary_df = pd.DataFrame(
    unique_alias_summary, index = unique_alias_summary_index
)
unique_alias_summary_df

Unnamed: 0,Number of Unique Aliases
HGNC,55938
ENSG,43770
NCBI,69156


## <a id='toc7_2_'></a>[All sources](#toc0_)

### <a id='toc7_2_1_'></a>[How many aliases appear in all sources?](#toc0_)

In [281]:
all_sources_unique_alias_set = (
    ensg_alias_symbol_set
    & hgnc_alias_symbol_set
    & ncbi_alias_symbol_set
)
all_sources_unique_alias_count = len(all_sources_unique_alias_set)
all_sources_unique_alias_count

5574

### <a id='toc7_2_2_'></a>[How many unique aliases are found between all sources?](#toc0_)

In [282]:
bw_all_sources_unique_alias_df = pd.concat(
    [
        subset_genes_ensg_df[["alias_symbol", "gene_symbol"]],
        subset_genes_hgnc_df[["alias_symbol", "gene_symbol"]],
        subset_genes_ncbi_df[["alias_symbol", "gene_symbol"]],
    ]
)

In [283]:
bw_all_sources_unique_alias_set = set(bw_all_sources_unique_alias_df["alias_symbol"])
bw_all_sources_unique_alias_count = len(bw_all_sources_unique_alias_set)
bw_all_sources_unique_alias_count

125379

# <a id='toc8_'></a>[How many gene records have an alias that is shared?](#toc0_)

## <a id='toc8_1_'></a>[Per Source](#toc0_)

In [284]:
ensg_aa_collision_primary_symbol_set = set(aa_collision_gene_ensg_df["gene_symbol"])
ensg_aa_collision_primary_symbol_count = len(ensg_aa_collision_primary_symbol_set)

In [285]:
hgnc_aa_collision_primary_symbol_set = set(aa_collision_gene_hgnc_df["gene_symbol"])
hgnc_aa_collision_primary_symbol_count = len(hgnc_aa_collision_primary_symbol_set)

In [286]:
ncbi_aa_collision_primary_symbol_set = set(aa_collision_gene_ncbi_df["gene_symbol"])
ncbi_aa_collision_primary_symbol_count = len(ncbi_aa_collision_primary_symbol_set)

In [287]:
aa_collision_primary_symbol_summary_index = "HGNC", "ENSG", "NCBI"
aa_collision_primary_symbol_summary= {
    "Number of Gene Records With a Shared Alias": [
        ensg_aa_collision_primary_symbol_count,
        hgnc_aa_collision_primary_symbol_count,
        ncbi_aa_collision_primary_symbol_count,
    ]
}
aa_collision_primary_symbol_summary_df = pd.DataFrame(
    aa_collision_primary_symbol_summary, index = aa_collision_primary_symbol_summary_index
)
aa_collision_primary_symbol_summary_df

Unnamed: 0,Number of Gene Records With a Shared Alias
HGNC,2224
ENSG,1356
NCBI,5732


## <a id='toc8_2_'></a>[All Sources](#toc0_)

### <a id='toc8_2_1_'></a>[How many gene records have at least one shared alias in all sources?](#toc0_)

In [288]:
all_sources_aa_collision_genes = (
    ensg_aa_collision_primary_symbol_set
    & hgnc_aa_collision_primary_symbol_set
    & ncbi_aa_collision_primary_symbol_set
)
len(all_sources_aa_collision_genes)

995

### <a id='toc8_2_2_'></a>[How many unique gene records that have at least one shared alias are found between all sources?](#toc0_)

In [289]:
bw_all_sources_aa_collision_df = pd.concat(
    [
        aa_collision_gene_ensg_df[["alias_symbol", "gene_symbol"]],
        aa_collision_gene_hgnc_df[["alias_symbol", "gene_symbol"]],
        aa_collision_gene_ncbi_df[["alias_symbol", "gene_symbol"]],
    ]
)

In [290]:
bw_all_sources_aa_collision_genes_set = set(bw_all_sources_aa_collision_df["gene_symbol"])
bw_all_sources_aa_collision_genes_count = len(bw_all_sources_aa_collision_genes_set)
bw_all_sources_aa_collision_genes_count

5977

# <a id='toc9_'></a>[How many alias symbols are being shared?](#toc0_)

## <a id='toc9_1_'></a>[Per Source](#toc0_)

In [291]:
ensg_aa_collision_set = set(aa_collision_gene_ensg_df["alias_symbol"])
ensg_aa_collision_count = len(ensg_aa_collision_set)

In [292]:
hgnc_aa_collision_set = set(aa_collision_gene_hgnc_df["alias_symbol"])
hgnc_aa_collision_count = len(hgnc_aa_collision_set)

In [293]:
ncbi_aa_collision_set = set(aa_collision_gene_ncbi_df["alias_symbol"])
ncbi_aa_collision_count = len(ncbi_aa_collision_set)

In [294]:
aa_collision_alias_symbol_summary_index = "HGNC", "ENSG", "NCBI"
aa_collision_alias_symbol_summary = {
    "Number of Shared Aliases": [
        ensg_aa_collision_count,
        hgnc_aa_collision_count,
        ncbi_aa_collision_count,
    ]
}
aa_collision_alias_symbol_summary_df = pd.DataFrame(
    aa_collision_alias_symbol_summary, index = aa_collision_alias_symbol_summary_index
)
aa_collision_alias_symbol_summary_df

Unnamed: 0,Number of Shared Aliases
HGNC,1149
ENSG,673
NCBI,3476


## <a id='toc9_2_'></a>[All Sources](#toc0_)

### <a id='toc9_2_1_'></a>[How many aliases are shared in all sources?](#toc0_)

In [295]:
all_sources_aa_collision_aliases = (
    ensg_aa_collision_set
    & hgnc_aa_collision_set
    & ncbi_aa_collision_set
)
len(all_sources_aa_collision_aliases)

95

### <a id='toc9_2_2_'></a>[How many unique shared aliases are found between all sources?](#toc0_)

In [296]:
bw_all_sources_aa_collision_aliases_set = set(bw_all_sources_aa_collision_df["alias_symbol"])
bw_all_sources_aa_collision_aliases_count = len(bw_all_sources_aa_collision_aliases_set)
bw_all_sources_aa_collision_aliases_count

4494

# <a id='toc10_'></a>[How many gene concept-alias relationships are there?](#toc0_)

## <a id='toc10_1_'></a>[Per Source](#toc0_)

In [297]:
ensg_primary_alias_pair_count = len(subset_genes_ensg_df)

In [298]:
hgnc_primary_alias_pair_count = len(subset_genes_hgnc_df)

In [299]:
ncbi_primary_alias_pair_count = len(subset_genes_ncbi_df)

In [300]:
primary_alias_pairs_summary_index = "HGNC", "ENSG", "NCBI"
primary_alias_pairs_summary = {
    "Number of Unique Gene Concept-Alias Pairs": [
        ensg_primary_alias_pair_count,
        hgnc_primary_alias_pair_count,
        ncbi_primary_alias_pair_count,
    ]
}
primary_alias_pairs_summary_df = pd.DataFrame(
    primary_alias_pairs_summary, index=primary_alias_pairs_summary_index
)
primary_alias_pairs_summary_df

Unnamed: 0,Number of Unique Gene Concept-Alias Pairs
HGNC,57362
ENSG,44584
NCBI,74053


## <a id='toc10_2_'></a>[All Sources](#toc0_)

### <a id='toc10_2_1_'></a>[How many unique gene-alias pairs are found between all sources?](#toc0_)

In [301]:
bw_all_sources_primary_alias_pairs_df = pd.concat(
    [
        subset_genes_ensg_df[["alias_symbol", "gene_symbol"]],
        subset_genes_hgnc_df[["alias_symbol", "gene_symbol"]],
        subset_genes_ncbi_df[["alias_symbol", "gene_symbol"]],
    ]
)

In [302]:
len(bw_all_sources_primary_alias_pairs_df)

175999

#### <a id='toc10_2_1_1_'></a>[Remove duplicate concept-alias pairs](#toc0_)

In [303]:
bw_all_sources_primary_alias_pairs_df = bw_all_sources_primary_alias_pairs_df.drop_duplicates(
    subset=["gene_symbol", "alias_symbol"], keep="first"
)

In [304]:
len(bw_all_sources_primary_alias_pairs_df)

131864