# <a id='toc1_'></a>[DGIdb ambiguous claims](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [DGIdb ambiguous claims](#toc1_)    
    - [Analyzing similaritis b/w alias-primary and alias-alias collisions](#toc1_1_1_)    
      - [Load merged alias-alias and alias-priamry collision sets. Symbols are the ambiguous symbols.](#toc1_1_1_1_)    
      - [How many of the alias-primary collisions are also alias-alias collisions?](#toc1_1_1_2_)    
      - [How many unique ambiguous symbosl are there b/w alias-primary and alias-alias collisions?](#toc1_1_1_3_)    
    - [Primary exploration of DGIdb gene content using collisions](#toc1_1_2_)    
      - [Load gene claim data from DGIdb](#toc1_1_2_1_)    
      - [How many claims are placed in a gene group with a different label?](#toc1_1_2_2_)    
      - [How many claims are not normalized?](#toc1_1_2_3_)    
      - [Create set of gene groups (gene_name) and gene claims (gene_claim_name)](#toc1_1_2_4_)    
      - [Load HGNC, Ensembl (ENSG), and NCBI gene and alias sets](#toc1_1_2_5_)    
      - [How many unique primary gene symbols are there b/w HGNC, Ensembl, and NCBI?](#toc1_1_2_6_)    
      - [How many unique alias symbols are there b/w HGNC, Ensembl, and NCBI?](#toc1_1_2_7_)    
      - [How many unique group names are not primary gene symbols?](#toc1_1_2_8_)    
      - [How many unique claims are not primary gene symbols?](#toc1_1_2_9_)    
      - [Load the collision sets from each data source](#toc1_1_2_10_)    
      - [How many unique group names are primary gene symbols?](#toc1_1_2_11_)    
      - [How many of the groups that are labeled with a primary gene symbol are also alias-alias collisions?](#toc1_1_2_12_)    
      - [How many unique claim symbols are collisions?](#toc1_1_2_13_)    
      - [How many unique groups are labeled with collisions?](#toc1_1_2_14_)    
      - [How many unique claims are not primary gene symbols?](#toc1_1_2_15_)    
      - [How many unique claims that were not primary symbols are collisions?](#toc1_1_2_16_)    
      - [How many unique claims that are primary gene symbols are also collisions?](#toc1_1_2_17_)    
      - [How many of the gene groups that are not primary gene symbols are alias-alias collisions from HGNC?](#toc1_1_2_18_)    
      - [How many claims are ambiguous symbols?](#toc1_1_2_19_)    
      - [How many gene group labels are ambiguous symbols?](#toc1_1_2_20_)    
      - [How many claims are primary symbols?](#toc1_1_2_21_)    
      - [How many gene group labels are primary symbols?](#toc1_1_2_22_)    
      - [How many gene group labels are alias symbols?](#toc1_1_2_23_)    
      - [How many claims are alias symbols?](#toc1_1_2_24_)    
      - [How many claims that were normalized into non-primary symbol labeled groups are alias symbols?](#toc1_1_2_25_)    
      - [How many not normalized claims are alias symbols?](#toc1_1_2_26_)    
      - [How many gene group labels are not primary, alias symbols, or null?](#toc1_1_2_27_)    
      - [How many claims are not primary, alias symbols, or null?](#toc1_1_2_28_)    
      - [How many claims are primary and alias symbols?](#toc1_1_2_29_)    
      - [How many primary symbol claims are normalized into non-primary gene group labels?](#toc1_1_2_30_)    
    - [Summary](#toc1_1_3_)    
      - [Normalization Rates](#toc1_1_3_1_)    
      - [Types of Claim Symbols](#toc1_1_3_2_)    
      - [Types of Normalizations (gene group labels claims are being normliazed into)](#toc1_1_3_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

### <a id='toc1_1_1_'></a>[Analyzing similarities b/w alias-primary and alias-alias collisions](#toc0_)

#### <a id='toc1_1_1_1_'></a>[Load merged alias-alias and alias-priamry collision sets. Symbols are the ambiguous symbols.](#toc0_)

Input: merged_alias_gene_intersections.csv (total_alias_gene_intersections.ipynb), merged_alias_overlap_df_2.csv (total_alias_overlap.ipynb)

Output: merged_alias_primary_collision_set, merged_alias_alias_collision_set

In [2]:
merged_alias_primary_collisions_df = pd.read_csv(
    "../output/merged_alias_primary_collisions_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [3]:
merged_alias_primary_collision_set = set(
    merged_alias_primary_collisions_df["collision"]
)
len(merged_alias_primary_collision_set)

1602

In [4]:
merged_alias_alias_collisions_df = pd.read_csv(
    "../output/merged_aa_collision_alias_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [5]:
merged_alias_alias_collision_set = set(
    merged_alias_alias_collisions_df["collision"].tolist()
)
len(merged_alias_alias_collision_set)

3809

In [6]:
merged_alias_alias_collision_list = merged_alias_alias_collisions_df["collision"].tolist()

In [7]:
merged_alias_alias_collision_list[1]

'H4-16'

#### <a id='toc1_1_1_2_'></a>[How many of the alias-primary collisions are also alias-alias collisions?](#toc0_)

In [8]:
print(
    len(
        merged_alias_alias_collision_set.intersection(
            merged_alias_primary_collision_set
        )
    )
)

220


#### <a id='toc1_1_1_3_'></a>[How many unique ambiguous symbols are there b/w alias-primary and alias-alias collisions?](#toc0_)

In [9]:
ambiguous_symbol_set = merged_alias_alias_collision_set.union(
    merged_alias_primary_collision_set
)
print(len(ambiguous_symbol_set))

5191


In [10]:
ambiguous_symbol_set

{'GUSBP1',
 'HGFL',
 'KFSD',
 'SOAT',
 'TRNAMI2',
 'CHL',
 'CAPS',
 'BD-4',
 'KV11.1',
 'ZNF32',
 'PRR20C',
 'SHP1',
 'RAX',
 'PXMP1',
 'SPANXE',
 'NOP9',
 'CAS2',
 'ZNF2',
 'KAT1',
 'TRY1',
 'HET',
 'TRF-GAA1-1',
 'GCD10',
 'HRG',
 'EEF1AKMT4',
 'FRA3B',
 'C10ORF77',
 'AR',
 'GUCA2',
 'MRS',
 'U3',
 'CSN3',
 'SSB1',
 'HNRPG',
 'BETA2',
 'NET5',
 'GLUT10',
 'H2BFA',
 'SCA31',
 'LNCBRM',
 'XBP1',
 'PAAT',
 'HARS2',
 'GST2',
 'PGG/HS',
 'ARIP1',
 'GALNT16',
 'SSX',
 'NF1L4',
 'GCR2',
 'IMP4',
 'H-PLK',
 'BTR',
 'HBB',
 'FSIP2-AS2',
 'TAF',
 'CHL1',
 'API5L1',
 'AGM1',
 'LINC02254',
 'BAM',
 'RGP3',
 'PI',
 'C20ORF197',
 'GAGE-9',
 'CRF',
 'COP',
 'KRTAP2.1A',
 'LINC01558',
 'BBP',
 'CLIC5',
 'TRIM34',
 'TCRAV7S1',
 'NECC1',
 'LINC02110',
 'L18',
 'SGS',
 'LCAD',
 'TRN',
 'ACA59',
 'P240',
 'NOP',
 'FGF2',
 'KRAS',
 'FAPP2',
 'ACTP1',
 'PRSSL1',
 'C9ORF36',
 'U5B1',
 'ERV1',
 'DSK2',
 'ME1',
 'SMN-AS1',
 'CREB3',
 'TAF2C2',
 'AGS2',
 'UGT1-01',
 'CRHSP24',
 'SAC2',
 'FBLP1',
 'RP66',
 'RN

In [11]:
ambiguous_symbol_set = set(item.strip() for item in ambiguous_symbol_set)

print(len(ambiguous_symbol_set))

5191


In [12]:
with open("../output/ambiguous_symbol_set.txt", "w") as file:
    for item in ambiguous_symbol_set:
        file.write(f"{item.strip()}\n")

In [13]:
with open("../output/ambiguous_symbol_set.txt", "r") as file:
    # read each line, strip newline characters, and convert to a set
    ambiguous_symbol_set = set(line.strip() for line in file)
len(ambiguous_symbol_set)

5191

In [14]:
ambiguous_symbol_set

{'GUSBP1',
 'TRNAMI2',
 'CHL',
 'CAPS',
 'BD-4',
 'PRR20C',
 'SHP1',
 'PXMP1',
 'SPANXE',
 'TRY1',
 'TRF-GAA1-1',
 'HRG',
 'EEF1AKMT4',
 'C10ORF77',
 'U3',
 'SSB1',
 'HNRPG',
 'BETA2',
 'H2BFA',
 'HARS2',
 'GST2',
 'PGG/HS',
 'GALNT16',
 'IMP4',
 'H-PLK',
 'BTR',
 'HBB',
 'FSIP2-AS2',
 'TAF',
 'API5L1',
 'LINC02254',
 'RGP3',
 'PI',
 'C20ORF197',
 'CRF',
 'KRTAP2.1A',
 'BBP',
 'CLIC5',
 'TRIM34',
 'LINC02110',
 'TRN',
 'ACA59',
 'P240',
 'FGF2',
 'KRAS',
 'ACTP1',
 'PRSSL1',
 'ME1',
 'DSK2',
 'SMN-AS1',
 'CREB3',
 'TAF2C2',
 'CRHSP24',
 'UGT1-01',
 'SAC2',
 'DIA2',
 'LINC01017',
 'MDCR',
 'TEL2',
 'IGER',
 'RALBP1',
 'P50',
 'MYH14',
 'PAB1',
 'H2AC11',
 'H3.1',
 'SIP1',
 'FADS6',
 'VAT1',
 'C7ORF28A',
 'M-ABC1',
 'ADK',
 'HCCA2',
 'OI12',
 'SPO',
 'TRF-GAA1-3',
 'GLUR2',
 'PTC',
 'ASRT2',
 'TTTY21B',
 'TTP',
 'IL1RA',
 'P52',
 'DBC1',
 'RNH1',
 'CCA1',
 'RNF35',
 'CPD',
 'CAF1A',
 'SCCA1',
 'TRE-TTC1-1',
 'SPRK',
 'PK1',
 'LAK',
 'H3F3AP1',
 'HBD-8',
 'RIP',
 'RRP4',
 'TRA-TGC3-2',
 '

### <a id='toc1_1_2_'></a>[Primary exploration of DGIdb gene content using collisions](#toc0_)

#### <a id='toc1_1_2_1_'></a>[Load gene claim data from DGIdb](#toc0_)

In [15]:
dgidb_gene_df = pd.read_csv(
    "../input/dgidb_genes_JUNE.tsv", sep="\t", na_values=["", "NULL"], keep_default_na=False
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14
...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20


In [16]:
num_total_claims = len(dgidb_gene_df)

In [17]:
dgidb_gene_df.loc[dgidb_gene_df["gene_claim_name"] == "HES1"]

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version
30772,HES1,Gene Symbol,hgnc:5192,HES1,Pharos,10-Apr-24
76844,HES1,Gene Symbol,hgnc:5192,HES1,HGNC,20240408
76845,HES1,Gene Symbol,hgnc:5192,HES1,NCBI,20240410
76846,HES1,Gene Symbol,hgnc:5192,HES1,Ensembl,111
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17


#### <a id='toc1_1_2_2_'></a>[How many claims are placed in a gene group with a different label?](#toc0_)

In [18]:
dgidb_gene_df.query("gene_name != gene_claim_name")

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14
...,...,...,...,...,...,...
79999,SEPT5,Gene Symbol,hgnc:9164,SEPTIN5,FoundationOneGenes,9/3/20
80014,ENSEMBL:ENSG00000185821,Ensembl Gene ID,hgnc:31305,OR6C76,RussLampel,26-Jul-11
80053,SEPT5,Gene Symbol,hgnc:9164,SEPTIN5,CarisMolecularIntelligence,9/4/20
80166,SEPT6,Gene Symbol,hgnc:15848,SEPTIN6,CarisMolecularIntelligence,9/4/20


#### <a id='toc1_1_2_3_'></a>[How many claims are not normalized?](#toc0_)

claims without a symbol/name/identifier
 (there shouldn't be any hooray)

In [19]:
no_claim_symbols_df = dgidb_gene_df[dgidb_gene_df["gene_claim_name"].isnull()]
no_claim_symbols_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version


In [20]:
no_name_symbols_df = dgidb_gene_df[dgidb_gene_df["gene_name"].isnull()]
num_not_normalized_claims = len(no_name_symbols_df)
num_not_normalized_claims

2144

In [21]:
num_normalized_claims = num_total_claims - num_not_normalized_claims

#### <a id='toc1_1_2_4_'></a>[Create set of gene groups (gene_name) and gene claims (gene_claim_name)](#toc0_)

In [22]:
dgidb_name_set = set(dgidb_gene_df["gene_name"])
len(dgidb_name_set)

12001

In [23]:
dgidb_gene_claim_name_set = set(dgidb_gene_df["gene_claim_name"])
len(dgidb_gene_claim_name_set)

26739

#### <a id='toc1_1_2_5_'></a>[Load HGNC, Ensembl (ENSG), and NCBI gene and alias sets](#toc0_)

Input: mini_hgnc_df, mini_ensg_df, mini_ncbi_df (total_alias_gene_intersections.ipynb)

Output: x_gene_symbol_set, x_alias_symbol_set where x is ncbi, ensg or ncbi

In [24]:
mini_hgnc_df = pd.read_csv(
    "../output/mini_hgnc_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [25]:
mini_ensg_df = pd.read_csv(
    "../output/mini_ensg_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [26]:
mini_ncbi_df = pd.read_csv(
    "../output/mini_ncbi_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [27]:
hgnc_gene_symbol_set = set(mini_hgnc_df["gene_symbol"])
len(hgnc_gene_symbol_set)

45646

In [28]:
ensg_gene_symbol_set = set(mini_ensg_df["gene_symbol"])
len(ensg_gene_symbol_set)

41164

In [29]:
ncbi_gene_symbol_set = set(mini_ncbi_df["gene_symbol"])
len(ncbi_gene_symbol_set)

45727

In [30]:
hgnc_alias_symbol_set = set(mini_hgnc_df["alias_symbol"])
len(hgnc_alias_symbol_set)

43187

In [31]:
ensg_alias_symbol_set = set(mini_ensg_df["alias_symbol"])
len(ensg_alias_symbol_set)

55413

In [32]:
ncbi_alias_symbol_set = set(mini_ncbi_df["alias_symbol"])
len(ncbi_alias_symbol_set)

69565

#### <a id='toc1_1_2_6_'></a>[How many unique primary gene symbols are there b/w HGNC, Ensembl, and NCBI?](#toc0_)

In [33]:
hgnc_ensg_gene_symbol_set = hgnc_gene_symbol_set.union(ensg_gene_symbol_set)

In [34]:
hgnc_ensg_ncbi_gene_symbol_set = hgnc_ensg_gene_symbol_set.union(ncbi_gene_symbol_set)
len(hgnc_ensg_ncbi_gene_symbol_set)

47368

#### <a id='toc1_1_2_7_'></a>[How many unique alias symbols are there b/w HGNC, Ensembl, and NCBI?](#toc0_)

In [35]:
hgnc_ensg_alias_symbol_set = hgnc_alias_symbol_set.union(ensg_alias_symbol_set)

In [36]:
hgnc_ensg_ncbi_alias_symbol_set = hgnc_ensg_alias_symbol_set.union(
    ncbi_alias_symbol_set
)
len(hgnc_ensg_ncbi_alias_symbol_set)

93528

#### <a id='toc1_1_2_8_'></a>[How many unique group names are not primary gene symbols?](#toc0_)

In [37]:
name_ensg_notmatch = dgidb_name_set.difference(ensg_gene_symbol_set)
len(name_ensg_notmatch)

297

In [38]:
name_hgnc_notmatch = dgidb_name_set.difference(hgnc_gene_symbol_set)
len(name_hgnc_notmatch)

25

In [39]:
cleaned_name_hgnc_notmatch = [x for x in name_hgnc_notmatch if str(x) != "NaN"]
len(cleaned_name_hgnc_notmatch)

25

In [40]:
name_ncbi_notmatch = dgidb_name_set.difference(ncbi_gene_symbol_set)
len(name_ncbi_notmatch)

135

In [41]:
name_ncbi_hgnc_notmatch = name_ncbi_notmatch.difference(hgnc_gene_symbol_set)
len(name_ncbi_hgnc_notmatch)

18

How many groups are labeled with a symbol not found in the sets of primary symbols from HGNC, NCBI, or ENSG?

In [42]:
name_ncbi_hgnc_ensg_notmatch = name_ncbi_hgnc_notmatch.difference(ensg_gene_symbol_set)
len(name_ncbi_hgnc_ensg_notmatch)

18

In [43]:
name_ncbi_hgnc_ensg_notmatch

{'ACT',
 'ATP5A2',
 'BTNL1',
 'CYB5P1',
 'GAPDHL17',
 'HEMC',
 'LOC102723996',
 'LOC102724428',
 'LOC105376944',
 'LOC112268384',
 'LOC112694756',
 'LOC344967',
 'LOC408186',
 'LOC653303',
 'OA1',
 'USP17L',
 'mCR',
 nan}

In [44]:
len(dgidb_name_set)

12001

#### <a id='toc1_1_2_9_'></a>[How many unique claims are not primary gene symbols?](#toc0_)

In [45]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(
    ensg_gene_symbol_set
)
len(gene_claim_name_ensg_notmatch)

15030

In [46]:
cleaned_gene_claim_name_ensg_notmatch = [
    x for x in gene_claim_name_ensg_notmatch if str(x) != "NaN"
]
len(cleaned_gene_claim_name_ensg_notmatch)

15030

In [47]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(
    hgnc_gene_symbol_set
)
len(gene_claim_name_hgnc_notmatch)

14755

In [48]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(
    ncbi_gene_symbol_set
)
len(gene_claim_name_ncbi_notmatch)

14867

In [49]:
gene_claim_name_ncbi_hngc_notmatch = gene_claim_name_ncbi_notmatch.difference(
    hgnc_gene_symbol_set
)
len(gene_claim_name_ncbi_hngc_notmatch)

14748

How many unique claims are not primary symbols

In [50]:
gene_claim_name_ncbi_hngc_ensg_notmatch = gene_claim_name_ncbi_hngc_notmatch.difference(
    ensg_gene_symbol_set
)
num_not_primary_claims = len(gene_claim_name_ncbi_hngc_ensg_notmatch)
num_not_primary_claims

14748

#### <a id='toc1_1_2_10_'></a>[Load the collision sets from each data source](#toc0_)

Input: aa_collision_x_df.csv (total_alis_overlap.ipynb)

Output: x_alias_alias_collision_set

In [51]:
aa_collision_hgnc_df = pd.read_csv(
    "../output/aa_collision_gene_hgnc_df.csv",
    na_values=["", "NULL"],
    keep_default_na=False,
)

In [52]:
hgnc_alias_alias_collision_set = set(aa_collision_hgnc_df["collision"])
len(hgnc_alias_alias_collision_set)

1250

In [53]:
aa_collision_ncbi_df = pd.read_csv(
    "../output/aa_collision_gene_ncbi_df.csv",
    na_values=["", "NULL"],
    keep_default_na=False,
)

In [54]:
ncbi_alias_alias_collision_set = set(aa_collision_ncbi_df["collision"])
len(ncbi_alias_alias_collision_set)

3698

In [55]:
aa_collision_ensg_df = pd.read_csv(
    "../output/aa_collision_gene_ensg_df.csv",
    na_values=["", "NULL"],
    keep_default_na=False,
)

In [56]:
ensg_alias_alias_collision_set = set(aa_collision_ensg_df["collision"])
len(ensg_alias_alias_collision_set)

1615

#### <a id='toc1_1_2_11_'></a>[How many unique group names are primary gene symbols?](#toc0_)

In [57]:
name_hgnc_match = dgidb_name_set.intersection(hgnc_gene_symbol_set)
len(name_hgnc_match)

11976

In [58]:
name_ensg_match = dgidb_name_set.intersection(ensg_gene_symbol_set)
len(name_ensg_match)

11704

In [59]:
name_ncbi_match = dgidb_name_set.intersection(ncbi_gene_symbol_set)
len(name_ncbi_match)

11866

In [60]:
name_ncbi_ensg_match = name_ncbi_match.intersection(ensg_gene_symbol_set)
len(name_ncbi_ensg_match)

11669

In [61]:
name_ncbi_ensg_hgnc_match = name_ncbi_ensg_match.intersection(hgnc_gene_symbol_set)
len(name_ncbi_ensg_hgnc_match)

11668

#### <a id='toc1_1_2_12_'></a>[How many of the groups that are labeled with a primary gene symbol are also alias-alias collisions?](#toc0_)

In [62]:
name_ncbi_match_aacollision = name_ncbi_match.intersection(
    ncbi_alias_alias_collision_set
)
len(name_ncbi_match_aacollision)

135

In [63]:
name_ensg_match_aacollision = name_ensg_match.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_match_aacollision)

41

In [64]:
name_ensg_notmatch_aacollision = name_ensg_notmatch.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_notmatch_aacollision)

3

#### <a id='toc1_1_2_13_'></a>[How many unique claim symbols are collisions?](#toc0_)

In [65]:
gene_claim_name_ensg_aacollision_match = dgidb_gene_claim_name_set.intersection(
    ensg_alias_alias_collision_set
)
len(gene_claim_name_ensg_aacollision_match)

98

In [66]:
gene_claim_name_hgnc_aacollision_match = dgidb_gene_claim_name_set.intersection(
    hgnc_alias_alias_collision_set
)
len(gene_claim_name_hgnc_aacollision_match)

77

In [67]:
gene_claim_name_ncbi_aacollision_match = dgidb_gene_claim_name_set.intersection(
    ncbi_alias_alias_collision_set
)
len(gene_claim_name_ncbi_aacollision_match)

238

#### <a id='toc1_1_2_14_'></a>[How many unique groups are labeled with collisions?](#toc0_)

In [68]:
name_ensg_aacollision_match = dgidb_name_set.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_aacollision_match)

44

In [69]:
name_hgnc_aacollision_match = dgidb_name_set.intersection(
    hgnc_alias_alias_collision_set
)
len(name_hgnc_aacollision_match)

42

In [70]:
name_ncbi_aacollision_match = dgidb_name_set.intersection(
    ncbi_alias_alias_collision_set
)
len(name_ncbi_aacollision_match)

137

#### <a id='toc1_1_2_15_'></a>[How many unique claims are not primary gene symbols?](#toc0_)

In [71]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(
    hgnc_gene_symbol_set
)
len(gene_claim_name_hgnc_notmatch)

14755

In [72]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(
    ensg_gene_symbol_set
)
len(gene_claim_name_ensg_notmatch)

15030

In [73]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(
    ncbi_gene_symbol_set
)
len(gene_claim_name_ncbi_notmatch)

14867

#### <a id='toc1_1_2_16_'></a>[How many unique claims that were not primary symbols are collisions?](#toc0_)

In [74]:
gene_claim_name_hgnc_notmatch_aacollision = gene_claim_name_hgnc_notmatch.intersection(
    hgnc_alias_alias_collision_set
)
len(gene_claim_name_hgnc_notmatch_aacollision)

36

In [75]:
gene_claim_name_ncbi_notmatch_aacollision = gene_claim_name_ncbi_notmatch.intersection(
    ncbi_alias_alias_collision_set
)
len(gene_claim_name_ncbi_notmatch_aacollision)

103

In [76]:
gene_claim_name_ensg_notmatch_aacollision = gene_claim_name_ensg_notmatch.intersection(
    ensg_alias_alias_collision_set
)
len(gene_claim_name_ensg_notmatch_aacollision)

57

#### <a id='toc1_1_2_17_'></a>[How many unique claims that are primary gene symbols are also collisions?](#toc0_)

In [77]:
gene_claim_name_hgnc_match = dgidb_gene_claim_name_set.intersection(
    hgnc_gene_symbol_set
)
len(gene_claim_name_hgnc_match)

11984

In [78]:
gene_claim_name_hgnc_match_aacollision = gene_claim_name_hgnc_match.intersection(
    hgnc_alias_alias_collision_set
)
len(gene_claim_name_hgnc_match_aacollision)

41

In [79]:
gene_claim_name_ncbi_match = dgidb_gene_claim_name_set.intersection(
    ncbi_gene_symbol_set
)
len(gene_claim_name_ncbi_match)

11872

In [80]:
gene_claim_name_ncbi_match_aacollision = gene_claim_name_ncbi_match.intersection(
    ncbi_alias_alias_collision_set
)
len(gene_claim_name_ncbi_match_aacollision)

135

In [81]:
gene_claim_name_ensg_match = dgidb_gene_claim_name_set.intersection(
    ensg_gene_symbol_set
)
len(gene_claim_name_ensg_match)

11709

In [82]:
gene_claim_name_ensg_match_aacollision = gene_claim_name_ensg_match.intersection(
    ensg_alias_alias_collision_set
)
len(gene_claim_name_ensg_match_aacollision)

41

#### <a id='toc1_1_2_18_'></a>[How many of the gene groups that are not primary gene symbols are alias-alias collisions from HGNC?](#toc0_)

In [83]:
name_hngc_notmatch_aacollision = name_hgnc_notmatch.intersection(
    hgnc_alias_alias_collision_set
)
len(name_hngc_notmatch_aacollision)

1

In [84]:
name_ncbi_notmatch_aacollision = name_ncbi_notmatch.intersection(
    ncbi_alias_alias_collision_set
)
len(name_ncbi_notmatch_aacollision)

2

In [85]:
name_ensg_notmatch_aacollision = name_ensg_notmatch.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_notmatch_aacollision)

3

#### <a id='toc1_1_2_19_'></a>[How many claims are ambiguous symbols?](#toc0_)

In [86]:
dgidb_gene_df["claim_ambiguous_status"] = dgidb_gene_df["gene_claim_name"].isin(
    ambiguous_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False
...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True


In [87]:
dgidb_gene_df["claim_ambiguous_status"].value_counts()

claim_ambiguous_status
False    75538
True      4696
Name: count, dtype: int64

In [88]:
ambiguous_claim_df = dgidb_gene_df.loc[dgidb_gene_df["claim_ambiguous_status"]]
num_ambiguous_claim = len(ambiguous_claim_df)
num_ambiguous_claim

4696

In [89]:
ambiguous_claim_df.loc[ambiguous_claim_df["gene_claim_name"] == "TR2"]

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status
18,TR2,NCBI Gene Name,hgnc:7971,NR2C1,BaderLab,Feb-14,True


#### <a id='toc1_1_2_20_'></a>[How many gene group labels are ambiguous symbols?](#toc0_)

In [90]:
dgidb_gene_df["name_ambiguous_status"] = dgidb_gene_df["gene_name"].isin(
    ambiguous_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False
...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False


In [91]:
dgidb_gene_df["name_ambiguous_status"].value_counts()

name_ambiguous_status
False    74661
True      5573
Name: count, dtype: int64

In [92]:
ambiguous_name_df = dgidb_gene_df.loc[dgidb_gene_df["name_ambiguous_status"]]
num_ambiguous_name = len(ambiguous_name_df)
num_ambiguous_name

5573

#### <a id='toc1_1_2_21_'></a>[How many claims are primary symbols?](#toc0_)

In [93]:
dgidb_gene_df["claim_primary_status"] = dgidb_gene_df["gene_claim_name"].isin(
    hgnc_ensg_ncbi_gene_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
...,...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False,True
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False,True
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False,False


In [94]:
dgidb_gene_df["claim_primary_status"].value_counts()

claim_primary_status
True     64185
False    16049
Name: count, dtype: int64

In [95]:
primary_claim_df = dgidb_gene_df.loc[dgidb_gene_df["claim_primary_status"]]
num_primary_claim = len(primary_claim_df)
num_primary_claim

64185

#### <a id='toc1_1_2_22_'></a>[How many gene group labels are primary symbols?](#toc0_)

In [96]:
dgidb_gene_df["name_primary_status"] = (
    dgidb_gene_df["gene_name"].astype(str).isin(hgnc_ensg_ncbi_gene_symbol_set)
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False,True,True
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False,True,True
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False,False,True


In [97]:
dgidb_gene_df["name_primary_status"].value_counts()

name_primary_status
True     78055
False     2179
Name: count, dtype: int64

In [98]:
primary_name_df = dgidb_gene_df.loc[dgidb_gene_df["name_primary_status"]]
num_primary_name = len(primary_name_df)
num_primary_name

78055

In [99]:
not_primary_group_name_df = dgidb_gene_df.loc[~dgidb_gene_df["name_primary_status"]]
not_primary_group_name_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
74562,ATP5A2,Gene Symbol,ncbigene:499,ATP5A2,NCBI,20240410,False,False,False,False
74841,LOC112268384,Gene Symbol,ncbigene:112268384,LOC112268384,NCBI,20240410,False,False,False,False
74842,LOC112268384,Gene Symbol,ncbigene:112268384,LOC112268384,GO,10-Apr-24,False,False,False,False
75375,LOC102723996,Gene Symbol,ncbigene:102723996,LOC102723996,NCBI,20240410,False,False,False,False


#### <a id='toc1_1_2_23_'></a>[How many gene group labels are alias symbols?](#toc0_)

In [100]:
dgidb_gene_df["name_alias_status"] = dgidb_gene_df["gene_name"].isin(
    hgnc_ensg_ncbi_alias_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status,name_alias_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False,True,True,False
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True,True,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False,True,True,False
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False,False,True,False


In [101]:
dgidb_gene_df["name_alias_status"].value_counts()

name_alias_status
False    72168
True      8066
Name: count, dtype: int64

In [102]:
alias_name_df = dgidb_gene_df.loc[dgidb_gene_df["name_alias_status"]]
num_alias_name = len(alias_name_df)
num_alias_name

8066

In [103]:
print("Calmbp1" in hgnc_ensg_ncbi_alias_symbol_set)

True


#### <a id='toc1_1_2_24_'></a>[How many claims are alias symbols?](#toc0_)

In [104]:
dgidb_gene_df["claim_alias_status"] = dgidb_gene_df["gene_claim_name"].isin(
    hgnc_ensg_ncbi_alias_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False,True,True,False,False
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True,True,True,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False,True,True,False,False
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False,False,True,False,True


In [105]:
dgidb_gene_df["claim_alias_status"].value_counts()

claim_alias_status
False    74602
True      5632
Name: count, dtype: int64

In [106]:
alias_claims_df = dgidb_gene_df.loc[dgidb_gene_df["claim_alias_status"]]
num_alias_claims = len(alias_claims_df)
num_alias_claims

5632

#### <a id='toc1_1_2_25_'></a>[How many claims that were normalized into non-primary symbol labeled groups are alias symbols?](#toc0_)

In [107]:
dgidb_gene_df["claim_alias_status"] = dgidb_gene_df["gene_claim_name"].isin(
    hgnc_ensg_ncbi_alias_symbol_set
)
alias_claim_not_primary_group_name_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_alias_status"] & ~dgidb_gene_df["name_primary_status"]
]
alias_claim_not_primary_group_name_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status
3251,USP17L,Gene Symbol,ncbigene:100862847,USP17L,dGene,27-Jun-13,False,False,False,False,True,True
5920,ALDA,Gene Name,ncbigene:112694756,LOC112694756,DrugBank,5.1.12,False,False,False,False,False,True
27217,SIK1B,Gene Symbol,ncbigene:102724428,LOC102724428,GO,10-Apr-24,False,False,True,False,False,True
31884,SIK1B,Gene Symbol,ncbigene:102724428,LOC102724428,Pharos,10-Apr-24,False,False,True,False,False,True
38737,USP17L,Gene Symbol,ncbigene:100862847,USP17L,NCBI,20240410,False,False,False,False,True,True
55006,ACT,Gene Symbol,ncbigene:389036,ACT,NCBI,20240410,True,True,False,False,True,True
66195,OA1,Gene Symbol,ncbigene:474285,OA1,NCBI,20240410,False,False,False,False,True,True
66196,OA1,Gene Symbol,ncbigene:474285,OA1,dGene,27-Jun-13,False,False,False,False,True,True


#### <a id='toc1_1_2_26_'></a>[How many not normalized claims are alias symbols?](#toc0_)

In [108]:
alias_claim_null_name_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_alias_status"] & dgidb_gene_df["gene_name"].isnull()
]
len(alias_claim_null_name_df)

0

#### <a id='toc1_1_2_27_'></a>[How many gene group labels are not primary, alias symbols, or null?](#toc0_)

In [109]:
other_name_df = dgidb_gene_df.loc[
    ~dgidb_gene_df["name_alias_status"]
    & ~dgidb_gene_df["name_primary_status"]
    & ~dgidb_gene_df["gene_name"].isnull()
]
num_other_name = len(other_name_df)
num_other_name

29

#### <a id='toc1_1_2_28_'></a>[How many claims are not primary, alias symbols, or null?](#toc0_)

In [110]:
other_claim_df = dgidb_gene_df.loc[
    ~dgidb_gene_df["claim_alias_status"] & ~dgidb_gene_df["claim_primary_status"]
]
num_other_claim = len(other_claim_df)
num_other_claim

15284

#### <a id='toc1_1_2_29_'></a>[How many claims are primary and alias symbols?](#toc0_)

In [111]:
primaryandalias_claim_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_alias_status"] & dgidb_gene_df["claim_primary_status"]
]
num_primaryandalias_claim = len(primaryandalias_claim_df)
num_primaryandalias_claim

4867

#### <a id='toc1_1_2_30_'></a>[How many primary symbol claims are normalized into non-primary gene group labels?](#toc0_)

In [112]:
claim_true_name_false_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_primary_status"] & ~dgidb_gene_df["name_primary_status"]
]
claim_true_name_false_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status
27217,SIK1B,Gene Symbol,ncbigene:102724428,LOC102724428,GO,10-Apr-24,False,False,True,False,False,True
31884,SIK1B,Gene Symbol,ncbigene:102724428,LOC102724428,Pharos,10-Apr-24,False,False,True,False,False,True


In [113]:
len(claim_true_name_false_df)

2

### <a id='toc1_1_3_'></a>[Summary](#toc0_)

#### <a id='toc1_1_3_1_'></a>[Normalization Rates](#toc0_)

In [114]:
normalization_index = "Normalized", "Not Normalized", "Total"
normalization_summary = {
    "Number of Claims": [
        num_normalized_claims,
        num_not_normalized_claims,
        num_total_claims,
    ]
}
normalization_summary_df = pd.DataFrame(
    normalization_summary, index=normalization_index
)
normalization_summary_df

Unnamed: 0,Number of Claims
Normalized,78090
Not Normalized,2144
Total,80234


#### <a id='toc1_1_3_2_'></a>[Types of Claim Symbols](#toc0_)

- Primary and Alias symbols are not exclusive as some primary symbols are used as aliases (alias-primary collisions)
- Other does not include any Primary or Alias symbols
- Ambiguous symbols can be either Primary or Alias

In [115]:
claim_index = "Primary", "Alias", "Other", "Ambiguous", "Total"
claim_summary = {
    "Number of Claims": [
        num_primary_claim,
        num_alias_claims,
        num_other_claim,
        num_ambiguous_claim,
        num_total_claims,
    ]
}
claim_summary_df = pd.DataFrame(claim_summary, index=claim_index)
claim_summary_df

Unnamed: 0,Number of Claims
Primary,64185
Alias,5632
Other,15284
Ambiguous,4696
Total,80234


#### <a id='toc1_1_3_3_'></a>[Types of Normalizations (gene group labels claims are being normliazed into)](#toc0_)

In [116]:
gene_group_index = "Primary", "Alias", "Null", "Other", "Ambiguous", "Total"
gene_group_summary = {
    "Number of Claims": [
        num_primary_name,
        num_alias_name,
        num_not_normalized_claims,
        num_other_name,
        num_ambiguous_name,
        num_total_claims,
    ]
}
gene_group_summary_df = pd.DataFrame(gene_group_summary, index=gene_group_index)
gene_group_summary_df

Unnamed: 0,Number of Claims
Primary,78055
Alias,8066
Null,2144
Other,29
Ambiguous,5573
Total,80234
