# <a id='toc1_'></a>[DGIdb ambiguous claims](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [DGIdb ambiguous claims](#toc1_)    
    - [Analyzing similaritis b/w alias-primary and alias-alias collisions](#toc1_1_1_)    
      - [Load merged alias-alias and alias-priamry collision sets. Symbols are the ambiguous symbols.](#toc1_1_1_1_)    
      - [How many of the alias-primary collisions are also alias-alias collisions?](#toc1_1_1_2_)    
      - [How many unique ambiguous symbosl are there b/w alias-primary and alias-alias collisions?](#toc1_1_1_3_)    
    - [Primary exploration of DGIdb gene content using collisions](#toc1_1_2_)    
      - [Load gene claim data from DGIdb](#toc1_1_2_1_)    
      - [How many claims are placed in a gene group with a different label?](#toc1_1_2_2_)    
      - [How many claims are not normalized?](#toc1_1_2_3_)    
      - [Create set of gene groups (gene_name) and gene claims (gene_claim_name)](#toc1_1_2_4_)    
      - [Load HGNC, Ensembl (ENSG), and NCBI gene and alias sets](#toc1_1_2_5_)    
      - [How many unique primary gene symbols are there b/w HGNC, Ensembl, and NCBI?](#toc1_1_2_6_)    
      - [How many unique alias symbols are there b/w HGNC, Ensembl, and NCBI?](#toc1_1_2_7_)    
      - [How many unique group names are not primary gene symbols?](#toc1_1_2_8_)    
      - [How many unique claims are not primary gene symbols?](#toc1_1_2_9_)    
      - [Load the collision sets from each data source](#toc1_1_2_10_)    
      - [How many unique group names are primary gene symbols?](#toc1_1_2_11_)    
      - [How many of the groups that are labeled with a primary gene symbol are also alias-alias collisions?](#toc1_1_2_12_)    
      - [How many unique claim symbols are collisions?](#toc1_1_2_13_)    
      - [How many unique groups are labeled with collisions?](#toc1_1_2_14_)    
      - [How many unique claims are not primary gene symbols?](#toc1_1_2_15_)    
      - [How many unique claims that were not primary symbols are collisions?](#toc1_1_2_16_)    
      - [How many unique claims that are primary gene symbols are also collisions?](#toc1_1_2_17_)    
      - [How many of the gene groups that are not primary gene symbols are alias-alias collisions from HGNC?](#toc1_1_2_18_)    
      - [How many claims are ambiguous symbols?](#toc1_1_2_19_)    
      - [How many gene group labels are ambiguous symbols?](#toc1_1_2_20_)    
      - [How many claims are primary symbols?](#toc1_1_2_21_)    
      - [How many gene group labels are primary symbols?](#toc1_1_2_22_)    
      - [How many gene group labels are alias symbols?](#toc1_1_2_23_)    
      - [How many claims are alias symbols?](#toc1_1_2_24_)    
      - [How many claims that were normalized into non-primary symbol labeled groups are alias symbols?](#toc1_1_2_25_)    
      - [How many not normalized claims are alias symbols?](#toc1_1_2_26_)    
      - [How many gene group labels are not primary, alias symbols, or null?](#toc1_1_2_27_)    
      - [How many claims are not primary, alias symbols, or null?](#toc1_1_2_28_)    
      - [How many claims are primary and alias symbols?](#toc1_1_2_29_)    
      - [How many primary symbol claims are normalized into non-primary gene group labels?](#toc1_1_2_30_)    
    - [Summary](#toc1_1_3_)    
      - [Normalization Rates](#toc1_1_3_1_)    
      - [Types of Claim Symbols](#toc1_1_3_2_)    
      - [Types of Normalizations (gene group labels claims are being normliazed into)](#toc1_1_3_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [241]:
import pandas as pd
import numpy as np
import plotly.express as px

### <a id='toc1_1_1_'></a>[Analyzing similaritis b/w alias-primary and alias-alias collisions](#toc0_)

#### <a id='toc1_1_1_1_'></a>[Load merged alias-alias and alias-priamry collision sets. Symbols are the ambiguous symbols.](#toc0_)

Input: merged_alias_gene_intersections.csv (total_alias_gene_intersections.ipynb), merged_alias_overlap_df_2.csv (total_alias_overlap.ipynb)

Output: merged_alias_primary_collision_set, merged_alias_alias_collision_set

In [242]:
merged_alias_primary_collisions_df = pd.read_csv(
    "../created_files/merged_alias_primary_collisions_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [243]:
merged_alias_primary_collision_set = set(
    merged_alias_primary_collisions_df["collision"]
)
len(merged_alias_primary_collision_set)

1677

In [244]:
merged_alias_alias_collisions_df = pd.read_csv(
    "../created_files/merged_aa_collision_alias_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [245]:
merged_alias_alias_collision_set = set(
    merged_alias_alias_collisions_df["collision"].tolist()
)
len(merged_alias_alias_collision_set)

3824

#### <a id='toc1_1_1_2_'></a>[How many of the alias-primary collisions are also alias-alias collisions?](#toc0_)

In [246]:
print(
    len(
        merged_alias_alias_collision_set.intersection(
            merged_alias_primary_collision_set
        )
    )
)

271


#### <a id='toc1_1_1_3_'></a>[How many unique ambiguous symbosl are there b/w alias-primary and alias-alias collisions?](#toc0_)

In [247]:
ambiguous_symbol_set = merged_alias_alias_collision_set.union(
    merged_alias_primary_collision_set
)
print(len(ambiguous_symbol_set))

5230


In [248]:
ambiguous_symbol_set

{'TAT',
 'SAR',
 'DELTA',
 'PFD',
 'MK2',
 'CAC1',
 'DAND1',
 'CAP3',
 'SPC3',
 'NKHC',
 'PAC1',
 'HBAB',
 'ZNF20',
 'PCT',
 'ELP1',
 'NRF2',
 'PAK5',
 'CEP1',
 'FAM90A17P',
 'PCGF4',
 'DEF6',
 'GAR1',
 'HXBL',
 'MRT1',
 'AIS1',
 'BIG3',
 'PHR1',
 'FOP',
 'TCRAV11S1',
 'NBPF',
 'MIO',
 'EFTUD1',
 'MCSP',
 'MIP-1-BETA',
 'GR',
 'CBS',
 'ELF1',
 'GCP-1',
 'TAT1',
 'CRAM',
 'CANP',
 'MGAT1',
 'ME2',
 'YMR292W',
 'TTL',
 'HL-2',
 'CT4.7',
 'TRB',
 'TCRAV24S1',
 'SMILE',
 'SPACDR',
 'HGPS',
 'NPIPA8',
 'FAM18B2',
 'CNIH2',
 'CBLC',
 'CPAD',
 'OR17-16',
 'AST',
 'KIR',
 'WDR82',
 'ADAM18',
 'NSCL2',
 'MUC5',
 'D3',
 'HSD11',
 'NET4',
 'TSPY8',
 'SES2',
 'POP1',
 'SCYL3',
 'OR11H12',
 'U7',
 'ECGP',
 'TRR-TCT3-1',
 'E3',
 'KRTAP4-7',
 'COAS3',
 'NOS',
 'HMCS',
 'TRP-AGG2-5',
 'HCAP',
 'AMD1',
 'OR2A7',
 'TARG1',
 'OS4',
 'ARG',
 'CDS1',
 'ECA2',
 'POLG',
 'MIHC',
 'SMUC',
 'OCC-1',
 'TM',
 'SFPQ',
 'BMH',
 'CRP1',
 'PFHB1',
 'CHAMP',
 'COD1',
 'PRP',
 'DEL16P11.2',
 'MYT1',
 'EFA6R',
 'NY-REN

In [249]:
ambiguous_symbol_set = set(item.strip() for item in ambiguous_symbol_set)

print(len(ambiguous_symbol_set))

5230


In [250]:
with open("created_files/ambiguous_symbol_set.txt", "w") as file:
    for item in ambiguous_symbol_set:
        file.write(f"{item.strip()}\n")

In [251]:
with open("created_files/ambiguous_symbol_set.txt", "r") as file:
    # Read each line, strip newline characters, and convert to a set
    ambiguous_symbol_set = set(line.strip() for line in file)
len(ambiguous_symbol_set)

5230

In [252]:
ambiguous_symbol_set

{'TAT',
 'SAR',
 'PFD',
 'DAND1',
 'CAC1',
 'CAP3',
 'SPC3',
 'PAC1',
 'ZNF20',
 'PCT',
 'ELP1',
 'NRF2',
 'PAK5',
 'PCGF4',
 'HXBL',
 'MRT1',
 'AIS1',
 'FOP',
 'EFTUD1',
 'MIO',
 'MIP-1-BETA',
 'ELF1',
 'TAT1',
 'CRAM',
 'MGAT1',
 'ME2',
 'SMILE',
 'HGPS',
 'FAM18B2',
 'CNIH2',
 'CBLC',
 'CPAD',
 'WDR82',
 'AST',
 'ADAM18',
 'MUC5',
 'D3',
 'HSD11',
 'SCYL3',
 'ECGP',
 'TRR-TCT3-1',
 'E3',
 'COAS3',
 'TRP-AGG2-5',
 'HCAP',
 'AMD1',
 'OR2A7',
 'MIHC',
 'OCC-1',
 'TM',
 'BMH',
 'CRP1',
 'PFHB1',
 'CHAMP',
 'COD1',
 'PRP',
 'DEL16P11.2',
 'MYT1',
 'NY-REN-2',
 'FADS1',
 'MS4A8',
 'GL',
 'SPANX',
 'PLAP-1',
 'CRD',
 'DGCR6',
 'APT2',
 'HAS1',
 'SSB2',
 'P160',
 'MRX92',
 'BPAD',
 'POTE2BETA',
 'GABPB2',
 'ILT5',
 'ICBP90',
 'PAP-ALPHA',
 'EXOC1',
 'CAP2',
 'LCAD',
 'CYP2D7AP',
 'AD10',
 'OR7E75P',
 'HCG',
 'IBP1',
 'ZBP1',
 'TSPY1',
 'AK3L2',
 'TRNAI-AAU',
 'TIM',
 'PSS3',
 'TP2',
 'DRP3',
 'AG2',
 'CK20',
 'GT1',
 '2F1',
 'MRPS12',
 'MORF4LP4',
 'PSM',
 'ADCY3',
 'FABP5P1',
 'GST1',
 'CP

### <a id='toc1_1_2_'></a>[Primary exploration of DGIdb gene content using collisions](#toc0_)

#### <a id='toc1_1_2_1_'></a>[Load gene claim data from DGIdb](#toc0_)

In [253]:
dgidb_gene_df = pd.read_csv(
    "downloaded_files/dgidb_genes_JUNE.tsv", sep="\t", na_values=["", "NULL"], keep_default_na=False
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14
...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20


In [254]:
num_total_claims = len(dgidb_gene_df)

In [255]:
dgidb_gene_df.loc[dgidb_gene_df["gene_claim_name"] == "HES1"]

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version
30772,HES1,Gene Symbol,hgnc:5192,HES1,Pharos,10-Apr-24
76844,HES1,Gene Symbol,hgnc:5192,HES1,HGNC,20240408
76845,HES1,Gene Symbol,hgnc:5192,HES1,NCBI,20240410
76846,HES1,Gene Symbol,hgnc:5192,HES1,Ensembl,111
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17


#### <a id='toc1_1_2_2_'></a>[How many claims are placed in a gene group with a different label?](#toc0_)

In [256]:
dgidb_gene_df.query("gene_name != gene_claim_name")

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14
...,...,...,...,...,...,...
79999,SEPT5,Gene Symbol,hgnc:9164,SEPTIN5,FoundationOneGenes,9/3/20
80014,ENSEMBL:ENSG00000185821,Ensembl Gene ID,hgnc:31305,OR6C76,RussLampel,26-Jul-11
80053,SEPT5,Gene Symbol,hgnc:9164,SEPTIN5,CarisMolecularIntelligence,9/4/20
80166,SEPT6,Gene Symbol,hgnc:15848,SEPTIN6,CarisMolecularIntelligence,9/4/20


#### <a id='toc1_1_2_3_'></a>[How many claims are not normalized?](#toc0_)

claims without a symbol/name/identifier
 (there shouldn't be any hooray)

In [257]:
no_claim_symbols_df = dgidb_gene_df[dgidb_gene_df["gene_claim_name"].isnull()]
no_claim_symbols_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version


In [258]:
no_name_symbols_df = dgidb_gene_df[dgidb_gene_df["gene_name"].isnull()]
num_not_normalized_claims = len(no_name_symbols_df)
num_not_normalized_claims

2144

In [259]:
num_normalized_claims = num_total_claims - num_not_normalized_claims

#### <a id='toc1_1_2_4_'></a>[Create set of gene groups (gene_name) and gene claims (gene_claim_name)](#toc0_)

In [260]:
dgidb_name_set = set(dgidb_gene_df["gene_name"])
len(dgidb_name_set)

12001

In [261]:
dgidb_gene_claim_name_set = set(dgidb_gene_df["gene_claim_name"])
len(dgidb_gene_claim_name_set)

26739

#### <a id='toc1_1_2_5_'></a>[Load HGNC, Ensembl (ENSG), and NCBI gene and alias sets](#toc0_)

Input: mini_hgnc_df, mini_ensg_df, mini_ncbi_df (total_alias_gene_intersections.ipynb)

Output: x_gene_symbol_set, x_alias_symbol_set where x is ncbi, ensg or ncbi

In [262]:
mini_hgnc_df = pd.read_csv(
    "../created_files/mini_hgnc_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [263]:
mini_ensg_df = pd.read_csv(
    "../created_files/mini_ensg_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [264]:
mini_ncbi_df = pd.read_csv(
    "../created_files/mini_ncbi_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [265]:
hgnc_gene_symbol_set = set(mini_hgnc_df["gene_symbol"])
len(hgnc_gene_symbol_set)

45646

In [266]:
ensg_gene_symbol_set = set(mini_ensg_df["gene_symbol"])
len(ensg_gene_symbol_set)

41068

In [267]:
ncbi_gene_symbol_set = set(mini_ncbi_df["gene_symbol"])
len(ncbi_gene_symbol_set)

193303

In [268]:
hgnc_alias_symbol_set = set(mini_hgnc_df["alias_symbol"])
len(hgnc_alias_symbol_set)

43187

In [269]:
ensg_alias_symbol_set = set(mini_ensg_df["alias_symbol"])
len(ensg_alias_symbol_set)

55276

In [270]:
ncbi_alias_symbol_set = set(mini_ncbi_df["alias_symbol"])
len(ncbi_alias_symbol_set)

69158

#### <a id='toc1_1_2_6_'></a>[How many unique primary gene symbols are there b/w HGNC, Ensembl, and NCBI?](#toc0_)

In [271]:
hgnc_ensg_gene_symbol_set = hgnc_gene_symbol_set.union(ensg_gene_symbol_set)

In [272]:
hgnc_ensg_ncbi_gene_symbol_set = hgnc_ensg_gene_symbol_set.union(ncbi_gene_symbol_set)
len(hgnc_ensg_ncbi_gene_symbol_set)

194866

#### <a id='toc1_1_2_7_'></a>[How many unique alias symbols are there b/w HGNC, Ensembl, and NCBI?](#toc0_)

In [273]:
hgnc_ensg_alias_symbol_set = hgnc_alias_symbol_set.union(ensg_alias_symbol_set)

In [274]:
hgnc_ensg_ncbi_alias_symbol_set = hgnc_ensg_alias_symbol_set.union(
    ncbi_alias_symbol_set
)
len(hgnc_ensg_ncbi_alias_symbol_set)

93036

#### <a id='toc1_1_2_8_'></a>[How many unique group names are not primary gene symbols?](#toc0_)

In [275]:
name_ensg_notmatch = dgidb_name_set.difference(ensg_gene_symbol_set)
len(name_ensg_notmatch)

298

In [276]:
name_hgnc_notmatch = dgidb_name_set.difference(hgnc_gene_symbol_set)
len(name_hgnc_notmatch)

25

In [277]:
cleaned_name_hgnc_notmatch = [x for x in name_hgnc_notmatch if str(x) != "NaN"]
len(cleaned_name_hgnc_notmatch)

25

In [278]:
name_ncbi_notmatch = dgidb_name_set.difference(ncbi_gene_symbol_set)
len(name_ncbi_notmatch)

97

In [279]:
name_ncbi_hgnc_notmatch = name_ncbi_notmatch.difference(hgnc_gene_symbol_set)
len(name_ncbi_hgnc_notmatch)

9

How many groups are labeled with a symbol not found in the sets of primary symbols from HGNC, NCBI, or ENSG?

In [280]:
name_ncbi_hgnc_ensg_notmatch = name_ncbi_hgnc_notmatch.difference(ensg_gene_symbol_set)
len(name_ncbi_hgnc_ensg_notmatch)

8

In [281]:
len(dgidb_name_set)

12001

#### <a id='toc1_1_2_9_'></a>[How many unique claims are not primary gene symbols?](#toc0_)

In [282]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(
    ensg_gene_symbol_set
)
len(gene_claim_name_ensg_notmatch)

15029

In [283]:
cleaned_gene_claim_name_ensg_notmatch = [
    x for x in gene_claim_name_ensg_notmatch if str(x) != "NaN"
]
len(cleaned_gene_claim_name_ensg_notmatch)

15029

In [284]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(
    hgnc_gene_symbol_set
)
len(gene_claim_name_hgnc_notmatch)

14755

In [285]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(
    ncbi_gene_symbol_set
)
len(gene_claim_name_ncbi_notmatch)

14828

In [286]:
gene_claim_name_ncbi_hngc_notmatch = gene_claim_name_ncbi_notmatch.difference(
    hgnc_gene_symbol_set
)
len(gene_claim_name_ncbi_hngc_notmatch)

14738

How many unique claims are not primary symbols

In [287]:
gene_claim_name_ncbi_hngc_ensg_notmatch = gene_claim_name_ncbi_hngc_notmatch.difference(
    ensg_gene_symbol_set
)
num_not_primary_claims = len(gene_claim_name_ncbi_hngc_ensg_notmatch)
num_not_primary_claims

14735

#### <a id='toc1_1_2_10_'></a>[Load the collision sets from each data source](#toc0_)

Input: aa_collision_x_df.csv (total_alis_overlap.ipynb)

Output: x_alias_alias_collision_set

In [288]:
aa_collision_hgnc_df = pd.read_csv(
    "../created_files/aa_collision_HGNC_count_df.csv",
    na_values=["", "NULL"],
    keep_default_na=False,
)

In [289]:
hgnc_alias_alias_collision_set = set(aa_collision_hgnc_df["collision"])
len(hgnc_alias_alias_collision_set)

1250

In [290]:
aa_collision_ncbi_df = pd.read_csv(
    "../created_files/aa_collision_NCBI_count_df.csv",
    na_values=["", "NULL"],
    keep_default_na=False,
)

In [291]:
ncbi_alias_alias_collision_set = set(aa_collision_ncbi_df["collision"])
len(ncbi_alias_alias_collision_set)

3711

In [292]:
aa_collision_ensg_df = pd.read_csv(
    "../created_files/aa_collision_ENSG_count_df.csv",
    na_values=["", "NULL"],
    keep_default_na=False,
)

In [293]:
ensg_alias_alias_collision_set = set(aa_collision_ensg_df["collision"])
len(ensg_alias_alias_collision_set)

1617

#### <a id='toc1_1_2_11_'></a>[How many unique group names are primary gene symbols?](#toc0_)

In [294]:
name_hgnc_match = dgidb_name_set.intersection(hgnc_gene_symbol_set)
len(name_hgnc_match)

11976

In [295]:
name_ensg_match = dgidb_name_set.intersection(ensg_gene_symbol_set)
len(name_ensg_match)

11703

In [296]:
name_ncbi_match = dgidb_name_set.intersection(ncbi_gene_symbol_set)
len(name_ncbi_match)

11904

In [297]:
name_ncbi_ensg_match = name_ncbi_match.intersection(ensg_gene_symbol_set)
len(name_ncbi_ensg_match)

11685

In [298]:
name_ncbi_ensg_hgnc_match = name_ncbi_ensg_match.intersection(hgnc_gene_symbol_set)
len(name_ncbi_ensg_hgnc_match)

11684

#### <a id='toc1_1_2_12_'></a>[How many of the groups that are labeled with a primary gene symbol are also alias-alias collisions?](#toc0_)

In [299]:
name_ncbi_match_aacollision = name_ncbi_match.intersection(
    ncbi_alias_alias_collision_set
)
len(name_ncbi_match_aacollision)

136

In [300]:
name_ensg_match_aacollision = name_ensg_match.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_match_aacollision)

42

In [301]:
name_ensg_notmatch_aacollision = name_ensg_notmatch.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_notmatch_aacollision)

3

#### <a id='toc1_1_2_13_'></a>[How many unique claim symbols are collisions?](#toc0_)

In [302]:
gene_claim_name_ensg_aacollision_match = dgidb_gene_claim_name_set.intersection(
    ensg_alias_alias_collision_set
)
len(gene_claim_name_ensg_aacollision_match)

98

In [303]:
gene_claim_name_hgnc_aacollision_match = dgidb_gene_claim_name_set.intersection(
    hgnc_alias_alias_collision_set
)
len(gene_claim_name_hgnc_aacollision_match)

77

In [304]:
gene_claim_name_ncbi_aacollision_match = dgidb_gene_claim_name_set.intersection(
    ncbi_alias_alias_collision_set
)
len(gene_claim_name_ncbi_aacollision_match)

240

#### <a id='toc1_1_2_14_'></a>[How many unique groups are labeled with collisions?](#toc0_)

In [305]:
name_ensg_aacollision_match = dgidb_name_set.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_aacollision_match)

45

In [306]:
name_hgnc_aacollision_match = dgidb_name_set.intersection(
    hgnc_alias_alias_collision_set
)
len(name_hgnc_aacollision_match)

42

In [307]:
name_ncbi_aacollision_match = dgidb_name_set.intersection(
    ncbi_alias_alias_collision_set
)
len(name_ncbi_aacollision_match)

138

#### <a id='toc1_1_2_15_'></a>[How many unique claims are not primary gene symbols?](#toc0_)

In [308]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(
    hgnc_gene_symbol_set
)
len(gene_claim_name_hgnc_notmatch)

14755

In [309]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(
    ensg_gene_symbol_set
)
len(gene_claim_name_ensg_notmatch)

15029

In [310]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(
    ncbi_gene_symbol_set
)
len(gene_claim_name_ncbi_notmatch)

14828

#### <a id='toc1_1_2_16_'></a>[How many unique claims that were not primary symbols are collisions?](#toc0_)

In [311]:
gene_claim_name_hgnc_notmatch_aacollision = gene_claim_name_hgnc_notmatch.intersection(
    hgnc_alias_alias_collision_set
)
len(gene_claim_name_hgnc_notmatch_aacollision)

36

In [312]:
gene_claim_name_ncbi_notmatch_aacollision = gene_claim_name_ncbi_notmatch.intersection(
    ncbi_alias_alias_collision_set
)
len(gene_claim_name_ncbi_notmatch_aacollision)

104

In [313]:
gene_claim_name_ensg_notmatch_aacollision = gene_claim_name_ensg_notmatch.intersection(
    ensg_alias_alias_collision_set
)
len(gene_claim_name_ensg_notmatch_aacollision)

56

#### <a id='toc1_1_2_17_'></a>[How many unique claims that are primary gene symbols are also collisions?](#toc0_)

In [314]:
gene_claim_name_hgnc_match = dgidb_gene_claim_name_set.intersection(
    hgnc_gene_symbol_set
)
len(gene_claim_name_hgnc_match)

11984

In [315]:
gene_claim_name_hgnc_match_aacollision = gene_claim_name_hgnc_match.intersection(
    hgnc_alias_alias_collision_set
)
len(gene_claim_name_hgnc_match_aacollision)

41

In [316]:
gene_claim_name_ncbi_match = dgidb_gene_claim_name_set.intersection(
    ncbi_gene_symbol_set
)
len(gene_claim_name_ncbi_match)

11911

In [317]:
gene_claim_name_ncbi_match_aacollision = gene_claim_name_ncbi_match.intersection(
    ncbi_alias_alias_collision_set
)
len(gene_claim_name_ncbi_match_aacollision)

136

In [318]:
gene_claim_name_ensg_match = dgidb_gene_claim_name_set.intersection(
    ensg_gene_symbol_set
)
len(gene_claim_name_ensg_match)

11710

In [319]:
gene_claim_name_ensg_match_aacollision = gene_claim_name_ensg_match.intersection(
    ensg_alias_alias_collision_set
)
len(gene_claim_name_ensg_match_aacollision)

42

#### <a id='toc1_1_2_18_'></a>[How many of the gene groups that are not primary gene symbols are alias-alias collisions from HGNC?](#toc0_)

In [320]:
name_hngc_notmatch_aacollision = name_hgnc_notmatch.intersection(
    hgnc_alias_alias_collision_set
)
len(name_hngc_notmatch_aacollision)

1

In [321]:
name_ncbi_notmatch_aacollision = name_ncbi_notmatch.intersection(
    ncbi_alias_alias_collision_set
)
len(name_ncbi_notmatch_aacollision)

2

In [322]:
name_ensg_notmatch_aacollision = name_ensg_notmatch.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_notmatch_aacollision)

3

#### <a id='toc1_1_2_19_'></a>[How many claims are ambiguous symbols?](#toc0_)

In [323]:
dgidb_gene_df["claim_ambiguous_status"] = dgidb_gene_df["gene_claim_name"].isin(
    ambiguous_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False
...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True


In [324]:
dgidb_gene_df["claim_ambiguous_status"].value_counts()

claim_ambiguous_status
False    75358
True      4876
Name: count, dtype: int64

In [325]:
ambiguous_claim_df = dgidb_gene_df.loc[dgidb_gene_df["claim_ambiguous_status"]]
num_ambiguous_claim = len(ambiguous_claim_df)
num_ambiguous_claim

4876

In [326]:
ambiguous_claim_df.loc[ambiguous_claim_df["gene_claim_name"] == "TR2"]

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status
18,TR2,NCBI Gene Name,hgnc:7971,NR2C1,BaderLab,Feb-14,True


#### <a id='toc1_1_2_20_'></a>[How many gene group labels are ambiguous symbols?](#toc0_)

In [327]:
dgidb_gene_df["name_ambiguous_status"] = dgidb_gene_df["gene_name"].isin(
    ambiguous_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False
...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False


In [328]:
dgidb_gene_df["name_ambiguous_status"].value_counts()

name_ambiguous_status
False    74459
True      5775
Name: count, dtype: int64

In [329]:
ambiguous_name_df = dgidb_gene_df.loc[dgidb_gene_df["name_ambiguous_status"]]
num_ambiguous_name = len(ambiguous_name_df)
num_ambiguous_name

5775

#### <a id='toc1_1_2_21_'></a>[How many claims are primary symbols?](#toc0_)

In [330]:
dgidb_gene_df["claim_primary_status"] = dgidb_gene_df["gene_claim_name"].isin(
    hgnc_ensg_ncbi_gene_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
...,...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False,True
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False,True
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False,False


In [331]:
dgidb_gene_df["claim_primary_status"].value_counts()

claim_primary_status
True     64209
False    16025
Name: count, dtype: int64

In [332]:
primary_claim_df = dgidb_gene_df.loc[dgidb_gene_df["claim_primary_status"]]
num_primary_claim = len(primary_claim_df)
num_primary_claim

64209

#### <a id='toc1_1_2_22_'></a>[How many gene group labels are primary symbols?](#toc0_)

In [333]:
dgidb_gene_df["name_primary_status"] = (
    dgidb_gene_df["gene_name"].astype(str).isin(hgnc_ensg_ncbi_gene_symbol_set)
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False,True,True
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False,True,True
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False,False,True


In [334]:
dgidb_gene_df["name_primary_status"].value_counts()

name_primary_status
True     78074
False     2160
Name: count, dtype: int64

In [335]:
primary_name_df = dgidb_gene_df.loc[dgidb_gene_df["name_primary_status"]]
num_primary_name = len(primary_name_df)
num_primary_name

78074

In [336]:
not_primary_group_name_df = dgidb_gene_df.loc[~dgidb_gene_df["name_primary_status"]]
not_primary_group_name_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
57108,NCBIGENE:2614,NCBI Gene ID,ncbigene:2614,GAPDHL17,GuideToPharmacology,2024.1,False,False,False,False
57552,CYB5P1,Gene Symbol,ncbigene:1529,CYB5P1,NCBI,20240410,False,False,False,False
66195,OA1,Gene Symbol,ncbigene:474285,OA1,NCBI,20240410,False,False,False,False
66196,OA1,Gene Symbol,ncbigene:474285,OA1,dGene,27-Jun-13,False,False,False,False


#### <a id='toc1_1_2_23_'></a>[How many gene group labels are alias symbols?](#toc0_)

In [337]:
dgidb_gene_df["name_alias_status"] = dgidb_gene_df["gene_name"].isin(
    hgnc_ensg_ncbi_alias_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status,name_alias_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False,True,True,False
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True,True,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False,True,True,False
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False,False,True,False


In [338]:
dgidb_gene_df["name_alias_status"].value_counts()

name_alias_status
False    72260
True      7974
Name: count, dtype: int64

In [339]:
alias_name_df = dgidb_gene_df.loc[dgidb_gene_df["name_alias_status"]]
num_alias_name = len(alias_name_df)
num_alias_name

7974

In [340]:
print("Calmbp1" in hgnc_ensg_ncbi_alias_symbol_set)

True


#### <a id='toc1_1_2_24_'></a>[How many claims are alias symbols?](#toc0_)

In [341]:
dgidb_gene_df["claim_alias_status"] = dgidb_gene_df["gene_claim_name"].isin(
    hgnc_ensg_ncbi_alias_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False,True,True,False,False
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True,True,True,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False,True,True,False,False
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False,False,True,False,True


In [342]:
dgidb_gene_df["claim_alias_status"].value_counts()

claim_alias_status
False    74684
True      5550
Name: count, dtype: int64

In [343]:
alias_claims_df = dgidb_gene_df.loc[dgidb_gene_df["claim_alias_status"]]
num_alias_claims = len(alias_claims_df)
num_alias_claims

5550

#### <a id='toc1_1_2_25_'></a>[How many claims that were normalized into non-primary symbol labeled groups are alias symbols?](#toc0_)

In [344]:
dgidb_gene_df["claim_alias_status"] = dgidb_gene_df["gene_claim_name"].isin(
    hgnc_ensg_ncbi_alias_symbol_set
)
alias_claim_not_primary_group_name_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_alias_status"] & ~dgidb_gene_df["name_primary_status"]
]
alias_claim_not_primary_group_name_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status
3251,USP17L,Gene Symbol,ncbigene:100862847,USP17L,dGene,27-Jun-13,False,False,False,False,True,True
38737,USP17L,Gene Symbol,ncbigene:100862847,USP17L,NCBI,20240410,False,False,False,False,True,True
55006,ACT,Gene Symbol,ncbigene:389036,ACT,NCBI,20240410,True,True,False,False,True,True
66195,OA1,Gene Symbol,ncbigene:474285,OA1,NCBI,20240410,False,False,False,False,True,True
66196,OA1,Gene Symbol,ncbigene:474285,OA1,dGene,27-Jun-13,False,False,False,False,True,True


#### <a id='toc1_1_2_26_'></a>[How many not normalized claims are alias symbols?](#toc0_)

In [345]:
alias_claim_null_name_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_alias_status"] & dgidb_gene_df["gene_name"].isnull()
]
len(alias_claim_null_name_df)

0

#### <a id='toc1_1_2_27_'></a>[How many gene group labels are not primary, alias symbols, or null?](#toc0_)

In [346]:
other_name_df = dgidb_gene_df.loc[
    ~dgidb_gene_df["name_alias_status"]
    & ~dgidb_gene_df["name_primary_status"]
    & ~dgidb_gene_df["gene_name"].isnull()
]
num_other_name = len(other_name_df)
num_other_name

10

#### <a id='toc1_1_2_28_'></a>[How many claims are not primary, alias symbols, or null?](#toc0_)

In [347]:
other_claim_df = dgidb_gene_df.loc[
    ~dgidb_gene_df["claim_alias_status"] & ~dgidb_gene_df["claim_primary_status"]
]
num_other_claim = len(other_claim_df)
num_other_claim

15270

#### <a id='toc1_1_2_29_'></a>[How many claims are primary and alias symbols?](#toc0_)

In [348]:
primaryandalias_claim_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_alias_status"] & dgidb_gene_df["claim_primary_status"]
]
num_primaryandalias_claim = len(primaryandalias_claim_df)
num_primaryandalias_claim

4795

#### <a id='toc1_1_2_30_'></a>[How many primary symbol claims are normalized into non-primary gene group labels?](#toc0_)

In [349]:
claim_true_name_false_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_primary_status"] & ~dgidb_gene_df["name_primary_status"]
]
claim_true_name_false_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status


In [350]:
len(claim_true_name_false_df)

0

### <a id='toc1_1_3_'></a>[Summary](#toc0_)

#### <a id='toc1_1_3_1_'></a>[Normalization Rates](#toc0_)

In [351]:
normalization_index = "Normalized", "Not Normalized", "Total"
normalization_summary = {
    "Number of Claims": [
        num_normalized_claims,
        num_not_normalized_claims,
        num_total_claims,
    ]
}
normalization_summary_df = pd.DataFrame(
    normalization_summary, index=normalization_index
)
normalization_summary_df

Unnamed: 0,Number of Claims
Normalized,78090
Not Normalized,2144
Total,80234


#### <a id='toc1_1_3_2_'></a>[Types of Claim Symbols](#toc0_)

- Primary and Alias symbols are not exclusive as some primary symbols are used as aliases (alias-primary collisions)
- Other does not include any Primary or Alias symbols
- Ambiguous symbols can be either Primary or Alias

In [352]:
claim_index = "Primary", "Alias", "Other", "Ambiguous", "Total"
claim_summary = {
    "Number of Claims": [
        num_primary_claim,
        num_alias_claims,
        num_other_claim,
        num_ambiguous_claim,
        num_total_claims,
    ]
}
claim_summary_df = pd.DataFrame(claim_summary, index=claim_index)
claim_summary_df

Unnamed: 0,Number of Claims
Primary,64209
Alias,5550
Other,15270
Ambiguous,4876
Total,80234


#### <a id='toc1_1_3_3_'></a>[Types of Normalizations (gene group labels claims are being normliazed into)](#toc0_)

In [353]:
gene_group_index = "Primary", "Alias", "Null", "Other", "Ambiguous", "Total"
gene_group_summary = {
    "Number of Claims": [
        num_primary_name,
        num_alias_name,
        num_not_normalized_claims,
        num_other_name,
        num_ambiguous_name,
        num_total_claims,
    ]
}
gene_group_summary_df = pd.DataFrame(gene_group_summary, index=gene_group_index)
gene_group_summary_df

Unnamed: 0,Number of Claims
Primary,78074
Alias,7974
Null,2144
Other,10
Ambiguous,5775
Total,80234
