# <a id='toc1_'></a>[DGIdb ambiguous claims](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [DGIdb ambiguous claims](#toc1_)    
    - [Analyzing similaritis b/w alias-primary and alias-alias collisions](#toc1_1_1_)    
      - [Load merged alias-alias and alias-priamry collision sets. Symbols are the ambiguous symbols.](#toc1_1_1_1_)    
      - [How many of the alias-primary collisions are also alias-alias collisions?](#toc1_1_1_2_)    
      - [How many unique ambiguous symbosl are there b/w alias-primary and alias-alias collisions?](#toc1_1_1_3_)    
    - [Primary exploration of DGIdb gene content using collisions](#toc1_1_2_)    
      - [Load gene claim data from DGIdb](#toc1_1_2_1_)    
      - [How many claims are placed in a gene group with a different label?](#toc1_1_2_2_)    
      - [How many claims are not normalized?](#toc1_1_2_3_)    
      - [Create set of gene groups (gene_name) and gene claims (gene_claim_name)](#toc1_1_2_4_)    
      - [Load HGNC, Ensembl (ENSG), and NCBI gene and alias sets](#toc1_1_2_5_)    
      - [How many unique primary gene symbols are there b/w HGNC, Ensembl, and NCBI?](#toc1_1_2_6_)    
      - [How many unique alias symbols are there b/w HGNC, Ensembl, and NCBI?](#toc1_1_2_7_)    
      - [How many unique group names are not primary gene symbols?](#toc1_1_2_8_)    
      - [How many unique claims are not primary gene symbols?](#toc1_1_2_9_)    
      - [Load the collision sets from each data source](#toc1_1_2_10_)    
      - [How many unique group names are primary gene symbols?](#toc1_1_2_11_)    
      - [How many of the groups that are labeled with a primary gene symbol are also alias-alias collisions?](#toc1_1_2_12_)    
      - [How many unique claim symbols are collisions?](#toc1_1_2_13_)    
      - [How many unique groups are labeled with collisions?](#toc1_1_2_14_)    
      - [How many unique claims are not primary gene symbols?](#toc1_1_2_15_)    
      - [How many unique claims that were not primary symbols are collisions?](#toc1_1_2_16_)    
      - [How many unique claims that are primary gene symbols are also collisions?](#toc1_1_2_17_)    
      - [How many of the gene groups that are not primary gene symbols are alias-alias collisions from HGNC?](#toc1_1_2_18_)    
      - [How many claims are ambiguous symbols?](#toc1_1_2_19_)    
      - [How many gene group labels are ambiguous symbols?](#toc1_1_2_20_)    
      - [How many claims are primary symbols?](#toc1_1_2_21_)    
      - [How many gene group labels are primary symbols?](#toc1_1_2_22_)    
      - [How many gene group labels are alias symbols?](#toc1_1_2_23_)    
      - [How many claims are alias symbols?](#toc1_1_2_24_)    
      - [How many claims that were normalized into non-primary symbol labeled groups are alias symbols?](#toc1_1_2_25_)    
      - [How many not normalized claims are alias symbols?](#toc1_1_2_26_)    
      - [How many gene group labels are not primary, alias symbols, or null?](#toc1_1_2_27_)    
      - [How many claims are not primary, alias symbols, or null?](#toc1_1_2_28_)    
      - [How many claims are primary and alias symbols?](#toc1_1_2_29_)    
      - [How many primary symbol claims are normalized into non-primary gene group labels?](#toc1_1_2_30_)    
    - [Summary](#toc1_1_3_)    
      - [Normalization Rates](#toc1_1_3_1_)    
      - [Types of Claim Symbols](#toc1_1_3_2_)    
      - [Types of Normalizations (gene group labels claims are being normliazed into)](#toc1_1_3_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [153]:
import pandas as pd
import numpy as np
import plotly.express as px

### <a id='toc1_1_1_'></a>[Analyzing similaritis b/w alias-primary and alias-alias collisions](#toc0_)

#### <a id='toc1_1_1_1_'></a>[Load merged alias-alias and alias-priamry collision sets. Symbols are the ambiguous symbols.](#toc0_)

Input: merged_alias_gene_intersections.csv (total_alias_gene_intersections.ipynb), merged_alias_overlap_df_2.csv (total_alias_overlap.ipynb)

Output: merged_alias_primary_collision_set, merged_alias_alias_collision_set

In [154]:
merged_alias_primary_collisions_df = pd.read_csv(
    "merged_alias_gene_intersections.csv", na_values=["", "NULL"], keep_default_na=False
)

In [155]:
merged_alias_primary_collision_set = set(
    merged_alias_primary_collisions_df["intersect_point"]
)
len(merged_alias_primary_collision_set)

1532

In [156]:
merged_alias_alias_collisions_df = pd.read_csv(
    "merged_alias_overlap_df_2.csv", na_values=["", "NULL"], keep_default_na=False
)

In [157]:
merged_alias_alias_collision_set = set(
    merged_alias_alias_collisions_df["alias_symbol"].tolist()
)
len(merged_alias_alias_collision_set)

4494

#### <a id='toc1_1_1_2_'></a>[How many of the alias-primary collisions are also alias-alias collisions?](#toc0_)

In [158]:
print(
    len(
        merged_alias_alias_collision_set.intersection(
            merged_alias_primary_collision_set
        )
    )
)

161


#### <a id='toc1_1_1_3_'></a>[How many unique ambiguous symbosl are there b/w alias-primary and alias-alias collisions?](#toc0_)

In [159]:
ambiguous_symbol_set = merged_alias_alias_collision_set.union(
    merged_alias_primary_collision_set
)
print(len(ambiguous_symbol_set))

5865


In [160]:
ambiguous_symbol_set

{'LINC00444',
 'L17',
 'GPAT',
 ' GID2',
 'UCN2',
 'HSP70.1',
 'TRNAV22',
 ' PNUTL3',
 ' DBS',
 'SYM1',
 'DEFB3',
 'H2A',
 'SCYL1',
 'LRP1',
 'STAG3L1, STAG3L3',
 ' RIP5',
 ' TCRBV3S1',
 ' P22',
 'LMN2',
 'PARPL',
 'MRXS32',
 'PIT1',
 ' NBS1',
 'MADH6',
 'SIK1B',
 'CXorf11',
 'LINC02921',
 'FAH',
 'POP4',
 'ARTC1',
 'CMPD1',
 'HLA-DRB',
 'OSCP',
 'CARDINAL',
 'NUDT10',
 'C2orf47',
 'G10',
 'NAT5',
 'P39',
 ' CT1',
 'CPAD',
 'p110',
 'NSP2',
 'DIA2',
 'HDGF2',
 'SPR',
 'TP53TG3F',
 'ASB-3',
 'TMEFF1',
 'PKR2',
 'C11ORF4',
 ' S10',
 ' M-ABC2',
 ' SGT2',
 'ACP',
 'F3',
 ' ESA',
 'GC1',
 ' TAZ',
 'MPP1',
 'HLP',
 'XLOC_000303',
 'OR11-3',
 'MIR-129b',
 ' MIC1',
 'ISWI',
 'PGD2',
 ' PTP',
 'HLA-DPB1',
 'ARX',
 'ETDA',
 'AP-1',
 'SCCMS',
 'TDH',
 'TTF2',
 'MLL4',
 'REF1',
 'MS4A4',
 'CEBPZ',
 'CRD',
 ' H3',
 'HBP',
 'TRX-CAT1-4',
 'TSARG1',
 ' TIG1',
 'H2AC16',
 'MIG7',
 'U2',
 'BCRP2',
 'ZNF688',
 'NABC1',
 'CSRP2BP',
 'HN1L',
 'HCG18',
 'ERV1',
 'KRTAP2-1',
 'SKD1',
 'p200',
 'BCDS1',
 'CD

In [161]:
ambiguous_symbol_set = set(item.strip() for item in ambiguous_symbol_set)

print(len(ambiguous_symbol_set))

5050


In [162]:
with open("ambiguous_symbol_set.txt", "w") as file:
    for item in ambiguous_symbol_set:
        file.write(f"{item.strip()}\n")

In [163]:
with open("ambiguous_symbol_set.txt", "r") as file:
    # Read each line, strip newline characters, and convert to a set
    ambiguous_symbol_set = set(line.strip() for line in file)
len(ambiguous_symbol_set)

5050

In [164]:
ambiguous_symbol_set

{'L17',
 'GPAT',
 'TRNAV22',
 'SCYL1',
 'H2A',
 'LRP1',
 'STAG3L1, STAG3L3',
 'LMN2',
 'LINC02921',
 'FAH',
 'CXorf11',
 'ARTC1',
 'CMPD1',
 'HLA-DRB',
 'OSCP',
 'C2orf47',
 'P39',
 'CPAD',
 'DIA2',
 'HDGF2',
 'TMEFF1',
 'PKR2',
 'ACP',
 'F3',
 'NG2',
 'GC1',
 'MPP1',
 'HLP',
 'OR11-3',
 'TDH',
 'REF1',
 'MS4A4',
 'CEBPZ',
 'CRD',
 'HBP',
 'TRX-CAT1-4',
 'TSARG1',
 'H2AC16',
 'CSRP2BP',
 'HN1L',
 'HCG18',
 'ERV1',
 'KRTAP2-1',
 'SKD1',
 'RRP17',
 'p200',
 'CD277',
 'p56',
 'CAK1',
 'HBM',
 'SF',
 'AD1',
 'CBWD3',
 'TRR-CCG1-2',
 'LBP',
 'ZNF2',
 'SA',
 'HCI',
 'ROX',
 'ZNF141',
 'HM1',
 'TRW-CCA3-1',
 'SGT2',
 'TIH1',
 'DXS648E',
 'SWI2',
 'UBL5',
 'GCAP-II',
 'CIC',
 'NAPA',
 'BCR',
 'LOXL',
 'TRAC-1',
 'TBC1D3D',
 'CHED',
 'TNFRSF11B',
 'GAL7',
 'GLUT6',
 'TMS',
 'OR17-24',
 'ROC2',
 'BAF190',
 'BCS1',
 'GOT1',
 'PDZD5C',
 'AIP',
 'TRI-AAT5-3',
 'P54',
 'P38',
 'TTTY7B',
 'TRD-GTC2-11',
 'COT',
 'ABP-280',
 'CEACAM22P',
 'CTRP5',
 'BETA2',
 'C2orf27B',
 'CED12',
 'NTT',
 'OF',
 'RGP1

### <a id='toc1_1_2_'></a>[Primary exploration of DGIdb gene content using collisions](#toc0_)

#### <a id='toc1_1_2_1_'></a>[Load gene claim data from DGIdb](#toc0_)

In [165]:
dgidb_gene_df = pd.read_csv(
    "dgidb_genes_JUNE.tsv", sep="\t", na_values=["", "NULL"], keep_default_na=False
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14
...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20


In [166]:
num_total_claims = len(dgidb_gene_df)

In [167]:
dgidb_gene_df.loc[dgidb_gene_df["gene_claim_name"] == "HES1"]

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version
30772,HES1,Gene Symbol,hgnc:5192,HES1,Pharos,10-Apr-24
76844,HES1,Gene Symbol,hgnc:5192,HES1,HGNC,20240408
76845,HES1,Gene Symbol,hgnc:5192,HES1,NCBI,20240410
76846,HES1,Gene Symbol,hgnc:5192,HES1,Ensembl,111
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17


#### <a id='toc1_1_2_2_'></a>[How many claims are placed in a gene group with a different label?](#toc0_)

In [168]:
dgidb_gene_df.query("gene_name != gene_claim_name")

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14
...,...,...,...,...,...,...
79999,SEPT5,Gene Symbol,hgnc:9164,SEPTIN5,FoundationOneGenes,9/3/20
80014,ENSEMBL:ENSG00000185821,Ensembl Gene ID,hgnc:31305,OR6C76,RussLampel,26-Jul-11
80053,SEPT5,Gene Symbol,hgnc:9164,SEPTIN5,CarisMolecularIntelligence,9/4/20
80166,SEPT6,Gene Symbol,hgnc:15848,SEPTIN6,CarisMolecularIntelligence,9/4/20


#### <a id='toc1_1_2_3_'></a>[How many claims are not normalized?](#toc0_)

claims without a symbol/name/identifier
 (there shouldn't be any hooray)

In [169]:
no_claim_symbols_df = dgidb_gene_df[dgidb_gene_df["gene_claim_name"].isnull()]
no_claim_symbols_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version


In [170]:
no_name_symbols_df = dgidb_gene_df[dgidb_gene_df["gene_name"].isnull()]
num_not_normalized_claims = len(no_name_symbols_df)
num_not_normalized_claims

2144

In [171]:
num_normalized_claims = num_total_claims - num_not_normalized_claims

#### <a id='toc1_1_2_4_'></a>[Create set of gene groups (gene_name) and gene claims (gene_claim_name)](#toc0_)

In [172]:
dgidb_name_set = set(dgidb_gene_df["gene_name"])
len(dgidb_name_set)

12001

In [173]:
dgidb_gene_claim_name_set = set(dgidb_gene_df["gene_claim_name"])
len(dgidb_gene_claim_name_set)

26739

#### <a id='toc1_1_2_5_'></a>[Load HGNC, Ensembl (ENSG), and NCBI gene and alias sets](#toc0_)

Input: mini_hgnc_df, mini_ensg_df, mini_ncbi_df (total_alias_gene_intersections.ipynb)

Output: x_gene_symbol_set, x_alias_symbol_set where x is ncbi, ensg or ncbi

In [174]:
mini_hgnc_df = pd.read_csv(
    "Downloaded_files/mini_hgnc_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [175]:
mini_ensg_df = pd.read_csv(
    "Downloaded_files/mini_ensg_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [176]:
mini_ncbi_df = pd.read_csv(
    "Downloaded_files/mini_ncbi_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [177]:
hgnc_gene_symbol_set = set(mini_hgnc_df["gene_symbol"])
len(hgnc_gene_symbol_set)

45646

In [178]:
ensg_gene_symbol_set = set(mini_ensg_df["gene_symbol"])
len(ensg_gene_symbol_set)

41068

In [179]:
ncbi_gene_symbol_set = set(mini_ncbi_df["gene_symbol"])
len(ncbi_gene_symbol_set)

193303

In [180]:
hgnc_alias_symbol_set = set(mini_hgnc_df["alias_symbol"])
len(hgnc_alias_symbol_set)

22583

In [181]:
ensg_alias_symbol_set = set(mini_ensg_df["alias_symbol"])
len(ensg_alias_symbol_set)

24717

In [182]:
ncbi_alias_symbol_set = set(mini_ncbi_df["alias_symbol"])
len(ncbi_alias_symbol_set)

27486

#### <a id='toc1_1_2_6_'></a>[How many unique primary gene symbols are there b/w HGNC, Ensembl, and NCBI?](#toc0_)

In [183]:
hgnc_ensg_gene_symbol_set = hgnc_gene_symbol_set.union(ensg_gene_symbol_set)

In [184]:
hgnc_ensg_ncbi_gene_symbol_set = hgnc_ensg_gene_symbol_set.union(ncbi_gene_symbol_set)
len(hgnc_ensg_ncbi_gene_symbol_set)

194866

#### <a id='toc1_1_2_7_'></a>[How many unique alias symbols are there b/w HGNC, Ensembl, and NCBI?](#toc0_)

In [185]:
hgnc_ensg_alias_symbol_set = hgnc_alias_symbol_set.union(ensg_alias_symbol_set)

In [186]:
hgnc_ensg_ncbi_alias_symbol_set = hgnc_ensg_alias_symbol_set.union(
    ncbi_alias_symbol_set
)
len(hgnc_ensg_ncbi_alias_symbol_set)

63935

#### <a id='toc1_1_2_8_'></a>[How many unique group names are not primary gene symbols?](#toc0_)

In [187]:
name_ensg_notmatch = dgidb_name_set.difference(ensg_gene_symbol_set)
len(name_ensg_notmatch)

298

In [188]:
name_hgnc_notmatch = dgidb_name_set.difference(hgnc_gene_symbol_set)
len(name_hgnc_notmatch)

25

In [189]:
cleaned_name_hgnc_notmatch = [x for x in name_hgnc_notmatch if str(x) != "NaN"]
len(cleaned_name_hgnc_notmatch)

25

In [190]:
name_ncbi_notmatch = dgidb_name_set.difference(ncbi_gene_symbol_set)
len(name_ncbi_notmatch)

97

In [191]:
name_ncbi_hgnc_notmatch = name_ncbi_notmatch.difference(hgnc_gene_symbol_set)
len(name_ncbi_hgnc_notmatch)

9

How many groups are labeled with a symbol not found in the sets of primary symbols from HGNC, NCBI, or ENSG?

In [192]:
name_ncbi_hgnc_ensg_notmatch = name_ncbi_hgnc_notmatch.difference(ensg_gene_symbol_set)
len(name_ncbi_hgnc_ensg_notmatch)

8

In [193]:
len(dgidb_name_set)

12001

#### <a id='toc1_1_2_9_'></a>[How many unique claims are not primary gene symbols?](#toc0_)

In [194]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(
    ensg_gene_symbol_set
)
len(gene_claim_name_ensg_notmatch)

15029

In [195]:
cleaned_gene_claim_name_ensg_notmatch = [
    x for x in gene_claim_name_ensg_notmatch if str(x) != "NaN"
]
len(cleaned_gene_claim_name_ensg_notmatch)

15029

In [196]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(
    hgnc_gene_symbol_set
)
len(gene_claim_name_hgnc_notmatch)

14755

In [197]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(
    ncbi_gene_symbol_set
)
len(gene_claim_name_ncbi_notmatch)

14828

In [198]:
gene_claim_name_ncbi_hngc_notmatch = gene_claim_name_ncbi_notmatch.difference(
    hgnc_gene_symbol_set
)
len(gene_claim_name_ncbi_hngc_notmatch)

14738

How many unique claims are not primary symbols

In [199]:
gene_claim_name_ncbi_hngc_ensg_notmatch = gene_claim_name_ncbi_hngc_notmatch.difference(
    ensg_gene_symbol_set
)
num_not_primary_claims = len(gene_claim_name_ncbi_hngc_ensg_notmatch)
num_not_primary_claims

14735

#### <a id='toc1_1_2_10_'></a>[Load the collision sets from each data source](#toc0_)

Input: aa_collision_x_df.csv (total_alis_overlap.ipynb)

Output: x_alias_alias_collision_set

In [200]:
aa_collision_hgnc_df = pd.read_csv(
    "created_files/aa_collision_hgnc_df.csv",
    na_values=["", "NULL"],
    keep_default_na=False,
)

In [201]:
hgnc_alias_alias_collision_set = set(aa_collision_hgnc_df["alias_symbol"])
len(hgnc_alias_alias_collision_set)

673

In [202]:
aa_collision_ncbi_df = pd.read_csv(
    "created_files/aa_collision_ncbi_df.csv",
    na_values=["", "NULL"],
    keep_default_na=False,
)

In [203]:
ncbi_alias_alias_collision_set = set(aa_collision_ncbi_df["alias_symbol"])
len(ncbi_alias_alias_collision_set)

3476

In [204]:
aa_collision_ensg_df = pd.read_csv(
    "created_files/aa_collision_ensg_df.csv",
    na_values=["", "NULL"],
    keep_default_na=False,
)

In [205]:
ensg_alias_alias_collision_set = set(aa_collision_ensg_df["alias_symbol"])
len(ensg_alias_alias_collision_set)

1149

#### <a id='toc1_1_2_11_'></a>[How many unique group names are primary gene symbols?](#toc0_)

In [206]:
name_hgnc_match = dgidb_name_set.intersection(hgnc_gene_symbol_set)
len(name_hgnc_match)

11976

In [207]:
name_ensg_match = dgidb_name_set.intersection(ensg_gene_symbol_set)
len(name_ensg_match)

11703

In [208]:
name_ncbi_match = dgidb_name_set.intersection(ncbi_gene_symbol_set)
len(name_ncbi_match)

11904

In [209]:
name_ncbi_ensg_match = name_ncbi_match.intersection(ensg_gene_symbol_set)
len(name_ncbi_ensg_match)

11685

In [210]:
name_ncbi_ensg_hgnc_match = name_ncbi_ensg_match.intersection(hgnc_gene_symbol_set)
len(name_ncbi_ensg_hgnc_match)

11684

#### <a id='toc1_1_2_12_'></a>[How many of the groups that are labeled with a primary gene symbol are also alias-alias collisions?](#toc0_)

In [211]:
name_ncbi_match_aacollision = name_ncbi_match.intersection(
    ncbi_alias_alias_collision_set
)
len(name_ncbi_match_aacollision)

124

In [212]:
name_ensg_match_aacollision = name_ensg_match.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_match_aacollision)

8

In [213]:
name_ensg_notmatch_aacollision = name_ensg_notmatch.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_notmatch_aacollision)

1

#### <a id='toc1_1_2_13_'></a>[How many unique claim symbols are collisions?](#toc0_)

In [214]:
gene_claim_name_ensg_aacollision_match = dgidb_gene_claim_name_set.intersection(
    ensg_alias_alias_collision_set
)
len(gene_claim_name_ensg_aacollision_match)

23

In [215]:
gene_claim_name_hgnc_aacollision_match = dgidb_gene_claim_name_set.intersection(
    hgnc_alias_alias_collision_set
)
len(gene_claim_name_hgnc_aacollision_match)

21

In [216]:
gene_claim_name_ncbi_aacollision_match = dgidb_gene_claim_name_set.intersection(
    ncbi_alias_alias_collision_set
)
len(gene_claim_name_ncbi_aacollision_match)

221

#### <a id='toc1_1_2_14_'></a>[How many unique groups are labeled with collisions?](#toc0_)

In [217]:
name_ensg_aacollision_match = dgidb_name_set.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_aacollision_match)

9

In [218]:
name_hgnc_aacollision_match = dgidb_name_set.intersection(
    hgnc_alias_alias_collision_set
)
len(name_hgnc_aacollision_match)

8

In [219]:
name_ncbi_aacollision_match = dgidb_name_set.intersection(
    ncbi_alias_alias_collision_set
)
len(name_ncbi_aacollision_match)

126

#### <a id='toc1_1_2_15_'></a>[How many unique claims are not primary gene symbols?](#toc0_)

In [220]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(
    hgnc_gene_symbol_set
)
len(gene_claim_name_hgnc_notmatch)

14755

In [221]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(
    ensg_gene_symbol_set
)
len(gene_claim_name_ensg_notmatch)

15029

In [222]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(
    ncbi_gene_symbol_set
)
len(gene_claim_name_ncbi_notmatch)

14828

#### <a id='toc1_1_2_16_'></a>[How many unique claims that were not primary symbols are collisions?](#toc0_)

In [223]:
gene_claim_name_hgnc_notmatch_aacollision = gene_claim_name_hgnc_notmatch.intersection(
    hgnc_alias_alias_collision_set
)
len(gene_claim_name_hgnc_notmatch_aacollision)

13

In [224]:
gene_claim_name_ncbi_notmatch_aacollision = gene_claim_name_ncbi_notmatch.intersection(
    ncbi_alias_alias_collision_set
)
len(gene_claim_name_ncbi_notmatch_aacollision)

97

In [225]:
gene_claim_name_ensg_notmatch_aacollision = gene_claim_name_ensg_notmatch.intersection(
    ensg_alias_alias_collision_set
)
len(gene_claim_name_ensg_notmatch_aacollision)

15

#### <a id='toc1_1_2_17_'></a>[How many unique claims that are primary gene symbols are also collisions?](#toc0_)

In [226]:
gene_claim_name_hgnc_match = dgidb_gene_claim_name_set.intersection(
    hgnc_gene_symbol_set
)
len(gene_claim_name_hgnc_match)

11984

In [227]:
gene_claim_name_hgnc_match_aacollision = gene_claim_name_hgnc_match.intersection(
    hgnc_alias_alias_collision_set
)
len(gene_claim_name_hgnc_match_aacollision)

8

In [228]:
gene_claim_name_ncbi_match = dgidb_gene_claim_name_set.intersection(
    ncbi_gene_symbol_set
)
len(gene_claim_name_ncbi_match)

11911

In [229]:
gene_claim_name_ncbi_match_aacollision = gene_claim_name_ncbi_match.intersection(
    ncbi_alias_alias_collision_set
)
len(gene_claim_name_ncbi_match_aacollision)

124

In [230]:
gene_claim_name_ensg_match = dgidb_gene_claim_name_set.intersection(
    ensg_gene_symbol_set
)
len(gene_claim_name_ensg_match)

11710

In [231]:
gene_claim_name_ensg_match_aacollision = gene_claim_name_ensg_match.intersection(
    ensg_alias_alias_collision_set
)
len(gene_claim_name_ensg_match_aacollision)

8

#### <a id='toc1_1_2_18_'></a>[How many of the gene groups that are not primary gene symbols are alias-alias collisions from HGNC?](#toc0_)

In [232]:
name_hngc_notmatch_aacollision = name_hgnc_notmatch.intersection(
    hgnc_alias_alias_collision_set
)
len(name_hngc_notmatch_aacollision)

0

In [233]:
name_ncbi_notmatch_aacollision = name_ncbi_notmatch.intersection(
    ncbi_alias_alias_collision_set
)
len(name_ncbi_notmatch_aacollision)

2

In [234]:
name_ensg_notmatch_aacollision = name_ensg_notmatch.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_notmatch_aacollision)

1

#### <a id='toc1_1_2_19_'></a>[How many claims are ambiguous symbols?](#toc0_)

In [235]:
dgidb_gene_df["claim_ambiguous_status"] = dgidb_gene_df["gene_claim_name"].isin(
    ambiguous_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False
...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True


In [236]:
dgidb_gene_df["claim_ambiguous_status"].value_counts()

claim_ambiguous_status
False    75819
True      4415
Name: count, dtype: int64

In [237]:
ambiguous_claim_df = dgidb_gene_df.loc[dgidb_gene_df["claim_ambiguous_status"]]
num_ambiguous_claim = len(ambiguous_claim_df)
num_ambiguous_claim

4415

In [238]:
ambiguous_claim_df.loc[ambiguous_claim_df["gene_claim_name"] == "TR2"]

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status
18,TR2,NCBI Gene Name,hgnc:7971,NR2C1,BaderLab,Feb-14,True


#### <a id='toc1_1_2_20_'></a>[How many gene group labels are ambiguous symbols?](#toc0_)

In [239]:
dgidb_gene_df["name_ambiguous_status"] = dgidb_gene_df["gene_name"].isin(
    ambiguous_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False
...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False


In [240]:
dgidb_gene_df["name_ambiguous_status"].value_counts()

name_ambiguous_status
False    75009
True      5225
Name: count, dtype: int64

In [241]:
ambiguous_name_df = dgidb_gene_df.loc[dgidb_gene_df["name_ambiguous_status"]]
num_ambiguous_name = len(ambiguous_name_df)
num_ambiguous_name

5225

#### <a id='toc1_1_2_21_'></a>[How many claims are primary symbols?](#toc0_)

In [242]:
dgidb_gene_df["claim_primary_status"] = dgidb_gene_df["gene_claim_name"].isin(
    hgnc_ensg_ncbi_gene_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
...,...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False,True
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False,True
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False,False


In [243]:
dgidb_gene_df["claim_primary_status"].value_counts()

claim_primary_status
True     64209
False    16025
Name: count, dtype: int64

In [244]:
primary_claim_df = dgidb_gene_df.loc[dgidb_gene_df["claim_primary_status"]]
num_primary_claim = len(primary_claim_df)
num_primary_claim

64209

#### <a id='toc1_1_2_22_'></a>[How many gene group labels are primary symbols?](#toc0_)

In [245]:
dgidb_gene_df["name_primary_status"] = (
    dgidb_gene_df["gene_name"].astype(str).isin(hgnc_ensg_ncbi_gene_symbol_set)
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False,True,True
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False,True,True
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False,False,True


In [246]:
dgidb_gene_df["name_primary_status"].value_counts()

name_primary_status
True     78074
False     2160
Name: count, dtype: int64

In [247]:
primary_name_df = dgidb_gene_df.loc[dgidb_gene_df["name_primary_status"]]
num_primary_name = len(primary_name_df)
num_primary_name

78074

In [248]:
not_primary_group_name_df = dgidb_gene_df.loc[~dgidb_gene_df["name_primary_status"]]
not_primary_group_name_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
57108,NCBIGENE:2614,NCBI Gene ID,ncbigene:2614,GAPDHL17,GuideToPharmacology,2024.1,False,False,False,False
57552,CYB5P1,Gene Symbol,ncbigene:1529,CYB5P1,NCBI,20240410,False,False,False,False
66195,OA1,Gene Symbol,ncbigene:474285,OA1,NCBI,20240410,False,False,False,False
66196,OA1,Gene Symbol,ncbigene:474285,OA1,dGene,27-Jun-13,False,False,False,False


#### <a id='toc1_1_2_23_'></a>[How many gene group labels are alias symbols?](#toc0_)

In [249]:
dgidb_gene_df["name_alias_status"] = dgidb_gene_df["gene_name"].isin(
    hgnc_ensg_ncbi_alias_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status,name_alias_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False,True,True,False
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True,True,True,False
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False,True,True,False
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False,False,True,False


In [250]:
dgidb_gene_df["name_alias_status"].value_counts()

name_alias_status
False    77224
True      3010
Name: count, dtype: int64

In [251]:
alias_name_df = dgidb_gene_df.loc[dgidb_gene_df["name_alias_status"]]
num_alias_name = len(alias_name_df)
num_alias_name

3010

In [252]:
print("Calmbp1" in hgnc_ensg_ncbi_alias_symbol_set)

False


#### <a id='toc1_1_2_24_'></a>[How many claims are alias symbols?](#toc0_)

In [253]:
dgidb_gene_df["claim_alias_status"] = dgidb_gene_df["gene_claim_name"].isin(
    hgnc_ensg_ncbi_alias_symbol_set
)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,False,False,True,True,False,False
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True,True,True,False,False
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,False,False,True,True,False,False
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,True,False,False,True,False,False


In [254]:
dgidb_gene_df["claim_alias_status"].value_counts()

claim_alias_status
False    79378
True       856
Name: count, dtype: int64

In [255]:
alias_claims_df = dgidb_gene_df.loc[dgidb_gene_df["claim_alias_status"]]
num_alias_claims = len(alias_claims_df)
num_alias_claims

856

#### <a id='toc1_1_2_25_'></a>[How many claims that were normalized into non-primary symbol labeled groups are alias symbols?](#toc0_)

In [256]:
dgidb_gene_df["claim_alias_status"] = dgidb_gene_df["gene_claim_name"].isin(
    hgnc_ensg_ncbi_alias_symbol_set
)
alias_claim_not_primary_group_name_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_alias_status"] & ~dgidb_gene_df["name_primary_status"]
]
alias_claim_not_primary_group_name_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status
55006,ACT,Gene Symbol,ncbigene:389036,ACT,NCBI,20240410,True,True,False,False,True,True
66195,OA1,Gene Symbol,ncbigene:474285,OA1,NCBI,20240410,False,False,False,False,True,True
66196,OA1,Gene Symbol,ncbigene:474285,OA1,dGene,27-Jun-13,False,False,False,False,True,True


#### <a id='toc1_1_2_26_'></a>[How many not normalized claims are alias symbols?](#toc0_)

In [257]:
alias_claim_null_name_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_alias_status"] & dgidb_gene_df["gene_name"].isnull()
]
len(alias_claim_null_name_df)

0

#### <a id='toc1_1_2_27_'></a>[How many gene group labels are not primary, alias symbols, or null?](#toc0_)

In [258]:
other_name_df = dgidb_gene_df.loc[
    ~dgidb_gene_df["name_alias_status"]
    & ~dgidb_gene_df["name_primary_status"]
    & ~dgidb_gene_df["gene_name"].isnull()
]
num_other_name = len(other_name_df)
num_other_name

12

#### <a id='toc1_1_2_28_'></a>[How many claims are not primary, alias symbols, or null?](#toc0_)

In [259]:
other_claim_df = dgidb_gene_df.loc[
    ~dgidb_gene_df["claim_alias_status"] & ~dgidb_gene_df["claim_primary_status"]
]
num_other_claim = len(other_claim_df)
num_other_claim

15899

#### <a id='toc1_1_2_29_'></a>[How many claims are primary and alias symbols?](#toc0_)

In [260]:
primaryandalias_claim_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_alias_status"] & dgidb_gene_df["claim_primary_status"]
]
num_primaryandalias_claim = len(primaryandalias_claim_df)
num_primaryandalias_claim

730

#### <a id='toc1_1_2_30_'></a>[How many primary symbol claims are normalized into non-primary gene group labels?](#toc0_)

In [261]:
claim_true_name_false_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_primary_status"] & ~dgidb_gene_df["name_primary_status"]
]
claim_true_name_false_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_ambiguous_status,name_ambiguous_status,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status


In [262]:
len(claim_true_name_false_df)

0

### <a id='toc1_1_3_'></a>[Summary](#toc0_)

#### <a id='toc1_1_3_1_'></a>[Normalization Rates](#toc0_)

In [263]:
normalization_index = "Normalized", "Not Normalized", "Total"
normalization_summary = {
    "Number of Claims": [
        num_normalized_claims,
        num_not_normalized_claims,
        num_total_claims,
    ]
}
normalization_summary_df = pd.DataFrame(
    normalization_summary, index=normalization_index
)
normalization_summary_df

Unnamed: 0,Number of Claims
Normalized,78090
Not Normalized,2144
Total,80234


#### <a id='toc1_1_3_2_'></a>[Types of Claim Symbols](#toc0_)

- Primary and Alias symbols are not exclusive as some primary symbols are used as aliases (alias-primary collisions)
- Other does not include any Primary or Alias symbols
- Ambiguous symbols can be either Primary or Alias

In [264]:
claim_index = "Primary", "Alias", "Other", "Ambiguous", "Total"
claim_summary = {
    "Number of Claims": [
        num_primary_claim,
        num_alias_claims,
        num_other_claim,
        num_ambiguous_claim,
        num_total_claims,
    ]
}
claim_summary_df = pd.DataFrame(claim_summary, index=claim_index)
claim_summary_df

Unnamed: 0,Number of Claims
Primary,64209
Alias,856
Other,15899
Ambiguous,4415
Total,80234


#### <a id='toc1_1_3_3_'></a>[Types of Normalizations (gene group labels claims are being normliazed into)](#toc0_)

In [266]:
gene_group_index = "Primary", "Alias", "Null", "Other", "Ambiguous", "Total"
gene_group_summary = {
    "Number of Claims": [
        num_primary_name,
        num_alias_name,
        num_not_normalized_claims,
        num_other_name,
        num_ambiguous_name,
        num_total_claims,
    ]
}
gene_group_summary_df = pd.DataFrame(gene_group_summary, index=gene_group_index)
gene_group_summary_df

Unnamed: 0,Number of Claims
Primary,78074
Alias,3010
Null,2144
Other,12
Ambiguous,5225
Total,80234
