# <a id='toc1_'></a>[DGIdb ambiguous claims](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [DGIdb ambiguous claims](#toc1_)    
    - [Analyzing similaritis b/w alias-primary and alias-alias collisions](#toc1_1_1_)    
      - [Load merged alias-alias and alias-priamry collision sets. Symbols are the ambiguous symbols.](#toc1_1_1_1_)    
      - [How many of the alias-primary collisions are also alias-alias collisions?](#toc1_1_1_2_)    
      - [How many unique ambiguous symbosl are there b/w alias-primary and alias-alias collisions?](#toc1_1_1_3_)    
    - [Primary exploration of DGIdb gene content using collisions](#toc1_1_2_)    
      - [Load gene claim data from DGIdb](#toc1_1_2_1_)    
      - [How many claims are placed in a gene group with a different label?](#toc1_1_2_2_)    
      - [How many claims are not normalized?](#toc1_1_2_3_)    
      - [Create set of gene groups (gene_name) and gene claims (gene_claim_name)](#toc1_1_2_4_)    
      - [Load HGNC, Ensembl (ENSG), and NCBI gene and alias sets](#toc1_1_2_5_)    
      - [How many unique primary gene symbols are there b/w HGNC, Ensembl, and NCBI?](#toc1_1_2_6_)    
      - [How many unique alias symbols are there b/w HGNC, Ensembl, and NCBI?](#toc1_1_2_7_)    
      - [How many unique group names are not primary gene symbols?](#toc1_1_2_8_)    
      - [How many unique claims are not primary gene symbols?](#toc1_1_2_9_)    
      - [Load the collision sets from each data source](#toc1_1_2_10_)    
      - [How many unique group names are primary gene symbols?](#toc1_1_2_11_)    
      - [How many of the groups that are labeled with a primary gene symbol are also alias-alias collisions?](#toc1_1_2_12_)    
      - [How many unique claim symbols are collisions?](#toc1_1_2_13_)    
      - [How many unique groups are labeled with collisions?](#toc1_1_2_14_)    
      - [How many unique claims are not primary gene symbols?](#toc1_1_2_15_)    
      - [How many unique claims that were not primary symbols are collisions?](#toc1_1_2_16_)    
      - [How many unique claims that are primary gene symbols are also collisions?](#toc1_1_2_17_)    
      - [How many of the gene groups that are not primary gene symbols are alias-alias collisions from HGNC?](#toc1_1_2_18_)    
      - [How many claims are ambiguous symbols?](#toc1_1_2_19_)    
      - [How many gene group labels are ambiguous symbols?](#toc1_1_2_20_)    
      - [How many claims are primary symbols?](#toc1_1_2_21_)    
      - [How many gene group labels are primary symbols?](#toc1_1_2_22_)    
      - [How many gene group labels are alias symbols?](#toc1_1_2_23_)    
      - [How many claims are alias symbols?](#toc1_1_2_24_)    
      - [How many claims that were normalized into non-primary symbol labeled groups are alias symbols?](#toc1_1_2_25_)    
      - [How many not normalized claims are alias symbols?](#toc1_1_2_26_)    
      - [How many gene group labels are not primary, alias symbols, or null?](#toc1_1_2_27_)    
      - [How many claims are not primary, alias symbols, or null?](#toc1_1_2_28_)    
      - [How many claims are primary and alias symbols?](#toc1_1_2_29_)    
      - [How many primary symbol claims are normalized into non-primary gene group labels?](#toc1_1_2_30_)    
    - [Summary](#toc1_1_3_)    
      - [Normalization Rates](#toc1_1_3_1_)    
      - [Types of Claim Symbols](#toc1_1_3_2_)    
      - [Types of Normalizations (gene group labels claims are being normliazed into)](#toc1_1_3_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px

### <a id='toc1_1_1_'></a>[Analyzing similaritis b/w alias-primary and alias-alias collisions](#toc0_)

#### <a id='toc1_1_1_1_'></a>[Load merged alias-alias and alias-priamry collision sets. Symbols are the ambiguous symbols.](#toc0_)

Input: merged_alias_gene_intersections.csv (total_alias_gene_intersections.ipynb), merged_alias_overlap_df_2.csv (total_alias_overlap.ipynb)

Output: merged_alias_primary_collision_set, merged_alias_alias_collision_set

In [None]:
merged_alias_primary_collisions_df = pd.read_csv(
    "merged_alias_gene_intersections.csv", na_values=["", "NULL"], keep_default_na=False
)

In [None]:
merged_alias_primary_collision_set = set(
    merged_alias_primary_collisions_df["intersect_point"]
)
len(merged_alias_primary_collision_set)

In [None]:
merged_alias_alias_collisions_df = pd.read_csv(
    "merged_alias_overlap_df_2.csv", na_values=["", "NULL"], keep_default_na=False
)

In [None]:
merged_alias_alias_collision_set = set(
    merged_alias_alias_collisions_df["alias_symbol"].tolist()
)
len(merged_alias_alias_collision_set)

#### <a id='toc1_1_1_2_'></a>[How many of the alias-primary collisions are also alias-alias collisions?](#toc0_)

In [None]:
print(
    len(
        merged_alias_alias_collision_set.intersection(
            merged_alias_primary_collision_set
        )
    )
)

#### <a id='toc1_1_1_3_'></a>[How many unique ambiguous symbosl are there b/w alias-primary and alias-alias collisions?](#toc0_)

In [None]:
ambiguous_symbol_set = merged_alias_alias_collision_set.union(
    merged_alias_primary_collision_set
)
print(len(ambiguous_symbol_set))

In [None]:
ambiguous_symbol_set

In [None]:
ambiguous_symbol_set = set(item.strip() for item in ambiguous_symbol_set)

print(len(ambiguous_symbol_set))

In [None]:
with open("ambiguous_symbol_set.txt", "w") as file:
    for item in ambiguous_symbol_set:
        file.write(f"{item.strip()}\n")

In [None]:
with open("ambiguous_symbol_set.txt", "r") as file:
    # Read each line, strip newline characters, and convert to a set
    ambiguous_symbol_set = set(line.strip() for line in file)
len(ambiguous_symbol_set)

In [None]:
ambiguous_symbol_set

### <a id='toc1_1_2_'></a>[Primary exploration of DGIdb gene content using collisions](#toc0_)

#### <a id='toc1_1_2_1_'></a>[Load gene claim data from DGIdb](#toc0_)

In [None]:
dgidb_gene_df = pd.read_csv(
    "dgidb_genes_JUNE.tsv", sep="\t", na_values=["", "NULL"], keep_default_na=False
)
dgidb_gene_df

In [None]:
num_total_claims = len(dgidb_gene_df)

In [None]:
dgidb_gene_df.loc[dgidb_gene_df["gene_claim_name"] == "HES1"]

#### <a id='toc1_1_2_2_'></a>[How many claims are placed in a gene group with a different label?](#toc0_)

In [None]:
dgidb_gene_df.query("gene_name != gene_claim_name")

#### <a id='toc1_1_2_3_'></a>[How many claims are not normalized?](#toc0_)

claims without a symbol/name/identifier
 (there shouldn't be any hooray)

In [None]:
no_claim_symbols_df = dgidb_gene_df[dgidb_gene_df["gene_claim_name"].isnull()]
no_claim_symbols_df

In [None]:
no_name_symbols_df = dgidb_gene_df[dgidb_gene_df["gene_name"].isnull()]
num_not_normalized_claims = len(no_name_symbols_df)
num_not_normalized_claims

In [None]:
num_normalized_claims = num_total_claims - num_not_normalized_claims

#### <a id='toc1_1_2_4_'></a>[Create set of gene groups (gene_name) and gene claims (gene_claim_name)](#toc0_)

In [None]:
dgidb_name_set = set(dgidb_gene_df["gene_name"])
len(dgidb_name_set)

In [None]:
dgidb_gene_claim_name_set = set(dgidb_gene_df["gene_claim_name"])
len(dgidb_gene_claim_name_set)

#### <a id='toc1_1_2_5_'></a>[Load HGNC, Ensembl (ENSG), and NCBI gene and alias sets](#toc0_)

Input: mini_hgnc_df, mini_ensg_df, mini_ncbi_df (total_alias_gene_intersections.ipynb)

Output: x_gene_symbol_set, x_alias_symbol_set where x is ncbi, ensg or ncbi

In [None]:
mini_hgnc_df = pd.read_csv(
    "Downloaded_files/mini_hgnc_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [None]:
mini_ensg_df = pd.read_csv(
    "Downloaded_files/mini_ensg_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [None]:
mini_ncbi_df = pd.read_csv(
    "Downloaded_files/mini_ncbi_df.csv", na_values=["", "NULL"], keep_default_na=False
)

In [None]:
hgnc_gene_symbol_set = set(mini_hgnc_df["gene_symbol"])
len(hgnc_gene_symbol_set)

In [None]:
ensg_gene_symbol_set = set(mini_ensg_df["gene_symbol"])
len(ensg_gene_symbol_set)

In [None]:
ncbi_gene_symbol_set = set(mini_ncbi_df["gene_symbol"])
len(ncbi_gene_symbol_set)

In [None]:
hgnc_alias_symbol_set = set(mini_hgnc_df["alias_symbol"])
len(hgnc_alias_symbol_set)

In [None]:
ensg_alias_symbol_set = set(mini_ensg_df["alias_symbol"])
len(ensg_alias_symbol_set)

In [None]:
ncbi_alias_symbol_set = set(mini_ncbi_df["alias_symbol"])
len(ncbi_alias_symbol_set)

#### <a id='toc1_1_2_6_'></a>[How many unique primary gene symbols are there b/w HGNC, Ensembl, and NCBI?](#toc0_)

In [None]:
hgnc_ensg_gene_symbol_set = hgnc_gene_symbol_set.union(ensg_gene_symbol_set)

In [None]:
hgnc_ensg_ncbi_gene_symbol_set = hgnc_ensg_gene_symbol_set.union(ncbi_gene_symbol_set)
len(hgnc_ensg_ncbi_gene_symbol_set)

#### <a id='toc1_1_2_7_'></a>[How many unique alias symbols are there b/w HGNC, Ensembl, and NCBI?](#toc0_)

In [None]:
hgnc_ensg_alias_symbol_set = hgnc_alias_symbol_set.union(ensg_alias_symbol_set)

In [None]:
hgnc_ensg_ncbi_alias_symbol_set = hgnc_ensg_alias_symbol_set.union(
    ncbi_alias_symbol_set
)
len(hgnc_ensg_ncbi_alias_symbol_set)

#### <a id='toc1_1_2_8_'></a>[How many unique group names are not primary gene symbols?](#toc0_)

In [None]:
name_ensg_notmatch = dgidb_name_set.difference(ensg_gene_symbol_set)
len(name_ensg_notmatch)

In [None]:
name_hgnc_notmatch = dgidb_name_set.difference(hgnc_gene_symbol_set)
len(name_hgnc_notmatch)

In [None]:
cleaned_name_hgnc_notmatch = [x for x in name_hgnc_notmatch if str(x) != "NaN"]
len(cleaned_name_hgnc_notmatch)

In [None]:
name_ncbi_notmatch = dgidb_name_set.difference(ncbi_gene_symbol_set)
len(name_ncbi_notmatch)

In [None]:
name_ncbi_hgnc_notmatch = name_ncbi_notmatch.difference(hgnc_gene_symbol_set)
len(name_ncbi_hgnc_notmatch)

How many groups are labeled with a symbol not found in the sets of primary symbols from HGNC, NCBI, or ENSG?

In [None]:
name_ncbi_hgnc_ensg_notmatch = name_ncbi_hgnc_notmatch.difference(ensg_gene_symbol_set)
len(name_ncbi_hgnc_ensg_notmatch)

In [None]:
len(dgidb_name_set)

#### <a id='toc1_1_2_9_'></a>[How many unique claims are not primary gene symbols?](#toc0_)

In [None]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(
    ensg_gene_symbol_set
)
len(gene_claim_name_ensg_notmatch)

In [None]:
cleaned_gene_claim_name_ensg_notmatch = [
    x for x in gene_claim_name_ensg_notmatch if str(x) != "NaN"
]
len(cleaned_gene_claim_name_ensg_notmatch)

In [None]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(
    hgnc_gene_symbol_set
)
len(gene_claim_name_hgnc_notmatch)

In [None]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(
    ncbi_gene_symbol_set
)
len(gene_claim_name_ncbi_notmatch)

In [None]:
gene_claim_name_ncbi_hngc_notmatch = gene_claim_name_ncbi_notmatch.difference(
    hgnc_gene_symbol_set
)
len(gene_claim_name_ncbi_hngc_notmatch)

How many unique claims are not primary symbols

In [None]:
gene_claim_name_ncbi_hngc_ensg_notmatch = gene_claim_name_ncbi_hngc_notmatch.difference(
    ensg_gene_symbol_set
)
num_not_primary_claims = len(gene_claim_name_ncbi_hngc_ensg_notmatch)
num_not_primary_claims

#### <a id='toc1_1_2_10_'></a>[Load the collision sets from each data source](#toc0_)

Input: aa_collision_x_df.csv (total_alis_overlap.ipynb)

Output: x_alias_alias_collision_set

In [None]:
aa_collision_hgnc_df = pd.read_csv(
    "created_files/aa_collision_hgnc_df.csv",
    na_values=["", "NULL"],
    keep_default_na=False,
)

In [None]:
hgnc_alias_alias_collision_set = set(aa_collision_hgnc_df["alias_symbol"])
len(hgnc_alias_alias_collision_set)

In [None]:
aa_collision_ncbi_df = pd.read_csv(
    "created_files/aa_collision_ncbi_df.csv",
    na_values=["", "NULL"],
    keep_default_na=False,
)

In [None]:
ncbi_alias_alias_collision_set = set(aa_collision_ncbi_df["alias_symbol"])
len(ncbi_alias_alias_collision_set)

In [None]:
aa_collision_ensg_df = pd.read_csv(
    "created_files/aa_collision_ensg_df.csv",
    na_values=["", "NULL"],
    keep_default_na=False,
)

In [None]:
ensg_alias_alias_collision_set = set(aa_collision_ensg_df["alias_symbol"])
len(ensg_alias_alias_collision_set)

#### <a id='toc1_1_2_11_'></a>[How many unique group names are primary gene symbols?](#toc0_)

In [None]:
name_hgnc_match = dgidb_name_set.intersection(hgnc_gene_symbol_set)
len(name_hgnc_match)

In [None]:
name_ensg_match = dgidb_name_set.intersection(ensg_gene_symbol_set)
len(name_ensg_match)

In [None]:
name_ncbi_match = dgidb_name_set.intersection(ncbi_gene_symbol_set)
len(name_ncbi_match)

In [None]:
name_ncbi_ensg_match = name_ncbi_match.intersection(ensg_gene_symbol_set)
len(name_ncbi_ensg_match)

In [None]:
name_ncbi_ensg_hgnc_match = name_ncbi_ensg_match.intersection(hgnc_gene_symbol_set)
len(name_ncbi_ensg_hgnc_match)

#### <a id='toc1_1_2_12_'></a>[How many of the groups that are labeled with a primary gene symbol are also alias-alias collisions?](#toc0_)

In [None]:
name_ncbi_match_aacollision = name_ncbi_match.intersection(
    ncbi_alias_alias_collision_set
)
len(name_ncbi_match_aacollision)

In [None]:
name_ensg_match_aacollision = name_ensg_match.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_match_aacollision)

In [None]:
name_ensg_notmatch_aacollision = name_ensg_notmatch.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_notmatch_aacollision)

#### <a id='toc1_1_2_13_'></a>[How many unique claim symbols are collisions?](#toc0_)

In [None]:
gene_claim_name_ensg_aacollision_match = dgidb_gene_claim_name_set.intersection(
    ensg_alias_alias_collision_set
)
len(gene_claim_name_ensg_aacollision_match)

In [None]:
gene_claim_name_hgnc_aacollision_match = dgidb_gene_claim_name_set.intersection(
    hgnc_alias_alias_collision_set
)
len(gene_claim_name_hgnc_aacollision_match)

In [None]:
gene_claim_name_ncbi_aacollision_match = dgidb_gene_claim_name_set.intersection(
    ncbi_alias_alias_collision_set
)
len(gene_claim_name_ncbi_aacollision_match)

#### <a id='toc1_1_2_14_'></a>[How many unique groups are labeled with collisions?](#toc0_)

In [None]:
name_ensg_aacollision_match = dgidb_name_set.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_aacollision_match)

In [None]:
name_hgnc_aacollision_match = dgidb_name_set.intersection(
    hgnc_alias_alias_collision_set
)
len(name_hgnc_aacollision_match)

In [None]:
name_ncbi_aacollision_match = dgidb_name_set.intersection(
    ncbi_alias_alias_collision_set
)
len(name_ncbi_aacollision_match)

#### <a id='toc1_1_2_15_'></a>[How many unique claims are not primary gene symbols?](#toc0_)

In [None]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(
    hgnc_gene_symbol_set
)
len(gene_claim_name_hgnc_notmatch)

In [None]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(
    ensg_gene_symbol_set
)
len(gene_claim_name_ensg_notmatch)

In [None]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(
    ncbi_gene_symbol_set
)
len(gene_claim_name_ncbi_notmatch)

#### <a id='toc1_1_2_16_'></a>[How many unique claims that were not primary symbols are collisions?](#toc0_)

In [None]:
gene_claim_name_hgnc_notmatch_aacollision = gene_claim_name_hgnc_notmatch.intersection(
    hgnc_alias_alias_collision_set
)
len(gene_claim_name_hgnc_notmatch_aacollision)

In [None]:
gene_claim_name_ncbi_notmatch_aacollision = gene_claim_name_ncbi_notmatch.intersection(
    ncbi_alias_alias_collision_set
)
len(gene_claim_name_ncbi_notmatch_aacollision)

In [None]:
gene_claim_name_ensg_notmatch_aacollision = gene_claim_name_ensg_notmatch.intersection(
    ensg_alias_alias_collision_set
)
len(gene_claim_name_ensg_notmatch_aacollision)

#### <a id='toc1_1_2_17_'></a>[How many unique claims that are primary gene symbols are also collisions?](#toc0_)

In [None]:
gene_claim_name_hgnc_match = dgidb_gene_claim_name_set.intersection(
    hgnc_gene_symbol_set
)
len(gene_claim_name_hgnc_match)

In [None]:
gene_claim_name_hgnc_match_aacollision = gene_claim_name_hgnc_match.intersection(
    hgnc_alias_alias_collision_set
)
len(gene_claim_name_hgnc_match_aacollision)

In [None]:
gene_claim_name_ncbi_match = dgidb_gene_claim_name_set.intersection(
    ncbi_gene_symbol_set
)
len(gene_claim_name_ncbi_match)

In [None]:
gene_claim_name_ncbi_match_aacollision = gene_claim_name_ncbi_match.intersection(
    ncbi_alias_alias_collision_set
)
len(gene_claim_name_ncbi_match_aacollision)

In [None]:
gene_claim_name_ensg_match = dgidb_gene_claim_name_set.intersection(
    ensg_gene_symbol_set
)
len(gene_claim_name_ensg_match)

In [None]:
gene_claim_name_ensg_match_aacollision = gene_claim_name_ensg_match.intersection(
    ensg_alias_alias_collision_set
)
len(gene_claim_name_ensg_match_aacollision)

#### <a id='toc1_1_2_18_'></a>[How many of the gene groups that are not primary gene symbols are alias-alias collisions from HGNC?](#toc0_)

In [None]:
name_hngc_notmatch_aacollision = name_hgnc_notmatch.intersection(
    hgnc_alias_alias_collision_set
)
len(name_hngc_notmatch_aacollision)

In [None]:
name_ncbi_notmatch_aacollision = name_ncbi_notmatch.intersection(
    ncbi_alias_alias_collision_set
)
len(name_ncbi_notmatch_aacollision)

In [None]:
name_ensg_notmatch_aacollision = name_ensg_notmatch.intersection(
    ensg_alias_alias_collision_set
)
len(name_ensg_notmatch_aacollision)

#### <a id='toc1_1_2_19_'></a>[How many claims are ambiguous symbols?](#toc0_)

In [None]:
dgidb_gene_df["claim_ambiguous_status"] = dgidb_gene_df["gene_claim_name"].isin(
    ambiguous_symbol_set
)
dgidb_gene_df

In [None]:
dgidb_gene_df["claim_ambiguous_status"].value_counts()

In [None]:
ambiguous_claim_df = dgidb_gene_df.loc[dgidb_gene_df["claim_ambiguous_status"]]
num_ambiguous_claim = len(ambiguous_claim_df)
num_ambiguous_claim

In [None]:
ambiguous_claim_df.loc[ambiguous_claim_df["gene_claim_name"] == "TR2"]

#### <a id='toc1_1_2_20_'></a>[How many gene group labels are ambiguous symbols?](#toc0_)

In [None]:
dgidb_gene_df["name_ambiguous_status"] = dgidb_gene_df["gene_name"].isin(
    ambiguous_symbol_set
)
dgidb_gene_df

In [None]:
dgidb_gene_df["name_ambiguous_status"].value_counts()

In [None]:
ambiguous_name_df = dgidb_gene_df.loc[dgidb_gene_df["name_ambiguous_status"]]
num_ambiguous_name = len(ambiguous_name_df)
num_ambiguous_name

#### <a id='toc1_1_2_21_'></a>[How many claims are primary symbols?](#toc0_)

In [None]:
dgidb_gene_df["claim_primary_status"] = dgidb_gene_df["gene_claim_name"].isin(
    hgnc_ensg_ncbi_gene_symbol_set
)
dgidb_gene_df

In [None]:
dgidb_gene_df["claim_primary_status"].value_counts()

In [None]:
primary_claim_df = dgidb_gene_df.loc[dgidb_gene_df["claim_primary_status"]]
num_primary_claim = len(primary_claim_df)
num_primary_claim

#### <a id='toc1_1_2_22_'></a>[How many gene group labels are primary symbols?](#toc0_)

In [None]:
dgidb_gene_df["name_primary_status"] = (
    dgidb_gene_df["gene_name"].astype(str).isin(hgnc_ensg_ncbi_gene_symbol_set)
)
dgidb_gene_df

In [None]:
dgidb_gene_df["name_primary_status"].value_counts()

In [None]:
primary_name_df = dgidb_gene_df.loc[dgidb_gene_df["name_primary_status"]]
num_primary_name = len(primary_name_df)
num_primary_name

In [None]:
not_primary_group_name_df = dgidb_gene_df.loc[~dgidb_gene_df["name_primary_status"]]
not_primary_group_name_df

#### <a id='toc1_1_2_23_'></a>[How many gene group labels are alias symbols?](#toc0_)

In [None]:
dgidb_gene_df["name_alias_status"] = dgidb_gene_df["gene_name"].isin(
    hgnc_ensg_ncbi_alias_symbol_set
)
dgidb_gene_df

In [None]:
dgidb_gene_df["name_alias_status"].value_counts()

In [None]:
alias_name_df = dgidb_gene_df.loc[dgidb_gene_df["name_alias_status"]]
num_alias_name = len(alias_name_df)
num_alias_name

In [None]:
print("Calmbp1" in hgnc_ensg_ncbi_alias_symbol_set)

#### <a id='toc1_1_2_24_'></a>[How many claims are alias symbols?](#toc0_)

In [None]:
dgidb_gene_df["claim_alias_status"] = dgidb_gene_df["gene_claim_name"].isin(
    hgnc_ensg_ncbi_alias_symbol_set
)
dgidb_gene_df

In [None]:
dgidb_gene_df["claim_alias_status"].value_counts()

In [None]:
alias_claims_df = dgidb_gene_df.loc[dgidb_gene_df["claim_alias_status"]]
num_alias_claims = len(alias_claims_df)
num_alias_claims

#### <a id='toc1_1_2_25_'></a>[How many claims that were normalized into non-primary symbol labeled groups are alias symbols?](#toc0_)

In [None]:
dgidb_gene_df["claim_alias_status"] = dgidb_gene_df["gene_claim_name"].isin(
    hgnc_ensg_ncbi_alias_symbol_set
)
alias_claim_not_primary_group_name_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_alias_status"] & ~dgidb_gene_df["name_primary_status"]
]
alias_claim_not_primary_group_name_df

#### <a id='toc1_1_2_26_'></a>[How many not normalized claims are alias symbols?](#toc0_)

In [None]:
alias_claim_null_name_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_alias_status"] & dgidb_gene_df["gene_name"].isnull()
]
len(alias_claim_null_name_df)

#### <a id='toc1_1_2_27_'></a>[How many gene group labels are not primary, alias symbols, or null?](#toc0_)

In [None]:
other_name_df = dgidb_gene_df.loc[
    ~dgidb_gene_df["name_alias_status"]
    & ~dgidb_gene_df["name_primary_status"]
    & ~dgidb_gene_df["gene_name"].isnull()
]
num_other_name = len(other_name_df)
num_other_name

#### <a id='toc1_1_2_28_'></a>[How many claims are not primary, alias symbols, or null?](#toc0_)

In [None]:
other_claim_df = dgidb_gene_df.loc[
    ~dgidb_gene_df["claim_alias_status"] & ~dgidb_gene_df["claim_primary_status"]
]
num_other_claim = len(other_claim_df)
num_other_claim

#### <a id='toc1_1_2_29_'></a>[How many claims are primary and alias symbols?](#toc0_)

In [None]:
primaryandalias_claim_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_alias_status"] & dgidb_gene_df["claim_primary_status"]
]
num_primaryandalias_claim = len(primaryandalias_claim_df)
num_primaryandalias_claim

#### <a id='toc1_1_2_30_'></a>[How many primary symbol claims are normalized into non-primary gene group labels?](#toc0_)

In [None]:
claim_true_name_false_df = dgidb_gene_df.loc[
    dgidb_gene_df["claim_primary_status"] & ~dgidb_gene_df["name_primary_status"]
]
claim_true_name_false_df

In [None]:
len(claim_true_name_false_df)

### <a id='toc1_1_3_'></a>[Summary](#toc0_)

#### <a id='toc1_1_3_1_'></a>[Normalization Rates](#toc0_)

In [None]:
normalization_index = "Normalized", "Not Normalized", "Total"
normalization_summary = {
    "Number of Claims": [
        num_normalized_claims,
        num_not_normalized_claims,
        num_total_claims,
    ]
}
normalization_summary_df = pd.DataFrame(
    normalization_summary, index=normalization_index
)
normalization_summary_df

#### <a id='toc1_1_3_2_'></a>[Types of Claim Symbols](#toc0_)

- Primary and Alias symbols are not exclusive as some primary symbols are used as aliases (alias-primary collisions)
- Other does not include any Primary or Alias symbols
- Ambiguous symbols can be either Primary or Alias

In [None]:
claim_index = "Primary", "Alias", "Other", "Ambiguous", "Total"
claim_summary = {
    "Number of Claims": [
        num_primary_claim,
        num_alias_claims,
        num_other_claim,
        num_ambiguous_claim,
        num_total_claims,
    ]
}
claim_summary_df = pd.DataFrame(claim_summary, index=claim_index)
claim_summary_df

#### <a id='toc1_1_3_3_'></a>[Types of Normalizations (gene group labels claims are being normliazed into)](#toc0_)

In [None]:
gene_group_index = "Primary", "Alias", "Null", "Other", "Ambiguous", "Total"
gene_group_summary = {
    "Number of Claims": [
        num_primary_name,
        num_alias_name,
        num_not_normalized_claims,
        num_other_name,
        num_ambiguous_name,
        num_total_claims,
    ]
}
gene_group_summary_df = pd.DataFrame(gene_group_summary, index=gene_group_index)
gene_group_summary_df