**Table of contents**<a id='toc0_'></a>    
- [ENSG](#toc1_)    
    - [Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc1_1_1_)    
    - [How many total unique gene records are there](#toc1_1_2_)    
    - [Drop rows with NAN in alias_symbol](#toc1_1_3_)    
    - [Make each row in alias_symbol a set:](#toc1_1_4_)    
    - [Explode the alias sets so that it is one per row](#toc1_1_5_)    
    - [How many total unique aliases are there](#toc1_1_6_)    
    - [Pull out all the rows that have an alias symbol that can be found elsewhere](#toc1_1_7_)    
    - [Sort alias symbols alphabetically](#toc1_1_8_)    
    - [Number of records with an alias that is shared](#toc1_1_9_)    
    - [Count the number of times each multi-use alias is used](#toc1_1_10_)    
      - [Save as csv](#toc1_1_10_1_)    
    - [Put columns in different order to ephasize alias symbols instead of gene records](#toc1_1_11_)    
    - [Merge rows with matching alias symbols](#toc1_1_12_)    
- [HGNC](#toc2_)    
    - [Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc2_1_1_)    
    - [How many total unique gene records are there](#toc2_1_2_)    
    - [Drop rows with NAN in alias_symbol](#toc2_1_3_)    
    - [Make each row in alias_symbol a set:](#toc2_1_4_)    
    - [Explode the alias sets so that it is one per row](#toc2_1_5_)    
    - [How many total unique aliases are there](#toc2_1_6_)    
    - [Pull out all the rows that have an alias symbol that can be found elsewhere](#toc2_1_7_)    
    - [Sort alias symbols alphabetically](#toc2_1_8_)    
    - [Number of records with an alias that is shared](#toc2_1_9_)    
    - [Count the number of times each multi-use alias is used](#toc2_1_10_)    
      - [Save as csv](#toc2_1_10_1_)    
    - [Put columns in different order to ephasize alias symbols instead of gene records](#toc2_1_11_)    
    - [Merge rows with matching alias symbols](#toc2_1_12_)    
- [NCBI Info](#toc3_)    
    - [Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc3_1_1_)    
    - [How many total unique gene records are there](#toc3_1_2_)    
    - [Drop rows with - in alias_symbol](#toc3_1_3_)    
    - [Make each row in alias_symbol a set:](#toc3_1_4_)    
    - [Explode the alias sets so that it is one per row](#toc3_1_5_)    
    - [How many unique aliases are there](#toc3_1_6_)    
    - [Pull out all the rows that have an alias symbol that can be found elsewhere](#toc3_1_7_)    
    - [Sort alias symbols alphabetically](#toc3_1_8_)    
    - [Number of records with an alias that is shared](#toc3_1_9_)    
    - [Count the number of times each multi-use alias is used](#toc3_1_10_)    
    - [Put columns in different order to ephasize alias symbols instead of gene records](#toc3_1_11_)    
    - [Merge rows with matching alias symbols](#toc3_1_12_)    
- [Merge to create Alias Overlap Table 1 - Gene Symbol](#toc4_)    
- [Merge to create Alias Overlap Table 2 - Alias Symbol](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [3]:
import pandas as pd
import numpy as np
import plotly.express as px

# <a id='toc1_'></a>[ENSG](#toc0_)

In [4]:
mini_ensg_df = pd.read_csv("../Alias_gene_intersections/Created_files/ensg_alias_df.csv")

Note: duplicate gene symbols can have different ENSG ids

In [5]:
duplicateENSG_ID2 = mini_ensg_df[mini_ensg_df.duplicated('ensg_id', keep=False)]
duplicateENSG_ID2

Unnamed: 0.1,Unnamed: 0,gene_type,ref_id,gene_name,ensg_id,gene_synonyms,gene_symbol,entrez_id
38,1366,rRNA,495150,5.8S ribosomal RNA [Source:RFAM;Acc:RF00002],ENSG00000278294,,5_8S_rRNA,124907156
39,1367,rRNA,495150,5.8S ribosomal RNA [Source:RFAM;Acc:RF00002],ENSG00000278294,,5_8S_rRNA,124907485
40,1368,rRNA,495150,5.8S ribosomal RNA [Source:RFAM;Acc:RF00002],ENSG00000278294,,5_8S_rRNA,124908250
55,1386,protein_coding,2874124,"killer cell immunoglobulin like receptor, two ...",ENSG00000276779,"103AS, 15.212, CD158D",KIR2DL4,3805
56,1387,protein_coding,2874124,"killer cell immunoglobulin like receptor, two ...",ENSG00000276779,"103AS, 15.212, CD158D",KIR2DL4,124900568
...,...,...,...,...,...,...,...,...
52151,76329,snRNA,2949399,"RNA, variant U1 small nuclear 29 [Source:HGNC ...",ENSG00000273768,,RNVU1-29,124905574
52152,76330,snRNA,2949399,"RNA, variant U1 small nuclear 29 [Source:HGNC ...",ENSG00000273768,,RNVU1-29,124905808
52153,76331,snRNA,2949399,"RNA, variant U1 small nuclear 29 [Source:HGNC ...",ENSG00000273768,,RNVU1-29,124905809
52196,76395,protein_coding,2919059,phosphodiesterase 4D interacting protein [Sour...,ENSG00000178104,"CMYA2, KIAA0454, KIAA0477, MMGL",PDE4DIP,9659


In [6]:
duplicateENSG_ID2.to_csv('../ensg_duplicateENSG_ID2.csv', index=True)

### <a id='toc1_1_1_'></a>[Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc0_)

In [7]:
mini_ensg_df = mini_ensg_df.drop(['gene_type', 'ref_id', 'gene_name', 'Unnamed: 0'], axis=1)
mini_ensg_df = mini_ensg_df.rename(columns = {'gene_synonyms':'alias_symbol', 'ensg_id':'ENSG_ID' })
mini_ensg_df

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
0,ENSG00000210049,"MTTF, trnF",MT-TF,0
1,ENSG00000211459,"12S, MOTS-c, MTRNR1",MT-RNR1,0
2,ENSG00000210077,"MTTV, trnV",MT-TV,0
3,ENSG00000210082,"16S, HN, MTRNR2",MT-RNR2,0
4,ENSG00000209082,"MTTL1, TRNL1",MT-TL1,0
...,...,...,...,...
52224,ENSG00000206764,,RNU6-152P,0
52225,ENSG00000231684,,EIF1P3,0
52226,ENSG00000264470,hsa-mir-4794,MIR4794,100616338
52227,ENSG00000162437,"FLJ10770, KIAA1579",RAVER2,55225


Note: duplicate ENSG ids can have different entrez ids

In [8]:
duplicateENSG_ID = mini_ensg_df[mini_ensg_df.duplicated('ENSG_ID', keep=False)]
duplicateENSG_ID

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
38,ENSG00000278294,,5_8S_rRNA,124907156
39,ENSG00000278294,,5_8S_rRNA,124907485
40,ENSG00000278294,,5_8S_rRNA,124908250
55,ENSG00000276779,"103AS, 15.212, CD158D",KIR2DL4,3805
56,ENSG00000276779,"103AS, 15.212, CD158D",KIR2DL4,124900568
...,...,...,...,...
52151,ENSG00000273768,,RNVU1-29,124905574
52152,ENSG00000273768,,RNVU1-29,124905808
52153,ENSG00000273768,,RNVU1-29,124905809
52196,ENSG00000178104,"CMYA2, KIAA0454, KIAA0477, MMGL",PDE4DIP,9659


In [9]:
duplicateENSG_ID.to_csv('../ensg_duplicateENSG_ID.csv', index=True)

### <a id='toc1_1_2_'></a>[How many total unique gene records are there](#toc0_)

By ENSG ID

In [10]:
ensg_gene_id_set = set(mini_ensg_df['ENSG_ID'])
len(ensg_gene_id_set)

46806

By gene symbol

In [11]:
ensg_gene_symbol_set = set(mini_ensg_df['gene_symbol'])
len(ensg_gene_symbol_set)

40353

Using IGHV as an example:

In [12]:
type(mini_ensg_df.gene_symbol[0])

str

In [13]:
# ensg_IGHV_alias_df = mini_ensg_df.loc[mini_ensg_df['gene_symbol'].str.contains("IGHV", case=False)]

In [14]:
ensg_IGHV_alias_df = mini_ensg_df[mini_ensg_df['gene_symbol'].str.contains('IGHV')]

In [15]:
ensg_IGHV_alias_df.to_csv('../hgnc_IGHV_alias_df.csv')

### <a id='toc1_1_3_'></a>[Drop genes with no aliases](#toc0_)

In [16]:
mini_ensg_df = mini_ensg_df[mini_ensg_df["alias_symbol"].str.contains("NaN") == False]
mini_ensg_df

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
0,ENSG00000210049,"MTTF, trnF",MT-TF,0
1,ENSG00000211459,"12S, MOTS-c, MTRNR1",MT-RNR1,0
2,ENSG00000210077,"MTTV, trnV",MT-TV,0
3,ENSG00000210082,"16S, HN, MTRNR2",MT-RNR2,0
4,ENSG00000209082,"MTTL1, TRNL1",MT-TL1,0
...,...,...,...,...
52220,ENSG00000198216,"BII, CACH6, CACNL1A6, Cav2.3",CACNA1E,777
52221,ENSG00000179930,FLJ46813,ZNF648,127665
52226,ENSG00000264470,hsa-mir-4794,MIR4794,100616338
52227,ENSG00000162437,"FLJ10770, KIAA1579",RAVER2,55225


### <a id='toc1_1_4_'></a>[Make each row in alias_symbol a set:](#toc0_)
    covert to a list 
    make a set

In [17]:
mini_ensg_df['alias_symbol'] = mini_ensg_df['alias_symbol'].astype(str)
mini_ensg_df['alias_symbol'] = [x.split(',') for x in mini_ensg_df.alias_symbol]
mini_ensg_df['alias_symbol']=np.where(mini_ensg_df.alias_symbol=='','',mini_ensg_df.alias_symbol.map(set))
mini_ensg_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_symbol'] = mini_ensg_df['alias_symbol'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_symbol'] = [x.split(',') for x in mini_ensg_df.alias_symbol]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_symbol']=np.where(mini_ensg_df.a

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
0,ENSG00000210049,"{MTTF, trnF}",MT-TF,0


### <a id='toc1_1_5_'></a>[Explode the alias sets so that it is one per row](#toc0_)

In [18]:
mini_ensg_df = mini_ensg_df.explode('alias_symbol')
mini_ensg_df

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
0,ENSG00000210049,MTTF,MT-TF,0
0,ENSG00000210049,trnF,MT-TF,0
1,ENSG00000211459,MTRNR1,MT-RNR1,0
1,ENSG00000211459,12S,MT-RNR1,0
1,ENSG00000211459,MOTS-c,MT-RNR1,0
...,...,...,...,...
52221,ENSG00000179930,FLJ46813,ZNF648,127665
52226,ENSG00000264470,hsa-mir-4794,MIR4794,100616338
52227,ENSG00000162437,KIAA1579,RAVER2,55225
52227,ENSG00000162437,FLJ10770,RAVER2,55225


### <a id='toc3_1_6_'></a>[How many unique aliases are there](#toc0_)

In [19]:
ensg_alias_symbol_set = set(mini_ensg_df['alias_symbol'])
ensg_alias_len = len(ensg_alias_symbol_set)
ensg_alias_len

54926

### <a id='toc3_1_7_'></a>[Pull out all the rows that have an alias symbol that can be found elsewhere](#toc0_)

In [20]:
mini_ensg_df['alias_duplicates'] = mini_ensg_df.duplicated(subset= 'alias_symbol', keep=False)
mini_ensg_df = mini_ensg_df[mini_ensg_df['alias_duplicates'] == True]
mini_ensg_df = mini_ensg_df.drop(['alias_duplicates'], axis=1)
mini_ensg_df.head(5)

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
5,ENSG00000198888,ND1,MT-ND1,4535
43,ENSG00000281486,G2SYN,SNTG2,54221
43,ENSG00000281486,SYN5,SNTG2,54221
44,ENSG00000262826,FLJ21919,INTS3,65123
44,ENSG00000262826,C1orf60,INTS3,65123


### <a id='toc3_1_8_'></a>[Sort alias symbols alphabetically](#toc0_)

In [21]:
mini_ensg_df = mini_ensg_df.sort_values('alias_symbol')
mini_ensg_df

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
80,ENSG00000278074,15.212,KIR2DL4,3805
631,ENSG00000277964,15.212,KIR2DL4,124900568
811,ENSG00000278430,15.212,KIR2DL4,3805
671,ENSG00000277362,15.212,KIR2DL4,3805
1300,ENSG00000276979,15.212,KIR2DL4,3805
...,...,...,...,...
8105,ENSG00000100764,p56,PSMC1,5700
7724,ENSG00000283997,promethin,LDAF1,57146
37151,ENSG00000011638,promethin,LDAF1,57146
27801,ENSG00000290418,psiTPTE22,TPTEP1,387590


In [22]:
#ensg_CD158b_alias_count_df.to_csv('../hgnc_CD158b_alias_count_df.csv')

### <a id='toc3_1_9_'></a>[Number of records with an alias that is shared](#toc0_)

In [23]:
ensg_gene_symbol_set_wsharedalias = set(mini_ensg_df['gene_symbol'])
len(ensg_gene_symbol_set_wsharedalias)

3680

In [24]:
mini_ensg_df['source'] = 'ENSG'
mini_ensg_df

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id,source
80,ENSG00000278074,15.212,KIR2DL4,3805,ENSG
631,ENSG00000277964,15.212,KIR2DL4,124900568,ENSG
811,ENSG00000278430,15.212,KIR2DL4,3805,ENSG
671,ENSG00000277362,15.212,KIR2DL4,3805,ENSG
1300,ENSG00000276979,15.212,KIR2DL4,3805,ENSG
...,...,...,...,...,...
8105,ENSG00000100764,p56,PSMC1,5700,ENSG
7724,ENSG00000283997,promethin,LDAF1,57146,ENSG
37151,ENSG00000011638,promethin,LDAF1,57146,ENSG
27801,ENSG00000290418,psiTPTE22,TPTEP1,387590,ENSG


### <a id='toc3_1_10_'></a>[Count the number of times each multi-use alias is used](#toc0_)

In [25]:
ensg_dup_alias_count_df = mini_ensg_df.pivot_table(index = ['alias_symbol'], aggfunc ='size')
ensg_dup_alias_count_df

alias_symbol
 15.212      82
 5T4-AG       2
 A1AT         2
 A3GALT1      2
 AAP          2
             ..
p42           2
p55           2
p56           2
promethin     2
psiTPTE22     2
Length: 5307, dtype: int64

In [26]:
ensg_dup_alias_count_df = ensg_dup_alias_count_df.reset_index()
ensg_dup_alias_count_df

Unnamed: 0,alias_symbol,0
0,15.212,82
1,5T4-AG,2
2,A1AT,2
3,A3GALT1,2
4,AAP,2
...,...,...
5302,p42,2
5303,p55,2
5304,p56,2
5305,promethin,2


In [27]:
ensg_dup_alias_count_df.rename(columns={0:'num_gene_records'}, inplace=True )
ensg_dup_alias_count_df

Unnamed: 0,alias_symbol,num_gene_records
0,15.212,82
1,5T4-AG,2
2,A1AT,2
3,A3GALT1,2
4,AAP,2
...,...,...
5302,p42,2
5303,p55,2
5304,p56,2
5305,promethin,2


In [28]:
ensg_dup_alias_count_df = ensg_dup_alias_count_df.sort_values('num_gene_records', ascending=False)
ensg_dup_alias_count_df.head(30)

Unnamed: 0,alias_symbol,num_gene_records
4853,RN5S3,218
4849,RN5S2,216
4844,RN5S16,216
4845,RN5S17,216
4835,RN5S1,216
4865,RNA5-8N2,211
4837,RN5S11,108
4839,RN5S12,108
4840,RN5S13,108
4842,RN5S14,108


In [29]:
ensg_alias_alias_collision_set = set(ensg_dup_alias_count_df['alias_symbol'])
len(ensg_alias_alias_collision_set)

5307

In [30]:
ensg_dup_alias_count_df.to_csv('../ensg_dup_alias_count_df.csv', index=True)

In [31]:
ensg_alias_count_histogram_df = ensg_dup_alias_count_df.pivot_table(index = ['num_gene_records'], aggfunc ='size')
ensg_alias_count_histogram_df

num_gene_records
2      3738
3       394
4       173
5       112
6       160
7       278
8       258
9        17
10       50
11        6
12        4
13       11
14        8
15        5
16       15
18        1
21        2
25        4
26        1
28        1
30        4
31        6
32       10
33        9
35        8
41        2
42        4
43        5
53        1
82        3
108      11
211       1
216       4
218       1
dtype: int64

In [32]:
ensg_alias_count_histogram_df = ensg_alias_count_histogram_df.reset_index()
ensg_alias_count_histogram_df

Unnamed: 0,num_gene_records,0
0,2,3738
1,3,394
2,4,173
3,5,112
4,6,160
5,7,278
6,8,258
7,9,17
8,10,50
9,11,6


In [33]:
ensg_alias_count_histogram_df.rename(columns={0:'num_alias_symbol'}, inplace=True )
ensg_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol
0,2,3738
1,3,394
2,4,173
3,5,112
4,6,160
5,7,278
6,8,258
7,9,17
8,10,50
9,11,6


In [34]:
ensg_alias_count_histogram_df['percent_alias_symbol'] = ((ensg_alias_count_histogram_df['num_alias_symbol'] / ensg_alias_len) * 100)
ensg_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol,percent_alias_symbol
0,2,3738,6.80552
1,3,394,0.717329
2,4,173,0.314969
3,5,112,0.203911
4,6,160,0.291301
5,7,278,0.506136
6,8,258,0.469723
7,9,17,0.030951
8,10,50,0.091032
9,11,6,0.010924


In [35]:
ensg_alias_count_histogram_df = ensg_alias_count_histogram_df.drop('num_alias_symbol', axis=1)
ensg_alias_count_histogram_df

Unnamed: 0,num_gene_records,percent_alias_symbol
0,2,6.80552
1,3,0.717329
2,4,0.314969
3,5,0.203911
4,6,0.291301
5,7,0.506136
6,8,0.469723
7,9,0.030951
8,10,0.091032
9,11,0.010924


In [36]:
#px.bar(ensg_alias_count_histogram_df, x='num_gene_records', y='percent_alias_symbol')

In [37]:
ensg_dup_alias_count_df.to_csv('../ensg_alias_overlap_count.csv', index=True)

#### <a id='toc1_1_10_1_'></a>[Save as csv](#toc0_)

In [38]:
#mini_ensg_df_explode.to_csv('../ensg_alias_overlap.csv', index=False)

### <a id='toc1_1_11_'></a>[Put columns in different order to ephasize alias symbols instead of gene records](#toc0_)

In [39]:
mini_ensg_df

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id,source
80,ENSG00000278074,15.212,KIR2DL4,3805,ENSG
631,ENSG00000277964,15.212,KIR2DL4,124900568,ENSG
811,ENSG00000278430,15.212,KIR2DL4,3805,ENSG
671,ENSG00000277362,15.212,KIR2DL4,3805,ENSG
1300,ENSG00000276979,15.212,KIR2DL4,3805,ENSG
...,...,...,...,...,...
8105,ENSG00000100764,p56,PSMC1,5700,ENSG
7724,ENSG00000283997,promethin,LDAF1,57146,ENSG
37151,ENSG00000011638,promethin,LDAF1,57146,ENSG
27801,ENSG00000290418,psiTPTE22,TPTEP1,387590,ENSG


In [40]:
mini_ensg_df_2 = mini_ensg_df[['alias_symbol', 'ENSG_ID', 'gene_symbol', 'source']]
mini_ensg_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
80,15.212,ENSG00000278074,KIR2DL4,ENSG
631,15.212,ENSG00000277964,KIR2DL4,ENSG
811,15.212,ENSG00000278430,KIR2DL4,ENSG
671,15.212,ENSG00000277362,KIR2DL4,ENSG
1300,15.212,ENSG00000276979,KIR2DL4,ENSG
...,...,...,...,...
8105,p56,ENSG00000100764,PSMC1,ENSG
7724,promethin,ENSG00000283997,LDAF1,ENSG
37151,promethin,ENSG00000011638,LDAF1,ENSG
27801,psiTPTE22,ENSG00000290418,TPTEP1,ENSG


### <a id='toc1_1_12_'></a>[Merge rows with matching alias symbols](#toc0_)

In [41]:
mini_ensg_df_2 = mini_ensg_df_2.drop_duplicates(subset = ["alias_symbol", "gene_symbol"], keep = 'first')

In [42]:
mini_ensg_df_2 = mini_ensg_df_2.applymap(str)
mini_ensg_df_2


  mini_ensg_df_2 = mini_ensg_df_2.applymap(str)


Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
80,15.212,ENSG00000278074,KIR2DL4,ENSG
6019,5T4-AG,ENSG00000283085,TPBG,ENSG
21101,A1AT,ENSG00000197249,SERPINA1,ENSG
29537,A3GALT1,ENSG00000175164,ABO,ENSG
2816,AAP,ENSG00000276838,SERPINF2,ENSG
...,...,...,...,...
50647,p55,ENSG00000197170,PSMD12,ENSG
23567,p56,ENSG00000227211,H3P45,ENSG
8105,p56,ENSG00000100764,PSMC1,ENSG
7724,promethin,ENSG00000283997,LDAF1,ENSG


In [43]:
mini_ensg_df_2 = mini_ensg_df_2.groupby('alias_symbol').agg({'ENSG_ID': ', '.join, 
                             'gene_symbol': ', '.join, 
                             'source':'first' }).reset_index()
mini_ensg_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
0,15.212,ENSG00000278074,KIR2DL4,ENSG
1,5T4-AG,ENSG00000283085,TPBG,ENSG
2,A1AT,ENSG00000197249,SERPINA1,ENSG
3,A3GALT1,ENSG00000175164,ABO,ENSG
4,AAP,ENSG00000276838,SERPINF2,ENSG
...,...,...,...,...
5302,p42,"ENSG00000100519, ENSG00000227443","PSMC6, H3P30",ENSG
5303,p55,"ENSG00000117461, ENSG00000197170","PIK3R3, PSMD12",ENSG
5304,p56,"ENSG00000227211, ENSG00000100764","H3P45, PSMC1",ENSG
5305,promethin,ENSG00000283997,LDAF1,ENSG


# <a id='toc2_'></a>[HGNC](#toc0_)

In [44]:
df1 = pd.read_csv("../hgnc_filtered.csv")

### <a id='toc2_1_1_'></a>[Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc0_)

In [45]:
mini_hgnc_df = df1.drop(['Unnamed: 0', 'hgnc_id', 'locus_type', 'name', 'mane_select', 'locus_group', 'entrez_id', 'agr', 'refseq_accession', 'alias_name', 'ENSEMBLtrans', 'NA', 'unknown'], axis=1)
mini_hgnc_df = mini_hgnc_df.rename(columns = {'ensembl_gene_id':'ENSG_ID', 'symbol':'gene_symbol'})
mini_hgnc_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
0,A1BG,ENSG00000121410,
1,A1BG-AS1,ENSG00000268895,FLJ23569
2,A1CF,ENSG00000148584,ACF;ASP;ACF64;ACF65;APOBEC1CF
3,A2M,ENSG00000175899,FWP007;S863-7;CPAMD5
4,A2M-AS1,ENSG00000245105,
...,...,...,...
43159,ZYG11B,ENSG00000162378,FLJ13456
43160,ZYX,ENSG00000159840,
43161,ZYXP1,ENSG00000274572,
43162,ZZEF1,ENSG00000074755,KIAA0399;ZZZ4;FLJ10821


### <a id='toc2_1_2_'></a>[How many total unique gene records are there](#toc0_)

In [46]:
hgnc_gene_symbol_set = set(mini_hgnc_df['gene_symbol'])
len(hgnc_gene_symbol_set)

43164

Looking at IGHV as an example:

In [47]:
hgnc_IGHV_alias_df = mini_hgnc_df[mini_hgnc_df['gene_symbol'].str.contains('IGHV')]

In [48]:
type(mini_hgnc_df.gene_symbol[0])

str

In [49]:
hgnc_IGHV_alias_df.to_csv('../hgnc_IGHV_alias_df.csv')

In [50]:
hgnc_IGHV_alias_count_df = mini_hgnc_df.copy
hgnc_IGHV_alias_count_df = mini_hgnc_df.loc[mini_hgnc_df['gene_symbol'] == "IGHV1-24" ]
hgnc_IGHV_alias_count_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
12539,IGHV1-24,ENSG00000211950,


In [51]:
hgnc_IGHV_alias_count_df = mini_hgnc_df.copy
hgnc_IGHV_alias_count_df = mini_hgnc_df.loc[mini_hgnc_df['ENSG_ID'] == "ENSG00000211950" ]
hgnc_IGHV_alias_count_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
12539,IGHV1-24,ENSG00000211950,


### <a id='toc2_1_3_'></a>[Drop genes with no aliases](#toc0_)

In [52]:
mini_hgnc_df = mini_hgnc_df[mini_hgnc_df["alias_symbol"].str.contains("NaN") == False]
mini_hgnc_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
1,A1BG-AS1,ENSG00000268895,FLJ23569
2,A1CF,ENSG00000148584,ACF;ASP;ACF64;ACF65;APOBEC1CF
3,A2M,ENSG00000175899,FWP007;S863-7;CPAMD5
5,A2ML1,ENSG00000166535,FLJ25179;p170
9,A3GALT2,ENSG00000184389,IGBS3S;IGB3S
...,...,...,...
43156,ZXDC,ENSG00000070476,MGC11349;FLJ13861
43157,ZYG11A,ENSG00000203995,ZYG11
43159,ZYG11B,ENSG00000162378,FLJ13456
43162,ZZEF1,ENSG00000074755,KIAA0399;ZZZ4;FLJ10821


### <a id='toc2_1_4_'></a>[Make each row in alias_symbol a set:](#toc0_)
    covert to a list 
    make a set

In [53]:
mini_hgnc_df['alias_symbol'] = mini_hgnc_df['alias_symbol'].astype(str)
mini_hgnc_df['alias_symbol'] = [x.split(';') for x in mini_hgnc_df.alias_symbol]
mini_hgnc_df['alias_symbol']=np.where(mini_hgnc_df.alias_symbol=='','',mini_hgnc_df.alias_symbol.map(set))
mini_hgnc_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_hgnc_df['alias_symbol'] = mini_hgnc_df['alias_symbol'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_hgnc_df['alias_symbol'] = [x.split(';') for x in mini_hgnc_df.alias_symbol]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_hgnc_df['alias_symbol']=np.where(mini_hgnc_df.a

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
1,A1BG-AS1,ENSG00000268895,{FLJ23569}


### <a id='toc2_1_5_'></a>[Explode the alias sets so that it is one per row](#toc0_)

In [54]:
mini_hgnc_df = mini_hgnc_df.explode(column="alias_symbol")
mini_hgnc_df.head(5)

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
1,A1BG-AS1,ENSG00000268895,FLJ23569
2,A1CF,ENSG00000148584,ASP
2,A1CF,ENSG00000148584,ACF65
2,A1CF,ENSG00000148584,ACF64
2,A1CF,ENSG00000148584,ACF


Looking at CD158b as an example:

In [55]:
mini_hgnc_df.loc[mini_hgnc_df['gene_symbol'] == "CD158b" ]

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol


In [56]:
hgnc_CD158b_alias_count_df = mini_hgnc_df.copy
hgnc_CD158b_alias_count_df = mini_hgnc_df.loc[mini_hgnc_df['alias_symbol'] == "CD158b" ]
hgnc_CD158b_alias_count_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol


In [57]:
hgnc_CD158b_alias_count_df.to_csv('../hgnc_CD158b_alias_count_df.csv')

### <a id='toc2_1_6_'></a>[How many total unique aliases are there](#toc0_)

In [58]:
hgnc_alias_symbol_set = set(mini_hgnc_df['alias_symbol'])
hgnc_alias_len = len(hgnc_alias_symbol_set)
hgnc_alias_len

41589

### <a id='toc2_1_7_'></a>[Pull out all the rows that have an alias symbol that can be found elsewhere](#toc0_)

In [59]:
mini_hgnc_df['alias_duplicates'] = mini_hgnc_df.duplicated(subset= 'alias_symbol', keep=False)
mini_hgnc_df = mini_hgnc_df[mini_hgnc_df['alias_duplicates'] == True]
mini_hgnc_df = mini_hgnc_df.drop(['alias_duplicates'], axis=1)
mini_hgnc_df.head(5)

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
2,A1CF,ENSG00000148584,ASP
22,AAGAB,ENSG00000103591,p34
65,ABCB8,ENSG00000197150,M-ABC1
67,ABCB10,ENSG00000135776,M-ABC2
68,ABCB10P1,ENSG00000274099,M-ABC2


### <a id='toc2_1_8_'></a>[Sort alias symbols alphabetically](#toc0_)

In [60]:
mini_hgnc_df = mini_hgnc_df.sort_values('alias_symbol')
mini_hgnc_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
35226,SLC25A5,ENSG00000005022,2F1
13978,KLRG1,ENSG00000139187,2F1
33987,S100A8,ENSG00000143546,60B8AG
33988,S100A9,ENSG00000163220,60B8AG
31335,RNU6V,ENSG00000206832,87U6
...,...,...,...
38078,TEX28P2,ENSG00000277008,pTEX
26200,PPP4R3A,ENSG00000100796,smk1
26203,PPP4R3C,ENSG00000224960,smk1
36661,SPATA2,ENSG00000158480,tamo


### <a id='toc2_1_9_'></a>[Number of records with an alias that is shared](#toc0_)

In [61]:
hgnc_gene_symbol_set_wsharedalias = set(mini_hgnc_df['gene_symbol'])
len(hgnc_gene_symbol_set_wsharedalias)

2084

In [62]:
mini_hgnc_df['source'] = 'HGNC'
mini_hgnc_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol,source
35226,SLC25A5,ENSG00000005022,2F1,HGNC
13978,KLRG1,ENSG00000139187,2F1,HGNC
33987,S100A8,ENSG00000143546,60B8AG,HGNC
33988,S100A9,ENSG00000163220,60B8AG,HGNC
31335,RNU6V,ENSG00000206832,87U6,HGNC
...,...,...,...,...
38078,TEX28P2,ENSG00000277008,pTEX,HGNC
26200,PPP4R3A,ENSG00000100796,smk1,HGNC
26203,PPP4R3C,ENSG00000224960,smk1,HGNC
36661,SPATA2,ENSG00000158480,tamo,HGNC


### <a id='toc2_1_10_'></a>[Count the number of times each multi-use alias is used](#toc0_)

In [63]:
hgnc_dup_alias_count_df = mini_hgnc_df.pivot_table(index = ['alias_symbol'], aggfunc ='size')
hgnc_dup_alias_count_df

alias_symbol
2F1       2
60B8AG    2
87U6      2
9G8       2
A1        3
         ..
p97       4
pH2A/f    2
pTEX      2
smk1      2
tamo      2
Length: 1040, dtype: int64

In [64]:
hgnc_dup_alias_count_df = hgnc_dup_alias_count_df.reset_index()
hgnc_dup_alias_count_df

Unnamed: 0,alias_symbol,0
0,2F1,2
1,60B8AG,2
2,87U6,2
3,9G8,2
4,A1,3
...,...,...
1035,p97,4
1036,pH2A/f,2
1037,pTEX,2
1038,smk1,2


In [65]:
print(hgnc_dup_alias_count_df.columns)

Index(['alias_symbol', 0], dtype='object')


In [66]:
hgnc_dup_alias_count_df.rename(columns={0:'num_gene_records'}, inplace=True )
hgnc_dup_alias_count_df

Unnamed: 0,alias_symbol,num_gene_records
0,2F1,2
1,60B8AG,2
2,87U6,2
3,9G8,2
4,A1,3
...,...,...
1035,p97,4
1036,pH2A/f,2
1037,pTEX,2
1038,smk1,2


In [67]:
hgnc_dup_alias_count_df = hgnc_dup_alias_count_df.sort_values('num_gene_records', ascending=False)
hgnc_dup_alias_count_df.head(30)

Unnamed: 0,alias_symbol,num_gene_records
68,ASP,7
653,PAP,7
935,U4,7
118,CAP,6
568,MYM,6
267,F379,6
27,AIP1,6
1012,p40,5
31,ALP,5
571,NAP1,5


In [68]:
hgnc_alias_alias_collision_set = set(hgnc_dup_alias_count_df['alias_symbol'])
len(hgnc_alias_alias_collision_set)

1040

In [69]:
hgnc_dup_alias_count_df.to_csv('../hgnc_dup_alias_count_df.csv', index=True)

In [70]:
hgnc_alias_count_histogram_df = hgnc_dup_alias_count_df.pivot_table(index = ['num_gene_records'], aggfunc ='size')
hgnc_alias_count_histogram_df

num_gene_records
2    858
3    125
4     38
5     12
6      4
7      3
dtype: int64

In [71]:
hgnc_alias_count_histogram_df = hgnc_alias_count_histogram_df.reset_index()
hgnc_alias_count_histogram_df

Unnamed: 0,num_gene_records,0
0,2,858
1,3,125
2,4,38
3,5,12
4,6,4
5,7,3


In [72]:
hgnc_alias_count_histogram_df.rename(columns={0:'num_alias_symbol'}, inplace=True )
hgnc_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol
0,2,858
1,3,125
2,4,38
3,5,12
4,6,4
5,7,3


In [73]:
hgnc_alias_count_histogram_df['percent_alias_symbol'] = ((hgnc_alias_count_histogram_df['num_alias_symbol'] / hgnc_alias_len) * 100)
hgnc_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol,percent_alias_symbol
0,2,858,2.063046
1,3,125,0.30056
2,4,38,0.09137
3,5,12,0.028854
4,6,4,0.009618
5,7,3,0.007213


In [74]:
hgnc_alias_count_histogram_df = hgnc_alias_count_histogram_df.drop('num_alias_symbol', axis=1)
hgnc_alias_count_histogram_df

Unnamed: 0,num_gene_records,percent_alias_symbol
0,2,2.063046
1,3,0.30056
2,4,0.09137
3,5,0.028854
4,6,0.009618
5,7,0.007213


In [75]:
#px.bar(hgnc_alias_count_histogram_df, x='num_gene_records', y='percent_alias_symbol')

#### <a id='toc2_1_10_1_'></a>[Save as csv](#toc0_)

In [76]:
#mini_hgnc_df_explode.to_csv('../hgnc_alias_overlap.csv', index=False)

### <a id='toc2_1_11_'></a>[Put columns in different order to ephasize alias symbols instead of gene records](#toc0_)

In [77]:
mini_hgnc_df_2 = mini_hgnc_df.drop_duplicates(subset = ["alias_symbol", "gene_symbol"], keep = 'first')

In [78]:
mini_hgnc_df_2 = mini_hgnc_df[['alias_symbol', 'ENSG_ID', 'gene_symbol', 'source']]
mini_hgnc_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
35226,2F1,ENSG00000005022,SLC25A5,HGNC
13978,2F1,ENSG00000139187,KLRG1,HGNC
33987,60B8AG,ENSG00000143546,S100A8,HGNC
33988,60B8AG,ENSG00000163220,S100A9,HGNC
31335,87U6,ENSG00000206832,RNU6V,HGNC
...,...,...,...,...
38078,pTEX,ENSG00000277008,TEX28P2,HGNC
26200,smk1,ENSG00000100796,PPP4R3A,HGNC
26203,smk1,ENSG00000224960,PPP4R3C,HGNC
36661,tamo,ENSG00000158480,SPATA2,HGNC


### <a id='toc2_1_12_'></a>[Merge rows with matching alias symbols](#toc0_)

In [79]:
mini_hgnc_df_2 = mini_hgnc_df_2.applymap(str)
mini_hgnc_df_2

  mini_hgnc_df_2 = mini_hgnc_df_2.applymap(str)


Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
35226,2F1,ENSG00000005022,SLC25A5,HGNC
13978,2F1,ENSG00000139187,KLRG1,HGNC
33987,60B8AG,ENSG00000143546,S100A8,HGNC
33988,60B8AG,ENSG00000163220,S100A9,HGNC
31335,87U6,ENSG00000206832,RNU6V,HGNC
...,...,...,...,...
38078,pTEX,ENSG00000277008,TEX28P2,HGNC
26200,smk1,ENSG00000100796,PPP4R3A,HGNC
26203,smk1,ENSG00000224960,PPP4R3C,HGNC
36661,tamo,ENSG00000158480,SPATA2,HGNC


In [80]:
mini_hgnc_df_2 = mini_hgnc_df_2.groupby('alias_symbol').agg({'ENSG_ID': ', '.join, 
                             'gene_symbol': ', '.join, 
                             'source':'first' }).reset_index()
mini_hgnc_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
0,2F1,"ENSG00000005022, ENSG00000139187","SLC25A5, KLRG1",HGNC
1,60B8AG,"ENSG00000143546, ENSG00000163220","S100A8, S100A9",HGNC
2,87U6,"ENSG00000206832, ENSG00000065135","RNU6V, GNAI3",HGNC
3,9G8,"ENSG00000164609, ENSG00000115875","SLU7, SRSF7",HGNC
4,A1,"ENSG00000035928, ENSG00000163918, ENSG00000049541","RFC1, RFC4, RFC2",HGNC
...,...,...,...,...
1035,p97,"ENSG00000165280, ENSG00000110321, ENSG00000179...","VCP, EIF4G2, GEMIN4, CFDP1",HGNC
1036,pH2A/f,"ENSG00000196787, ENSG00000234816","H2AC11, H2AC5P",HGNC
1037,pTEX,"ENSG00000274962, ENSG00000277008","TEX28P1, TEX28P2",HGNC
1038,smk1,"ENSG00000100796, ENSG00000224960","PPP4R3A, PPP4R3C",HGNC


In [81]:
#mini_hgnc_df_2.to_csv('../hgnc_alias_overlap_2.csv', index=False)

# <a id='toc3_'></a>[NCBI Info](#toc0_)

In [82]:
df2 = pd.read_csv("../ncbi_info_20220719_filtered.csv")

  df2 = pd.read_csv("../ncbi_info_20220719_filtered.csv")


### <a id='toc3_1_1_'></a>[Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc0_)

In [83]:
mini_ncbi_df = df2.drop(['Unnamed: 0', '#tax_id','GeneID', 'dbXrefs', 'description', 'type_of_gene', 'Symbol_from_nomenclature_authority', 'Full_name_from_nomenclature_authority', 'Other_designations', 'MIM', 'HGNC', 'AllianceGenome','MIRbase', 'IMGTgene_db', 'dash', 'unknown'], axis=1)
mini_ncbi_df = mini_ncbi_df.rename(columns = {'Symbol':'gene_symbol','Synonyms':'alias_symbol', 'ENSEMBL':'ENSG_ID'})
mini_ncbi_df['ENSG_ID'] = mini_ncbi_df['ENSG_ID'].astype(str)
mini_ncbi_df['ENSG_ID'] = mini_ncbi_df['ENSG_ID'].apply(str.upper)
mini_ncbi_df

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
0,A1BG,A1B|ABG|GAB|HYST2477,ENSG00000121410
1,A2M,A2MD|CPAMD5|FWP007|S863-7,ENSG00000175899
2,A2MP1,A2MP,ENSG00000256069
3,NAT1,AAC1|MNAT|NAT-1|NATI,ENSG00000171428
4,NAT2,AAC2|NAT-2|PNAT,ENSG00000156006
...,...,...,...
75495,trnD,-,NAN
75496,trnP,-,NAN
75497,trnA,-,NAN
75498,COX1,-,NAN


### <a id='toc3_1_2_'></a>[How many total unique gene records are there](#toc0_)

In [84]:
ncbi_gene_symbol_set = set(mini_ncbi_df['gene_symbol'])
len(ncbi_gene_symbol_set)

75346

### <a id='toc3_1_3_'></a>[Drop genes with no aliases](#toc0_)

In [85]:
mini_ncbi_df = mini_ncbi_df.replace("-", np.nan)
mini_ncbi_df

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
0,A1BG,A1B|ABG|GAB|HYST2477,ENSG00000121410
1,A2M,A2MD|CPAMD5|FWP007|S863-7,ENSG00000175899
2,A2MP1,A2MP,ENSG00000256069
3,NAT1,AAC1|MNAT|NAT-1|NATI,ENSG00000171428
4,NAT2,AAC2|NAT-2|PNAT,ENSG00000156006
...,...,...,...
75495,trnD,,NAN
75496,trnP,,NAN
75497,trnA,,NAN
75498,COX1,,NAN


In [86]:
mini_ncbi_df = mini_ncbi_df.dropna(subset=['alias_symbol'])
mini_ncbi_df

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
0,A1BG,A1B|ABG|GAB|HYST2477,ENSG00000121410
1,A2M,A2MD|CPAMD5|FWP007|S863-7,ENSG00000175899
2,A2MP1,A2MP,ENSG00000256069
3,NAT1,AAC1|MNAT|NAT-1|NATI,ENSG00000171428
4,NAT2,AAC2|NAT-2|PNAT,ENSG00000156006
...,...,...,...
71686,LOC124906931,WASH,NAN
71738,LOC124906983,WASH,NAN
72857,LOC124908102,WASH,NAN
74876,POLGARF,ORF-Y|POLG,NAN


### <a id='toc3_1_4_'></a>[Make each row in alias_symbol a set:](#toc0_)
    covert to a list 
    make a set

In [87]:
mini_ncbi_df['alias_symbol'] = mini_ncbi_df['alias_symbol'].astype(str)
mini_ncbi_df['alias_symbol'] = [x.split('|') for x in mini_ncbi_df.alias_symbol]
mini_ncbi_df['alias_symbol']=np.where(mini_ncbi_df.alias_symbol=='','',mini_ncbi_df.alias_symbol.map(set))
mini_ncbi_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df['alias_symbol'] = mini_ncbi_df['alias_symbol'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df['alias_symbol'] = [x.split('|') for x in mini_ncbi_df.alias_symbol]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df['alias_symbol']=np.where(mini_ncbi_df.a

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
0,A1BG,"{HYST2477, A1B, ABG, GAB}",ENSG00000121410


### <a id='toc3_1_5_'></a>[Explode the alias sets so that it is one per row](#toc0_)

In [88]:
mini_ncbi_df = mini_ncbi_df.explode(column="alias_symbol")
mini_ncbi_df.head(5)

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
0,A1BG,HYST2477,ENSG00000121410
0,A1BG,A1B,ENSG00000121410
0,A1BG,ABG,ENSG00000121410
0,A1BG,GAB,ENSG00000121410
1,A2M,S863-7,ENSG00000175899


In [89]:
#ncbi_CD158b_alias_count_df.to_csv('../ncbi_CD158b_alias_count_df.csv')

### <a id='toc3_1_6_'></a>[How many unique aliases are there](#toc0_)

In [90]:
ncbi_alias_symbol_set = set(mini_ncbi_df['alias_symbol'])
ncbi_alias_len = len(ncbi_alias_symbol_set)
ncbi_alias_len

67454

### <a id='toc3_1_7_'></a>[Pull out all the rows that have an alias symbol that can be found elsewhere](#toc0_)

In [91]:
mini_ncbi_df['alias_duplicates'] = mini_ncbi_df.duplicated(subset= 'alias_symbol', keep=False)
mini_ncbi_df = mini_ncbi_df[mini_ncbi_df['alias_duplicates'] == True]
mini_ncbi_df = mini_ncbi_df.drop(['alias_duplicates'], axis=1)
mini_ncbi_df.head(5)

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
0,A1BG,A1B,ENSG00000121410
3,NAT1,AAC1,ENSG00000171428
3,NAT1,NAT-1,ENSG00000171428
4,NAT2,AAC2,ENSG00000156006
6,SERPINA3,ACT,ENSG00000196136


### <a id='toc3_1_8_'></a>[Sort alias symbols alphabetically](#toc0_)

In [92]:
mini_ncbi_df = mini_ncbi_df.sort_values('alias_symbol')
mini_ncbi_df.head(5)

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
4534,PTEN,10q23del,ENSG00000171862
537,BMPR1A,10q23del,ENSG00000107779
199,ALOX12,12-LOX,ENSG00000108839
205,ALOX15,12-LOX,ENSG00000161905
245,SLC25A5,2F1,ENSG00000005022


### <a id='toc3_1_9_'></a>[Number of records with an alias that is shared](#toc0_)

In [93]:
ncbi_gene_symbol_set_wsharedalias = set(mini_ncbi_df['gene_symbol'])
len(ncbi_gene_symbol_set_wsharedalias)

5670

In [94]:
mini_ncbi_df['source'] = 'NCBI Info'
mini_ncbi_df

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID,source
4534,PTEN,10q23del,ENSG00000171862,NCBI Info
537,BMPR1A,10q23del,ENSG00000107779,NCBI Info
199,ALOX12,12-LOX,ENSG00000108839,NCBI Info
205,ALOX15,12-LOX,ENSG00000161905,NCBI Info
245,SLC25A5,2F1,ENSG00000005022,NCBI Info
...,...,...,...,...
18205,PPP4R3C,smk1,ENSG00000224960,NCBI Info
12929,PPP4R3A,smk1,ENSG00000100796,NCBI Info
13546,PPP4R3B,smk1,ENSG00000275052,NCBI Info
17549,SPATA2L,tamo,ENSG00000158792,NCBI Info


### <a id='toc3_1_10_'></a>[Count the number of times each multi-use alias is used](#toc0_)

In [95]:
ncbi_dup_alias_count_df = mini_ncbi_df.pivot_table(index = ['alias_symbol'], aggfunc ='size')
ncbi_dup_alias_count_df

alias_symbol
10q23del       2
12-LOX         2
2F1            2
3-alpha-HSD    2
35DAG          2
              ..
pTEX           2
polymerase     3
rpL7a          2
smk1           3
tamo           2
Length: 3427, dtype: int64

In [96]:
ncbi_dup_alias_count_df.to_csv('../ncbi_alias_overlap_count.csv', index=True)

In [97]:
ncbi_dup_alias_count_df = ncbi_dup_alias_count_df.reset_index()
ncbi_dup_alias_count_df

Unnamed: 0,alias_symbol,0
0,10q23del,2
1,12-LOX,2
2,2F1,2
3,3-alpha-HSD,2
4,35DAG,2
...,...,...
3422,pTEX,2
3423,polymerase,3
3424,rpL7a,2
3425,smk1,3


In [98]:
ncbi_dup_alias_count_df.rename(columns={0:'num_gene_records'}, inplace=True )
ncbi_dup_alias_count_df

Unnamed: 0,alias_symbol,num_gene_records
0,10q23del,2
1,12-LOX,2
2,2F1,2
3,3-alpha-HSD,2
4,35DAG,2
...,...,...
3422,pTEX,2
3423,polymerase,3
3424,rpL7a,2
3425,smk1,3


In [99]:
ncbi_dup_alias_count_df = ncbi_dup_alias_count_df.sort_values('num_gene_records', ascending=False)
ncbi_dup_alias_count_df.head(30)

Unnamed: 0,alias_symbol,num_gene_records
3260,VH,36
1284,H4-16,14
1297,H4C9,13
1286,H4C11,13
1287,H4C12,13
1288,H4C13,13
1296,H4C8,13
1289,H4C14,13
1290,H4C15,13
1291,H4C2,13


In [100]:
ncbi_alias_alias_collision_set = set(ncbi_dup_alias_count_df['alias_symbol'])
len(ncbi_alias_alias_collision_set)

3427

In [101]:
ncbi_dup_alias_count_df.to_csv('../ncbi_dup_alias_count_df.csv', index=True)

In [102]:
ncbi_alias_count_histogram_df = ncbi_dup_alias_count_df.pivot_table(index = ['num_gene_records'], aggfunc ='size')
ncbi_alias_count_histogram_df

num_gene_records
2     2748
3      404
4      142
5       51
6       24
7       17
8        7
9       15
10       2
11       1
12       1
13      13
14       1
36       1
dtype: int64

In [103]:
ncbi_alias_count_histogram_df = ncbi_alias_count_histogram_df.reset_index()
ncbi_alias_count_histogram_df

Unnamed: 0,num_gene_records,0
0,2,2748
1,3,404
2,4,142
3,5,51
4,6,24
5,7,17
6,8,7
7,9,15
8,10,2
9,11,1


In [104]:
ncbi_alias_count_histogram_df.rename(columns={0:'num_alias_symbol'}, inplace=True )
ncbi_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol
0,2,2748
1,3,404
2,4,142
3,5,51
4,6,24
5,7,17
6,8,7
7,9,15
8,10,2
9,11,1


In [105]:
ncbi_alias_count_histogram_df['percent_alias_symbol'] = ((ncbi_alias_count_histogram_df['num_alias_symbol'] / ncbi_alias_len) * 100)
ncbi_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol,percent_alias_symbol
0,2,2748,4.073887
1,3,404,0.598927
2,4,142,0.210514
3,5,51,0.075607
4,6,24,0.03558
5,7,17,0.025202
6,8,7,0.010377
7,9,15,0.022237
8,10,2,0.002965
9,11,1,0.001482


In [106]:
ncbi_alias_count_histogram_df = ncbi_alias_count_histogram_df.drop('num_alias_symbol', axis=1)
ncbi_alias_count_histogram_df

Unnamed: 0,num_gene_records,percent_alias_symbol
0,2,4.073887
1,3,0.598927
2,4,0.210514
3,5,0.075607
4,6,0.03558
5,7,0.025202
6,8,0.010377
7,9,0.022237
8,10,0.002965
9,11,0.001482


In [107]:
# px.bar(ncbi_alias_count_histogram_df, x='num_gene_records', y='percent_alias_symbol')

### <a id='toc3_1_11_'></a>[Put columns in different order to ephasize alias symbols instead of gene records](#toc0_)

In [108]:
mini_ncbi_df_2 = mini_ncbi_df.drop_duplicates(subset = ["alias_symbol", "gene_symbol"], keep = 'first')

In [109]:
mini_ncbi_df_2 = mini_ncbi_df[['alias_symbol', 'ENSG_ID', 'gene_symbol', 'source']]
mini_ncbi_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
4534,10q23del,ENSG00000171862,PTEN,NCBI Info
537,10q23del,ENSG00000107779,BMPR1A,NCBI Info
199,12-LOX,ENSG00000108839,ALOX12,NCBI Info
205,12-LOX,ENSG00000161905,ALOX15,NCBI Info
245,2F1,ENSG00000005022,SLC25A5,NCBI Info
...,...,...,...,...
18205,smk1,ENSG00000224960,PPP4R3C,NCBI Info
12929,smk1,ENSG00000100796,PPP4R3A,NCBI Info
13546,smk1,ENSG00000275052,PPP4R3B,NCBI Info
17549,tamo,ENSG00000158792,SPATA2L,NCBI Info


### <a id='toc3_1_12_'></a>[Merge rows with matching alias symbols](#toc0_)

In [110]:
mini_ncbi_df_2 = mini_ncbi_df_2.applymap(str)
mini_ncbi_df_2

  mini_ncbi_df_2 = mini_ncbi_df_2.applymap(str)


Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
4534,10q23del,ENSG00000171862,PTEN,NCBI Info
537,10q23del,ENSG00000107779,BMPR1A,NCBI Info
199,12-LOX,ENSG00000108839,ALOX12,NCBI Info
205,12-LOX,ENSG00000161905,ALOX15,NCBI Info
245,2F1,ENSG00000005022,SLC25A5,NCBI Info
...,...,...,...,...
18205,smk1,ENSG00000224960,PPP4R3C,NCBI Info
12929,smk1,ENSG00000100796,PPP4R3A,NCBI Info
13546,smk1,ENSG00000275052,PPP4R3B,NCBI Info
17549,tamo,ENSG00000158792,SPATA2L,NCBI Info


In [111]:
mini_ncbi_df_2['ENSG_ID'] = mini_ncbi_df_2['ENSG_ID'].str.replace('NAN','nan')
mini_ncbi_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
4534,10q23del,ENSG00000171862,PTEN,NCBI Info
537,10q23del,ENSG00000107779,BMPR1A,NCBI Info
199,12-LOX,ENSG00000108839,ALOX12,NCBI Info
205,12-LOX,ENSG00000161905,ALOX15,NCBI Info
245,2F1,ENSG00000005022,SLC25A5,NCBI Info
...,...,...,...,...
18205,smk1,ENSG00000224960,PPP4R3C,NCBI Info
12929,smk1,ENSG00000100796,PPP4R3A,NCBI Info
13546,smk1,ENSG00000275052,PPP4R3B,NCBI Info
17549,tamo,ENSG00000158792,SPATA2L,NCBI Info


In [112]:
mini_ncbi_df_2 = mini_ncbi_df_2.groupby('alias_symbol').agg({'ENSG_ID': ', '.join, 
                             'gene_symbol': ', '.join, 
                             'source':'first' }).reset_index()
mini_ncbi_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
0,10q23del,"ENSG00000171862, ENSG00000107779","PTEN, BMPR1A",NCBI Info
1,12-LOX,"ENSG00000108839, ENSG00000161905","ALOX12, ALOX15",NCBI Info
2,2F1,"ENSG00000005022, ENSG00000139187","SLC25A5, KLRG1",NCBI Info
3,3-alpha-HSD,"ENSG00000198610, ENSG00000073737","AKR1C4, DHRS9",NCBI Info
4,35DAG,"ENSG00000170624, ENSG00000102683","SGCD, SGCG",NCBI Info
...,...,...,...,...
3422,pTEX,"ENSG00000274962, ENSG00000277008","TEX28P1, TEX28P2",NCBI Info
3423,polymerase,"nan, nan, nan","ERVK-9, ERVK-11, ERVK-19",NCBI Info
3424,rpL7a,"ENSG00000240522, ENSG00000213272","RPL7AP10, RPL7AP9",NCBI Info
3425,smk1,"ENSG00000224960, ENSG00000100796, ENSG00000275052","PPP4R3C, PPP4R3A, PPP4R3B",NCBI Info


# <a id='toc4_'></a>[Merge to create Alias Overlap Table 1 - Gene Symbol](#toc0_)

In [113]:
merged_alias_overlap_df_1 = pd.concat([mini_hgnc_df[['gene_symbol', 'ENSG_ID', 'alias_symbol', 'source']],mini_ncbi_df[['gene_symbol', 'ENSG_ID', 'alias_symbol', 'source']], mini_ensg_df[['gene_symbol', 'ENSG_ID', 'alias_symbol', 'source']]])
merged_alias_overlap_df_1

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol,source
35226,SLC25A5,ENSG00000005022,2F1,HGNC
13978,KLRG1,ENSG00000139187,2F1,HGNC
33987,S100A8,ENSG00000143546,60B8AG,HGNC
33988,S100A9,ENSG00000163220,60B8AG,HGNC
31335,RNU6V,ENSG00000206832,87U6,HGNC
...,...,...,...,...
8105,PSMC1,ENSG00000100764,p56,ENSG
7724,LDAF1,ENSG00000283997,promethin,ENSG
37151,LDAF1,ENSG00000011638,promethin,ENSG
27801,TPTEP1,ENSG00000290418,psiTPTE22,ENSG


In [114]:
merged_alias_overlap_df_1.to_csv('../merged_alias_overlap_df_1.csv', index=False)

In [199]:
merged_alias_overlap_df_1.loc[merged_alias_overlap_df_1.gene_symbol == 'NAP1']

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol,source


In [200]:
merged_alias_overlap_df_1.loc[merged_alias_overlap_df_1['alias_symbol'] == "NAP1" ]

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol,source
5768,CXCL8,ENSG00000169429,NAP1,HGNC
21850,NAPSA,ENSG00000131400,NAP1,HGNC
221,ACOT8,ENSG00000101473,NAP1,HGNC
2070,AZI2,ENSG00000163512,NAP1,HGNC
21830,NAP1L1,ENSG00000187109,NAP1,HGNC
20405,TAB3,ENSG00000157625,NAP1,NCBI Info
7804,ACOT8,ENSG00000101473,NAP1,NCBI Info
11818,TRMO,ENSG00000136932,NAP1,NCBI Info
3677,NAP1L1,ENSG00000187109,NAP1,NCBI Info
14180,AZI2,ENSG00000163512,NAP1,NCBI Info


In [117]:
merged_alias_overlap_df_1['source'].value_counts()

source
ENSG         20879
NCBI Info     8247
HGNC          2348
Name: count, dtype: int64

# <a id='toc5_'></a>[Merge to create Alias Overlap Table 2 - Alias Symbol](#toc0_)

In [118]:
merged_alias_overlap_df_2 = pd.concat([mini_hgnc_df_2[['alias_symbol', 'gene_symbol', 'ENSG_ID', 'source']],mini_ncbi_df_2[['alias_symbol', 'gene_symbol', 'ENSG_ID', 'source']], mini_ensg_df_2[['alias_symbol', 'gene_symbol', 'ENSG_ID', 'source']]])
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source
0,2F1,"SLC25A5, KLRG1","ENSG00000005022, ENSG00000139187",HGNC
1,60B8AG,"S100A8, S100A9","ENSG00000143546, ENSG00000163220",HGNC
2,87U6,"RNU6V, GNAI3","ENSG00000206832, ENSG00000065135",HGNC
3,9G8,"SLU7, SRSF7","ENSG00000164609, ENSG00000115875",HGNC
4,A1,"RFC1, RFC4, RFC2","ENSG00000035928, ENSG00000163918, ENSG00000049541",HGNC
...,...,...,...,...
5302,p42,"PSMC6, H3P30","ENSG00000100519, ENSG00000227443",ENSG
5303,p55,"PIK3R3, PSMD12","ENSG00000117461, ENSG00000197170",ENSG
5304,p56,"H3P45, PSMC1","ENSG00000227211, ENSG00000100764",ENSG
5305,promethin,LDAF1,ENSG00000283997,ENSG


In [119]:
merged_alias_overlap_df_2.loc[merged_alias_overlap_df_2['alias_symbol'] == "ALP" ]

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source
31,ALP,"PDLIM3, ATRNL1, CCL27, ASRGL1, SLPI","ENSG00000154553, ENSG00000107518, ENSG00000213...",HGNC
115,ALP,"CCL27, ATHS, PDLIM3, ATRNL1, ALPP, NAT10, SLPI...","ENSG00000213927, nan, ENSG00000154553, ENSG000...",NCBI Info
3245,ALP,"CCL27, ASRGL1, PDLIM3, ATRNL1","ENSG00000213927, ENSG00000162174, ENSG00000154...",ENSG


In [120]:
merged_alias_overlap_df_2['gene_symbol_count'] = [len(c) for c in merged_alias_overlap_df_2['gene_symbol']]
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source,gene_symbol_count
0,2F1,"SLC25A5, KLRG1","ENSG00000005022, ENSG00000139187",HGNC,14
1,60B8AG,"S100A8, S100A9","ENSG00000143546, ENSG00000163220",HGNC,14
2,87U6,"RNU6V, GNAI3","ENSG00000206832, ENSG00000065135",HGNC,12
3,9G8,"SLU7, SRSF7","ENSG00000164609, ENSG00000115875",HGNC,11
4,A1,"RFC1, RFC4, RFC2","ENSG00000035928, ENSG00000163918, ENSG00000049541",HGNC,16
...,...,...,...,...,...
5302,p42,"PSMC6, H3P30","ENSG00000100519, ENSG00000227443",ENSG,12
5303,p55,"PIK3R3, PSMD12","ENSG00000117461, ENSG00000197170",ENSG,14
5304,p56,"H3P45, PSMC1","ENSG00000227211, ENSG00000100764",ENSG,12
5305,promethin,LDAF1,ENSG00000283997,ENSG,5


In [121]:
merged_alias_overlap_df_2 = merged_alias_overlap_df_2.sort_values(by='gene_symbol_count', ascending= False)
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source,gene_symbol_count
3260,VH,"IGHV4-59, IGHV3-64, IGHV2-70, IGHV3-38, IGHV1-...","ENSG00000224373, ENSG00000223648, ENSG00000274...",NCBI Info,347
1284,H4-16,"H4C6, H4C13, H4C11, H4C14, H4C1, H4C5, H4C12, ...","ENSG00000274618, ENSG00000275126, ENSG00000197...",NCBI Info,88
1192,GPCR,"GPR166P, VN1R17P, MRGPRX1, GPR151, MRGPRX4, OX...","ENSG00000220349, nan, ENSG00000170255, ENSG000...",NCBI Info,85
868,DUX10,"DUX4L1, LOC107987486, LOC107987485, LOC1249064...","ENSG00000280757, nan, nan, nan, nan, nan, ENSG...",NCBI Info,85
869,DUX4,"DUX4L2, DUX4L1, LOC107987486, LOC107987485, LO...","ENSG00000280457, ENSG00000280757, nan, nan, na...",NCBI Info,84
...,...,...,...,...,...
3932,DOM3Z,DXO,ENSG00000206346,ENSG,3
1682,Nop1,FBL,ENSG00000105202,ENSG,3
4078,FLJ12614,NXN,ENSG00000281300,ENSG,3
4742,OSC,LSS,ENSG00000281289,ENSG,3


In [122]:
merged_alias_overlap_df_2['source'].value_counts()

source
ENSG         5307
NCBI Info    3427
HGNC         1040
Name: count, dtype: int64

In [123]:
merged_alias_overlap_df_2.to_csv('../merged_alias_overlap_df_2.csv', index=True)

# DGIdb ambiguous query

In [124]:
dgidb_gene_df = pd.read_csv("../dgidb_genes_df.tsv", sep='\t')
dgidb_gene_df

Unnamed: 0,name,nomenclature,concept_id,gene_claim_name,source_db_name,source_db_version
0,MN1,Gene Symbol,hgnc:7180,MN1,CarisMolecularIntelligence,5-Jun-23
1,P2RX7,Gene Symbol,hgnc:8537,P2RX7,ChEMBL,32
2,CHRNA7,Gene Symbol,hgnc:1960,CHRNA7,ChEMBL,32
3,MAPK8,Gene Symbol,hgnc:6881,MAPK8,ChEMBL,32
4,MAPK10,Gene Symbol,hgnc:6872,MAPK10,ChEMBL,32
...,...,...,...,...,...,...
73884,MAPKAPK5,Gene Name,hgnc:6889,MAPKAPK5,DTC,5-Jun-23
73885,PTGER3,Gene Name,hgnc:9595,PTGER3,DTC,5-Jun-23
73886,CD1D,Gene Name,hgnc:1637,CD1D,DTC,5-Jun-23
73887,SLC26A1,Gene Symbol,hgnc:10993,SLC26A1,IDG,15-Jul-19


In [125]:
symbol_col_comparison = dgidb_gene_df['name'] == dgidb_gene_df['gene_claim_name']
symbol_col_comparison.value_counts()

True     62221
False    11668
Name: count, dtype: int64

In [126]:
symbol_col_comparison

0        True
1        True
2        True
3        True
4        True
         ... 
73884    True
73885    True
73886    True
73887    True
73888    True
Length: 73889, dtype: bool

In [127]:
dgidb_gene_df.dtypes

name                 object
nomenclature         object
concept_id           object
gene_claim_name      object
source_db_name       object
source_db_version    object
dtype: object

In [128]:
dgidb_gene_df.query('name != gene_claim_name')

Unnamed: 0,name,nomenclature,concept_id,gene_claim_name,source_db_name,source_db_version
25,HSP90,Gene Symbol,hgnc:5253,HSP90AA1,ChEMBL,32
100,RPLP,Gene Symbol,,,ChEMBL,32
167,SEPT6,Gene Symbol,hgnc:15848,SEPTIN6,CarisMolecularIntelligence,5-Jun-23
169,SEPT5,Gene Symbol,hgnc:9164,SEPTIN5,CarisMolecularIntelligence,5-Jun-23
213,WISP3,Gene Symbol,hgnc:12771,CCN6,CarisMolecularIntelligence,5-Jun-23
...,...,...,...,...,...,...
73631,ENSEMBL:ENSG00000185821,Ensembl Gene ID,hgnc:31305,OR6C76,RussLampel,26-Jul-11
73642,C11ORF30,Gene Symbol,hgnc:18071,EMSY,FoundationOneGenes,5-Jun-23
73727,ENSEMBL:ENSG00000183921,Ensembl Gene ID,hgnc:35414,SDR42E2,RussLampel,26-Jul-11
73872,MR,NCBI Gene Name,hgnc:7979,NR3C2,BaderLab,14-Feb


In [129]:
no_claim_symbols_df = dgidb_gene_df[dgidb_gene_df['gene_claim_name'].isnull()]
no_claim_symbols_df

Unnamed: 0,name,nomenclature,concept_id,gene_claim_name,source_db_name,source_db_version
100,RPLP,Gene Symbol,,,ChEMBL,32
353,EMBA,Gene Symbol,,,ChEMBL,32
355,RPLQ,Gene Symbol,,,ChEMBL,32
402,ILES,Gene Symbol,,,ChEMBL,32
439,MURA,Gene Symbol,,,ChEMBL,32
...,...,...,...,...,...,...
26386,CCNK-CDK13_HUMAN,Gene Symbol,,,GO,5-Jun-23
26387,INO80_HUMAN-1,Gene Symbol,,,GO,5-Jun-23
26388,CCNC-CDK3_HUMAN,Gene Symbol,,,GO,5-Jun-23
26573,TRYPTASE_B2_HUMAN,Gene Symbol,,,GO,5-Jun-23


In [130]:
dgidb_name_set = set(dgidb_gene_df['name'])
len(dgidb_name_set)

21619

In [131]:
dgidb_gene_claim_name_set = set(dgidb_gene_df['gene_claim_name'])
len(dgidb_gene_claim_name_set)

11287

In [132]:
name_ensg_notmatch = dgidb_name_set.difference(ensg_gene_symbol_set)
len(name_ensg_notmatch)

10422

In [133]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(ensg_gene_symbol_set)
len(gene_claim_name_ensg_notmatch)

107

In [134]:
cleaned_gene_claim_name_ensg_notmatch = [x for x in gene_claim_name_ensg_notmatch if str(x) != 'NaN']
len(cleaned_gene_claim_name_ensg_notmatch)

107

In [135]:
name_hgnc_notmatch = dgidb_name_set.difference(hgnc_gene_symbol_set)
len(name_hgnc_notmatch)

10374

In [136]:
name_hngc_notmatch_aacollision = name_hgnc_notmatch.intersection(hgnc_alias_alias_collision_set)
len(name_hngc_notmatch_aacollision)

29

In [137]:
cleaned_name_hgnc_notmatch = [x for x in name_hgnc_notmatch if str(x) != 'NaN']
len(cleaned_name_hgnc_notmatch)

10374

In [138]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(hgnc_gene_symbol_set)
len(gene_claim_name_hgnc_notmatch)

66

In [139]:
name_ncbi_notmatch = dgidb_name_set.difference(ncbi_gene_symbol_set)
len(name_ncbi_notmatch)

10351

In [140]:
name_ncbi_notmatch_aacollision = name_ncbi_notmatch.intersection(ncbi_alias_alias_collision_set)
len(name_ncbi_notmatch_aacollision)

96

In [141]:
name_ncbi_hgnc_notmatch = name_ncbi_notmatch.difference(hgnc_gene_symbol_set)
len(name_ncbi_hgnc_notmatch)

10335

In [142]:
name_ncbi_hgnc_ensg_notmatch = name_ncbi_hgnc_notmatch.difference(ensg_gene_symbol_set)
len(name_ncbi_hgnc_ensg_notmatch)

10331

In [143]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(ncbi_gene_symbol_set)
len(gene_claim_name_ncbi_notmatch)

63

In [144]:
gene_claim_name_ncbi_hngc_notmatch = gene_claim_name_ncbi_notmatch.difference(hgnc_gene_symbol_set)
len(gene_claim_name_ncbi_hngc_notmatch)

47

In [145]:
gene_claim_name_ncbi_hngc_ensg_notmatch = gene_claim_name_ncbi_hngc_notmatch.difference(ensg_gene_symbol_set)
len(gene_claim_name_ncbi_hngc_ensg_notmatch)

44

In [146]:
name_hgnc_match = dgidb_name_set.intersection(hgnc_gene_symbol_set)
len(name_hgnc_match)

11245

In [147]:
name_hgnc_match_aacollision = name_hgnc_match.intersection(hgnc_alias_alias_collision_set)
len(name_hgnc_match_aacollision)

29

In [148]:
name_ensg_match = dgidb_name_set.intersection(ensg_gene_symbol_set)
len(name_ensg_match)

11197

In [149]:
name_ncbi_match = dgidb_name_set.intersection(ncbi_gene_symbol_set)
len(name_ncbi_match)

11268

In [150]:
name_ncbi_match_aacollision = name_ncbi_match.intersection(ncbi_alias_alias_collision_set)
len(name_ncbi_match_aacollision)

112

In [151]:
name_ncbi_ensg_match = name_ncbi_match.intersection(ensg_gene_symbol_set)
len(name_ncbi_ensg_match)

11178

In [152]:
name_ncbi_ensg_hgnc_match = name_ncbi_ensg_match.intersection(hgnc_gene_symbol_set)
len(name_ncbi_ensg_hgnc_match)

11176

In [153]:
gene_claim_name_hgnc_match = dgidb_gene_claim_name_set.intersection(hgnc_gene_symbol_set)
len(gene_claim_name_hgnc_match)

11221

In [154]:
gene_claim_name_ensg_match = dgidb_gene_claim_name_set.intersection(ensg_gene_symbol_set)
len(gene_claim_name_ensg_match)

11180

In [155]:
gene_claim_name_ensg_aacollision_match = dgidb_gene_claim_name_set.intersection(ensg_alias_alias_collision_set)
len(gene_claim_name_ensg_aacollision_match)

14

In [156]:
gene_claim_name_hgnc_aacollision_match = dgidb_gene_claim_name_set.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_aacollision_match)

30

In [157]:
gene_claim_name_ncbi_aacollision_match = dgidb_gene_claim_name_set.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_aacollision_match)

113

In [158]:
name_ensg_aacollision_match = dgidb_name_set.intersection(ensg_alias_alias_collision_set)
len(name_ensg_aacollision_match)

43

In [159]:
name_hgnc_aacollision_match = dgidb_name_set.intersection(hgnc_alias_alias_collision_set)
len(name_hgnc_aacollision_match)

58

In [160]:
name_ncbi_aacollision_match = dgidb_name_set.intersection(ncbi_alias_alias_collision_set)
len(name_ncbi_aacollision_match)

208

In [161]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(hgnc_gene_symbol_set)
len(gene_claim_name_hgnc_notmatch)

66

In [162]:
gene_claim_name_hgnc_notmatch_aacollision = gene_claim_name_hgnc_notmatch.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_notmatch_aacollision)

1

In [163]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(ensg_gene_symbol_set)
len(gene_claim_name_ensg_notmatch)

107

In [164]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(ncbi_gene_symbol_set)
len(gene_claim_name_ncbi_notmatch)

63

In [165]:
gene_claim_name_ncbi_notmatch_aacollision = gene_claim_name_ncbi_notmatch.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_notmatch_aacollision)

1

In [166]:
gene_claim_name_hgnc_match = dgidb_gene_claim_name_set.intersection(hgnc_gene_symbol_set)
len(gene_claim_name_hgnc_match)

11221

In [167]:
gene_claim_name_hgnc_match_aacollision = gene_claim_name_hgnc_match.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_match_aacollision)

29

In [168]:
gene_claim_name_ncbi_match = dgidb_gene_claim_name_set.intersection(ncbi_gene_symbol_set)
len(gene_claim_name_ncbi_match)

11224

In [169]:
gene_claim_name_ncbi_match_aacollision = gene_claim_name_ncbi_match.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_match_aacollision)

112

In [170]:
name_ensg_match_aacollision = name_ensg_match.intersection(ensg_alias_alias_collision_set)
len(name_ensg_match_aacollision)

13

In [171]:
len(gene_claim_name_ncbi_hngc_ensg_notmatch)

44

In [172]:
len(name_ncbi_hgnc_ensg_notmatch)


10331

In [173]:
len(dgidb_name_set)

21619

In [174]:
len(dgidb_gene_claim_name_set)

11287

In [175]:
name_ensg_notmatch_aacollision = name_ensg_notmatch.intersection(ensg_alias_alias_collision_set)
len(name_ensg_notmatch_aacollision)

30

In [176]:
gene_claim_name_ensg_match_aacollision = gene_claim_name_ensg_match.intersection(ensg_alias_alias_collision_set)
len(gene_claim_name_ensg_match_aacollision)

13

In [177]:
gene_claim_name_ensg_notmatch_aacollision = gene_claim_name_ensg_notmatch.intersection(ensg_alias_alias_collision_set)
len(gene_claim_name_ensg_notmatch_aacollision)

1

In [178]:
gene_claim_name_ncbi_notmatch_aacollision = gene_claim_name_ncbi_notmatch.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_notmatch_aacollision)

1

In [179]:
gene_claim_name_ncbi_match_aacollision = gene_claim_name_ncbi_match.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_match_aacollision)

112

In [180]:
gene_claim_name_hgnc_match_aacollision = gene_claim_name_hgnc_match.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_match_aacollision)

29

In [181]:
gene_claim_name_hgnc_notmatch_aacollision = gene_claim_name_hgnc_notmatch.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_notmatch_aacollision)

1

Pull out instances of claim symbols that match to a primary gene symbol and the corresponding group symbols not matching to a primary gene symbol. Check for patterns of modes of error


In [182]:
dgidb_gene_df['hgnc_claim_match_status'] = dgidb_gene_df['gene_claim_name'].isin(hgnc_gene_symbol_set)
dgidb_gene_df

Unnamed: 0,name,nomenclature,concept_id,gene_claim_name,source_db_name,source_db_version,hgnc_claim_match_status
0,MN1,Gene Symbol,hgnc:7180,MN1,CarisMolecularIntelligence,5-Jun-23,True
1,P2RX7,Gene Symbol,hgnc:8537,P2RX7,ChEMBL,32,True
2,CHRNA7,Gene Symbol,hgnc:1960,CHRNA7,ChEMBL,32,True
3,MAPK8,Gene Symbol,hgnc:6881,MAPK8,ChEMBL,32,True
4,MAPK10,Gene Symbol,hgnc:6872,MAPK10,ChEMBL,32,True
...,...,...,...,...,...,...,...
73884,MAPKAPK5,Gene Name,hgnc:6889,MAPKAPK5,DTC,5-Jun-23,True
73885,PTGER3,Gene Name,hgnc:9595,PTGER3,DTC,5-Jun-23,True
73886,CD1D,Gene Name,hgnc:1637,CD1D,DTC,5-Jun-23,True
73887,SLC26A1,Gene Symbol,hgnc:10993,SLC26A1,IDG,15-Jul-19,True


In [183]:
dgidb_gene_df['hgnc_name_match_status'] = dgidb_gene_df['name'].isin(hgnc_gene_symbol_set)
dgidb_gene_df

Unnamed: 0,name,nomenclature,concept_id,gene_claim_name,source_db_name,source_db_version,hgnc_claim_match_status,hgnc_name_match_status
0,MN1,Gene Symbol,hgnc:7180,MN1,CarisMolecularIntelligence,5-Jun-23,True,True
1,P2RX7,Gene Symbol,hgnc:8537,P2RX7,ChEMBL,32,True,True
2,CHRNA7,Gene Symbol,hgnc:1960,CHRNA7,ChEMBL,32,True,True
3,MAPK8,Gene Symbol,hgnc:6881,MAPK8,ChEMBL,32,True,True
4,MAPK10,Gene Symbol,hgnc:6872,MAPK10,ChEMBL,32,True,True
...,...,...,...,...,...,...,...,...
73884,MAPKAPK5,Gene Name,hgnc:6889,MAPKAPK5,DTC,5-Jun-23,True,True
73885,PTGER3,Gene Name,hgnc:9595,PTGER3,DTC,5-Jun-23,True,True
73886,CD1D,Gene Name,hgnc:1637,CD1D,DTC,5-Jun-23,True,True
73887,SLC26A1,Gene Symbol,hgnc:10993,SLC26A1,IDG,15-Jul-19,True,True


In [184]:
claim_true_name_false_df = dgidb_gene_df.loc[dgidb_gene_df['hgnc_claim_match_status'] & ~dgidb_gene_df['hgnc_name_match_status']]
claim_true_name_false_df

Unnamed: 0,name,nomenclature,concept_id,gene_claim_name,source_db_name,source_db_version,hgnc_claim_match_status,hgnc_name_match_status
25,HSP90,Gene Symbol,hgnc:5253,HSP90AA1,ChEMBL,32,True,False
167,SEPT6,Gene Symbol,hgnc:15848,SEPTIN6,CarisMolecularIntelligence,5-Jun-23,True,False
169,SEPT5,Gene Symbol,hgnc:9164,SEPTIN5,CarisMolecularIntelligence,5-Jun-23,True,False
213,WISP3,Gene Symbol,hgnc:12771,CCN6,CarisMolecularIntelligence,5-Jun-23,True,False
294,MLL2,Gene Name,hgnc:7133,KMT2D,CGI,5-Jun-23,True,False
...,...,...,...,...,...,...,...,...
73631,ENSEMBL:ENSG00000185821,Ensembl Gene ID,hgnc:31305,OR6C76,RussLampel,26-Jul-11,True,False
73642,C11ORF30,Gene Symbol,hgnc:18071,EMSY,FoundationOneGenes,5-Jun-23,True,False
73727,ENSEMBL:ENSG00000183921,Ensembl Gene ID,hgnc:35414,SDR42E2,RussLampel,26-Jul-11,True,False
73872,MR,NCBI Gene Name,hgnc:7979,NR3C2,BaderLab,14-Feb,True,False
