**Table of contents**<a id='toc0_'></a>    
- [ENSG](#toc1_)    
    - [Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc1_1_1_)    
    - [How many total unique gene records are there](#toc1_1_2_)    
    - [Drop rows with NAN in alias_symbol](#toc1_1_3_)    
    - [Make each row in alias_symbol a set:](#toc1_1_4_)    
    - [Explode the alias sets so that it is one per row](#toc1_1_5_)    
    - [How many total unique aliases are there](#toc1_1_6_)    
    - [Pull out all the rows that have an alias symbol that can be found elsewhere](#toc1_1_7_)    
    - [Sort alias symbols alphabetically](#toc1_1_8_)    
    - [Number of records with an alias that is shared](#toc1_1_9_)    
    - [Count the number of times each multi-use alias is used](#toc1_1_10_)    
      - [Save as csv](#toc1_1_10_1_)    
    - [Put columns in different order to ephasize alias symbols instead of gene records](#toc1_1_11_)    
    - [Merge rows with matching alias symbols](#toc1_1_12_)    
- [HGNC](#toc2_)    
    - [Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc2_1_1_)    
    - [How many total unique gene records are there](#toc2_1_2_)    
    - [Drop rows with NAN in alias_symbol](#toc2_1_3_)    
    - [Make each row in alias_symbol a set:](#toc2_1_4_)    
    - [Explode the alias sets so that it is one per row](#toc2_1_5_)    
    - [How many total unique aliases are there](#toc2_1_6_)    
    - [Pull out all the rows that have an alias symbol that can be found elsewhere](#toc2_1_7_)    
    - [Sort alias symbols alphabetically](#toc2_1_8_)    
    - [Number of records with an alias that is shared](#toc2_1_9_)    
    - [Count the number of times each multi-use alias is used](#toc2_1_10_)    
      - [Save as csv](#toc2_1_10_1_)    
    - [Put columns in different order to ephasize alias symbols instead of gene records](#toc2_1_11_)    
    - [Merge rows with matching alias symbols](#toc2_1_12_)    
- [NCBI Info](#toc3_)    
    - [Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc3_1_1_)    
    - [How many total unique gene records are there](#toc3_1_2_)    
    - [Drop rows with - in alias_symbol](#toc3_1_3_)    
    - [Make each row in alias_symbol a set:](#toc3_1_4_)    
    - [Explode the alias sets so that it is one per row](#toc3_1_5_)    
    - [How many unique aliases are there](#toc3_1_6_)    
    - [Pull out all the rows that have an alias symbol that can be found elsewhere](#toc3_1_7_)    
    - [Sort alias symbols alphabetically](#toc3_1_8_)    
    - [Number of records with an alias that is shared](#toc3_1_9_)    
    - [Count the number of times each multi-use alias is used](#toc3_1_10_)    
    - [Put columns in different order to ephasize alias symbols instead of gene records](#toc3_1_11_)    
    - [Merge rows with matching alias symbols](#toc3_1_12_)    
- [Merge to create Alias Overlap Table 1 - Gene Symbol](#toc4_)    
- [Merge to create Alias Overlap Table 2 - Alias Symbol](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
#new_alias-alias_collision_records

# <a id='toc1_'></a>[ENSG](#toc0_)

In [2]:
mini_ensg_df = pd.read_csv('Downloaded_files/mini_ensg_df.csv', dtype={'HGNC_ID': pd.Int64Dtype(), 'NCBI_ID': pd.Int64Dtype()})
mini_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"T245, TM4SF6, TSPAN-6"
1,ENSG00000000005,TNMD,17757,64102,"BRICD4, CHM1L, MYODULIN, TEM, TENDIN"
2,ENSG00000000419,DPM1,3005,8813,"CDGIE, MPDS"
3,ENSG00000000457,SCYL3,19285,57147,"PACE-1, PACE1"
4,ENSG00000000460,FIRRM,25565,55732,"APOLO1, C1ORF112, FLIP, FLJ10706, MEICA1"
...,...,...,...,...,...
75829,ENSG00000293596,,,105372654,
75830,ENSG00000293597,LINC00970,48730,101978719,
75831,ENSG00000293599,,,,
75832,ENSG00000293600,,,131768270,


### <a id='toc1_1_2_'></a>[How many total unique gene records are there](#toc0_)

By ENSG ID

In [3]:
ensg_gene_id_set = set(mini_ensg_df['ENSG_ID'])
len(ensg_gene_id_set)

70611

By gene symbol

In [4]:
ensg_gene_symbol_set = set(mini_ensg_df['gene_symbol'])
len(ensg_gene_symbol_set)

41068

### <a id='toc1_1_3_'></a>[Drop genes with no aliases](#toc0_)

In [5]:
mini_ensg_df = mini_ensg_df[mini_ensg_df["alias_symbol"].str.contains("NaN") == False]
mini_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"T245, TM4SF6, TSPAN-6"
1,ENSG00000000005,TNMD,17757,64102,"BRICD4, CHM1L, MYODULIN, TEM, TENDIN"
2,ENSG00000000419,DPM1,3005,8813,"CDGIE, MPDS"
3,ENSG00000000457,SCYL3,19285,57147,"PACE-1, PACE1"
4,ENSG00000000460,FIRRM,25565,55732,"APOLO1, C1ORF112, FLIP, FLJ10706, MEICA1"
...,...,...,...,...,...
75796,ENSG00000293549,HCG22,,285834,PBMUCL2
75798,ENSG00000293551,PRAMEF22,34393,653606,PRAMEF3L
75801,ENSG00000293555,FAM169BP,26835,283777,"FAM169B, FLJ39743, KIAA0888L"
75828,ENSG00000293595,SLC25A3P1,26869,163742,FLJ40434


### <a id='toc1_1_4_'></a>[Make each row in alias_symbol a set:](#toc0_)
    covert to a list 
    make a set

In [6]:
mini_ensg_df['alias_symbol'] = mini_ensg_df['alias_symbol'].astype(str)
mini_ensg_df['alias_symbol'] = [x.split(',') for x in mini_ensg_df.alias_symbol]
mini_ensg_df['alias_symbol']=np.where(mini_ensg_df.alias_symbol=='','',mini_ensg_df.alias_symbol.map(set))
mini_ensg_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_symbol'] = mini_ensg_df['alias_symbol'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_symbol'] = [x.split(',') for x in mini_ensg_df.alias_symbol]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_symbol']=np.where(mini_ensg_df.a

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"{ TSPAN-6, T245, TM4SF6}"


### <a id='toc1_1_5_'></a>[Explode the alias sets so that it is one per row](#toc0_)

In [7]:
mini_ensg_df = mini_ensg_df.explode('alias_symbol')
mini_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,TSPAN-6
0,ENSG00000000003,TSPAN6,11858,7105,T245
0,ENSG00000000003,TSPAN6,11858,7105,TM4SF6
1,ENSG00000000005,TNMD,17757,64102,BRICD4
1,ENSG00000000005,TNMD,17757,64102,CHM1L
...,...,...,...,...,...
75801,ENSG00000293555,FAM169BP,26835,283777,FLJ39743
75828,ENSG00000293595,SLC25A3P1,26869,163742,FLJ40434
75833,ENSG00000293604,ORAI1,25896,84876,TMEM142A
75833,ENSG00000293604,ORAI1,25896,84876,CRACM1


In [8]:
mini_ensg_df.loc[mini_ensg_df['alias_symbol'] == "CFM1" ]

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
15693,ENSG00000183688,RFLNB,28705,359845,CFM1
66337,ENSG00000283979,RFLNB,28705,359845,CFM1


### <a id='toc3_1_6_'></a>[How many unique aliases are there](#toc0_)

In [9]:
ensg_alias_symbol_set = set(mini_ensg_df['alias_symbol'])
ensg_alias_len = len(ensg_alias_symbol_set)
ensg_alias_len

55938

### Remove the duplicate instances of a primary gene symbol- alias pair

Example:

In [10]:
mini_ensg_df.loc[mini_ensg_df['gene_symbol'] == "RFLNB" ]

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
15693,ENSG00000183688,RFLNB,28705,359845,MGC45871
15693,ENSG00000183688,RFLNB,28705,359845,FAM101B
15693,ENSG00000183688,RFLNB,28705,359845,REFILINB
15693,ENSG00000183688,RFLNB,28705,359845,CFM1
66337,ENSG00000283979,RFLNB,28705,359845,MGC45871
66337,ENSG00000283979,RFLNB,28705,359845,FAM101B
66337,ENSG00000283979,RFLNB,28705,359845,REFILINB
66337,ENSG00000283979,RFLNB,28705,359845,CFM1


In [11]:
mini_ensg_df = mini_ensg_df.drop_duplicates(subset=['gene_symbol', 'alias_symbol'], keep='first')

In [12]:
mini_ensg_df.loc[mini_ensg_df['gene_symbol'] == "RFLNB" ]

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
15693,ENSG00000183688,RFLNB,28705,359845,MGC45871
15693,ENSG00000183688,RFLNB,28705,359845,FAM101B
15693,ENSG00000183688,RFLNB,28705,359845,REFILINB
15693,ENSG00000183688,RFLNB,28705,359845,CFM1


### <a id='toc3_1_7_'></a>[Pull out all the rows that have an alias symbol that can be found elsewhere](#toc0_)

In [13]:
mini_ensg_df['alias_duplicates'] = mini_ensg_df.duplicated(subset= 'alias_symbol', keep=False)
aa_collision_ensg_df = mini_ensg_df[mini_ensg_df['alias_duplicates'] == True]
aa_collision_ensg_df = aa_collision_ensg_df.drop(['alias_duplicates'], axis=1)
aa_collision_ensg_df.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_duplicates'] = mini_ensg_df.duplicated(subset= 'alias_symbol', keep=False)


Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
4,ENSG00000000460,FIRRM,25565,55732,FLIP
8,ENSG00000001084,GCLC,4311,2729,GCS
12,ENSG00000001497,LAS1L,25726,81887,LAS1
13,ENSG00000001561,ENPP4,3359,22875,AP3AASE
15,ENSG00000001626,CFTR,1884,1080,DJ760C5.1


### <a id='toc3_1_8_'></a>[Sort alias symbols alphabetically](#toc0_)

In [14]:
aa_collision_ensg_df = aa_collision_ensg_df.sort_values('alias_symbol')
aa_collision_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
58193,ENSG00000275176,ACACA,84,31,ACC1
8000,ENSG00000140379,BCL2A1,991,597,ACC1
8000,ENSG00000140379,BCL2A1,991,597,ACC2
1354,ENSG00000076555,ACACB,85,32,ACC2
2085,ENSG00000097021,ACOT7,24157,11332,ACT
...,...,...,...,...,...
24134,ENSG00000213339,QTRT1,23797,81890,TGT
6615,ENSG00000132388,UBE2G1,12482,7326,UBC7
15955,ENSG00000184787,UBE2G2,12483,7327,UBC7
11780,ENSG00000165828,PRAP1,23304,118471,UPA


In [15]:
#ensg_CD158b_alias_count_df.to_csv('../hgnc_CD158b_alias_count_df.csv')

### <a id='toc3_1_9_'></a>[Number of records with an alias that is shared](#toc0_)

In [16]:
ensg_alias_alias_collision_primary_symbol_set = set(aa_collision_ensg_df['gene_symbol'])
len(ensg_alias_alias_collision_primary_symbol_set)

2224

In [17]:
aa_collision_ensg_df['source'] = 'ENSG'
aa_collision_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol,source
58193,ENSG00000275176,ACACA,84,31,ACC1,ENSG
8000,ENSG00000140379,BCL2A1,991,597,ACC1,ENSG
8000,ENSG00000140379,BCL2A1,991,597,ACC2,ENSG
1354,ENSG00000076555,ACACB,85,32,ACC2,ENSG
2085,ENSG00000097021,ACOT7,24157,11332,ACT,ENSG
...,...,...,...,...,...,...
24134,ENSG00000213339,QTRT1,23797,81890,TGT,ENSG
6615,ENSG00000132388,UBE2G1,12482,7326,UBC7,ENSG
15955,ENSG00000184787,UBE2G2,12483,7327,UBC7,ENSG
11780,ENSG00000165828,PRAP1,23304,118471,UPA,ENSG


In [18]:
aa_collision_ensg_df.loc[aa_collision_ensg_df['alias_symbol'] == "RN5S3" ]

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol,source


### <a id='toc3_1_10_'></a>[Count the number of times each multi-use alias is used](#toc0_)

In [19]:
aa_collision_ensg_count_df = aa_collision_ensg_df.pivot_table(index = ['alias_symbol'], aggfunc ='size')
aa_collision_ensg_count_df = aa_collision_ensg_count_df.reset_index()
aa_collision_ensg_count_df.rename(columns={0:'num_gene_records'}, inplace=True )
aa_collision_ensg_count_df = aa_collision_ensg_count_df.sort_values('num_gene_records', ascending=False)
aa_collision_ensg_count_df.head(5)

Unnamed: 0,alias_symbol,num_gene_records
1091,MT1,10
1060,HOX1,10
1061,HOX2,9
411,P40,9
392,P18,8


In [20]:
ensg_alias_alias_collision_set = set(aa_collision_ensg_count_df['alias_symbol'])
len(ensg_alias_alias_collision_set)

1149

In [21]:
aa_collision_ensg_count_df.to_csv('../aa_collision_ensg_count_df.csv', index=True)

In [22]:
aa_collision_ensg_distribution_df = aa_collision_ensg_count_df.pivot_table(index = ['num_gene_records'], aggfunc ='size')
aa_collision_ensg_distribution_df = aa_collision_ensg_distribution_df.reset_index()
aa_collision_ensg_distribution_df.rename(columns={0:'num_alias_symbol'}, inplace=True )
aa_collision_ensg_distribution_df['percent_alias_symbol'] = ((aa_collision_ensg_distribution_df['num_alias_symbol'] / ensg_alias_len) * 100)
aa_collision_ensg_distribution_df.head()

Unnamed: 0,num_gene_records,num_alias_symbol,percent_alias_symbol
0,2,980,1.75194
1,3,117,0.20916
2,4,31,0.055418
3,5,5,0.008938
4,6,10,0.017877


In [23]:
ensg_alias_count_histogram_df = aa_collision_ensg_distribution_df.drop('num_alias_symbol', axis=1)
ensg_alias_count_histogram_df.head()

Unnamed: 0,num_gene_records,percent_alias_symbol
0,2,1.75194
1,3,0.20916
2,4,0.055418
3,5,0.008938
4,6,0.017877


In [24]:
#px.bar(ensg_alias_count_histogram_df, x='num_gene_records', y='percent_alias_symbol')

In [25]:
aa_collision_ensg_distribution_df.to_csv('../aa_collision_ensg_distribution_df.csv', index=True)

#### <a id='toc1_1_10_1_'></a>[Save as csv](#toc0_)

In [26]:
#mini_ensg_df_explode.to_csv('../ensg_alias_overlap.csv', index=False)

### <a id='toc1_1_11_'></a>[Put columns in different order to ephasize alias symbols instead of gene records](#toc0_)

In [27]:
aa_collision_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol,source
58193,ENSG00000275176,ACACA,84,31,ACC1,ENSG
8000,ENSG00000140379,BCL2A1,991,597,ACC1,ENSG
8000,ENSG00000140379,BCL2A1,991,597,ACC2,ENSG
1354,ENSG00000076555,ACACB,85,32,ACC2,ENSG
2085,ENSG00000097021,ACOT7,24157,11332,ACT,ENSG
...,...,...,...,...,...,...
24134,ENSG00000213339,QTRT1,23797,81890,TGT,ENSG
6615,ENSG00000132388,UBE2G1,12482,7326,UBC7,ENSG
15955,ENSG00000184787,UBE2G2,12483,7327,UBC7,ENSG
11780,ENSG00000165828,PRAP1,23304,118471,UPA,ENSG


In [28]:
aa_collision_ensg_df_2 = aa_collision_ensg_df[['alias_symbol', 'ENSG_ID', 'gene_symbol', 'source']]
aa_collision_ensg_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
58193,ACC1,ENSG00000275176,ACACA,ENSG
8000,ACC1,ENSG00000140379,BCL2A1,ENSG
8000,ACC2,ENSG00000140379,BCL2A1,ENSG
1354,ACC2,ENSG00000076555,ACACB,ENSG
2085,ACT,ENSG00000097021,ACOT7,ENSG
...,...,...,...,...
24134,TGT,ENSG00000213339,QTRT1,ENSG
6615,UBC7,ENSG00000132388,UBE2G1,ENSG
15955,UBC7,ENSG00000184787,UBE2G2,ENSG
11780,UPA,ENSG00000165828,PRAP1,ENSG


### <a id='toc1_1_12_'></a>[Merge rows with matching alias symbols](#toc0_)

In [29]:
aa_collision_ensg_df_2 = aa_collision_ensg_df_2.drop_duplicates(subset = ["alias_symbol", "gene_symbol"], keep = 'first')

In [30]:
aa_collision_ensg_df_2 = aa_collision_ensg_df_2.applymap(str)
aa_collision_ensg_df_2


  aa_collision_ensg_df_2 = aa_collision_ensg_df_2.applymap(str)


Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
58193,ACC1,ENSG00000275176,ACACA,ENSG
8000,ACC1,ENSG00000140379,BCL2A1,ENSG
8000,ACC2,ENSG00000140379,BCL2A1,ENSG
1354,ACC2,ENSG00000076555,ACACB,ENSG
2085,ACT,ENSG00000097021,ACOT7,ENSG
...,...,...,...,...
24134,TGT,ENSG00000213339,QTRT1,ENSG
6615,UBC7,ENSG00000132388,UBE2G1,ENSG
15955,UBC7,ENSG00000184787,UBE2G2,ENSG
11780,UPA,ENSG00000165828,PRAP1,ENSG


In [31]:
aa_collision_ensg_df_2 = aa_collision_ensg_df_2.groupby('alias_symbol').agg({'ENSG_ID': ', '.join, 
                             'gene_symbol': ', '.join, 
                             'source':'first' }).reset_index()
aa_collision_ensg_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
0,ACC1,"ENSG00000275176, ENSG00000140379","ACACA, BCL2A1",ENSG
1,ACC2,"ENSG00000140379, ENSG00000076555","BCL2A1, ACACB",ENSG
2,ACT,"ENSG00000097021, ENSG00000196136","ACOT7, SERPINA3",ENSG
3,AGPAT9,"ENSG00000153395, ENSG00000138678","LPCAT1, GPAT3",ENSG
4,AIP1,"ENSG00000136848, ENSG00000187391","DAB2IP, MAGI2",ENSG
...,...,...,...,...
1144,TCRBV15S1,"ENSG00000276819, ENSG00000211750","TRBV15, TRBV24-1",ENSG
1145,TCRGV5P,"ENSG00000228668, ENSG00000226212","TRGV5P, TRGV6",ENSG
1146,TGT,"ENSG00000101557, ENSG00000213339","USP14, QTRT1",ENSG
1147,UBC7,"ENSG00000132388, ENSG00000184787","UBE2G1, UBE2G2",ENSG


# <a id='toc2_'></a>[HGNC](#toc0_)

In [32]:
mini_hgnc_df = pd.read_csv('Downloaded_files/mini_hgnc_df.csv', dtype={'HGNC_ID': pd.Int64Dtype(), 'NCBI_ID': pd.Int64Dtype()})
mini_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"T245, TSPAN-6"
1,ENSG00000000005,TNMD,17757,64102,"myodulin, ChM1L, tendin, TEM, BRICD4"
2,ENSG00000000419,DPM1,3005,8813,"MPDS, CDGIE"
3,ENSG00000000457,SCYL3,19285,57147,"PACE-1, PACE1"
4,ENSG00000000460,FIRRM,25565,55732,"FLJ10706, Apolo1, FLIP, MEICA1"
...,...,...,...,...,...
45641,,ZNF97,13173,,
45642,,ZNFP1,13181,,
45643,,ZPAXP,51635,105373450,ZPX1P
45644,,ZRK,13193,,


### <a id='toc2_1_2_'></a>[How many total unique gene records are there](#toc0_)

By HGNC ID

In [33]:
hgnc_gene_id_set = set(mini_hgnc_df['HGNC_ID'])
len(hgnc_gene_id_set)

45646

By gene symbol

In [34]:
hgnc_gene_symbol_set = set(mini_hgnc_df['gene_symbol'])
len(hgnc_gene_symbol_set)

45646

### <a id='toc2_1_3_'></a>[Drop genes with no aliases](#toc0_)

In [35]:
mini_hgnc_df = mini_hgnc_df[mini_hgnc_df["alias_symbol"].str.contains("NaN") == False]
mini_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"T245, TSPAN-6"
1,ENSG00000000005,TNMD,17757,64102,"myodulin, ChM1L, tendin, TEM, BRICD4"
2,ENSG00000000419,DPM1,3005,8813,"MPDS, CDGIE"
3,ENSG00000000457,SCYL3,19285,57147,"PACE-1, PACE1"
4,ENSG00000000460,FIRRM,25565,55732,"FLJ10706, Apolo1, FLIP, MEICA1"
...,...,...,...,...,...
45632,,ZNF78L2,13152,,pT3
45636,,ZNF88,13163,,HPF8
45638,,ZNF94,13170,,F11465
45643,,ZPAXP,51635,105373450,ZPX1P


### <a id='toc2_1_4_'></a>[Make each row in alias_symbol a set:](#toc0_)
    covert to a list 
    make a set

In [36]:
mini_hgnc_df['alias_symbol'] = mini_hgnc_df['alias_symbol'].astype(str)
mini_hgnc_df['alias_symbol'] = [x.split(',') for x in mini_hgnc_df.alias_symbol]
mini_hgnc_df['alias_symbol']=np.where(mini_hgnc_df.alias_symbol=='','',mini_hgnc_df.alias_symbol.map(set))
mini_hgnc_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_hgnc_df['alias_symbol'] = mini_hgnc_df['alias_symbol'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_hgnc_df['alias_symbol'] = [x.split(',') for x in mini_hgnc_df.alias_symbol]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_hgnc_df['alias_symbol']=np.where(mini_hgnc_df.a

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"{T245, TSPAN-6}"
1,ENSG00000000005,TNMD,17757,64102,"{myodulin, ChM1L, BRICD4, TEM, tendin}"
2,ENSG00000000419,DPM1,3005,8813,"{MPDS, CDGIE}"
3,ENSG00000000457,SCYL3,19285,57147,"{PACE-1, PACE1}"
4,ENSG00000000460,FIRRM,25565,55732,"{ Apolo1, MEICA1, FLIP, FLJ10706}"
...,...,...,...,...,...
45632,,ZNF78L2,13152,,{pT3}
45636,,ZNF88,13163,,{HPF8}
45638,,ZNF94,13170,,{F11465}
45643,,ZPAXP,51635,105373450,{ZPX1P}


### <a id='toc2_1_5_'></a>[Explode the alias sets so that it is one per row](#toc0_)

In [37]:
mini_hgnc_df = mini_hgnc_df.explode('alias_symbol')
mini_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,T245
0,ENSG00000000003,TSPAN6,11858,7105,TSPAN-6
1,ENSG00000000005,TNMD,17757,64102,myodulin
1,ENSG00000000005,TNMD,17757,64102,ChM1L
1,ENSG00000000005,TNMD,17757,64102,BRICD4
...,...,...,...,...,...
45638,,ZNF94,13170,,F11465
45643,,ZPAXP,51635,105373450,ZPX1P
45645,,ZWINTAS,13196,,NCRNA00018
45645,,ZWINTAS,13196,,MPP5


### <a id='toc2_1_6_'></a>[How many total unique aliases are there](#toc0_)

In [38]:
hgnc_alias_symbol_set = set(mini_hgnc_df['alias_symbol'])
hgnc_alias_len = len(hgnc_alias_symbol_set)
hgnc_alias_len

43770

### Remove the duplicate instances of a primary gene symbol- alias pair

In [39]:
mini_hgnc_df = mini_hgnc_df.drop_duplicates(subset=['gene_symbol', 'alias_symbol'], keep='first')
mini_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,T245
0,ENSG00000000003,TSPAN6,11858,7105,TSPAN-6
1,ENSG00000000005,TNMD,17757,64102,myodulin
1,ENSG00000000005,TNMD,17757,64102,ChM1L
1,ENSG00000000005,TNMD,17757,64102,BRICD4
...,...,...,...,...,...
45638,,ZNF94,13170,,F11465
45643,,ZPAXP,51635,105373450,ZPX1P
45645,,ZWINTAS,13196,,NCRNA00018
45645,,ZWINTAS,13196,,MPP5


### <a id='toc2_1_7_'></a>[Pull out all the rows that have an alias symbol that can be found elsewhere](#toc0_)

In [40]:
mini_hgnc_df['alias_duplicates'] = mini_hgnc_df.duplicated(subset= 'alias_symbol', keep=False)
aa_collision_hgnc_df = mini_hgnc_df[mini_hgnc_df['alias_duplicates'] == True]
aa_collision_hgnc_df = aa_collision_hgnc_df.drop(['alias_duplicates'], axis=1)
aa_collision_hgnc_df.head(5)

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
4,ENSG00000000460,FIRRM,25565,55732,FLIP
8,ENSG00000001084,GCLC,4311,2729,GCS
13,ENSG00000001561,ENPP4,3359,22875,AP3Aase
22,ENSG00000002549,LAP3,18449,51056,LAP
39,ENSG00000003402,CFLAR,1876,8837,FLIP


### <a id='toc2_1_8_'></a>[Sort alias symbols alphabetically](#toc0_)

In [41]:
aa_collision_hgnc_df = aa_collision_hgnc_df.sort_values('alias_symbol')
aa_collision_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
7761,ENSG00000139187,KLRG1,6380,10219,2F1
75,ENSG00000005022,SLC25A5,10991,292,2F1
10916,ENSG00000163220,S100A9,10499,6280,60B8AG
8398,ENSG00000143546,S100A8,10498,6279,60B8AG
10424,ENSG00000160226,CFAP410,1260,755,A2
...,...,...,...,...,...
10537,ENSG00000161011,SQSTM1,11280,8878,p62
17080,ENSG00000196787,H2AC11,4737,8969,pH2A/f
28269,ENSG00000234816,H2AC5P,4728,10341,pH2A/f
39337,ENSG00000274962,TEX28P1,33356,728447,pTEX


### <a id='toc2_1_9_'></a>[Number of records with an alias that is shared](#toc0_)

In [42]:
hgnc_alias_alias_collision_primary_symbol_set = set(aa_collision_hgnc_df['gene_symbol'])
len(hgnc_alias_alias_collision_primary_symbol_set)

1356

In [43]:
aa_collision_hgnc_df['source'] = 'HGNC'
aa_collision_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol,source
7761,ENSG00000139187,KLRG1,6380,10219,2F1,HGNC
75,ENSG00000005022,SLC25A5,10991,292,2F1,HGNC
10916,ENSG00000163220,S100A9,10499,6280,60B8AG,HGNC
8398,ENSG00000143546,S100A8,10498,6279,60B8AG,HGNC
10424,ENSG00000160226,CFAP410,1260,755,A2,HGNC
...,...,...,...,...,...,...
10537,ENSG00000161011,SQSTM1,11280,8878,p62,HGNC
17080,ENSG00000196787,H2AC11,4737,8969,pH2A/f,HGNC
28269,ENSG00000234816,H2AC5P,4728,10341,pH2A/f,HGNC
39337,ENSG00000274962,TEX28P1,33356,728447,pTEX,HGNC


### <a id='toc2_1_10_'></a>[Count the number of times each multi-use alias is used](#toc0_)

In [44]:
aa_collision_hgnc_count_df = aa_collision_hgnc_df.pivot_table(index = ['alias_symbol'], aggfunc ='size')
aa_collision_hgnc_count_df = aa_collision_hgnc_count_df.reset_index()
aa_collision_hgnc_count_df.rename(columns={0:'num_gene_records'}, inplace=True )
aa_collision_hgnc_count_df = aa_collision_hgnc_count_df.sort_values('num_gene_records', ascending=False)
aa_collision_hgnc_count_df.head(5)

Unnamed: 0,alias_symbol,num_gene_records
639,U3,8
642,U4,6
189,MYM,6
446,F379,6
638,U2,5


In [45]:
hgnc_alias_alias_collision_set = set(aa_collision_hgnc_count_df['alias_symbol'])
len(hgnc_alias_alias_collision_set)

673

In [46]:
aa_collision_hgnc_count_df.to_csv('../aa_collision_hgnc_count_df.csv', index=True)

In [47]:
hgnc_alias_count_distribution_df = aa_collision_hgnc_count_df.pivot_table(index = ['num_gene_records'], aggfunc ='size')
hgnc_alias_count_distribution_df = hgnc_alias_count_distribution_df.reset_index()
hgnc_alias_count_distribution_df.rename(columns={0:'num_alias_symbol'}, inplace=True )
hgnc_alias_count_distribution_df['percent_alias_symbol'] = ((hgnc_alias_count_distribution_df['num_alias_symbol'] / hgnc_alias_len) * 100)
hgnc_alias_count_distribution_df

Unnamed: 0,num_gene_records,num_alias_symbol,percent_alias_symbol
0,2,574,1.311401
1,3,70,0.159927
2,4,22,0.050263
3,5,3,0.006854
4,6,3,0.006854
5,8,1,0.002285


In [48]:
hgnc_alias_count_distribution_df = hgnc_alias_count_distribution_df.drop('num_alias_symbol', axis=1)
hgnc_alias_count_distribution_df

Unnamed: 0,num_gene_records,percent_alias_symbol
0,2,1.311401
1,3,0.159927
2,4,0.050263
3,5,0.006854
4,6,0.006854
5,8,0.002285


In [49]:
#px.bar(hgnc_alias_count_histogram_df, x='num_gene_records', y='percent_alias_symbol')

#### <a id='toc2_1_10_1_'></a>[Save as csv](#toc0_)

In [50]:
#mini_hgnc_df_explode.to_csv('../hgnc_alias_overlap.csv', index=False)

### <a id='toc2_1_11_'></a>[Put columns in different order to ephasize alias symbols instead of gene records](#toc0_)

In [51]:
aa_collision_hgnc_df_2 = aa_collision_hgnc_df.drop_duplicates(subset = ["alias_symbol", "gene_symbol"], keep = 'first')

In [52]:
aa_collision_hgnc_df_2 = aa_collision_hgnc_df_2[['alias_symbol', 'ENSG_ID', 'gene_symbol', 'source']]
aa_collision_hgnc_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
7761,2F1,ENSG00000139187,KLRG1,HGNC
75,2F1,ENSG00000005022,SLC25A5,HGNC
10916,60B8AG,ENSG00000163220,S100A9,HGNC
8398,60B8AG,ENSG00000143546,S100A8,HGNC
10424,A2,ENSG00000160226,CFAP410,HGNC
...,...,...,...,...
10537,p62,ENSG00000161011,SQSTM1,HGNC
17080,pH2A/f,ENSG00000196787,H2AC11,HGNC
28269,pH2A/f,ENSG00000234816,H2AC5P,HGNC
39337,pTEX,ENSG00000274962,TEX28P1,HGNC


### <a id='toc2_1_12_'></a>[Merge rows with matching alias symbols](#toc0_)

In [53]:
aa_collision_hgnc_df_2 = aa_collision_hgnc_df_2.applymap(str)
aa_collision_hgnc_df_2

  aa_collision_hgnc_df_2 = aa_collision_hgnc_df_2.applymap(str)


Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
7761,2F1,ENSG00000139187,KLRG1,HGNC
75,2F1,ENSG00000005022,SLC25A5,HGNC
10916,60B8AG,ENSG00000163220,S100A9,HGNC
8398,60B8AG,ENSG00000143546,S100A8,HGNC
10424,A2,ENSG00000160226,CFAP410,HGNC
...,...,...,...,...
10537,p62,ENSG00000161011,SQSTM1,HGNC
17080,pH2A/f,ENSG00000196787,H2AC11,HGNC
28269,pH2A/f,ENSG00000234816,H2AC5P,HGNC
39337,pTEX,ENSG00000274962,TEX28P1,HGNC


In [54]:
aa_collision_hgnc_df_2 = aa_collision_hgnc_df_2.groupby('alias_symbol').agg({'ENSG_ID': ', '.join, 
                             'gene_symbol': ', '.join, 
                             'source':'first' }).reset_index()
aa_collision_hgnc_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
0,2F1,"ENSG00000139187, ENSG00000005022","KLRG1, SLC25A5",HGNC
1,60B8AG,"ENSG00000163220, ENSG00000143546","S100A9, S100A8",HGNC
2,A2,"ENSG00000160226, ENSG00000108823, ENSG00000149735","CFAP410, SGCA, GPHA2",HGNC
3,ACC2,"ENSG00000076555, ENSG00000140379","ACACB, BCL2A1",HGNC
4,ACS2,"ENSG00000197142, ENSG00000164398","ACSL5, ACSL6",HGNC
...,...,...,...,...
668,p55,"ENSG00000197170, ENSG00000075618, ENSG00000117...","PSMD12, FSCN1, PIK3R3, H3P44",HGNC
669,p56,"ENSG00000123106, ENSG00000227211","CCDC91, H3P45",HGNC
670,p62,"ENSG00000213024, ENSG00000161011","NUP62, SQSTM1",HGNC
671,pH2A/f,"ENSG00000196787, ENSG00000234816","H2AC11, H2AC5P",HGNC


In [55]:
#mini_hgnc_df_2.to_csv('../hgnc_alias_overlap_2.csv', index=False)

# <a id='toc3_'></a>[NCBI Info](#toc0_)

In [56]:
mini_ncbi_df = pd.read_csv('Downloaded_files/mini_ncbi_df.csv', dtype={'HGNC_ID': pd.Int64Dtype(), 'NCBI_ID': pd.Int64Dtype()})
mini_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B|ABG|GAB|HYST2477,5,ENSG00000121410
1,2,A2M,A2MD|CPAMD5|FWP007|S863-7,7,ENSG00000175899
2,3,A2MP1,A2MP,8,ENSG00000291190
3,9,NAT1,AAC1|MNAT|NAT-1|NATI,7645,ENSG00000171428
4,10,NAT2,AAC2|NAT-2|PNAT,7646,ENSG00000156006
...,...,...,...,...,...
193451,8923215,trnD,-,,
193452,8923216,trnP,-,,
193453,8923217,trnA,-,,
193454,8923218,COX1,-,,


### <a id='toc3_1_2_'></a>[How many total unique gene records are there](#toc0_)

By ENSG ID

In [57]:
ncbi_gene_id_set = set(mini_ncbi_df['ENSG_ID'])
len(ncbi_gene_id_set)

36803

By gene symbol

In [58]:
ncbi_gene_symbol_set = set(mini_ncbi_df['gene_symbol'])
len(ncbi_gene_symbol_set)

193303

### <a id='toc3_1_3_'></a>[Drop genes with no aliases](#toc0_)

In [59]:
mini_ncbi_df = mini_ncbi_df.replace("-", np.nan)
mini_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B|ABG|GAB|HYST2477,5,ENSG00000121410
1,2,A2M,A2MD|CPAMD5|FWP007|S863-7,7,ENSG00000175899
2,3,A2MP1,A2MP,8,ENSG00000291190
3,9,NAT1,AAC1|MNAT|NAT-1|NATI,7645,ENSG00000171428
4,10,NAT2,AAC2|NAT-2|PNAT,7646,ENSG00000156006
...,...,...,...,...,...
193451,8923215,trnD,,,
193452,8923216,trnP,,,
193453,8923217,trnA,,,
193454,8923218,COX1,,,


In [60]:
mini_ncbi_df = mini_ncbi_df.dropna(subset=['alias_symbol'])
mini_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B|ABG|GAB|HYST2477,5,ENSG00000121410
1,2,A2M,A2MD|CPAMD5|FWP007|S863-7,7,ENSG00000175899
2,3,A2MP1,A2MP,8,ENSG00000291190
3,9,NAT1,AAC1|MNAT|NAT-1|NATI,7645,ENSG00000171428
4,10,NAT2,AAC2|NAT-2|PNAT,7646,ENSG00000156006
...,...,...,...,...,...
190958,131696449,LOC131696449,PKD1P1-NPIPA5L,,
190961,131840634,GLTC1,GLTC,56861,
193342,132532400,GABRA6-AS1,ARBAG,40248,
193377,133395150,LNCARGI,ARGI,56890,


### <a id='toc3_1_4_'></a>[Make each row in alias_symbol a set:](#toc0_)
    covert to a list 
    make a set

In [61]:
mini_ncbi_df['alias_symbol'] = mini_ncbi_df['alias_symbol'].astype(str)
mini_ncbi_df['alias_symbol'] = [x.split('|') for x in mini_ncbi_df.alias_symbol]
mini_ncbi_df['alias_symbol']=np.where(mini_ncbi_df.alias_symbol=='','',mini_ncbi_df.alias_symbol.map(set))
mini_ncbi_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df['alias_symbol'] = mini_ncbi_df['alias_symbol'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df['alias_symbol'] = [x.split('|') for x in mini_ncbi_df.alias_symbol]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df['alias_symbol']=np.where(mini_ncbi_df.a

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,"{A1B, HYST2477, ABG, GAB}",5,ENSG00000121410


### <a id='toc3_1_5_'></a>[Explode the alias sets so that it is one per row](#toc0_)

In [62]:
mini_ncbi_df = mini_ncbi_df.explode(column="alias_symbol")
mini_ncbi_df.head(5)

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B,5,ENSG00000121410
0,1,A1BG,HYST2477,5,ENSG00000121410
0,1,A1BG,ABG,5,ENSG00000121410
0,1,A1BG,GAB,5,ENSG00000121410
1,2,A2M,A2MD,7,ENSG00000175899


In [63]:
#ncbi_CD158b_alias_count_df.to_csv('../ncbi_CD158b_alias_count_df.csv')

### <a id='toc3_1_6_'></a>[How many unique aliases are there](#toc0_)

In [64]:
ncbi_alias_symbol_set = set(mini_ncbi_df['alias_symbol'])
ncbi_alias_len = len(ncbi_alias_symbol_set)
ncbi_alias_len

69157

### Remove the duplicate instances of a primary gene symbol- alias pair

In [65]:
mini_ncbi_df = mini_ncbi_df.drop_duplicates(subset=['gene_symbol', 'alias_symbol'], keep='first')
mini_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B,5,ENSG00000121410
0,1,A1BG,HYST2477,5,ENSG00000121410
0,1,A1BG,ABG,5,ENSG00000121410
0,1,A1BG,GAB,5,ENSG00000121410
1,2,A2M,A2MD,7,ENSG00000175899
...,...,...,...,...,...
190961,131840634,GLTC1,GLTC,56861,
193342,132532400,GABRA6-AS1,ARBAG,40248,
193377,133395150,LNCARGI,ARGI,56890,
193378,133834869,MLDHR,MP31,55481,


### <a id='toc3_1_7_'></a>[Pull out all the rows that have an alias symbol that can be found elsewhere](#toc0_)

In [66]:
mini_ncbi_df['alias_duplicates'] = mini_ncbi_df.duplicated(subset= 'alias_symbol', keep=False)
mini_ncbi_df = mini_ncbi_df[mini_ncbi_df['alias_duplicates'] == True]
mini_ncbi_df = mini_ncbi_df.drop(['alias_duplicates'], axis=1)
mini_ncbi_df.head(5)

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B,5,ENSG00000121410
3,9,NAT1,NAT-1,7645,ENSG00000171428
3,9,NAT1,AAC1,7645,ENSG00000171428
4,10,NAT2,AAC2,7646,ENSG00000156006
6,12,SERPINA3,ACT,16,ENSG00000196136


### <a id='toc3_1_8_'></a>[Sort alias symbols alphabetically](#toc0_)

In [67]:
mini_ncbi_df = mini_ncbi_df.sort_values('alias_symbol')
mini_ncbi_df.head(5)

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
537,657,BMPR1A,10q23del,1076,ENSG00000107779
4525,5728,PTEN,10q23del,9588,ENSG00000171862
199,239,ALOX12,12-LOX,429,ENSG00000108839
205,246,ALOX15,12-LOX,433,ENSG00000161905
7974,10219,KLRG1,2F1,6380,ENSG00000139187


### <a id='toc3_1_9_'></a>[Number of records with an alias that is shared](#toc0_)

In [68]:
ncbi_alias_alias_collision_primary_symbol_set = set(mini_ncbi_df['gene_symbol'])
len(ncbi_alias_alias_collision_primary_symbol_set)

5732

In [69]:
mini_ncbi_df['source'] = 'NCBI Info'
mini_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,source
537,657,BMPR1A,10q23del,1076,ENSG00000107779,NCBI Info
4525,5728,PTEN,10q23del,9588,ENSG00000171862,NCBI Info
199,239,ALOX12,12-LOX,429,ENSG00000108839,NCBI Info
205,246,ALOX15,12-LOX,433,ENSG00000161905,NCBI Info
7974,10219,KLRG1,2F1,6380,ENSG00000139187,NCBI Info
...,...,...,...,...,...,...
18172,139420,PPP4R3C,smk1,33146,ENSG00000224960,NCBI Info
12905,55671,PPP4R3A,smk1,20219,ENSG00000100796,NCBI Info
13522,57223,PPP4R3B,smk1,29267,ENSG00000275052,NCBI Info
7631,9825,SPATA2,tamo,14681,ENSG00000158480,NCBI Info


### <a id='toc3_1_10_'></a>[Count the number of times each multi-use alias is used](#toc0_)

In [70]:
ncbi_dup_alias_count_df = mini_ncbi_df.pivot_table(index = ['alias_symbol'], aggfunc ='size')
ncbi_dup_alias_count_df

alias_symbol
10q23del       2
12-LOX         2
2F1            2
3-alpha-HSD    2
35DAG          2
              ..
polymerase     3
psiSSX8        2
rpL7a          2
smk1           3
tamo           2
Length: 3476, dtype: int64

In [71]:
ncbi_dup_alias_count_df.to_csv('../ncbi_alias_overlap_count.csv', index=True)

In [72]:
ncbi_dup_alias_count_df = ncbi_dup_alias_count_df.reset_index()
ncbi_dup_alias_count_df

Unnamed: 0,alias_symbol,0
0,10q23del,2
1,12-LOX,2
2,2F1,2
3,3-alpha-HSD,2
4,35DAG,2
...,...,...
3471,polymerase,3
3472,psiSSX8,2
3473,rpL7a,2
3474,smk1,3


In [73]:
ncbi_dup_alias_count_df.rename(columns={0:'num_gene_records'}, inplace=True )
ncbi_dup_alias_count_df

Unnamed: 0,alias_symbol,num_gene_records
0,10q23del,2
1,12-LOX,2
2,2F1,2
3,3-alpha-HSD,2
4,35DAG,2
...,...,...
3471,polymerase,3
3472,psiSSX8,2
3473,rpL7a,2
3474,smk1,3


In [74]:
ncbi_dup_alias_count_df = ncbi_dup_alias_count_df.sort_values('num_gene_records', ascending=False)
ncbi_dup_alias_count_df.head(30)

Unnamed: 0,alias_symbol,num_gene_records
3305,VH,36
1303,H4-16,14
1306,H4C12,13
1305,H4C11,13
1316,H4C8,13
1317,H4C9,13
1312,H4C3,13
1310,H4C16,13
1304,H4C1,13
1315,H4C6,13


In [75]:
ncbi_alias_alias_collision_set = set(ncbi_dup_alias_count_df['alias_symbol'])
len(ncbi_alias_alias_collision_set)

3476

In [76]:
ncbi_dup_alias_count_df.to_csv('../ncbi_dup_alias_count_df.csv', index=True)

In [77]:
ncbi_alias_count_histogram_df = ncbi_dup_alias_count_df.pivot_table(index = ['num_gene_records'], aggfunc ='size')
ncbi_alias_count_histogram_df

num_gene_records
2     2786
3      413
4      140
5       54
6       23
7       17
8        8
9       15
10       2
11       1
12       1
13      14
14       1
36       1
dtype: int64

In [78]:
ncbi_alias_count_histogram_df = ncbi_alias_count_histogram_df.reset_index()
ncbi_alias_count_histogram_df

Unnamed: 0,num_gene_records,0
0,2,2786
1,3,413
2,4,140
3,5,54
4,6,23
5,7,17
6,8,8
7,9,15
8,10,2
9,11,1


In [79]:
ncbi_alias_count_histogram_df.rename(columns={0:'num_alias_symbol'}, inplace=True )
ncbi_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol
0,2,2786
1,3,413
2,4,140
3,5,54
4,6,23
5,7,17
6,8,8
7,9,15
8,10,2
9,11,1


In [80]:
ncbi_alias_count_histogram_df['percent_alias_symbol'] = ((ncbi_alias_count_histogram_df['num_alias_symbol'] / ncbi_alias_len) * 100)
ncbi_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol,percent_alias_symbol
0,2,2786,4.028515
1,3,413,0.597192
2,4,140,0.202438
3,5,54,0.078083
4,6,23,0.033258
5,7,17,0.024582
6,8,8,0.011568
7,9,15,0.02169
8,10,2,0.002892
9,11,1,0.001446


In [81]:
ncbi_alias_count_histogram_df = ncbi_alias_count_histogram_df.drop('num_alias_symbol', axis=1)
ncbi_alias_count_histogram_df

Unnamed: 0,num_gene_records,percent_alias_symbol
0,2,4.028515
1,3,0.597192
2,4,0.202438
3,5,0.078083
4,6,0.033258
5,7,0.024582
6,8,0.011568
7,9,0.02169
8,10,0.002892
9,11,0.001446


In [82]:
# px.bar(ncbi_alias_count_histogram_df, x='num_gene_records', y='percent_alias_symbol')

### <a id='toc3_1_11_'></a>[Put columns in different order to ephasize alias symbols instead of gene records](#toc0_)

In [83]:
mini_ncbi_df_2 = mini_ncbi_df.drop_duplicates(subset = ["alias_symbol", "gene_symbol"], keep = 'first')

In [84]:
mini_ncbi_df_2 = mini_ncbi_df[['alias_symbol', 'ENSG_ID', 'gene_symbol', 'source']]
mini_ncbi_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
537,10q23del,ENSG00000107779,BMPR1A,NCBI Info
4525,10q23del,ENSG00000171862,PTEN,NCBI Info
199,12-LOX,ENSG00000108839,ALOX12,NCBI Info
205,12-LOX,ENSG00000161905,ALOX15,NCBI Info
7974,2F1,ENSG00000139187,KLRG1,NCBI Info
...,...,...,...,...
18172,smk1,ENSG00000224960,PPP4R3C,NCBI Info
12905,smk1,ENSG00000100796,PPP4R3A,NCBI Info
13522,smk1,ENSG00000275052,PPP4R3B,NCBI Info
7631,tamo,ENSG00000158480,SPATA2,NCBI Info


### <a id='toc3_1_12_'></a>[Merge rows with matching alias symbols](#toc0_)

In [85]:
mini_ncbi_df_2 = mini_ncbi_df_2.applymap(str)
mini_ncbi_df_2

  mini_ncbi_df_2 = mini_ncbi_df_2.applymap(str)


Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
537,10q23del,ENSG00000107779,BMPR1A,NCBI Info
4525,10q23del,ENSG00000171862,PTEN,NCBI Info
199,12-LOX,ENSG00000108839,ALOX12,NCBI Info
205,12-LOX,ENSG00000161905,ALOX15,NCBI Info
7974,2F1,ENSG00000139187,KLRG1,NCBI Info
...,...,...,...,...
18172,smk1,ENSG00000224960,PPP4R3C,NCBI Info
12905,smk1,ENSG00000100796,PPP4R3A,NCBI Info
13522,smk1,ENSG00000275052,PPP4R3B,NCBI Info
7631,tamo,ENSG00000158480,SPATA2,NCBI Info


In [86]:
mini_ncbi_df_2['ENSG_ID'] = mini_ncbi_df_2['ENSG_ID'].str.replace('NAN','nan')
mini_ncbi_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
537,10q23del,ENSG00000107779,BMPR1A,NCBI Info
4525,10q23del,ENSG00000171862,PTEN,NCBI Info
199,12-LOX,ENSG00000108839,ALOX12,NCBI Info
205,12-LOX,ENSG00000161905,ALOX15,NCBI Info
7974,2F1,ENSG00000139187,KLRG1,NCBI Info
...,...,...,...,...
18172,smk1,ENSG00000224960,PPP4R3C,NCBI Info
12905,smk1,ENSG00000100796,PPP4R3A,NCBI Info
13522,smk1,ENSG00000275052,PPP4R3B,NCBI Info
7631,tamo,ENSG00000158480,SPATA2,NCBI Info


In [87]:
mini_ncbi_df_2 = mini_ncbi_df_2.groupby('alias_symbol').agg({'ENSG_ID': ', '.join, 
                             'gene_symbol': ', '.join, 
                             'source':'first' }).reset_index()
mini_ncbi_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
0,10q23del,"ENSG00000107779, ENSG00000171862","BMPR1A, PTEN",NCBI Info
1,12-LOX,"ENSG00000108839, ENSG00000161905","ALOX12, ALOX15",NCBI Info
2,2F1,"ENSG00000139187, ENSG00000005022","KLRG1, SLC25A5",NCBI Info
3,3-alpha-HSD,"ENSG00000198610, ENSG00000073737","AKR1C4, DHRS9",NCBI Info
4,35DAG,"ENSG00000102683, ENSG00000170624","SGCG, SGCD",NCBI Info
...,...,...,...,...
3471,polymerase,"nan, nan, nan","ERVK-11, ERVK-19, ERVK-9",NCBI Info
3472,psiSSX8,"ENSG00000241207, nan","SSX18P, SSXP8",NCBI Info
3473,rpL7a,"ENSG00000213272, ENSG00000240522","RPL7AP9, RPL7AP10",NCBI Info
3474,smk1,"ENSG00000224960, ENSG00000100796, ENSG00000275052","PPP4R3C, PPP4R3A, PPP4R3B",NCBI Info


# <a id='toc4_'></a>[Merge to create Alias Overlap Table 1 - Gene Symbol](#toc0_)

In [88]:
# merged_alias_overlap_df_1 = pd.concat([mini_hgnc_df[['gene_symbol', 'ENSG_ID', 'alias_symbol', 'source']],mini_ncbi_df[['gene_symbol', 'ENSG_ID', 'alias_symbol', 'source']], mini_ensg_df[['gene_symbol', 'ENSG_ID', 'alias_symbol', 'source']]])
# merged_alias_overlap_df_1

In [89]:
# merged_alias_overlap_df_1.to_csv('../merged_alias_overlap_df_1.csv', index=False)

In [90]:
# merged_alias_overlap_df_1.loc[merged_alias_overlap_df_1.gene_symbol == 'HSP90AA1']

In [91]:
# merged_alias_overlap_df_1.loc[merged_alias_overlap_df_1.alias_symbol == 'Hsp90' ]

In [92]:
# merged_alias_overlap_df_1['source'].value_counts()

# <a id='toc5_'></a>[Merge to create Alias Overlap Table 2 - Alias Symbol](#toc0_)

In [93]:
merged_alias_overlap_df_2 = pd.concat([aa_collision_hgnc_df_2[['alias_symbol', 'gene_symbol', 'ENSG_ID', 'source']],mini_ncbi_df_2[['alias_symbol', 'gene_symbol', 'ENSG_ID', 'source']], aa_collision_ensg_df_2[['alias_symbol', 'gene_symbol', 'ENSG_ID', 'source']]])
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source
0,2F1,"KLRG1, SLC25A5","ENSG00000139187, ENSG00000005022",HGNC
1,60B8AG,"S100A9, S100A8","ENSG00000163220, ENSG00000143546",HGNC
2,A2,"CFAP410, SGCA, GPHA2","ENSG00000160226, ENSG00000108823, ENSG00000149735",HGNC
3,ACC2,"ACACB, BCL2A1","ENSG00000076555, ENSG00000140379",HGNC
4,ACS2,"ACSL5, ACSL6","ENSG00000197142, ENSG00000164398",HGNC
...,...,...,...,...
1144,TCRBV15S1,"TRBV15, TRBV24-1","ENSG00000276819, ENSG00000211750",ENSG
1145,TCRGV5P,"TRGV5P, TRGV6","ENSG00000228668, ENSG00000226212",ENSG
1146,TGT,"USP14, QTRT1","ENSG00000101557, ENSG00000213339",ENSG
1147,UBC7,"UBE2G1, UBE2G2","ENSG00000132388, ENSG00000184787",ENSG


In [94]:
merged_alias_overlap_df_2.loc[merged_alias_overlap_df_2['alias_symbol'] == "ASP" ]

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source
394,ASP,"ROPN1L, ASPA, ATG5, ASIP","ENSG00000145491, ENSG00000108381, ENSG00000057...",HGNC
222,ASP,"A1CF, ATG5, ASPM, TMPRSS11D, ROPN1L, ASIP, ASP...","ENSG00000148584, ENSG00000057663, ENSG00000066...",NCBI Info
864,ASP,"ROPN1L, ASPM, TMPRSS11D","ENSG00000145491, ENSG00000066279, ENSG00000153802",ENSG


In [95]:
merged_alias_overlap_df_2['gene_symbol'] = merged_alias_overlap_df_2['gene_symbol'].str.split(",")
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source
0,2F1,"[KLRG1, SLC25A5]","ENSG00000139187, ENSG00000005022",HGNC
1,60B8AG,"[S100A9, S100A8]","ENSG00000163220, ENSG00000143546",HGNC
2,A2,"[CFAP410, SGCA, GPHA2]","ENSG00000160226, ENSG00000108823, ENSG00000149735",HGNC
3,ACC2,"[ACACB, BCL2A1]","ENSG00000076555, ENSG00000140379",HGNC
4,ACS2,"[ACSL5, ACSL6]","ENSG00000197142, ENSG00000164398",HGNC
...,...,...,...,...
1144,TCRBV15S1,"[TRBV15, TRBV24-1]","ENSG00000276819, ENSG00000211750",ENSG
1145,TCRGV5P,"[TRGV5P, TRGV6]","ENSG00000228668, ENSG00000226212",ENSG
1146,TGT,"[USP14, QTRT1]","ENSG00000101557, ENSG00000213339",ENSG
1147,UBC7,"[UBE2G1, UBE2G2]","ENSG00000132388, ENSG00000184787",ENSG


In [96]:
merged_alias_overlap_df_2['gene_symbol_count'] = [len(c) for c in merged_alias_overlap_df_2['gene_symbol']]
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source,gene_symbol_count
0,2F1,"[KLRG1, SLC25A5]","ENSG00000139187, ENSG00000005022",HGNC,2
1,60B8AG,"[S100A9, S100A8]","ENSG00000163220, ENSG00000143546",HGNC,2
2,A2,"[CFAP410, SGCA, GPHA2]","ENSG00000160226, ENSG00000108823, ENSG00000149735",HGNC,3
3,ACC2,"[ACACB, BCL2A1]","ENSG00000076555, ENSG00000140379",HGNC,2
4,ACS2,"[ACSL5, ACSL6]","ENSG00000197142, ENSG00000164398",HGNC,2
...,...,...,...,...,...
1144,TCRBV15S1,"[TRBV15, TRBV24-1]","ENSG00000276819, ENSG00000211750",ENSG,2
1145,TCRGV5P,"[TRGV5P, TRGV6]","ENSG00000228668, ENSG00000226212",ENSG,2
1146,TGT,"[USP14, QTRT1]","ENSG00000101557, ENSG00000213339",ENSG,2
1147,UBC7,"[UBE2G1, UBE2G2]","ENSG00000132388, ENSG00000184787",ENSG,2


In [97]:
merged_alias_overlap_df_2 = merged_alias_overlap_df_2.sort_values(by='gene_symbol_count', ascending= False)
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source,gene_symbol_count
3305,VH,"[IGHV4-34, IGHV3-43, IGHV4-59, IGHV4-61, I...","ENSG00000211956, ENSG00000232216, ENSG00000224...",NCBI Info,36
1303,H4-16,"[H4C9, H4C14, H4C13, H4C5, H4C16, H4C2, ...","ENSG00000276180, ENSG00000270882, ENSG00000275...",NCBI Info,14
1317,H4C9,"[H4C1, H4C2, H4C6, H4C4, H4C8, H4C15, H4...","ENSG00000278637, ENSG00000278705, ENSG00000274...",NCBI Info,13
1304,H4C1,"[H4C2, H4C12, H4C9, H4C8, H4C11, H4C14, ...","ENSG00000278705, ENSG00000273542, ENSG00000276...",NCBI Info,13
1305,H4C11,"[H4C13, H4C1, H4C2, H4C9, H4C8, H4C4, H4...","ENSG00000275126, ENSG00000278637, ENSG00000278...",NCBI Info,13
...,...,...,...,...,...
1240,GST3,"[CHST4, GSTP1]","ENSG00000140835, ENSG00000084207",NCBI Info,2
1238,GST1,"[GSTM1, GSPT1]","ENSG00000134184, ENSG00000103342",NCBI Info,2
1237,GST,"[GSTK1, SLCO6A1]","ENSG00000197448, ENSG00000205359",NCBI Info,2
1236,GSP,"[GNAS, GSM1]","ENSG00000087460, nan",NCBI Info,2


In [98]:
merged_alias_overlap_df_2.loc[merged_alias_overlap_df_2['alias_symbol'] == "CFM1" ]

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source,gene_symbol_count


In [99]:
merged_alias_overlap_df_2['gene_symbol_count'].value_counts()

gene_symbol_count
2     4340
3      600
4      193
5       62
6       36
7       18
9       17
13      14
8       10
10       4
36       1
14       1
12       1
11       1
Name: count, dtype: int64

In [100]:
merged_alias_overlap_df_2.to_csv('merged_alias_overlap_df_2.csv', index=True)

In [101]:
aa_collision_set = set(merged_alias_overlap_df_2['alias_symbol'].tolist())

In [102]:
common_aa_collisions = ensg_alias_alias_collision_primary_symbol_set & hgnc_alias_alias_collision_primary_symbol_set & ncbi_alias_alias_collision_primary_symbol_set
common_aa_collisions

{'A1CF',
 'ABCC2',
 'ABCC6',
 'ABCC8',
 'ABHD14B',
 'ABL1',
 'ACACB',
 'ACKR1',
 'ACOD1',
 'ACOT7',
 'ACOT8',
 'ACP2',
 'ACP6',
 'ACSL5',
 'ACSL6',
 'ACTR2',
 'ACTR3',
 'ACTR3B',
 'ACTR3C',
 'ACTRT1',
 'ADAM10',
 'AGAP13P',
 'AGER',
 'AGFG1',
 'AGTR1',
 'AHI1',
 'AHSA1',
 'AIFM2',
 'AIMP1',
 'AIMP2',
 'AKR1A1',
 'AKR1B1',
 'AKR1C4',
 'ALDH9A1',
 'ALK',
 'ALPG',
 'ALPI',
 'ALPP',
 'AMBP',
 'AMPD1',
 'AMT',
 'ANAPC1',
 'ANAPC10',
 'ANGPTL1',
 'ANGPTL2',
 'ANGPTL4',
 'ANKRD1',
 'ANKS4B',
 'ANPEP',
 'ANXA2',
 'ANXA2P3',
 'AP1M2',
 'AP2M1',
 'AP3B1',
 'AP3M2',
 'AP4S1',
 'APEX1',
 'APOBEC2',
 'APOBEC3A',
 'APPBP2',
 'ARFRP1',
 'ARHGEF2',
 'ARHGEF28',
 'ARHGEF5',
 'ARID2',
 'ARL6IP1',
 'ARMH1',
 'ARSF',
 'ASAP1',
 'ASAP2',
 'ASCC1',
 'ASIP',
 'ASPA',
 'ASPM',
 'ASRGL1',
 'ATF7IP',
 'ATG5',
 'ATN1',
 'ATP2C1',
 'ATP6V0A1',
 'ATP6V0A2',
 'ATP6V0A4',
 'ATP6V0D1',
 'ATP6V1B1',
 'ATP6V1B2',
 'ATP6V1G1',
 'ATP6V1G3',
 'ATP8A2',
 'ATRAID',
 'ATRNL1',
 'AURKA',
 'AZI2',
 'AZIN1',
 'AZIN2',
 'B3GNT2'

In [103]:
len(common_aa_collisions)

995

# DGIdb ambiguous query

In [104]:
merged_alias_primary_collisions_df = pd.read_csv("merged_alias_gene_intersections.csv", na_values=['', 'NULL'], keep_default_na= False)
merged_alias_primary_collisions_df

Unnamed: 0,gene_symbol,alias_symbol,intersect_point,source
0,SOAT2,ACAT2,ACAT2,HGNC
1,ACTBP8,ACTBP2,ACTBP2,HGNC
2,APPL1,APPL,APPL,HGNC
3,AKR1B1,AR,AR,HGNC
4,B3GNTL1,B3GNT8,B3GNT8,HGNC
...,...,...,...,...
2021,USP21,"USP23, USP16",USP16,ENSG
2022,USP25,USP21,USP21,ENSG
2023,VDAC1P5,"VDAC5P, VDAC3",VDAC3,ENSG
2024,XBP1P1,"XBPP1, XBP1",XBP1,ENSG


In [105]:
ag_collision_set = set(merged_alias_primary_collisions_df["intersect_point"])

In [106]:
dgidb_gene_df = pd.read_csv("dgidb_genes_JUNE.tsv", sep='\t', na_values=['', 'NULL'], keep_default_na= False)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14
...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20


In [107]:
dgidb_gene_df.query('gene_name != gene_claim_name')

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14
...,...,...,...,...,...,...
79999,SEPT5,Gene Symbol,hgnc:9164,SEPTIN5,FoundationOneGenes,9/3/20
80014,ENSEMBL:ENSG00000185821,Ensembl Gene ID,hgnc:31305,OR6C76,RussLampel,26-Jul-11
80053,SEPT5,Gene Symbol,hgnc:9164,SEPTIN5,CarisMolecularIntelligence,9/4/20
80166,SEPT6,Gene Symbol,hgnc:15848,SEPTIN6,CarisMolecularIntelligence,9/4/20


claims without a symbol/name/identifier
 (there shouldn't be any hooray)

In [108]:
no_claim_symbols_df = dgidb_gene_df[dgidb_gene_df['gene_claim_name'].isnull()]
no_claim_symbols_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version


un normalized 

In [109]:
no_name_symbols_df = dgidb_gene_df[dgidb_gene_df['gene_name'].isnull()]
no_name_symbols_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14
...,...,...,...,...,...,...
29008,URS00006E35E8_9606,Gene Symbol,,,GO,10-Apr-24
29009,URS0000EA3BA1_9606,Gene Symbol,,,GO,10-Apr-24
29198,TRYPTASE_B2_HUMAN,Gene Symbol,,,GO,10-Apr-24
29199,TRYPTASE_B1_HUMAN,Gene Symbol,,,GO,10-Apr-24


In [110]:
dgidb_name_set = set(dgidb_gene_df['gene_name'])
len(dgidb_name_set)

12001

In [111]:
dgidb_gene_claim_name_set = set(dgidb_gene_df['gene_claim_name'])
len(dgidb_gene_claim_name_set)

26739

In [112]:
hgnc_ensg_gene_symbol_set = hgnc_gene_symbol_set.union(ensg_gene_symbol_set)

In [113]:
hgnc_ensg_ncbi_gene_symbol_set = hgnc_ensg_gene_symbol_set.union(ncbi_gene_symbol_set)

In [114]:
hgnc_ensg_alias_symbol_set = hgnc_alias_symbol_set.union(ensg_alias_symbol_set)

In [115]:
hgnc_ensg_ncbi_alias_symbol_set = hgnc_ensg_alias_symbol_set.union(ncbi_alias_symbol_set)

In [116]:
print(len(aa_collision_set.intersection(ag_collision_set)))

161


In [117]:
ambiguous_symbol_set = aa_collision_set.union(ag_collision_set)
print(len(ambiguous_symbol_set))

5865


In [118]:
ambiguous_symbol_set = set(item.strip() for item in ambiguous_symbol_set)

print(len(ambiguous_symbol_set))

5050


In [119]:
ambiguous_symbol_set

{'TF',
 'MYO1C',
 'CPAD',
 'CYPH',
 'PRAP',
 'TRMT1',
 'LPCAT4',
 'GPRASP3',
 'APC, PC',
 'PBS',
 'H2B/s',
 'CEACAM3',
 'GCP2',
 'SR',
 'H2BC10, H2BC4, H2BC6, H2BC7',
 'HRES-1',
 'CMT2B2',
 'DAGK5',
 'HSD17',
 'FRAXE',
 'PWCR',
 'KCNJN1',
 'SPC3',
 'SKD2',
 'rpL7a',
 'PRR23D2',
 'PP',
 'U7',
 'HLA-DQB1',
 'MFT',
 'CCA1',
 'FAM90A16P',
 'FABP5P1',
 'P65',
 'p23',
 'HCA3',
 'GCCD2',
 'H2BC4, H2BC6, H2BC7, H2BC8',
 'DSC1, DSC2',
 'NET4',
 'ZNT8',
 'RK',
 'CLS',
 'p67',
 'DHX40P1, TBC1D3P1',
 'FAM25C',
 'CAB',
 'FAM28A',
 'TSPY',
 'HRH1',
 'ZIP2',
 'U3a',
 'OFC2',
 'M6A',
 'LIP1',
 'TK2',
 'SCNM1',
 'MAGEE1',
 'TTY11',
 'CRP1',
 'EPO',
 'UBH1',
 'TER',
 'NIPA2',
 'NET2',
 'HZF3',
 'H4C8',
 'CHC',
 'HE',
 'AG1',
 'ACF',
 'GGF2',
 '464.2',
 'RAX',
 'SMG',
 'RCK',
 'FRMPD2',
 'CXorf51B',
 'p33',
 'USE1',
 'PRX',
 'H4',
 'DOC-1',
 'HBK',
 'SAST',
 'FADS3',
 'MCP2',
 'TIP',
 'H2B/g',
 'STAG3L1, STAG3L3',
 'S31III125',
 'NBPF',
 'OR7-21',
 'MRXSMP',
 'SRC1',
 'CIP1',
 'eIF-2gA',
 'SYT11',
 'U13'

In [120]:
with open('ambiguous_symbol_set.txt', 'w') as file:
    for item in ambiguous_symbol_set:
        file.write(f"{item.strip()}\n")

In [121]:
with open('ambiguous_symbol_set.txt', 'r') as file:
    # Read each line, strip newline characters, and convert to a set
    ambiguous_symbol_set = set(line for line in file)
len(ambiguous_symbol_set)

5050

# May or may not be nonesense

In [122]:
name_ensg_notmatch = dgidb_name_set.difference(ensg_gene_symbol_set)
len(name_ensg_notmatch)

298

In [123]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(ensg_gene_symbol_set)
len(gene_claim_name_ensg_notmatch)

15029

In [124]:
cleaned_gene_claim_name_ensg_notmatch = [x for x in gene_claim_name_ensg_notmatch if str(x) != 'NaN']
len(cleaned_gene_claim_name_ensg_notmatch)

15029

In [125]:
name_hgnc_notmatch = dgidb_name_set.difference(hgnc_gene_symbol_set)
len(name_hgnc_notmatch)

25

In [126]:
name_hngc_notmatch_aacollision = name_hgnc_notmatch.intersection(hgnc_alias_alias_collision_set)
len(name_hngc_notmatch_aacollision)

0

In [127]:
cleaned_name_hgnc_notmatch = [x for x in name_hgnc_notmatch if str(x) != 'NaN']
len(cleaned_name_hgnc_notmatch)

25

In [128]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(hgnc_gene_symbol_set)
len(gene_claim_name_hgnc_notmatch)

14755

In [129]:
name_ncbi_notmatch = dgidb_name_set.difference(ncbi_gene_symbol_set)
len(name_ncbi_notmatch)

97

In [130]:
name_ncbi_notmatch_aacollision = name_ncbi_notmatch.intersection(ncbi_alias_alias_collision_set)
len(name_ncbi_notmatch_aacollision)

2

In [131]:
name_ncbi_hgnc_notmatch = name_ncbi_notmatch.difference(hgnc_gene_symbol_set)
len(name_ncbi_hgnc_notmatch)

9

In [132]:
name_ncbi_hgnc_ensg_notmatch = name_ncbi_hgnc_notmatch.difference(ensg_gene_symbol_set)
len(name_ncbi_hgnc_ensg_notmatch)

8

In [133]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(ncbi_gene_symbol_set)
len(gene_claim_name_ncbi_notmatch)

14828

In [134]:
gene_claim_name_ncbi_hngc_notmatch = gene_claim_name_ncbi_notmatch.difference(hgnc_gene_symbol_set)
len(gene_claim_name_ncbi_hngc_notmatch)

14738

In [135]:
gene_claim_name_ncbi_hngc_ensg_notmatch = gene_claim_name_ncbi_hngc_notmatch.difference(ensg_gene_symbol_set)
len(gene_claim_name_ncbi_hngc_ensg_notmatch)

14735

In [136]:
name_hgnc_match = dgidb_name_set.intersection(hgnc_gene_symbol_set)
len(name_hgnc_match)

11976

In [137]:
name_hgnc_match_aacollision = name_hgnc_match.intersection(hgnc_alias_alias_collision_set)
len(name_hgnc_match_aacollision)

8

In [138]:
name_ensg_match = dgidb_name_set.intersection(ensg_gene_symbol_set)
len(name_ensg_match)

11703

In [139]:
name_ncbi_match = dgidb_name_set.intersection(ncbi_gene_symbol_set)
len(name_ncbi_match)

11904

In [140]:
name_ncbi_match_aacollision = name_ncbi_match.intersection(ncbi_alias_alias_collision_set)
len(name_ncbi_match_aacollision)

124

In [141]:
name_ncbi_ensg_match = name_ncbi_match.intersection(ensg_gene_symbol_set)
len(name_ncbi_ensg_match)

11685

In [142]:
name_ncbi_ensg_hgnc_match = name_ncbi_ensg_match.intersection(hgnc_gene_symbol_set)
len(name_ncbi_ensg_hgnc_match)

11684

In [143]:
gene_claim_name_hgnc_match = dgidb_gene_claim_name_set.intersection(hgnc_gene_symbol_set)
len(gene_claim_name_hgnc_match)

11984

In [144]:
gene_claim_name_ensg_match = dgidb_gene_claim_name_set.intersection(ensg_gene_symbol_set)
len(gene_claim_name_ensg_match)

11710

In [145]:
gene_claim_name_ensg_aacollision_match = dgidb_gene_claim_name_set.intersection(ensg_alias_alias_collision_set)
len(gene_claim_name_ensg_aacollision_match)

23

In [146]:
gene_claim_name_hgnc_aacollision_match = dgidb_gene_claim_name_set.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_aacollision_match)

21

In [147]:
gene_claim_name_ncbi_aacollision_match = dgidb_gene_claim_name_set.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_aacollision_match)

221

In [148]:
name_ensg_aacollision_match = dgidb_name_set.intersection(ensg_alias_alias_collision_set)
len(name_ensg_aacollision_match)

9

In [149]:
name_hgnc_aacollision_match = dgidb_name_set.intersection(hgnc_alias_alias_collision_set)
len(name_hgnc_aacollision_match)

8

In [150]:
name_ncbi_aacollision_match = dgidb_name_set.intersection(ncbi_alias_alias_collision_set)
len(name_ncbi_aacollision_match)

126

In [151]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(hgnc_gene_symbol_set)
len(gene_claim_name_hgnc_notmatch)

14755

In [152]:
gene_claim_name_hgnc_notmatch_aacollision = gene_claim_name_hgnc_notmatch.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_notmatch_aacollision)

13

In [153]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(ensg_gene_symbol_set)
len(gene_claim_name_ensg_notmatch)

15029

In [154]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(ncbi_gene_symbol_set)
len(gene_claim_name_ncbi_notmatch)

14828

In [155]:
gene_claim_name_ncbi_notmatch_aacollision = gene_claim_name_ncbi_notmatch.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_notmatch_aacollision)

97

In [156]:
gene_claim_name_hgnc_match = dgidb_gene_claim_name_set.intersection(hgnc_gene_symbol_set)
len(gene_claim_name_hgnc_match)

11984

In [157]:
gene_claim_name_hgnc_match_aacollision = gene_claim_name_hgnc_match.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_match_aacollision)

8

In [158]:
gene_claim_name_ncbi_match = dgidb_gene_claim_name_set.intersection(ncbi_gene_symbol_set)
len(gene_claim_name_ncbi_match)

11911

In [159]:
gene_claim_name_ncbi_match_aacollision = gene_claim_name_ncbi_match.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_match_aacollision)

124

In [160]:
name_ensg_match_aacollision = name_ensg_match.intersection(ensg_alias_alias_collision_set)
len(name_ensg_match_aacollision)

8

In [161]:
len(gene_claim_name_ncbi_hngc_ensg_notmatch)

14735

In [162]:
len(name_ncbi_hgnc_ensg_notmatch)


8

In [163]:
len(dgidb_name_set)

12001

In [164]:
len(dgidb_gene_claim_name_set)

26739

In [165]:
name_ensg_notmatch_aacollision = name_ensg_notmatch.intersection(ensg_alias_alias_collision_set)
len(name_ensg_notmatch_aacollision)

1

In [166]:
gene_claim_name_ensg_match_aacollision = gene_claim_name_ensg_match.intersection(ensg_alias_alias_collision_set)
len(gene_claim_name_ensg_match_aacollision)

8

In [167]:
gene_claim_name_ensg_notmatch_aacollision = gene_claim_name_ensg_notmatch.intersection(ensg_alias_alias_collision_set)
len(gene_claim_name_ensg_notmatch_aacollision)

15

In [168]:
gene_claim_name_ncbi_notmatch_aacollision = gene_claim_name_ncbi_notmatch.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_notmatch_aacollision)

97

In [169]:
gene_claim_name_ncbi_match_aacollision = gene_claim_name_ncbi_match.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_match_aacollision)

124

In [170]:
gene_claim_name_hgnc_match_aacollision = gene_claim_name_hgnc_match.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_match_aacollision)

8

In [171]:
gene_claim_name_hgnc_notmatch_aacollision = gene_claim_name_hgnc_notmatch.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_notmatch_aacollision)

13

# Pull out instances of claim symbols that match to a primary gene symbol and the corresponding group symbols not matching to a primary gene symbol. Check for patterns of modes of error


In [172]:
dgidb_gene_df['claim_primary_status'] = dgidb_gene_df['gene_claim_name'].isin(hgnc_ensg_ncbi_gene_symbol_set)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_primary_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False
...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,True
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,True
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,False


In [173]:
dgidb_gene_df['claim_primary_status'].value_counts()

claim_primary_status
True     64209
False    16025
Name: count, dtype: int64

In [174]:
dgidb_gene_df['name_primary_status'] = dgidb_gene_df['gene_name'].astype(str).isin(hgnc_ensg_ncbi_gene_symbol_set)
dgidb_gene_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_primary_status,name_primary_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False
...,...,...,...,...,...,...,...,...
80229,KIT,Gene Symbol,hgnc:6342,KIT,Oncomine,v3,True,True
80230,HES1,Gene Name,hgnc:5192,HES1,NCI,14-Sep-17,True,True
80231,IRF1,Gene Symbol,hgnc:6116,IRF1,Tempus,11-Nov-18,True,True
80232,SHFM1,Gene Name,hgnc:10845,SEM1,DTC,9/2/20,False,True


In [175]:
dgidb_gene_df['name_primary_status'].value_counts()

name_primary_status
True     78074
False     2160
Name: count, dtype: int64

In [176]:
not_primary_group_name_df =  dgidb_gene_df.loc[~dgidb_gene_df['name_primary_status']]
not_primary_group_name_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_primary_status,name_primary_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False
...,...,...,...,...,...,...,...,...
57108,NCBIGENE:2614,NCBI Gene ID,ncbigene:2614,GAPDHL17,GuideToPharmacology,2024.1,False,False
57552,CYB5P1,Gene Symbol,ncbigene:1529,CYB5P1,NCBI,20240410,False,False
66195,OA1,Gene Symbol,ncbigene:474285,OA1,NCBI,20240410,False,False
66196,OA1,Gene Symbol,ncbigene:474285,OA1,dGene,27-Jun-13,False,False


In [177]:
not_primary_group_name_df['name_alias_status'] = dgidb_gene_df['gene_name'].isin(hgnc_ensg_ncbi_alias_symbol_set)
not_primary_group_name_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  not_primary_group_name_df['name_alias_status'] = dgidb_gene_df['gene_name'].isin(hgnc_ensg_ncbi_alias_symbol_set)


Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_primary_status,name_primary_status,name_alias_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False
...,...,...,...,...,...,...,...,...,...
57108,NCBIGENE:2614,NCBI Gene ID,ncbigene:2614,GAPDHL17,GuideToPharmacology,2024.1,False,False,False
57552,CYB5P1,Gene Symbol,ncbigene:1529,CYB5P1,NCBI,20240410,False,False,False
66195,OA1,Gene Symbol,ncbigene:474285,OA1,NCBI,20240410,False,False,True
66196,OA1,Gene Symbol,ncbigene:474285,OA1,dGene,27-Jun-13,False,False,True


In [178]:
not_primary_group_name_df['name_alias_status'].value_counts()

name_alias_status
False    2154
True        6
Name: count, dtype: int64

In [179]:
print("Calmbp1" in hgnc_ensg_ncbi_alias_symbol_set)

True


In [180]:
not_primary_group_name_df['claim_alias_status'] = dgidb_gene_df['gene_claim_name'].isin(hgnc_ensg_ncbi_alias_symbol_set)
not_primary_group_name_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  not_primary_group_name_df['claim_alias_status'] = dgidb_gene_df['gene_claim_name'].isin(hgnc_ensg_ncbi_alias_symbol_set)


Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
57108,NCBIGENE:2614,NCBI Gene ID,ncbigene:2614,GAPDHL17,GuideToPharmacology,2024.1,False,False,False,False
57552,CYB5P1,Gene Symbol,ncbigene:1529,CYB5P1,NCBI,20240410,False,False,False,False
66195,OA1,Gene Symbol,ncbigene:474285,OA1,NCBI,20240410,False,False,True,True
66196,OA1,Gene Symbol,ncbigene:474285,OA1,dGene,27-Jun-13,False,False,True,True


In [181]:
alias_not_primary_group_name_df = not_primary_group_name_df.loc[not_primary_group_name_df['name_alias_status']]
alias_not_primary_group_name_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status
3251,USP17L,Gene Symbol,ncbigene:100862847,USP17L,dGene,27-Jun-13,False,False,True,True
23020,ACTIN,Gene Name,ncbigene:389036,ACT,TTD,2020.06.01,False,False,True,False
38737,USP17L,Gene Symbol,ncbigene:100862847,USP17L,NCBI,20240410,False,False,True,True
55006,ACT,Gene Symbol,ncbigene:389036,ACT,NCBI,20240410,False,False,True,True
66195,OA1,Gene Symbol,ncbigene:474285,OA1,NCBI,20240410,False,False,True,True
66196,OA1,Gene Symbol,ncbigene:474285,OA1,dGene,27-Jun-13,False,False,True,True


In [182]:
not_primary_group_name_df['name_null_status'] = dgidb_gene_df['gene_name'].isnull()
not_primary_group_name_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  not_primary_group_name_df['name_null_status'] = dgidb_gene_df['gene_name'].isnull()


Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status,name_null_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
57108,NCBIGENE:2614,NCBI Gene ID,ncbigene:2614,GAPDHL17,GuideToPharmacology,2024.1,False,False,False,False,False
57552,CYB5P1,Gene Symbol,ncbigene:1529,CYB5P1,NCBI,20240410,False,False,False,False,False
66195,OA1,Gene Symbol,ncbigene:474285,OA1,NCBI,20240410,False,False,True,True,False
66196,OA1,Gene Symbol,ncbigene:474285,OA1,dGene,27-Jun-13,False,False,True,True,False


In [183]:
not_primary_group_name_df['name_null_status'].value_counts()

name_null_status
True     2144
False      16
Name: count, dtype: int64

In [184]:
null_not_primary_group_name_df = not_primary_group_name_df.loc[not_primary_group_name_df['name_null_status']]
null_not_primary_group_name_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status,name_null_status
0,NGFIBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
1,NGFIBB,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
2,DAX,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
3,REV-ERBA,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
4,COUP2,NCBI Gene Name,,,BaderLab,Feb-14,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
29008,URS00006E35E8_9606,Gene Symbol,,,GO,10-Apr-24,False,False,False,False,True
29009,URS0000EA3BA1_9606,Gene Symbol,,,GO,10-Apr-24,False,False,False,False,True
29198,TRYPTASE_B2_HUMAN,Gene Symbol,,,GO,10-Apr-24,False,False,False,False,True
29199,TRYPTASE_B1_HUMAN,Gene Symbol,,,GO,10-Apr-24,False,False,False,False,True


In [185]:
null_not_primary_group_name_df['gene_claim_name'].value_counts()

gene_claim_name
ATPF                 2
RPSK                 2
RPSE                 2
RPSD                 2
RPSC                 2
                    ..
YCEI                 1
CEFE                 1
NFSB                 1
LIGA                 1
TRYPTASE_B1_HUMAN    1
Name: count, Length: 2053, dtype: int64

In [186]:
null_not_primary_group_name_df.loc[null_not_primary_group_name_df['claim_alias_status']]

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status,name_null_status


In [187]:
other_not_primary_group_name_df = not_primary_group_name_df.loc[~not_primary_group_name_df['name_alias_status'] & ~not_primary_group_name_df['name_null_status']]
other_not_primary_group_name_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_primary_status,name_primary_status,name_alias_status,claim_alias_status,name_null_status
22269,MELANOCORTIN RECEPTOR,Gene Name,ncbigene:359995,mCR,TTD,2020.06.01,False,False,False,False,False
25653,NCBIGENE:697,NCBI Gene ID,ncbigene:697,BTNL1,GuideToPharmacology,2024.1,False,False,False,False,False
26457,NCBIGENE:1529,NCBI Gene ID,ncbigene:1529,CYB5P1,GuideToPharmacology,2024.1,False,False,False,False,False
26608,NCBIGENE:499,NCBI Gene ID,ncbigene:499,ATP5A2,GuideToPharmacology,2024.1,False,False,False,False,False
54783,mCR,Gene Symbol,ncbigene:359995,mCR,NCBI,20240410,False,False,False,False,False
56710,BTNL1,Gene Symbol,ncbigene:697,BTNL1,NCBI,20240410,False,False,False,False,False
57107,GAPDHL17,Gene Symbol,ncbigene:2614,GAPDHL17,NCBI,20240410,False,False,False,False,False
57108,NCBIGENE:2614,NCBI Gene ID,ncbigene:2614,GAPDHL17,GuideToPharmacology,2024.1,False,False,False,False,False
57552,CYB5P1,Gene Symbol,ncbigene:1529,CYB5P1,NCBI,20240410,False,False,False,False,False
74562,ATP5A2,Gene Symbol,ncbigene:499,ATP5A2,NCBI,20240410,False,False,False,False,False


In [188]:
claim_true_name_false_df = dgidb_gene_df.loc[dgidb_gene_df['claim_primary_status'] & ~dgidb_gene_df['name_primary_status']]
claim_true_name_false_df

Unnamed: 0,gene_claim_name,nomenclature,concept_id,gene_name,source_db_name,source_db_version,claim_primary_status,name_primary_status


In [189]:
len(claim_true_name_false_df)

0

# Creating a yaml file for each collision

In [190]:
# merged_alias_overlap_df_2

In [191]:
# data = {}

# for row in merged_alias_overlap_df_2.itertuples():
#     print(row)
#     break

In [192]:
# import yaml
# import os

# folder_path = 'new_alias-alias_collision_records'
# os.makedirs(folder_path, exist_ok=True)

# data = []

# for row in merged_alias_overlap_df_2.itertuples():
    
#     collision_record = {
#         "collision_symbol": row.alias_symbol,
#     }

#     collision_group = []

#     len_gene_symbols = len(row.gene_symbol)
#     ensg_ids = [r.strip() for r in row.ENSG_ID.split(",")]
#     len_ensg_ids = len(ensg_ids)

#     # if len_gene_symbols != len_ensg_ids:
#     #     print(row)
#     for i in range(0, len_gene_symbols):
#         collision_group_item = {
#             "gene_symbol": row.gene_symbol[i],
#             "ensg_id": ensg_ids[i].upper()
#         }
#         collision_group.append(collision_group_item)

#     collision_record["collision_group"] = collision_group
#     data.append(collision_record)

#     file_path = os.path.join(folder_path, f"{(row.alias_symbol.replace('/', '_'))}_collision_record.yaml")

#     with open(file_path, "w") as wf:
#         yaml.dump(collision_record, wf, default_flow_style=False)

# data