**Table of contents**<a id='toc0_'></a>    
- [ENSG](#toc1_)    
    - [How many total unique gene records are there](#toc1_1_1_)    
    - [Drop genes with no aliases](#toc1_1_2_)    
    - [Make each row in alias_symbol a set:](#toc1_1_3_)    
    - [Explode the alias sets so that it is one per row](#toc1_1_4_)    
    - [How many unique aliases are there](#toc1_1_5_)    
    - [Remove the duplicate instances of a primary gene symbol- alias pair](#toc1_1_6_)    
    - [Pull out all the rows that have an alias symbol that can be found elsewhere](#toc1_1_7_)    
    - [Sort alias symbols alphabetically](#toc1_1_8_)    
    - [Number of records with an alias that is shared](#toc1_1_9_)    
    - [Count the number of times each multi-use alias is used](#toc1_1_10_)    
      - [Save as csv](#toc1_1_10_1_)    
    - [Put columns in different order to ephasize alias symbols instead of gene records](#toc1_1_11_)    
    - [Merge rows with matching alias symbols](#toc1_1_12_)    
- [HGNC](#toc2_)    
    - [How many total unique gene records are there](#toc2_1_1_)    
    - [Drop genes with no aliases](#toc2_1_2_)    
    - [Make each row in alias_symbol a set:](#toc2_1_3_)    
    - [Explode the alias sets so that it is one per row](#toc2_1_4_)    
    - [How many total unique aliases are there](#toc2_1_5_)    
    - [Remove the duplicate instances of a primary gene symbol- alias pair](#toc2_1_6_)    
    - [Pull out all the rows that have an alias symbol that can be found elsewhere](#toc2_1_7_)    
    - [Sort alias symbols alphabetically](#toc2_1_8_)    
    - [Number of records with an alias that is shared](#toc2_1_9_)    
    - [Count the number of times each multi-use alias is used](#toc2_1_10_)    
      - [Save as csv](#toc2_1_10_1_)    
    - [Put columns in different order to ephasize alias symbols instead of gene records](#toc2_1_11_)    
    - [Merge rows with matching alias symbols](#toc2_1_12_)    
- [NCBI Info](#toc3_)    
    - [How many total unique gene records are there](#toc3_1_1_)    
    - [Drop genes with no aliases](#toc3_1_2_)    
    - [Make each row in alias_symbol a set:](#toc3_1_3_)    
    - [Explode the alias sets so that it is one per row](#toc3_1_4_)    
    - [How many unique aliases are there](#toc3_1_5_)    
    - [Remove the duplicate instances of a primary gene symbol- alias pair](#toc3_1_6_)    
    - [Pull out all the rows that have an alias symbol that can be found elsewhere](#toc3_1_7_)    
    - [Sort alias symbols alphabetically](#toc3_1_8_)    
    - [Number of records with an alias that is shared](#toc3_1_9_)    
    - [Count the number of times each multi-use alias is used](#toc3_1_10_)    
    - [Put columns in different order to ephasize alias symbols instead of gene records](#toc3_1_11_)    
    - [Merge rows with matching alias symbols](#toc3_1_12_)    
- [Merge to create Alias Overlap Table 1 - Gene Symbol](#toc4_)    
- [Merge to create Alias Overlap Table 2 - Alias Symbol](#toc5_)    
- [Common Records with Collisions](#toc6_)    
- [How many gene concept-alias relationships are there?](#toc7_)    
  - [Per Source](#toc7_1_)    
  - [Between All Sources](#toc7_2_)    
    - [Remove duplicate concept-alias pairs](#toc7_2_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1630]:
import pandas as pd
import numpy as np
import plotly.express as px

# <a id='toc1_'></a>[ENSG](#toc0_)

In [1631]:
mini_ensg_df = pd.read_csv('../created_files/mini_ensg_df.csv', dtype={'HGNC_ID': pd.Int64Dtype(), 'NCBI_ID': pd.Int64Dtype()})
mini_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"T245, TM4SF6, TSPAN-6"
1,ENSG00000000005,TNMD,17757,64102,"BRICD4, CHM1L, MYODULIN, TEM, TENDIN"
2,ENSG00000000419,DPM1,3005,8813,"CDGIE, MPDS"
3,ENSG00000000457,SCYL3,19285,57147,"PACE-1, PACE1"
4,ENSG00000000460,FIRRM,25565,55732,"APOLO1, C1ORF112, FLIP, FLJ10706, MEICA1"
...,...,...,...,...,...
75829,ENSG00000293596,,,105372654,
75830,ENSG00000293597,LINC00970,48730,101978719,
75831,ENSG00000293599,,,,
75832,ENSG00000293600,,,131768270,


### <a id='toc1_1_1_'></a>[How many total unique gene records are there](#toc0_)

By ENSG ID

In [1632]:
ensg_gene_id_set = set(mini_ensg_df['ENSG_ID'])
len(ensg_gene_id_set)

70611

By gene symbol

In [1633]:
ensg_gene_symbol_set = set(mini_ensg_df['gene_symbol'])
len(ensg_gene_symbol_set)

41068

### <a id='toc1_1_2_'></a>[Drop genes with no aliases](#toc0_)

In [1634]:
mini_ensg_df = mini_ensg_df[mini_ensg_df["alias_symbol"].str.contains("NaN") == False]
mini_ensg_df

Flushing oldest 200 entries.
  warn('Output cache limit (currently {sz} entries) hit.\n'


Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"T245, TM4SF6, TSPAN-6"
1,ENSG00000000005,TNMD,17757,64102,"BRICD4, CHM1L, MYODULIN, TEM, TENDIN"
2,ENSG00000000419,DPM1,3005,8813,"CDGIE, MPDS"
3,ENSG00000000457,SCYL3,19285,57147,"PACE-1, PACE1"
4,ENSG00000000460,FIRRM,25565,55732,"APOLO1, C1ORF112, FLIP, FLJ10706, MEICA1"
...,...,...,...,...,...
75796,ENSG00000293549,HCG22,,285834,PBMUCL2
75798,ENSG00000293551,PRAMEF22,34393,653606,PRAMEF3L
75801,ENSG00000293555,FAM169BP,26835,283777,"FAM169B, FLJ39743, KIAA0888L"
75828,ENSG00000293595,SLC25A3P1,26869,163742,FLJ40434


### <a id='toc1_1_3_'></a>[Make each row in alias_symbol a set:](#toc0_)
    covert to a list 
    make a set

In [1635]:
mini_ensg_df['alias_symbol'] = mini_ensg_df['alias_symbol'].astype(str)
mini_ensg_df['alias_symbol'] = [x.split(',') for x in mini_ensg_df.alias_symbol]
mini_ensg_df['alias_symbol']=np.where(mini_ensg_df.alias_symbol=='','',mini_ensg_df.alias_symbol.map(set))
mini_ensg_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_symbol'] = mini_ensg_df['alias_symbol'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_symbol'] = [x.split(',') for x in mini_ensg_df.alias_symbol]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_symbol']=np.where(mini_ensg_df.a

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"{ TM4SF6, T245, TSPAN-6}"


### <a id='toc1_1_4_'></a>[Explode the alias sets so that it is one per row](#toc0_)

In [1636]:
mini_ensg_df = mini_ensg_df.explode('alias_symbol')
mini_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,TM4SF6
0,ENSG00000000003,TSPAN6,11858,7105,T245
0,ENSG00000000003,TSPAN6,11858,7105,TSPAN-6
1,ENSG00000000005,TNMD,17757,64102,TEM
1,ENSG00000000005,TNMD,17757,64102,MYODULIN
...,...,...,...,...,...
75801,ENSG00000293555,FAM169BP,26835,283777,KIAA0888L
75828,ENSG00000293595,SLC25A3P1,26869,163742,FLJ40434
75833,ENSG00000293604,ORAI1,25896,84876,FLJ14466
75833,ENSG00000293604,ORAI1,25896,84876,TMEM142A


In [1637]:
mini_ensg_df.loc[mini_ensg_df['alias_symbol'] == "CFM1" ]

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
15693,ENSG00000183688,RFLNB,28705,359845,CFM1
66337,ENSG00000283979,RFLNB,28705,359845,CFM1


### <a id='toc1_1_5_'></a>[How many unique aliases are there](#toc0_)

In [1638]:
ensg_alias_symbol_set = set(mini_ensg_df['alias_symbol'])
ensg_alias_len = len(ensg_alias_symbol_set)
ensg_alias_len

55938

### <a id='toc1_1_6_'></a>[Remove the duplicate instances of a primary gene symbol- alias pair](#toc0_)

Example:

In [1639]:
mini_ensg_df.loc[mini_ensg_df['gene_symbol'] == "RFLNB" ]

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
15693,ENSG00000183688,RFLNB,28705,359845,CFM1
15693,ENSG00000183688,RFLNB,28705,359845,MGC45871
15693,ENSG00000183688,RFLNB,28705,359845,FAM101B
15693,ENSG00000183688,RFLNB,28705,359845,REFILINB
66337,ENSG00000283979,RFLNB,28705,359845,CFM1
66337,ENSG00000283979,RFLNB,28705,359845,MGC45871
66337,ENSG00000283979,RFLNB,28705,359845,FAM101B
66337,ENSG00000283979,RFLNB,28705,359845,REFILINB


In [1640]:
mini_ensg_df = mini_ensg_df.drop_duplicates(subset=['gene_symbol', 'alias_symbol'], keep='first')

In [1641]:
mini_ensg_df.loc[mini_ensg_df['gene_symbol'] == "RFLNB" ]

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
15693,ENSG00000183688,RFLNB,28705,359845,CFM1
15693,ENSG00000183688,RFLNB,28705,359845,MGC45871
15693,ENSG00000183688,RFLNB,28705,359845,FAM101B
15693,ENSG00000183688,RFLNB,28705,359845,REFILINB


In [1642]:
mini_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,TM4SF6
0,ENSG00000000003,TSPAN6,11858,7105,T245
0,ENSG00000000003,TSPAN6,11858,7105,TSPAN-6
1,ENSG00000000005,TNMD,17757,64102,TEM
1,ENSG00000000005,TNMD,17757,64102,MYODULIN
...,...,...,...,...,...
75744,ENSG00000293496,TMED11P,,100379220,p24a1
75744,ENSG00000293496,TMED11P,,100379220,p24alpha1
75752,ENSG00000293504,NPY6R,,4888,Y2B
75790,ENSG00000293543,DUSP13A,56772,128854680,BEDP


In [1643]:
ensg_concept_alias_pair_count = len(mini_ensg_df)

### <a id='toc1_1_7_'></a>[Pull out all the rows that have an alias symbol that can be found elsewhere](#toc0_)

In [1644]:
mini_ensg_df['alias_duplicates'] = mini_ensg_df.duplicated(subset= 'alias_symbol', keep=False)
aa_collision_ensg_df = mini_ensg_df[mini_ensg_df['alias_duplicates'] == True]
aa_collision_ensg_df = aa_collision_ensg_df.drop(['alias_duplicates'], axis=1)
aa_collision_ensg_df.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_duplicates'] = mini_ensg_df.duplicated(subset= 'alias_symbol', keep=False)


Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
4,ENSG00000000460,FIRRM,25565,55732,FLIP
8,ENSG00000001084,GCLC,4311,2729,GCS
12,ENSG00000001497,LAS1L,25726,81887,LAS1
13,ENSG00000001561,ENPP4,3359,22875,AP3AASE
15,ENSG00000001626,CFTR,1884,1080,MRP7


### <a id='toc1_1_8_'></a>[Sort alias symbols alphabetically](#toc0_)

In [1645]:
aa_collision_ensg_df = aa_collision_ensg_df.sort_values('alias_symbol')
aa_collision_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
8000,ENSG00000140379,BCL2A1,991,597,ACC1
58193,ENSG00000275176,ACACA,84,31,ACC1
1354,ENSG00000076555,ACACB,85,32,ACC2
8000,ENSG00000140379,BCL2A1,991,597,ACC2
2085,ENSG00000097021,ACOT7,24157,11332,ACT
...,...,...,...,...,...
2565,ENSG00000101557,USP14,12612,9097,TGT
6615,ENSG00000132388,UBE2G1,12482,7326,UBC7
15955,ENSG00000184787,UBE2G2,12483,7327,UBC7
11780,ENSG00000165828,PRAP1,23304,118471,UPA


In [1646]:
aa_collision_ensg_df.to_csv('../created_files/aa_collision_ensg_df.csv', index=True)

In [1647]:
#ensg_CD158b_alias_count_df.to_csv('../hgnc_CD158b_alias_count_df.csv')

### <a id='toc1_1_9_'></a>[Number of records with an alias that is shared](#toc0_)

In [1648]:
ensg_alias_alias_collision_primary_symbol_set = set(aa_collision_ensg_df['gene_symbol'])
len(ensg_alias_alias_collision_primary_symbol_set)

2224

In [1649]:
aa_collision_ensg_df['source'] = 'ENSG'
aa_collision_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol,source
8000,ENSG00000140379,BCL2A1,991,597,ACC1,ENSG
58193,ENSG00000275176,ACACA,84,31,ACC1,ENSG
1354,ENSG00000076555,ACACB,85,32,ACC2,ENSG
8000,ENSG00000140379,BCL2A1,991,597,ACC2,ENSG
2085,ENSG00000097021,ACOT7,24157,11332,ACT,ENSG
...,...,...,...,...,...,...
2565,ENSG00000101557,USP14,12612,9097,TGT,ENSG
6615,ENSG00000132388,UBE2G1,12482,7326,UBC7,ENSG
15955,ENSG00000184787,UBE2G2,12483,7327,UBC7,ENSG
11780,ENSG00000165828,PRAP1,23304,118471,UPA,ENSG


In [1650]:
aa_collision_ensg_df.loc[aa_collision_ensg_df['alias_symbol'] == "RN5S3" ]

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol,source


### <a id='toc1_1_10_'></a>[Count the number of times each multi-use alias is used](#toc0_)

In [1651]:
aa_collision_ensg_count_df = aa_collision_ensg_df.pivot_table(index = ['alias_symbol'], aggfunc ='size')
aa_collision_ensg_count_df = aa_collision_ensg_count_df.reset_index()
aa_collision_ensg_count_df.rename(columns={0:'num_gene_records'}, inplace=True )
aa_collision_ensg_count_df = aa_collision_ensg_count_df.sort_values('num_gene_records', ascending=False)
aa_collision_ensg_count_df.head(5)

Unnamed: 0,alias_symbol,num_gene_records
1091,MT1,10
1060,HOX1,10
1061,HOX2,9
411,P40,9
392,P18,8


In [1652]:
ensg_alias_alias_collision_set = set(aa_collision_ensg_count_df['alias_symbol'])
len(ensg_alias_alias_collision_set)

1149

In [1653]:
aa_collision_ensg_count_df.to_csv('../created_files/aa_collision_ensg_count_df.csv', index=True)

In [1654]:
aa_collision_ensg_distribution_df = aa_collision_ensg_count_df.pivot_table(index = ['num_gene_records'], aggfunc ='size')
aa_collision_ensg_distribution_df = aa_collision_ensg_distribution_df.reset_index()
aa_collision_ensg_distribution_df.rename(columns={0:'num_alias_symbol'}, inplace=True )
aa_collision_ensg_distribution_df['percent_alias_symbol'] = ((aa_collision_ensg_distribution_df['num_alias_symbol'] / ensg_alias_len) * 100)
aa_collision_ensg_distribution_df

Unnamed: 0,num_gene_records,num_alias_symbol,percent_alias_symbol
0,2,980,1.75194
1,3,117,0.20916
2,4,31,0.055418
3,5,5,0.008938
4,6,10,0.017877
5,7,1,0.001788
6,8,1,0.001788
7,9,2,0.003575
8,10,2,0.003575


In [1655]:
ensg_alias_count_histogram_df = aa_collision_ensg_distribution_df.drop('num_alias_symbol', axis=1)
ensg_alias_count_histogram_df

Unnamed: 0,num_gene_records,percent_alias_symbol
0,2,1.75194
1,3,0.20916
2,4,0.055418
3,5,0.008938
4,6,0.017877
5,7,0.001788
6,8,0.001788
7,9,0.003575
8,10,0.003575


In [1656]:
#px.bar(ensg_alias_count_histogram_df, x='num_gene_records', y='percent_alias_symbol')

In [1657]:
aa_collision_ensg_distribution_df.to_csv('../created_files/aa_collision_ensg_distribution_df.csv', index=True)

#### <a id='toc1_1_10_1_'></a>[Save as csv](#toc0_)

In [1658]:
#mini_ensg_df_explode.to_csv('../ensg_alias_overlap.csv', index=False)

### <a id='toc1_1_11_'></a>[Put columns in different order to ephasize alias symbols instead of gene records](#toc0_)

In [1659]:
aa_collision_ensg_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol,source
8000,ENSG00000140379,BCL2A1,991,597,ACC1,ENSG
58193,ENSG00000275176,ACACA,84,31,ACC1,ENSG
1354,ENSG00000076555,ACACB,85,32,ACC2,ENSG
8000,ENSG00000140379,BCL2A1,991,597,ACC2,ENSG
2085,ENSG00000097021,ACOT7,24157,11332,ACT,ENSG
...,...,...,...,...,...,...
2565,ENSG00000101557,USP14,12612,9097,TGT,ENSG
6615,ENSG00000132388,UBE2G1,12482,7326,UBC7,ENSG
15955,ENSG00000184787,UBE2G2,12483,7327,UBC7,ENSG
11780,ENSG00000165828,PRAP1,23304,118471,UPA,ENSG


In [1660]:
aa_collision_ensg_df_2 = aa_collision_ensg_df[['alias_symbol', 'ENSG_ID', 'gene_symbol', 'source']]
aa_collision_ensg_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
8000,ACC1,ENSG00000140379,BCL2A1,ENSG
58193,ACC1,ENSG00000275176,ACACA,ENSG
1354,ACC2,ENSG00000076555,ACACB,ENSG
8000,ACC2,ENSG00000140379,BCL2A1,ENSG
2085,ACT,ENSG00000097021,ACOT7,ENSG
...,...,...,...,...
2565,TGT,ENSG00000101557,USP14,ENSG
6615,UBC7,ENSG00000132388,UBE2G1,ENSG
15955,UBC7,ENSG00000184787,UBE2G2,ENSG
11780,UPA,ENSG00000165828,PRAP1,ENSG


### <a id='toc1_1_12_'></a>[Merge rows with matching alias symbols](#toc0_)

In [1661]:
aa_collision_ensg_df_2 = aa_collision_ensg_df_2.drop_duplicates(subset = ["alias_symbol", "gene_symbol"], keep = 'first')

In [1662]:
aa_collision_ensg_df_2 = aa_collision_ensg_df_2.applymap(str)
aa_collision_ensg_df_2


  aa_collision_ensg_df_2 = aa_collision_ensg_df_2.applymap(str)


Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
8000,ACC1,ENSG00000140379,BCL2A1,ENSG
58193,ACC1,ENSG00000275176,ACACA,ENSG
1354,ACC2,ENSG00000076555,ACACB,ENSG
8000,ACC2,ENSG00000140379,BCL2A1,ENSG
2085,ACT,ENSG00000097021,ACOT7,ENSG
...,...,...,...,...
2565,TGT,ENSG00000101557,USP14,ENSG
6615,UBC7,ENSG00000132388,UBE2G1,ENSG
15955,UBC7,ENSG00000184787,UBE2G2,ENSG
11780,UPA,ENSG00000165828,PRAP1,ENSG


In [1663]:
aa_collision_ensg_df_2 = aa_collision_ensg_df_2.groupby('alias_symbol').agg({'ENSG_ID': ', '.join, 
                             'gene_symbol': ', '.join, 
                             'source':'first' }).reset_index()
aa_collision_ensg_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
0,ACC1,"ENSG00000140379, ENSG00000275176","BCL2A1, ACACA",ENSG
1,ACC2,"ENSG00000076555, ENSG00000140379","ACACB, BCL2A1",ENSG
2,ACT,"ENSG00000097021, ENSG00000196136","ACOT7, SERPINA3",ENSG
3,AGPAT9,"ENSG00000138678, ENSG00000153395","GPAT3, LPCAT1",ENSG
4,AIP1,"ENSG00000187391, ENSG00000136848","MAGI2, DAB2IP",ENSG
...,...,...,...,...
1144,TCRBV15S1,"ENSG00000211750, ENSG00000276819","TRBV24-1, TRBV15",ENSG
1145,TCRGV5P,"ENSG00000226212, ENSG00000228668","TRGV6, TRGV5P",ENSG
1146,TGT,"ENSG00000213339, ENSG00000101557","QTRT1, USP14",ENSG
1147,UBC7,"ENSG00000132388, ENSG00000184787","UBE2G1, UBE2G2",ENSG


# <a id='toc2_'></a>[HGNC](#toc0_)

In [1664]:
mini_hgnc_df = pd.read_csv('../created_files/mini_hgnc_df.csv', dtype={'HGNC_ID': pd.Int64Dtype(), 'NCBI_ID': pd.Int64Dtype()})
mini_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"T245, TSPAN-6"
1,ENSG00000000005,TNMD,17757,64102,"myodulin, ChM1L, tendin, TEM, BRICD4"
2,ENSG00000000419,DPM1,3005,8813,"MPDS, CDGIE"
3,ENSG00000000457,SCYL3,19285,57147,"PACE-1, PACE1"
4,ENSG00000000460,FIRRM,25565,55732,"FLJ10706, Apolo1, FLIP, MEICA1"
...,...,...,...,...,...
45641,,ZNF97,13173,,
45642,,ZNFP1,13181,,
45643,,ZPAXP,51635,105373450,ZPX1P
45644,,ZRK,13193,,


### <a id='toc2_1_1_'></a>[How many total unique gene records are there](#toc0_)

By HGNC ID

In [1665]:
hgnc_gene_id_set = set(mini_hgnc_df['HGNC_ID'])
len(hgnc_gene_id_set)

45646

By gene symbol

In [1666]:
hgnc_gene_symbol_set = set(mini_hgnc_df['gene_symbol'])
len(hgnc_gene_symbol_set)

45646

### <a id='toc2_1_2_'></a>[Drop genes with no aliases](#toc0_)

In [1667]:
mini_hgnc_df = mini_hgnc_df[mini_hgnc_df["alias_symbol"].str.contains("NaN") == False]
mini_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"T245, TSPAN-6"
1,ENSG00000000005,TNMD,17757,64102,"myodulin, ChM1L, tendin, TEM, BRICD4"
2,ENSG00000000419,DPM1,3005,8813,"MPDS, CDGIE"
3,ENSG00000000457,SCYL3,19285,57147,"PACE-1, PACE1"
4,ENSG00000000460,FIRRM,25565,55732,"FLJ10706, Apolo1, FLIP, MEICA1"
...,...,...,...,...,...
45632,,ZNF78L2,13152,,pT3
45636,,ZNF88,13163,,HPF8
45638,,ZNF94,13170,,F11465
45643,,ZPAXP,51635,105373450,ZPX1P


### <a id='toc2_1_3_'></a>[Make each row in alias_symbol a set:](#toc0_)
    covert to a list 
    make a set

In [1668]:
mini_hgnc_df['alias_symbol'] = mini_hgnc_df['alias_symbol'].astype(str)
mini_hgnc_df['alias_symbol'] = [x.split(',') for x in mini_hgnc_df.alias_symbol]
mini_hgnc_df['alias_symbol']=np.where(mini_hgnc_df.alias_symbol=='','',mini_hgnc_df.alias_symbol.map(set))
mini_hgnc_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_hgnc_df['alias_symbol'] = mini_hgnc_df['alias_symbol'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_hgnc_df['alias_symbol'] = [x.split(',') for x in mini_hgnc_df.alias_symbol]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_hgnc_df['alias_symbol']=np.where(mini_hgnc_df.a

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,"{T245, TSPAN-6}"
1,ENSG00000000005,TNMD,17757,64102,"{ TEM, BRICD4, ChM1L, tendin, myodulin}"
2,ENSG00000000419,DPM1,3005,8813,"{MPDS, CDGIE}"
3,ENSG00000000457,SCYL3,19285,57147,"{ PACE1, PACE-1}"
4,ENSG00000000460,FIRRM,25565,55732,"{ Apolo1, FLJ10706, FLIP, MEICA1}"
...,...,...,...,...,...
45632,,ZNF78L2,13152,,{pT3}
45636,,ZNF88,13163,,{HPF8}
45638,,ZNF94,13170,,{F11465}
45643,,ZPAXP,51635,105373450,{ZPX1P}


### <a id='toc2_1_4_'></a>[Explode the alias sets so that it is one per row](#toc0_)

In [1669]:
mini_hgnc_df = mini_hgnc_df.explode('alias_symbol')
mini_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,T245
0,ENSG00000000003,TSPAN6,11858,7105,TSPAN-6
1,ENSG00000000005,TNMD,17757,64102,TEM
1,ENSG00000000005,TNMD,17757,64102,BRICD4
1,ENSG00000000005,TNMD,17757,64102,ChM1L
...,...,...,...,...,...
45638,,ZNF94,13170,,F11465
45643,,ZPAXP,51635,105373450,ZPX1P
45645,,ZWINTAS,13196,,MPP5
45645,,ZWINTAS,13196,,MPHOSPH5


### <a id='toc2_1_5_'></a>[How many total unique aliases are there](#toc0_)

In [1670]:
hgnc_alias_symbol_set = set(mini_hgnc_df['alias_symbol'])
hgnc_alias_len = len(hgnc_alias_symbol_set)
hgnc_alias_len

43770

### <a id='toc2_1_6_'></a>[Remove the duplicate instances of a primary gene symbol- alias pair](#toc0_)

In [1671]:
mini_hgnc_df = mini_hgnc_df.drop_duplicates(subset=['gene_symbol', 'alias_symbol'], keep='first')
mini_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
0,ENSG00000000003,TSPAN6,11858,7105,T245
0,ENSG00000000003,TSPAN6,11858,7105,TSPAN-6
1,ENSG00000000005,TNMD,17757,64102,TEM
1,ENSG00000000005,TNMD,17757,64102,BRICD4
1,ENSG00000000005,TNMD,17757,64102,ChM1L
...,...,...,...,...,...
45638,,ZNF94,13170,,F11465
45643,,ZPAXP,51635,105373450,ZPX1P
45645,,ZWINTAS,13196,,MPP5
45645,,ZWINTAS,13196,,MPHOSPH5


In [1672]:
hgnc_concept_alias_pair_count = len(mini_hgnc_df)

### <a id='toc2_1_7_'></a>[Pull out all the rows that have an alias symbol that can be found elsewhere](#toc0_)

In [1673]:
mini_hgnc_df['alias_duplicates'] = mini_hgnc_df.duplicated(subset= 'alias_symbol', keep=False)
aa_collision_hgnc_df = mini_hgnc_df[mini_hgnc_df['alias_duplicates'] == True]
aa_collision_hgnc_df = aa_collision_hgnc_df.drop(['alias_duplicates'], axis=1)
aa_collision_hgnc_df.head(5)

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
4,ENSG00000000460,FIRRM,25565,55732,FLIP
8,ENSG00000001084,GCLC,4311,2729,GCS
13,ENSG00000001561,ENPP4,3359,22875,AP3Aase
22,ENSG00000002549,LAP3,18449,51056,LAP
39,ENSG00000003402,CFLAR,1876,8837,FLIP


### <a id='toc2_1_8_'></a>[Sort alias symbols alphabetically](#toc0_)

In [1674]:
aa_collision_hgnc_df = aa_collision_hgnc_df.sort_values('alias_symbol')
aa_collision_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol
75,ENSG00000005022,SLC25A5,10991,292,2F1
7761,ENSG00000139187,KLRG1,6380,10219,2F1
8398,ENSG00000143546,S100A8,10498,6279,60B8AG
10916,ENSG00000163220,S100A9,10499,6280,60B8AG
9226,ENSG00000149735,GPHA2,18054,170589,A2
...,...,...,...,...,...
10537,ENSG00000161011,SQSTM1,11280,8878,p62
17080,ENSG00000196787,H2AC11,4737,8969,pH2A/f
28269,ENSG00000234816,H2AC5P,4728,10341,pH2A/f
39337,ENSG00000274962,TEX28P1,33356,728447,pTEX


### <a id='toc2_1_9_'></a>[Number of records with an alias that is shared](#toc0_)

In [1675]:
hgnc_alias_alias_collision_primary_symbol_set = set(aa_collision_hgnc_df['gene_symbol'])
len(hgnc_alias_alias_collision_primary_symbol_set)

1356

In [1676]:
aa_collision_hgnc_df['source'] = 'HGNC'
aa_collision_hgnc_df

Unnamed: 0,ENSG_ID,gene_symbol,HGNC_ID,NCBI_ID,alias_symbol,source
75,ENSG00000005022,SLC25A5,10991,292,2F1,HGNC
7761,ENSG00000139187,KLRG1,6380,10219,2F1,HGNC
8398,ENSG00000143546,S100A8,10498,6279,60B8AG,HGNC
10916,ENSG00000163220,S100A9,10499,6280,60B8AG,HGNC
9226,ENSG00000149735,GPHA2,18054,170589,A2,HGNC
...,...,...,...,...,...,...
10537,ENSG00000161011,SQSTM1,11280,8878,p62,HGNC
17080,ENSG00000196787,H2AC11,4737,8969,pH2A/f,HGNC
28269,ENSG00000234816,H2AC5P,4728,10341,pH2A/f,HGNC
39337,ENSG00000274962,TEX28P1,33356,728447,pTEX,HGNC


In [1677]:
aa_collision_hgnc_df.to_csv('../created_files/aa_collision_hgnc_df.csv', index=True)

### <a id='toc2_1_10_'></a>[Count the number of times each multi-use alias is used](#toc0_)

In [1678]:
aa_collision_hgnc_count_df = aa_collision_hgnc_df.pivot_table(index = ['alias_symbol'], aggfunc ='size')
aa_collision_hgnc_count_df = aa_collision_hgnc_count_df.reset_index()
aa_collision_hgnc_count_df.rename(columns={0:'num_gene_records'}, inplace=True )
aa_collision_hgnc_count_df = aa_collision_hgnc_count_df.sort_values('num_gene_records', ascending=False)
aa_collision_hgnc_count_df.head(5)

Unnamed: 0,alias_symbol,num_gene_records
639,U3,8
642,U4,6
189,MYM,6
446,F379,6
638,U2,5


In [1679]:
hgnc_alias_alias_collision_set = set(aa_collision_hgnc_count_df['alias_symbol'])
len(hgnc_alias_alias_collision_set)

673

In [1680]:
aa_collision_hgnc_count_df.to_csv('../created_files/aa_collision_hgnc_count_df.csv', index=True)

In [1681]:
aa_collision_hgnc_distribution_df = aa_collision_hgnc_count_df.pivot_table(index = ['num_gene_records'], aggfunc ='size')
aa_collision_hgnc_distribution_df = aa_collision_hgnc_distribution_df.reset_index()
aa_collision_hgnc_distribution_df.rename(columns={0:'num_alias_symbol'}, inplace=True )
aa_collision_hgnc_distribution_df['percent_alias_symbol'] = ((aa_collision_hgnc_distribution_df['num_alias_symbol'] / hgnc_alias_len) * 100)
aa_collision_hgnc_distribution_df

Unnamed: 0,num_gene_records,num_alias_symbol,percent_alias_symbol
0,2,574,1.311401
1,3,70,0.159927
2,4,22,0.050263
3,5,3,0.006854
4,6,3,0.006854
5,8,1,0.002285


In [1682]:
aa_collision_hgnc_distribution_df = aa_collision_hgnc_distribution_df.drop('num_alias_symbol', axis=1)
aa_collision_hgnc_distribution_df

Unnamed: 0,num_gene_records,percent_alias_symbol
0,2,1.311401
1,3,0.159927
2,4,0.050263
3,5,0.006854
4,6,0.006854
5,8,0.002285


In [1683]:
aa_collision_hgnc_distribution_df.to_csv('../created_files/aa_collision_hgnc_distribution_df.csv', index=True)

In [1684]:
#px.bar(hgnc_alias_count_histogram_df, x='num_gene_records', y='percent_alias_symbol')

#### <a id='toc2_1_10_1_'></a>[Save as csv](#toc0_)

In [1685]:
#mini_hgnc_df_explode.to_csv('../hgnc_alias_overlap.csv', index=False)

### <a id='toc2_1_11_'></a>[Put columns in different order to ephasize alias symbols instead of gene records](#toc0_)

In [1686]:
aa_collision_hgnc_df_2 = aa_collision_hgnc_df.drop_duplicates(subset = ["alias_symbol", "gene_symbol"], keep = 'first')

In [1687]:
aa_collision_hgnc_df_2 = aa_collision_hgnc_df_2[['alias_symbol', 'ENSG_ID', 'gene_symbol', 'source']]
aa_collision_hgnc_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
75,2F1,ENSG00000005022,SLC25A5,HGNC
7761,2F1,ENSG00000139187,KLRG1,HGNC
8398,60B8AG,ENSG00000143546,S100A8,HGNC
10916,60B8AG,ENSG00000163220,S100A9,HGNC
9226,A2,ENSG00000149735,GPHA2,HGNC
...,...,...,...,...
10537,p62,ENSG00000161011,SQSTM1,HGNC
17080,pH2A/f,ENSG00000196787,H2AC11,HGNC
28269,pH2A/f,ENSG00000234816,H2AC5P,HGNC
39337,pTEX,ENSG00000274962,TEX28P1,HGNC


### <a id='toc2_1_12_'></a>[Merge rows with matching alias symbols](#toc0_)

In [1688]:
aa_collision_hgnc_df_2 = aa_collision_hgnc_df_2.applymap(str)
aa_collision_hgnc_df_2

  aa_collision_hgnc_df_2 = aa_collision_hgnc_df_2.applymap(str)


Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
75,2F1,ENSG00000005022,SLC25A5,HGNC
7761,2F1,ENSG00000139187,KLRG1,HGNC
8398,60B8AG,ENSG00000143546,S100A8,HGNC
10916,60B8AG,ENSG00000163220,S100A9,HGNC
9226,A2,ENSG00000149735,GPHA2,HGNC
...,...,...,...,...
10537,p62,ENSG00000161011,SQSTM1,HGNC
17080,pH2A/f,ENSG00000196787,H2AC11,HGNC
28269,pH2A/f,ENSG00000234816,H2AC5P,HGNC
39337,pTEX,ENSG00000274962,TEX28P1,HGNC


In [1689]:
aa_collision_hgnc_df_2 = aa_collision_hgnc_df_2.groupby('alias_symbol').agg({'ENSG_ID': ', '.join, 
                             'gene_symbol': ', '.join, 
                             'source':'first' }).reset_index()
aa_collision_hgnc_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
0,2F1,"ENSG00000005022, ENSG00000139187","SLC25A5, KLRG1",HGNC
1,60B8AG,"ENSG00000143546, ENSG00000163220","S100A8, S100A9",HGNC
2,A2,"ENSG00000149735, ENSG00000160226, ENSG00000108823","GPHA2, CFAP410, SGCA",HGNC
3,ACC2,"ENSG00000140379, ENSG00000076555","BCL2A1, ACACB",HGNC
4,ACS2,"ENSG00000164398, ENSG00000197142","ACSL6, ACSL5",HGNC
...,...,...,...,...
668,p55,"ENSG00000197170, ENSG00000075618, ENSG00000117...","PSMD12, FSCN1, PIK3R3, H3P44",HGNC
669,p56,"ENSG00000123106, ENSG00000227211","CCDC91, H3P45",HGNC
670,p62,"ENSG00000213024, ENSG00000161011","NUP62, SQSTM1",HGNC
671,pH2A/f,"ENSG00000196787, ENSG00000234816","H2AC11, H2AC5P",HGNC


In [1690]:
#mini_hgnc_df_2.to_csv('../hgnc_alias_overlap_2.csv', index=False)

# <a id='toc3_'></a>[NCBI Info](#toc0_)

In [1691]:
mini_ncbi_df = pd.read_csv('../created_files/mini_ncbi_df.csv', dtype={'HGNC_ID': pd.Int64Dtype(), 'NCBI_ID': pd.Int64Dtype()})
mini_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B|ABG|GAB|HYST2477,5,ENSG00000121410
1,2,A2M,A2MD|CPAMD5|FWP007|S863-7,7,ENSG00000175899
2,3,A2MP1,A2MP,8,ENSG00000291190
3,9,NAT1,AAC1|MNAT|NAT-1|NATI,7645,ENSG00000171428
4,10,NAT2,AAC2|NAT-2|PNAT,7646,ENSG00000156006
...,...,...,...,...,...
193451,8923215,trnD,-,,
193452,8923216,trnP,-,,
193453,8923217,trnA,-,,
193454,8923218,COX1,-,,


### <a id='toc3_1_1_'></a>[How many total unique gene records are there](#toc0_)

By ENSG ID

In [1692]:
ncbi_gene_id_set = set(mini_ncbi_df['ENSG_ID'])
len(ncbi_gene_id_set)

36803

By gene symbol

In [1693]:
ncbi_gene_symbol_set = set(mini_ncbi_df['gene_symbol'])
len(ncbi_gene_symbol_set)

193303

### <a id='toc3_1_2_'></a>[Drop genes with no aliases](#toc0_)

In [1694]:
mini_ncbi_df = mini_ncbi_df.replace("-", np.nan)
mini_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B|ABG|GAB|HYST2477,5,ENSG00000121410
1,2,A2M,A2MD|CPAMD5|FWP007|S863-7,7,ENSG00000175899
2,3,A2MP1,A2MP,8,ENSG00000291190
3,9,NAT1,AAC1|MNAT|NAT-1|NATI,7645,ENSG00000171428
4,10,NAT2,AAC2|NAT-2|PNAT,7646,ENSG00000156006
...,...,...,...,...,...
193451,8923215,trnD,,,
193452,8923216,trnP,,,
193453,8923217,trnA,,,
193454,8923218,COX1,,,


In [1695]:
mini_ncbi_df = mini_ncbi_df.dropna(subset=['alias_symbol'])
mini_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B|ABG|GAB|HYST2477,5,ENSG00000121410
1,2,A2M,A2MD|CPAMD5|FWP007|S863-7,7,ENSG00000175899
2,3,A2MP1,A2MP,8,ENSG00000291190
3,9,NAT1,AAC1|MNAT|NAT-1|NATI,7645,ENSG00000171428
4,10,NAT2,AAC2|NAT-2|PNAT,7646,ENSG00000156006
...,...,...,...,...,...
190958,131696449,LOC131696449,PKD1P1-NPIPA5L,,
190961,131840634,GLTC1,GLTC,56861,
193342,132532400,GABRA6-AS1,ARBAG,40248,
193377,133395150,LNCARGI,ARGI,56890,


In [1696]:
mini_ncbi_df.loc[mini_ncbi_df['gene_symbol'] == "TMEM37" ]

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
18270,140738,TMEM37,PR|PR1,18216,ENSG00000171227


### <a id='toc3_1_3_'></a>[Make each row in alias_symbol a set:](#toc0_)
    covert to a list 
    make a set

In [1697]:
mini_ncbi_df['alias_symbol'] = mini_ncbi_df['alias_symbol'].astype(str)
mini_ncbi_df['alias_symbol'] = [x.split('|') for x in mini_ncbi_df.alias_symbol]
mini_ncbi_df['alias_symbol']=np.where(mini_ncbi_df.alias_symbol=='','',mini_ncbi_df.alias_symbol.map(set))
mini_ncbi_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df['alias_symbol'] = mini_ncbi_df['alias_symbol'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df['alias_symbol'] = [x.split('|') for x in mini_ncbi_df.alias_symbol]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df['alias_symbol']=np.where(mini_ncbi_df.a

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,"{A1B, GAB, ABG, HYST2477}",5,ENSG00000121410


### <a id='toc3_1_4_'></a>[Explode the alias sets so that it is one per row](#toc0_)

In [1698]:
mini_ncbi_df = mini_ncbi_df.explode(column="alias_symbol")
mini_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B,5,ENSG00000121410
0,1,A1BG,GAB,5,ENSG00000121410
0,1,A1BG,ABG,5,ENSG00000121410
0,1,A1BG,HYST2477,5,ENSG00000121410
1,2,A2M,FWP007,7,ENSG00000175899
...,...,...,...,...,...
190961,131840634,GLTC1,GLTC,56861,
193342,132532400,GABRA6-AS1,ARBAG,40248,
193377,133395150,LNCARGI,ARGI,56890,
193378,133834869,MLDHR,MP31,55481,


In [1699]:
#ncbi_CD158b_alias_count_df.to_csv('../ncbi_CD158b_alias_count_df.csv')

### <a id='toc3_1_5_'></a>[How many unique aliases are there](#toc0_)

In [1700]:
ncbi_alias_symbol_set = set(mini_ncbi_df['alias_symbol'])
ncbi_alias_len = len(ncbi_alias_symbol_set)
ncbi_alias_len

69157

### <a id='toc3_1_6_'></a>[Remove the duplicate instances of a primary gene symbol- alias pair](#toc0_)

In [1701]:
mini_ncbi_df = mini_ncbi_df.drop_duplicates(subset=['gene_symbol', 'alias_symbol'], keep='first')
mini_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B,5,ENSG00000121410
0,1,A1BG,GAB,5,ENSG00000121410
0,1,A1BG,ABG,5,ENSG00000121410
0,1,A1BG,HYST2477,5,ENSG00000121410
1,2,A2M,FWP007,7,ENSG00000175899
...,...,...,...,...,...
190961,131840634,GLTC1,GLTC,56861,
193342,132532400,GABRA6-AS1,ARBAG,40248,
193377,133395150,LNCARGI,ARGI,56890,
193378,133834869,MLDHR,MP31,55481,


In [1702]:
ncbi_concept_alias_pair_count = len(mini_ncbi_df)

### <a id='toc3_1_7_'></a>[Pull out all the rows that have an alias symbol that can be found elsewhere](#toc0_)

In [1703]:
mini_ncbi_df['alias_duplicates'] = mini_ncbi_df.duplicated(subset= 'alias_symbol', keep=False)
aa_collision_ncbi_df = mini_ncbi_df[mini_ncbi_df['alias_duplicates'] == True]
aa_collision_ncbi_df = aa_collision_ncbi_df.drop(['alias_duplicates'], axis=1)
aa_collision_ncbi_df.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df['alias_duplicates'] = mini_ncbi_df.duplicated(subset= 'alias_symbol', keep=False)


Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
0,1,A1BG,A1B,5,ENSG00000121410
3,9,NAT1,NAT-1,7645,ENSG00000171428
3,9,NAT1,AAC1,7645,ENSG00000171428
4,10,NAT2,AAC2,7646,ENSG00000156006
6,12,SERPINA3,ACT,16,ENSG00000196136


### <a id='toc3_1_8_'></a>[Sort alias symbols alphabetically](#toc0_)

In [1704]:
aa_collision_ncbi_df = aa_collision_ncbi_df.sort_values('alias_symbol')
aa_collision_ncbi_df.head(5)

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID
4525,5728,PTEN,10q23del,9588,ENSG00000171862
537,657,BMPR1A,10q23del,1076,ENSG00000107779
199,239,ALOX12,12-LOX,429,ENSG00000108839
205,246,ALOX15,12-LOX,433,ENSG00000161905
245,292,SLC25A5,2F1,10991,ENSG00000005022


### <a id='toc3_1_9_'></a>[Number of records with an alias that is shared](#toc0_)

In [1705]:
ncbi_alias_alias_collision_primary_symbol_set = set(mini_ncbi_df['gene_symbol'])
len(ncbi_alias_alias_collision_primary_symbol_set)

27580

In [1706]:
aa_collision_ncbi_df['source'] = 'NCBI Info'
aa_collision_ncbi_df

Unnamed: 0,NCBI_ID,gene_symbol,alias_symbol,HGNC_ID,ENSG_ID,source
4525,5728,PTEN,10q23del,9588,ENSG00000171862,NCBI Info
537,657,BMPR1A,10q23del,1076,ENSG00000107779,NCBI Info
199,239,ALOX12,12-LOX,429,ENSG00000108839,NCBI Info
205,246,ALOX15,12-LOX,433,ENSG00000161905,NCBI Info
245,292,SLC25A5,2F1,10991,ENSG00000005022,NCBI Info
...,...,...,...,...,...,...
18172,139420,PPP4R3C,smk1,33146,ENSG00000224960,NCBI Info
13522,57223,PPP4R3B,smk1,29267,ENSG00000275052,NCBI Info
12905,55671,PPP4R3A,smk1,20219,ENSG00000100796,NCBI Info
7631,9825,SPATA2,tamo,14681,ENSG00000158480,NCBI Info


In [1707]:
aa_collision_ncbi_df.to_csv('../created_files/aa_collision_ncbi_df.csv', index=True)

### <a id='toc3_1_10_'></a>[Count the number of times each multi-use alias is used](#toc0_)

In [1708]:
aa_collision_ncbi_count_df = aa_collision_ncbi_df.pivot_table(index = ['alias_symbol'], aggfunc ='size')
aa_collision_ncbi_count_df = aa_collision_ncbi_count_df.reset_index()
aa_collision_ncbi_count_df.rename(columns={0:'num_gene_records'}, inplace=True )
aa_collision_ncbi_count_df = aa_collision_ncbi_count_df.sort_values('num_gene_records', ascending=False)
aa_collision_ncbi_count_df.head(5)

Unnamed: 0,alias_symbol,num_gene_records
3305,VH,36
1303,H4-16,14
1306,H4C12,13
1305,H4C11,13
1316,H4C8,13


In [1709]:
ncbi_alias_alias_collision_set = set(aa_collision_ncbi_count_df['alias_symbol'])
len(ncbi_alias_alias_collision_set)

3476

In [1710]:
aa_collision_ncbi_count_df.to_csv('../created_files/aa_collision_ncbi_count_df.csv', index=True)

In [1711]:
aa_collision_ncbi_distribution_df = aa_collision_ncbi_count_df.pivot_table(index = ['num_gene_records'], aggfunc ='size')
aa_collision_ncbi_distribution_df = aa_collision_ncbi_distribution_df.reset_index()
aa_collision_ncbi_distribution_df.rename(columns={0:'num_alias_symbol'}, inplace=True )
aa_collision_ncbi_distribution_df['percent_alias_symbol'] = ((aa_collision_ncbi_distribution_df['num_alias_symbol'] / ncbi_alias_len) * 100)
aa_collision_ncbi_distribution_df

Unnamed: 0,num_gene_records,num_alias_symbol,percent_alias_symbol
0,2,2786,4.028515
1,3,413,0.597192
2,4,140,0.202438
3,5,54,0.078083
4,6,23,0.033258
5,7,17,0.024582
6,8,8,0.011568
7,9,15,0.02169
8,10,2,0.002892
9,11,1,0.001446


In [1712]:
aa_collision_ncbi_distribution_df = aa_collision_ncbi_distribution_df.drop('num_alias_symbol', axis=1)
aa_collision_ncbi_distribution_df

Unnamed: 0,num_gene_records,percent_alias_symbol
0,2,4.028515
1,3,0.597192
2,4,0.202438
3,5,0.078083
4,6,0.033258
5,7,0.024582
6,8,0.011568
7,9,0.02169
8,10,0.002892
9,11,0.001446


In [1713]:
aa_collision_ncbi_distribution_df.to_csv('../created_files/aa_collision_ncbi_distribution_df.csv', index=True)

In [1714]:
# px.bar(ncbi_alias_count_histogram_df, x='num_gene_records', y='percent_alias_symbol')

### <a id='toc3_1_11_'></a>[Put columns in different order to ephasize alias symbols instead of gene records](#toc0_)

In [1715]:
aa_collision_ncbi_df_2 = aa_collision_ncbi_df.drop_duplicates(subset = ["alias_symbol", "gene_symbol"], keep = 'first')

In [1716]:
aa_collision_ncbi_df_2 = aa_collision_ncbi_df[['alias_symbol', 'ENSG_ID', 'gene_symbol', 'source']]
aa_collision_ncbi_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
4525,10q23del,ENSG00000171862,PTEN,NCBI Info
537,10q23del,ENSG00000107779,BMPR1A,NCBI Info
199,12-LOX,ENSG00000108839,ALOX12,NCBI Info
205,12-LOX,ENSG00000161905,ALOX15,NCBI Info
245,2F1,ENSG00000005022,SLC25A5,NCBI Info
...,...,...,...,...
18172,smk1,ENSG00000224960,PPP4R3C,NCBI Info
13522,smk1,ENSG00000275052,PPP4R3B,NCBI Info
12905,smk1,ENSG00000100796,PPP4R3A,NCBI Info
7631,tamo,ENSG00000158480,SPATA2,NCBI Info


### <a id='toc3_1_12_'></a>[Merge rows with matching alias symbols](#toc0_)

In [1717]:
aa_collision_ncbi_df_2 = aa_collision_ncbi_df_2.applymap(str)
aa_collision_ncbi_df_2

  aa_collision_ncbi_df_2 = aa_collision_ncbi_df_2.applymap(str)


Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
4525,10q23del,ENSG00000171862,PTEN,NCBI Info
537,10q23del,ENSG00000107779,BMPR1A,NCBI Info
199,12-LOX,ENSG00000108839,ALOX12,NCBI Info
205,12-LOX,ENSG00000161905,ALOX15,NCBI Info
245,2F1,ENSG00000005022,SLC25A5,NCBI Info
...,...,...,...,...
18172,smk1,ENSG00000224960,PPP4R3C,NCBI Info
13522,smk1,ENSG00000275052,PPP4R3B,NCBI Info
12905,smk1,ENSG00000100796,PPP4R3A,NCBI Info
7631,tamo,ENSG00000158480,SPATA2,NCBI Info


In [1718]:
aa_collision_ncbi_df_2['ENSG_ID'] = aa_collision_ncbi_df_2['ENSG_ID'].str.replace('NAN','nan')
aa_collision_ncbi_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
4525,10q23del,ENSG00000171862,PTEN,NCBI Info
537,10q23del,ENSG00000107779,BMPR1A,NCBI Info
199,12-LOX,ENSG00000108839,ALOX12,NCBI Info
205,12-LOX,ENSG00000161905,ALOX15,NCBI Info
245,2F1,ENSG00000005022,SLC25A5,NCBI Info
...,...,...,...,...
18172,smk1,ENSG00000224960,PPP4R3C,NCBI Info
13522,smk1,ENSG00000275052,PPP4R3B,NCBI Info
12905,smk1,ENSG00000100796,PPP4R3A,NCBI Info
7631,tamo,ENSG00000158480,SPATA2,NCBI Info


In [1719]:
aa_collision_ncbi_df_2 = aa_collision_ncbi_df_2.groupby('alias_symbol').agg({'ENSG_ID': ', '.join, 
                             'gene_symbol': ', '.join, 
                             'source':'first' }).reset_index()
aa_collision_ncbi_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
0,10q23del,"ENSG00000171862, ENSG00000107779","PTEN, BMPR1A",NCBI Info
1,12-LOX,"ENSG00000108839, ENSG00000161905","ALOX12, ALOX15",NCBI Info
2,2F1,"ENSG00000005022, ENSG00000139187","SLC25A5, KLRG1",NCBI Info
3,3-alpha-HSD,"ENSG00000198610, ENSG00000073737","AKR1C4, DHRS9",NCBI Info
4,35DAG,"ENSG00000102683, ENSG00000170624","SGCG, SGCD",NCBI Info
...,...,...,...,...
3471,polymerase,"nan, nan, nan","ERVK-11, ERVK-19, ERVK-9",NCBI Info
3472,psiSSX8,"ENSG00000241207, nan","SSX18P, SSXP8",NCBI Info
3473,rpL7a,"ENSG00000213272, ENSG00000240522","RPL7AP9, RPL7AP10",NCBI Info
3474,smk1,"ENSG00000224960, ENSG00000275052, ENSG00000100796","PPP4R3C, PPP4R3B, PPP4R3A",NCBI Info


# <a id='toc4_'></a>[Merge to create Alias Overlap Table 1 - Gene Symbol](#toc0_)

In [1720]:
merged_alias_overlap_df_1 = pd.concat([aa_collision_hgnc_df[['gene_symbol', 'ENSG_ID', 'alias_symbol', 'source']],aa_collision_ncbi_df[['gene_symbol', 'ENSG_ID', 'alias_symbol', 'source']], aa_collision_ensg_df[['gene_symbol', 'ENSG_ID', 'alias_symbol', 'source']]])
merged_alias_overlap_df_1

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol,source
75,SLC25A5,ENSG00000005022,2F1,HGNC
7761,KLRG1,ENSG00000139187,2F1,HGNC
8398,S100A8,ENSG00000143546,60B8AG,HGNC
10916,S100A9,ENSG00000163220,60B8AG,HGNC
9226,GPHA2,ENSG00000149735,A2,HGNC
...,...,...,...,...
2565,USP14,ENSG00000101557,TGT,ENSG
6615,UBE2G1,ENSG00000132388,UBC7,ENSG
15955,UBE2G2,ENSG00000184787,UBC7,ENSG
11780,PRAP1,ENSG00000165828,UPA,ENSG


In [1721]:
merged_alias_overlap_df_1.to_csv('../created_files/merged_alias_overlap_df_1.csv', index=False)

In [1722]:
merged_alias_overlap_df_1.loc[merged_alias_overlap_df_1.gene_symbol == 'TAS1R2']

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol,source
14516,TAS1R2,ENSG00000179002,TR2,HGNC
15238,TAS1R2,ENSG00000179002,TR2,NCBI Info
14718,TAS1R2,ENSG00000179002,TR2,ENSG


In [1723]:
merged_alias_overlap_df_1.loc[merged_alias_overlap_df_1.alias_symbol == 'TR2']

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol,source
15238,TAS1R2,ENSG00000179002,TR2,NCBI Info
5747,NR2C1,ENSG00000120798,TR2,NCBI Info
17003,TXNRD3,ENSG00000197763,TR2,NCBI Info
6795,TNFRSF14,ENSG00000157873,TR2,NCBI Info
16635,DEPDC7,ENSG00000121690,TR2,NCBI Info


In [1724]:
merged_alias_overlap_df_1['source'].value_counts()

source
NCBI Info    8372
ENSG         2573
HGNC         1487
Name: count, dtype: int64

# <a id='toc5_'></a>[Merge to create Alias Overlap Table 2 - Alias Symbol](#toc0_)

In [1725]:
merged_alias_overlap_df_2 = pd.concat([aa_collision_hgnc_df_2[['alias_symbol', 'gene_symbol', 'ENSG_ID', 'source']],aa_collision_ncbi_df_2[['alias_symbol', 'gene_symbol', 'ENSG_ID', 'source']], aa_collision_ensg_df_2[['alias_symbol', 'gene_symbol', 'ENSG_ID', 'source']]])
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source
0,2F1,"SLC25A5, KLRG1","ENSG00000005022, ENSG00000139187",HGNC
1,60B8AG,"S100A8, S100A9","ENSG00000143546, ENSG00000163220",HGNC
2,A2,"GPHA2, CFAP410, SGCA","ENSG00000149735, ENSG00000160226, ENSG00000108823",HGNC
3,ACC2,"BCL2A1, ACACB","ENSG00000140379, ENSG00000076555",HGNC
4,ACS2,"ACSL6, ACSL5","ENSG00000164398, ENSG00000197142",HGNC
...,...,...,...,...
1144,TCRBV15S1,"TRBV24-1, TRBV15","ENSG00000211750, ENSG00000276819",ENSG
1145,TCRGV5P,"TRGV6, TRGV5P","ENSG00000226212, ENSG00000228668",ENSG
1146,TGT,"QTRT1, USP14","ENSG00000213339, ENSG00000101557",ENSG
1147,UBC7,"UBE2G1, UBE2G2","ENSG00000132388, ENSG00000184787",ENSG


In [1726]:
merged_alias_overlap_df_2.loc[merged_alias_overlap_df_2['alias_symbol'] == "ASP" ]

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source
394,ASP,"ATG5, ASIP, ASPA, ROPN1L","ENSG00000057663, ENSG00000101440, ENSG00000108...",HGNC
222,ASP,"ROPN1L, ASPM, ASIP, ASPA, TMPRSS11D, A1CF, ATG...","ENSG00000145491, ENSG00000066279, ENSG00000101...",NCBI Info
864,ASP,"ROPN1L, TMPRSS11D, ASPM","ENSG00000145491, ENSG00000153802, ENSG00000066279",ENSG


In [1727]:
merged_alias_overlap_df_2.to_csv('../created_files/merged_alias_overlap_df_2.csv', index=True, quoting=0)

In [1728]:
merged_alias_overlap_df_2['gene_symbol'] = merged_alias_overlap_df_2['gene_symbol'].str.split(",")
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source
0,2F1,"[SLC25A5, KLRG1]","ENSG00000005022, ENSG00000139187",HGNC
1,60B8AG,"[S100A8, S100A9]","ENSG00000143546, ENSG00000163220",HGNC
2,A2,"[GPHA2, CFAP410, SGCA]","ENSG00000149735, ENSG00000160226, ENSG00000108823",HGNC
3,ACC2,"[BCL2A1, ACACB]","ENSG00000140379, ENSG00000076555",HGNC
4,ACS2,"[ACSL6, ACSL5]","ENSG00000164398, ENSG00000197142",HGNC
...,...,...,...,...
1144,TCRBV15S1,"[TRBV24-1, TRBV15]","ENSG00000211750, ENSG00000276819",ENSG
1145,TCRGV5P,"[TRGV6, TRGV5P]","ENSG00000226212, ENSG00000228668",ENSG
1146,TGT,"[QTRT1, USP14]","ENSG00000213339, ENSG00000101557",ENSG
1147,UBC7,"[UBE2G1, UBE2G2]","ENSG00000132388, ENSG00000184787",ENSG


In [1729]:
merged_alias_overlap_df_2['gene_symbol_count'] = [len(c) for c in merged_alias_overlap_df_2['gene_symbol']]
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source,gene_symbol_count
0,2F1,"[SLC25A5, KLRG1]","ENSG00000005022, ENSG00000139187",HGNC,2
1,60B8AG,"[S100A8, S100A9]","ENSG00000143546, ENSG00000163220",HGNC,2
2,A2,"[GPHA2, CFAP410, SGCA]","ENSG00000149735, ENSG00000160226, ENSG00000108823",HGNC,3
3,ACC2,"[BCL2A1, ACACB]","ENSG00000140379, ENSG00000076555",HGNC,2
4,ACS2,"[ACSL6, ACSL5]","ENSG00000164398, ENSG00000197142",HGNC,2
...,...,...,...,...,...
1144,TCRBV15S1,"[TRBV24-1, TRBV15]","ENSG00000211750, ENSG00000276819",ENSG,2
1145,TCRGV5P,"[TRGV6, TRGV5P]","ENSG00000226212, ENSG00000228668",ENSG,2
1146,TGT,"[QTRT1, USP14]","ENSG00000213339, ENSG00000101557",ENSG,2
1147,UBC7,"[UBE2G1, UBE2G2]","ENSG00000132388, ENSG00000184787",ENSG,2


In [1730]:
merged_alias_overlap_df_2 = merged_alias_overlap_df_2.sort_values(by='gene_symbol_count', ascending= False)
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source,gene_symbol_count
3305,VH,"[IGHV3-64, IGHV1-24, IGHV3-74, IGHV3-7, IG...","ENSG00000223648, ENSG00000211950, ENSG00000224...",NCBI Info,36
1303,H4-16,"[H4C16, H4C5, H4C1, H4C3, H4C4, H4C8, H4...","ENSG00000197837, ENSG00000276966, ENSG00000278...",NCBI Info,14
1317,H4C9,"[H4C12, H4C5, H4C1, H4C3, H4C6, H4C4, H4...","ENSG00000273542, ENSG00000276966, ENSG00000278...",NCBI Info,13
1304,H4C1,"[H4C2, H4C12, H4C8, H4C16, H4C9, H4C5, H...","ENSG00000278705, ENSG00000273542, ENSG00000158...",NCBI Info,13
1305,H4C11,"[H4C8, H4C6, H4C9, H4C13, H4C14, H4C5, H...","ENSG00000158406, ENSG00000274618, ENSG00000276...",NCBI Info,13
...,...,...,...,...,...
1240,GST3,"[CHST4, GSTP1]","ENSG00000140835, ENSG00000084207",NCBI Info,2
1238,GST1,"[GSTM1, GSPT1]","ENSG00000134184, ENSG00000103342",NCBI Info,2
1237,GST,"[GSTK1, SLCO6A1]","ENSG00000197448, ENSG00000205359",NCBI Info,2
1236,GSP,"[GNAS, GSM1]","ENSG00000087460, nan",NCBI Info,2


In [1731]:
merged_alias_overlap_df_2.loc[merged_alias_overlap_df_2['alias_symbol'] == "TR2" ]

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source,gene_symbol_count
3161,TR2,"[TAS1R2, NR2C1, TXNRD3, TNFRSF14, DEPDC7]","ENSG00000179002, ENSG00000120798, ENSG00000197...",NCBI Info,5


In [1732]:
merged_alias_overlap_df_2['gene_symbol_count'].value_counts()

gene_symbol_count
2     4340
3      600
4      193
5       62
6       36
7       18
9       17
13      14
8       10
10       4
36       1
14       1
12       1
11       1
Name: count, dtype: int64

In [1733]:
aa_collision_set = set(merged_alias_overlap_df_2['alias_symbol'].tolist())

# <a id='toc6_'></a>[Common Records with Collisions](#toc0_)

In [1734]:
common_aa_collisions = ensg_alias_alias_collision_primary_symbol_set & hgnc_alias_alias_collision_primary_symbol_set & ncbi_alias_alias_collision_primary_symbol_set
common_aa_collisions

{'CUX1',
 'UNC119',
 'PAPPA',
 'POPDC2',
 'CCL16',
 'ZFP36L1',
 'SPCS3',
 'NAA20',
 'F2R',
 'RAB5IF',
 'ARFRP1',
 'PLXNB1',
 'RMDN1',
 'EREG',
 'GSTM1',
 'SEC14L2',
 'PPP1R13B',
 'ATP6V1B2',
 'CMC4',
 'ATP6V1G1',
 'SMN2',
 'MUC16',
 'IGKV1-12',
 'RNU4-6P',
 'LINC00670',
 'GLUD1',
 'LIPF',
 'SKIC2',
 'IL36RN',
 'CDK2AP2',
 'SHOX',
 'CACNA1C',
 'MSH2',
 'YWHAQ',
 'FANCD2',
 'MRO',
 'TOR1AIP1',
 'LAMP3',
 'DUOXA1',
 'OR7E4P',
 'HDLBP',
 'RPS14',
 'TRBV14',
 'NAPSA',
 'CCDC26',
 'SNORA73A',
 'IL17F',
 'HHIP',
 'LY6G6F-LY6G6D',
 'PDE7A',
 'SPATA2L',
 'HHLA2',
 'POTEM',
 'YLPM1',
 'TSHZ1',
 'DYNC1H1',
 'ZMYM1',
 'SNORA59B',
 'ASIP',
 'TRBV27',
 'ELMO2',
 'HDAC8',
 'SNRPN',
 'GTF2E1',
 'H3P5',
 'SCN1A',
 'HEATR5B',
 'POLE4',
 'ACP2',
 'TFAMP1',
 'GFER',
 'MAGEC3',
 'SLC25A24',
 'WDR1',
 'CCL2',
 'ZNF197',
 'LAMTOR3',
 'ZNF585B',
 'UHRF2',
 'IGKJ2',
 'PSPN',
 'RPSA',
 'EIF2B1',
 'LILRB3',
 'C17orf49',
 'H3P30',
 'ATRNL1',
 'PSMC3',
 'FOLH1',
 'SNORD13P3',
 'MPHOSPH10',
 'TNFRSF25',
 'MTUS1',
 

In [1735]:
len(common_aa_collisions)

1039

# <a id='toc7_'></a>[How many gene concept-alias relationships are there?](#toc0_)

## <a id='toc7_1_'></a>[Per Source](#toc0_)

In [1736]:
concept_alias_pairs_summary_index= 'HGNC', 'ENSG', 'NCBI'
concept_alias_pairs_summary = {'Number of Unique Gene Concept-Alias Pairs': [hgnc_concept_alias_pair_count, ensg_concept_alias_pair_count, ncbi_concept_alias_pair_count]}
concept_alias_pairs_summary_df = pd.DataFrame(concept_alias_pairs_summary, index= concept_alias_pairs_summary_index )
concept_alias_pairs_summary_df

Unnamed: 0,Number of Unique Gene Concept-Alias Pairs
HGNC,44584
ENSG,57362
NCBI,74053


## <a id='toc7_2_'></a>[Between All Sources](#toc0_)

In [1737]:
all_sources_concept_alias_pairs_df = pd.concat([mini_hgnc_df[['alias_symbol', 'gene_symbol']],mini_ensg_df[['alias_symbol', 'gene_symbol']], mini_ncbi_df[['alias_symbol', 'gene_symbol']]])

In [1738]:
len(all_sources_concept_alias_pairs_df)

175999

### <a id='toc7_2_1_'></a>[Remove duplicate concept-alias pairs](#toc0_)

In [1739]:
all_sources_concept_alias_pairs_df = all_sources_concept_alias_pairs_df.drop_duplicates(subset=['gene_symbol', 'alias_symbol'], keep='first')

In [1740]:
len(all_sources_concept_alias_pairs_df)

131864