**Table of contents**<a id='toc0_'></a>    
- [ENSG](#toc1_)    
    - [Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc1_1_1_)    
    - [How many total unique gene records are there](#toc1_1_2_)    
    - [Drop rows with NAN in alias_symbol](#toc1_1_3_)    
    - [Make each row in alias_symbol a set:](#toc1_1_4_)    
    - [Explode the alias sets so that it is one per row](#toc1_1_5_)    
    - [How many total unique aliases are there](#toc1_1_6_)    
    - [Pull out all the rows that have an alias symbol that can be found elsewhere](#toc1_1_7_)    
    - [Sort alias symbols alphabetically](#toc1_1_8_)    
    - [Number of records with an alias that is shared](#toc1_1_9_)    
    - [Count the number of times each multi-use alias is used](#toc1_1_10_)    
      - [Save as csv](#toc1_1_10_1_)    
    - [Put columns in different order to ephasize alias symbols instead of gene records](#toc1_1_11_)    
    - [Merge rows with matching alias symbols](#toc1_1_12_)    
- [HGNC](#toc2_)    
    - [Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc2_1_1_)    
    - [How many total unique gene records are there](#toc2_1_2_)    
    - [Drop rows with NAN in alias_symbol](#toc2_1_3_)    
    - [Make each row in alias_symbol a set:](#toc2_1_4_)    
    - [Explode the alias sets so that it is one per row](#toc2_1_5_)    
    - [How many total unique aliases are there](#toc2_1_6_)    
    - [Pull out all the rows that have an alias symbol that can be found elsewhere](#toc2_1_7_)    
    - [Sort alias symbols alphabetically](#toc2_1_8_)    
    - [Number of records with an alias that is shared](#toc2_1_9_)    
    - [Count the number of times each multi-use alias is used](#toc2_1_10_)    
      - [Save as csv](#toc2_1_10_1_)    
    - [Put columns in different order to ephasize alias symbols instead of gene records](#toc2_1_11_)    
    - [Merge rows with matching alias symbols](#toc2_1_12_)    
- [NCBI Info](#toc3_)    
    - [Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc3_1_1_)    
    - [How many total unique gene records are there](#toc3_1_2_)    
    - [Drop rows with - in alias_symbol](#toc3_1_3_)    
    - [Make each row in alias_symbol a set:](#toc3_1_4_)    
    - [Explode the alias sets so that it is one per row](#toc3_1_5_)    
    - [How many unique aliases are there](#toc3_1_6_)    
    - [Pull out all the rows that have an alias symbol that can be found elsewhere](#toc3_1_7_)    
    - [Sort alias symbols alphabetically](#toc3_1_8_)    
    - [Number of records with an alias that is shared](#toc3_1_9_)    
    - [Count the number of times each multi-use alias is used](#toc3_1_10_)    
    - [Put columns in different order to ephasize alias symbols instead of gene records](#toc3_1_11_)    
    - [Merge rows with matching alias symbols](#toc3_1_12_)    
- [Merge to create Alias Overlap Table 1 - Gene Symbol](#toc4_)    
- [Merge to create Alias Overlap Table 2 - Alias Symbol](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [229]:
import pandas as pd
import numpy as np
import plotly.express as px

# <a id='toc1_'></a>[ENSG](#toc0_)

In [230]:
mini_ensg_df = pd.read_csv("ensg_alias_df.csv")

Note: duplicate gene symbols can have different ENSG ids

In [231]:
duplicateENSG_ID2 = mini_ensg_df[mini_ensg_df.duplicated('ensg_id', keep=False)]
duplicateENSG_ID2

Unnamed: 0.1,Unnamed: 0,gene_type,ref_id,gene_name,ensg_id,gene_synonyms,gene_symbol,entrez_id
38,1366,rRNA,495150,5.8S ribosomal RNA [Source:RFAM;Acc:RF00002],ENSG00000278294,,5_8S_rRNA,124907156
39,1367,rRNA,495150,5.8S ribosomal RNA [Source:RFAM;Acc:RF00002],ENSG00000278294,,5_8S_rRNA,124907485
40,1368,rRNA,495150,5.8S ribosomal RNA [Source:RFAM;Acc:RF00002],ENSG00000278294,,5_8S_rRNA,124908250
55,1386,protein_coding,2874124,"killer cell immunoglobulin like receptor, two ...",ENSG00000276779,"103AS, 15.212, CD158D",KIR2DL4,3805
56,1387,protein_coding,2874124,"killer cell immunoglobulin like receptor, two ...",ENSG00000276779,"103AS, 15.212, CD158D",KIR2DL4,124900568
...,...,...,...,...,...,...,...,...
52151,76329,snRNA,2949399,"RNA, variant U1 small nuclear 29 [Source:HGNC ...",ENSG00000273768,,RNVU1-29,124905574
52152,76330,snRNA,2949399,"RNA, variant U1 small nuclear 29 [Source:HGNC ...",ENSG00000273768,,RNVU1-29,124905808
52153,76331,snRNA,2949399,"RNA, variant U1 small nuclear 29 [Source:HGNC ...",ENSG00000273768,,RNVU1-29,124905809
52196,76395,protein_coding,2919059,phosphodiesterase 4D interacting protein [Sour...,ENSG00000178104,"CMYA2, KIAA0454, KIAA0477, MMGL",PDE4DIP,9659


In [232]:
duplicateENSG_ID2.to_csv('../ensg_duplicateENSG_ID2.csv', index=True)

### <a id='toc1_1_1_'></a>[Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc0_)

In [233]:
mini_ensg_df = mini_ensg_df.drop(['gene_type', 'ref_id', 'gene_name', 'Unnamed: 0'], axis=1)
mini_ensg_df = mini_ensg_df.rename(columns = {'gene_synonyms':'alias_symbol', 'ensg_id':'ENSG_ID' })
mini_ensg_df

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
0,ENSG00000210049,"MTTF, trnF",MT-TF,0
1,ENSG00000211459,"12S, MOTS-c, MTRNR1",MT-RNR1,0
2,ENSG00000210077,"MTTV, trnV",MT-TV,0
3,ENSG00000210082,"16S, HN, MTRNR2",MT-RNR2,0
4,ENSG00000209082,"MTTL1, TRNL1",MT-TL1,0
...,...,...,...,...
52224,ENSG00000206764,,RNU6-152P,0
52225,ENSG00000231684,,EIF1P3,0
52226,ENSG00000264470,hsa-mir-4794,MIR4794,100616338
52227,ENSG00000162437,"FLJ10770, KIAA1579",RAVER2,55225


Note: duplicate ENSG ids can have different entrez ids

In [234]:
duplicateENSG_ID = mini_ensg_df[mini_ensg_df.duplicated('ENSG_ID', keep=False)]
duplicateENSG_ID

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
38,ENSG00000278294,,5_8S_rRNA,124907156
39,ENSG00000278294,,5_8S_rRNA,124907485
40,ENSG00000278294,,5_8S_rRNA,124908250
55,ENSG00000276779,"103AS, 15.212, CD158D",KIR2DL4,3805
56,ENSG00000276779,"103AS, 15.212, CD158D",KIR2DL4,124900568
...,...,...,...,...
52151,ENSG00000273768,,RNVU1-29,124905574
52152,ENSG00000273768,,RNVU1-29,124905808
52153,ENSG00000273768,,RNVU1-29,124905809
52196,ENSG00000178104,"CMYA2, KIAA0454, KIAA0477, MMGL",PDE4DIP,9659


In [235]:
duplicateENSG_ID.to_csv('../ensg_duplicateENSG_ID.csv', index=True)

### <a id='toc1_1_2_'></a>[How many total unique gene records are there](#toc0_)

By ENSG ID

In [236]:
ensg_gene_id_set = set(mini_ensg_df['ENSG_ID'])
len(ensg_gene_id_set)

46806

By gene symbol

In [237]:
ensg_gene_symbol_set = set(mini_ensg_df['gene_symbol'])
len(ensg_gene_symbol_set)

40353

Using IGHV as an example:

In [238]:
type(mini_ensg_df.gene_symbol[0])

str

In [239]:
# ensg_IGHV_alias_df = mini_ensg_df.loc[mini_ensg_df['gene_symbol'].str.contains("IGHV", case=False)]

In [240]:
ensg_IGHV_alias_df = mini_ensg_df[mini_ensg_df['gene_symbol'].str.contains('IGHV')]

In [241]:
ensg_IGHV_alias_df.to_csv('../hgnc_IGHV_alias_df.csv')

### <a id='toc1_1_3_'></a>[Drop genes with no aliases](#toc0_)

In [242]:
mini_ensg_df = mini_ensg_df[mini_ensg_df["alias_symbol"].str.contains("NaN") == False]
mini_ensg_df

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
0,ENSG00000210049,"MTTF, trnF",MT-TF,0
1,ENSG00000211459,"12S, MOTS-c, MTRNR1",MT-RNR1,0
2,ENSG00000210077,"MTTV, trnV",MT-TV,0
3,ENSG00000210082,"16S, HN, MTRNR2",MT-RNR2,0
4,ENSG00000209082,"MTTL1, TRNL1",MT-TL1,0
...,...,...,...,...
52220,ENSG00000198216,"BII, CACH6, CACNL1A6, Cav2.3",CACNA1E,777
52221,ENSG00000179930,FLJ46813,ZNF648,127665
52226,ENSG00000264470,hsa-mir-4794,MIR4794,100616338
52227,ENSG00000162437,"FLJ10770, KIAA1579",RAVER2,55225


### <a id='toc1_1_4_'></a>[Make each row in alias_symbol a set:](#toc0_)
    covert to a list 
    make a set

In [243]:
mini_ensg_df['alias_symbol'] = mini_ensg_df['alias_symbol'].astype(str)
mini_ensg_df['alias_symbol'] = [x.split(',') for x in mini_ensg_df.alias_symbol]
mini_ensg_df['alias_symbol']=np.where(mini_ensg_df.alias_symbol=='','',mini_ensg_df.alias_symbol.map(set))
mini_ensg_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_symbol'] = mini_ensg_df['alias_symbol'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_symbol'] = [x.split(',') for x in mini_ensg_df.alias_symbol]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ensg_df['alias_symbol']=np.where(mini_ensg_df.a

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
0,ENSG00000210049,"{MTTF, trnF}",MT-TF,0


### <a id='toc1_1_5_'></a>[Explode the alias sets so that it is one per row](#toc0_)

In [244]:
mini_ensg_df = mini_ensg_df.explode('alias_symbol')
mini_ensg_df

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
0,ENSG00000210049,MTTF,MT-TF,0
0,ENSG00000210049,trnF,MT-TF,0
1,ENSG00000211459,12S,MT-RNR1,0
1,ENSG00000211459,MOTS-c,MT-RNR1,0
1,ENSG00000211459,MTRNR1,MT-RNR1,0
...,...,...,...,...
52221,ENSG00000179930,FLJ46813,ZNF648,127665
52226,ENSG00000264470,hsa-mir-4794,MIR4794,100616338
52227,ENSG00000162437,FLJ10770,RAVER2,55225
52227,ENSG00000162437,KIAA1579,RAVER2,55225


### <a id='toc3_1_6_'></a>[How many unique aliases are there](#toc0_)

In [245]:
ensg_alias_symbol_set = set(mini_ensg_df['alias_symbol'])
ensg_alias_len = len(ensg_alias_symbol_set)
ensg_alias_len

54926

### <a id='toc3_1_7_'></a>[Pull out all the rows that have an alias symbol that can be found elsewhere](#toc0_)

In [246]:
mini_ensg_df['alias_duplicates'] = mini_ensg_df.duplicated(subset= 'alias_symbol', keep=False)
mini_ensg_df = mini_ensg_df[mini_ensg_df['alias_duplicates'] == True]
mini_ensg_df = mini_ensg_df.drop(['alias_duplicates'], axis=1)
mini_ensg_df.head(5)

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
5,ENSG00000198888,ND1,MT-ND1,4535
43,ENSG00000281486,G2SYN,SNTG2,54221
43,ENSG00000281486,SYN5,SNTG2,54221
44,ENSG00000262826,C1orf60,INTS3,65123
44,ENSG00000262826,INT3,INTS3,65123


### <a id='toc3_1_8_'></a>[Sort alias symbols alphabetically](#toc0_)

In [247]:
mini_ensg_df = mini_ensg_df.sort_values('alias_symbol')
mini_ensg_df

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id
672,ENSG00000277362,15.212,KIR2DL4,124900568
8516,ENSG00000284365,15.212,KIR2DL4,124900568
8515,ENSG00000284365,15.212,KIR2DL4,3805
8349,ENSG00000284206,15.212,KIR2DL4,124900568
8348,ENSG00000284206,15.212,KIR2DL4,3805
...,...,...,...,...
23567,ENSG00000227211,p56,H3P45,0
7724,ENSG00000283997,promethin,LDAF1,57146
37151,ENSG00000011638,promethin,LDAF1,57146
27797,ENSG00000100181,psiTPTE22,TPTEP1,0


In [248]:
#ensg_CD158b_alias_count_df.to_csv('../hgnc_CD158b_alias_count_df.csv')

### <a id='toc3_1_9_'></a>[Number of records with an alias that is shared](#toc0_)

In [249]:
ensg_gene_symbol_set_wsharedalias = set(mini_ensg_df['gene_symbol'])
len(ensg_gene_symbol_set_wsharedalias)

3680

In [250]:
mini_ensg_df['source'] = 'ENSG'
mini_ensg_df

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id,source
672,ENSG00000277362,15.212,KIR2DL4,124900568,ENSG
8516,ENSG00000284365,15.212,KIR2DL4,124900568,ENSG
8515,ENSG00000284365,15.212,KIR2DL4,3805,ENSG
8349,ENSG00000284206,15.212,KIR2DL4,124900568,ENSG
8348,ENSG00000284206,15.212,KIR2DL4,3805,ENSG
...,...,...,...,...,...
23567,ENSG00000227211,p56,H3P45,0,ENSG
7724,ENSG00000283997,promethin,LDAF1,57146,ENSG
37151,ENSG00000011638,promethin,LDAF1,57146,ENSG
27797,ENSG00000100181,psiTPTE22,TPTEP1,0,ENSG


In [251]:
mini_ensg_df.loc[mini_ensg_df['alias_symbol'] == "RN5S3" ]

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id,source
9998,ENSG00000285168,RN5S3,RNA5S3,124905833,ENSG
9983,ENSG00000285168,RN5S3,RNA5S3,124905818,ENSG
9982,ENSG00000285168,RN5S3,RNA5S3,124905817,ENSG
9981,ENSG00000285168,RN5S3,RNA5S3,124905816,ENSG
9980,ENSG00000285168,RN5S3,RNA5S3,124905815,ENSG
...,...,...,...,...,...
43674,ENSG00000199337,RN5S3,RNA5S3,124905767,ENSG
43673,ENSG00000199337,RN5S3,RNA5S3,124905766,ENSG
43672,ENSG00000199337,RN5S3,RNA5S3,124905765,ENSG
43671,ENSG00000199337,RN5S3,RNA5S3,124905764,ENSG


### <a id='toc3_1_10_'></a>[Count the number of times each multi-use alias is used](#toc0_)

In [252]:
ensg_dup_alias_count_df = mini_ensg_df.pivot_table(index = ['alias_symbol'], aggfunc ='size')
ensg_dup_alias_count_df

alias_symbol
 15.212      82
 5T4-AG       2
 A1AT         2
 A3GALT1      2
 AAP          2
             ..
p42           2
p55           2
p56           2
promethin     2
psiTPTE22     2
Length: 5307, dtype: int64

In [253]:
ensg_dup_alias_count_df = ensg_dup_alias_count_df.reset_index()
ensg_dup_alias_count_df

Unnamed: 0,alias_symbol,0
0,15.212,82
1,5T4-AG,2
2,A1AT,2
3,A3GALT1,2
4,AAP,2
...,...,...
5302,p42,2
5303,p55,2
5304,p56,2
5305,promethin,2


In [254]:
ensg_dup_alias_count_df.rename(columns={0:'num_gene_records'}, inplace=True )
ensg_dup_alias_count_df

Unnamed: 0,alias_symbol,num_gene_records
0,15.212,82
1,5T4-AG,2
2,A1AT,2
3,A3GALT1,2
4,AAP,2
...,...,...
5302,p42,2
5303,p55,2
5304,p56,2
5305,promethin,2


In [255]:
ensg_dup_alias_count_df = ensg_dup_alias_count_df.sort_values('num_gene_records', ascending=False)
ensg_dup_alias_count_df.head(30)

Unnamed: 0,alias_symbol,num_gene_records
4853,RN5S3,218
4849,RN5S2,216
4844,RN5S16,216
4845,RN5S17,216
4835,RN5S1,216
4865,RNA5-8N2,211
4837,RN5S11,108
4839,RN5S12,108
4840,RN5S13,108
4842,RN5S14,108


In [256]:
ensg_alias_alias_collision_set = set(ensg_dup_alias_count_df['alias_symbol'])
len(ensg_alias_alias_collision_set)

5307

In [257]:
ensg_dup_alias_count_df.to_csv('../ensg_dup_alias_count_df.csv', index=True)

In [258]:
ensg_alias_count_histogram_df = ensg_dup_alias_count_df.pivot_table(index = ['num_gene_records'], aggfunc ='size')
ensg_alias_count_histogram_df

num_gene_records
2      3738
3       394
4       173
5       112
6       160
7       278
8       258
9        17
10       50
11        6
12        4
13       11
14        8
15        5
16       15
18        1
21        2
25        4
26        1
28        1
30        4
31        6
32       10
33        9
35        8
41        2
42        4
43        5
53        1
82        3
108      11
211       1
216       4
218       1
dtype: int64

In [259]:
ensg_alias_count_histogram_df = ensg_alias_count_histogram_df.reset_index()
ensg_alias_count_histogram_df

Unnamed: 0,num_gene_records,0
0,2,3738
1,3,394
2,4,173
3,5,112
4,6,160
5,7,278
6,8,258
7,9,17
8,10,50
9,11,6


In [260]:
ensg_alias_count_histogram_df.rename(columns={0:'num_alias_symbol'}, inplace=True )
ensg_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol
0,2,3738
1,3,394
2,4,173
3,5,112
4,6,160
5,7,278
6,8,258
7,9,17
8,10,50
9,11,6


In [261]:
ensg_alias_count_histogram_df['percent_alias_symbol'] = ((ensg_alias_count_histogram_df['num_alias_symbol'] / ensg_alias_len) * 100)
ensg_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol,percent_alias_symbol
0,2,3738,6.80552
1,3,394,0.717329
2,4,173,0.314969
3,5,112,0.203911
4,6,160,0.291301
5,7,278,0.506136
6,8,258,0.469723
7,9,17,0.030951
8,10,50,0.091032
9,11,6,0.010924


In [262]:
ensg_alias_count_histogram_df = ensg_alias_count_histogram_df.drop('num_alias_symbol', axis=1)
ensg_alias_count_histogram_df

Unnamed: 0,num_gene_records,percent_alias_symbol
0,2,6.80552
1,3,0.717329
2,4,0.314969
3,5,0.203911
4,6,0.291301
5,7,0.506136
6,8,0.469723
7,9,0.030951
8,10,0.091032
9,11,0.010924


In [263]:
#px.bar(ensg_alias_count_histogram_df, x='num_gene_records', y='percent_alias_symbol')

In [264]:
ensg_dup_alias_count_df.to_csv('../ensg_alias_overlap_count.csv', index=True)

#### <a id='toc1_1_10_1_'></a>[Save as csv](#toc0_)

In [265]:
#mini_ensg_df_explode.to_csv('../ensg_alias_overlap.csv', index=False)

### <a id='toc1_1_11_'></a>[Put columns in different order to ephasize alias symbols instead of gene records](#toc0_)

In [266]:
mini_ensg_df

Unnamed: 0,ENSG_ID,alias_symbol,gene_symbol,entrez_id,source
672,ENSG00000277362,15.212,KIR2DL4,124900568,ENSG
8516,ENSG00000284365,15.212,KIR2DL4,124900568,ENSG
8515,ENSG00000284365,15.212,KIR2DL4,3805,ENSG
8349,ENSG00000284206,15.212,KIR2DL4,124900568,ENSG
8348,ENSG00000284206,15.212,KIR2DL4,3805,ENSG
...,...,...,...,...,...
23567,ENSG00000227211,p56,H3P45,0,ENSG
7724,ENSG00000283997,promethin,LDAF1,57146,ENSG
37151,ENSG00000011638,promethin,LDAF1,57146,ENSG
27797,ENSG00000100181,psiTPTE22,TPTEP1,0,ENSG


In [267]:
mini_ensg_df_2 = mini_ensg_df[['alias_symbol', 'ENSG_ID', 'gene_symbol', 'source']]
mini_ensg_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
672,15.212,ENSG00000277362,KIR2DL4,ENSG
8516,15.212,ENSG00000284365,KIR2DL4,ENSG
8515,15.212,ENSG00000284365,KIR2DL4,ENSG
8349,15.212,ENSG00000284206,KIR2DL4,ENSG
8348,15.212,ENSG00000284206,KIR2DL4,ENSG
...,...,...,...,...
23567,p56,ENSG00000227211,H3P45,ENSG
7724,promethin,ENSG00000283997,LDAF1,ENSG
37151,promethin,ENSG00000011638,LDAF1,ENSG
27797,psiTPTE22,ENSG00000100181,TPTEP1,ENSG


### <a id='toc1_1_12_'></a>[Merge rows with matching alias symbols](#toc0_)

In [268]:
mini_ensg_df_2 = mini_ensg_df_2.drop_duplicates(subset = ["alias_symbol", "gene_symbol"], keep = 'first')

In [269]:
mini_ensg_df_2 = mini_ensg_df_2.applymap(str)
mini_ensg_df_2


  mini_ensg_df_2 = mini_ensg_df_2.applymap(str)


Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
672,15.212,ENSG00000277362,KIR2DL4,ENSG
26945,5T4-AG,ENSG00000146242,TPBG,ENSG
12461,A1AT,ENSG00000277377,SERPINA1,ENSG
5321,A3GALT1,ENSG00000281879,ABO,ENSG
2816,AAP,ENSG00000276838,SERPINF2,ENSG
...,...,...,...,...
50103,p55,ENSG00000117461,PIK3R3,ENSG
8105,p56,ENSG00000100764,PSMC1,ENSG
23567,p56,ENSG00000227211,H3P45,ENSG
7724,promethin,ENSG00000283997,LDAF1,ENSG


In [270]:
mini_ensg_df_2 = mini_ensg_df_2.groupby('alias_symbol').agg({'ENSG_ID': ', '.join, 
                             'gene_symbol': ', '.join, 
                             'source':'first' }).reset_index()
mini_ensg_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
0,15.212,ENSG00000277362,KIR2DL4,ENSG
1,5T4-AG,ENSG00000146242,TPBG,ENSG
2,A1AT,ENSG00000277377,SERPINA1,ENSG
3,A3GALT1,ENSG00000281879,ABO,ENSG
4,AAP,ENSG00000276838,SERPINF2,ENSG
...,...,...,...,...
5302,p42,"ENSG00000100519, ENSG00000227443","PSMC6, H3P30",ENSG
5303,p55,"ENSG00000197170, ENSG00000117461","PSMD12, PIK3R3",ENSG
5304,p56,"ENSG00000100764, ENSG00000227211","PSMC1, H3P45",ENSG
5305,promethin,ENSG00000283997,LDAF1,ENSG


# <a id='toc2_'></a>[HGNC](#toc0_)

In [271]:
df1 = pd.read_csv("hgnc_filtered.csv")

### <a id='toc2_1_1_'></a>[Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc0_)

In [272]:
mini_hgnc_df = df1.drop(['Unnamed: 0', 'hgnc_id', 'locus_type', 'name', 'mane_select', 'locus_group', 'entrez_id', 'agr', 'refseq_accession', 'alias_name', 'ENSEMBLtrans', 'NA', 'unknown'], axis=1)
mini_hgnc_df = mini_hgnc_df.rename(columns = {'ensembl_gene_id':'ENSG_ID', 'symbol':'gene_symbol'})
mini_hgnc_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
0,A1BG,ENSG00000121410,
1,A1BG-AS1,ENSG00000268895,FLJ23569
2,A1CF,ENSG00000148584,ACF;ASP;ACF64;ACF65;APOBEC1CF
3,A2M,ENSG00000175899,FWP007;S863-7;CPAMD5
4,A2M-AS1,ENSG00000245105,
...,...,...,...
43159,ZYG11B,ENSG00000162378,FLJ13456
43160,ZYX,ENSG00000159840,
43161,ZYXP1,ENSG00000274572,
43162,ZZEF1,ENSG00000074755,KIAA0399;ZZZ4;FLJ10821


### <a id='toc2_1_2_'></a>[How many total unique gene records are there](#toc0_)

In [273]:
hgnc_gene_symbol_set = set(mini_hgnc_df['gene_symbol'])
len(hgnc_gene_symbol_set)

43164

Looking at IGHV as an example:

In [274]:
hgnc_IGHV_alias_df = mini_hgnc_df[mini_hgnc_df['gene_symbol'].str.contains('IGHV')]

In [275]:
type(mini_hgnc_df.gene_symbol[0])

str

In [276]:
hgnc_IGHV_alias_df.to_csv('../hgnc_IGHV_alias_df.csv')

In [277]:
hgnc_IGHV_alias_count_df = mini_hgnc_df.copy
hgnc_IGHV_alias_count_df = mini_hgnc_df.loc[mini_hgnc_df['gene_symbol'] == "IGHV1-24" ]
hgnc_IGHV_alias_count_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
12539,IGHV1-24,ENSG00000211950,


In [278]:
hgnc_IGHV_alias_count_df = mini_hgnc_df.copy
hgnc_IGHV_alias_count_df = mini_hgnc_df.loc[mini_hgnc_df['ENSG_ID'] == "ENSG00000211950" ]
hgnc_IGHV_alias_count_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
12539,IGHV1-24,ENSG00000211950,


### <a id='toc2_1_3_'></a>[Drop genes with no aliases](#toc0_)

In [279]:
mini_hgnc_df = mini_hgnc_df[mini_hgnc_df["alias_symbol"].str.contains("NaN") == False]
mini_hgnc_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
1,A1BG-AS1,ENSG00000268895,FLJ23569
2,A1CF,ENSG00000148584,ACF;ASP;ACF64;ACF65;APOBEC1CF
3,A2M,ENSG00000175899,FWP007;S863-7;CPAMD5
5,A2ML1,ENSG00000166535,FLJ25179;p170
9,A3GALT2,ENSG00000184389,IGBS3S;IGB3S
...,...,...,...
43156,ZXDC,ENSG00000070476,MGC11349;FLJ13861
43157,ZYG11A,ENSG00000203995,ZYG11
43159,ZYG11B,ENSG00000162378,FLJ13456
43162,ZZEF1,ENSG00000074755,KIAA0399;ZZZ4;FLJ10821


### <a id='toc2_1_4_'></a>[Make each row in alias_symbol a set:](#toc0_)
    covert to a list 
    make a set

In [280]:
mini_hgnc_df['alias_symbol'] = mini_hgnc_df['alias_symbol'].astype(str)
mini_hgnc_df['alias_symbol'] = [x.split(';') for x in mini_hgnc_df.alias_symbol]
mini_hgnc_df['alias_symbol']=np.where(mini_hgnc_df.alias_symbol=='','',mini_hgnc_df.alias_symbol.map(set))
mini_hgnc_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_hgnc_df['alias_symbol'] = mini_hgnc_df['alias_symbol'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_hgnc_df['alias_symbol'] = [x.split(';') for x in mini_hgnc_df.alias_symbol]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_hgnc_df['alias_symbol']=np.where(mini_hgnc_df.a

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
1,A1BG-AS1,ENSG00000268895,{FLJ23569}


### <a id='toc2_1_5_'></a>[Explode the alias sets so that it is one per row](#toc0_)

In [281]:
mini_hgnc_df = mini_hgnc_df.explode(column="alias_symbol")
mini_hgnc_df.head(5)

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
1,A1BG-AS1,ENSG00000268895,FLJ23569
2,A1CF,ENSG00000148584,ASP
2,A1CF,ENSG00000148584,ACF
2,A1CF,ENSG00000148584,ACF64
2,A1CF,ENSG00000148584,ACF65


Looking at CD158b as an example:

In [282]:
mini_hgnc_df.loc[mini_hgnc_df['gene_symbol'] == "CD158b" ]

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol


In [283]:
hgnc_CD158b_alias_count_df = mini_hgnc_df.copy
hgnc_CD158b_alias_count_df = mini_hgnc_df.loc[mini_hgnc_df['alias_symbol'] == "CD158b" ]
hgnc_CD158b_alias_count_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol


In [284]:
hgnc_CD158b_alias_count_df.to_csv('../hgnc_CD158b_alias_count_df.csv')

### <a id='toc2_1_6_'></a>[How many total unique aliases are there](#toc0_)

In [285]:
hgnc_alias_symbol_set = set(mini_hgnc_df['alias_symbol'])
hgnc_alias_len = len(hgnc_alias_symbol_set)
hgnc_alias_len

41589

### <a id='toc2_1_7_'></a>[Pull out all the rows that have an alias symbol that can be found elsewhere](#toc0_)

In [286]:
mini_hgnc_df['alias_duplicates'] = mini_hgnc_df.duplicated(subset= 'alias_symbol', keep=False)
mini_hgnc_df = mini_hgnc_df[mini_hgnc_df['alias_duplicates'] == True]
mini_hgnc_df = mini_hgnc_df.drop(['alias_duplicates'], axis=1)
mini_hgnc_df.head(5)

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
2,A1CF,ENSG00000148584,ASP
22,AAGAB,ENSG00000103591,p34
65,ABCB8,ENSG00000197150,M-ABC1
67,ABCB10,ENSG00000135776,M-ABC2
68,ABCB10P1,ENSG00000274099,M-ABC2


### <a id='toc2_1_8_'></a>[Sort alias symbols alphabetically](#toc0_)

In [287]:
mini_hgnc_df = mini_hgnc_df.sort_values('alias_symbol')
mini_hgnc_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol
35226,SLC25A5,ENSG00000005022,2F1
13978,KLRG1,ENSG00000139187,2F1
33987,S100A8,ENSG00000143546,60B8AG
33988,S100A9,ENSG00000163220,60B8AG
31335,RNU6V,ENSG00000206832,87U6
...,...,...,...
38078,TEX28P2,ENSG00000277008,pTEX
26200,PPP4R3A,ENSG00000100796,smk1
26203,PPP4R3C,ENSG00000224960,smk1
36662,SPATA2L,ENSG00000158792,tamo


### <a id='toc2_1_9_'></a>[Number of records with an alias that is shared](#toc0_)

In [288]:
hgnc_gene_symbol_set_wsharedalias = set(mini_hgnc_df['gene_symbol'])
len(hgnc_gene_symbol_set_wsharedalias)

2084

In [289]:
mini_hgnc_df['source'] = 'HGNC'
mini_hgnc_df

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol,source
35226,SLC25A5,ENSG00000005022,2F1,HGNC
13978,KLRG1,ENSG00000139187,2F1,HGNC
33987,S100A8,ENSG00000143546,60B8AG,HGNC
33988,S100A9,ENSG00000163220,60B8AG,HGNC
31335,RNU6V,ENSG00000206832,87U6,HGNC
...,...,...,...,...
38078,TEX28P2,ENSG00000277008,pTEX,HGNC
26200,PPP4R3A,ENSG00000100796,smk1,HGNC
26203,PPP4R3C,ENSG00000224960,smk1,HGNC
36662,SPATA2L,ENSG00000158792,tamo,HGNC


### <a id='toc2_1_10_'></a>[Count the number of times each multi-use alias is used](#toc0_)

In [290]:
hgnc_dup_alias_count_df = mini_hgnc_df.pivot_table(index = ['alias_symbol'], aggfunc ='size')
hgnc_dup_alias_count_df

alias_symbol
2F1       2
60B8AG    2
87U6      2
9G8       2
A1        3
         ..
p97       4
pH2A/f    2
pTEX      2
smk1      2
tamo      2
Length: 1040, dtype: int64

In [291]:
hgnc_dup_alias_count_df = hgnc_dup_alias_count_df.reset_index()
hgnc_dup_alias_count_df

Unnamed: 0,alias_symbol,0
0,2F1,2
1,60B8AG,2
2,87U6,2
3,9G8,2
4,A1,3
...,...,...
1035,p97,4
1036,pH2A/f,2
1037,pTEX,2
1038,smk1,2


In [292]:
print(hgnc_dup_alias_count_df.columns)

Index(['alias_symbol', 0], dtype='object')


In [293]:
hgnc_dup_alias_count_df.rename(columns={0:'num_gene_records'}, inplace=True )
hgnc_dup_alias_count_df

Unnamed: 0,alias_symbol,num_gene_records
0,2F1,2
1,60B8AG,2
2,87U6,2
3,9G8,2
4,A1,3
...,...,...
1035,p97,4
1036,pH2A/f,2
1037,pTEX,2
1038,smk1,2


In [294]:
hgnc_dup_alias_count_df = hgnc_dup_alias_count_df.sort_values('num_gene_records', ascending=False)
hgnc_dup_alias_count_df.head(30)

Unnamed: 0,alias_symbol,num_gene_records
68,ASP,7
653,PAP,7
935,U4,7
118,CAP,6
568,MYM,6
267,F379,6
27,AIP1,6
1012,p40,5
31,ALP,5
571,NAP1,5


In [295]:
hgnc_alias_alias_collision_set = set(hgnc_dup_alias_count_df['alias_symbol'])
len(hgnc_alias_alias_collision_set)

1040

In [296]:
hgnc_dup_alias_count_df.to_csv('../hgnc_dup_alias_count_df.csv', index=True)

In [297]:
hgnc_alias_count_histogram_df = hgnc_dup_alias_count_df.pivot_table(index = ['num_gene_records'], aggfunc ='size')
hgnc_alias_count_histogram_df

num_gene_records
2    858
3    125
4     38
5     12
6      4
7      3
dtype: int64

In [298]:
hgnc_alias_count_histogram_df = hgnc_alias_count_histogram_df.reset_index()
hgnc_alias_count_histogram_df

Unnamed: 0,num_gene_records,0
0,2,858
1,3,125
2,4,38
3,5,12
4,6,4
5,7,3


In [299]:
hgnc_alias_count_histogram_df.rename(columns={0:'num_alias_symbol'}, inplace=True )
hgnc_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol
0,2,858
1,3,125
2,4,38
3,5,12
4,6,4
5,7,3


In [300]:
hgnc_alias_count_histogram_df['percent_alias_symbol'] = ((hgnc_alias_count_histogram_df['num_alias_symbol'] / hgnc_alias_len) * 100)
hgnc_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol,percent_alias_symbol
0,2,858,2.063046
1,3,125,0.30056
2,4,38,0.09137
3,5,12,0.028854
4,6,4,0.009618
5,7,3,0.007213


In [301]:
hgnc_alias_count_histogram_df = hgnc_alias_count_histogram_df.drop('num_alias_symbol', axis=1)
hgnc_alias_count_histogram_df

Unnamed: 0,num_gene_records,percent_alias_symbol
0,2,2.063046
1,3,0.30056
2,4,0.09137
3,5,0.028854
4,6,0.009618
5,7,0.007213


In [302]:
#px.bar(hgnc_alias_count_histogram_df, x='num_gene_records', y='percent_alias_symbol')

#### <a id='toc2_1_10_1_'></a>[Save as csv](#toc0_)

In [303]:
#mini_hgnc_df_explode.to_csv('../hgnc_alias_overlap.csv', index=False)

### <a id='toc2_1_11_'></a>[Put columns in different order to ephasize alias symbols instead of gene records](#toc0_)

In [304]:
mini_hgnc_df_2 = mini_hgnc_df.drop_duplicates(subset = ["alias_symbol", "gene_symbol"], keep = 'first')

In [305]:
mini_hgnc_df_2 = mini_hgnc_df[['alias_symbol', 'ENSG_ID', 'gene_symbol', 'source']]
mini_hgnc_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
35226,2F1,ENSG00000005022,SLC25A5,HGNC
13978,2F1,ENSG00000139187,KLRG1,HGNC
33987,60B8AG,ENSG00000143546,S100A8,HGNC
33988,60B8AG,ENSG00000163220,S100A9,HGNC
31335,87U6,ENSG00000206832,RNU6V,HGNC
...,...,...,...,...
38078,pTEX,ENSG00000277008,TEX28P2,HGNC
26200,smk1,ENSG00000100796,PPP4R3A,HGNC
26203,smk1,ENSG00000224960,PPP4R3C,HGNC
36662,tamo,ENSG00000158792,SPATA2L,HGNC


### <a id='toc2_1_12_'></a>[Merge rows with matching alias symbols](#toc0_)

In [306]:
mini_hgnc_df_2 = mini_hgnc_df_2.applymap(str)
mini_hgnc_df_2

  mini_hgnc_df_2 = mini_hgnc_df_2.applymap(str)


Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
35226,2F1,ENSG00000005022,SLC25A5,HGNC
13978,2F1,ENSG00000139187,KLRG1,HGNC
33987,60B8AG,ENSG00000143546,S100A8,HGNC
33988,60B8AG,ENSG00000163220,S100A9,HGNC
31335,87U6,ENSG00000206832,RNU6V,HGNC
...,...,...,...,...
38078,pTEX,ENSG00000277008,TEX28P2,HGNC
26200,smk1,ENSG00000100796,PPP4R3A,HGNC
26203,smk1,ENSG00000224960,PPP4R3C,HGNC
36662,tamo,ENSG00000158792,SPATA2L,HGNC


In [307]:
mini_hgnc_df_2 = mini_hgnc_df_2.groupby('alias_symbol').agg({'ENSG_ID': ', '.join, 
                             'gene_symbol': ', '.join, 
                             'source':'first' }).reset_index()
mini_hgnc_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
0,2F1,"ENSG00000005022, ENSG00000139187","SLC25A5, KLRG1",HGNC
1,60B8AG,"ENSG00000143546, ENSG00000163220","S100A8, S100A9",HGNC
2,87U6,"ENSG00000206832, ENSG00000065135","RNU6V, GNAI3",HGNC
3,9G8,"ENSG00000164609, ENSG00000115875","SLU7, SRSF7",HGNC
4,A1,"ENSG00000035928, ENSG00000163918, ENSG00000049541","RFC1, RFC4, RFC2",HGNC
...,...,...,...,...
1035,p97,"ENSG00000179409, ENSG00000110321, ENSG00000153...","GEMIN4, EIF4G2, CFDP1, VCP",HGNC
1036,pH2A/f,"ENSG00000234816, ENSG00000196787","H2AC5P, H2AC11",HGNC
1037,pTEX,"ENSG00000274962, ENSG00000277008","TEX28P1, TEX28P2",HGNC
1038,smk1,"ENSG00000100796, ENSG00000224960","PPP4R3A, PPP4R3C",HGNC


In [308]:
#mini_hgnc_df_2.to_csv('../hgnc_alias_overlap_2.csv', index=False)

# <a id='toc3_'></a>[NCBI Info](#toc0_)

In [309]:
df2 = pd.read_csv("ncbi_info_20220719_filtered.csv")

  df2 = pd.read_csv("ncbi_info_20220719_filtered.csv")


### <a id='toc3_1_1_'></a>[Drop all columns besides ENSG_ID, gene_symbol, and alias_symbol](#toc0_)

In [310]:
mini_ncbi_df = df2.drop(['Unnamed: 0', '#tax_id','GeneID', 'dbXrefs', 'description', 'type_of_gene', 'Symbol_from_nomenclature_authority', 'Full_name_from_nomenclature_authority', 'Other_designations', 'MIM', 'HGNC', 'AllianceGenome','MIRbase', 'IMGTgene_db', 'dash', 'unknown'], axis=1)
mini_ncbi_df = mini_ncbi_df.rename(columns = {'Symbol':'gene_symbol','Synonyms':'alias_symbol', 'ENSEMBL':'ENSG_ID'})
mini_ncbi_df['ENSG_ID'] = mini_ncbi_df['ENSG_ID'].astype(str)
mini_ncbi_df['ENSG_ID'] = mini_ncbi_df['ENSG_ID'].apply(str.upper)
mini_ncbi_df

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
0,A1BG,A1B|ABG|GAB|HYST2477,ENSG00000121410
1,A2M,A2MD|CPAMD5|FWP007|S863-7,ENSG00000175899
2,A2MP1,A2MP,ENSG00000256069
3,NAT1,AAC1|MNAT|NAT-1|NATI,ENSG00000171428
4,NAT2,AAC2|NAT-2|PNAT,ENSG00000156006
...,...,...,...
75495,trnD,-,NAN
75496,trnP,-,NAN
75497,trnA,-,NAN
75498,COX1,-,NAN


### <a id='toc3_1_2_'></a>[How many total unique gene records are there](#toc0_)

In [311]:
ncbi_gene_symbol_set = set(mini_ncbi_df['gene_symbol'])
len(ncbi_gene_symbol_set)

75346

### <a id='toc3_1_3_'></a>[Drop genes with no aliases](#toc0_)

In [312]:
mini_ncbi_df = mini_ncbi_df.replace("-", np.nan)
mini_ncbi_df

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
0,A1BG,A1B|ABG|GAB|HYST2477,ENSG00000121410
1,A2M,A2MD|CPAMD5|FWP007|S863-7,ENSG00000175899
2,A2MP1,A2MP,ENSG00000256069
3,NAT1,AAC1|MNAT|NAT-1|NATI,ENSG00000171428
4,NAT2,AAC2|NAT-2|PNAT,ENSG00000156006
...,...,...,...
75495,trnD,,NAN
75496,trnP,,NAN
75497,trnA,,NAN
75498,COX1,,NAN


In [313]:
mini_ncbi_df = mini_ncbi_df.dropna(subset=['alias_symbol'])
mini_ncbi_df

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
0,A1BG,A1B|ABG|GAB|HYST2477,ENSG00000121410
1,A2M,A2MD|CPAMD5|FWP007|S863-7,ENSG00000175899
2,A2MP1,A2MP,ENSG00000256069
3,NAT1,AAC1|MNAT|NAT-1|NATI,ENSG00000171428
4,NAT2,AAC2|NAT-2|PNAT,ENSG00000156006
...,...,...,...
71686,LOC124906931,WASH,NAN
71738,LOC124906983,WASH,NAN
72857,LOC124908102,WASH,NAN
74876,POLGARF,ORF-Y|POLG,NAN


### <a id='toc3_1_4_'></a>[Make each row in alias_symbol a set:](#toc0_)
    covert to a list 
    make a set

In [314]:
mini_ncbi_df['alias_symbol'] = mini_ncbi_df['alias_symbol'].astype(str)
mini_ncbi_df['alias_symbol'] = [x.split('|') for x in mini_ncbi_df.alias_symbol]
mini_ncbi_df['alias_symbol']=np.where(mini_ncbi_df.alias_symbol=='','',mini_ncbi_df.alias_symbol.map(set))
mini_ncbi_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df['alias_symbol'] = mini_ncbi_df['alias_symbol'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df['alias_symbol'] = [x.split('|') for x in mini_ncbi_df.alias_symbol]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_ncbi_df['alias_symbol']=np.where(mini_ncbi_df.a

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
0,A1BG,"{ABG, A1B, HYST2477, GAB}",ENSG00000121410


### <a id='toc3_1_5_'></a>[Explode the alias sets so that it is one per row](#toc0_)

In [315]:
mini_ncbi_df = mini_ncbi_df.explode(column="alias_symbol")
mini_ncbi_df.head(5)

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
0,A1BG,ABG,ENSG00000121410
0,A1BG,A1B,ENSG00000121410
0,A1BG,HYST2477,ENSG00000121410
0,A1BG,GAB,ENSG00000121410
1,A2M,CPAMD5,ENSG00000175899


In [316]:
#ncbi_CD158b_alias_count_df.to_csv('../ncbi_CD158b_alias_count_df.csv')

### <a id='toc3_1_6_'></a>[How many unique aliases are there](#toc0_)

In [317]:
ncbi_alias_symbol_set = set(mini_ncbi_df['alias_symbol'])
ncbi_alias_len = len(ncbi_alias_symbol_set)
ncbi_alias_len

67454

### <a id='toc3_1_7_'></a>[Pull out all the rows that have an alias symbol that can be found elsewhere](#toc0_)

In [318]:
mini_ncbi_df['alias_duplicates'] = mini_ncbi_df.duplicated(subset= 'alias_symbol', keep=False)
mini_ncbi_df = mini_ncbi_df[mini_ncbi_df['alias_duplicates'] == True]
mini_ncbi_df = mini_ncbi_df.drop(['alias_duplicates'], axis=1)
mini_ncbi_df.head(5)

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
0,A1BG,A1B,ENSG00000121410
3,NAT1,AAC1,ENSG00000171428
3,NAT1,NAT-1,ENSG00000171428
4,NAT2,AAC2,ENSG00000156006
6,SERPINA3,ACT,ENSG00000196136


### <a id='toc3_1_8_'></a>[Sort alias symbols alphabetically](#toc0_)

In [319]:
mini_ncbi_df = mini_ncbi_df.sort_values('alias_symbol')
mini_ncbi_df.head(5)

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID
4534,PTEN,10q23del,ENSG00000171862
537,BMPR1A,10q23del,ENSG00000107779
199,ALOX12,12-LOX,ENSG00000108839
205,ALOX15,12-LOX,ENSG00000161905
245,SLC25A5,2F1,ENSG00000005022


### <a id='toc3_1_9_'></a>[Number of records with an alias that is shared](#toc0_)

In [320]:
ncbi_gene_symbol_set_wsharedalias = set(mini_ncbi_df['gene_symbol'])
len(ncbi_gene_symbol_set_wsharedalias)

5670

In [321]:
mini_ncbi_df['source'] = 'NCBI Info'
mini_ncbi_df

Unnamed: 0,gene_symbol,alias_symbol,ENSG_ID,source
4534,PTEN,10q23del,ENSG00000171862,NCBI Info
537,BMPR1A,10q23del,ENSG00000107779,NCBI Info
199,ALOX12,12-LOX,ENSG00000108839,NCBI Info
205,ALOX15,12-LOX,ENSG00000161905,NCBI Info
245,SLC25A5,2F1,ENSG00000005022,NCBI Info
...,...,...,...,...
12929,PPP4R3A,smk1,ENSG00000100796,NCBI Info
13546,PPP4R3B,smk1,ENSG00000275052,NCBI Info
18205,PPP4R3C,smk1,ENSG00000224960,NCBI Info
17549,SPATA2L,tamo,ENSG00000158792,NCBI Info


### <a id='toc3_1_10_'></a>[Count the number of times each multi-use alias is used](#toc0_)

In [322]:
ncbi_dup_alias_count_df = mini_ncbi_df.pivot_table(index = ['alias_symbol'], aggfunc ='size')
ncbi_dup_alias_count_df

alias_symbol
10q23del       2
12-LOX         2
2F1            2
3-alpha-HSD    2
35DAG          2
              ..
pTEX           2
polymerase     3
rpL7a          2
smk1           3
tamo           2
Length: 3427, dtype: int64

In [323]:
ncbi_dup_alias_count_df.to_csv('../ncbi_alias_overlap_count.csv', index=True)

In [324]:
ncbi_dup_alias_count_df = ncbi_dup_alias_count_df.reset_index()
ncbi_dup_alias_count_df

Unnamed: 0,alias_symbol,0
0,10q23del,2
1,12-LOX,2
2,2F1,2
3,3-alpha-HSD,2
4,35DAG,2
...,...,...
3422,pTEX,2
3423,polymerase,3
3424,rpL7a,2
3425,smk1,3


In [325]:
ncbi_dup_alias_count_df.rename(columns={0:'num_gene_records'}, inplace=True )
ncbi_dup_alias_count_df

Unnamed: 0,alias_symbol,num_gene_records
0,10q23del,2
1,12-LOX,2
2,2F1,2
3,3-alpha-HSD,2
4,35DAG,2
...,...,...
3422,pTEX,2
3423,polymerase,3
3424,rpL7a,2
3425,smk1,3


In [326]:
ncbi_dup_alias_count_df = ncbi_dup_alias_count_df.sort_values('num_gene_records', ascending=False)
ncbi_dup_alias_count_df.head(30)

Unnamed: 0,alias_symbol,num_gene_records
3260,VH,36
1284,H4-16,14
1297,H4C9,13
1286,H4C11,13
1287,H4C12,13
1288,H4C13,13
1296,H4C8,13
1289,H4C14,13
1290,H4C15,13
1291,H4C2,13


In [327]:
ncbi_alias_alias_collision_set = set(ncbi_dup_alias_count_df['alias_symbol'])
len(ncbi_alias_alias_collision_set)

3427

In [328]:
ncbi_dup_alias_count_df.to_csv('../ncbi_dup_alias_count_df.csv', index=True)

In [329]:
ncbi_alias_count_histogram_df = ncbi_dup_alias_count_df.pivot_table(index = ['num_gene_records'], aggfunc ='size')
ncbi_alias_count_histogram_df

num_gene_records
2     2748
3      404
4      142
5       51
6       24
7       17
8        7
9       15
10       2
11       1
12       1
13      13
14       1
36       1
dtype: int64

In [330]:
ncbi_alias_count_histogram_df = ncbi_alias_count_histogram_df.reset_index()
ncbi_alias_count_histogram_df

Unnamed: 0,num_gene_records,0
0,2,2748
1,3,404
2,4,142
3,5,51
4,6,24
5,7,17
6,8,7
7,9,15
8,10,2
9,11,1


In [331]:
ncbi_alias_count_histogram_df.rename(columns={0:'num_alias_symbol'}, inplace=True )
ncbi_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol
0,2,2748
1,3,404
2,4,142
3,5,51
4,6,24
5,7,17
6,8,7
7,9,15
8,10,2
9,11,1


In [332]:
ncbi_alias_count_histogram_df['percent_alias_symbol'] = ((ncbi_alias_count_histogram_df['num_alias_symbol'] / ncbi_alias_len) * 100)
ncbi_alias_count_histogram_df

Unnamed: 0,num_gene_records,num_alias_symbol,percent_alias_symbol
0,2,2748,4.073887
1,3,404,0.598927
2,4,142,0.210514
3,5,51,0.075607
4,6,24,0.03558
5,7,17,0.025202
6,8,7,0.010377
7,9,15,0.022237
8,10,2,0.002965
9,11,1,0.001482


In [333]:
ncbi_alias_count_histogram_df = ncbi_alias_count_histogram_df.drop('num_alias_symbol', axis=1)
ncbi_alias_count_histogram_df

Unnamed: 0,num_gene_records,percent_alias_symbol
0,2,4.073887
1,3,0.598927
2,4,0.210514
3,5,0.075607
4,6,0.03558
5,7,0.025202
6,8,0.010377
7,9,0.022237
8,10,0.002965
9,11,0.001482


In [334]:
# px.bar(ncbi_alias_count_histogram_df, x='num_gene_records', y='percent_alias_symbol')

### <a id='toc3_1_11_'></a>[Put columns in different order to ephasize alias symbols instead of gene records](#toc0_)

In [335]:
mini_ncbi_df_2 = mini_ncbi_df.drop_duplicates(subset = ["alias_symbol", "gene_symbol"], keep = 'first')

In [336]:
mini_ncbi_df_2 = mini_ncbi_df[['alias_symbol', 'ENSG_ID', 'gene_symbol', 'source']]
mini_ncbi_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
4534,10q23del,ENSG00000171862,PTEN,NCBI Info
537,10q23del,ENSG00000107779,BMPR1A,NCBI Info
199,12-LOX,ENSG00000108839,ALOX12,NCBI Info
205,12-LOX,ENSG00000161905,ALOX15,NCBI Info
245,2F1,ENSG00000005022,SLC25A5,NCBI Info
...,...,...,...,...
12929,smk1,ENSG00000100796,PPP4R3A,NCBI Info
13546,smk1,ENSG00000275052,PPP4R3B,NCBI Info
18205,smk1,ENSG00000224960,PPP4R3C,NCBI Info
17549,tamo,ENSG00000158792,SPATA2L,NCBI Info


### <a id='toc3_1_12_'></a>[Merge rows with matching alias symbols](#toc0_)

In [337]:
mini_ncbi_df_2 = mini_ncbi_df_2.applymap(str)
mini_ncbi_df_2

  mini_ncbi_df_2 = mini_ncbi_df_2.applymap(str)


Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
4534,10q23del,ENSG00000171862,PTEN,NCBI Info
537,10q23del,ENSG00000107779,BMPR1A,NCBI Info
199,12-LOX,ENSG00000108839,ALOX12,NCBI Info
205,12-LOX,ENSG00000161905,ALOX15,NCBI Info
245,2F1,ENSG00000005022,SLC25A5,NCBI Info
...,...,...,...,...
12929,smk1,ENSG00000100796,PPP4R3A,NCBI Info
13546,smk1,ENSG00000275052,PPP4R3B,NCBI Info
18205,smk1,ENSG00000224960,PPP4R3C,NCBI Info
17549,tamo,ENSG00000158792,SPATA2L,NCBI Info


In [338]:
mini_ncbi_df_2['ENSG_ID'] = mini_ncbi_df_2['ENSG_ID'].str.replace('NAN','nan')
mini_ncbi_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
4534,10q23del,ENSG00000171862,PTEN,NCBI Info
537,10q23del,ENSG00000107779,BMPR1A,NCBI Info
199,12-LOX,ENSG00000108839,ALOX12,NCBI Info
205,12-LOX,ENSG00000161905,ALOX15,NCBI Info
245,2F1,ENSG00000005022,SLC25A5,NCBI Info
...,...,...,...,...
12929,smk1,ENSG00000100796,PPP4R3A,NCBI Info
13546,smk1,ENSG00000275052,PPP4R3B,NCBI Info
18205,smk1,ENSG00000224960,PPP4R3C,NCBI Info
17549,tamo,ENSG00000158792,SPATA2L,NCBI Info


In [339]:
mini_ncbi_df_2 = mini_ncbi_df_2.groupby('alias_symbol').agg({'ENSG_ID': ', '.join, 
                             'gene_symbol': ', '.join, 
                             'source':'first' }).reset_index()
mini_ncbi_df_2

Unnamed: 0,alias_symbol,ENSG_ID,gene_symbol,source
0,10q23del,"ENSG00000171862, ENSG00000107779","PTEN, BMPR1A",NCBI Info
1,12-LOX,"ENSG00000108839, ENSG00000161905","ALOX12, ALOX15",NCBI Info
2,2F1,"ENSG00000005022, ENSG00000139187","SLC25A5, KLRG1",NCBI Info
3,3-alpha-HSD,"ENSG00000198610, ENSG00000073737","AKR1C4, DHRS9",NCBI Info
4,35DAG,"ENSG00000170624, ENSG00000102683","SGCD, SGCG",NCBI Info
...,...,...,...,...
3422,pTEX,"ENSG00000274962, ENSG00000277008","TEX28P1, TEX28P2",NCBI Info
3423,polymerase,"nan, nan, nan","ERVK-11, ERVK-9, ERVK-19",NCBI Info
3424,rpL7a,"ENSG00000213272, ENSG00000240522","RPL7AP9, RPL7AP10",NCBI Info
3425,smk1,"ENSG00000100796, ENSG00000275052, ENSG00000224960","PPP4R3A, PPP4R3B, PPP4R3C",NCBI Info


# <a id='toc4_'></a>[Merge to create Alias Overlap Table 1 - Gene Symbol](#toc0_)

In [340]:
merged_alias_overlap_df_1 = pd.concat([mini_hgnc_df[['gene_symbol', 'ENSG_ID', 'alias_symbol', 'source']],mini_ncbi_df[['gene_symbol', 'ENSG_ID', 'alias_symbol', 'source']], mini_ensg_df[['gene_symbol', 'ENSG_ID', 'alias_symbol', 'source']]])
merged_alias_overlap_df_1

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol,source
35226,SLC25A5,ENSG00000005022,2F1,HGNC
13978,KLRG1,ENSG00000139187,2F1,HGNC
33987,S100A8,ENSG00000143546,60B8AG,HGNC
33988,S100A9,ENSG00000163220,60B8AG,HGNC
31335,RNU6V,ENSG00000206832,87U6,HGNC
...,...,...,...,...
23567,H3P45,ENSG00000227211,p56,ENSG
7724,LDAF1,ENSG00000283997,promethin,ENSG
37151,LDAF1,ENSG00000011638,promethin,ENSG
27797,TPTEP1,ENSG00000100181,psiTPTE22,ENSG


In [341]:
merged_alias_overlap_df_1.to_csv('../merged_alias_overlap_df_1.csv', index=False)

In [342]:
merged_alias_overlap_df_1.loc[merged_alias_overlap_df_1.gene_symbol == 'H4-16']

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol,source


In [343]:
merged_alias_overlap_df_1.loc[merged_alias_overlap_df_1.alias_symbol == 'H4-16' ]

Unnamed: 0,gene_symbol,ENSG_ID,alias_symbol,source
6491,H4C13,ENSG00000275126,H4-16,NCBI Info
6487,H4C3,ENSG00000197061,H4-16,NCBI Info
6483,H4C4,ENSG00000277157,H4-16,NCBI Info
6486,H4C11,ENSG00000197238,H4-16,NCBI Info
6482,H4C1,ENSG00000278637,H4-16,NCBI Info
6493,H4C14,ENSG00000270882,H4-16,NCBI Info
6484,H4C6,ENSG00000274618,H4-16,NCBI Info
6485,H4C12,ENSG00000273542,H4-16,NCBI Info
17448,H4C16,ENSG00000197837,H4-16,NCBI Info
6430,H4C9,ENSG00000276180,H4-16,NCBI Info


In [344]:
merged_alias_overlap_df_1['source'].value_counts()

source
ENSG         20879
NCBI Info     8247
HGNC          2348
Name: count, dtype: int64

# <a id='toc5_'></a>[Merge to create Alias Overlap Table 2 - Alias Symbol](#toc0_)

In [345]:
merged_alias_overlap_df_2 = pd.concat([mini_hgnc_df_2[['alias_symbol', 'gene_symbol', 'ENSG_ID', 'source']],mini_ncbi_df_2[['alias_symbol', 'gene_symbol', 'ENSG_ID', 'source']], mini_ensg_df_2[['alias_symbol', 'gene_symbol', 'ENSG_ID', 'source']]])
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source
0,2F1,"SLC25A5, KLRG1","ENSG00000005022, ENSG00000139187",HGNC
1,60B8AG,"S100A8, S100A9","ENSG00000143546, ENSG00000163220",HGNC
2,87U6,"RNU6V, GNAI3","ENSG00000206832, ENSG00000065135",HGNC
3,9G8,"SLU7, SRSF7","ENSG00000164609, ENSG00000115875",HGNC
4,A1,"RFC1, RFC4, RFC2","ENSG00000035928, ENSG00000163918, ENSG00000049541",HGNC
...,...,...,...,...
5302,p42,"PSMC6, H3P30","ENSG00000100519, ENSG00000227443",ENSG
5303,p55,"PSMD12, PIK3R3","ENSG00000197170, ENSG00000117461",ENSG
5304,p56,"PSMC1, H3P45","ENSG00000100764, ENSG00000227211",ENSG
5305,promethin,LDAF1,ENSG00000283997,ENSG


In [346]:
merged_alias_overlap_df_2.loc[merged_alias_overlap_df_2['alias_symbol'] == "RN5S3" ]

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source
4853,RN5S3,RNA5S3,ENSG00000285168,ENSG


In [347]:
merged_alias_overlap_df_2['gene_symbol'] = merged_alias_overlap_df_2['gene_symbol'].str.split(",")
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source
0,2F1,"[SLC25A5, KLRG1]","ENSG00000005022, ENSG00000139187",HGNC
1,60B8AG,"[S100A8, S100A9]","ENSG00000143546, ENSG00000163220",HGNC
2,87U6,"[RNU6V, GNAI3]","ENSG00000206832, ENSG00000065135",HGNC
3,9G8,"[SLU7, SRSF7]","ENSG00000164609, ENSG00000115875",HGNC
4,A1,"[RFC1, RFC4, RFC2]","ENSG00000035928, ENSG00000163918, ENSG00000049541",HGNC
...,...,...,...,...
5302,p42,"[PSMC6, H3P30]","ENSG00000100519, ENSG00000227443",ENSG
5303,p55,"[PSMD12, PIK3R3]","ENSG00000197170, ENSG00000117461",ENSG
5304,p56,"[PSMC1, H3P45]","ENSG00000100764, ENSG00000227211",ENSG
5305,promethin,[LDAF1],ENSG00000283997,ENSG


In [348]:
merged_alias_overlap_df_2['gene_symbol_count'] = [len(c) for c in merged_alias_overlap_df_2['gene_symbol']]
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source,gene_symbol_count
0,2F1,"[SLC25A5, KLRG1]","ENSG00000005022, ENSG00000139187",HGNC,2
1,60B8AG,"[S100A8, S100A9]","ENSG00000143546, ENSG00000163220",HGNC,2
2,87U6,"[RNU6V, GNAI3]","ENSG00000206832, ENSG00000065135",HGNC,2
3,9G8,"[SLU7, SRSF7]","ENSG00000164609, ENSG00000115875",HGNC,2
4,A1,"[RFC1, RFC4, RFC2]","ENSG00000035928, ENSG00000163918, ENSG00000049541",HGNC,3
...,...,...,...,...,...
5302,p42,"[PSMC6, H3P30]","ENSG00000100519, ENSG00000227443",ENSG,2
5303,p55,"[PSMD12, PIK3R3]","ENSG00000197170, ENSG00000117461",ENSG,2
5304,p56,"[PSMC1, H3P45]","ENSG00000100764, ENSG00000227211",ENSG,2
5305,promethin,[LDAF1],ENSG00000283997,ENSG,1


In [349]:
merged_alias_overlap_df_2 = merged_alias_overlap_df_2.sort_values(by='gene_symbol_count', ascending= False)
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source,gene_symbol_count
3260,VH,"[IGHV1-24, IGHV3-43, IGHM, IGHV3-9, IGHV3-...","ENSG00000211950, ENSG00000232216, ENSG00000211...",NCBI Info,36
1284,H4-16,"[H4C13, H4C3, H4C4, H4C11, H4C1, H4C14, ...","ENSG00000275126, ENSG00000197061, ENSG00000277...",NCBI Info,14
1287,H4C12,"[H4C15, H4C5, H4C3, H4C1, H4C4, H4C6, H4...","ENSG00000270276, ENSG00000276966, ENSG00000197...",NCBI Info,13
1294,H4C5,"[H4C6, H4C4, H4C12, H4C16, H4C11, H4C14, ...","ENSG00000274618, ENSG00000277157, ENSG00000273...",NCBI Info,13
1292,H4C3,"[H4C14, H4C4, H4C1, H4C8, H4C5, H4C15, H...","ENSG00000270882, ENSG00000277157, ENSG00000278...",NCBI Info,13
...,...,...,...,...,...
1937,PNUTL4,[SEPTIN9],ENSG00000282302,ENSG,1
1936,PNOCR,[OPRL1],ENSG00000125510,ENSG,1
1935,PNMA4,[MOAP1],ENSG00000165943,ENSG,1
1932,PMSR3,[PMS2P3],ENSG00000291092,ENSG,1


In [350]:
merged_alias_overlap_df_2['source'].value_counts()

source
ENSG         5307
NCBI Info    3427
HGNC         1040
Name: count, dtype: int64

In [351]:
merged_alias_overlap_df_2.to_csv('../merged_alias_overlap_df_2.csv', index=True)

# DGIdb ambiguous query

In [352]:
dgidb_gene_df = pd.read_csv("dgidb_genes_df.tsv", sep='\t')
dgidb_gene_df

Unnamed: 0,name,nomenclature,concept_id,gene_claim_name,source_db_name,source_db_version
0,MN1,Gene Symbol,hgnc:7180,MN1,CarisMolecularIntelligence,5-Jun-23
1,P2RX7,Gene Symbol,hgnc:8537,P2RX7,ChEMBL,32
2,CHRNA7,Gene Symbol,hgnc:1960,CHRNA7,ChEMBL,32
3,MAPK8,Gene Symbol,hgnc:6881,MAPK8,ChEMBL,32
4,MAPK10,Gene Symbol,hgnc:6872,MAPK10,ChEMBL,32
...,...,...,...,...,...,...
73884,MAPKAPK5,Gene Name,hgnc:6889,MAPKAPK5,DTC,5-Jun-23
73885,PTGER3,Gene Name,hgnc:9595,PTGER3,DTC,5-Jun-23
73886,CD1D,Gene Name,hgnc:1637,CD1D,DTC,5-Jun-23
73887,SLC26A1,Gene Symbol,hgnc:10993,SLC26A1,IDG,15-Jul-19


In [353]:
symbol_col_comparison = dgidb_gene_df['name'] == dgidb_gene_df['gene_claim_name']
symbol_col_comparison.value_counts()

True     62221
False    11668
Name: count, dtype: int64

In [354]:
symbol_col_comparison

0        True
1        True
2        True
3        True
4        True
         ... 
73884    True
73885    True
73886    True
73887    True
73888    True
Length: 73889, dtype: bool

In [355]:
dgidb_gene_df.dtypes

name                 object
nomenclature         object
concept_id           object
gene_claim_name      object
source_db_name       object
source_db_version    object
dtype: object

In [356]:
dgidb_gene_df.query('name != gene_claim_name')

Unnamed: 0,name,nomenclature,concept_id,gene_claim_name,source_db_name,source_db_version
25,HSP90,Gene Symbol,hgnc:5253,HSP90AA1,ChEMBL,32
100,RPLP,Gene Symbol,,,ChEMBL,32
167,SEPT6,Gene Symbol,hgnc:15848,SEPTIN6,CarisMolecularIntelligence,5-Jun-23
169,SEPT5,Gene Symbol,hgnc:9164,SEPTIN5,CarisMolecularIntelligence,5-Jun-23
213,WISP3,Gene Symbol,hgnc:12771,CCN6,CarisMolecularIntelligence,5-Jun-23
...,...,...,...,...,...,...
73631,ENSEMBL:ENSG00000185821,Ensembl Gene ID,hgnc:31305,OR6C76,RussLampel,26-Jul-11
73642,C11ORF30,Gene Symbol,hgnc:18071,EMSY,FoundationOneGenes,5-Jun-23
73727,ENSEMBL:ENSG00000183921,Ensembl Gene ID,hgnc:35414,SDR42E2,RussLampel,26-Jul-11
73872,MR,NCBI Gene Name,hgnc:7979,NR3C2,BaderLab,14-Feb


In [357]:
no_claim_symbols_df = dgidb_gene_df[dgidb_gene_df['gene_claim_name'].isnull()]
no_claim_symbols_df

Unnamed: 0,name,nomenclature,concept_id,gene_claim_name,source_db_name,source_db_version
100,RPLP,Gene Symbol,,,ChEMBL,32
353,EMBA,Gene Symbol,,,ChEMBL,32
355,RPLQ,Gene Symbol,,,ChEMBL,32
402,ILES,Gene Symbol,,,ChEMBL,32
439,MURA,Gene Symbol,,,ChEMBL,32
...,...,...,...,...,...,...
26386,CCNK-CDK13_HUMAN,Gene Symbol,,,GO,5-Jun-23
26387,INO80_HUMAN-1,Gene Symbol,,,GO,5-Jun-23
26388,CCNC-CDK3_HUMAN,Gene Symbol,,,GO,5-Jun-23
26573,TRYPTASE_B2_HUMAN,Gene Symbol,,,GO,5-Jun-23


In [358]:
no_name_symbols_df = dgidb_gene_df[dgidb_gene_df['name'].isnull()]
no_name_symbols_df

Unnamed: 0,name,nomenclature,concept_id,gene_claim_name,source_db_name,source_db_version
463,,Gene Symbol,hgnc:12811,XK,ChEMBL,32
4643,,Gene Name,hgnc:12811,XK,DrugBank,5.1.9


In [359]:
dgidb_name_set = set(dgidb_gene_df['name'])
len(dgidb_name_set)

21619

In [360]:
dgidb_gene_claim_name_set = set(dgidb_gene_df['gene_claim_name'])
len(dgidb_gene_claim_name_set)

11287

In [361]:
name_ensg_notmatch = dgidb_name_set.difference(ensg_gene_symbol_set)
len(name_ensg_notmatch)

10422

In [362]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(ensg_gene_symbol_set)
len(gene_claim_name_ensg_notmatch)

107

In [363]:
cleaned_gene_claim_name_ensg_notmatch = [x for x in gene_claim_name_ensg_notmatch if str(x) != 'NaN']
len(cleaned_gene_claim_name_ensg_notmatch)

107

In [364]:
name_hgnc_notmatch = dgidb_name_set.difference(hgnc_gene_symbol_set)
len(name_hgnc_notmatch)

10374

In [365]:
name_hngc_notmatch_aacollision = name_hgnc_notmatch.intersection(hgnc_alias_alias_collision_set)
len(name_hngc_notmatch_aacollision)

29

In [366]:
cleaned_name_hgnc_notmatch = [x for x in name_hgnc_notmatch if str(x) != 'NaN']
len(cleaned_name_hgnc_notmatch)

10374

In [367]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(hgnc_gene_symbol_set)
len(gene_claim_name_hgnc_notmatch)

66

In [368]:
name_ncbi_notmatch = dgidb_name_set.difference(ncbi_gene_symbol_set)
len(name_ncbi_notmatch)

10351

In [369]:
name_ncbi_notmatch_aacollision = name_ncbi_notmatch.intersection(ncbi_alias_alias_collision_set)
len(name_ncbi_notmatch_aacollision)

96

In [370]:
name_ncbi_hgnc_notmatch = name_ncbi_notmatch.difference(hgnc_gene_symbol_set)
len(name_ncbi_hgnc_notmatch)

10335

In [371]:
name_ncbi_hgnc_ensg_notmatch = name_ncbi_hgnc_notmatch.difference(ensg_gene_symbol_set)
len(name_ncbi_hgnc_ensg_notmatch)

10331

In [372]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(ncbi_gene_symbol_set)
len(gene_claim_name_ncbi_notmatch)

63

In [373]:
gene_claim_name_ncbi_hngc_notmatch = gene_claim_name_ncbi_notmatch.difference(hgnc_gene_symbol_set)
len(gene_claim_name_ncbi_hngc_notmatch)

47

In [374]:
gene_claim_name_ncbi_hngc_ensg_notmatch = gene_claim_name_ncbi_hngc_notmatch.difference(ensg_gene_symbol_set)
len(gene_claim_name_ncbi_hngc_ensg_notmatch)

44

In [375]:
name_hgnc_match = dgidb_name_set.intersection(hgnc_gene_symbol_set)
len(name_hgnc_match)

11245

In [376]:
name_hgnc_match_aacollision = name_hgnc_match.intersection(hgnc_alias_alias_collision_set)
len(name_hgnc_match_aacollision)

29

In [377]:
name_ensg_match = dgidb_name_set.intersection(ensg_gene_symbol_set)
len(name_ensg_match)

11197

In [378]:
name_ncbi_match = dgidb_name_set.intersection(ncbi_gene_symbol_set)
len(name_ncbi_match)

11268

In [379]:
name_ncbi_match_aacollision = name_ncbi_match.intersection(ncbi_alias_alias_collision_set)
len(name_ncbi_match_aacollision)

112

In [380]:
name_ncbi_ensg_match = name_ncbi_match.intersection(ensg_gene_symbol_set)
len(name_ncbi_ensg_match)

11178

In [381]:
name_ncbi_ensg_hgnc_match = name_ncbi_ensg_match.intersection(hgnc_gene_symbol_set)
len(name_ncbi_ensg_hgnc_match)

11176

In [382]:
gene_claim_name_hgnc_match = dgidb_gene_claim_name_set.intersection(hgnc_gene_symbol_set)
len(gene_claim_name_hgnc_match)

11221

In [383]:
gene_claim_name_ensg_match = dgidb_gene_claim_name_set.intersection(ensg_gene_symbol_set)
len(gene_claim_name_ensg_match)

11180

In [384]:
gene_claim_name_ensg_aacollision_match = dgidb_gene_claim_name_set.intersection(ensg_alias_alias_collision_set)
len(gene_claim_name_ensg_aacollision_match)

14

In [385]:
gene_claim_name_hgnc_aacollision_match = dgidb_gene_claim_name_set.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_aacollision_match)

30

In [386]:
gene_claim_name_ncbi_aacollision_match = dgidb_gene_claim_name_set.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_aacollision_match)

113

In [387]:
name_ensg_aacollision_match = dgidb_name_set.intersection(ensg_alias_alias_collision_set)
len(name_ensg_aacollision_match)

43

In [388]:
name_hgnc_aacollision_match = dgidb_name_set.intersection(hgnc_alias_alias_collision_set)
len(name_hgnc_aacollision_match)

58

In [389]:
name_ncbi_aacollision_match = dgidb_name_set.intersection(ncbi_alias_alias_collision_set)
len(name_ncbi_aacollision_match)

208

In [390]:
gene_claim_name_hgnc_notmatch = dgidb_gene_claim_name_set.difference(hgnc_gene_symbol_set)
len(gene_claim_name_hgnc_notmatch)

66

In [391]:
gene_claim_name_hgnc_notmatch_aacollision = gene_claim_name_hgnc_notmatch.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_notmatch_aacollision)

1

In [392]:
gene_claim_name_ensg_notmatch = dgidb_gene_claim_name_set.difference(ensg_gene_symbol_set)
len(gene_claim_name_ensg_notmatch)

107

In [393]:
gene_claim_name_ncbi_notmatch = dgidb_gene_claim_name_set.difference(ncbi_gene_symbol_set)
len(gene_claim_name_ncbi_notmatch)

63

In [394]:
gene_claim_name_ncbi_notmatch_aacollision = gene_claim_name_ncbi_notmatch.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_notmatch_aacollision)

1

In [395]:
gene_claim_name_hgnc_match = dgidb_gene_claim_name_set.intersection(hgnc_gene_symbol_set)
len(gene_claim_name_hgnc_match)

11221

In [396]:
gene_claim_name_hgnc_match_aacollision = gene_claim_name_hgnc_match.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_match_aacollision)

29

In [397]:
gene_claim_name_ncbi_match = dgidb_gene_claim_name_set.intersection(ncbi_gene_symbol_set)
len(gene_claim_name_ncbi_match)

11224

In [398]:
gene_claim_name_ncbi_match_aacollision = gene_claim_name_ncbi_match.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_match_aacollision)

112

In [399]:
name_ensg_match_aacollision = name_ensg_match.intersection(ensg_alias_alias_collision_set)
len(name_ensg_match_aacollision)

13

In [400]:
len(gene_claim_name_ncbi_hngc_ensg_notmatch)

44

In [401]:
len(name_ncbi_hgnc_ensg_notmatch)


10331

In [402]:
len(dgidb_name_set)

21619

In [403]:
len(dgidb_gene_claim_name_set)

11287

In [404]:
name_ensg_notmatch_aacollision = name_ensg_notmatch.intersection(ensg_alias_alias_collision_set)
len(name_ensg_notmatch_aacollision)

30

In [405]:
gene_claim_name_ensg_match_aacollision = gene_claim_name_ensg_match.intersection(ensg_alias_alias_collision_set)
len(gene_claim_name_ensg_match_aacollision)

13

In [406]:
gene_claim_name_ensg_notmatch_aacollision = gene_claim_name_ensg_notmatch.intersection(ensg_alias_alias_collision_set)
len(gene_claim_name_ensg_notmatch_aacollision)

1

In [407]:
gene_claim_name_ncbi_notmatch_aacollision = gene_claim_name_ncbi_notmatch.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_notmatch_aacollision)

1

In [408]:
gene_claim_name_ncbi_match_aacollision = gene_claim_name_ncbi_match.intersection(ncbi_alias_alias_collision_set)
len(gene_claim_name_ncbi_match_aacollision)

112

In [409]:
gene_claim_name_hgnc_match_aacollision = gene_claim_name_hgnc_match.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_match_aacollision)

29

In [410]:
gene_claim_name_hgnc_notmatch_aacollision = gene_claim_name_hgnc_notmatch.intersection(hgnc_alias_alias_collision_set)
len(gene_claim_name_hgnc_notmatch_aacollision)

1

Pull out instances of claim symbols that match to a primary gene symbol and the corresponding group symbols not matching to a primary gene symbol. Check for patterns of modes of error


In [411]:
dgidb_gene_df['hgnc_claim_match_status'] = dgidb_gene_df['gene_claim_name'].isin(hgnc_gene_symbol_set)
dgidb_gene_df

Unnamed: 0,name,nomenclature,concept_id,gene_claim_name,source_db_name,source_db_version,hgnc_claim_match_status
0,MN1,Gene Symbol,hgnc:7180,MN1,CarisMolecularIntelligence,5-Jun-23,True
1,P2RX7,Gene Symbol,hgnc:8537,P2RX7,ChEMBL,32,True
2,CHRNA7,Gene Symbol,hgnc:1960,CHRNA7,ChEMBL,32,True
3,MAPK8,Gene Symbol,hgnc:6881,MAPK8,ChEMBL,32,True
4,MAPK10,Gene Symbol,hgnc:6872,MAPK10,ChEMBL,32,True
...,...,...,...,...,...,...,...
73884,MAPKAPK5,Gene Name,hgnc:6889,MAPKAPK5,DTC,5-Jun-23,True
73885,PTGER3,Gene Name,hgnc:9595,PTGER3,DTC,5-Jun-23,True
73886,CD1D,Gene Name,hgnc:1637,CD1D,DTC,5-Jun-23,True
73887,SLC26A1,Gene Symbol,hgnc:10993,SLC26A1,IDG,15-Jul-19,True


In [412]:
dgidb_gene_df['hgnc_name_match_status'] = dgidb_gene_df['name'].isin(hgnc_gene_symbol_set)
dgidb_gene_df

Unnamed: 0,name,nomenclature,concept_id,gene_claim_name,source_db_name,source_db_version,hgnc_claim_match_status,hgnc_name_match_status
0,MN1,Gene Symbol,hgnc:7180,MN1,CarisMolecularIntelligence,5-Jun-23,True,True
1,P2RX7,Gene Symbol,hgnc:8537,P2RX7,ChEMBL,32,True,True
2,CHRNA7,Gene Symbol,hgnc:1960,CHRNA7,ChEMBL,32,True,True
3,MAPK8,Gene Symbol,hgnc:6881,MAPK8,ChEMBL,32,True,True
4,MAPK10,Gene Symbol,hgnc:6872,MAPK10,ChEMBL,32,True,True
...,...,...,...,...,...,...,...,...
73884,MAPKAPK5,Gene Name,hgnc:6889,MAPKAPK5,DTC,5-Jun-23,True,True
73885,PTGER3,Gene Name,hgnc:9595,PTGER3,DTC,5-Jun-23,True,True
73886,CD1D,Gene Name,hgnc:1637,CD1D,DTC,5-Jun-23,True,True
73887,SLC26A1,Gene Symbol,hgnc:10993,SLC26A1,IDG,15-Jul-19,True,True


In [413]:
claim_true_name_false_df = dgidb_gene_df.loc[dgidb_gene_df['hgnc_claim_match_status'] & ~dgidb_gene_df['hgnc_name_match_status']]
claim_true_name_false_df

Unnamed: 0,name,nomenclature,concept_id,gene_claim_name,source_db_name,source_db_version,hgnc_claim_match_status,hgnc_name_match_status
25,HSP90,Gene Symbol,hgnc:5253,HSP90AA1,ChEMBL,32,True,False
167,SEPT6,Gene Symbol,hgnc:15848,SEPTIN6,CarisMolecularIntelligence,5-Jun-23,True,False
169,SEPT5,Gene Symbol,hgnc:9164,SEPTIN5,CarisMolecularIntelligence,5-Jun-23,True,False
213,WISP3,Gene Symbol,hgnc:12771,CCN6,CarisMolecularIntelligence,5-Jun-23,True,False
294,MLL2,Gene Name,hgnc:7133,KMT2D,CGI,5-Jun-23,True,False
...,...,...,...,...,...,...,...,...
73631,ENSEMBL:ENSG00000185821,Ensembl Gene ID,hgnc:31305,OR6C76,RussLampel,26-Jul-11,True,False
73642,C11ORF30,Gene Symbol,hgnc:18071,EMSY,FoundationOneGenes,5-Jun-23,True,False
73727,ENSEMBL:ENSG00000183921,Ensembl Gene ID,hgnc:35414,SDR42E2,RussLampel,26-Jul-11,True,False
73872,MR,NCBI Gene Name,hgnc:7979,NR3C2,BaderLab,14-Feb,True,False


In [414]:
merged_alias_overlap_df_2

Unnamed: 0,alias_symbol,gene_symbol,ENSG_ID,source,gene_symbol_count
3260,VH,"[IGHV1-24, IGHV3-43, IGHM, IGHV3-9, IGHV3-...","ENSG00000211950, ENSG00000232216, ENSG00000211...",NCBI Info,36
1284,H4-16,"[H4C13, H4C3, H4C4, H4C11, H4C1, H4C14, ...","ENSG00000275126, ENSG00000197061, ENSG00000277...",NCBI Info,14
1287,H4C12,"[H4C15, H4C5, H4C3, H4C1, H4C4, H4C6, H4...","ENSG00000270276, ENSG00000276966, ENSG00000197...",NCBI Info,13
1294,H4C5,"[H4C6, H4C4, H4C12, H4C16, H4C11, H4C14, ...","ENSG00000274618, ENSG00000277157, ENSG00000273...",NCBI Info,13
1292,H4C3,"[H4C14, H4C4, H4C1, H4C8, H4C5, H4C15, H...","ENSG00000270882, ENSG00000277157, ENSG00000278...",NCBI Info,13
...,...,...,...,...,...
1937,PNUTL4,[SEPTIN9],ENSG00000282302,ENSG,1
1936,PNOCR,[OPRL1],ENSG00000125510,ENSG,1
1935,PNMA4,[MOAP1],ENSG00000165943,ENSG,1
1932,PMSR3,[PMS2P3],ENSG00000291092,ENSG,1


In [415]:
data = {}

for row in merged_alias_overlap_df_2.itertuples():
    print(row)
    break

Pandas(Index=3260, alias_symbol='VH', gene_symbol=['IGHV1-24', ' IGHV3-43', ' IGHM', ' IGHV3-9', ' IGHV3-16', ' IGHV2-5', ' IGHV3-20', ' IGHV3-21', ' IGHV1-58', ' IGHV1-45', ' IGHV3-35', ' IGHV3-33', ' IGHV3-11', ' IGHV3-30', ' IGHV3-7', ' IGHV3-38', ' IGHV3-15', ' IGHV3-74', ' SLC7A4', ' IGHV6-1', ' IGHV5-51', ' IGHV4-61', ' IGHV4-59', ' IGHV4-39', ' IGHV4-28', ' IGHV4-4', ' IGHV2-26', ' IGHV3-73', ' IGHV3-72', ' IGHV4-34', ' IGHV3-66', ' IGHV3-48', ' IGHV3-53', ' IGHV3-64', ' IGHV3-49', ' IGHV2-70'], ENSG_ID='ENSG00000211950, ENSG00000232216, ENSG00000211899, nan, ENSG00000211944, ENSG00000211937, ENSG00000211946, ENSG00000211947, ENSG00000211968, ENSG00000211961, ENSG00000211957, ENSG00000211955, ENSG00000211941, ENSG00000270550, ENSG00000211938, ENSG00000211958, ENSG00000211943, ENSG00000224650, ENSG00000099960, ENSG00000211933, ENSG00000211966, ENSG00000211970, ENSG00000224373, ENSG00000211959, ENSG00000211952, ENSG00000276775, ENSG00000211951, ENSG00000211976, ENSG00000225698, EN

In [416]:
import yaml
import os

folder_path = 'new_alias-alias_collision_records'
os.makedirs(folder_path, exist_ok=True)

data = []

for row in merged_alias_overlap_df_2.itertuples():
    
    collision_record = {
        "collision_symbol": row.alias_symbol,
    }

    collision_group = []

    len_gene_symbols = len(row.gene_symbol)
    ensg_ids = [r.strip() for r in row.ENSG_ID.split(",")]
    len_ensg_ids = len(ensg_ids)

    # if len_gene_symbols != len_ensg_ids:
    #     print(row)
    for i in range(0, len_gene_symbols):
        collision_group_item = {
            "gene_symbol": row.gene_symbol[i],
            "ensg_id": ensg_ids[i].upper()
        }
        collision_group.append(collision_group_item)

    collision_record["collision_group"] = collision_group
    data.append(collision_record)

    file_path = os.path.join(folder_path, f"{(row.alias_symbol.replace('/', '_'))}_collision_record.yaml")

    with open(file_path, "w") as wf:
        yaml.dump(collision_record, wf, default_flow_style=False)

data

[{'collision_symbol': 'VH',
  'collision_group': [{'gene_symbol': 'IGHV1-24',
    'ensg_id': 'ENSG00000211950'},
   {'gene_symbol': ' IGHV3-43', 'ensg_id': 'ENSG00000232216'},
   {'gene_symbol': ' IGHM', 'ensg_id': 'ENSG00000211899'},
   {'gene_symbol': ' IGHV3-9', 'ensg_id': 'NAN'},
   {'gene_symbol': ' IGHV3-16', 'ensg_id': 'ENSG00000211944'},
   {'gene_symbol': ' IGHV2-5', 'ensg_id': 'ENSG00000211937'},
   {'gene_symbol': ' IGHV3-20', 'ensg_id': 'ENSG00000211946'},
   {'gene_symbol': ' IGHV3-21', 'ensg_id': 'ENSG00000211947'},
   {'gene_symbol': ' IGHV1-58', 'ensg_id': 'ENSG00000211968'},
   {'gene_symbol': ' IGHV1-45', 'ensg_id': 'ENSG00000211961'},
   {'gene_symbol': ' IGHV3-35', 'ensg_id': 'ENSG00000211957'},
   {'gene_symbol': ' IGHV3-33', 'ensg_id': 'ENSG00000211955'},
   {'gene_symbol': ' IGHV3-11', 'ensg_id': 'ENSG00000211941'},
   {'gene_symbol': ' IGHV3-30', 'ensg_id': 'ENSG00000270550'},
   {'gene_symbol': ' IGHV3-7', 'ensg_id': 'ENSG00000211938'},
   {'gene_symbol': ' IGH