## Tests for New HGNC Gene Table Updates

The following tests can be executed to compare the new version of the HGNC gene table with the previous versions for updates.

#### Test 1: Records Count
Ensure the count of `gene_info_records` is greater than or equal to the `hgnc_complete_set` records.

#### Test 2: Deleted Rows Check
Verify that the rows deleted in `gene_info.txt` are due to duplicates.

#### Test 3: Duplicate Gene Symbols
Identify any duplicate gene symbols present in the `gene_info` table.

#### Test 4: Duplicate Entrez IDs
Identify any duplicate `entrez_ids` present in the `gene_info` table.

#### Test 5: Ambiguous Symbol Dropping
Check if ambiguous symbols in the `gene_info` table are dropped according to the following order:
- `main_symbol` > `previous_symbol` > `alias_symbol`

#### Test 6: Comparison of Current and Old Versions
Compare the current and old versions of `gene_info.txt` and list the symbols missing in the new file, alongside the entries in `gene_updates.md`.


In [None]:
import pandas as pd

hgnc_complete_set_file_path = "hgnc_complete_set_oct_16_2024.txt"
gene_info_file_path = "gene_info-Oct-24-2024.txt"
gene_updates_list_path = "gene_updates_list.txt"
prev_gene_info_path = "prev_gene_info.txt"

complete_set = pd.read_csv(hgnc_complete_set_file_path, sep='\t', dtype=str).fillna("")
gene_info = pd.read_csv(gene_info_file_path, sep='\t', dtype=str).fillna("")
gene_updates_list = pd.read_csv(gene_updates_list_path, sep='\t', dtype=str).fillna("")
prev_gene_info = pd.read_csv(prev_gene_info_path, sep='\t', dtype=str).fillna("")

In [27]:
### Test1: Records count
# count of gene_info_records >= hgnc_complete_set records 

# Prepare complete_set
# Remove rows where
# locus_type value is 'RNA, micro' -> miRNA rows
# entrez_id is empty

complete_set = complete_set[complete_set['locus_type'] != 'RNA, micro']
complete_set = complete_set[complete_set['entrez_id'] != '']

complete_set = complete_set[['hgnc_id', 'symbol', 'alias_symbol', 'prev_symbol', 'entrez_id']]
gene_info = gene_info[['hgnc_id', 'symbol', 'synonyms', 'entrez_id']]

# gene_info.txt has supp symbols (412 main supp + alias supp)
# count (gene_info.txt) should be >= count(complete set) 

if gene_info.shape[0] >= complete_set.shape[0]:
    print(f'The counts are :\n\tgene_info : {gene_info.shape[0]}\n\tcomplete_set : {complete_set.shape[0]}')


The counts are :
	gene_info : 42331
	complete_set : 41900


In [28]:
### Test2: Make sure the rows that got deleted in gene_info.txt are because of duplicates. 

# Explode symbols by '|' and compare
# create dfs with HGNC_id, entrez_id, symbol columns
# add the split alias and previous symbols as new rows. Create a longitudinal table.
cs_dfs = []
for col in ['symbol', 'alias_symbol', 'prev_symbol']:
    temp_df = complete_set[['hgnc_id', 'entrez_id', col]].copy()
    temp_df[col] = temp_df[col].str.split('|')
    exploded_df = temp_df.explode(col).rename(columns={col: 'symbol'}).reset_index(drop=True)
    cs_dfs.append(exploded_df)
cs_exploded = pd.concat(cs_dfs, ignore_index=True)
cs_exploded = cs_exploded[cs_exploded['symbol'] != '']

gi_dfs = []
for col in ['symbol', 'synonyms']:
    temp_df = gene_info[['hgnc_id', 'entrez_id', col]].copy()
    temp_df[col] = temp_df[col].str.split('|')
    exploded_df = temp_df.explode(col).rename(columns={col: 'symbol'}).reset_index(drop=True)
    gi_dfs.append(exploded_df)
gi_exploded = pd.concat(gi_dfs, ignore_index=True)
gi_exploded = gi_exploded[gi_exploded['symbol'] != '']

#1. Remove rows where HGNC_id = '' in both df's
#2. Join two tables by HGNC_id, entrez_id, symbol
#3. Drop rows that do not have a match in complete set (These rows come from supp files)
#4. For the rows that do not have any matches in gene_info file, understand why? They should have been dropped due to ambiguity

gi_exploded_renamed = gi_exploded.rename(columns={
    'hgnc_id': 'hgnc_id_gi',
    'entrez_id': 'entrez_id_gi',
    'symbol': 'symbol_gi'
})
merged_result = pd.merge(cs_exploded, gi_exploded_renamed, left_on=['hgnc_id', 'entrez_id', 'symbol'], right_on=['hgnc_id_gi', 'entrez_id_gi', 'symbol_gi'], how='outer')
merged_result = merged_result.fillna("")

merged_result = merged_result[merged_result['hgnc_id'] != ""]
merged_result = merged_result[
    ~((merged_result['hgnc_id_gi'] == "") & (merged_result['entrez_id_gi'] != "") & (merged_result['symbol_gi'] != ""))
]

# identify symbols in complete set that are not in gene info. 
# freq of those symbols in complete set - freq of these symbols in gene_info = 1 (always)
unmatched_rows = merged_result[merged_result['hgnc_id_gi'] == ""]
unmatched_symbols = unmatched_rows.groupby('symbol', as_index=False).size()
unmatched_symbols.columns = ['symbol', 'gi_freq']

# gene freq in complete set
cs_gene_freq = cs_exploded.groupby('symbol', as_index=False).size()
cs_gene_freq.columns = ['symbol', 'cs_freq']

merged_freq = pd.merge(unmatched_symbols, cs_gene_freq, on='symbol', how='left')
merged_freq['net_freq'] = merged_freq['cs_freq'] - merged_freq['gi_freq']
net_freq_greater_than_1 = merged_freq[merged_freq['net_freq'] > 1]
if not net_freq_greater_than_1.empty:
    print("There are unmatched symbols with a net frequency greater than 1.")
else:
    print("No unmatched symbols with a net frequency greater than 1.")

No unmatched symbols with a net frequency greater than 1.


In [29]:
### Test3: identify any duplicate gene symbols in the gene_info table

gene_info_duplicates = gi_exploded.groupby('symbol', as_index=False).size()
gene_info_duplicates.columns = ['symbol', 'gi_freq']
gene_info_freq_gt_1 = gene_info_duplicates[gene_info_duplicates['gi_freq'] > 1]
if not gene_info_freq_gt_1.empty:
    print("There are ambigious symbols")
    print(gene_info_freq_gt_1)
else:
    print("No ambigious symbols")

There are ambigious symbols
      symbol  gi_freq
31582   H3.X        2
91196   ZASP        2


In [30]:
### Test4: identify any duplicate entrez_ids in the gene_info table

gene_info_duplicates = gene_info.groupby('entrez_id', as_index=False).size()
gene_info_duplicates.columns = ['entrez_id', 'gi_freq']
gene_info_freq_gt_1 = gene_info_duplicates[gene_info_duplicates['gi_freq'] > 1]
if not gene_info_freq_gt_1.empty:
    print("There are ambigious entrez_id")
    print(gene_info_freq_gt_1)
else:
    print("No ambigious entrez_id")

No ambigious entrez_id


In [31]:
### Test5: Check if the ambiguous symbols in gene info are are dropped according to the order
# main_symbol > previous_symbol > alias_symbol

cs_dfs = []
for col in ['symbol', 'alias_symbol', 'prev_symbol']:
    temp_df = complete_set[['hgnc_id', 'entrez_id', col]].copy()
    temp_df[col] = temp_df[col].str.split('|')
    exploded_df = temp_df.explode(col).rename(columns={col: 'symbol'}).reset_index(drop=True)
    exploded_df['source_column'] = col
    cs_dfs.append(exploded_df)
cs_exploded = pd.concat(cs_dfs, ignore_index=True)
cs_exploded = cs_exploded[cs_exploded['symbol'] != '']

# Define the desired order for the 'source_column'
sort_order = {'symbol': 1, 'prev_symbol': 2, 'alias_symbol': 3}
cs_exploded['sort_order'] = cs_exploded['source_column'].map(sort_order)
cs_exploded = cs_exploded.sort_values(by=['symbol', 'sort_order']).reset_index(drop=True)

# Drop duplicates in the 'symbol' column, keeping the first occurrence in the sorted order
cs_exploded = cs_exploded.drop_duplicates(subset='symbol', keep='first').drop(columns='sort_order')

# Join gene_info.txt file to complete set
merged_df = pd.merge(cs_exploded, gi_exploded_renamed, left_on=['hgnc_id', 'entrez_id', 'symbol'], right_on=['hgnc_id_gi', 'entrez_id_gi', 'symbol_gi'], how='left')

unmatched_rows = merged_df[merged_df['hgnc_id_gi'] == ""]
unmatched_symbols = unmatched_rows.groupby('symbol', as_index=False).size()
unmatched_symbols.columns = ['symbol', 'gi_freq']
if not unmatched_symbols.empty:
    print("Some symbols in gene_info were not created following the priority order: main_symbol > previous_symbol > alias_symbol.")
    print(unmatched_symbols)
else:
    print("All symbols in gene_info were created following the priority order: main_symbol > previous_symbol > alias_symbol.")


All symbols in gene_info were created following the priority order: main_symbol > previous_symbol > alias_symbol.


In [32]:
### Test6: Compare the current and old versions of gene_info.txt
# and list symbols missing in the new file along with the gene_updates.md file.

# input file: with two columns gene, entrez that contains all gene and entrez symbols from gene_updates.md file all merged together in no particualar order.
gene_set = set(gene_updates_list['gene'].dropna().replace('', None).dropna())
entrez_set = set(gene_updates_list['entrez'].dropna().replace('', None).dropna())
prev_gene_info = prev_gene_info[['hgnc_id', 'symbol', 'synonyms', 'entrez_id']]

prev_gi_dfs = []
for col in ['symbol', 'synonyms']:
    temp_df = prev_gene_info[['hgnc_id', 'entrez_id', col]].copy()
    temp_df[col] = temp_df[col].str.split('|')
    exploded_df = temp_df.explode(col).rename(columns={col: 'symbol'}).reset_index(drop=True)
    exploded_df['source_column'] = col
    prev_gi_dfs.append(exploded_df)
prev_gi_exploded = pd.concat(prev_gi_dfs, ignore_index=True)
prev_gi_exploded = prev_gi_exploded[prev_gi_exploded['symbol'] != '']

gi_dfs = []
for col in ['symbol', 'synonyms']:
    temp_df = gene_info[['hgnc_id', 'entrez_id', col]].copy()
    temp_df[col] = temp_df[col].str.split('|')
    exploded_df = temp_df.explode(col).rename(columns={col: 'symbol'}).reset_index(drop=True)
    exploded_df['source_column'] = col
    gi_dfs.append(exploded_df)
gi_exploded = pd.concat(gi_dfs, ignore_index=True)
gi_exploded = gi_exploded[gi_exploded['symbol'] != '']

gi_exploded_renamed = gi_exploded.rename(columns={
    'hgnc_id': 'hgnc_id_gi',
    'entrez_id': 'entrez_id_gi',
    'symbol': 'symbol_gi'
})

merged_result1 = pd.merge(prev_gi_exploded, gi_exploded_renamed, left_on=['hgnc_id', 'entrez_id', 'symbol'], right_on=['hgnc_id_gi', 'entrez_id_gi', 'symbol_gi'], how='outer', suffixes=('_df1', '_df2'))
merged_result1 = merged_result1.fillna("")
merged_result1.to_csv('b.txt', sep='\t')

#1. Select were symbol = '' in prev -> gene_updates.md should have the symbol
prev_symbol_check = merged_result1[(merged_result1['symbol'] == "") & (merged_result1['source_column_df2'] == 'symbol')]
prev_symbol_check_set = set(prev_symbol_check['symbol_gi'].dropna().replace('', None).dropna())
gene_difference = prev_symbol_check_set - gene_set
print(f"There are {len(gene_difference)} genes in the new gene_info.txt file that are missing from the old version and not listed in gene_updates.md.")
print(gene_difference)

#2. Select were entrez = '' in prev -> gene_updates.md should have the id
prev_entrez_check = merged_result1[(merged_result1['entrez_id'] == "") & (merged_result1['source_column_df2'] == 'symbol')]
prev_entrez_check_set = set(prev_entrez_check['entrez_id_gi'].dropna().replace('', None).dropna())
entrez_difference = prev_entrez_check_set - entrez_set
print(f"\n\nThere are {len(entrez_difference)} entrez_ids in the new gene_info.txt file that are missing from the old version and not listed in gene_updates.md.")
print(entrez_difference)

There are 173 genes in the new gene_info.txt file that are missing from the old version and not listed in gene_updates.md.
{'DNAJC2P1', 'ACO2P1', 'TIALD', 'KCNK10-AS1', 'LINC03139', 'DNAJC9P1', 'PYGO2-AS1', 'IL9RP6', 'GABRA6-AS1', 'SNRPNP2', 'WTAPP2', 'PIEZO1-AS1', 'BBLNP1', 'MIR550A3HG', 'EFCAB2-AS1', 'ERVE-5', 'ADAMTS16-AS1', 'ACER3-AS1', 'SNRPA1P2', 'LDLRAD4-AS2', 'DCAF12-AS1', 'DLGAP1-AS6', 'LINC03125', 'TGILR', 'DENRP1', 'TM7SF3-AS1', 'CYP3A4-AS1', 'NIPAL1P1', 'LNCEGFL7OS', 'MLDHR', 'CANT1P1', 'PERPP3', 'SYNCRIPP1', 'ATP5MFP7', 'CMPK1P2', 'PAGE4P1', 'ATP5PFP2', 'RPL35P10', 'CDK5RAP3P1', 'MPPE1P2', 'CDCA7P1', 'ATP6V1FP1', 'ANAPC13P1', 'CPSF6P1', 'KLHL5P1', 'C14orf119P1', 'GLYR1P1', 'RNF228', 'DENRP4', 'RNF4BP', 'ZNF143-AS1', 'KLHL12P1', 'SSBP3P6', 'ADAMTS7P5', 'SKP1P4', 'ALDOC-AS1', 'HNRNPA3P17', 'RPS8P11', 'PHF10P2', 'CRK-AS1', 'LNCARGI', 'TGFBRAP1-AS1', 'USP10P3', 'LIRIL2R', 'WDR5CP', 'SSBP3P5', 'ACO2P2', 'RBBP8-AS1', 'YY1-DT', 'TSHZ2-AS1', 'GLYATL1-AS1', 'TCEAL9P1', 'SSBP3P3', '