# HGNC Gene List Validation

This notebook demonstrates how to merge gene lists and validate human gene symbols
using the official HGNC (HUGO Gene Nomenclature Committee) nomenclature.

The validation logic is implemented in a reusable Python script located in:

scripts/gene_symbols/validate_hgnc_symbols.py

This code allows merging gene lists and uses the official HGNC (HUGO Gene Nomenclature Committee) REST API to validate human gene symbols and detect previous (deprecated) symbols or aliases.

Endpoints used:

/fetch/symbol/{symbol}: verification of the existence of an approved HGNC gene symbol.

/search/{query}: detection of previous (deprecated) symbols and aliases.

The queries require an internet connection, and the results depend on the current state of the HGNC database.

For gene list merging only, refer to the second and third cells.

In [None]:
# Environment setup to allow importing project scripts

import sys
from pathlib import Path

# Resolve project root (repo root)
# The original code `Path().resolve().parents[2]` caused an IndexError
# because the current working directory likely doesn't have 3 parent directories.
# Assuming the notebook is run from the project's root directory,
# we set PROJECT_ROOT to the current resolved path.
PROJECT_ROOT = Path().resolve()
sys.path.append(str(PROJECT_ROOT))

# Import reusable functions from the scripts module
from scripts.gene_symbols.validate_hgnc_symbols import (
    merge_gene_lists,
    validate_gene_list
)

## Example: Merging and validating two gene lists

The following example uses predefined gene lists to ensure reproducibility.

In [None]:
# Example gene lists
genes_1= "ACTB, ACTG1, AMMECR1, ARCN1, ATR, B3GAT3, BCSTL, BLM, BMP2, BRAF, CBL, CCDC8, CDC45, CDC6, CDKN1C, CDT1, CENPJ, CEP152, CEP63, COL27A1, CREBBP, CUL7, DHCR7, DONSON, EP300, FGD1, FGFR3, FN1, GH1, GHR, GHRHR, GHSR, GLI2, GNAS, HDAC8, HESX1, HMGA2, HRAS, IDUA, IGF1, IGF1R, IGF2, IGFALS, INSR, IRS1, KRAS, LARP7, LFNG, LHX3, LHX4, LZTR1, MAP2K1, MAP2K2, NIPBL, NOTCH2, NRAS, OBSL1, ORC1, ORC4, ORC6, OSGEP, OTX2, PCNT, PISD, PITX2, PLAG1, POC1A, POP1, POU1F1, PPP3CA, PRMT7, PROP1, PTPN11, PUF60, RAD21, RAF1, RALA, RASA2, RBBP8, RIT1, RNU4ATAC, RRAS, RTTN, SGMS2, SHOC2, SHOX, SMARCA2, SMARCE1, SMC1A, SMC3, SOS1, SOX11, SOX2, SOX3, SRCAP, STAT5B, TALDO1, TBX19, TBX2, TBX3, TOP3A, TRIM37, TRMT10A, XRCC4"
genes_2= "ACAN, ACTB, ACTG1, ALMS1, AMMECR1, ANKRD11, ARCN1, ARID1A, ARID1B, ATR, ATRIP, B3GAT3, BLM, BMP2, BRAF, BRF1, BTK, CBL, CCDC8, CDC45, CDC6, CDT1, CENPJ, CEP152, CEP63, COL10A1, COL11A1, COL11A2, COL1A1, COL27A1, COL2A1, COL9A1, COL9A2, COL9A3, COMP, CREBBP, CRIPT, CUL7, DHCR7, DNA2, DONSON, DVL1, EP300, ERCC6, ERCC8, EVC, EVC2, FANCA, FANCC, FANCG, FBN1, FGD1, FGFR3, FN1, GH1, GHR, GHRHR, GHSR, GLI2, GLI3, GNAS, HDAC8, HESX1, HRAS, HSPG2, IDUA, IGF1, IGF1R, IGF2, IGFALS, IHH, INSR, KDM6A, KMT2D, KRAS, LARP7, LFNG, LHX3, LHX4, LIG4, LMNA, LZTR1, MAP2K1, MAP2K2, MATN3, MRAS, NBN, NF1, NIPBL, NOTCH2, NPPC, NRAS, NSMCE2, OBSL1, ORC1, ORC4, ORC6, OSGEP, OTX2, PCNT, PDE4D, PIK3R1, PISD, PLK4, POC1A, POP1, POU1F1, PPP1CB, PPP3CA, PRKAR1A, PRMT7, PROP1, PTH1R, PTPN11, PUF60, RAD21, RAF1, RALA, RASA2, RBBP8, RIT1, RNU4ATAC, ROR2, RPS6KA3, RRAS, RTTN, SGMS2, SHOC2, SHOX, SMARCA2, SMARCA4, SMARCAL1, SMARCB1, SMARCE1, SMC1A, SMC3, SOS1, SOS2, SOX11, SOX2, SOX3, SOX9, SPRED1, SRCAP, STAT5B, TALDO1, TBX2, TBX3, TOP3A, TRIM37, TRMT10A, WNT5A, XRCC4"

# Merge lists
genes_unidos = merge_gene_lists(genes_1, genes_2)

print(f"Total number of unique gene symbols: {len(genes_unidos)}")

# Validate against HGNC
genes_aprobados, genes_renombrados, genes_no_encontrados = validate_gene_list(genes_unidos)
print("Your new list of ", len(genes_aprobados), "genes is:", ", ".join(genes_aprobados))

if genes_renombrados:
    print("\nIMPORTANT: The following gene symbols were updated according to HGNC database:")
    for viejo, nuevo in genes_renombrados:
        print(f"{viejo} -> {nuevo}")

if genes_no_encontrados:
    print("\nThe following gene symbols were not found in the HGNC database or are invalid::")
    print(", ".join(genes_no_encontrados))

## Optional: Interactive usage

The following cell allows interactive input.  
This is optional and intended for local execution only.

In [None]:
#If you only need to merge your genes lists without validating HGNC symbol.
genes1=input("Please write your first list: ")
genes2=input("Now, please write your second list: ")
genesALL_=merge_gene_lists(genes1, genes2)
print(f"Your new list of {len(genesALL_)} genes is: {", ".join(genesALL_)}")

Añada su primera lista: POP1, POU1F1, PPP1CB, PPP3CA, PRKAR1A, PRMT7, PROP1, PTH1R, PTPN11, PUF60, RAD21, RAF1, RALA, RASA2, RBBP8, RIT1, RNU4ATAC, ROR2, RPS6KA3, RRAS, RTTN, SGMS2, SHOC2, SHOX, SMARCA2, SMARCA4, SMARCAL1, SMARCB1, SMARCE1, SMC1A, SMC3, SOS1, SOS2, SOX11, SOX2, SOX3, SOX9, SPRED1, SRCAP, STAT5B
Añada su segunda lista: B3GAT3, BCSTL, BLM, BMP2, BRAF, CBL, CCDC8, CDC45, CDC6, CDKN1C, CDT1, CENPJ, CEP152, CEP63, COL27A1, CREBBP, CUL7, DHCR7, DONSON, EP300
Su nueva lista de 60 genes es: B3GAT3, BCSTL, BLM, BMP2, BRAF, CBL, CCDC8, CDC45, CDC6, CDKN1C, CDT1, CENPJ, CEP152, CEP63, COL27A1, CREBBP, CUL7, DHCR7, DONSON, EP300, POP1, POU1F1, PPP1CB, PPP3CA, PRKAR1A, PRMT7, PROP1, PTH1R, PTPN11, PUF60, RAD21, RAF1, RALA, RASA2, RBBP8, RIT1, RNU4ATAC, ROR2, RPS6KA3, RRAS, RTTN, SGMS2, SHOC2, SHOX, SMARCA2, SMARCA4, SMARCAL1, SMARCB1, SMARCE1, SMC1A, SMC3, SOS1, SOS2, SOX11, SOX2, SOX3, SOX9, SPRED1, SRCAP, STAT5B


In [None]:
genes_a=input("Please write your first list: ")
genes_b=input("Now, please write your second list: ")
genes_ab_unidos = merge_gene_lists(genes_a, genes_b)
print(f"Total number of unique gene symbols: {len(genes_ab_unidos)}. Validation may take a few seconds to complete.")
# Validate against HGNC
genes_ab_aprobados, genes_ab_renombrados, genes_abn_no_encontrados = validate_gene_list(genes_ab_unidos)
print("Your new list of ", len(genes_ab_aprobados), "genes is:", ", ".join(genes_ab_aprobados))

if genes_ab_renombrados:
    print("\nIMPORTANT: The following gene symbols were updated according to HGNC database:")
    for viejo, nuevo in genes_ab_renombrados:
        print(f"{viejo} -> {nuevo}")

if genes_ab_no_encontrados:
    print("\nThe following gene symbols were not found in the HGNC database or are invalid::")
    print(", ".join(genes_ab_no_encontrados))

Total number of unique gene symbols: 32. Validation may take a few seconds to complete.
Your new list of  30 genes is: ACTB, ACTG1, AMMECR1, ANKRD11, ARCN1, ARID1A, ARID1B, ATR, ATRIP, B3GAT3, BLM, BMP2, BRAF, BRF1, BTK, CBL, CCDC8, CDC45, CDC6, CDKN1C, CDT1, CPAP, CEP152, CEP63, COL10A1, COL11A1, COL27A1, CREBBP, CUL7, DHCR7

IMPORTANT: The following gene symbols were updated according to HGNC:
CENPJ -> CPAP
The following gene symbols were not found in the HGNC database or are invalid: BCSTL, DONS
