# Expanding collision acronyms
One problem for Aim 1 in Anastasia's dissertation is that once a gene-alias collision pair has been identified, we have to determine what the collision symbol actually represents in the context of the parent symbol. For example, CAP is listed as an alias for BRD4 but this symbol collides with so many other CAP aliases. In the context of BRD4, CAP actually refers to 'chromosome associated protein'. What cap stands for differs across different parent gene symbols. While this can be manually curated for a small set of genes, there exist over 100,000 gene-alias pairs to consider and so a programmatic approach will be needed. Additionally, a separate but related problem will be to programmatically identify the type of collision(?).

## Test Run w/ Anastasia Gene Lists

Given some sample gene lists, grab a uniprot identifier and use it to query for alternative names and pmids related to the parent gene. Once the alternative names and pmids are grabbed, use the PMID to query pubmed for the publication abstract. Use the obtained abstract and spaCy to do a quick check for symbol first letter matching (i.e. do triplets of words from an abstract start with C A P to match CAP). More sophisticated methods can be employed to check for correct expansion of words.

In [50]:
import spacy
from tqdm import tqdm
import pandas as pd
import requests
from acronym_parser import mark_acronyms

def grab_uniprot_id(gene_symbol):
    query = f'https://normalize.cancervariants.org/gene/normalize?q={gene_symbol}'
    r = requests.get(query)
    try:
        uniprot_id = None
        for extension in r.json()['gene_descriptor']['extensions']:
            if extension['name'] == 'associated_with':
                uniprot_id = [value.replace('uniprot:','') for value in extension['value'] if value.startswith('uniprot:')]
                pass    
    except:
        uniprot_id = 'Yikes'
    return uniprot_id

def uniprot_request(uniprot_id):
    uniprot_data = requests.get(f'https://rest.uniprot.org/uniprotkb/{uniprot_id}?format=json')
    return uniprot_data    

def grab_alternative_names(uniprot_data):
    alt_names = []
    try:
        for name in uniprot_data.json()['proteinDescription']['alternativeNames']:
            alt_names.append(name['fullName']['value'])
    except:
        alt_names = None
    return(alt_names)

def grab_pmids(uniprot_data):
    pmids_to_check = []
    for pmid in uniprot_data.json()['references']:
        pmids_to_check.append(pmid['citation']['id'])
    return(pmids_to_check)

def get_pubmed_abstract(pmid, api_key='YOUR_API_KEY'):
    """ Fetch the abstract for a given PMID from PubMed. """
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    params = {
        'db': 'pubmed',
        'id': pmid,
        'retmode': 'xml'
        # 'api_key': api_key
    }
    
    response = requests.get(url, params=params)
    
    if response.status_code == 200:
        # Parse the XML response
        from xml.etree import ElementTree
        root = ElementTree.fromstring(response.text)
        
        # Find the AbstractText element and return its text
        abstract_text = root.find('.//AbstractText')
        if abstract_text is not None:
            return abstract_text.text
        else:
            return "No abstract found."
    else:
        return f"Error: {response.status_code}"

def find_matching_groups(abstract, gene_symbol):
    # Load the English tokenizer, tagger, parser, NER, and word vectors
    nlp = spacy.load("en_core_web_sm")
    
    # Process the abstract text with spaCy
    doc = nlp(abstract)

    abstract = abstract.replace('-',' ')
    # Create a list of words (tokens) from the document
    words = [token.text for token in doc if not token.is_punct and not token.is_space]
    
    # Determine the length of the gene symbol
    gene_length = len(gene_symbol)

    # Initialize a list to store matching groups
    matching_groups = []
    
    # Loop through the words to form groups of size gene_length
    for i in range(len(words) - gene_length + 1):
        # Select the current group of words
        group = words[i:i + gene_length]
        # Check if the first letter of each word in the group matches the letters in the gene symbol
        if all(group[j][0].lower() == gene_symbol[j].lower() for j in range(gene_length)):
            # If the group matches, add it to the list as a string
            matching_groups.append(' '.join(group))
    
    return matching_groups

In [51]:
data = {'collision_symbol': ['CAP','CAP','MYM','ASP','ASP','ASP','ALP','ALP','ALP'],
        'parent_symbol': ['BRD4','LNPEP','ZMYM1','ASIP','TMPRSS11D','ASPM','ATHS','CCL27','ALPI'],
        'gene_alias_pair': ['BRD4-CAP','LNPEP-CAP','ZMYM1-MYM','ASIP-ASP','TMPRSS11D-ASP','ASPM-ASP','ATHS-ALP','CCL27-ALP','ALPI-ALP'],
        'correct_alias_expansion': ["Chromosome Associated Protein","Cystinyl Aminopeptidase","Myeloproloferative syndrome and mental retardation","Agouti Signaling Protein","Adrenal secretory serine protease","Drosophila abnormal spindle","Atherogenic Lipoprotein Phenotype","Alkaline Phosphatase","Antileukoproteinase"],
        'collision_association': ['Protein Product','Protein Product','Disease', 'Protein Product', 'Protein Product', 'Ortholog', 'Phenotype', 'Protein Product', 'Protein Product'],
        'curation_difficulty': ['easy','medium','hard','easy','medium','hard','easy','medium','hard']}
df = pd.DataFrame(data)
df

Unnamed: 0,collision_symbol,parent_symbol,gene_alias_pair,correct_alias_expansion,collision_association,curation_difficulty
0,CAP,BRD4,BRD4-CAP,Chromosome Associated Protein,Protein Product,easy
1,CAP,LNPEP,LNPEP-CAP,Cystinyl Aminopeptidase,Protein Product,medium
2,MYM,ZMYM1,ZMYM1-MYM,Myeloproloferative syndrome and mental retarda...,Disease,hard
3,ASP,ASIP,ASIP-ASP,Agouti Signaling Protein,Protein Product,easy
4,ASP,TMPRSS11D,TMPRSS11D-ASP,Adrenal secretory serine protease,Protein Product,medium
5,ASP,ASPM,ASPM-ASP,Drosophila abnormal spindle,Ortholog,hard
6,ALP,ATHS,ATHS-ALP,Atherogenic Lipoprotein Phenotype,Phenotype,easy
7,ALP,CCL27,CCL27-ALP,Alkaline Phosphatase,Protein Product,medium
8,ALP,ALPI,ALPI-ALP,Antileukoproteinase,Protein Product,hard


In [54]:
# Main 
import time

# Initialize
df['test_results_alt_names'] = None
df['test_results_pmids'] = None
df['test_results_pmcs'] = None
df['uniprot_id'] = df['parent_symbol'].apply(grab_uniprot_id)

# Grab PMIDS and Alternative Names
for index,row in df.iterrows():
    print(f'Retrieving ref data for {row["parent_symbol"]}')
    if not row['uniprot_id']:
        continue
    data = uniprot_request(row['uniprot_id'][0])    
    alt_names = grab_alternative_names(data)
    df.at[index,'test_results_alt_names'] = alt_names
    pmids = grab_pmids(data)

    # Grab Abstracts from PubMed
    abstracts_to_check = []
    pmcs_to_check = []
    for pmid in tqdm(pmids):
        entry = {}
        if pmid.startswith('CI'):
            continue
        else:
            time.sleep(1) # Pubmed will throw back "too many requests" error unless you space them out or find a way to batch them
            # Abstracts
            abstract = get_pubmed_abstract(pmid)
            entry[pmid] = abstract
            abstracts_to_check.append(entry)
            # PMC IDs
            pmcid = get_pmc_id(pmid)
            entry[pmid] = pmcid 
            pmcs_to_check.append(entry)

    # Check Abstracts using spaCy
    print(f'Checking abstracts using spaCy {row["parent_symbol"]}')    
    possible_matches = []
    for abstract_dict in tqdm(abstracts_to_check):
        keys = list(abstract_dict.keys())
        possible_match = {}
        for key in keys:
            result = find_matching_groups(abstract_dict[key],row['collision_symbol'])
            possible_match[key] = result
        possible_matches.append(possible_match)

    df.at[index,'test_results_pmids'] = possible_matches

    # Use Acronym Parser 
    print(f'Checking PMC using acronym_parser {row["parent_symbol"]}')
    possible_matches = []
    for pmc_dict in tqdm(pmcs_to_check):
        keys = list(pmc_dict.keys())
        possible_match = {}
        for key in keys:
            pmcid = pmc_dict[key]
            url = f'https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_xml/{pmcid}/unicode'
            r = requests.get(url)
            acronyms = mark_acronyms(r.text)
            possible_match[key] = acronyms
        possible_matches.append(possible_match)

    df.at[index,'test_results_pmcs'] = possible_matches
    
df

Retrieving ref data for BRD4


100%|██████████| 58/58 [01:28<00:00,  1.53s/it]


Checking abstracts using spaCy BRD4


100%|██████████| 56/56 [00:12<00:00,  4.59it/s]


Checking PMC using acronym_parser BRD4


100%|██████████| 56/56 [00:19<00:00,  2.88it/s]


Retrieving ref data for LNPEP


100%|██████████| 15/15 [00:21<00:00,  1.44s/it]


Checking abstracts using spaCy LNPEP


100%|██████████| 14/14 [00:03<00:00,  4.66it/s]


Checking PMC using acronym_parser LNPEP


100%|██████████| 14/14 [00:06<00:00,  2.15it/s]


Retrieving ref data for ZMYM1


100%|██████████| 5/5 [00:06<00:00,  1.23s/it]


Checking abstracts using spaCy ZMYM1


100%|██████████| 4/4 [00:00<00:00,  4.35it/s]


Checking PMC using acronym_parser ZMYM1


100%|██████████| 4/4 [00:01<00:00,  3.49it/s]


Retrieving ref data for ASIP


100%|██████████| 7/7 [00:10<00:00,  1.54s/it]


Checking abstracts using spaCy ASIP


100%|██████████| 7/7 [00:01<00:00,  4.52it/s]


Checking PMC using acronym_parser ASIP


100%|██████████| 7/7 [00:01<00:00,  3.67it/s]


Retrieving ref data for TMPRSS11D


100%|██████████| 5/5 [00:07<00:00,  1.56s/it]


Checking abstracts using spaCy TMPRSS11D


100%|██████████| 5/5 [00:01<00:00,  4.74it/s]


Checking PMC using acronym_parser TMPRSS11D


100%|██████████| 5/5 [00:01<00:00,  4.00it/s]


Retrieving ref data for ASPM


100%|██████████| 19/19 [00:27<00:00,  1.45s/it]


Checking abstracts using spaCy ASPM


100%|██████████| 18/18 [00:04<00:00,  4.48it/s]


Checking PMC using acronym_parser ASPM


100%|██████████| 18/18 [00:04<00:00,  4.04it/s]


Retrieving ref data for ATHS
Retrieving ref data for CCL27


100%|██████████| 6/6 [00:07<00:00,  1.28s/it]


Checking abstracts using spaCy CCL27


100%|██████████| 5/5 [00:01<00:00,  4.55it/s]


Checking PMC using acronym_parser CCL27


100%|██████████| 5/5 [00:01<00:00,  3.88it/s]


Retrieving ref data for ALPI


100%|██████████| 12/12 [00:17<00:00,  1.45s/it]


Checking abstracts using spaCy ALPI


100%|██████████| 11/11 [00:02<00:00,  4.57it/s]


Checking PMC using acronym_parser ALPI


100%|██████████| 11/11 [00:02<00:00,  4.97it/s]


Unnamed: 0,collision_symbol,parent_symbol,gene_alias_pair,correct_alias_expansion,collision_association,curation_difficulty,test_results_alt_names,test_results_pmids,uniprot_id,test_results_pmcs
0,CAP,BRD4,BRD4-CAP,Chromosome Associated Protein,Protein Product,easy,[Protein HUNK1],"[{'11733348': []}, {'15057824': []}, {'1548933...",[O60885],"[{'11733348': {}}, {'15057824': {}}, {'1548933..."
1,CAP,LNPEP,LNPEP-CAP,Cystinyl Aminopeptidase,Protein Product,medium,"[Insulin-regulated membrane aminopeptidase, In...","[{'8550619': []}, {'9177475': []}, {'10759854'...",[Q9UIQ6],"[{'8550619': {}}, {'9177475': {}}, {'10759854'..."
2,MYM,ZMYM1,ZMYM1-MYM,Myeloproloferative syndrome and mental retarda...,Disease,hard,,"[{'17974005': []}, {'16710414': []}, {'2575529...",[Q5SVZ6],[{'17974005': {'NMD': {'nonsense mediated deca...
3,ASP,ASIP,ASIP-ASP,Agouti Signaling Protein,Protein Product,easy,[Agouti switch protein],"[{'7937887': []}, {'7757071': []}, {'11780052'...",[P42127],"[{'7937887': {}}, {'7757071': {}}, {'11780052'..."
4,ASP,TMPRSS11D,TMPRSS11D-ASP,Adrenal secretory serine protease,Protein Product,medium,[Airway trypsin-like protease],"[{'9565616': []}, {'15489334': []}, {'9070615'...",[O60235],"[{'9565616': {}}, {'15489334': {}}, {'9070615'..."
5,ASP,ASPM,ASPM-ASP,Drosophila abnormal spindle,Ortholog,hard,[Abnormal spindle protein homolog],"[{'12355089': []}, {'14704186': []}, {'1597272...",[Q8IZT6],"[{'12355089': {}}, {'14704186': {}}, {'1597272..."
6,ALP,ATHS,ATHS-ALP,Atherogenic Lipoprotein Phenotype,Phenotype,easy,,,[],
7,ALP,CCL27,CCL27-ALP,Alkaline Phosphatase,Protein Product,medium,"[CC chemokine ILC, Cutaneous T-cell-attracting...","[{'10556532': []}, {'10588729': []}, {'1072569...",[Q9Y4X3],"[{'10556532': {}}, {'10588729': {}}, {'1072569..."
8,ALP,ALPI,ALPI-ALP,Antileukoproteinase,Protein Product,hard,,"[{'3468508': []}, {'3469665': []}, {'2841341':...",[P09923],"[{'3468508': {}}, {'3469665': {}}, {'2841341':..."


In [4]:
spot_check = 3

print(df['test_results_alt_names'][spot_check])

print(df['test_results_pmids'][spot_check])
print(df['test_results_pmcs'][spot_check])


['Agouti switch protein']
[{'7937887': []}, {'7757071': ['Agouti Signaling Protein']}, {'11780052': []}, {'15489334': ['a saturation point']}, {'15701517': ['agouti signaling protein']}, {'11833005': ['agouti signaling protein', 'agouti signaling protein']}, {'36536132': ['agouti signaling protein', 'a similar phenotype']}]
