### Programming for Biomedical Informatics
#### Week 3 Assignment - Gene ID Conversion

In this weekly mini assignment you will practice converting between different accession types using NCBI eUtils and BioMart

- make sure that you've installed Biopython, signed up for a free NCBI account so that you can get an API key
- if you deciede that you would like to try with BioMart too remember that you can (should!) practice using the BioMart web interface first so that you understand the correct parameters to use

We've included the basic code below based on the weekly snippets from the GitHub ```./notebooks/week3``` feel free to explore and try different things.

In [None]:
# using NCBI-NLM eUtils to get gene IDs from gene symbols
from Bio import Entrez

# load your API key from the file
with open('../api_keys/ncbi.txt', 'r') as file:
    api_key = file.read().strip()

# load your email from the file
with open('../api_keys/ncbi_email.txt', 'r') as file:
    email = file.read().strip()

Entrez.api_key = api_key
Entrez.email = email

## function to get the gene ids using eUtils
def get_gene_ids(gene_symbols, organism="Homo sapiens"):
    """
    Convert a list of gene symbols into NCBI Gene IDs.

    Parameters:
    gene_symbols (list): List of gene symbols to search for.
    organism (str): Organism name to restrict search (default is "Homo sapiens").

    Returns:
    dict: A dictionary mapping gene symbols to NCBI Gene IDs.
    """
    gene_ids = {}
    search_string = " OR ".join([f"{symbol}[Gene]" for symbol in gene_symbols])
    search_string += f" AND {organism}[Organism]"

     # get the full id list with an eSearch
    handle = Entrez.esearch(db="gene", term=search_string, retmax=len(gene_symbols))
    ids = Entrez.read(handle)['IdList']
    handle.close()    
    
    # next use an esummary search to make sure we match the gene symbol to the correct gene id
    handle = Entrez.esummary(db='gene', id=",".join(ids))
    geneRecords = Entrez.read(handle)
    
    # iterate through to gather our information
    for geneRecord in geneRecords['DocumentSummarySet']['DocumentSummary']:
        symbol = geneRecord['Name']
        gene_id = geneRecord.attributes['uid']
        gene_ids[symbol] = gene_id
    
    handle.close()
     
    return gene_ids

In [None]:
'''In last week's assignment you looked up the genes associated with Alzheimer's disease using the Reactome API.
Now, let's parse the list you recovered to extract the gene symbol, this is the first element after 
the hsa:12344 number from last week's mapping:

e.g.

hsa:10000	AKT3, MPPH, MPPH2, PKB-GAMMA, PKBG, PRKBG, RAC-PK-gamma, RAC-gamma, STK-2; AKT serine/threonine kinase 3
hsa:100137049	PLA2G4B, HsT16992, cPLA2-beta; phospholipase A2 group IVB
hsa:10125	RASGRP1, CALDAG-GEFI, CALDAG-GEFII, IMD64, RASGRP; RAS guanyl releasing protein 1
....

This can be done using the python first splitting on ```tab``` and then using split on ```,``` to get the gene symbol.
Save the gene symbols to a file called alzheimers_genes.txt.
'''

# pseudocode1
# Use code from last week (solution for that posted on GitHub) to get the gene symbols data as above
'''###YOUR CODE HERE###'''
# Importing the required libraries
import requests

# Function to download KEGG pathway data
def download_kegg_pathway_genes(pathway_id):
    # URL for the pathway data
    data_url = f"http://rest.kegg.jp/link/hsa/{pathway_id}"
    
    # Make the HTTP request for the pathway data
    response = requests.get(data_url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Write the data content to a file
        with open(f"{pathway_id}.txt", 'w') as file:
            file.write(response.text)
        print(f"Pathway data saved as {pathway_id}.txt")
    else:
        print("Failed to retrieve pathway data. Status code:", response.status_code)

# Example pathway ID for Alzheimer's disease
pathway_id = "hsa05010"

# Call the functions to download the pathway image and data
download_kegg_pathway_genes(pathway_id)

In [None]:
# pseudocode2
# Parse the data to extract the gene symbol
'''###YOUR CODE HERE###'''
import pandas as pd
df = pd.read_csv('hsa05010.txt', sep='\t', header=None)

# the second column contians the gene ids, use the first 5 (for speed) to fetch the gene details
gene_ids = df.iloc[:, 1]

# Function to fetch gene details
def fetch_gene_details(gene_ids):
    gene_details = dict()
    for gene_id in gene_ids:
        # URL for the pathway data
        data_url = f"http://rest.kegg.jp/list/{gene_id}"
        
        # Make the HTTP request for the pathway data
        response = requests.get(data_url)
        
        # Check if the request was successful
        if response.status_code == 200:
            #strip the newline character from the response text
            response = response.text.strip()
            #ad the gene details to the dictionary
            gene_details[gene_id] = response
            print(response)
        else:
            print("Failed to retrieve gene data. Status code:", response.status_code)
    return gene_details

# Call the function to fetch gene details
# this will take a while to run (c.15 minutes)
gene_details = fetch_gene_details(gene_ids)

In [None]:
# save gene_details as a pickle file
import pickle
with open('gene_details.pkl', 'wb') as file:
    pickle.dump(gene_details, file)

In [None]:
# extract the first element after the hsa:12344 number
gene_symbols = []
for gene_id in gene_details:
    gene_symbol_synonym = gene_details[gene_id].split('\t')[1]
    # split on either a comma or a semi-colon
    gene_symbol = gene_symbol_synonym.split(',')[0].split(';')[0]
    gene_symbols.append(gene_symbol)

print(gene_symbols)

In [None]:
# pseudocode3
# Save the gene symbols to a file called alzheimers_genes.txt
'''###YOUR CODE HERE###'''
with open('alzheimers_genes.txt', 'w') as file:
    for gene_symbol in gene_symbols:
        file.write(f"{gene_symbol}\n")

In [None]:
test = get_gene_ids(['AKT3', 'FRAT1', 'CASP12', 'NDUFC2-KCTD14', 'PPIF', 'ADAM10', 'CDK5', 'PSMD14', 'TPTEP2-CSNK1E', 'APC2', 'RTN3'])
print(test)

In [61]:
'''Now use the function defined above to fetch the NCBI Gene IDs for the gene symbols you extracted above. Use Pretty Table to print out a table with the first column being the gene symbol and the second column being the NCBI Gene ID.'''

# pseudocode4
# Use the function defined above to fetch the NCBI Gene IDs for the gene symbols you extracted above
'''###YOUR CODE HERE###'''
gene_ids = get_gene_ids(gene_symbols[0:10])

# save the gene_ids as a pickle file
with open('gene_ids.pkl', 'wb') as file:
    pickle.dump(gene_ids, file)

In [62]:
# pseudocode5
# Use Pretty Table to print out a table with the first column being the gene symbol and the second column being the NCBI Gene ID
'''###YOUR CODE HERE###'''
from prettytable import PrettyTable

table = PrettyTable()
table.field_names = ["Gene Symbol", "NCBI Gene ID"]

for gene_symbol, gene_id in gene_ids.items():
    table.add_row([gene_symbol, str(gene_id)])

print(table)

+---------------+--------------+
|  Gene Symbol  | NCBI Gene ID |
+---------------+--------------+
|     ADAM10    |     102      |
|      CDK5     |     1020     |
|      AKT3     |    10000     |
|     PSMD14    |    10213     |
|      PPIF     |    10105     |
|     ANAPC2    |    29882     |
|     CASP12    |  100506742   |
|      APC2     |    10297     |
|     FRAT1     |    10023     |
| TPTEP2-CSNK1E |  102800317   |
+---------------+--------------+


In [None]:
'''If you would like you can repeat the above process but for BioMart. You could also try retrieving other information fields, for example the gene name, description, chromosome, start and end position, etc. from BioMart. Use one of the two BioMart snippets in the GitHub to help you.'''