# **Bioinformatics with Jupyter Notebooks for WormBase:**
## **Data Retrieval 4 - Getting essential gene information**
Welcome to the fourth jupyter notebook in the WormBase tutorial series. Over this series of tutorials, we will write code in Python that allows us to retrieve and perform simple analyses with data available on the WormBase sites.

This tutorial will deal with getting the essential gene information for WormBase genes, replicating the results from the SimpleMine utility. Given a list of WormBase Gene IDs, we can extract all required information associated with the gene using the WormBase RESTful API. Let's get started!

We start by installing and loading the libraries that are required for this tutorial.

In [None]:
import requests
import sys
import pandas as pd
pd.set_option('display.max_columns', None) #for ensuring full view of the dataframe generated

We initialise the columns for the dataframe. The description of each column can be found preceding the function that assigns the value to that column. You can comment out any of the columns if they are not required for your study.

In [None]:
GeneInfo = pd.DataFrame(columns = ['WormBase Gene ID',
                                   'Public Name', 
                                   'Species', 
                                   'Sequence Name',
                                   'Other Name', 
                                   'Transcript', 
                                   'Operon', 
                                   'Protein Domain', 
                                   'UniProt', 
                                   'Reference UniProt ID', 
                                   'TreeFam', 
                                   'RefSeq_mRNA', 
                                   'RefSeq_protein', 
                                   'Genetic Map Position', 
                                   'RNAi Phenotype Observed',
                                   'Allele Phenotype Observed',
                                   'Coding_exon Non_silent Allele', 
                                   'Interacting Gene', 
                                   'Expr_pattern Tissue', 
                                   'Genomic Study Tissue',
                                   'Expr_pattern LifeStage',
                                   'Genomic Study LifeStage',
                                   'Disease Info',
                                   'Human Ortholog',
                                   'Gene Ontology Association', 
                                   'Automated Description',
                                   'Reference' 
                                    ])

The next step is to initialize the list of Gene IDs. This can be done either by manually typing in a list or by uploading a csv file that has one single column with the required gene IDs.
Uncomment the required line in the below cell.

GeneID - Unique Gene identifiers used by WormBase

In [None]:
GeneID = ['WBGene00001648', 'WBGene00012578', 'WBGene00021277']
#GeneID = pd.read_csv('GeneID_list.csv', header=None)[0]

We finally access the WormBase RESTful API to get the necessary information for the genes. We use the gene widgets and gene fields operations to access the data based on the gene ID. (http://rest.wormbase.org/index.html)

Remember to comment out any fields that you may have commented in the previous cell while initialising the empty dataframe.

Public Name - Official gene names specified by WormBase

In [None]:
def publicName(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/name')
    return res.json()['name']['data']['label']

Species - Each gene can only be associated with one species

In [None]:
def species(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/taxonomy')
    return res.json()['taxonomy']['data']['genus'] + ' ' + res.json()['taxonomy']['data']['species']

Sequence name - Sequence name of the gene

In [None]:
def sequenceName(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/sequence_name')
    return res.json()['sequence_name']['data']

Other name - all names that have been used by the gene in publications

In [None]:
def otherName(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/other_names')
    return ', '.join(res.json()['other_names']['data'])

Transcript - Transcript names of the gene

In [None]:
def transcript(gene):
    res = requests.get('http://rest.wormbase.org/rest/widget/gene/' + gene + '/sequences')
    return res.json()['fields']['gene_models']['data']['table'][0]['model'][0]['id']

Operon - A set of genes transcribed under the control of an operator gene

In [None]:
def operon(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/operon')
    return 'N.A.' if res.json()['operon']['data'] is None else res.json()['operon']['data']['label']

Protein Domains associated with the gene

In [None]:
def proteinDomain(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/protein_domains')
    return [val['id'] for key, val in res.json()['protein_domains']['data'].items() if 'id' in val]

Uniprot - Official Protein Identifiers used by the UniProt database

In [None]:
def uniprot(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/xrefs')
    
    if 'TrEMBL' in res.json()['xrefs']['data']:
        return res.json()['xrefs']['data']['TrEMBL']['UniProtAcc']['ids']  
    
    elif 'SwissProt' in res.json()['xrefs']['data']:
        return res.json()['xrefs']['data']['SwissProt']['UniProtAcc']['ids']
    
    else:
        return 'N.A.'

Reference UniProt ID - Unique UniProt ID for each gene 

In [None]:
def refUniprotId(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/xrefs')
    return res.json()['xrefs']['data']['UniProt_GCRP']['UniProtAcc']['ids']

TreeFam - Official gene identifiers used by the TreeFam database

In [None]:
def treeFam(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/treefam')
    
    if res.json()['treefam']['data'] is None:
        return 'N.A.'
    else:
        return res.json()['treefam']['data']

RefSeq mRNA - Sequence IDs used by the RefSeq database

In [None]:
def refSeqmRNA(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/xrefs')
    return res.json()['xrefs']['data']['RefSeq']['mRNA']['ids']

RefSeq Protein - Sequence proteins used by the RefSeq database

In [None]:
def refSeqProtein(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/xrefs')
    return res.json()['xrefs']['data']['RefSeq']['protein']['ids']

GeneticMapPosition - Chromosome and chromosomal position of the gene

In [None]:
def geneticMapPosition(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/genetic_position')
    return res.json()['genetic_position']['data'][0]['chromosome'] + ':' + \
           str(res.json()['genetic_position']['data'][0]['position'])

RNAi Phenotype Observed - RNAi phenotype ontology names

In [None]:
def rnaiPhen(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/phenotype')
    final = []
    for i in res.json()['phenotype']['data']:
        if 'RNAi' in i['evidence']:
            final.append(i['phenotype']['label'])
    return set(final)

Allele Phenotype Observed - Allele phenotype ontology names

In [None]:
def allelePhen(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/phenotype')
    final  = []
    for i in res.json()['phenotype']['data']:
        if 'Allele' in i['evidence']:
            final.append(i['phenotype']['label'])
    return set(final)

Coding_exon Non_silent Allele - List of alleles that fall in any coding exon

In [None]:
def nonSilent(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/alleles')
    final = []
    for i in res.json()['alleles']['data']:
        if 'effects' in i:
            final.append(i['variation']['label'] + '|' + i['effects'][0])
    return set(final)

Interacting Gene - Experimentally confirmed gene interactions

In [None]:
def interactingGene(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/interaction_details')
    final = []
    for i in res.json()['interaction_details']['data']['edges_all']:
        final.append(i['affected']['label'])
    return set(final)

Expr_pattern Tissue - Anatomical expression based on GFP, immunoprecipitation, in-situ

In [None]:
def exprPatternTissue(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/expressed_in')
    final = []
    for i in res.json()['expressed_in']['data']:
        if i['details'][0]['text']['class'] == 'expr_pattern':
            final.append(i['ontology_term']['label'])
    return set(final)

Genomic Study Tissue - Tissue enrichment based on microarray, RNA-Seq, proteomics studies

In [None]:
def genomicStudyTissue(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/expressed_in')
    final = []
    for i in res.json()['expressed_in']['data']:
        if i['details'][0]['text']['class'] == 'expression_cluster':
            final.append(i['ontology_term']['label'])
    return set(final)

Expr_pattern LifeStage - Anatomical expression based on GFP, immunoprecipitation, in-situ

In [None]:
def exprPatternLifeStage(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/expressed_during')
    final = []
    if res.json()['expressed_during']['data'] is None:
        return 'N.A.'
    else:
        for i in res.json()['expressed_during']['data']:
            final.append(i['ontology_term']['label'])
    return set(final)

Genomic Study LifeStage - Developmental expression based on microarray, RNA-Seq, proteomics studies

In [None]:
def genomicStudyLifeStage(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/fpkm_expression_summary_ls')
    final = []
    if res.json()['fpkm_expression_summary_ls']['data']['table']['fpkm']['data'] is None:
        return 'N.A.'
    else:
        for i in res.json()['fpkm_expression_summary_ls']['data']['table']['fpkm']['data']:
            final.append(i['life_stage']['label'])
    return set(final)

Disease Info - Diseases associated with the gene

In [None]:
def disease(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/human_diseases')
    final = []
    if res.json()['human_diseases']['data'] is None:
        return 'N.A.'
    else:
        for i in res.json()['human_diseases']['data']['potential_model']:
            final.append(i['label'])
    return set(final)

Human Ortholog - Human orthologs of the gene

In [None]:
def humanOrtholog(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/other_orthologs')
    final = []
    if res.json()['other_orthologs']['data'] is None:
        return 'N.A.'
    else:
        for i in res.json()['other_orthologs']['data']:
            if (i['ortholog']['taxonomy'] == 'h_sapiens'):
                final.append(i['ortholog']['id'] + '|' + '; '.join([j['id'] for j in i['method']]))
    return set(final)

Gene Ontology Association - Gene ontology terms annotated to the gene

In [None]:
def geneOntology(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/gene_ontology')
    final = []
    if res.json()['gene_ontology']['data'] is None:
        return 'N.A.'
    else:
        for i in res.json()['gene_ontology']['data']['Biological_process']:
            final.append('BP_' + i['term_description']['label'])
        for i in res.json()['gene_ontology']['data']['Molecular_function']:
            final.append('MF_' + i['term_description']['label'])
        for i in res.json()['gene_ontology']['data']['Cellular_component']:
            final.append('CC_' + i['term_description']['label'])
        for i in res.json()['gene_ontology']['data']['Cellular_component']:
            if i['with'] is not None and i['with'][0]['label'][:7] != 'Panther':
                final.append('CC_' + i['with'][0]['label'])
    return set(final)

Automated Description - Up-to-date gene description

In [None]:
def automatedDesc(gene):
    res = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/concise_description')
    return res.json()['concise_description']['data']['text']

Reference - Primary research articles that studied the gene

In [None]:
def reference(gene):
    res = requests.get('http://rest.wormbase.org/rest/widget/gene/' + gene + '/references')
    final = []
    if res.json()['fields']['references']['data'] is None:
        return 'N.A.'
    else:
        for i in res.json()['fields']['references']['data']['results']:
            final.append(i['name']['id'])
    return set(final)

In [None]:
for gene in GeneID:
    WormBaseGeneID = gene
    GeneInfo = GeneInfo.append({'WormBase Gene ID': WormBaseGeneID, 
                                'Public Name': publicName(gene),
                                'Species': species(gene),
                                'Sequence Name': sequenceName(gene),
                                'Other Name': otherName(gene),
                                'Transcript': transcript(gene),
                                'Operon': operon(gene),
                                'Protein Domain': proteinDomain(gene), 
                                'UniProt': uniprot(gene),
                                'Reference UniProt ID': refUniprotId(gene),
                                'TreeFam': treeFam(gene),
                                'RefSeq_mRNA': refSeqmRNA(gene),
                                'RefSeq_protein': refSeqProtein(gene),
                                'Genetic Map Position': geneticMapPosition(gene),
                                'RNAi Phenotype Observed': rnaiPhen(gene),
                                'Allele Phenotype Observed': allelePhen(gene),
                                'Coding_exon Non_silent Allele': nonSilent(gene),
                                'Interacting Gene': interactingGene(gene),
                                'Expr_pattern Tissue': exprPatternTissue(gene),
                                'Genomic Study Tissue': genomicStudyTissue(gene),
                                'Expr_pattern LifeStage': exprPatternLifeStage(gene),
                                'Genomic Study LifeStage': genomicStudyLifeStage(gene),
                                'Disease Info': disease(gene),
                                'Human Ortholog': humanOrtholog(gene),
                                'Gene Ontology Association': geneOntology(gene),
                                'Automated Description': automatedDesc(gene),
                                'Reference': reference(gene)}, 
                               ignore_index = True)

In [None]:
GeneInfo

The dataframe can be written into a csv file to save the essential gene information and for any analyses that you want to perform later.

In [None]:
GeneInfo.to_csv('EssentialGeneInformation.csv')

This is the end of the tutorial on replicating SimpleMine results using the WormBase RESTful API to get the essential gene information. The data is up-to date and is very quick to extract, and is easier to handle than the results from SimpleMine.

This tutorial is also the end of the Data Retrieval series. In the next tutorial, we will implement and test some simple utilities that can help us work with the data we have retrieved until now.

Acknowledgements:
- WormBase RESTful API (http://rest.wormbase.org/index.html#/)