# **Bioinformatics with Jupyter Notebooks for WormBase:**
## **Data Retrieval 4 - Getting essential gene information**
Welcome to the fourth jupyter notebook in the WormBase tutorial series. Over this series of tutorials, we will write code in Python that allows us to retrieve and perform simple analyses with data available on the WormBase sites.

This tutorial will deal with getting the essential gene information for WormBase genes, replicating the results from the SimpleMine utility. Given a list of WormBase Gene IDs, we can extract all required information associated with the gene using the WormBase RESTful API. Let's get started!

We start by installing and loading the libraries that are required for this tutorial.

In [1]:
import requests, sys
import pandas as pd
pd.set_option('display.max_columns', None) #for ensuring full view of the dataframe generated

We initialise the columns for the dataframe. All columns have a small description next to them. You can comment out any of the columns if they are not required for your study.

In [2]:
EssentialGeneInfo = pd.DataFrame(columns = ['WormBase Gene ID', #Unique Gene identifiers used by WormBase
                                            'Public Name', #Official gene names specified by WormBase
                                            'Species', #Each gene can only be associated with one species
                                            'Sequence Name', #Sequence name of the gene
                                            'Other Name', #All names that have been used by the gene in publications
                                            'Transcript', #Transcript names of the gene
                                            'Operon', #A set of genes transcribed under the control of an operator gene
                                            'Protein Domain', #Protein Domains associated with the gene
                                            'UniProt', #Official Protein Identifiers used by the UniProt database
                                            'Reference UniProt ID', #Unique UniProt ID for each gene to act as a "reference" UniProt ID
                                            'TreeFam', #Official gene identifiers used by the TreeFam database
                                            'RefSeq_mRNA', #Sequence IDs used by the RefSeq database
                                            'RefSeq_protein', #Protein IDs used by the RefSeq databaseDisplay Chromosome and chromosomal position of the gene
                                            'Genetic Map Position', #Display Chromosome and chromosomal position of the gene
                                            'RNAi Phenotype Observed', #Display the RNAi phenotype ontology names
                                            'Allele Phenotype Observed', #Display the Allele phenotype ontology names
                                            'Coding_exon Non_silent Allele', #Display a list of alleles that fall in any coding exon
                                            'Interacting Gene', #Display experimentally confirmed gene interactions
                                            'Expr_pattern Tissue', #Anatomical expression based on GFP, immunoprecipitation, In_situ
                                            'Genomic Study Tissue', #Tissue enrichment based on the microarray, RNA-Seq, and proteomics studies
                                            'Expr_pattern LifeStage', #Developmental expression based on GFP, immunoprecipitation, In_situ
                                            'Genomic Study LifeStage', #Developmental expression based on the microarray, RNA-Seq, and proteomics studies
                                            'Disease Info', #Display the disease names associated with the gene
                                            'Human Ortholog', #Display the human orthologs of the gene
                                            'Gene Ontology Association', #Display the names of gene ontology terms that were annotated to the gene
                                            'Automated Description', #Up-to-date gene description 
                                            'Reference' #Primary research articles that studied the gene
                                           ])

The next step is to initialize the list of Gene IDs. This can be done either by manually typing in a list or by uploading a csv file that has one single column with the required gene IDs.
Uncomment the required line in the below cell.

In [3]:
GeneID = ['WBGene00001648', 'WBGene00012578', 'WBGene00021277']
#GeneID = pd.read_csv('GeneID_list.csv', header=None)[0]

We finally access the WormBase RESTful API to get the necessary information for the genes. We use the gene widgets and gene fields operations to access the data based on the gene ID. (http://rest.wormbase.org/index.html)

Remember to comment out any fields that you may have commented in the previous cell while initialising the empty dataframe.

In [4]:
for gene in GeneID:
    WormBaseGeneID = gene
    PublicName = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/name').json()['name']['data']['label']
    Species = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/taxonomy').json()['taxonomy']['data']['genus'] + ' ' + requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/taxonomy').json()['taxonomy']['data']['species']
    SequenceName = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/sequence_name').json()['sequence_name']['data']
    OtherName = ', '. join(requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/other_names').json()['other_names']['data'])
    Transcript = requests.get('http://rest.wormbase.org/rest/widget/gene/' + gene + '/sequences').json()['fields']['gene_models']['data']['table'][0]['model'][0]['id']
    Operon = 'N.A.' if requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/operon').json()['operon']['data'] is None else requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/operon').json()['operon']['data']['label']
    ProteinDomain = [val['id'] for key, val in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/protein_domains').json()['protein_domains']['data'].items() if 'id' in val]
    Uniprot = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/xrefs').json()['xrefs']['data']['TrEMBL']['UniProtAcc']['ids'] if 'TrEMBL' in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/xrefs').json()['xrefs']['data'] else requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/xrefs').json()['xrefs']['data']['SwissProt']['UniProtAcc']['ids'] if 'SwissProt' in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/xrefs').json()['xrefs']['data'] else 'N.A.'
    RefUniprotID = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/xrefs').json()['xrefs']['data']['UniProt_GCRP']['UniProtAcc']['ids']
    TreeFam = 'N.A.' if requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/treefam').json()['treefam']['data'] is None else requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/treefam').json()['treefam']['data']
    RefSeqmRNA = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/xrefs').json()['xrefs']['data']['RefSeq']['mRNA']['ids']
    RefSeqProtein = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/xrefs').json()['xrefs']['data']['RefSeq']['protein']['ids']
    GeneticMapPosition = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/genetic_position').json()['genetic_position']['data'][0]['chromosome'] + ':' + str(requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/genetic_position').json()['genetic_position']['data'][0]['position'])
    RNAiPhen = sorted(list(set([i['phenotype']['label'] for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/phenotype').json()['phenotype']['data'] if 'RNAi' in i['evidence']])))
    AllelePhen = sorted(list(set([i['phenotype']['label'] for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/phenotype').json()['phenotype']['data'] if 'Allele' in i['evidence']])))
    NonSilent = sorted(list(set([i['variation']['label'] + '|' + i['effects'][0] for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/alleles').json()['alleles']['data'] if 'effects' in i])))
    InteractingGene = sorted(list(set([i['affected']['label'] for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/interaction_details').json()['interaction_details']['data']['edges_all']])))
    ExprPatternTissue = sorted([i['ontology_term']['label'] for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/expressed_in').json()['expressed_in']['data'] if i['details'][0]['text']['class'] == 'expr_pattern'])
    GenomicStudyTissue = sorted([i['ontology_term']['label'] for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/expressed_in').json()['expressed_in']['data'] if i['details'][0]['text']['class'] == 'expression_cluster'])
    ExprPatternLifeStage = 'N.A.' if requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/expressed_during').json()['expressed_during']['data'] is None else sorted([i['ontology_term']['label'] for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/expressed_during').json()['expressed_during']['data']])
    GenomicStudyLifeStage = 'N.A.' if requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/fpkm_expression_summary_ls').json()['fpkm_expression_summary_ls']['data']['table']['fpkm']['data'] is None else sorted(list(set([i['life_stage']['label'] for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/fpkm_expression_summary_ls').json()['fpkm_expression_summary_ls']['data']['table']['fpkm']['data']])))
    Disease = 'N.A.' if requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/human_diseases').json()['human_diseases']['data'] is None else sorted([i['label'] for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/human_diseases').json()['human_diseases']['data']['potential_model']])
    HumanOrtholog = 'N.A.' if requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/other_orthologs').json()['other_orthologs']['data'] is None else sorted([i['ortholog']['id'] + '|' + '; '.join([j['id'] for j in i['method']]) for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/other_orthologs').json()['other_orthologs']['data'] if (i['ortholog']['taxonomy'] == 'h_sapiens')])
    GeneOntology = 'N.A.' if requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/gene_ontology').json()['gene_ontology']['data'] is None else sorted(list(set([i['term_description']['label'] for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/gene_ontology').json()['gene_ontology']['data']['Biological_process']] + [i['term_description']['label'] for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/gene_ontology').json()['gene_ontology']['data']['Molecular_function']] + [i['term_description']['label'] for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/gene_ontology').json()['gene_ontology']['data']['Cellular_component']] + [i['with'][0]['label'] for i in requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/gene_ontology').json()['gene_ontology']['data']['Cellular_component'] if i['with'] is not None and i['with'][0]['label'][:7] != 'Panther'])))
    AutomatedDesc = requests.get('http://rest.wormbase.org/rest/field/gene/' + gene + '/concise_description').json()['concise_description']['data']['text']
    Reference = 'N.A.' if requests.get('http://rest.wormbase.org/rest/widget/gene/' + gene + '/references').json()['fields']['references']['data'] is None else [i['name']['id'] for i in requests.get('http://rest.wormbase.org/rest/widget/gene/' + gene + '/references').json()['fields']['references']['data']['results']]
    EssentialGeneInfo = EssentialGeneInfo.append({'WormBase Gene ID': WormBaseGeneID, 
                                            'Public Name': PublicName,
                                            'Species': Species,
                                            'Sequence Name': SequenceName,
                                            'Other Name': OtherName,
                                            'Transcript': Transcript,
                                            'Operon': Operon,
                                            'Protein Domain': ProteinDomain, 
                                            'UniProt': Uniprot,
                                            'Reference UniProt ID': RefUniprotID,
                                            'TreeFam': TreeFam,
                                            'RefSeq_mRNA': RefSeqmRNA,
                                            'RefSeq_protein': RefSeqProtein,
                                            'Genetic Map Position': GeneticMapPosition,
                                            'RNAi Phenotype Observed': RNAiPhen,
                                            'Allele Phenotype Observed': AllelePhen,
                                            'Coding_exon Non_silent Allele': NonSilent,
                                            'Interacting Gene': InteractingGene,
                                            'Expr_pattern Tissue': ExprPatternTissue,
                                            'Genomic Study Tissue': GenomicStudyTissue,
                                            'Expr_pattern LifeStage': ExprPatternLifeStage,
                                            'Genomic Study LifeStage': GenomicStudyLifeStage,
                                            'Disease Info': Disease,
                                            'Human Ortholog': HumanOrtholog,
                                            'Gene Ontology Association': GeneOntology,
                                            'Automated Description': AutomatedDesc,
                                            'Reference': Reference}, 
                                            ignore_index = True)

In [5]:
EssentialGeneInfo

Unnamed: 0,WormBase Gene ID,Public Name,Species,Sequence Name,Other Name,Transcript,Operon,Protein Domain,UniProt,Reference UniProt ID,TreeFam,RefSeq_mRNA,RefSeq_protein,Genetic Map Position,RNAi Phenotype Observed,Allele Phenotype Observed,Coding_exon Non_silent Allele,Interacting Gene,Expr_pattern Tissue,Genomic Study Tissue,Expr_pattern LifeStage,Genomic Study LifeStage,Disease Info,Human Ortholog,Gene Ontology Association,Automated Description,Reference
0,WBGene00001648,goa-1,Caenorhabditis elegans,C26C6.2,"unc-109, CELE_C26C6.2",C26C6.2.1,N.A.,"[INTERPRO:IPR001408, INTERPRO:IPR027417, INTER...",[P51875],[P51875],[],[NM_059707.6],[NP_492108.1],I:2.098785,"[P0 spindle position defective early emb, anti...","[acetylcholine synaptic transmission variant, ...","[n1134|Missense, n499|Missense, sa734|Nonsense...","[F10D7.5, F38H4.3, F47F2.1, F53A10.2, W02B3.8,...","[ALML, ALMR, AVM, BDUL, BDUR, CANL, CANR, HSNL...","[AFD, ASER, AVK, DA neuron, NSM, OLL, PVD, RIS...","[adult Ce, embryo Ce, larva Ce, postembryonic Ce]","[1-cell embryo Ce, 1-day post-L4 adult hermaph...",[developmental and epileptic encephalopathy 17],[HGNC:4389|EnsEMBL-Compara; OrthoFinder; Round...,[G protein-coupled acetylcholine receptor acti...,"Enables several functions, including G protein...","[WBPaper00014801, WBPaper00034090, WBPaper0003..."
1,WBGene00012578,ccct-1,Caenorhabditis elegans,Y37H9A.3,CELE_Y37H9A.3,Y37H9A.3.1,CEOP1933,"[INTERPRO:IPR035892, INTERPRO:IPR039725, INTER...",[Q9U2M8],[Q9U2M8],[TF314229],[NM_001373656.2],[NP_001359623.1],I:21.6609,[sterile],"[lethal, sterile]",[tm2372|Splice site],"[C30F12.4, F07F6.8, F59D12.2, R119.2, ccct-1, ...",[],"[NSM, germ line, intestine]",N.A.,"[1-cell embryo Ce, 1-day post-L4 adult hermaph...",[autosomal recessive non-syndromic intellectua...,[HGNC:29386|EnsEMBL-Compara; OrthoFinder; Inpa...,"[DNA-binding transcription factor activity, RN...",Is predicted to enable DNA-binding transcripti...,"[WBPaper00038491, WBPaper00055090]"
2,WBGene00021277,ddx-10,Caenorhabditis elegans,Y23H5B.6,"CELE_Y23H5B.6, Y74C10A_153.a, Y23H5B.c",Y23H5B.6a.1,CEOP1048,"[INTERPRO:IPR027417, INTERPRO:IPR025313, INTER...","[Q9N478, A0A061ADT4, A0A061ACL9]",[Q9N478],[],"[NM_001306361.3, NM_058588.6, NM_001306360.3]","[NP_001293290.1, NP_490989.1, NP_001293289.1]",I:-8.47867,"[embryonic lethal, larval arrest, late larval ...",[],[],"[B0280.9, B0511.6, F57B10.8, F58B3.4, K01G5.5,...",[],"[AVE, body wall muscle cell, germ line, intest...",N.A.,"[1-cell embryo Ce, 1-day post-L4 adult hermaph...",N.A.,[HGNC:2735|EnsEMBL-Compara; OrthoFinder; Round...,"[ATP binding, RNA binding, RNA helicase activi...",Is predicted to enable ATP binding activity; R...,"[WBPaper00038491, WBPaper00055090]"


The dataframe can be written into a csv file to save the essential gene information and for any analyses that you want to perform later.

In [6]:
EssentialGeneInfo.to_csv('EssentialGeneInformation.csv')

This is the end of the tutorial on replicating SimpleMine results using the WormBase RESTful API to get the essential gene information. The data is up-to date and is very quick to extract, and is easier to handle than the results from SimpleMine.

This tutorial is also the end of the Data Retrieval series. In the next tutorial, we will implement and test some simple utilities that can help us work with the data we have retrieved until now.
