# **Bioinformatics with Jupyter Notebooks for WormBase:**
## **Data Retrieval 2 - Getting data from WormMine**
Welcome to the second jupyter notebook in the WormBase tutorial series. Over this series of tutorials, we will write code in Python that allows us to retrieve and perform simple analyses with data available on the WormBase sites.

This tutorial will deal with the WormBase data from WormMine. (http://intermine.wormbase.org/tools/wormmine/begin.do)
We will both explore the site, and the intermine python package, and extract data of interest. Let's get started!

We start by installing and loading the libraries that are required for this tutorial. 

In [1]:
!pip install intermine
import intermine
from intermine import registry
from intermine.webservice import Service
import pandas as pd



getInfo(mine) can fetch all the information about a particular mine i.e., its description, version, organisms associated etc.

In [2]:
registry.getInfo("WormMine")

Description: Intermine data mining platform for C. elegans and related nematodes
URL: http://intermine.wormbase.org/tools/wormmine/
API Version: 31
Release Version: 280
InterMine Version: 4.2.0
Organisms: 
C. elegans
Neighbours: 
Animals
AGR


getData(mine) can be used to extract the data sets corresponding to it.
Note that this is not completely representative of the data available on WormMine!! Check the WormMine site for the complete data!!

In [3]:
registry.getData("WormMine")

Name: C. elegans genomic annotations (GFF3 Gene)
Name: Disease Ontology
Name: FlyMine intergenic regions
Name: GO
Name: GO Annotation data set
Name: Panther orthologue and paralogue predictions
Name: WormBase B. malayi protein sequence (Fasta)
Name: WormBase C. brenneri protein sequence (Fasta)
Name: WormBase C. briggsae protein sequence (Fasta)
Name: WormBase C. elegans CDS sequence (Fasta)
Name: WormBase C. elegans chromosome sequence (Fasta)
Name: WormBase C. elegans protein sequence (Fasta)
Name: WormBase C. elegans transcript sequence (Fasta)
Name: WormBase C. japonica protein sequence (Fasta)
Name: WormBase C. remanei protein sequence (Fasta)
Name: WormBase O. volvulus protein sequence (Fasta)
Name: WormBase P. pacificus protein sequence (Fasta)
Name: WormBase S. ratti protein sequence (Fasta)
Name: WormBase T. muris protein sequence (Fasta)
Name: WormBaseAcedbConverter
Name: anatomy


The method "new_query" from Service class creates a query object

In [4]:
service = Service("http://intermine.wormbase.org/tools/wormmine/service")
query=service.new_query()

### Performing simple queries on the WormMine database

Let's query the WormMine database to extract the commonName, genus, name, shortName, species, and taxonID of all organisms whose data is available:

In [5]:
query=service.new_query("Organism")
query.select("commonName", "genus", "name", "shortName", "species","taxonId")

#Insert first 10 rows of the query results into a dataframe and display the output!
organisms_data = pd.DataFrame(columns = ["commonName", "genus", "name", "shortName", "species","taxonId"])
for row in query.rows(start=0,size=10):
    info = {'commonName':row[0], 'genus':row[1], 'name':row[2], 'shortName':row[3], 'species':row[4], 
            'taxonId':row[5]}
    organisms_data = organisms_data.append(info, ignore_index = True)
organisms_data

Unnamed: 0,commonName,genus,name,shortName,species,taxonId
0,Norway rat,Rattus,Rattus norvegicus,R. norvegicus,norvegicus,10116
1,agent of lymphatic filariasis,Brugia,Brugia malayi,B. malayi,malayi,6279
2,baker's yeast,Saccharomyces,Saccharomyces cerevisiae,S. cerevisiae,cerevisiae,4932
3,barber pole worm,Haemonchus,Haemonchus contortus,H. contortus,contortus,6289
4,fruit fly,Drosophila,Drosophila melanogaster,D. melanogaster,melanogaster,7227
5,house mouse,Mus,Mus musculus,M. musculus,musculus,10090
6,human,Homo,Homo sapiens,H. sapiens,sapiens,9606
7,pig roundworm,Ascaris,Ascaris suum,A. suum,suum,6253
8,pine wood nematode,Bursaphelenchus,Bursaphelenchus xylophilus,B. xylophilus,xylophilus,6326
9,roundworm,Caenorhabditis,Caenorhabditis elegans,C. elegans,elegans,6239


Let's query the WormMine database to extract the automatedDescription, biotype, briefDescription, length, operon, primaryIdentifier, secondaryIdentifier, and symbol of all genes whose data is available:

In [6]:
query=service.new_query("Gene")
query.select("automatedDescription", "biotype", "briefDescription", "length", "operon", "primaryIdentifier", 
             "secondaryIdentifier", "symbol")

#Insert first 10 rows of the query results into a dataframe and display the output!
genes_data = pd.DataFrame(columns = ["automatedDescription", "biotype", "briefDescription", "length", "operon", 
                                     "primaryIdentifier", "secondaryIdentifier", "symbol"])
for row in query.rows(start=0,size=10):
    info = {'automatedDescription':row[0], 'biotype':row[1], 'briefDescription':row[2], 'length':row[3], 
            'operon':row[4], 'primaryIdentifier':row[5], 'secondaryIdentifier':row[6], 'symbol':row[7]}
    genes_data = genes_data.append(info, ignore_index = True)
genes_data

Unnamed: 0,automatedDescription,biotype,briefDescription,length,operon,primaryIdentifier,secondaryIdentifier,symbol
0,'Skinny hedgehog' (SKI1) encodes an enzyme tha...,SO:0001217,,,,HGNC:18270,,Hhat
1,(6-4)-photolyase (phr6-4) encodes an enzyme th...,SO:0001217,,,,FB:FBgn0016054,,phr6-4
2,"1,3-beta-glucanosyltransferase; has similarity...",SO:0001217,,,,SGD:S000005390,,Gas5
3,"1,3-beta-glucanosyltransferase; involved with ...",SO:0001217,,,,SGD:S000005492,,Gas4
4,"1,3-beta-glucanosyltransferase; involved with ...",SO:0001217,,,,SGD:S000004335,,Gas2
5,"1,4-Alpha-Glucan Branching Enzyme (AGBE) encod...",SO:0001217,,,,FB:FBgn0053138,,AGBE
6,1-acyl-sn-glycerol-3-phosphate acyltransferase...,SO:0001217,,,,SGD:S000002210,,SLC1
7,1-phosphatidylinositol-3-phosphate 5-kinase; v...,SO:0001217,,,,SGD:S000001915,,FAB1
8,"14-3-3 protein, major isoform; controls proteo...",SO:0001217,,,,SGD:S000000979,,BMH1
9,"14-3-3 protein, minor isoform; controls proteo...",SO:0001217,,,,SGD:S000002506,,BMH2


Let's create a query object to query the description of all Gene Ontology (GO) terms from the WormMine database. We can then add extra columns (fields) to our query based on our needs - here, the identifier for all GO terms is added to the query. 
When there are multiple fields in a query, the default ordering of the output is based on the first field, but this can be changed to any other field.

In [7]:
query=service.new_query()
query.select("GOTerm.description")

query.add_view("GOTerm.identifier") #Add a column to the query
 
query.add_sort_order("GOTerm.identifier") #Changing the sorting order of the query by a specific column

#Insert first 10 rows of the query results into a dataframe and display the output!
GO_data = pd.DataFrame(columns = ["GOTerm.description", "GOTerm.identifier"])
for row in query.rows(start=0,size=10):
    info = {'GOTerm.description':row[0], 'GOTerm.identifier':row[1]}
    GO_data = GO_data.append(info, ignore_index = True)
GO_data

Unnamed: 0,GOTerm.description,GOTerm.identifier
0,"The distribution of mitochondria, including th...",GO:0000001
1,The maintenance of the structure and integrity...,GO:0000002
2,The production of new individuals that contain...,GO:0000003
3,Enables the transfer of zinc ions (Zn2+) from ...,GO:0000006
4,Enables the transfer of a solute or solutes fr...,GO:0000007
5,Catalysis of the transfer of a mannose residue...,GO:0000009
6,Catalysis of the reaction: all-trans-hexapreny...,GO:0000010
7,The distribution of vacuoles into daughter cel...,GO:0000011
8,The repair of single strand breaks in DNA. Rep...,GO:0000012
9,Catalysis of the hydrolysis of ester linkages ...,GO:0000014


### Performing queries on the WormMine database using constraints

Let's query the WormMine database to extract all details of all organisms for which data is available. Then we can add a constraint to the query to restrict outputs to only those with a certain value in one of the fields, here Caenorhabditis in genus.

In [8]:
query=service.new_query("Organism")
query.select("commonName", "genus", "name", "shortName", "species","taxonId")

query.add_constraint("genus","=","Caenorhabditis") #Add a constraint to the query based on a column

#Insert first 10 rows of the query results into a dataframe and display the output!
organisms_data = pd.DataFrame(columns = ["commonName", "genus", "name", "shortName", "species","taxonId"])
for row in query.rows(start=0,size=10):
    info = {'commonName':row[0], 'genus':row[1], 'name':row[2], 'shortName':row[3], 'species':row[4], 
            'taxonId':row[5]}
    organisms_data = organisms_data.append(info, ignore_index = True)
organisms_data

Unnamed: 0,commonName,genus,name,shortName,species,taxonId
0,roundworm,Caenorhabditis,Caenorhabditis elegans,C. elegans,elegans,6239
1,,Caenorhabditis,Caenorhabditis angaria,C. angaria,angaria,860376
2,,Caenorhabditis,Caenorhabditis brenneri,C. brenneri,brenneri,135651
3,,Caenorhabditis,Caenorhabditis briggsae,C. briggsae,briggsae,6238
4,,Caenorhabditis,Caenorhabditis inopinata,C. inopinata,inopinata,1978547
5,,Caenorhabditis,Caenorhabditis japonica,C. japonica,japonica,281687
6,,Caenorhabditis,Caenorhabditis latens,C. latens,latens,1503980
7,,Caenorhabditis,Caenorhabditis nigoni,C. nigoni,nigoni,1611254
8,,Caenorhabditis,Caenorhabditis remanei,C. remanei,remanei,31234
9,,Caenorhabditis,Caenorhabditis sinica,C. sinica,sinica,1550068


Let's query the WormMine database to extract some details of all genes for which data is available. Then we can add multiple constraints. It is not necessary that the constraints are related to the fields explicitly mentioned in the query. 

Here, we add a constraint based on the value of genus being Caenorhabditis and another based on the value of ontologyTerm being kinase activity. 

(Even though the genus field is used as a constraint, the query does not return the column of genus as it has not been called in the query.)

In [9]:
query=service.new_query("Gene")
query.select("primaryIdentifier", "ontologyAnnotations.id", "ontologyAnnotations.qualifier")

query.add_constraint("organism.genus","=","Caenorhabditis") #Add a constraint to the query based on a column

query.add_constraint("ontologyAnnotations.ontologyTerm.name","=","kinase activity") #Add a second constraint

#Insert first 10 rows of the query results into a dataframe and display the output!
genes_data = pd.DataFrame(columns = ["primaryIdentifier", "ontologyAnnotations.id", 
                                     "ontologyAnnotations.qualifier"])
for row in query.rows(start=0,size=10):
    info = {'primaryIdentifier':row[0], 'ontologyAnnotations.id':row[1], 'ontologyAnnotations.qualifier':row[2]}
    genes_data = genes_data.append(info, ignore_index = True)
genes_data

Unnamed: 0,primaryIdentifier,ontologyAnnotations.id,ontologyAnnotations.qualifier
0,WBGene00000001,5000031,enables
1,WBGene00000018,5000365,enables
2,WBGene00000090,5003503,enables
3,WBGene00000090,5003504,enables
4,WBGene00000098,5003908,enables
5,WBGene00000099,5004016,enables
6,WBGene00000101,5004235,enables
7,WBGene00000102,5004261,enables
8,WBGene00000103,5004402,enables
9,WBGene00000186,5006755,enables


Let's query the WormMine database to extract all homologue data that is available along with the gene primaryIdentifiers and symbols for this data. Like mentiones previously, we can add multiple constraints. But it is not necessary that the constraints are applied one on the other.

We can use logic operators to apply constraints on the query outputs. & - AND and | - OR can be used in the regular sense.

Here, we have 3 constraints based on - the value of genus being Caenorhabditis, the value of species being elegans, and type of homologue being orthologue.

Then we apply the constraints on the query such that the output either has to have Caenorhabditis as genus AND elegans as species OR type of homologue as orthologue.

In [10]:
query=service.new_query("Homologue")
query.select('Homologue.id', 'Homologue.type', 'gene.primaryIdentifier', 'gene.symbol')

query.add_constraint("gene.organism.genus","=","Caenorhabditis") #Set up constraint A
query.add_constraint("gene.organism.species","=","elegans") #Set up constraint B
query.add_constraint("type","=","orthologue") #Set up constraint C

query.set_logic("A & B & C") #Logic operators can be used to set the different constraints on the query

#Insert first 10 rows of the query results into a dataframe and display the output!
homologue_data = pd.DataFrame(columns = ["Homologue.id", "Homologue.type", "gene.primaryIdentifier", 
                                         "gene.symbol"])
for row in query.rows(start=0,size=10):
    info = {'Homologue.id':row[0], 'Homologue.type':row[1], 'gene.primaryIdentifier':row[2], 'gene.symbol':row[3]}
    homologue_data = homologue_data.append(info, ignore_index = True)
homologue_data

Unnamed: 0,Homologue.id,Homologue.type,gene.primaryIdentifier,gene.symbol
0,39000008,orthologue,WBGene00004790,sgn-1
1,39000012,orthologue,WBGene00004790,sgn-1
2,39000016,orthologue,WBGene00004790,sgn-1
3,39000021,orthologue,WBGene00004790,sgn-1
4,39000025,orthologue,WBGene00004790,sgn-1
5,39000029,orthologue,WBGene00004790,sgn-1
6,39000034,orthologue,WBGene00004790,sgn-1
7,39000060,orthologue,WBGene00008583,ugt-65
8,39000064,orthologue,WBGene00009450,ugt-58
9,39000068,orthologue,WBGene00011238,ugt-59


#### Different types of constraints

There are several kinds of constraints - Unary, Binary, Ternary, Multi-value, and List. We explore examples for all these constraint types.

##### Unary Constraints - 
Constraints that do not take any particular value but can be used to check if particular attirbute is absent or present.

Types of Unary Constraints - IS Null and IS NOT Null

Let's query the WormMine database to extract the automatedDescription, biotype, briefDescription, length, operon, primaryIdentifier, secondaryIdentifier, and symbol of all genes whose data is available, and then retain only those results where there is a valid secondaryIdentifier!

In [11]:
query=service.new_query("Gene")
query.select("automatedDescription", "biotype", "briefDescription", "length", "operon", "primaryIdentifier", 
             "secondaryIdentifier", "symbol")

query.add_constraint("secondaryIdentifier","IS NOT NULL") #Unary Constraint

#Insert first 10 rows of the query results into a dataframe and display the output!
genes_data = pd.DataFrame(columns = ["automatedDescription", "biotype", "briefDescription", "length", "operon", 
                                     "primaryIdentifier", "secondaryIdentifier", "symbol"])
for row in query.rows(start=0,size=10):
    info = {'automatedDescription':row[0], 'biotype':row[1], 'briefDescription':row[2], 'length':row[3], 
            'operon':row[4], 'primaryIdentifier':row[5], 'secondaryIdentifier':row[6], 'symbol':row[7]}
    genes_data = genes_data.append(info, ignore_index = True)
genes_data

Unnamed: 0,automatedDescription,biotype,briefDescription,length,operon,primaryIdentifier,secondaryIdentifier,symbol
0,ADP-ribosylation factor 4D is a member of the ...,SO:0001217,,,,HGNC:656,ARL4D,Arl4d
1,ADP-ribosylation factor-like 4A is a member of...,SO:0001217,,,,HGNC:695,ARL4A,Arl4a
2,ADP-ribosylation factor-like 4C is a member of...,SO:0001217,,,,HGNC:698,ARL4C,Arl4c
3,ADP-ribosyltransferase catalyzes the ADP-ribos...,SO:0001217,,,,HGNC:723,ART1,Art1
4,AIM2 is a member of the IFI20X /IFI16 family. ...,SO:0001217,,,,HGNC:357,AIM2,Aim2
5,ASSOCIATED WITH Atrioventricular Septal Defect 4,SO:0001267,,,,HGNC:50403,SNORA99,SNORA99
6,"ASSOCIATED WITH Bardet-Biedl Syndrome, 21; con...",SO:0001263,,,,HGNC:50444,C8orf37-AS1,C8orf37-AS1
7,ASSOCIATED WITH Chromosome 13q Deletion Syndro...,SO:0001263,,,,HGNC:50277,LMO7-AS1,LMO7-AS1
8,ASSOCIATED WITH Chromosome 13q Deletion Syndro...,SO:0001263,,,,HGNC:50496,DLEU1-AS1,DLEU1-AS1
9,ASSOCIATED WITH Experimental Liver Cirrhosis; ...,SO:0001217,,,,HGNC:1310,TMEM184B,Tmem184b


##### Binary Constraints - 
Constraints that can take a particular attribute which can then be compared them to a specified value.

Types of Binary Constraints - `=`, `<=`, `>=`, `<`, `>`, `!=`.

Let's query the WormMine database to extract the automatedDescription, biotype, briefDescription, length, operon, primaryIdentifier, secondaryIdentifier, and symbol of all genes whose data is available, and then retain only those results where the value of the length is greater than or equal to 12000!

In [12]:
query=service.new_query("Gene")
query.select("automatedDescription", "biotype", "briefDescription", "length", "operon", "primaryIdentifier", 
             "secondaryIdentifier", "symbol")

query.add_constraint("length",">=","12000") #Binary Constraint

#Insert first 10 rows of the query results into a dataframe and display the output!
genes_data = pd.DataFrame(columns = ["automatedDescription", "biotype", "briefDescription", "length", "operon", 
                                     "primaryIdentifier", "secondaryIdentifier", "symbol"])
for row in query.rows(start=0,size=10):
    info = {'automatedDescription':row[0], 'biotype':row[1], 'briefDescription':row[2], 'length':row[3], 
            'operon':row[4], 'primaryIdentifier':row[5], 'secondaryIdentifier':row[6], 'symbol':row[7]}
    genes_data = genes_data.append(info, ignore_index = True)
genes_data

Unnamed: 0,automatedDescription,biotype,briefDescription,length,operon,primaryIdentifier,secondaryIdentifier,symbol
0,Enables ATP-dependent microtubule motor activi...,SO:0001217,unc-104 encodes a kinesin-like motor protein h...,15639,,WBGene00006831,C52E12.2,unc-104
1,Enables ATPase-coupled intramembrane lipid tra...,SO:0001217,tat-1 encodes one of six C. elegans subfamily ...,22818,,WBGene00013034,Y49E10.11,tat-1
2,Enables DEAD/H-box RNA helicase binding activi...,SO:0001217,zyx-1 encodes a LIM domain protein and is most...,14604,,WBGene00006999,F42G4.3,zyx-1
3,Enables DNA-binding transcription factor activ...,SO:0001217,sbp-1 encodes a basic helix-loop-helix (bHLH) ...,14200,,WBGene00004735,Y47D3B.7,sbp-1
4,Enables DNA-binding transcription factor activ...,SO:0001217,vab-3 encodes a homeodomain protein that is th...,19691,,WBGene00006870,F14F3.1,vab-3
5,Enables G protein-coupled GABA receptor activi...,SO:0001217,gbb-2 encodes a subunit of the GABAB receptor ...,17820,,WBGene00022675,ZK180.1,gbb-2
6,Enables G protein-coupled acetylcholine recept...,SO:0001217,gar-3 encodes a G protein-linked muscarinic ac...,14901,,WBGene00001519,Y40H4A.1,gar-3
7,Enables G-protein alpha-subunit binding activi...,SO:0001217,ags-3 encodes a protein containing N-terminal ...,13877,,WBGene00000092,F32A6.4,ags-3
8,Enables GABA receptor binding activity. Is inv...,SO:0001217,,12622,,WBGene00001490,H05G16.1,frm-3
9,Enables GTPase activating protein binding acti...,SO:0001217,mpz-1 encodes a multi-PDZ domain scaffold prot...,39492,,WBGene00003404,C52A11.4,mpz-1


##### Ternary constraints - 
Constraints that have one required value and one optional value.

Types of Ternary Constraints -LOOKUP (this operator searches through all the fields in a particular class for the value specified by the user)

Let's query the WormMine database to extract the automatedDescription, biotype, briefDescription, length, operon, primaryIdentifier, secondaryIdentifier, and symbol of all genes whose data is available, and then retain only those results where there is a mention of 'GABA' in any field!

The extra_value parameter can be used to limit the search to the type of object (for example, organism in genes), here C. elegans.

In [13]:
query=service.new_query("Gene")
query.select("automatedDescription", "biotype", "briefDescription", "length", "operon", "primaryIdentifier", 
             "secondaryIdentifier", "symbol")

query.add_constraint("Gene", "LOOKUP", "hlh-2", extra_value='C. elegans') #Ternary Constraint

#Insert first 10 rows of the query results into a dataframe and display the output!
genes_data = pd.DataFrame(columns = ["automatedDescription", "biotype", "briefDescription", "length", "operon", 
                                     "primaryIdentifier", "secondaryIdentifier", "symbol"])
for row in query.rows(start=0,size=10):
    info = {'automatedDescription':row[0], 'biotype':row[1], 'briefDescription':row[2], 'length':row[3], 
            'operon':row[4], 'primaryIdentifier':row[5], 'secondaryIdentifier':row[6], 'symbol':row[7]}
    genes_data = genes_data.append(info, ignore_index = True)
genes_data

Unnamed: 0,automatedDescription,biotype,briefDescription,length,operon,primaryIdentifier,secondaryIdentifier,symbol
0,"Enables several functions, including DNA-bindi...",SO:0001217,hlh-2 encodes a Class I basic helix-loop-helix...,3133,,WBGene00001949,M05B5.5,hlh-2


##### Multi-value Constraints - 
Constraints that can take multiple values.

Types of Multi-value Constraints -ONE OF and NONE OF

Let's query the WormMine database to extract the automatedDescription, biotype, briefDescription, length, operon, primaryIdentifier, secondaryIdentifier, and symbol of all genes whose data is available, and then retain only those results where the gene symbol is one of hlh-2, unc-26, gar-3, or gbb-2.

In [14]:
query=service.new_query("Gene")
query.select("automatedDescription", "biotype", "briefDescription", "length", "operon", "primaryIdentifier", 
             "secondaryIdentifier", "symbol")

query.add_constraint("symbol","ONE OF",['hlh-2','unc-26', 'gar-3', 'gbb-2']) #Multi-value Constraint

#Insert first 10 rows of the query results into a dataframe and display the output!
genes_data = pd.DataFrame(columns = ["automatedDescription", "biotype", "briefDescription", "length", "operon", 
                                     "primaryIdentifier", "secondaryIdentifier", "symbol"])
for row in query.rows(start=0,size=10):
    info = {'automatedDescription':row[0], 'biotype':row[1], 'briefDescription':row[2], 'length':row[3], 
            'operon':row[4], 'primaryIdentifier':row[5], 'secondaryIdentifier':row[6], 'symbol':row[7]}
    genes_data = genes_data.append(info, ignore_index = True)
genes_data

Unnamed: 0,automatedDescription,biotype,briefDescription,length,operon,primaryIdentifier,secondaryIdentifier,symbol
0,Enables G protein-coupled GABA receptor activi...,SO:0001217,gbb-2 encodes a subunit of the GABAB receptor ...,17820,,WBGene00022675,ZK180.1,gbb-2
1,Enables G protein-coupled acetylcholine recept...,SO:0001217,gar-3 encodes a G protein-linked muscarinic ac...,14901,,WBGene00001519,Y40H4A.1,gar-3
2,"Enables several functions, including DNA-bindi...",SO:0001217,hlh-2 encodes a Class I basic helix-loop-helix...,3133,,WBGene00001949,M05B5.5,hlh-2
3,"Is predicted to enable phosphatidylinositol-3,...",SO:0001217,"unc-26 encodes synaptojanin, a polyphosphoinos...",10962,CEOP4539,WBGene00006763,JC8.10,unc-26


##### List constraints - 
Constraints that contain a list of values.

Types of List Constraints -IN or NOT IN 

Let's query the WormMine database to extract the automatedDescription, biotype, briefDescription, length, operon, primaryIdentifier, secondaryIdentifier, and symbol of all genes whose data is available, and then retain only those results where the Gene is in the publicly available list C. elegans transcription factor genes present on WormMine.

In [15]:
query=service.new_query("Gene")
query.select("automatedDescription", "biotype", "briefDescription", "length", "operon", "primaryIdentifier", 
             "secondaryIdentifier", "symbol")

query.add_constraint("Gene","IN","C. elegans transcription factor genes") #List Constraint

#Insert first 10 rows of the query results into a dataframe and display the output!
genes_data = pd.DataFrame(columns = ["automatedDescription", "biotype", "briefDescription", "length", "operon", 
                                     "primaryIdentifier", "secondaryIdentifier", "symbol"])
for row in query.rows(start=0,size=10):
    info = {'automatedDescription':row[0], 'biotype':row[1], 'briefDescription':row[2], 'length':row[3], 
            'operon':row[4], 'primaryIdentifier':row[5], 'secondaryIdentifier':row[6], 'symbol':row[7]}
    genes_data = genes_data.append(info, ignore_index = True)
genes_data

Unnamed: 0,automatedDescription,biotype,briefDescription,length,operon,primaryIdentifier,secondaryIdentifier,symbol
0,Acts upstream of or within positive regulation...,SO:0001217,The tlp-1 gene encodes a C2H2-type zinc finger...,2875,,WBGene00006580,T23G4.1,tlp-1
1,Contributes to RNA polymerase II activity. Is ...,SO:0001217,ama-1 encodes the large subunit of RNA polymer...,10100,,WBGene00000123,F36A4.7,ama-1
2,Contributes to RNA polymerase II transcription...,SO:0001217,mab-5 encodes a homeodomain transcription fact...,6620,,WBGene00003102,C08C3.3,mab-5
3,Contributes to sequence-specific DNA binding a...,SO:0001217,cky-1 encodes a member of the basic helix-loop...,4503,,WBGene00000521,C15C8.2,cky-1
4,Enables DEAD/H-box RNA helicase binding activi...,SO:0001217,zyx-1 encodes a LIM domain protein and is most...,14604,,WBGene00006999,F42G4.3,zyx-1
5,Enables DEAD/H-box RNA helicase binding activi...,SO:0001217,nhl-2 encodes one of five C. elegans proteins ...,5428,CEOP3232,WBGene00003598,F26F4.7,nhl-2
6,Enables DNA binding activity and RNA polymeras...,SO:0001217,ahr-1 encodes an aryl hydrocarbon receptor (li...,8583,,WBGene00000096,C41G7.5,ahr-1
7,"Enables DNA binding activity, bending and doub...",SO:0001217,ceh-37 encodes one of three C. elegans protein...,10629,,WBGene00000458,C37E2.5,ceh-37
8,Enables DNA binding activity; DNA-binding tran...,SO:0001217,aha-1 encodes an ortholog of human aryl-hydroc...,3073,,WBGene00000095,C25A1.11,aha-1
9,Enables DNA-binding transcription activator ac...,SO:0001217,ceh-17 encodes a phox-2-like homeodomain prote...,1263,,WBGene00000440,D1007.1,ceh-17


##### Sub-Class constraints - 
Constraints that allow a user to specify a sub-class of a class to constrain a path to

Let's query the WormMine database to extract the automatedDescription, biotype, briefDescription, length, operon, primaryIdentifier, secondaryIdentifier, and symbol of all genes whose data is available, and then constrain our results to only those items of the sub class GOAnnotation of ontologyAnnotations class.

In [16]:
query=service.new_query("Gene")
query.select("automatedDescription", "biotype", "briefDescription", "length", "operon", "primaryIdentifier", 
             "secondaryIdentifier", "symbol")

query.add_constraint("ontologyAnnotations","GOAnnotation") #Sub-Class constraint

#Insert first 10 rows of the query results into a dataframe and display the output!
genes_data = pd.DataFrame(columns = ["automatedDescription", "biotype", "briefDescription", "length", "operon", 
                                     "primaryIdentifier", "secondaryIdentifier", "symbol"])
for row in query.rows(start=0,size=10):
    info = {'automatedDescription':row[0], 'biotype':row[1], 'briefDescription':row[2], 'length':row[3], 
            'operon':row[4], 'primaryIdentifier':row[5], 'secondaryIdentifier':row[6], 'symbol':row[7]}
    genes_data = genes_data.append(info, ignore_index = True)
genes_data

Unnamed: 0,automatedDescription,biotype,briefDescription,length,operon,primaryIdentifier,secondaryIdentifier,symbol
0,Acts upstream of or within IRE1-mediated unfol...,SO:0001217,nlp-28 encodes a neuropeptide-like protein.,417,,WBGene00003766,B0213.3,nlp-28
1,Acts upstream of or within IRE1-mediated unfol...,SO:0001217,,468,,WBGene00007992,C37A5.8,fipr-24
2,Acts upstream of or within IRE1-mediated unfol...,SO:0001217,,2395,,WBGene00009239,F28H7.6,irld-6
3,Acts upstream of or within IRE1-mediated unfol...,SO:0001217,,961,,WBGene00018181,F38E1.9,mpdu-1
4,Acts upstream of or within IRE1-mediated unfol...,SO:0001217,,3594,,WBGene00016469,C36B7.6,C36B7.6
5,Acts upstream of or within IRE1-mediated unfol...,SO:0001217,,1699,,WBGene00009112,F25D7.2,tag-353
6,Acts upstream of or within IRE1-mediated unfol...,SO:0001217,,979,,WBGene00017705,F22E5.6,F22E5.6
7,Acts upstream of or within IRE1-mediated unfol...,SO:0001217,,1097,,WBGene00022570,ZC239.12,sdz-35
8,Acts upstream of or within IRE1-mediated unfol...,SO:0001217,,1326,CEOPX072,WBGene00020486,T13C5.6,T13C5.6
9,Acts upstream of or within IRE1-mediated unfol...,SO:0001217,,2275,,WBGene00010496,K02B12.3,sec-12


##### Loop Constraints - 
Constraints that assert that two paths refer to the same object

Types of Loop Constraints - IS or IS NOT

Let's query the WormMine database to extract the automatedDescription, biotype, briefDescription, length, operon, primaryIdentifier, secondaryIdentifier, and symbol of all genes whose data is available, constrain our results using a list constraint and then, a loop constraint.

In [17]:
query=service.new_query("Gene")
query.select("automatedDescription", "biotype", "briefDescription", "length", "operon", "primaryIdentifier", 
             "secondaryIdentifier", "symbol")

query.add_view("homologues.gene.primaryIdentifier","homologues.homologue.primaryIdentifier") #Add more fields

query.add_constraint("Gene", "IN", "C. elegans transcription factor genes", code = "A") #List constraint

query.add_constraint("homologues.homologue", "IS NOT", "Gene", code = "B") #Loop Constraint


#Insert first 10 rows of the query results into a dataframe and display the output!
genes_data = pd.DataFrame(columns = ["homologues.gene.primaryIdentifier", "homologues.homologue.primaryIdentifier"])
for row in query.rows(start=0,size=10):
    info = {'homologues.gene.primaryIdentifier':row['homologues.gene.primaryIdentifier'], 
            'homologues.homologue.primaryIdentifier':row['homologues.homologue.primaryIdentifier']}
    genes_data = genes_data.append(info, ignore_index = True)
genes_data

Unnamed: 0,homologues.gene.primaryIdentifier,homologues.homologue.primaryIdentifier
0,WBGene00006580,FBgn0004858
1,WBGene00006580,FBgn0005771
2,WBGene00006580,HGNC:23589
3,WBGene00006580,HGNC:25883
4,WBGene00006580,MGI:1353644
5,WBGene00006580,MGI:2662729
6,WBGene00006580,RGD:1312018
7,WBGene00006580,RGD:1588249
8,WBGene00006580,ZDB-GENE-010717-1
9,WBGene00006580,ZDB-GENE-031113-5


##### Range Constraints - 
Constraints that test whether a value lies relative to a set of ranges or not

Types of Range Constraints - OVERLAPS, DOES NOT OVERLAP, WITHIN, OUTSIDE, CONTAINS and DOES NOT CONTAIN

Let's query the WormMine database to extract the details on organism name and chromosome location of all sequences whose data is available, and then constrain our results based on if the chromosome location overlaps our specified range of I:1..4000.

In [18]:
query=service.new_query()

query.add_view("SequenceFeature.organism.shortName", 
               "SequenceFeature.chromosomeLocation.locatedOn.primaryIdentifier", 
               "SequenceFeature.chromosomeLocation.start", "SequenceFeature.chromosomeLocation.end" ) #Add fields

query.add_constraint("chromosomeLocation", "OVERLAPS", ["I:1..4000"]) #Range constraint



#Insert first 10 rows of the query results into a dataframe and display the output!
genes_data = pd.DataFrame(columns = ["Organism", "Chromosome identifier", "Chromosome Start Location", 
                                     "Chromosome End Location"])
for row in query.rows(start=0,size=10):
    info = {'Organism':row[0], 'Chromosome identifier':row[1], 'Chromosome Start Location':row[2], 
            'Chromosome End Location':row[3]}
    genes_data = genes_data.append(info, ignore_index = True)
genes_data

Unnamed: 0,Organism,Chromosome identifier,Chromosome Start Location,Chromosome End Location
0,C. elegans,I,1,3746
1,C. elegans,I,3747,3909
2,C. elegans,I,3747,3909
3,C. elegans,I,3747,3909
4,C. elegans,I,3910,4115


### Some query examples and exploring some other functionalities

Let's query the WormMine database to extract the primaryIdentifier, symbol, of all genes whose data is available and connect that to the name and identifiers of the ontology terms. We then add a constraint that the homologue type of the results should be orthologue:

In [19]:
query=service.new_query("Gene")
query.select("primaryIdentifier","symbol", "ontologyAnnotations.ontologyTerm.name", 
             "ontologyAnnotations.ontologyTerm.identifier")
query.add_constraint("homologues.type","=","orthologue")

#Insert first 10 rows of the query results into a dataframe and display the output!
genes_data = pd.DataFrame(columns = ["primaryIdentifier", "symbol", "Ontology Name", "Ontology Identifier"])
for row in query.rows(start=0,size=10):
    info = {'primaryIdentifier':row[0], 'symbol':row[1], 'Ontology Name':row[2], 'Ontology Identifier':row[3]}
    genes_data = genes_data.append(info, ignore_index = True)
genes_data

Unnamed: 0,primaryIdentifier,symbol,Ontology Name,Ontology Identifier
0,WBGene00000001,aap-1,1-phosphatidylinositol-3-kinase regulator acti...,GO:0046935
1,WBGene00000001,aap-1,dauer larval development,GO:0040024
2,WBGene00000001,aap-1,determination of adult lifespan,GO:0008340
3,WBGene00000001,aap-1,insulin receptor signaling pathway,GO:0008286
4,WBGene00000001,aap-1,kinase activity,GO:0016301
5,WBGene00000001,aap-1,phosphatidylinositol 3-kinase complex,GO:0005942
6,WBGene00000001,aap-1,phosphatidylinositol phosphorylation,GO:0046854
7,WBGene00000001,aap-1,phosphorylation,GO:0016310
8,WBGene00000001,aap-1,protein kinase binding,GO:0019901
9,WBGene00000001,aap-1,regulation of phosphatidylinositol 3-kinase ac...,GO:0043551


It is possible to perform INNER and OUTER joins on the queries to get columns from different sets of data easily!

In [20]:
query.add_view('homologues.type')
query.add_join("homologues", "INNER")

#Insert first 10 rows of the query results into a dataframe and display the output!
genes_data = pd.DataFrame(columns = ["primaryIdentifier", "symbol", "Ontology Name", "Ontology Identifier",
                                    "Homologue Type"])
for row in query.rows(start=0,size=10):
    info = {'primaryIdentifier':row[0], 'symbol':row[1], 'Ontology Name':row[2], 'Ontology Identifier':row[3], 
            'Homologue Type':row[4]}
    genes_data = genes_data.append(info, ignore_index = True)
genes_data

Unnamed: 0,primaryIdentifier,symbol,Ontology Name,Ontology Identifier,Homologue Type
0,WBGene00000001,aap-1,1-phosphatidylinositol-3-kinase regulator acti...,GO:0046935,orthologue
1,WBGene00000001,aap-1,1-phosphatidylinositol-3-kinase regulator acti...,GO:0046935,orthologue
2,WBGene00000001,aap-1,1-phosphatidylinositol-3-kinase regulator acti...,GO:0046935,orthologue
3,WBGene00000001,aap-1,1-phosphatidylinositol-3-kinase regulator acti...,GO:0046935,orthologue
4,WBGene00000001,aap-1,1-phosphatidylinositol-3-kinase regulator acti...,GO:0046935,orthologue
5,WBGene00000001,aap-1,1-phosphatidylinositol-3-kinase regulator acti...,GO:0046935,orthologue
6,WBGene00000001,aap-1,1-phosphatidylinositol-3-kinase regulator acti...,GO:0046935,orthologue
7,WBGene00000001,aap-1,1-phosphatidylinositol-3-kinase regulator acti...,GO:0046935,orthologue
8,WBGene00000001,aap-1,1-phosphatidylinositol-3-kinase regulator acti...,GO:0046935,orthologue
9,WBGene00000001,aap-1,1-phosphatidylinositol-3-kinase regulator acti...,GO:0046935,orthologue


##### Combinations of constraints and set logic

Let's query the organism name and gene symbol for all genes in the WormMine database. We then constrain our results based on the logic OR of two specified constraints as in the cell below:

In [21]:
query = service.new_query()
query.add_view("Gene.organism.name","Gene.symbol")

gene_is_ugt = query.add_constraint("Gene.symbol", "=", "ugt-59") #Add first binary constraint
gene_is_sgn = query.add_constraint("Gene.symbol", "=", "sgn-1") #Add second binary constraint
query.set_logic(gene_is_ugt | gene_is_sgn) #Logic OR on the 2 constraints

#Insert first 10 rows of the query results into a dataframe and display the output!
genes_data = pd.DataFrame(columns = ["Gene.organism.name", "Gene.symbol"])
for row in query.rows(start=0,size=10):
    info = {'Gene.organism.name':row[0], 'Gene.symbol':row[1]}
    genes_data = genes_data.append(info, ignore_index = True)
genes_data

Unnamed: 0,Gene.organism.name,Gene.symbol
0,Caenorhabditis elegans,sgn-1
1,Caenorhabditis elegans,ugt-59


### Writing query results to a file for later use

Since we use dataframes for storing the data, we can easily choose any rows or columns we want to retain based on simple constraints using the various functionalities present in the pandas library!

In [22]:
genes_data.to_csv('results.csv')

### Getting a readable XML serialisation of a query

In [23]:
query.to_xml()

'<query name="" model="genomic" view="Gene.organism.name Gene.symbol" sortOrder="Gene.organism.name asc" longDescription="" constraintLogic="A or B"><constraint path="Gene.symbol" op="=" code="A" value="ugt-59"/><constraint path="Gene.symbol" op="=" code="B" value="sgn-1"/></query>'

### Clearing the output of a query

In [24]:
query.clear_view()

This is the end of the tutorial for querying and extracting WormBase data using WormMine through intermine. This tutorial is influenced by the intermine tutorial notebooks from - https://github.com/intermine/intermine-ws-python-docs

In the next tutorial, we will use access the WormBase ParaSite data through their RESTful API.