# Assignment 5 SPARQL queries

## Uniprot

Q1: 1 POINT  How many protein records are in UniProt? 


In [11]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>

SELECT
    (COUNT(DISTINCT(?protein)) AS ?number_prots)
WHERE
{   
    ?protein a up:Protein .
}

number_prots
360157660


Q2: 1 POINT How many Arabidopsis thaliana protein records are in UniProt? 

In [1]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (COUNT(DISTINCT(?protein)) AS ?Arabidopsis_proteins)
WHERE
{
    ?protein  a up:Protein .
  	?protein up:organism taxon:3702 .
}

Arabidopsis_proteins
136782


Q3: 1 POINT retrieve pictures of Arabidopsis thaliana from UniProt? 

In [2]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>
PREFIX foaf:<http://xmlns.com/foaf/0.1/>

SELECT ?image
WHERE{ 
    ?sp up:rank up:Species ;
		up:scientificName ?name .
  FILTER(?name = "Arabidopsis thaliana")
  ?sp  foaf:depiction  ?image .
}


image
https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


Q4: 1 POINT:  What is the description of the enzyme activity of UniProt Protein Q9SZZ8

In [5]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>

SELECT ?description
WHERE
{
    uniprotkb:Q9SZZ8 up:enzyme ?enzyme .
    ?enzyme up:activity ?activity .
    ?activity rdfs:label ?description .
}


description
Beta-carotene + 4 reduced ferredoxin [iron-sulfur] cluster + 2 H(+) + 2 O(2) = zeaxanthin + 4 oxidized ferredoxin [iron-sulfur] cluster + 2 H(2)O.


Q5: 1 POINT:  Retrieve the proteins ids, and date of submission, for proteins that have been added to UniProt this year   (HINT Google for “SPARQL FILTER by date”)

In [2]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON


PREFIX up: <http://purl.uniprot.org/core/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?id ?date
WHERE
{
		?protein a up:Protein ;
             up:created ?date .
                   
		
 		FILTER (?date > "2021-01-18"^^xsd:dateTime)
  		BIND (SUBSTR(STR(?protein),33) AS ?id)
}



Using the SPARQL kernel it doesn't return anything but in the uniprot endpoint it gives a result of 37880749 proteins.

Q6: 1 POINT How  many species are in the UniProt taxonomy?

In [1]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>

SELECT (COUNT(DISTINCT(?species)) AS ?number_species)
WHERE
{
    ?species up:rank up:Species
}


number_species
2029846


Q7: 2 POINT  How many species have at least one protein record? (this might take a long time to execute, so do this one last!)


In [10]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON


PREFIX up: <http://purl.uniprot.org/core/>

SELECT (COUNT(DISTINCT(?species)) AS ?number_of_species)
WHERE
{
    ?protein a up:Protein;
         up:organism ?species .
}


number_of_species
1266002


Q8: 3 points:  find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”

In [4]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON


PREFIX up: <http://purl.uniprot.org/core/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>

SELECT ?gene_name ?agi
WHERE
{
    ?protein a up:Protein ;
             up:organism taxon:3702 ;
            up:annotation ?annotation .
    ?annotation rdf:type up:Function_Annotation ;
            rdfs:comment ?function .
FILTER(CONTAINS(?function, "pattern formation")) .
    ?protein up:encodedBy ?gene .
    ?gene skos:prefLabel ?gene_name ;
            up:locusName ?agi .
}


gene_name,agi
SCR,At3g54220
GN,At1g13980
ATML1,At4g21750
SWEET8,At5g40260
CUL3B,At1g69670
YDA,At1g63700
ROPGAP3,At2g46710
CUL3A,At1g26830
DEX1,At3g09090
SHR,At4g37650


## MetaNetX

Q9: 4 POINTS:  what is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb:Q18A79

In [6]:
%endpoint https://rdf.metanetx.org/sparql
%format JSON

PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX up: <http://purl.uniprot.org/uniprot/>

SELECT DISTINCT ?reac_label 
WHERE{
    ?pept mnx:peptXref up:Q18A79 .
    ?cata mnx:pept ?pept ;
          rdfs:label ?cata_label .
    ?gpr mnx:cata ?cata ;
         mnx:reac ?reac .
    ?reac rdfs:label ?reac_label .
}


reac_label
mnxr165934
mnxr145046c3


## FEDERATED QUERY - UniProt and MetaNetX

Q10: 5 POINTS:  What is the official Gene ID (UniProt calls this a “mnemonic”) and the MetaNetX Reaction identifier (mnxr…..) for the protein that has “Starch synthase” catalytic activity in Clostridium difficile (taxon 272563).

In [9]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?gene_id ?reac_label 
WHERE{
  ?protein a up:Protein ;
           up:organism taxon:3702 ;
         up:annotation ?annotation .
  ?annotation up:catalyticActivity ?ca .
  ?ca up:enzymeClass ?enzyme . 
  ?enzyme skos:prefLabel ?description .
  FILTER(CONTAINS(?description, "Starch synthase")).
    
  ?protein up:mnemonic ?gene_id .

  SERVICE <https://rdf.metanetx.org/sparql>{
    ?pept mnx:peptXref ?protein .
    ?cata mnx:pept ?pept ;
          rdfs:label ?cata_label .
    ?gpr mnx:cata ?cata ;
         mnx:reac ?reac .
    ?reac rdfs:label ?reac_label .
}
}


gene_id,reac_label
SSY4_ARATH,mnxr165934
SSY3_ARATH,mnxr165934
SSY1_ARATH,mnxr165934
SSY2_ARATH,mnxr165934
