Assignment 5 SPARQL queries
-----------------------------------------

I would like you to create the SPARQL query that will answer each of these questions.  Please submit the queries as a Jupyter notebook with the SPARQL kernel activated.  NO programming is required! Submit to GitHub as usual, WITH THE ANSWERS STILL VISIBLE IN THE NOTEBOOK.   Thanks!


UniProt SPARQL Endpoint:  http://sparql.uniprot.org/sparql  (note that you need to configure the endpoint to GET if you’re using YASGUI)


In [1]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

Q1: 1 POINT  How many protein records are in UniProt? 



In [8]:
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (count(?protein) as ?proteincount)

WHERE{ 
       ?protein a up:Protein 
 }

proteincount
378979161


Q2: 1 POINT How many Arabidopsis thaliana protein records are in UniProt? 

In [8]:
#Arabidopsis thaliana's ID=3702
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (count(?protein) as ?proteincount)

WHERE{ 
    ?protein a up:Protein;
            up:organism taxon:3702    
 } 

proteincount
136447


Q3: 1 POINT retrieve pictures of Arabidopsis thaliana from UniProt? 

In [9]:
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?image

WHERE{ 
    taxon:3702 foaf:depiction ?image 
 } 

image
https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


Q4: 1 POINT:  What is the description of the enzyme activity of UniProt Protein Q9SZZ8

In [2]:
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX uniprotkb:<http://purl.uniprot.org/uniprot/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?description 

WHERE{
    uniprotkb:Q9SZZ8 a up:Protein ;
                up:enzyme ?enzyme .
    ?enzyme up:activity ?activity .
    ?activity a up:Catalytic_Activity ;
                rdfs:label ?description 
 }

description
all-trans-beta-carotene + 4 H(+) + 2 O2 + 4 reduced [2Fe-2S]-[ferredoxin] = all-trans-zeaxanthin + 2 H2O + 4 oxidized [2Fe-2S]-[ferredoxin].


Q5: 1 POINT:  Retrieve the proteins ids, and date of submission, for 5 proteins that have been added to UniProt this year   (HINT Google for “SPARQL FILTER by date”)


In [6]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?id ?date

WHERE{
    ?protein a up:Protein ;
               up:created ?date .
    FILTER (?date >= "2021-01-01"^^xsd:date && ?date < "2022-01-01"^^xsd:date) .
    BIND (REPLACE(STR(?protein), "http://purl.uniprot.org/uniprot/", "") as ?id) 
    
 }LIMIT 5

id,date
A0A7G7KUQ7,2021-02-10
A0A7G7L0K6,2021-02-10
A0A7U4S0U4,2021-06-02
A0A7U8RPA0,2021-06-02
A0A7U9G810,2021-06-02


Q6: 1 POINT How  many species are in the UniProt taxonomy?


In [7]:
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (count (DISTINCT ?taxon) as ?taxoncount)

WHERE{
    ?taxon a up:Taxon;
          up:rank up:Species .
 }

taxoncount
1995728


Q7: 2 POINT  How many species have at least one protein record? (this might take a long time to execute, so do this one last!)


In [2]:
PREFIX up: <http://purl.uniprot.org/core/> 

SELECT (count(DISTINCT ?species) as ?speciescount)

WHERE{
    ?protein a up:Protein .
    ?protein up:organism ?species .
    ?species a up:Taxon .
    ?species up:rank up:Species 
 }

speciescount
1078469


Q8: 3 points:  find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”

In [18]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?code ?gene_name

WHERE{
    
    ?protein a up:Protein ;
        up:organism taxon:3702 ;
        up:encodedBy ?gene ;
        up:annotation ?function_annotation .
    
    ?gene a up:Gene ;
        up:locusName ?code ;
        skos:prefLabel ?gene_name .
    
    ?function_annotation a up:Function_Annotation ;
                      rdfs:comment ?function_description .
    FILTER REGEX (?function_description, "pattern formation", "i") .

 }

code,gene_name
At1g13980,GN
At3g02130,RPK2
At1g69270,RPK1
At5g37800,RSL1
At1g26830,CUL3A
At1g66470,RHD6
At3g09090,DEX1
At5g55250,IAMT1
At1g63700,YDA
At4g21750,ATML1


From the MetaNetX metabolic networks for metagenomics database SPARQL Endpoint: https://rdf.metanetx.org/sparql
(this slide deck will make it much easier for you!  https://www.metanetx.org/cgi-bin/mnxget/mnxref/MetaNetX_RDF_schema.pdf)


Q9: 4 POINTS:  what is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb:Q18A79


In [24]:
%endpoint https://rdf.metanetx.org/sparql

PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>

SELECT DISTINCT ?id

WHERE{
    ?pept a mnx:PEPT ;
          mnx:peptXref uniprotkb:Q18A79 .
    ?cata a mnx:CATA ;
          mnx:pept ?pept .
    ?gpr a mnx:GPR ;
         mnx:cata ?cata ;
         mnx:reac ?reac .
    ?reac a mnx:REAC ;
          mnx:mnxr ?mnxr .
    ?mnxr rdfs:label ?id .
}

id
MNXR165934
MNXR145046


FEDERATED QUERY - UniProt and MetaNetX

Q10: 5 POINTS:  What is the official locus name, and the MetaNetX Reaction identifier (mnxr…..) for the protein that has “glycine reductase” catalytic activity in Clostridium difficile (taxon 272563).   (this must be executed on the https://rdf.metanetx.org/sparql   endpoint)

In [26]:
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>

SELECT DISTINCT ?locus ?MetaNetXID

WHERE{
  service <http://sparql.uniprot.org/sparql> {
    ?protein a up:Protein .
    ?protein up:organism taxon:272563 .
    ?protein up:mnemonic ?locus .
    ?protein up:classifiedWith ?goTerm .
    ?goTerm rdfs:label ?activity .
    filter contains(?activity, "starch synthase")
    bind (substr(str(?protein),33) as ?ac)
    bind (IRI(CONCAT("http://purl.uniprot.org/uniprot/",?ac)) as ?proteinRef)
  }
  service <https://rdf.metanetx.org/sparql> {
    ?pept mnx:peptXref ?proteinRef .
    ?cata mnx:pept ?pept .
    ?gpr mnx:cata ?cata ;
         mnx:reac ?reac .
    ?reac rdfs:label ?MetaNetXID .
  }
} 

locus,MetaNetXID
GLGA_CLOD6,mnxr165934
GLGA_CLOD6,mnxr145046c3
