## Assignment 5: SPARQL queries

In [1]:
# Setting the endpoint in SPARQL Uniprot 
%endpoint https://sparql.uniprot.org/sparql 
# Setting the JSON format
%format JSON                                

Exercise 1: How many Arabidopsis thaliana protein records are in UniProt? 

In [2]:
# Seting the prefix of SPARQL Uniprot
PREFIX up:<http://purl.uniprot.org/core/>
# Counting the number of ptrotein records
SELECT (COUNT(?protein) as ?number_of_proteins)
WHERE
{
    # Selecting all protein entries
    ?protein a up:Protein .
}

number_of_proteins
360157660


Exercise 2: How many Arabidopsis thaliana protein records are in UniProt? 

In [17]:
PREFIX up:<http://purl.uniprot.org/core/>
# Adding the taxon prefix
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
# Counting the number of ptrotein records
SELECT (COUNT(?protein) as ?number_of_proteins)
WHERE
{
    # Selecting the protein records of Arabidopsis thaliana by its taxon identifier
    ?protein a up:Protein ;
               up:organism taxon:3702 .
}

number_of_proteins
136782


Exercise 3: Rretrieve pictures of Arabidopsis thaliana from UniProt

In [18]:
PREFIX up: <http://purl.uniprot.org/core/>
# Adding the prefix of Xmlns
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?link_to_picture
WHERE
{
    # Selecting the picture
  ?taxon foaf:depiction ?link_to_picture .
    # Filtering by 'Arabidopsis thaliana'
  ?taxon up:scientificName "Arabidopsis thaliana" .
}

link_to_picture
https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


Exercise 4: What is the description of the enzyme activity of UniProt Protein Q9SZZ8 

In [19]:
PREFIX up:<http://purl.uniprot.org/core/> 
# Adding the prefix of the main page of Uniprot
PREFIX uniprot:<http://purl.uniprot.org/uniprot/> 
# Adding the world wide web consultorium
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
SELECT ?catalytic_activity
WHERE
{
    # Selecting the Q9SZZ8 protein
    uniprot:Q9SZZ8 a up:Protein ;
        up:enzyme ?enzyme.
    # Selecting the molecular reaction involved in the catalytic activity of the protein
    ?enzyme up:activity ?activity .
        ?activity rdfs:label ?catalytic_activity .
}

catalytic_activity
Beta-carotene + 4 reduced ferredoxin [iron-sulfur] cluster + 2 H(+) + 2 O(2) = zeaxanthin + 4 oxidized ferredoxin [iron-sulfur] cluster + 2 H(2)O.


Exercise 5: Retrieve the proteins ids, and date of submission, for proteins that have been added to UniProt this year (HINT Google for “SPARQL FILTER by date”)

In [2]:
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
SELECT ?id ?date_of_submission
WHERE 
{ 
    ?protein a up:Protein ;
        up:created ?date_of_submission .
    # Filtering by date: 2021-01-01
    FILTER ( ?date_of_submission >= "2021-01-01"^^xsd:date) .
    # Converting URL of the protein entry into the protein ID
    BIND (SUBSTR(STR(?protein),33) AS ?id) .
} 

id,date_of_submission
A0A1H7ADE3,2021-06-02
A0A1V1AIL4,2021-06-02
A0A2Z0L603,2021-06-02
A0A4J5GG53,2021-04-07
A0A6G8SU52,2021-02-10
A0A6G8SU69,2021-02-10
A0A7C9JLR7,2021-02-10
A0A7C9JMZ7,2021-02-10
A0A7C9KUQ4,2021-02-10
A0A7D4HP61,2021-02-10


Exercise 6: How many species are in the UniProt taxonomy?

In [13]:
PREFIX up:<http://purl.uniprot.org/core/>
SELECT (COUNT(?taxon) AS ?uniprot_species)
WHERE
{
    # Selecting all the species from the different taxons
    ?taxon a up:Taxon ;
             up:rank up:Species .
}

uniprot_species
2029846


Exercise 7: How many species have at least one protein record? (this might take a long time to execute, so do this one last!)

In [14]:
PREFIX up: <http://purl.uniprot.org/core/>
# Selecting only unique results
SELECT (COUNT (DISTINCT ?taxon) AS ?species_with_record)
WHERE
{
    # Selecting orgenisms with protein records
  ?protein a up:Protein;
           up:organism ?taxon .
  ?taxon a up:Taxon;
          up:rank up:Species .
}

species_with_record
1057158


Exercise 8: Find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”

In [2]:
PREFIX up: <http://purl.uniprot.org/core/>
SELECT DISTINCT ?agi ?gene_name
WHERE
{
    # Selecting annotations for Arabidopsis thaliana protein records
    ?protein a up:Protein .
    ?protein up:organism ?organism .
    ?organism up:scientificName "Arabidopsis thaliana" .
    ?protein up:encodedBy ?gene .
    ?gene skos:prefLabel ?gene_name .
    ?gene up:locusName ?agi .
    ?protein up:annotation ?annotation .
        ?annotation rdfs:comment ?functions .
    # Searching for functions in which appears 'pattern formation'
        FILTER (CONTAINS(?functions, "pattern formation"))
}

agi,gene_name
At3g54220,SCR
At4g21750,ATML1
At1g13980,GN
At5g40260,SWEET8
At1g69670,CUL3B
At1g63700,YDA
At2g46710,ROPGAP3
At1g26830,CUL3A
At1g55325,MED13
At3g09090,DEX1


Exercise 9: What is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb:Q18A79?

In [5]:
# Setting the endpoint in Metanex 
%endpoint https://rdf.metanetx.org/sparql
# Setting the JSON format
%format JSON

In [6]:
# Adding the prefix of Metanex
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX up: <http://purl.uniprot.org/uniprot/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?metanex_id
WHERE{
    # Selecting Q18A79
    ?pept mnx:peptXref up:Q18A79 .
    # Getting the identifiers that corresponds to Q18A79
    ?cata mnx:pept ?pept .
    ?gpr mnx:cata ?cata ;
         mnx:reac ?reac .
    ?reac rdfs:label ?metanex_id .
}

metanex_id
mnxr165934
mnxr145046c3


Exercise 10: What is the official Gene ID (UniProt calls this a “mnemonic”) and the MetaNetX Reaction identifier (mnxr…) for the protein that has “Starch synthase” catalytic activity in Clostridium difficile (taxon 272563)?

In [7]:
# Setting the endpoint in Uniprot 
%endpoint https://sparql.uniprot.org/sparql
# Setting the JSON format
%format JSON

In [10]:
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
SELECT DISTINCT ?official_id ?metanex_id
WHERE
{
    # Searching the protein in Uniprot with a GO Term that contains Starch synthase catalytic activity in C. difficile
  service <http://sparql.uniprot.org/sparql> {
    ?protein a up:Protein .
    ?protein up:organism taxon:272563 .
    ?protein up:mnemonic ?official_id .
    ?protein up:classifiedWith ?go_term .
    ?go_term rdfs:label ?activity .
    FILTER contains(?activity, "starch synthase")
    BIND (SUBSTR(STR(?protein),33) as ?ac)
    BIND (IRI(CONCAT("http://purl.uniprot.org/uniprot/",?ac)) as ?reference)
  }
  service <https://rdf.metanetx.org/sparql> {
      # Getting the identifiers that corresponds to the protein with Starch synthase catalytic activity in C. difficile
    ?pept mnx:peptXref ?reference .
    ?cata mnx:pept ?pept .
    ?gpr mnx:cata ?cata ;
         mnx:reac ?reac .
    ?reac rdfs:label ?metanex_id .
  }
} 

official_id,metanex_id
GLGA_CLOD6,mnxr165934
GLGA_CLOD6,mnxr145046c3
