## Assignment 5 SPARQL queries

In [38]:
## Set endpoint
%endpoint https://sparql.uniprot.org/sparql
%format JSON

### Q1:

Q:  How many protein records are in UniProt? 

A: This SPARQL query counts the number of proteins that are in the core namespace of Uniprot. It uses the COUNT function to do this, and then stores the number of proteins in the variable ?Protein_Records.

In [3]:
PREFIX core:<http://purl.uniprot.org/core/> 

SELECT (COUNT(?protein) AS ?Protein_Records) 
WHERE{ 
        ?protein a core:Protein .
}

Protein_Records
378979161


#### Q2:
Q: How many Arabidopsis thaliana protein records are in UniProt? 

A: This SPARQL query is counting the number of distinct proteins from the organism Arabidopsis thaliana (taxon ID 3702). It is using the core and taxonomy prefixes from the UniProt database to access the necessary information.

In [4]:
PREFIX core:<http://purl.uniprot.org/core/> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>

SELECT (COUNT(DISTINCT ?protein) AS ?A_Thaliana_Records)
WHERE{ 
        ?protein a core:Protein .         
        ?protein core:organism taxon:3702 .
}


A_Thaliana_Records
136447


#### Q3:
Q: retrieve pictures of Arabidopsis thaliana from UniProt? 

A: This SPARQL query retrieves the URI image associated with the species 'Arabidopsis thaliana' from the Uniprot Core ontology (up:) and the Friend of a Friend (foaf:) ontology.

In [5]:

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX foaf:<http://xmlns.com/foaf/0.1/>

SELECT ?uri
WHERE {
       ?taxon    foaf:depiction  ?uri .
       ?taxon    up:scientificName   ?Species .
       FILTER(CONTAINS(?Species, "Arabidopsis thaliana"))
}

uri
https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


#### Q4:
Q: What is the description of the enzyme activity of UniProt Protein Q9SZZ8

A: This SPARQL query selects the description and enzyme of the protein for the Uniprot ID Q9SZZ8. It first finds the enzyme associated with the Uniprot ID, then looks for its activity, and finally obtains its description.

In [7]:
PREFIX core:<http://purl.uniprot.org/core/> 
PREFIX uniprot:<http://purl.uniprot.org/uniprot/> 

SELECT ?description ?enzyme
WHERE {
  uniprot:Q9SZZ8 a core:Protein ;          
                   core:enzyme ?enzyme .
  ?enzyme core:activity ?activity .        
  ?activity rdfs:label ?description
}

description,enzyme
all-trans-beta-carotene + 4 H(+) + 2 O2 + 4 reduced [2Fe-2S]-[ferredoxin] = all-trans-zeaxanthin + 2 H2O + 4 oxidized [2Fe-2S]-[ferredoxin].,http://purl.uniprot.org/enzyme/1.14.15.24


#### Q5:
Q: Retrieve the proteins ids, and date of submission, for 5 proteins that have been added to UniProt this year   (HINT Google for “SPARQL FILTER by date”)

A: This SPARQL query is looking for proteins that have been created on or after 2022/23-01-01, and is returning the ID and submission date of these proteins. The query is limited to the first 5 results.

This explination helped: https://stackoverflow.com/questions/24051435/filter-by-date-range-in-sparql

In [9]:
PREFIX core:<http://purl.uniprot.org/core/> 

SELECT ?id ?sumbitted_date
WHERE{
  ?protein a core:Protein .
  ?protein core:mnemonic ?id .
  ?protein core:created ?sumbitted_date .
  FILTER (?sumbitted_date >= "2023-01-01"^^xsd:dateTime)
} LIMIT 5


id,sumbitted_date


Zero results, although... I think this assignment was originally due in 2022 so:

In [10]:
PREFIX core:<http://purl.uniprot.org/core/> 

SELECT ?id ?sumbitted_date
WHERE{
  ?protein a core:Protein .
  ?protein core:mnemonic ?id .
  ?protein core:created ?sumbitted_date .
  FILTER (?sumbitted_date >= "2022-01-01"^^xsd:dateTime)
} LIMIT 5

id,sumbitted_date
A0A8E0N8L5_ECOLX,2022-01-19
A0A8F9CQZ7_ECOLX,2022-01-19
A0A8F9ICG9_ECOLX,2022-01-19
A0A8F8WH98_PSEAI,2022-01-19
A0A8F9NZK3_PSEAI,2022-01-19


#### Q6:
Q: How  many species are in the UniProt taxonomy?

A: This SPARQL query retrieves the number of distinct species from the core:Taxon class which have the core:rank of core:Species. This query will be useful for obtaining an overview of the species present in the uniprot dataset.

In [12]:
PREFIX core:<http://purl.uniprot.org/core/> 

SELECT (COUNT(DISTINCT ?species) AS ?Count)
WHERE{
  ?species a core:Taxon .
  ?species core:rank core:Species
}

Count
1995728


#### Q7:
Q: How many species have at least one protein record? (this might take a long time to execute, so do this one last!)

A: This code counts the number of distinct species that have proteins associated with them in the core:Protein class. It does this by looking for proteins with an associated species in the core:Taxon class, and then filtering for only those with a core:rank of core:Species. The resulting count is stored in the variable ?Count.

In [14]:
PREFIX core: <http://purl.uniprot.org/core/>

SELECT (COUNT(DISTINCT ?species) AS ?Count)
WHERE 
{
    ?protein a core:Protein .
    ?protein core:organism ?species .
    ?species a core:Taxon . 
    ?species core:rank core:Species . 
}

Count
1078469


#### Q8:

Q: find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”

A: This SPARQL code is used to retrieve the AGI_ID and gene name of all genes in Arabidopsis thaliana that are related to pattern formation. The query first filters to only proteins that are encoded by genes in Arabidopsis thaliana and have a function annotation, then checks if the annotation contains the phrase "pattern formation". Finally, the AGI_ID and gene name are retrieved from the gene that encodes the protein.

In [37]:
PREFIX skos:<http://www.w3.org/2004/02/skos/core#> 
PREFIX core:<http://purl.uniprot.org/core/> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 

SELECT ?AGI_ID ?gene_name
WHERE{ 
    
    ?protein core:organism taxon:3702 .     #From A. Thaliana
    ?protein a core:Protein .               # Is a protein
    ?protein core:annotation ?annotation .  
    ?annotation a core:Function_Annotation . #Has an annotation
    ?annotation rdfs:comment ?description . # has a description
    ?protein core:encodedBy ?gene .         #Fine gene inciding it
    ?gene core:locusName ?AGI_ID .        #Get its AGI
    ?gene skos:prefLabel ?gene_name .       #Get its name

    FILTER CONTAINS(?description, "pattern formation")
    
}

AGI_ID,gene_name
At1g13980,GN
At3g02130,RPK2
At1g69270,RPK1
At5g37800,RSL1
At1g26830,CUL3A
At1g66470,RHD6
At3g09090,DEX1
At5g55250,IAMT1
At1g63700,YDA
At4g21750,ATML1


#### Q9:
Q: what is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb:Q18A79

A: This query is looking for any reactions that contain the string "mnxr" and are catalyzed by proteins with a UniProt ID of Q18A79. It is using the SPARQL PREFIXes meta and uniprot to define the namespaces used in the query. It is then using the meta:peptXref predicate to find the protein with the given UniProt ID, followed by the meta:pept and meta:cata predicates to find the reactions catalyzed by that protein. Finally, the rdfs:label predicate is used to get the reaction ID and filter for only those containing "mnxr".

In [28]:
%endpoint https://rdf.metanetx.org/sparql 

In [29]:
PREFIX meta: <https://rdf.metanetx.org/schema/>
PREFIX uniprot: <http://purl.uniprot.org/uniprot/>

SELECT DISTINCT ?pept ?react_ID
WHERE{
    ?pept meta:peptXref uniprot:Q18A79 . 
    ?catalyzes meta:pept ?pept .
    ?gpr meta:cata ?catalyzes ; meta:reac ?reaction .
    ?reaction rdfs:label ?react_ID . 
    FILTER CONTAINS(?react_ID, 'mnxr')
}

pept,react_ID
https://rdf.metanetx.org/pept/GLGA_CLOD6,mnxr165934
https://rdf.metanetx.org/pept/GLGA_CLOD6,mnxr145046c3


### FEDERATED QUERY - UniProt and MetaNetX

#### Q10:
Q: What is the official locus name, and the MetaNetX Reaction identifier (mnxr…..) for the protein that has “glycine reductase” catalytic activity in Clostridium difficile (taxon 272563).   (this must be executed on the https://rdf.metanetx.org/sparql   endpoint)

A: This is a Federated SPARQL query that combines data from two different sources (Uniprot and Metanetx) to return information about proteins that are classified with a particular term (in this case, "glycine reductase") and the reaction IDs associated with them.

In [30]:
%endpoint https://rdf.metanetx.org/sparql  

In [34]:
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX up: <http://purl.uniprot.org/core/>


SELECT ?protein ?loc_ID ?react_ID

### Uniprot part
WHERE {
SERVICE <http://sparql.uniprot.org/sparql> {
        select distinct ?loc_ID ?protein

WHERE {
?protein rdf:type up:Protein ; 
         up:organism taxon:272563 ;
         up:encodedBy ?gene ; 
         up:classifiedWith ?term .
?gene up:locusName ?loc_ID . 
?term rdfs:label ?info . 
FILTER (CONTAINS (?info, "glycine reductase")) 
}
}

## Metanetx part
SERVICE <https://rdf.metanetx.org/sparql> {
    select distinct ?protein ?react_ID
 WHERE {
     ?prot mnx:peptXref ?protein .
     ?catalyst rdf:type mnx:CATA ;
               mnx:pept ?prot .
     ?gpr mnx:cata ?catalyst ;
                     mnx:reac ?reaction .
     ?reaction rdfs:label ?react_ID . 
 } 
}
}

protein,loc_ID,react_ID
http://purl.uniprot.org/uniprot/Q185M4,CD630_23490,mnxr157884c3
http://purl.uniprot.org/uniprot/Q185M4,CD630_23490,mnxr162774c3
http://purl.uniprot.org/uniprot/Q185M6,CD630_23520,mnxr157884c3
http://purl.uniprot.org/uniprot/Q185M6,CD630_23520,mnxr162774c3
http://purl.uniprot.org/uniprot/Q185M3,CD630_23510,mnxr157884c3
http://purl.uniprot.org/uniprot/Q185M3,CD630_23510,mnxr162774c3
http://purl.uniprot.org/uniprot/Q185M5,CD630_23540,mnxr157884c3
http://purl.uniprot.org/uniprot/Q185M1,CD630_23480,mnxr157884c3
http://purl.uniprot.org/uniprot/Q185M1,CD630_23480,mnxr162774c3


### Sources:
1. https://www.metanetx.org/cgi-bin/mnxget/mnxref/MetaNetX_RDF_schema.pdf
2. https://stackoverflow.com/questions/24051435/filter-by-date-range-in-sparql