# Required Python libraries

In [2]:
from rdflib import Graph
from SPARQLWrapper import SPARQLWrapper, JSON

# Basic information

This notebook aims to show you basic informations of UniProtKB entries:  
- identifier
- date
- names

## Entry identifier

Each UniProt entry is identified by a [primary accession](https://www.uniprot.org/help/accession_numbers).
This is the best way to access an entry.  
In the RDF format the primary accession is part of the IRI identifying an entry.



In [3]:
entry=Graph().parse(format='ttl',
                     data="""
base <http://purl.uniprot.org/uniprot/>
prefix up: <http://purl.uniprot.org/core/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix isoform:<http://purl.uniprot.org/isoforms/>
<O22340> rdf:type up:Protein ;
         up:reviewed true ;
         up:created "2001-10-24"^^<xsd:date> ;
         up:modified "2015-04-01"^^<xsd:date> ;
         up:version 86 ;
         up:mnemonic "TPSDA_ABIGR" ;
         up:oldMnemonic "TPSD3_ABIGR" ,
                        "TSD3_ABIGR" ;
         up:replaces <Q94FV9> ;
         up:sequence isoform:O22340-1 .
isoform:O22340-1 rdf:type up:Simple_Sequence ;
                 up:modified "1998-01-01"^^<xsd:date> ;
                 up:version 1 .""")

In [4]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/> 
SELECT ?protein
WHERE {
  ?protein a up:Protein .
}""")

for row in qres:
    print("UniProt entry URI = %s" % row)

UniProt entry URI = http://purl.uniprot.org/uniprot/O22340


### Extracting a primaryAccession from a IRI

This is easy enough with some string manipulation.  
While UniProt primary accession are unique within UniProtKB they may be reused by accident or itentionally by other data sources. If we provided them as strings (not URI) and if you used them in a query that way, you might accidentaly retrieve completly wrong records.  


In [5]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/> 
SELECT ?primaryAccession
       ?protein
WHERE {
  ?protein a up:Protein .
  BIND(substr(str(?protein), strlen(str(uniprotkb:))+1) AS ?primaryAccession)
}""")

for row in qres:
    print("'%s' is the PrimaryAccession of %s" % row)

'O22340' is the PrimaryAccession of http://purl.uniprot.org/uniprot/O22340


## UniProt entry name (mnemonic)

The UniProtKB/Swiss-Prot **entry name** consists of up to 11 uppercase alphanumeric characters with a naming convention that can be symbolized as **X_Y**, where:  

- **X** is a mnemonic protein identification code of at most 5 alphanumeric characters  
- The **'_'** sign serves as a separator
- **Y** is a mnemonic species identification code of at most 5 alphanumeric characters

The mnemonic code **X** is an abbreviation of the protein/gene name, which does not necessarily correspond to the recommended protein name or to the gene name.  

See more details on [Entry Name](https://www.uniprot.org/help/entry_name) UniProt documentation 

The RDF format stores the **entry name** in the property `mnemonic` and, for convenience reasons, lists also obsolete entry names with `oldMnemonic` properties.

In [6]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/> 
SELECT
  ?protein ?mnemonic
WHERE {
  ?protein a up:Protein ;
      up:mnemonic ?mnemonic.
}""")

for row in qres:
    print("The entry name of %s is %s" % row)

The entry name of http://purl.uniprot.org/uniprot/O22340 is TPSDA_ABIGR


### Old mnemonics

In [5]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/> 
SELECT
  ?protein (GROUP_CONCAT(?oldMnemonic; separator=" and ") AS ?oldMnemonics)
WHERE {
  ?protein a up:Protein ;
      up:oldMnemonic ?oldMnemonic.
} GROUP BY ?protein
""")

for row in qres:
    print("%s used to be known as %s" % row)

http://purl.uniprot.org/uniprot/O22340 used to be known as TPSD3_ABIGR and TSD3_ABIGR


## Entry status

UniProtKB has two sections: 

- UniProtKB/Swiss-Prot: entries that have been manually annotated and reviewed by UniProtKB biocurators 
- UniProtKB/TrEMBL: entries that have been annotated using annotation pipelines 

The RDF format stores the **entry status** in the property `reviewed`.



In [6]:
sp_entry=Graph().parse(format='ttl',
                     data="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix isoform:<http://purl.uniprot.org/isoforms/>
<O22340> rdf:type up:Protein ;
         up:reviewed true ;
         up:created "2001-10-24"^^<xsd:date> ;
         up:modified "2015-04-01"^^<xsd:date> ;
         up:version 86 ;
         up:mnemonic "TPSDA_ABIGR" .
""")

qres=sp_entry.query("""
PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?protein
       ?entryName
       ?reviewed
WHERE {
  ?protein a up:Protein .
  ?protein up:mnemonic ?entryName .
  ?protein up:reviewed ?reviewed .
}""" )

for row in qres:
    print("%s (%s) is a reviewed (Swiss-Prot) entry? %s" % row)

http://purl.uniprot.org/uniprot/O22340 (TPSDA_ABIGR) is a reviewed (Swiss-Prot) entry? true


In [7]:
tr_entry=Graph().parse(format='ttl',
                         data="""base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/> 
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>  
prefix xsd: <http://www.w3.org/2001/XMLSchema#> 

<A0A024R563> rdf:type up:Protein ;
             up:reviewed false ;
             up:created "2014-07-09"^^xsd:date ;
             up:modified "2020-10-07"^^xsd:date ;
             up:version 30 ;
             up:mnemonic "A0A024R563_HUMAN" ."""
)

qres=tr_entry.query("""
PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?protein
       ?entryName
       ?reviewed
WHERE {
  ?protein a up:Protein .
  ?protein up:mnemonic ?entryName .
  ?protein up:reviewed ?reviewed .
}""" )

for row in qres:
    print("%s (%s) is a reviewed (Swiss-Prot) entry? %s" % row)

http://purl.uniprot.org/uniprot/A0A024R563 (A0A024R563_HUMAN) is a reviewed (Swiss-Prot) entry? false


In [10]:
tr_entry=Graph().parse(format='ttl',
                     data="""base <http://purl.uniprot.org/uniprot/>
prefix up: <http://purl.uniprot.org/core/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

<A0A024R563> rdf:type up:Protein ;
             up:reviewed false ;
             up:created "2014-07-09"^^xsd:date ;
             up:modified "2020-10-07"^^xsd:date ;
             up:version 30 ;
             up:mnemonic "A0A024R563_HUMAN" ."""
)

qres=tr_entry.query("""
PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?protein
       ?entryName
       ?reviewed
WHERE {
  ?protein a up:Protein .
  ?protein up:mnemonic ?entryName .
  ?protein up:reviewed ?reviewed .
}""" )

for row in qres:
    print("%s (%s) is a reviewed (Swiss-Prot) entry? %s" % row)

http://purl.uniprot.org/uniprot/A0A024R563 (A0A024R563_HUMAN) is a reviewed (Swiss-Prot) entry? false


## Dates and versions

We stores the date when an entry was integrated into UniProtKB in the `created` property and the last modification date of the entry and its current version in the `modified` and `version` properties of the entry. The last modification date of the sequence and its current version are displayed in the `modified` and `version` properties of the `sequence` element/subject.We make use of the international standard date [notation](http://www.w3.org/QA/Tips/iso-date)



In [4]:
entry=Graph().parse(format='ttl',
                     data="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix isoform:<http://purl.uniprot.org/isoforms/>
<O22340> rdf:type up:Protein ;
         up:reviewed true ;
         up:created "2001-10-24"^^<xsd:date> ;
         up:modified "2015-04-01"^^<xsd:date> ;
         up:version 86 ;
         up:mnemonic "TPSDA_ABIGR" ;
         up:oldMnemonic "TPSD3_ABIGR" ,
                        "TSD3_ABIGR" ;
         up:replaces <Q94FV9> ;
         up:sequence isoform:O22340-1 .
isoform:O22340-1 rdf:type up:Simple_Sequence ;
                 up:modified "1998-01-01"^^<xsd:date> ;
                 up:version 1 .""")

qres=entry.query("""prefix up: <http://purl.uniprot.org/core/>
SELECT
    ?protein
    ?created
    ?modified
    ?version
WHERE {
  ?protein a up:Protein ;
           up:created ?created ;
           up:modified ?modified ;
           up:version ?version .
}""")

for row in qres:
    print("%s was created on %s and modified on %s. It is at version %s" % row)

http://purl.uniprot.org/uniprot/O22340 was created on 2001-10-24 and modified on 2015-04-01. It is at version 86


### <span style='color:#f44336'>Exercise select entries with version between 20 and 100</span>



In [14]:
qres=entry.query("""prefix up: <http://purl.uniprot.org/core/>
SELECT
    ?protein
    ?created
    ?modified
    ?version
WHERE {
  ?protein a up:Protein ;
           up:created ?created ;
           up:modified ?modified ;
           up:version ?version .
 FILTER(20 < ?version)
 FILTER(**** ?version)
}""")

for row in qres:
    print("%s was created on %s and modified on %s. It is at version %s" % row)

http://purl.uniprot.org/uniprot/O22340 was created on 2001-10-24 and modified on 2015-04-01. It is at version 86


# UniProt protein names

Now we try to show you basic information on **protein names**.

Protein names are modeled as `name` resources in the RDF format. 

There are 3 main types of protein names:  

 1. The name recommended by the UniProt consortium: linked to a `recommendedName` element/property in the RDF format.  
 2. Names provided by the submitter of the underlying nucleotide sequence (in UniProtKB/TrEMBL only):
    Shown in `submittedName` elements/properties in the RDF format.  
 3. Alternative names:  
    Shown in `alternativeName` elements/properties in the RDF format.  

These types are further categorized into:  

 1. Full name:  
    Shown in a `fullName` element/property in the RDF format.  
 2. Abbreviations or acronyms of the full name:  
    Shown in `shortName` elements/properties in the RDF format.  

There are furthermore a few categories with more specific meanings:  

  1. Name of an allergen:  
     Shown in an `allergenName` element/property in the RDF format.  
  2. Names of CD antigens:  
     Shown in `CdAntigenName` elements/properties in the RDF format.  
  3. Name used in a biotechnological context:  
     Shown in a `biotechName` element/property in the RDF format.  
  4. International nonproprietary names:  
     Shown in `innName` elements/properties in the RDF format.  
  5. Enzyme Commission EC numbers.  
     This links a name with the EC number that clasifies the enzymatic activity of this entry.  
     See the notebook `XX_metabolism.ipynb` for more details.  



## Protein Names



In [4]:
entry=Graph().parse(format='ttl',
                     data="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix isoform:<http://purl.uniprot.org/isoforms/>
prefix enzyme: <http://purl.uniprot.org/enzyme/>

<P12820>
  a up:Protein ;
  up:recommendedName
    <P12820#SIP30A> ;
  up:alternativeName
    <P12820#SIP62B> ,
    <P12820#SIPE4F> ,
    <P12820#SIPFE1> ;
  up:enzyme
    enzyme:3.2.1.- ,
    enzyme:3.4.15.1 ;
  up:sequence
    isoform:P12820-1 .

<P12820#SIP30A>
  rdf:type up:Structured_Name ;
  up:fullName "Angiotensin-converting enzyme" ;
  up:shortName "ACE" ;
  up:ecName "3.2.1.-" ,
    "3.4.15.1" .

<P12820#SIP62B>
  rdf:type up:Structured_Name ;
  up:fullName "Dipeptidyl carboxypeptidase I" .

<P12820#SIPE4F>
  rdf:type up:Structured_Name ;
  up:fullName "Kininase II" .

<P12820#SIPFE1>
  rdf:type up:Structured_Name ;
  up:cdAntigenName "CD143" .
""")


### Selecting a recommended full name

In [11]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/> 
SELECT ?protein 
       ?fullName
WHERE {
  ?protein a up:Protein ;
           up:recommendedName ?recommendedName .
  ?recommendedName up:fullName ?fullName .
}""")
for row in qres:
    print("UniProt recommends that %s is called '%s' (full name)" % row)

UniProt recommends that http://purl.uniprot.org/uniprot/P12820 is called 'Angiotensin-converting enzyme' (full name)


### Selecting a recommended short name

In [12]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/> 
SELECT ?protein 
       ?shortName
WHERE {
  ?protein a up:Protein ;
           up:recommendedName ?recommendedName .
  ?recommendedName up:shortName ?shortName .
}""")
for row in qres:
    print("UniProt recommends that %s is called '%s' (short name)" % row)

UniProt recommends that http://purl.uniprot.org/uniprot/P12820 is called 'ACE' (short name)


### Selecting recommended EC numbers

In [14]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
SELECT ?protein
       ?ecName
WHERE {
  ?protein a up:Protein ;
           up:recommendedName ?recommendedName .
  ?recommendedName up:ecName ?ecName .
}""")
for row in qres:
    print("UniProt recommends that %s is named EC %s" % row)

UniProt recommends that http://purl.uniprot.org/uniprot/P12820 is named EC 3.4.15.1
UniProt recommends that http://purl.uniprot.org/uniprot/P12820 is named EC 3.2.1.-


### Selecting alternative names

In [6]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
SELECT ?protein
       ?fullName
WHERE {
  ?protein a up:Protein ;
           up:alternativeName ?alternativeName .
  ?alternativeName up:fullName ?fullName .
}""")

for row in qres:
    print("%s is also known as %s" % row)

http://purl.uniprot.org/uniprot/P12820 is also known as Dipeptidyl carboxypeptidase I
http://purl.uniprot.org/uniprot/P12820 is also known as Kininase II


# Replicon and genes

This notebook aims to show you basic informations on **genes** that encode the protein and their replicon (chromosome, plasmid, etc).   

## Gene Names

This notebook aims to show you basic informations on **genes** that encode the protein.

The name(s) of the gene(s) that encode the protein by a separate `encodedBy` properties.

There are four categories of gene names.  
- The primary gene name is represented with a `skos:prefLabel` property
- The synonyms with `skos:altLabel` property. 
- Ordered locus names (OLN) with `locusName` property.
- ORF names with `orfName` property.

The resources representing a gene are members of the `up:Gene` class.

In [2]:
entry=Graph().parse(format='ttl',
                     data="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix isoform:<http://purl.uniprot.org/isoforms/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>

<Q0JNS6>
  a up:Protein ;
  up:encodedBy
    <Q0JNS6#51304A4E53360019> ,
    <Q0JNS6#51304A4E5336001A> ,
    <Q0JNS6#51304A4E5336001B> .

<Q0JNS6#51304A4E53360019>
  rdf:type up:Gene ;
  skos:prefLabel "CAM1-1" ;
  skos:altLabel "CAM1" ;
  up:locusName "Os03g0319300" ,
    "LOC_Os03g20370" ;
  up:orfName "OsJ_010214" .

<Q0JNS6#51304A4E5336001A>
  rdf:type up:Gene ;
  skos:prefLabel "CAM1-2" ;
  skos:altLabel "CAM" ;
  up:locusName "Os07g0687200" ,
    "LOC_Os07g48780" ;
  up:orfName "OJ1150_E04.120-1" ,
    "OJ1200_C08.124-1" ,
    "OsJ_024630" .

<Q0JNS6#51304A4E5336001B>
  rdf:type up:Gene ;
  skos:prefLabel "CAM1-3" ;
  up:locusName "Os01g0267900" ,
    "LOC_Os01g16240" ;
  up:orfName "OsJ_001186" ,
    "P0011D01.22" .""")


### Selecting encoding genes

In [3]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?protein
       ?gene 
WHERE {
  ?protein a up:Protein ;
           up:encodedBy ?gene .
}""")

for row in qres:
    print("%s is encoded by %s" % row)

http://purl.uniprot.org/uniprot/Q0JNS6 is encoded by http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001B
http://purl.uniprot.org/uniprot/Q0JNS6 is encoded by http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E53360019
http://purl.uniprot.org/uniprot/Q0JNS6 is encoded by http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A


### Selecting the recommended gene names

In [4]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?gene
       ?recommendedGeneName
WHERE {
  ?protein a up:Protein ;
           up:encodedBy ?gene .
  ?gene skos:prefLabel ?recommendedGeneName .
}""")

for row in qres:
    print("%s is called %s" % row)

http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001B is called CAM1-3
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E53360019 is called CAM1-1
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A is called CAM1-2


### Selecting alternative gene names

In [5]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?gene
       ?altGeneName
WHERE {
  ?protein a up:Protein ;
           up:encodedBy ?gene .
  ?gene skos:altLabel ?altGeneName .
}""")

for row in qres:
    print("%s is also known as %s" % row)

http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E53360019 is also known as CAM1
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A is also known as CAM


### Selecting ordered locus names

In [6]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?gene
       ?oln
WHERE {
  ?protein a up:Protein ;
           up:encodedBy ?gene .
  ?gene up:locusName ?oln .
}""")

for row in qres:
    print("%s has a ordered locus name %s" % row)

http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001B has a ordered locus name Os01g0267900
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001B has a ordered locus name LOC_Os01g16240
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E53360019 has a ordered locus name Os03g0319300
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E53360019 has a ordered locus name LOC_Os03g20370
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A has a ordered locus name LOC_Os07g48780
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A has a ordered locus name Os07g0687200


### Selecting ORF names

In [7]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?gene
       ?orfName
WHERE {
  ?protein a up:Protein ;
           up:encodedBy ?gene .
  ?gene up:orfName ?orfName .
}""")

for row in qres:
    print("%s has a open reading frame name %s" % row)

http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001B has a open reading frame name OsJ_001186
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001B has a open reading frame name P0011D01.22
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E53360019 has a open reading frame name OsJ_010214
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A has a open reading frame name OJ1200_C08.124-1
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A has a open reading frame name OsJ_024630
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A has a open reading frame name OJ1150_E04.120-1


## Replicons

In [3]:
entry=Graph().parse(format='ttl',
                     data="""
base <http://purl.uniprot.org/uniprot/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix up: <http://purl.uniprot.org/core/> 
prefix xsd: <http://www.w3.org/2001/XMLSchema#> 

<Q71RH2> rdf:type up:Protein ;
  up:reviewed true ;
  up:created "2005-04-12"^^xsd:date ;
  up:modified "2021-04-07"^^xsd:date ;
  up:version 130 ;
  up:mnemonic "TLC3B_HUMAN" .


<Q71RH2> up:proteome <http://purl.uniprot.org/proteomes/UP000005640#Chromosome%2016> .""")

In [9]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?protein
       ?replicon
WHERE {
  ?protein a up:Protein ;
           up:proteome ?proteomeData .
  BIND( strafter( str(?proteomeData), "#" ) as ?replicon )
}""")

for row in qres:
    print("The gene coding for %s is located on %s" % row)

The gene coding for http://purl.uniprot.org/uniprot/Q71RH2 is located on Chromosome%2016


### Using the live UniProt SPARQL endpoint

Selecting the chromosome via the proteomes dataset (gives components i.e. chromosomes)

In [10]:
query = """
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX skos:<http://www.w3.org/2004/02/skos/core#> 
PREFIX proteome:<http://purl.uniprot.org/proteomes/>

SELECT
  DISTINCT
    ?proteomeData
WHERE {
  # reviewed entries (UniProtKB/Swiss-Prot)
  ?protein up:reviewed true .
  # restricted to Human taxid
  ?uniprot up:organism taxon:9606 .
  ?uniprot up:proteome ?proteomeData .
  BIND( strbefore( str(?proteomeData), "#" ) as ?proteome )
  BIND( strafter( str(?proteomeData), "#" ) as ?replicon )
}
LIMIT 3
"""

# Set the SPARQL endpoint (UniProt)
sparql = SPARQLWrapper("https://sparql.uniprot.org/sparql")

# Define the query
sparql.setQuery(query)

# Set the output format as JSON
sparql.setReturnFormat(JSON)

# Run the SPARQL query and convert to the defined format
results = sparql.query().convert()

# Print the query result
for result in results["results"]["bindings"]:
    print(result["proteomeData"]["value"])

http://purl.uniprot.org/proteomes/UP000005640#Chromosome%2011
http://purl.uniprot.org/proteomes/UP000005640#Chromosome%202
http://purl.uniprot.org/proteomes/UP000005640#Chromosome%203


### Organelles and Plasmids

If a gene is located in an organelle other than the nucleus, or/and on a plasmid rather than a chromosome, the gene location is stored in encodedIn properties. Note that if a plasmid has several names, they are listed as multiple `rdfs:label` properties.

In [3]:
entry2=Graph().parse(format='ttl',
                     data="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix isoform:<http://purl.uniprot.org/isoforms/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

<Q01529>
  a up:Protein ;
  up:encodedIn up:Mitochondrion ,
               <Q01529#SIP29DF58> . 

<Q01529#SIP29DF58>
  rdf:type up:Plasmid ;
  rdfs:label "pAL2-1" .
  """)


qres=entry2.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT
    ?protein
    ?plasmidOrOrganelle
    ?label
WHERE {
    ?protein a up:Protein ;
      up:encodedIn ?plasmidOrOrganelle .
    OPTIONAL {
        ?plasmidOrOrganelle rdfs:label ?label .
    }
}""")

for row in qres:
    print("protein=%s plasmid or organelle=%s label=%s" % row)

protein=http://purl.uniprot.org/uniprot/Q01529 plasmid or organelle=http://purl.uniprot.org/core/Mitochondrion label=None
protein=http://purl.uniprot.org/uniprot/Q01529 plasmid or organelle=http://purl.uniprot.org/uniprot/Q01529#SIP29DF58 label=pAL2-1


Sometimes it is known that a gene is located on a plasmid, but the name of the plasmid is unknown. The example below shows how this is represented.

In [13]:
entry3=Graph().parse(format='ttl',
                     data="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix isoform:<http://purl.uniprot.org/isoforms/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
<Q7BS32>
  a up:Protein ;
  up:encodedIn
    <Q7BS32#51374253333200E> .

<Q7BS32#51374253333200E>
  rdf:type up:Plasmid .""")

qres=entry3.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT
    ?protein
    ?type
WHERE {
    ?protein a up:Protein ;
      up:encodedIn ?plasmidOrOrganelle .
    OPTIONAL {
        ?plasmidOrOrganelle a ?type .
    }
}""")

for row in qres:
    print("%s is encodedIn a '%s'" % row)

http://purl.uniprot.org/uniprot/Q7BS32 is encodedIn a 'http://purl.uniprot.org/core/Plasmid'


# Taxonomy

This notebook aims to show you how taxonomy data are represented in UniProt.  

UniProtKB taxonomy data is manually curated (see details [here](https://www.uniprot.org/taxonomy/)).


The organism which is the source of a protein sequence is identified by a unique identifier (often called _taxon_ or _taxid_) from the [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) database.   
This is the only taxonomy information that is stored in the RDF format of a UniProtKB entry. However, the full NCBI taxonomy is modelled and available as well.   

## Organism identifier

The organism identifier (taxon) is stored in the `organism` property of a uniprot entry.  

In [13]:
P05067ttl = """base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>
prefix taxon: <http://purl.uniprot.org/taxonomy/>

<P05067> a up:Protein ;
         up:organism taxon:9606 .
"""

P05067=Graph().parse(format='ttl', data=P05067ttl)

for subj, pred, obj in P05067:
   print(subj, pred, obj)


http://purl.uniprot.org/uniprot/P05067 http://purl.uniprot.org/core/organism http://purl.uniprot.org/taxonomy/9606
http://purl.uniprot.org/uniprot/P05067 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.uniprot.org/core/Protein


## Retrieve the taxon (organism id) of a protein

In [14]:
qres=P05067.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?protein ?taxon
WHERE {
  ?protein a up:Protein ;
           up:organism ?taxon .
}""")

for row in qres:
    print("The taxon (organism id) of %s is %s" % row)

The taxon (organism id) of http://purl.uniprot.org/uniprot/P05067 is http://purl.uniprot.org/taxonomy/9606


## Taxonomy data

**Properties**:
- `rank`
- `mnemonic`
- `scientificName`
- `commonName`
- `otherName`
- `seeAlso` (xref)
- `subClassOf` (hierarchy)
- `narrowerTransitive` (opposite direction of hierarchy)
- `partOfLineage` (If it should be shown on the UniProt record pages of the website)

In [4]:
# Description of the taxon:9606 (Homo sapiens)
taxon=Graph().parse(format='ttl',
                 data="""
base <http://purl.uniprot.org/taxonomy/> 
prefix up: <http://purl.uniprot.org/core/> 
prefix foaf: <http://xmlns.com/foaf/0.1/> 
prefix owl: <http://www.w3.org/2002/07/owl#> 
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
prefix skos: <http://www.w3.org/2004/02/skos/core#> 
prefix xsd: <http://www.w3.org/2001/XMLSchema#> 

<9606> a up:Taxon ;
       up:rank up:Species ;
       up:mnemonic "HUMAN" ;
       up:scientificName "Homo sapiens" ;
       up:commonName "Human" ;
       up:otherName "Home sapiens" ,
                    "Homo sapiens Linnaeus, 1758" ,
                    "man" ;
       rdfs:seeAlso <http://animaldiversity.org/site/accounts/information/Homo_sapiens.html> ,
                    <http://archaeologyinfo.com/homo-sapiens/> ,
                    <http://www.ensembl.org/Homo_sapiens/Info/Index> ,
                    <https://www.sciencedaily.com/releases/2005/02/050223122209.htm> ;
       rdfs:subClassOf <9605> ;
       skos:narrowerTransitive <63221> ,
                               <741158> ;
       up:partOfLineage false .

<9605> a up:Taxon ;
       up:rank up:Genus ;
       up:scientificName "Homo" ;
       up:otherName "Homo Linnaeus, 1758" ,
                    "humans" ;
       rdfs:subClassOf <207598> ;
       skos:narrowerTransitive <9606> ,
                               <1425170> ,
                               <2665952> ;
       up:partOfLineage true .
""")

### Retrieve the rank and the scientific name of the organism

The rank and scientificName are by far the most queried properties of a taxon.

In [16]:
qres=taxon.query("""PREFIX up: <http://purl.uniprot.org/core/> 
SELECT ?taxon 
       ?rank
       ?scientificName
WHERE {
  ?taxon a up:Taxon ;
         up:rank ?rank ;
         up:scientificName ?scientificName .
}""")

for row in qres:
    print('Taxon "%s", rank = "%s", scientificName = "%s"' % row)

Taxon "http://purl.uniprot.org/taxonomy/9606", rank = "http://purl.uniprot.org/core/Species", scientificName = "Homo sapiens"
Taxon "http://purl.uniprot.org/taxonomy/9605", rank = "http://purl.uniprot.org/core/Genus", scientificName = "Homo"


## Taxonomy hierarchy

Querying the taxonomic hierarchy is straightforward using the `rdfs:subClassOf` property.  
In our _taxon_ example shown previously:  
<9605> rdfs:subClassOf <9606>  
<9606> rdfs:subClassOf <207598>  

In order to facilitate the search, the UniProt SPARQL endpoint materialized all relationships. In other words, you don't need to use SPARQL property path to query the taxonomy classification.  
Note that if you use other endpoints you might need to use `rdfs:subClassOf+` to query by higher levels of taxonomy.


In [5]:
qres=taxon.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT
  ?species
  ?genus
WHERE {
  ?species a up:Taxon ;
           up:rank up:Species ;
           rdfs:subClassOf ?genus .
  ?genus a up:Taxon ;
         up:rank up:Genus .
}""")

for row in qres:
    print("%s is part of the genus %s" % row)

http://purl.uniprot.org/taxonomy/9606 is part of the genus http://purl.uniprot.org/taxonomy/9605


### <span style='color:#f44336'>Exercise use skos:narrowerTransitive instead</span>



In [None]:
qres=taxon.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT
  ?species
  ?genus
WHERE {
  ?species a up:Taxon ;
           up:rank up:Species ;
           **** ?genus .
  ?genus a up:Taxon ;
         up:rank up:Genus .
}""")

for row in qres:
    print("%s is part of the genus %s" % row)

## Host organisms

Sometimes an organism is known to be hosted inside an other one (_e.g._ parasite, symbiont, infection).   
We defined the `host` property to link an organism to its host.  

In [4]:
host=Graph().parse(format='ttl',
                 data="""
base <http://purl.uniprot.org/taxonomy/> 
prefix up: <http://purl.uniprot.org/core/> 
<1241371> a up:Taxon ;
          up:mnemonic "ABHV" ;
          up:host <6451> .
""")

In [5]:
qres=host.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT
    ?virus
    ?host
WHERE {
    ?virus up:host ?host .
}""")

for row in qres:
    print("%s hosted by %s" % row)

http://purl.uniprot.org/taxonomy/1241371 hosted by http://purl.uniprot.org/taxonomy/6451


# Annotations

UniProt is well known for it's detailed functional annotation of Protein.
There are many type of annotations we will talk about three kinds today

- Function_Annotation
- Catalytic_Activity_Annotation
- Active_Site_Annotation



In [7]:
annotated_entry=Graph().parse(format='ttl',
                 data="""
base <http://purl.uniprot.org/uniprot/>
prefix up: <http://purl.uniprot.org/core/>
prefix range: <http://purl.uniprot.org/range/>
prefix faldo: <http://biohackathon.org/resource/faldo#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix enzyme: <http://purl.uniprot.org/enzyme/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix position: <http://purl.uniprot.org/position/>
prefix isoform: <http://purl.uniprot.org/isoforms/>

<A1L3X0> rdf:type up:Protein ;
  up:reviewed true ;
  up:created "2007-12-04"^^xsd:date ;
  up:modified "2022-12-14"^^xsd:date ;
  up:version 126 ;
  up:mnemonic "ELOV7_HUMAN" ;
  up:annotation <A1L3X0#SIP89646E7BFEA7319C> , <A1L3X0#SIPAA7F9D2700865549> , <A1L3X0#SIP9679820FC93F989D> .

<A1L3X0#SIP89646E7BFEA7319C> rdf:type up:Function_Annotation ;
  rdfs:comment "Catalyzes the first and rate-limiting reaction of the four reactions that constitute the long-chain fatty acids elongation cycle. This endoplasmic reticulum-bound enzymatic process allows the addition of 2 carbons to the chain of long- and very long-chain fatty acids (VLCFAs) per cycle. Condensing enzyme with higher activity toward C18 acyl-CoAs, especially C18:3(n-3) acyl-CoAs and C18:3(n-6)-CoAs. Also active toward C20:4-, C18:0-, C18:1-, C18:2- and C16:0-CoAs, and weakly toward C20:0-CoA. Little or no activity toward C22:0-, C24:0-, or C26:0-CoAs. May participate in the production of saturated and polyunsaturated VLCFAs of different chain lengths that are involved in multiple biological processes as precursors of membrane lipids and lipid mediators." .

<A1L3X0#SIPAA7F9D2700865549> rdf:type up:Active_Site_Annotation ;
  rdfs:comment "Nucleophile" ;
  up:range range:18350076834885678tt150tt150 .
range:18350076834885678tt150tt150 rdf:type faldo:Region ;
  faldo:begin position:18350076834885678tt150 ;
  faldo:end position:18350076834885678tt150 .

<A1L3X0#SIP9679820FC93F989D> rdf:type up:Catalytic_Activity_Annotation ;
  up:catalyticActivity <A1L3X0#SIP795F50B39688C7AA> ;
  up:catalyzedPhysiologicalReaction <http://rdf.rhea-db.org/32728> .

<A1L3X0#SIP795F50B39688C7AA> rdf:type up:Catalytic_Activity ;
  up:catalyzedReaction <http://rdf.rhea-db.org/32727> ;
  up:enzymeClass enzyme:2.3.1.199 .

position:18350076834885678tt150 rdf:type faldo:Position ,
    faldo:ExactPosition ;
  faldo:position 150 ;
  faldo:reference isoform:A1L3X0-1 .

isoform:A1L3X0-1 rdf:type up:Simple_Sequence ;
  up:modified "2007-02-06"^^xsd:date ;
  up:version 1 ;
  up:mass 33356 ;
  up:md5Checksum "8be30446ba90dce3da3a4ed115206e19" ;
  rdf:value "MAFSDLTSRTVHLYDNWIKDADPRVEDWLLMSSPLPQTILLGFYVYFVTSLGPKLMENRKPFELKKAMITYNFFIVLFSVYMCYEFVMSGWGIGYSFRCDIVDYSRSPTALRMARTCWLYYFSKFIELLDTIFFVLRKKNSQVTFLHVFHHTIMPWTWWFGVKFAAGGLGTFHALLNTAVHVVMYSYYGLSALGPAYQKYLWWKKYLTSLQLVQFVIVAIHISQFFFMEDCKYQFPVFACIIMSYSFMFLLLFLHFWYRAYTKGQRLPKTVKNGTCKNKDN" .
""")

## Function Annotation

A free text annotation that is for humans to read and understand. i.e. "General description of the functions of a protein"

In [8]:
res=annotated_entry.query("""
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix up: <http://purl.uniprot.org/core/>

SELECT
  ?protein
  ?comment
WHERE {
  ?protein a up:Protein ;
    up:annotation ?annotation .
  ?annotation a up:Function_Annotation ;
    rdfs:comment ?comment .
  FILTER(contains(?comment, 'Condensing enzyme'))
}
""")

for row in res:
    print("Found %s annotated with the a function=%s" % row)

Found http://purl.uniprot.org/uniprot/A1L3X0 annotated with the a function=Catalyzes the first and rate-limiting reaction of the four reactions that constitute the long-chain fatty acids elongation cycle. This endoplasmic reticulum-bound enzymatic process allows the addition of 2 carbons to the chain of long- and very long-chain fatty acids (VLCFAs) per cycle. Condensing enzyme with higher activity toward C18 acyl-CoAs, especially C18:3(n-3) acyl-CoAs and C18:3(n-6)-CoAs. Also active toward C20:4-, C18:0-, C18:1-, C18:2- and C16:0-CoAs, and weakly toward C20:0-CoA. Little or no activity toward C22:0-, C24:0-, or C26:0-CoAs. May participate in the production of saturated and polyunsaturated VLCFAs of different chain lengths that are involved in multiple biological processes as precursors of membrane lipids and lipid mediators.


## Active site annotation

Active sites are the "Amino acid(s) involved in the activity of an enzyme."

In [14]:
res=annotated_entry.query("""
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix up: <http://purl.uniprot.org/core/>

SELECT
  ?protein
  ?comment
  ?beginPosition
WHERE {
  ?protein a up:Protein ;
    up:annotation ?annotation .
  ?annotation a up:Active_Site_Annotation ;
    rdfs:comment ?comment ;
    up:range ?range .
  ?range faldo:begin/faldo:position ?beginPosition .
}
""")

for row in res:
    print("Found %s annotated with the an active site (%s) at %s" % row)

Found http://purl.uniprot.org/uniprot/A1L3X0 annotated with the an active site = Nucleophile at 150


In [8]:
res=annotated_entry.query("""
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix up: <http://purl.uniprot.org/core/>

SELECT
  ?protein
  ?beginPosition
  ?sequence
WHERE {
  ?protein a up:Protein ;
    up:annotation ?annotation .
  ?annotation a up:Active_Site_Annotation ;
    up:range ?range .
  ?range faldo:begin [ faldo:position ?beginPosition ;
                       faldo:reference ?isoform ] .
  ?isoform rdf:value ?sequence .
}
""")

for row in res:
    print("Found %s annotated active site at %s on the sequence %s" % row)

Found http://purl.uniprot.org/uniprot/A1L3X0 annotated active site at 150 on the sequence MAFSDLTSRTVHLYDNWIKDADPRVEDWLLMSSPLPQTILLGFYVYFVTSLGPKLMENRKPFELKKAMITYNFFIVLFSVYMCYEFVMSGWGIGYSFRCDIVDYSRSPTALRMARTCWLYYFSKFIELLDTIFFVLRKKNSQVTFLHVFHHTIMPWTWWFGVKFAAGGLGTFHALLNTAVHVVMYSYYGLSALGPAYQKYLWWKKYLTSLQLVQFVIVAIHISQFFFMEDCKYQFPVFACIIMSYSFMFLLLFLHFWYRAYTKGQRLPKTVKNGTCKNKDN


### Extracting the amino acid annotated



In [9]:
res=annotated_entry.query("""
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix up: <http://purl.uniprot.org/core/>

SELECT
  ?protein
  ?beginPosition
  ?amminoacid
WHERE {
  ?protein a up:Protein ;
    up:annotation ?annotation .
  ?annotation a up:Active_Site_Annotation ;
    up:range ?range .
  ?range faldo:begin [ faldo:position ?beginPosition ;
                       faldo:reference ?isoform ] .
  ?isoform rdf:value ?sequence .
  BIND(SUBSTR(?sequence, ?beginPosition, 1) AS ?amminoacid)
}
""")

for row in res:
    print("Found %s annotated active site at %s is a  %s" % row)

Found http://purl.uniprot.org/uniprot/A1L3X0 annotated active site at 150 is a  H


### <span style='color:#f44336'>Exercise select only those entries where the active site is a 'H'</span>



In [None]:
res=annotated_entry.query("""
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix up: <http://purl.uniprot.org/core/>

SELECT
  ?protein
  ?beginPosition
  ?amminoacid
WHERE {
  ?protein a up:Protein ;
    up:annotation ?annotation .
  ?annotation a up:Active_Site_Annotation ;
    up:range ?range .
  ?range faldo:begin [ faldo:position ?beginPosition ;
                       faldo:reference ?isoform ] .
  ?isoform rdf:value ?sequence .
  BIND(SUBSTR(?sequence, ?beginPosition, 1) AS ?amminoacid)
  FILTER(**** = ?aminoacid)
}
""")

for row in res:
    print("Found %s annotated active site at %s is a %s and must be a 'H'" % row)

### FALDO

We use FALDO to describe positions. FALDO allows for positions that we currently don't allow in UniProt but that do appear in other databases.
Building on this reusable schema means that what you learned here can be reused at other databases e.g. DDBJ.

Why the layers of indirection. Simple biology doesn't need, unfortunately lots of biology is not simple.

Things you might expect but are not always true.

genes \-&gt; start before they end \-&gt; have a definable length \-&gt; are shorter than their chromosome they are on \-&gt; start on the same chromosome that it ends on \-&gt; position is actually known \-&gt; genes are not wholly within other genes \-&gt; and if they are they don't share exons \-&gt; and if they do mutations in the one don't make the other non functional

FALDO deals with the rare edge cases, and we pay for the extra triples in the simple cases. Specifically because the rare cases are often interesting for biological reasons.

![Example](./Image/faldo-example.png)



## Catalytic activity

![Protein to Reaction](Image/uniprot_ca_rhea.png)



In [5]:
res=annotated_entry.query("""
PREFIX up: <http://purl.uniprot.org/core/>

SELECT
  DISTINCT
    ?uniprot
    ?rhea
WHERE {
    ?uniprot a up:Protein .

    # Catalytic activity annotation
    ?uniprot up:annotation ?cann .
    ?cann a up:Catalytic_Activity_Annotation ;
      up:catalyticActivity ?ca .

    # catalyzed reaction (Rhea)
    ?ca up:catalyzedReaction ?rhea .
}
""")

for row in res:
    print("Found %s that catalyzes %s" % row)

Found http://purl.uniprot.org/uniprot/A1L3X0 that catalyzes http://rdf.rhea-db.org/32727


## String functions on sequences

UniProt uses 1 based indexing as do the [SPARQL string](https://www.w3.org/TR/sparql11-query/) functions.

In the RDF all isoforms sequences are available, not only the "cannonical" as in the flatfile.


In [None]:
res=annotated_entry.query("""
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
SELECT
    ?protein
    ?sequence
    ?sequenceLength
WHERE {
    ?protein up:sequence ?sequence .
    ?sequence rdf:value ?sequenceIUPACstring .
    BIND(STRLEN(?sequenceIUPACstring ) AS ?sequenceLength)
}
""")

for row in res:
    print("Found %s with a sequence of length %s" % row)

## Disease Annotation

In [9]:
diseased_entry=Graph().parse(format='ttl',
                 data="""
base <http://purl.uniprot.org/uniprot/>
prefix up: <http://purl.uniprot.org/core/>
prefix range: <http://purl.uniprot.org/range/>
prefix faldo: <http://biohackathon.org/resource/faldo#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix enzyme: <http://purl.uniprot.org/enzyme/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix position: <http://purl.uniprot.org/position/>
prefix isoform: <http://purl.uniprot.org/isoforms/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix annotation: <http://purl.uniprot.org/annotation/>
prefix disease: <http://purl.uniprot.org/diseases/>

<A0A1B0GTW7> rdf:type up:Protein ;
  up:reviewed true ;
  up:annotation <A0A1B0GTW7#SIP7FF941061D88C2CD> , annotation:VAR_086199 .

<A0A1B0GTW7#SIP7FF941061D88C2CD> rdf:type up:Disease_Annotation ;
  rdfs:comment "The disease is caused by variants affecting the gene represented in this entry." ;
  up:disease disease:6243 .

annotation:VAR_086199 rdf:type up:Natural_Variant_Annotation ;
  rdfs:comment "In HTX12; unknown pathological significance." ;
  up:substitution "F" ;
  skos:related disease:6243 ;
  up:range <http://purl.uniprot.org/range/-9218581325671703506tt31tt31> . """)

In [None]:
In the above example one can see there is a `skos:related` between a natural variation and a dissease (6423) this is only made explicit in the RDF of UniProt.

In [11]:
res=diseased_entry.query("""
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT
    ?protein
    ?annotation
    ?disease
WHERE {
    ?protein up:annotation ?annotation .
    ?annotation (skos:related|up:disease) ?disease .
}
""")

for row in res:
    print("Found %s with an annotation %s that is related to %s" % row)

Found http://purl.uniprot.org/uniprot/A0A1B0GTW7 with an annotation http://purl.uniprot.org/annotation/VAR_086199 that is related to http://purl.uniprot.org/diseases/6243
Found http://purl.uniprot.org/uniprot/A0A1B0GTW7 with an annotation http://purl.uniprot.org/uniprot/A0A1B0GTW7#SIP7FF941061D88C2CD that is related to http://purl.uniprot.org/diseases/6243
