This is the code to extract the entities and associated triples / descriptions from the Uniprot data uploaded to a dedicated GraphDB instance from the [ftp release](https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/) uniprotkb_reviewed_eukaryota_opisthokonta_metazoa_33208_0.rdf.xz (including [citation info](https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/citations.rdf.xz) and the [GO owl](https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/go.owl.xz) file). The RDF data can be queried at https://biosoda.unil.ch/graphdb/sparql (repository *uniprot_swiss_ai*).

The overall idea is explained in the presentation available in the Swiss AI RAG Gdrive folder [here](https://docs.google.com/presentation/d/1c89Sa7WM-vcCVZ-F2LfP0jZe0l3bVImT/edit?usp=drive_link&ouid=106167242104510220555&rtpof=true&sd=true).

We start by identifying proteins which have GO annotations that can be attributed back to a specific Journal Citation. This will enable us to link a protein to a specific DOI, abstract and title, which will then act as our "entity description" similarly to the Wikidata dataset.



In [None]:
!pip install sparqlwrapper

In [None]:
from SPARQLWrapper import SPARQLWrapper, JSON

# the endpoint where the data can be queried
sparql = SPARQLWrapper(
    "https://biosoda.unil.ch/graphdb/repositories/uniprot_swiss_ai"
)
sparql.setReturnFormat(JSON)

In [None]:
# simple example of a query: count all proteins in the endpoint
sparql.setQuery("""
    PREFIX up: <http://purl.uniprot.org/core/>

    SELECT (count (distinct ?protein) as ?count_num_proteins)
    WHERE {
        ?protein a up:Protein .
    }
    """
)

try:
    ret = sparql.queryAndConvert()

    for r in ret["results"]["bindings"]:
        print(r["count_num_proteins"]["value"])
except Exception as e:
    print(e)

109788


In [None]:
# count all proteins with some journal citation associated to them in the endpoint
sparql.setQuery("""
    PREFIX up: <http://purl.uniprot.org/core/>

    SELECT (count (distinct ?protein) as ?count_num_proteins_with_citations)
    WHERE {
        ?protein a up:Protein .
        ?protein up:citation ?citation.
        ?citation a up:Journal_Citation.
    }
    """
)

try:
    ret = sparql.queryAndConvert()

    for r in ret["results"]["bindings"]:
        print(r["count_num_proteins_with_citations"]["value"])
except Exception as e:
    print(e)

94165


In [None]:
# example query to get the DOI, title and abstract of a Journal Citation corresponding to a protein
# need to make sure this is assigned to a GO annotation link
sparql.setQuery("""
    PREFIX up: <http://purl.uniprot.org/core/>

    SELECT *
    WHERE {
        ?protein a up:Protein .

        ?protein up:attribution /up:source ?citation.

        ?citation a <http://purl.uniprot.org/core/Journal_Citation>.

        ?citation <http://purl.org/dc/terms/identifier> ?doi.

        ?citation <http://www.w3.org/2004/02/skos/core#exactMatch> ?pubMedId.

        ?citation <http://purl.uniprot.org/core/title> ?title.

        ?citation <http://www.w3.org/2000/01/rdf-schema#comment> ?abstract.
    } limit 10
    """
)

try:
    ret = sparql.queryAndConvert()

    for r in ret["results"]["bindings"]:
      for var in r:
        print(var + " " + r[var]["value"])
      print()
except Exception as e:
    print(e)

protein http://purl.uniprot.org/uniprot/Q60888
citation http://purl.uniprot.org/citations/14611657
doi doi:10.1186/gb-2003-4-11-r71
pubMedId http://purl.uniprot.org/pubmed/14611657
title Odorant receptor expressed sequence tags demonstrate olfactory expression of over 400 genes, extensive alternate splicing and unequal expression levels.
abstract <h4>Background</h4>The olfactory receptor gene family is one of the largest in the mammalian genome. Previous computational analyses have identified approximately 1,500 mouse olfactory receptors, but experimental evidence confirming olfactory function is available for very few olfactory receptors. We therefore screened a mouse olfactory epithelium cDNA library to obtain olfactory receptor expressed sequence tags, providing evidence of olfactory function for many additional olfactory receptors, as well as identifying gene structure and putative promoter regions.<h4>Results</h4>We identified more than 1,200 odorant receptor cDNAs representing mo

In [None]:
import csv
fields=['protein_id','go_annotation_id']

with open(r'data.csv', 'w') as f:
  writer = csv.writer(f)
  writer.writerow(fields)

# select all pairs of protein - GO annotations
sparql.setQuery("""
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT distinct ?protein ?goAnnotation
    WHERE {
        ?protein a up:Protein .
		    ?protein up:attribution ?attribution.
		    ?attribution up:source ?citation.
		    ?protein_classified_with_go_property up:attribution ?attribution.
        ?protein_classified_with_go_property rdf:subject ?protein.
		    ?protein_classified_with_go_property rdf:object ?goAnnotation.
        FILTER(strstarts(str(?goAnnotation), "http://purl.obolibrary.org/obo/GO"))
    }
    """
)

try:
    ret = sparql.queryAndConvert()

    for r in ret["results"]["bindings"]:
      fields_to_write = []

      for var in r:
        #print(var + " " + r[var]["value"])
        fields_to_write.append(str(r[var]["value"]))
      print()

      with open(r'data.csv', 'a') as f:
        writer = csv.writer(f)
        writer.writerow(fields_to_write)
except Exception as e:
    print(e)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m









































































































































































































































































































































































































































































































































































































































































































































































































































































































































































In [31]:
def query_by_protein_and_go(protein_id, go_id):
  # example query to get the DOI, title and abstract of a Journal Citation corresponding to a protein
  # need to make sure this is assigned to a GO annotation link
  sparql.setQuery("""
    PREFIX up: <http://purl.uniprot.org/core/>

    SELECT ?doi ?pubMedId ?title ?abstract
    WHERE {
        """
        + "<"+ protein_id + ">" + " up:classifiedWith " + "<" + go_id + "> ;"
        """ up:attribution / up:source ?citation.

        ?citation a <http://purl.uniprot.org/core/Journal_Citation>.

        ?citation <http://purl.org/dc/terms/identifier> ?doi.

        ?citation <http://www.w3.org/2004/02/skos/core#exactMatch> ?pubMedId.

        ?citation <http://purl.uniprot.org/core/title> ?title.

        ?citation <http://www.w3.org/2000/01/rdf-schema#comment> ?abstract.
    }
    """
  )

  try:
      ret = sparql.queryAndConvert()

      for r in ret["results"]["bindings"]:
        for var in r:
          print(var + " " + r[var]["value"])
        print()
  except Exception as e:
      print(e)

In [33]:
# once we have this data we can start working with the individual proteins
# to get back the citations corresponding to their go annotations

with open("data.csv", "r") as f:
    reader = csv.reader(f)
    for i, line in enumerate(reader):
        if i == 0:
          continue
        [protein, go_annotation] = line
        print(protein + " " + go_annotation)
        query_by_protein_and_go(protein, go_annotation)
        break

http://purl.uniprot.org/uniprot/Q60888 http://purl.obolibrary.org/obo/GO_0016020
doi doi:10.1186/gb-2003-4-11-r71
pubMedId http://purl.uniprot.org/pubmed/14611657
title Odorant receptor expressed sequence tags demonstrate olfactory expression of over 400 genes, extensive alternate splicing and unequal expression levels.
abstract <h4>Background</h4>The olfactory receptor gene family is one of the largest in the mammalian genome. Previous computational analyses have identified approximately 1,500 mouse olfactory receptors, but experimental evidence confirming olfactory function is available for very few olfactory receptors. We therefore screened a mouse olfactory epithelium cDNA library to obtain olfactory receptor expressed sequence tags, providing evidence of olfactory function for many additional olfactory receptors, as well as identifying gene structure and putative promoter regions.<h4>Results</h4>We identified more than 1,200 odorant receptor cDNAs representing more than 400 genes.