This is the code to extract the entities and associated triples / descriptions from the Uniprot data uploaded to a dedicated GraphDB instance from the [ftp release](https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/) uniprotkb_reviewed_eukaryota_opisthokonta_metazoa_33208_0.rdf.xz (including [citation info](https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/citations.rdf.xz) and the [GO owl](https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/go.owl.xz) file). The RDF data can be queried at https://biosoda.unil.ch/graphdb/sparql (repository *uniprot_swiss_ai*).

Note: all the queries can also be executed against the full dataset at https://sparql.uniprot.org/

The overall idea is explained in the presentation available in the Swiss AI RAG Gdrive folder [here](https://docs.google.com/presentation/d/1c89Sa7WM-vcCVZ-F2LfP0jZe0l3bVImT/edit?usp=drive_link&ouid=106167242104510220555&rtpof=true&sd=true).

We start by identifying proteins which have GO annotations that can be attributed back to a specific Journal Citation. This will enable us to link a protein to a specific DOI, abstract and title, which will then act as our "entity description" similarly to the Wikidata dataset.

**Note**: 1 protein can be associated to multiple GO annotations. Moreover, the link between 1 Protein and 1 specific GO annotation can be associated to multiple Journal citations (i.e. multiple papers assert that a given protein is associated to a given GO term). Therefore, if we consider the titles and abstracts of the corresponding papers as the "entity description" of a protein, then these will need to be somehow merged into a single description.

**Goal**: predict a ranked list of K GO annotations connected to a given protein.



In [1]:
!pip install sparqlwrapper

Collecting sparqlwrapper
  Downloading SPARQLWrapper-2.0.0-py3-none-any.whl.metadata (2.0 kB)
Collecting rdflib>=6.1.1 (from sparqlwrapper)
  Downloading rdflib-7.0.0-py3-none-any.whl.metadata (11 kB)
Collecting isodate<0.7.0,>=0.6.0 (from rdflib>=6.1.1->sparqlwrapper)
  Downloading isodate-0.6.1-py2.py3-none-any.whl.metadata (9.6 kB)
Downloading SPARQLWrapper-2.0.0-py3-none-any.whl (28 kB)
Downloading rdflib-7.0.0-py3-none-any.whl (531 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m531.9/531.9 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading isodate-0.6.1-py2.py3-none-any.whl (41 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: isodate, rdflib, sparqlwrapper
Successfully installed isodate-0.6.1 rdflib-7.0.0 sparqlwrapper-2.0.0


In [2]:
from SPARQLWrapper import SPARQLWrapper, JSON

# the endpoint where the data can be queried
sparql = SPARQLWrapper(
    "https://biosoda.unil.ch/graphdb/repositories/uniprot_swiss_ai"
)
sparql.setReturnFormat(JSON)

In [None]:
# simple example of a query: count all proteins in the endpoint
sparql.setQuery("""
    PREFIX up: <http://purl.uniprot.org/core/>

    SELECT (count (distinct ?protein) as ?count_num_proteins)
    WHERE {
        ?protein a up:Protein .
    }
    """
)

try:
    ret = sparql.queryAndConvert()

    for r in ret["results"]["bindings"]:
        print(r["count_num_proteins"]["value"])
except Exception as e:
    print(e)

109788


In [None]:
# count all proteins with some journal citation associated to them in the endpoint
sparql.setQuery("""
    PREFIX up: <http://purl.uniprot.org/core/>

    SELECT (count (distinct ?protein) as ?count_num_proteins_with_citations)
    WHERE {
        ?protein a up:Protein .
        ?protein up:citation ?citation.
        ?citation a up:Journal_Citation.
    }
    """
)

try:
    ret = sparql.queryAndConvert()

    for r in ret["results"]["bindings"]:
        print(r["count_num_proteins_with_citations"]["value"])
except Exception as e:
    print(e)

94165


In [None]:
# example query to get the DOI, title and abstract of a Journal Citation corresponding to a protein
# need to make sure this is assigned to a GO annotation link
sparql.setQuery("""
    PREFIX up: <http://purl.uniprot.org/core/>

    SELECT *
    WHERE {
        ?protein a up:Protein .

        ?protein up:attribution / up:source ?citation.

        ?citation a <http://purl.uniprot.org/core/Journal_Citation>.

        ?citation <http://purl.org/dc/terms/identifier> ?doi.

        ?citation <http://www.w3.org/2004/02/skos/core#exactMatch> ?pubMedId.

        ?citation <http://purl.uniprot.org/core/title> ?title.

        ?citation <http://www.w3.org/2000/01/rdf-schema#comment> ?abstract.
    } limit 10
    """
)

try:
    ret = sparql.queryAndConvert()

    for r in ret["results"]["bindings"]:
      for var in r:
        print(var + " " + r[var]["value"])
      print()
except Exception as e:
    print(e)

protein http://purl.uniprot.org/uniprot/Q60888
citation http://purl.uniprot.org/citations/14611657
doi doi:10.1186/gb-2003-4-11-r71
pubMedId http://purl.uniprot.org/pubmed/14611657
title Odorant receptor expressed sequence tags demonstrate olfactory expression of over 400 genes, extensive alternate splicing and unequal expression levels.
abstract <h4>Background</h4>The olfactory receptor gene family is one of the largest in the mammalian genome. Previous computational analyses have identified approximately 1,500 mouse olfactory receptors, but experimental evidence confirming olfactory function is available for very few olfactory receptors. We therefore screened a mouse olfactory epithelium cDNA library to obtain olfactory receptor expressed sequence tags, providing evidence of olfactory function for many additional olfactory receptors, as well as identifying gene structure and putative promoter regions.<h4>Results</h4>We identified more than 1,200 odorant receptor cDNAs representing mo

In [None]:
import csv
fields=['protein_id','go_annotation_id']

with open(r'data.csv', 'w') as f:
  writer = csv.writer(f)
  writer.writerow(fields)

# select all pairs of protein - GO annotations
sparql.setQuery("""
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT distinct ?protein ?goAnnotation
    WHERE {
        ?protein a up:Protein .
		    ?protein up:attribution ?attribution.
		    ?attribution up:source ?citation.
        ?citation a <http://purl.uniprot.org/core/Journal_Citation>.
		    ?protein_classified_with_go_property up:attribution ?attribution.
        ?protein_classified_with_go_property rdf:subject ?protein.
		    ?protein_classified_with_go_property rdf:object ?goAnnotation.
        FILTER(strstarts(str(?goAnnotation), "http://purl.obolibrary.org/obo/GO"))
    }
    """
)

try:
    ret = sparql.queryAndConvert()

    for r in ret["results"]["bindings"]:
      fields_to_write = []

      for var in r:
        #print(var + " " + r[var]["value"])
        fields_to_write.append(str(r[var]["value"]))
      #print()

      with open(r'data.csv', 'a') as f:
        writer = csv.writer(f)
        writer.writerow(fields_to_write)
except Exception as e:
    print(e)

In [None]:
def query_by_protein_and_go(protein_id, go_id):
  # example query to get the DOI, title and abstract of a Journal Citation corresponding to a protein
  # need to make sure this is assigned to a GO annotation link
  results = {}

  sparql.setQuery("""
    PREFIX up: <http://purl.uniprot.org/core/>

    SELECT ?doi ?pubMedId ?title ?abstract
    WHERE {
        """
        + "<"+ protein_id + ">" +
        """ a up:Protein ; up:attribution ?attribution.
		    ?attribution up:source ?citation.
		    ?protein_classified_with_go_property up:attribution ?attribution.
        ?protein_classified_with_go_property rdf:subject """ + "<"+ protein_id + ">. " +
		    " ?protein_classified_with_go_property rdf:object " + "<" + go_id + "> ." + """

        ?citation a <http://purl.uniprot.org/core/Journal_Citation>.

        ?citation <http://purl.org/dc/terms/identifier> ?doi.

        ?citation <http://www.w3.org/2004/02/skos/core#exactMatch> ?pubMedId.

        ?citation <http://purl.uniprot.org/core/title> ?title.

        ?citation <http://www.w3.org/2000/01/rdf-schema#comment> ?abstract.
    }
    """
  )

  try:
      ret = sparql.queryAndConvert()

      for r in ret["results"]["bindings"]:
        for var in r:
          #print(var + " " + r[var]["value"])
          results[var] = r[var]["value"]
        #print()
  except Exception as e:
      print(e)

  return results

In [67]:
def query_for_go_info(go_id):
  # get all text triples related to a GO annotation, e.g. http://purl.obolibrary.org/obo/GO_0016020
  results = {}

  query = """
    PREFIX up: <http://purl.uniprot.org/core/>

    SELECT distinct ?label ?exact_synonym ?related_synonym ?definition ?parent
    WHERE {
        """\
        +  "<" + go_id + "> <http://www.w3.org/2000/01/rdf-schema#label> ?label. \n"""\
        +  "OPTIONAL { <" + go_id + "> <http://www.geneontology.org/formats/oboInOwl#hasRelatedSynonym> ?related_synonym.} \n"""\
        +  "OPTIONAL { <" + go_id + "> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> ?exact_synonym.} \n"""\
        +  "OPTIONAL { <" + go_id + "> <http://purl.obolibrary.org/obo/IAO_0000115> ?definition.} \n"""\
        +  "OPTIONAL { <" + go_id + "> <http://www.w3.org/2000/01/rdf-schema#subClassOf> ?parent.} \n"""\
    + "}"

  sparql.setQuery(query)

  try:
      ret = sparql.queryAndConvert()
      for r in ret["results"]["bindings"]:
        for var in r:
          if(results.get(var) is None):
            results[var] = set()
        for var in r:
          #print(var + " " + r[var]["value"])
          results[var].add(r[var]["value"])
        #print()
  except Exception as e:
      print(e)

  return results

In [71]:
def query_for_protein_info(protein_id):
  # get all text triples related to a protein, e.g. http://purl.uniprot.org/uniprot/Q60888
  results = {}

  query = """
    PREFIX up: <http://purl.uniprot.org/core/>

    SELECT distinct ?mnemonic ?old_mnemonic ?recommendedName ?alternativeName ?submittedName
    WHERE {
        """\
        +  "<" + protein_id + "> a <http://purl.uniprot.org/core/Protein>. \n"""\
        +  "OPTIONAL { <" + protein_id + "> <http://purl.uniprot.org/core/mnemonic> ?mnemonic.} \n"""\
        +  "OPTIONAL { <" + protein_id + "> <http://purl.uniprot.org/core/oldMnemonic> ?oldMnemonic.} \n"""\
        +  "OPTIONAL { <" + protein_id + "> <http://purl.uniprot.org/core/recommendedName>/<http://purl.uniprot.org/core/fullName> ?recommendedName.} \n"""\
        +  "OPTIONAL { <" + protein_id + "> <http://purl.uniprot.org/core/alternativeName>/<http://purl.uniprot.org/core/fullName> ?alternativeName.} \n"""\
        +  "OPTIONAL { <" + protein_id + "> <http://purl.uniprot.org/core/submittedName>/<http://purl.uniprot.org/core/fullName> ?submittedName.} \n"""\
    + "}"

  sparql.setQuery(query)

  try:
      ret = sparql.queryAndConvert()
      for r in ret["results"]["bindings"]:
        for var in r:
          if(results.get(var) is None):
            results[var] = set()
        for var in r:
          #print(var + " " + r[var]["value"])
          results[var].add(r[var]["value"])
        #print()
  except Exception as e:
      print(e)

  return results

In [None]:
# create entities file with corresponding columns
import csv
columns = ["protein_id", "doi", "pubMedId", "title", "abstract", "go_id"]
with open(r'proteins_go_journal_descs.csv', 'w') as f:
  writer = csv.writer(f, delimiter='\t')
  writer.writerow(columns)

with open(r'errors.log', 'w') as f:
  f.write("errors logs")

In [35]:
# once we have this data we can start working with the individual proteins
# to get back the citations corresponding to their go annotations

# note: go annotation can be very generic e.g. http://purl.obolibrary.org/obo/GO_0016020
# ... or very specific e.g. http://purl.obolibrary.org/obo/GO_0004984
# we aim to focus on the very specific ones, coz those are the more "interesting"
# we could look at the GO hierarchy for this (or just attempt to predict everything)

with open("data.csv", "r") as f:
    reader = csv.reader(f)
    for i, line in enumerate(reader):
        if i == 0:
          continue
        [protein, go_annotation] = line
        #print(protein + " " + go_annotation)
        results = query_by_protein_and_go(protein, go_annotation)
        if i % 1000 == 0:
          percent = i/(float)(468762)*100
          print("Now at line " + str(line) + " out of 468762")
        #print(results)
        # TODO: populate CSV here with all descriptions per protein - go pair
        try:
          row_to_insert = [protein, results["doi"], results["pubMedId"], results["title"], results["abstract"], go_annotation]
        except Exception as e:
          with open(r'errors.log', 'a') as f:
            f.write("Protein " + protein + " go " + go_annotation + " errors " + str(e) + "\n")
          #print(e)

        #print(row_to_insert)
        with open(r'proteins_go_journal_descs.csv', 'a') as f:
          writer = csv.writer(f, delimiter='\t')
          writer.writerow(row_to_insert)

NameError: name 'query_by_protein_and_go' is not defined

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from google.colab import files
files.download("/content/proteins_go_journal_descs.csv")

In [None]:
# get all the labels of proteins



In [70]:
# get all the labels of GO annotations
import csv
with open(r'errors.log', 'w') as f:
  f.write("errors logs")

with open(r'go_triples.csv', 'w') as f:
  f.write("subject\tpredicate\tobject\n")

with open("data.csv", "r") as f:
    reader = csv.reader(f)
    for i, line in enumerate(reader):
        if i == 0:
          continue
        [protein, go_annotation] = line
        results = query_for_go_info(go_annotation)

        if i % 1000 == 0:
          percent = i/(float)(468762)*100
          print("Now at line " + str(line) + " out of 468762 (" + str(percent) + "%)")

        for var in results.keys():
            if(var == "label"):
              for label in results[var]:
                row_to_insert = [go_annotation, "http://www.w3.org/2000/01/rdf-schema#label", label]
                with open(r'go_triples.csv', 'a') as f:
                  writer = csv.writer(f, delimiter='\t')
                  writer.writerow(row_to_insert)
            elif(var == "exact_synonym"):
              for exact_synonym in results[var]:
                row_to_insert = [go_annotation, "http://www.geneontology.org/formats/oboInOwl#hasRelatedSynonym", exact_synonym]
                with open(r'go_triples.csv', 'a') as f:
                  writer = csv.writer(f, delimiter='\t')
                  writer.writerow(row_to_insert)
            elif(var == "related_synonym"):
              for related_synonym in results[var]:
                row_to_insert = [go_annotation, "http://www.geneontology.org/formats/oboInOwl#hasRelatedSynonym", related_synonym]
                with open(r'go_triples.csv', 'a') as f:
                  writer = csv.writer(f, delimiter='\t')
                  writer.writerow(row_to_insert)
            elif(var == "parent"):
              for parent in results[var]:
                row_to_insert = [go_annotation, "http://www.w3.org/2000/01/rdf-schema#subClassOf", parent]
                with open(r'go_triples.csv', 'a') as f:
                  writer = csv.writer(f, delimiter='\t')
                  writer.writerow(row_to_insert)
            elif(var == "definition"):
              for definition in results[var]:
                row_to_insert = [go_annotation, "http://purl.obolibrary.org/obo/IAO_0000115", definition]
                with open(r'go_triples.csv', 'a') as f:
                  writer = csv.writer(f, delimiter='\t')
                  writer.writerow(row_to_insert)

KeyboardInterrupt: 

In [None]:
# get all the labels of GO annotations
import csv
with open(r'errors.log', 'w') as f:
  f.write("errors logs")

with open(r'protein_triples.csv', 'w') as f:
  f.write("subject\tpredicate\tobject\n")

already_processed=set()
with open("data.csv", "r") as f:
    reader = csv.reader(f)
    for i, line in enumerate(reader):
        if i == 0:
          continue
        [protein, go_annotation] = line

        if(protein in already_processed):
          continue

        already_processed.add(protein)

        results = query_for_protein_info(protein)

        if i % 1000 == 0:
          percent = i/(float)(468762)*100
          print("Now at line " + str(line) + " out of 468762 (" + str(percent) + "%)")

        for var in results.keys():
            if(var == "mnemonic"):
              for mnemonic in results[var]:
                row_to_insert = [protein, "http://purl.uniprot.org/core/mnemonic", mnemonic]
                with open(r'protein_triples.csv', 'a') as f:
                  writer = csv.writer(f, delimiter='\t')
                  writer.writerow(row_to_insert)
            elif(var == "oldMnemonic"):
              for oldMnemonic in results[var]:
                row_to_insert = [protein, "http://purl.uniprot.org/core/oldMnemonic", oldMnemonic]
                with open(r'protein_triples.csv', 'a') as f:
                  writer = csv.writer(f, delimiter='\t')
                  writer.writerow(row_to_insert)
            elif(var == "recommendedName"):
              for recommendedName in results[var]:
                row_to_insert = [protein, "http://purl.uniprot.org/core/recommendedName", recommendedName]
                with open(r'protein_triples.csv', 'a') as f:
                  writer = csv.writer(f, delimiter='\t')
                  writer.writerow(row_to_insert)
            elif(var == "alternativeName"):
              for alternativeName in results[var]:
                row_to_insert = [protein, "http://purl.uniprot.org/core/alternativeName", alternativeName]
                with open(r'protein_triples.csv', 'a') as f:
                  writer = csv.writer(f, delimiter='\t')
                  writer.writerow(row_to_insert)
            elif(var == "submittedName"):
              for submittedName in results[var]:
                row_to_insert = [protein, "http://purl.uniprot.org/core/submittedName>", submittedName]
                with open(r'protein_triples.csv', 'a') as f:
                  writer = csv.writer(f, delimiter='\t')
                  writer.writerow(row_to_insert)