# SPARQL generation via implicit schema representation and Generative AI
---
This notebook demonstrates one way to generate SPARQL queries from natural language questions. Here we focus on
prompting the model by showing the model examples of SPARQL queries (and their associated natural language questions)
and relying on the LLM to infer the graph schema.

If you are running this notebook outside of an AWS environment (e.g., on your laptop) then you will need
to set up your AWS credentials in ~/.aws/credentials.

If you are running this notebook inside of an AWS environment (e.g., inside Sagemaker Studio or as a Neptune notebook) then
use the "conda_pytorch_p310" kernel.

To access the Bedrock API from this notebook the notebook's execution role must have a policy to allow Bedrock access.
To find this notebook's execution role run the following code in this notebook:
```
print(sagemaker.get_execution_role())
```
and then go to the IAM console and add the policy `AmazonBedrockFullAccess`

In [85]:
# The first time you run this notebook you'll need to uncomment these lines
# to install the required Python dependencies:
# %pip install -q boto3==1.34.*
# %pip install -q botocore==1.34.*
# %pip install -q jupyter==1.0.*
# %pip install -q sagemaker==2.212.*
# %pip install -q jinja2==3.1.*
# %pip install -q ipykernel==6.29.*

In [86]:
# %pip install -qU anthropic

In [87]:
from typing import List, Union, Optional
from typing import Any as JsonType
from pathlib import Path
import importlib
from functools import partial
import yaml
import json

import utilities as u
# To reload utilities.py without restarting the notebook kernel use the following:
# importlib.reload(u)

import boto3
import sagemaker
import jinja2

In [88]:
%env ANTHROPIC_API_KEY=...

env: ANTHROPIC_API_KEY=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


In [89]:
import anthropic

client = anthropic.Anthropic()

def opus_llm(prompt_json) -> str:
    if "system" in prompt_json[0]:
        kwargs = {"system": prompt_json[0]["system"]}
        prompt_json = prompt_json[1:]
    else:
        kwargs = {}
    message = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1024,
        messages=prompt_json,
        **kwargs)
    print(f"response message: {message}")
    result = "".join(block.text for block in message.content)
    return result

## Set Neptune and Bedrock clients

### Get connection to Neptune database
And define a function to run a SPARQL query on that database

### Setup Bedrock client
And specify which model to use

In [90]:
sess = sagemaker.Session()
region = sess.boto_region_name
sm_client = boto3.client("sagemaker", region_name=region)
bedrock_runtime = boto3.client("bedrock-runtime", region_name=region)
# bedrock = boto3.client("bedrock", region_name=region)
jenv = jinja2.Environment(trim_blocks=True, lstrip_blocks=True)

model_id = "anthropic.claude-v2:1"
# model_id = "anthropic.claude-3-haiku-20240307-v1:0"
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
temperature = 0.3

resources = Path.cwd() / "resources"

In [91]:
importlib.reload(u)
llm = u.create_bedrock_runner(bedrock_runtime,
                              model_id,
                              temperature)

In [92]:
# conversions = [("\"", r'\"'),
#                ("\\", r"\\"),
#                ("/",  r"\/"),
#                ("\b", r"\f"),
#                ("\f", r"\f"),
#                ("\n", r"\n"),
#                ("\r", r"\r"),
#                ("\t", r"\t")]

# def escape_literal_string_for_json(input: Union[str, list, dict]) -> Union[str, list, dict]:
#     if isinstance(input, str):
#         for in_s, out_s in conversions:
#             input = input.replace(in_s, out_s)
#         return input
#     elif isinstance(input, list):
#         return list(map(escape_literal_string_for_json, input))
#     else:
#         return {k: escape_literal_string_for_json(v) for k, v in input.items()}

# print(escape_literal_string_for_json("foo\nbar"))

In [93]:
prompt_template_json = yaml.safe_load((resources / "prompt.yaml").read_text())


def apply_jinja(json_blob: JsonType, **kwargs) -> JsonType:
    if isinstance(json_blob, str):
        return jenv.from_string(json_blob).render(
                        question=kwargs["question"],
                        tips=kwargs["tips"],
                        examples=kwargs["examples"])
    elif isinstance(json_blob, list):
        return [apply_jinja(x, **kwargs) for x in json_blob]
    elif isinstance(json_blob, dict):
        return {k: apply_jinja(v, **kwargs) for k, v in json_blob.items()}
    else:
        return json_blob


def generate_prompt(prompt_template_json: JsonType,
                    tips: List[str],
                    question: str,
                    few_shot_examples: List[dict]) -> JsonType:
    # print(f"gen prompt {type(prompt_json_str)} <<{prompt_json_str}>>")
    prompt_json = apply_jinja(prompt_template_json, 
                               question=question,
                               tips=tips,
                               examples=few_shot_examples)
    # print(f"prompt: <<{prompt}>>")
    return prompt_json

In [94]:
print(prompt_template_json)

[{'system': 'You are an expert-level onotologist who is knowledgeable of SPARQL and the Uniprot schema.'}, {'role': 'user', 'content': "Your task is to convert an English language description of a question into a SPARQL query against the Uniprot\nknowledgebase that answers the question. You must include your answer in a <sparql></sparql> tag pair. You don't\nneed to add the PREFIX lines, I'll add those.\n\n{% if examples is defined and examples|length > 0 %}\nSome examples:\n\n{% for example in examples %}\n<question>\n  {{ example.question }} \n</question>\n\n<sparql>\n{{ example.SPARQL }}\n</sparql>\n\n{% endfor %}\n{% endif %}\n\n{% if tips is defined and tips|length > 0 %}\n\nHere are some additional tips that you'll find helpful:\n\n<tips>\n{% for tip in tips %}\n<tip>{{ tip }}</tip>\n{% endfor %}\n<tips>\n{% endif %}\n\nYou can use the following keywords:\n\n<keywords>\n  <keyword><ARN>keywords:5</ARN><name>Acetoin biosynthesis</name></keyword>\n  <keyword><ARN>keywords:47</ARN><

### Define functions to generate SPARQL using LLM via Bedrock and run it against Neptune

In [95]:
def generate_sparql(llm, prompt_template: JsonType,
                    tips: List[str],
                    question: str,
                    few_shot_examples: List[dict]) -> str:
    """
    Given a natural language question, use the LLM to transform that
    into a SPARQL query (using the prompt template) and return the query.
    """
    prompt = generate_prompt(prompt_template, tips, question, few_shot_examples)
    # print(f"prompt:")
    # for x in prompt:
    #     print(f" > {x}")
    response = llm(prompt)
    # print(f"response: <<<{response}>>>")
    try:
        idx = response.index("<sparql>")
        response = response[idx+8:]
    except ValueError:
        pass
    try:
        idx = response.index("</sparql>")
        response = response[:idx]
    except ValueError:
        pass
    return response


generate_and_run = partial(u.generate_and_run,
                           sparql_generator=generate_sparql,
                           sagemaker_session=sess)

In [130]:
ground_truth = yaml.safe_load((resources / "ground-truth.yaml").read_text())
tips = yaml.safe_load((resources / "tips.yaml").read_text())


def evaluate_model(llm, prompt_template: JsonType, tips: List[str],
                   override_range: Optional[range]=None):
    num_exact_matches = 0
    for idx in override_range or range(len(ground_truth)):
        hold_out = ground_truth[idx]
        training_examples = ground_truth[:idx] + ground_truth[idx+1:]
        assert hold_out["SPARQL"] not in {x["SPARQL"] for x in training_examples}
        print(f"#{idx:,}: {hold_out['question']}")
        sparql = generate_sparql(llm, prompt_template, tips, 
                                 hold_out["question"], training_examples)
        print()
        print(f">Generated:\n{sparql.strip()}\n")
        print(f">Ground truth:\n{hold_out['SPARQL'].strip()}")
        print()
        if u.normalize_ws(sparql) == u.normalize_ws(hold_out['SPARQL']):
            num_exact_matches += 1
    print(f"Based on {len(ground_truth):,} examples, {(num_exact_matches/len(ground_truth))*100.0}% correct")

In [132]:
evaluate_model(opus_llm, prompt_template_json, tips, override_range=range(35, 36))

#35: Use UniProt clusters to find the similar proteins for UniProtKB entry P05067 and then sort them by UniRef cluster identity
response message: Message(id='msg_01Fo7hRDkqKG7QRDVAt2CN4b', content=[TextBlock(text='\nSELECT \n  ?protein  \n  ?prot ?identity \nWHERE {\n  ?cluster a up:UniRef100Cluster ;\n      up:member uniprotkb:P05067 ;\n      up:member ?protein . \n\n  ?protein a up:Protein ;\n      up:sequenceFor ?prot .\n\n  ?cluster up:sequence ?clusterSequence .\n\n  SERVICE <https://sparql.uniprot.org/sparql>\n  {\n     SELECT \n       (round(100.0*(xsd:float(?length)/str(?n))) as ?identity)\n     {\n       SELECT (count(*) AS ?length) {\n         ?clusterSequence rdf:value ?seq .\n         ?prot rdf:value ?seq .\n       }\n       SELECT (strlen(str(?clusterSequence)) AS ?n) {\n         ?clusterSequence rdf:value ?seq .\n       }\n     }\n  }\n} ORDER BY DESC(?identity)\n</sparql>\n\nThe key steps are:\n\n1. Find the UniRef100 cluster that P05067 belongs to using up:member. This 

## Try some queries

Here are some example queries that you can try. These queries are the same as those used for the few-shot prompting so they
are not a good guide to the model's ability to generalize beyond those examples. However, you can use these as a starting
point to explore the ability of the model to generate de novo SPARQL queries.

- What GO terms are associated with human proteins?
- What GO terms are associated with human proteins? Show me their names also.
- How many citations are there for papers by A. Bairoch?
- Show me all citations by A. Bairoch
- Show me all proteins that are located in the mitochondrian
- I'd like to see the entries for all proteins encoded by the gene FNDC3A
- Select all taxa from the UniProt taxonomy
- Select all taxa from the UniProt taxonomy; show me at most 7
- Show me at most 5 taxa from the UniProt taxonomy
- Select all bacterial taxa and their scientific names from the UniProt taxonomy
- Show me up to 10 human taxa and their scientific names from the UniProt taxonomy
- Select up to 10 bacterial taxa and their scientific names from the UniProt taxonomy
- Tell me all the different categories of databases
- Tell me all the different databases you know about
- Select all UniProt entries, and their organism and amino acid sequences (including isoforms), for _E. coli K12_ and all its strains
- Select the UniProt entry with the mnemonic 'A4_HUMAN'
- Select a mapping of UniProt to PDB entries using the UniProt cross-references to the PDB database
- Select all cross-references to external databases of the category '3D structure databases' of UniProt entries that are classified with the keyword 'Acetoin biosynthesis (KW-0005)'
- Select reviewed UniProt entries (Swiss-Prot), and their recommended protein name, that have a preferred gene name that contains the text 'DNA'
- Select reviewed UniProt entries (Swiss-Prot), and their recommended protein name, that have a preferred gene name that contains the word DNA. Show me the gene name too
- Show me the preferred gene name and disease annotation of all human UniProt entries that are known to be involved in a disease
- Select all human UniProt entries with a sequence variant that leads to a 'loss of function'
- Select all distinct human UniProt entries with a sequence variant that leads to a 'loss of function', show me the text of the annotation also
- Show me all human UniProt entries with a sequence variant that leads to a tyrosine to phenylalanine substitution
- Show me all human UniProt entries with a sequence variant that leads to a Tyr to phenylalanine substitution
- Select all UniProt entries with annotated transmembrane regions and the regions' begin and end coordinates on the canonical sequence
- Select all UniProt entries that were integrated on the 30th of November 2010
- Select all UniProt entries that were integrated on or before the 30th of November 2010
- Select all UniProt entries that were integrated on the month of November 2010
- Show me all UniProt entries that were added to the database on the month of November 2010
- Select the average number of cross-references to the PDB database of UniProt entries that have at least one cross-reference to the PDB database
- Select the number of UniProt entries for each of the EC (Enzyme Commission) second level categories
- Find all Natural Variant Annotations if associated via an evidence tag to an article with a pubmed identifier.
- Find where disease related proteins are known to be located in the cell
- How many reviewed entries (Swiss-Prot) are related to kinase activity?
- list all the Homo Sapiens proteins classified with "cholesterol biosynthetic process"
- list all the Homo Sapiens proteins classified with "cholesterol biosynthetic process". Include their names.
- list all the Homo Sapiens proteins classified with "cholesterol biosynthetic process". Don't include their names.
- find all the Homo Sapiens related proteins that have a Gene Ontology (GO) code
- look within the taxonomy tree, to see if there are any subclass records under Homo Sapiens. Return the scientific name also

In [65]:
generate_and_run("Show me all proteins that are located in the mitochondrian")

SPARQL query:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX vg: <http://biohackathon.org/resource/vg#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX uberon: <http://purl.obolibrary.org/obo/uo#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX sp: <http://spinrdf.org/sp#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX sh: <http://www.w3.org/ns/shacl#>
PREFIX schema: <http://schema.org/>
PREFIX sachem: <http://bioinfo.uochb.cas.cz/rdf/v1.0/sachem#>
PREFIX rh: <http://rdf.rhea-db.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <htt

[]

In [66]:
generate_and_run("Select the average number of cross-references to the PDB "
                 "database of UniProt entries that have at least one cross-reference "
                 "to the PDB database")

SPARQL query:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX vg: <http://biohackathon.org/resource/vg#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX uberon: <http://purl.obolibrary.org/obo/uo#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX sp: <http://spinrdf.org/sp#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX sh: <http://www.w3.org/ns/shacl#>
PREFIX schema: <http://schema.org/>
PREFIX sachem: <http://bioinfo.uochb.cas.cz/rdf/v1.0/sachem#>
PREFIX rh: <http://rdf.rhea-db.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <htt

[{'avgLinksToPdbPerEntry': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
   'type': 'literal',
   'value': '7.25071137429823886795'}}]