# Local Wikidata Agent 2

A LangChain Q&A workflow using a locally hosted LLM that can query Wikidata for up-to-date facts and incorporate them into a generated response.

- Accesses data via the Wikidata SPARQL endpoint to support inference 
- Supports return of implicit facts in the dataset not explicitly on individual records
- Defines a narrow scope of functionality in order to deal with the wide ranging schema of Wikidata content

This exercise builds on the `local-wikidata-agent-1.ipynb` exercise, significantly simplifying and streamlining the agent architecture in order to constrain the LLM to relevant outputs that lead to a response to the user query.  

- Abandons LangGraph for a simpler, more straightforward LangGraph implementation
- Forces individual evaluation of query parameters (evaluations are run in parallel)
- Refines prompts for each query parameter and provides few-shot examples to prime expected responses 
- Removes the LLM's direct access to the query tool, which runs as an undecorated function
- Provides a final "response" agent with a set of facts returned from the knowledge graph and interleaved in a system prompt along with the initial user query

Sample questions (and correct responses) are provided below. Additional questions can be verified with [this SPARQL query](https://query.wikidata.org/#%23title%3A%20Medalists%20of%20the%202024%20Summer%20Olympic%20Games%0A%23description%3A%20Return%20their%20country%2C%20sport%2C%20event%2C%20age%20at%20the%20time%20of%20the%20games%2C%20and%20medal%0APREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0APREFIX%20wikibase%3A%20%3Chttp%3A%2F%2Fwikiba.se%2Fontology%23%3E%0APREFIX%20p%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2F%3E%0APREFIX%20ps%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fstatement%2F%3E%0APREFIX%20pq%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fqualifier%2F%3E%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0A%0ASELECT%20DISTINCT%20%3FsportLabel%20%3FeventLabel%20%3FparticipantLabel%20%3Fage%20%3FcountryLabel%20%3FawardLabel%0A%0AWHERE%20%7B%0A%20%20%3Fsport%20wdt%3AP31%20wd%3AQ26132862%3B%20%20%23%20item%20that%20is%20an%20instance%20of%20olympic%20sports%20discipline%0A%20%20%20%20wdt%3AP361%2B%20wd%3AQ995653%3B%20%20%20%20%20%20%20%23%20and%20part%20of%202024%20games%2C%20or%20part%20of%20an%20event%20that%20was%0A%20%20%20%20wdt%3AP527%20%3Fevent.%20%20%20%20%20%20%20%20%20%20%20%20%23%20and%20hasParts%20that%20are%20individual%20events%0A%20%20%20%20%23%20not%20all%20disciplines%20have%20events%20listed%3B%20this%20list%20will%20be%20missing%20some%20categories%0A%20%20%0A%20%20%23%20Filter%20to%20events%20by%20English%20keyword%0A%20%20%3Fevent%20rdfs%3Alabel%20%3FhumanEventLabel.%0A%20%20FILTER%28LANG%28%3FhumanEventLabel%29%20%3D%20%22en%22%29.%0A%20%20FILTER%28CONTAINS%28%3FhumanEventLabel%2C%20%22%22%29%29.%20%23%20add%20event%20keyword%20here%20to%20test%0A%20%20%0A%20%20%23%20get%20full%20statements%20about%20the%20participatns%2C%20participating%20teams%2C%20and%20winners%20of%20those%20events%0A%20%20%3Fevent%20p%3AP710%20%7C%20p%3AP1923%20%7C%20p%3AP1346%20%3FparticipantStatement.%0A%0A%20%20%3FparticipantStatement%20ps%3AP1346%20%7C%20ps%3AP710%20%7C%20ps%3AP1923%20%3Fparticipant%3B%20%20%23%20get%20the%20participant%0A%20%20%20%20%20%20pq%3AP166%20%7C%20pq%3AP2868%20%3Faward.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20and%20that%20participant%27s%20award%0A%0A%20%20%23%20Filter%20to%20award%20by%20English%20keyword%0A%20%20%3Faward%20rdfs%3Alabel%20%3FhumanAwardLabel.%0A%20%20FILTER%28LANG%28%3FhumanAwardLabel%29%20%3D%20%22en%22%29.%0A%20%20FILTER%28CONTAINS%28%3FhumanAwardLabel%2C%20%22%22%29%29.%20%23%20add%20award%20keyword%20here%20to%20test%0A%20%20%0A%20%20%0A%20%20%3Fparticipant%20wdt%3AP1532%20%3Fcountry%3B%20%20%20%23%20Get%20the%20country%20the%20participant%20competed%20for%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP569%20%3Fdob.%20%20%20%20%20%20%20%20%23%20and%20their%20date%20of%20birth%0A%20%20%0A%20%20%23%20Filter%20to%20country%20by%20English%20keyword%0A%20%20%3Fcountry%20rdfs%3Alabel%20%3FhumanCountryLabel.%0A%20%20FILTER%28LANG%28%3FhumanCountryLabel%29%20%3D%20%22en%22%29.%0A%20%20FILTER%28CONTAINS%28%3FhumanCountryLabel%2C%20%22%22%29%29.%20%23%20add%20country%20keyword%20here%20to%20test%0A%20%20%0A%0A%20%20%23%20Bind%20closing%20ceremony%20date%20as%20%3Fclosing%0A%20%20BIND%28%222024-08-11%22%5E%5Exsd%3Adate%20AS%20%3Fclosing%29%0A%0A%20%20%23%20Calculate%20approximate%20age%20by%20subtracting%20the%20closing%20date%20from%20the%20participant%27s%20date%20of%20birth%0A%20%20BIND%28%0A%20%20%20%20FLOOR%28%0A%20%20%20%20%20%20%28YEAR%28%3Fclosing%29%20-%20YEAR%28%3Fdob%29%29%20-%0A%20%20%20%20%20%20%20%20IF%20%28MONTH%28%3Fclosing%29%20%3C%20MONTH%28%3Fdob%29%20%7C%7C%20%28MONTH%28%3Fclosing%29%20%3D%20MONTH%28%3Fdob%29%20%26%26%20DAY%28%3Fclosing%29%20%3C%20DAY%28%3Fdob%29%29%2C%201%2C%200%29%0A%20%20%20%20%29%20AS%20%3Fage%29%0A%20%20%20%20%20%20%20%20%20%20%20%20%0A%20%20%23%20Use%20SERVICE%20wikibase%3Alabel%20to%20fetch%20labels%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%0A%7D%0AORDER%20BY%20%3Fage%0ALIMIT%2025%0A%0A%0A). 

## Observations
- The addition of few-shot examples dramatically increased the accuracy of query parameter returns
- 5/8 responses that matched the accuracy of the GPT-4o Mini run model
- 1 response that returned a correct "not available" response when asked to infer knowledge not present about the continent on which a given country is located
- 2 responses which, while _wrong_, were _consistently_ wrong in light of the facts accurately returned from the knowledge graph

## Takeaways
- "Deskilling" is continuum: some tasks could be deskilled even further to NLP or RegEx (for example the "award" parameter), eliminating the need for the LLM for those altogether
- Error checking and validation through the chain (or as a graph) could further improve the quality and reliability of these responses
- The facts returned could be converted to an in-memory vector store to aid in the final response (especially for more complex queries)
- Like users, different models have different strengths. A range of "usability" techniques may be the most viable approach to ensuring the fungibility of models across contexts. 


In [1]:
# Instructions for downloading Ollama are at https://github.com/ollama/ollama

# %pip install langchain
# %pip install langchain_core
# %pip install sparqlwrapper
# %pip install -qU langchain_ollama

In [63]:
import os

# Enable tracing with LangSmith
# LANGCHAIN_API_KEY environment variable is set in .env
os.environ['LANGCHAIN_TRACING_V2'] = "true"
os.environ['LANGCHAIN_PROJECT'] = "agent-2"

# Set the USER_AGENT environment variable
os.environ['USER_AGENT'] = 'langgraph-agent'

## Wikidata Lookup Function

In [64]:
from SPARQLWrapper import SPARQLWrapper, JSON
from typing import Literal

# The query is passed into the chain as a regular function, not as a tool.
def wikidataOlympicMedalistQuery(
    direction: Literal["ASC", "DESC"] = "ASC",
    limit: int = 10,
    event: str = "",
    award: str = "",
    country: str = "",
) -> str:
    """This tool returns data on medal winners of the 2024 Summer Olympic Games from the Wikidata knowledge graph.
        It returns their country, sport, event, age at the time of the games, and what medal they won.
        By default it returns the 10 youngest medalists in ascending order of age.
        You can return a longer list of athletes by increasing the limit.
        You can change search for the oldest medalists first by changing the direction.
    Args:
        direction: The direction to sort the results.
        limit: The number of results to return.
        event: Limit results to events containing this keyword.
        award: Limit results to awards containing this keyword.
        country: Limit results to countries containing this keyword."""

    endpoint_url = "https://query.wikidata.org/sparql"

    query = f"""#title: 10 youngest medalists of the 2024 Summer Olympic Games
    #description: Return their country, sport, event, age at the time of the games, and medal
    PREFIX wd: <http://www.wikidata.org/entity/>
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    PREFIX wikibase: <http://wikiba.se/ontology#>
    PREFIX p: <http://www.wikidata.org/prop/>
    PREFIX ps: <http://www.wikidata.org/prop/statement/>
    PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>


    SELECT DISTINCT ?sportLabel ?eventLabel ?participantLabel ?age ?countryLabel ?awardLabel

    WHERE {{
    ?sport wdt:P31 wd:Q26132862;    # item that is an instance of olympic sports discipline
        wdt:P361+ wd:Q995653;       # and part of 2024 games, or part of an event that was
        wdt:P527 ?event.            # and hasParts that are individual events
                                    # not all disciplines have events listed: this list will be missing some categories

    # filter to events by English keyword
    ?event rdfs:label ?humanEventLabel.
    FILTER(LANG(?humanEventLabel) = "en").
    FILTER(CONTAINS(?humanEventLabel, "{event}")).


    # get full statements about the participants, participating teams, and winners of those events
    ?event p:P710 | p:P1923 | p:P1346 ?participantStatement.

    ?participantStatement ps:P1346 | ps:P710 | ps:P1923 ?participant;  # get the participant
        pq:P166 | pq:P2868 ?award.                                     # and that participant's award

    # Filter to award by English keyword
    ?award rdfs:label ?humanAwardLabel.
    FILTER(LANG(?humanAwardLabel) = "en").
    FILTER(CONTAINS(?humanAwardLabel, "{award}")).


    ?participant wdt:P1532 ?country;   # Get the country the participant competed for
                wdt:P569 ?dob.         # and their date of birth

    # Filter to country by English keyword
    ?country rdfs:label ?humanCountryLabel.
    FILTER(LANG(?humanCountryLabel) = "en").
    FILTER(CONTAINS(?humanCountryLabel, "{country}")).                

    # Bind closing ceremony date as ?closing
    BIND("2024-08-11"^^xsd:date AS ?closing)

    # Calculate approximate age by subtracting the closing date from the participant's date of birth
    BIND(
        FLOOR(
        (YEAR(?closing) - YEAR(?dob)) -
            IF (MONTH(?closing) < MONTH(?dob) || (MONTH(?closing) = MONTH(?dob) && DAY(?closing) < DAY(?dob)), 1, 0)
        ) AS ?age)
                
    # Use SERVICE wikibase:label to fetch labels
    SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}
    }}
    ORDER BY {direction}(?age)
    LIMIT {limit}
    """

    def get_results(endpoint_url, query):
        user_agent = "LlmIntegration/0.0 (https://github.com/andybywire/)"
        sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
        sparql.setQuery(query)
        sparql.setReturnFormat(JSON)
        return sparql.query().convert()

    results = get_results(endpoint_url, query)

    def create_statements(results):
        statements = []
        if results:
            for result in results["results"]["bindings"]:
                statement = f"{result['participantLabel']['value']}, age {result['age']['value']} from {result['countryLabel']['value']}, won a {result['awardLabel']['value'].replace('Olympic','')} in {result['eventLabel']['value'].replace(' – ',', ')}."
                statements.append(statement)
        return statements

    return create_statements(results)

## Prompts

In [65]:
country_prompt = """You are an assistant designed for identifying references to specific countries in questions about the 2024 Summer Olympics. Your job is to determine if the the following question asks a question about a specific country and, if so, to identify the country and return the country's name. If the question does not ask about a specific country, return the value "None".

DO NOT return any other information aside from the value of the country parameter you are asked to identify or "None" if no country is mentioned. Do not add any additional keys or additional commentary, simply return the value. DO NOT make up a country name if no name is mentioned in the query.
"""

country_few_shot = [
  {"input": "Who was the youngest medalist from Japan in the 2024 Summer Olympic games?", "output": "Japan"},
  {"input": "Which countries were the youngest weightlifting medalists from?", "output": "None"},
  {"input": "Who won the gold medal in 100m freestyle swimming at the 2024 Summer Olympics?", "output": "None"},
  {"input": "Who was the youngest medalist in the 2024 Summer Olympic games?", "output": "None"},
]


event_prompt = """You are an assistant designed for identifying sporting events in questions about the 2024 Summer Olympics. Your job is to determine if the the following question asks a question about a specific Olympic sporting event and, if so, to identify the sporting event and return the sporting event name as a parameter. If the question does not ask about a specific event, return the value "None".

DO NOT return any other information aside from the name of the sporting event you are asked to identify or "None" if no sporting event is mentioned. Do not add any additional keys or additional commentary, simply return the value. Do not return a sporting event name if the sporting event name is not present in the provided query. DO NOT make up an sporting event name if the sporting event name is not present in the query. DO NOT return a partial sporting event name if the sporting event name is not present in the query. DO NOT return a generic sporting event name like "sport" or "athletics".
"""

event_few_shot = [
    {"input": "which countries were the youngest weightlifting medalists from?", "output": "weightlifting"},
    {"input": "Who was the youngest medalist from Japan in the 2024 Summer Olympic games?", "output": "None"},
    {"input": "Who won the gold medal in 100m freestyle swimming at the 2024 Summer Olympics?", "output": "swimming"},
    {"input": "Who was the youngest medalist in the 2024 Summer Olympic games?", "output": "None"},
    {"input": "Who was the youngest fencing medalist in the 2024 Summer Olympic games?", "output": "fencing"},
]

award_prompt = """You are an assistant designed for identifying events in questions about the 2024 Summer Olympics. Your job is to determine if the the following question asks a question about a specific Olympic award medal and, if so, to identify the medal type and return the medal type as a parameter. The response you return may ONLY be one of these options: "gold", "silver", "bronze", or "None". No other value is possible. If the question does not ask about a specific medal, return the value "None".

For example, 
- if the word "gold" is present in the question, return "gold"
- if the word "silver" is present in the question, return "silver"
- if the word "bronze" is present in the question, return "bronze"
- if none of the words "gold", "silver", or "bronze" are present in the question, return "None"

DO NOT return any other information aside from the name of the award you are asked to identify or "None" if no award is mentioned. Do not add any additional keys or additional commentary, simply return the value. Do not return an award name if the award name is not present in the provided query If the query mentions only "award" or "medal," but does not ask about the type of medal, for example gold, silver, or bronze, DO NOT infer a medal type. The type of medal MUST be present in the question.

"""

award_few_shot = [
    {"input": "Who was the youngest silver medalist from Japan in the 2024 Summer Olympic games?", "output": "silver"},
    {"input": "Who won the gold medal in 100m freestyle swimming at the 2024 Summer Olympics?", "output": "gold"},
    {"input": "Who was the youngest medalist in the 2024 Summer Olympic games?", "output": "None"},
    {"input": "Who was the oldest bonze medal winner from Germany in the 2024 Summer Olympic games?", "output": "bronze"},
]

sort_prompt = """You are an assistant designed for identifying the sort order of results returned from an API that answers questions about participants in the 2024 Summer Olympics. The API can return ascending results for participants from youngest to oldest, or descending results for participants from oldest to youngest. 

Your job is to determine if the question provided asks about the youngest participants or the oldest participants in the 2024 Summer Olympics. If the question asks about the youngest participants, return the value ASC. If the question asks about the oldest participants, return the value DESC. If the question does not ask about the age of the participants, return the value "None".

DO NOT return any other information beside the sort order. The only response you may return are ASC, DESC, or "None".
"""

sort_few_shot = [
  {"input": "Who was the youngest silver medalist from Japan in the 2024 Summer Olympic games?", "output": "ASC"},
    {"input": "Who won the gold medal in 100m freestyle swimming at the 2024 Summer Olympics?", "output": "None"},
    {"input": "Who was the oldest medalist in the 2024 Summer Olympic games?", "output": "DESC"},
]


response_template = """You are an assistant designed for generating a response to a question about the 2024 Summer Olympics based on a set of facts that are drawn from an authoritative database of Olympic medalists. Your response should be based on the facts provided and should not contain any information not already present in the facts.

Here are the facts: 

{facts}

If the facts provided do not contain any information about the question asked, return the response "I do not have that information." Do not make speculative guesses or provide information that is not present in the facts. Do not add any additional commentary or information to the response. Only return the information that is present in the facts.
"""

## Chains

In [66]:
from langchain_core.prompts import SystemMessagePromptTemplate, ChatPromptTemplate, FewShotChatMessagePromptTemplate
from langchain_core.messages import SystemMessage
from langchain_core.runnables import RunnableParallel
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama import ChatOllama
from langchain_core.runnables import RunnablePassthrough

model = ChatOllama(model="llama3.1:8b", temperature=0)

few_shot_template = ChatPromptTemplate.from_messages(
    [
        ("human", "{input}"),
        ("ai", "{output}"),
    ]
)

class FacetChain:
    def __init__(self, prompt, few_shot):
        self.prompt = prompt
        self.few_shot = few_shot

    def create_chain(self):
        return (
            SystemMessage(content=self.prompt)
            + FewShotChatMessagePromptTemplate(
                examples=self.few_shot,
                example_prompt=few_shot_template,
            )
            + "{query}"
            | model
            | StrOutputParser()
        )

country_chain = FacetChain(
    prompt=country_prompt,
    few_shot=country_few_shot,
).create_chain()

event_chain = FacetChain(
    prompt=event_prompt,
    few_shot=event_few_shot,
).create_chain()

award_chain = FacetChain(
    prompt=award_prompt,
    few_shot=award_few_shot,
).create_chain()

sort_chain = FacetChain(
    prompt=sort_prompt,
    few_shot=sort_few_shot,
).create_chain()


response_chain = SystemMessagePromptTemplate.from_template(response_template) + "{query}" | model | StrOutputParser()        


# Start with the facet chains in parallel to process the query and get parameters
# Then pass the query and parameters through to a lambda function to process the query
# Then pass the query and response through to a response agent
composed_chain = (
    RunnableParallel(
        query=RunnablePassthrough(),
        country=country_chain,
        event=event_chain,
        award=award_chain,
        sort=sort_chain,
    )
    | (
        lambda x: {
            "query": x["query"]["query"],
            "facts": "\n".join(
                wikidataOlympicMedalistQuery(
                    **{
                        k: v
                        for k, v in {
                            "event": x["event"],
                            "country": x["country"],
                            "award": x["award"],
                            "direction": x["sort"],
                        }.items()
                        if v != "None"
                    }
                )
            ),
        }
    )
    | response_chain
)


## Tests

In [67]:
test_queries = [											
"Who were the youngest medalists at the 2024 Summer Olympics and what did they win?",
"What countries were the youngest medalists at the 2024 Summer Olympics from?",
"Who was the youngest medalist from Japan in the 2024 Summer Olympic games?",
"What event in the 2024 Summer Olympics did the youngest medalist from Spain win?",
"What medal in the 2024 Summer Olympics did the youngest medalist from Spain win?",
"Who was the oldest medalist at the 2024 Olympics from a country in Africa?",
"Who was the youngest fencing medalist at the 2024 Olympic Games?",
"Who was the oldest medalist in weightlifting at the 2024 Summer Olympics?",
]

correct_responses = [
    "- Ban Hyo-jin (South Korea) - age 16, gold medal in shooting (women's 10 metre air rifle).\n- Dominika Banevič (Lithuania) - age 17, silver medal in breaking (B-girls).\n- Darja Varfolomeev (Germany) - age 17, gold medal in gymnastics (women's rhythmic individual all-around).\n- Quan Hongchan (People's Republic of China) - age 17, gold medal in diving (women's 10 metre platform).\n- Rikuto Tamai (Japan) - age 17, silver medal in diving (men's 10 metre platform).\n- Huang Yuting (People's Republic of China) - age 17, silver medal in shooting (women's 10 air rifle).\n- Mirra Andreeva (Russia) - age 17, silver medal in tennis (women's doubles).",
	"South Korea, Lithuania, Germany, People's Republic of China, Russia, Japan",
	"Rikuto Tamai",
	"Men's tennis singles, Carlos Alcaraz, 21",
	"silver",
	"Hellen Obiri from Kenya, age 34, who won a bronze medal in athletics (women's marathon)",
	"As of 9.13.24, this should return no results: fencing data is not yet available in Wikidata",
	"Mari Sánchez (Colombia) - age 32, silver medal in weightlifting (women's 71 kg)",
]


# test a single response, or "all" for all responses
q = "all"
# q = 4

if q == "all":
    for i, query in enumerate(test_queries, start=1):
        print(f"{i}. {test_queries[i-1]}\n")
        print(f"AI Response: \n{composed_chain.invoke({'query': query})}\n")
        print(f"Correct Response: \n{correct_responses[i-1]}\n")
else:
    print(f"{test_queries[q]}\n")
    print(f"AI Response: \n{composed_chain.invoke({'query': test_queries[q]})}\n")
    print(f"Correct Response: \n{correct_responses[q]}\n")

1. Who were the youngest medalists at the 2024 Summer Olympics and what did they win?

AI Response: 
Ban Hyo-jin, age 16 from South Korea, won a gold medal in shooting at the 2024 Summer Olympics, women's 10 metre air rifle.
Darja Varfolomeev, age 17 from Germany, won a gold medal in gymnastics at the 2024 Summer Olympics, women's rhythmic individual all-around.

The youngest medalists were Ban Hyo-jin and Darja Varfolomeev.

Correct Response: 
- Ban Hyo-jin (South Korea) - age 16, gold medal in shooting (women's 10 metre air rifle).
- Dominika Banevič (Lithuania) - age 17, silver medal in breaking (B-girls).
- Darja Varfolomeev (Germany) - age 17, gold medal in gymnastics (women's rhythmic individual all-around).
- Quan Hongchan (People's Republic of China) - age 17, gold medal in diving (women's 10 metre platform).
- Rikuto Tamai (Japan) - age 17, silver medal in diving (men's 10 metre platform).
- Huang Yuting (People's Republic of China) - age 17, silver medal in shooting (women's 