# Graph-RAG 

This notebook contains the code for the *The Legend of Zelda* Graph-RAG system.

The graph-RAG system is a system that uses a graph database to store the knowledge of *The Legend of Zelda* universe and makes it available to be queried using a **R**etrieval-**A**ugmented-**G**eneration system.

### Prerequisites

📝 To run this code you will need:
 - Docker installed on your machine,
 - A Python 3.10+ interpreter,
 - An OpenAI API key,
 - An Anthropic API key.

You must have a `.env` file with the following environment variables:

```bash
OPENAI_API_KEY=[PUT YOUR OPENAI API KEY HERE]
OPENAI_MODEL=gpt-4-turbo

ANTHROPIC_API_KEY=[PUT YOUR ANTHROPIC API KEY HERE]
ANTHROPIC_MODEL=claude-3-5-sonnet-20240620

NEO4J_HOST=localhost
NEO4J_PORT=7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=EchoesOfWisdom # If you are using the same password as in the docker-compose file, you can leave it as is.
```



## Graph database

The graph database is a Neo4j database instance running inside a Docker container, it exposes two ports to the host machine, one for the HTTP API (that we can use to visualize the graph using the Neo4j Browser) and one for the Bolt API, which is the application programming interface (API) that our Python driver will use to interact with the database.

To run the graph database, you can use the following command:

```bash
docker compose up --build
```

The command abobe will launch three containers:

 - One for the Neo4j database and its frontend, 
 - One for for an app that populates the database, and
 - One for a full end-to-end demo of the graph-RAG system.

You can access the Neo4j Browser at [`http://localhost:7474`](http://localhost:7474), use the password set in the `.env` file to login.

### Connecting with the Python driver

In [3]:
from dotenv import load_dotenv
import os
load_dotenv()


True

In [6]:
from neo4j import GraphDatabase

host = os.environ["NEO4J_HOST"]
user = os.environ["NEO4J_USER"]
password = os.environ["NEO4J_PASSWORD"]
port = os.environ["NEO4J_PORT"]

driver = GraphDatabase.driver(f"bolt://{host}:{port}", auth=(user, password))

with driver.session() as session:
    result = session.run("MATCH (n) RETURN count(n) as count")
    print(result.single()[0])

16068


## Cypher query generation with an LLM

To turn a users' question into a Cypher query, it is necessary to use a Large Language Model (LLM), it is better to use an LLM that has been fine-tuned or trained to generate Cypher queries, but in this case we will use a generic LLM and rely on prompting to guide it to generate the correct Cypher.

An important part of generating a good prompt involves providing the LLM with a clear and concise definition of the schema of the graph database, this will help guide the LLM to generate more accurate and relevant Cypher queries.

The following code is used to query the database to generate its schema.

In [7]:
def query_database(neo4j_query):
    with driver.session() as session:
        result = session.run(neo4j_query)
        output = [r.values() for r in result]
        output.insert(0, result.keys())
        return output

### Node schema generation

In [10]:
node_properties_query = """CALL apoc.meta.data()
YIELD label, other, elementType, type, property
WHERE NOT type = "RELATIONSHIP" AND elementType = "node"
WITH label AS nodeLabels, collect(property) AS properties
RETURN {labels: nodeLabels, properties: properties} AS output
"""

def get_nodes_schema():
    properties_descriptions = {
        'name': 'Name of the entity',
        'uri': 'URI of the entity, it is a unique identifier',
        'gender': 'Gender of the entity, if applicable',
    }

    node_props = query_database(node_properties_query)
    nodes = [node[0] for node in node_props[1:]]

    node_descriptions = []
    for node in nodes:
        listable_properties = sorted([prop for prop in node['properties'] if prop in properties_descriptions])
        node_description = f" - {node['labels']}, with properties: {', '.join(listable_properties)}"
        node_descriptions.append(node_description)

    prop_descriptions = [
        f" - {prop}: {properties_descriptions[prop]}" for prop in sorted(properties_descriptions)
    ]

    property_description_instructions = [
        "### Nodes",
        "",
        "The following are the nodes in the graph database, along with their properties.",
        "The property descriptions are listed at the end.",
        "",
        *node_descriptions,
        "",
        "Property descriptions:",
        *prop_descriptions,
    ]

    return "\n".join(property_description_instructions)

nodes_schema = get_nodes_schema()

print(nodes_schema)

### Nodes

The following are the nodes in the graph database, along with their properties.
The property descriptions are listed at the end.

 - ENTITY, with properties: gender, name, uri
 - LOCATION, with properties: gender, name, uri
 - ITEM, with properties: name, uri
 - QUEST, with properties: name, uri
 - CHARACTER, with properties: gender, name, uri
 - ENEMY, with properties: gender, name, uri
 - BOSS, with properties: gender, name, uri
 - OBJECT, with properties: gender, name, uri
 - SHRINE, with properties: name, uri
 - SHOP, with properties: name, uri
 - GROUP, with properties: gender, name, uri
 - WEAPON, with properties: name, uri
 - DUNGEON, with properties: name, uri
 - SONG, with properties: name, uri
 - SHIELD, with properties: name, uri
 - SWORD, with properties: name, uri
 - MASK, with properties: name, uri
 - STAGE, with properties: name, uri
 - SEQUEL, with properties: name, uri
 - RACE, with properties: gender, name, uri

Property descriptions:
 - gender: Gender of t

### Relationship schema generation

In [12]:
rel_properties_query = """
CALL apoc.meta.data()
YIELD label, other, elementType, type, property
WHERE NOT type = "RELATIONSHIP" AND elementType = "relationship"
WITH label AS nodeLabels, collect(property) AS properties
RETURN {type: nodeLabels, properties: properties} AS output
"""

rel_query = """
CALL apoc.meta.data()
YIELD label, other, elementType, type, property
WHERE type = "RELATIONSHIP" AND elementType = "node"
RETURN {source: label, relationship: property, target: other} AS output
"""

In [13]:

def get_rel_schema():
    rels = query_database(rel_query)
    rels = [r[0] for r in rels[1:]]

    rel_descriptions = []
    for rel in rels:
        targets = ", ".join([f"`{t}`" for t in rel['target']])
        rel_description = f" - `{rel['relationship']}`, that relates `{rel['source']}` with {targets}"
        rel_descriptions.append(rel_description)

    rel_props_descriptions = {
        'relation':'When used as a property of `related_to`, it specifies the type of relationship between two entities (father, mother, etc)'
    }

    rel_props_descriptions = [
        f" - {prop}: {rel_props_descriptions[prop]}" for prop in sorted(rel_props_descriptions)
    ]

    rels_description_instructions = [
        "### Relationships",
        "",
        "The following are the relationships in the graph database",
        "",
        *rel_descriptions,
        "",
        "Property descriptions:",
        *rel_props_descriptions,
    ]

    return "\n".join(rels_description_instructions)

rel_schema = get_rel_schema()
print(rel_schema)


### Relationships

The following are the relationships in the graph database

 - `appears_in`, that relates `ENTITY` with `ENTITY`, `LOCATION`, `DUNGEON`, `RACE`, `CHARACTER`, `ENEMY`
 - `location`, that relates `ENTITY` with `ENTITY`, `DUNGEON`, `LOCATION`, `SHOP`, `ENEMY`, `OBJECT`, `RACE`, `ITEM`
 - `country`, that relates `ENTITY` with `ENTITY`, `LOCATION`, `RACE`
 - `region`, that relates `ENTITY` with `ENTITY`, `LOCATION`
 - `hometown`, that relates `ENTITY` with `ENTITY`, `LOCATION`, `SHRINE`, `QUEST`, `DUNGEON`
 - `is`, that relates `ENTITY` with `ENTITY`, `RACE`, `ENEMY`, `CHARACTER`
 - `homeland`, that relates `ENTITY` with `ENTITY`, `LOCATION`, `RACE`
 - `related_to`, that relates `ENTITY` with `ENTITY`, `CHARACTER`, `BOSS`, `RACE`
 - `location`, that relates `LOCATION` with `ENTITY`, `LOCATION`, `RACE`, `DUNGEON`, `SHOP`, `STAGE`
 - `appears_in`, that relates `LOCATION` with `ENTITY`, `LOCATION`, `ITEM`, `DUNGEON`, `RACE`, `STAGE`, `BOSS`
 - `region`, that relates `LOCATION

### Build full prompt

In [14]:
def get_system_message():

    node_schema = get_nodes_schema()
    rel_schema = get_rel_schema()
    return f"""
## Task:

Generate Cypher queries to query a Neo4j graph database based on the provided schema definition.

## Instructions:

You are an expert at generating Cypher queries to query a Neo4j graph database based on the provided schema definition.
Use only the provided relationship types and properties.
Do not use any other relationship types or properties that are not provided.
If you cannot generate a Cypher statement based on the provided schema, explain the reason to the user.
Try to eliminate duplicate results.

## Schema:

{node_schema}
{rel_schema}

Note: Do not include any explanations or apologies in your responses.""".strip()

print(get_system_message())

## Task:

Generate Cypher queries to query a Neo4j graph database based on the provided schema definition.

## Instructions:

You are an expert at generating Cypher queries to query a Neo4j graph database based on the provided schema definition.
Use only the provided relationship types and properties.
Do not use any other relationship types or properties that are not provided.
If you cannot generate a Cypher statement based on the provided schema, explain the reason to the user.
Try to eliminate duplicate results.

## Schema:

### Nodes

The following are the nodes in the graph database, along with their properties.
The property descriptions are listed at the end.

 - ENTITY, with properties: gender, name, uri
 - LOCATION, with properties: gender, name, uri
 - ITEM, with properties: name, uri
 - QUEST, with properties: name, uri
 - CHARACTER, with properties: gender, name, uri
 - ENEMY, with properties: gender, name, uri
 - BOSS, with properties: gender, name, uri
 - OBJECT, with prope

### Using Claude 3.5 from Anthropic to generate Cypher queries

In [20]:
import anthropic

anthropic_client = anthropic.Anthropic()

def get_candidate_cypher_query(question):
    message = anthropic_client.messages.create(
        model=os.environ["ANTHROPIC_MODEL"],
        max_tokens=1000,
        temperature=0,
        system=get_system_message(),
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": question
                    }
                ]
            }
        ]
    )

    return message.content[0].text

In [37]:
query_1 = "In which games does the Hookshot and Bow appear at the same time?"
cypher_query_1 = get_candidate_cypher_query(query_1)
print(cypher_query_1)

To find games where both the Hookshot and Bow appear, we can use the following Cypher query:

MATCH (hookshot:ITEM {name: "Hookshot"})-[:appears_in]->(game:ENTITY)
MATCH (bow:ITEM {name: "Bow"})-[:appears_in]->(game)
RETURN DISTINCT game.name AS Game

This query will return the names of games where both the Hookshot and Bow appear.


In [21]:

candidate_cypher = get_candidate_cypher_query("Which character has the most relatives?")
print(candidate_cypher)

To find the character with the most relatives, we can use the following Cypher query:

MATCH (c:CHARACTER)-[r:related_to]->(relative)
WITH c, COUNT(DISTINCT relative) AS relativeCount
RETURN c.name AS Character, relativeCount AS RelativeCount
ORDER BY relativeCount DESC
LIMIT 1


## Execute the Cypher query

In [26]:
results_query_1 = query_database(cypher_query_1)
print(results_query_1)

[['game.name'], ['The Legend of Zelda: Ocarina of Time'], ['The Legend of Zelda: A Link to the Past'], ['The Legend of Zelda: A Link Between Worlds'], ['The Legend of Zelda'], ['The Legend of Zelda (Game)'], ['Super Smash Bros.'], ['Super Smash Bros. Brawl'], ['Super Smash Bros. Melee'], ['Hyrule Warriors'], ['BS The Legend of Zelda: Ancient Stone Tablets'], ['The Legend of Zelda: The Wind Waker']]


## Use another LLM to generate the response to the user

Once we have the results from querying our graph database, we can use another LLM to generate a response to the user.

This generation will be done using OpenAI's GPT-4 model, and while this will be a different prompt to the one used to generate the Cypher query, it will still require to contain the original user's question, and the results from the graph database query.

### Formatting the results from the graph database query

There are multiple ways to format the results from the graph database query to be used in the prompt to generate the response to the user – you can provide the results in a table, in a list, or in a more free-form text format.

In our case, we will use a table in markdown format.



In [35]:
def format_results_as_markdown_table(results):
    if not results:
        return ""

    headers = results[0]

    columns = len(headers)

    column_widths = [len(header) for header in headers]
    
    for result in results[1:]:
        column_widths = [
            max(column_width, len(value)) for column_width, value in zip(column_widths, result)
        ]

    rows = []
    def format_row(row, space_char=" "):
        return "|" + ("|".join([f"{space_char}{value:<{column_widths[i]}}{space_char}" for i, value in enumerate(row)])) + "|"

    rows.append(format_row(headers))
    rows.append(format_row(["-" * column_widths[i] for i in range(columns)], "-"))
    for result in results[1:]:
        rows.append(format_row(result))
    rows.append("")
    return "\n".join(rows)
    
formatted_results_1 = format_results_as_markdown_table(results_query_1)
print(formatted_results_1)


| game.name                                     |
|-----------------------------------------------|
| The Legend of Zelda: Ocarina of Time          |
| The Legend of Zelda: A Link to the Past       |
| The Legend of Zelda: A Link Between Worlds    |
| The Legend of Zelda                           |
| The Legend of Zelda (Game)                    |
| Super Smash Bros.                             |
| Super Smash Bros. Brawl                       |
| Super Smash Bros. Melee                       |
| Hyrule Warriors                               |
| BS The Legend of Zelda: Ancient Stone Tablets |
| The Legend of Zelda: The Wind Waker           |



### Generate a prompt and system instructions for result generation

This prompt will be used to generate a response to the user, and will be a combination of the user's question, the results from the graph database query, along with some instructions on how to format the response.

In [44]:
generation_system_instruction = """
You are a Zelda expert, and your goal is to answer the user's question.
You will be provided with a table of results from a graph database query.
The results are related to the user's question and may represent appereances, counts, or relationships between entities in the Zelda universe.
Do not mention that the results are coming from a neo4j database.
Your task is to generate a response to the user's question based on the provided results.
Do not use any information that is not provided in the results.  
If the results are empty, explain it to the user. 
Use all the information provided to you to answer the question.
Do not return any links, urls or references.
Be as comprehensive as possible. 
""".strip()

generation_prompt = f"""These are the results from a graph database query related to the user's question:

{{results_table}}

And the user's question you must answer using the results:

{{question}}
"""


In [45]:

import openai

openai_client = openai.OpenAI()

def generate_response(question,  results_table):

    prompt = generation_prompt.format(results_table=results_table, question=question)
    messages = [
        {"role": "system", "content": generation_system_instruction},
        {"role": "user", "content": prompt},
    ]
    completions = openai_client.chat.completions.create(
        model=os.environ["OPENAI_MODEL"],
        temperature=0.0,
        max_tokens=1000,
        messages=messages
    )

    return completions.choices[0].message.content

response_1 = generate_response(query_1, formatted_results_1)
print(response_1)


Based on the provided results, the Hookshot and Bow both appear in the following games:

1. **The Legend of Zelda: Ocarina of Time** - Both the Hookshot and Bow are essential items for navigating dungeons and solving puzzles.
2. **The Legend of Zelda: A Link to the Past** - These items are crucial for progression and are found in various dungeons throughout the game.
3. **The Legend of Zelda: A Link Between Worlds** - Similar to "A Link to the Past," both items are used for puzzle-solving and combat.

These games feature both the Hookshot and Bow, making them integral tools for the player's adventure in each respective game.


## Putting it all together

See a full end-to-end demo of the graph-RAG system [here](https://localhost:8501).