# LLM RAG vs GraphRAG 

<img src="resources/images/rvg1.png" style="width:1200px; height:auto;">


**Leveraging knowledge graphs with large language models is essential for reducing hallucinations and providing data-centric results.**


This notebook explains the differences between RAG and GraphRAG by example. By the end, you will understand how GraphRAG can improve your question-and-answer results, save substantial costs, and provide explainability all without needing to make changes to an existing knowledge graph.


In [225]:

%%capture
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install pyvis
%pip install llama-index-readers-arango-db==0.1.3

import os
import logging
import sys
import datetime
from IPython.display import Markdown, display

from llama_index.core import (Settings,
                            StorageContext, 
                            SimpleDirectoryReader, 
                            KnowledgeGraphIndex)
from llama_index.core.graph_stores import SimpleGraphStore
from llama_index.llms.openai import OpenAI
from langchain_community.callbacks import get_openai_callback


logging.basicConfig(stream=sys.stdout, level=logging.INFO)

In [226]:

os.environ["OPENAI_API_KEY"] = "sk-"

## RAG
RAG builds upon powerful language models by providing fact-based answers derived from user files. The early approaches for RAG take user data, often in the form of flat files, and parses that data to generate embeddings which are made available via an in-memory vector based index.

This approach is superior to off-the-shelf models but has many limitations regarding reliability and accuracy.
Limitations:
* Costly embedding generation process
* Time Consuming
* Inconsistent results based on KG creation approach
* The resulting index is an additional data source that needs to remain consistent with source data changes

In [227]:
# ##
# # Don't run if loading from persisted directory
# ##
# from llama_index.readers.arango_db import SimpleArangoDBReader
# import tiktoken
# from llama_index.core.callbacks import CallbackManager, TokenCountingHandler

# # Read single documentation collection from ArangoDB
# reader = SimpleArangoDBReader("https://fdf2638171c0.arangodb.cloud:8529")
# documents = reader.load_data(
#     "root",
#     _password,
#     "open_intelligence",
#     collection_name="Event",
#     field_names=["description", "date", "label", "fatalities"],
#     separator=", ",
#     metadata_names=["_id"]
# )

# # Begin parsing the collection into triplets, so that it can be used by OpenAI/ChatGPT
# # Basic timestamp to monitor processing time
# now = datetime.datetime.now()
# print(now)

# # Keep track of tokens for input/output
# token_counter = TokenCountingHandler(
#     tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
# )

# token_counter.reset_counts()

# # Assigns token counter to llama_index callback manager
# Settings.callback_manager = CallbackManager([token_counter])

# # Use a simple in-memory graph store to store the resulting vectors
# graph_store = SimpleGraphStore()
# storage_context = StorageContext.from_defaults(graph_store=graph_store)

# # NOTE: It WILL take a while!
# index = KnowledgeGraphIndex.from_documents(
#     documents,
#     max_triplets_per_chunk=5, # adjust max_triples_per_chunk based on your own knowledge
#     storage_context=storage_context,
# )

# print(token_counter.total_llm_token_count)

# finished = datetime.datetime.now()
# print(finished - now)

# # Persist the resulting graph store to disk
# storage_context.persist(persist_dir="resources/gdelt_persist_dir")

## Note: Proccessing and Data Consistency

We have already done the costly, time consuming part of generating triplets out of our documents and storing them in the graph store.
We don't store the embeddings the LLM uses to generate these as that would only add more cost and complexity to this process.

The following code block loads the persisted version of this graph store into memory, so we don't need to re-process everything.

In [228]:
# This code block does the following:
#  * Loads pre-generated index from disk
#  * Configures LLM to use for prompting/responses
#  * Defines token and cost tracking


# Load graph store from persisted directory
import tiktoken
from decimal import *
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from llama_index.core import StorageContext, load_index_from_storage

graph_store = SimpleGraphStore().from_persist_dir("./resources/gdelt_persist_dir")

# Rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="./resources/gdelt_persist_dir", graph_store=graph_store)

# load index
index = load_index_from_storage(storage_context)

# define LLM
# NOTE: at the time of demo, text-davinci-002 did not have rate-limit errors

llm = OpenAI(model='gpt-4', temperature=0)
Settings.llm = llm
Settings.chunk_size = 512

model_name = "gpt-4"

token_counter = TokenCountingHandler(
    tokenizer=tiktoken.encoding_for_model(model_name).encode
)

# Assumes alternative is gpt-4, if using other model update cost per million
cost_per_token = (0.50 / 1000000) if model_name == "gpt-3.5-turbo" else (30.00 / 1000000)

callback_manager = CallbackManager([token_counter])
Settings.callback_manager = callback_manager

print(
    "Note: Tokens here are 0 since we load from static pre-processed data. \n",
    "Pre-processing Token Cost: ~$50 for 60k documents \n"
)

# reset counts
token_counter.reset_counts()

query_engine = index.as_query_engine()


all_token_counter = 0

def print_response_with_costs(response):
    
    display(Markdown(f"<b>{response}</b>"))
    global all_token_counter
    all_token_counter = all_token_counter + token_counter.total_llm_token_count
    
    print(
    "Query Embedding Tokens: ",
    token_counter.total_embedding_token_count,
    "\n",
    "Query LLM Prompt Tokens: ",
    token_counter.prompt_llm_token_count,
    "\n",
    "Query LLM Completion Tokens: ",
    token_counter.completion_llm_token_count,
    "\n",
    "Query Total LLM Token Count: ",
    token_counter.total_llm_token_count,
    "\n",
    "Query Token Cost: $",
    str(Decimal(token_counter.total_llm_token_count * cost_per_token))[:9],
    "\n",
    "Total Query Activity Token Cost: $",
    str(Decimal(all_token_counter * cost_per_token))[:9],
    "\n",
    "Total Activity Token Cost: $",
    str(Decimal(50 + all_token_counter * cost_per_token))[:9]
    )
    token_counter.reset_counts()


Note: Tokens here are 0 since we load from static pre-processed data. 
 Pre-processing Token Cost: ~$50 for 60k documents 



## Knowledge Graph Creation

LLM's can be powerful for creating knowledge graphs from pre-existing data.

However, there are a number of issues that come along with using an only in-memory and LLM-dependent pipeline for long term graph creation.
* Variable cost assiociated with KG creation
* Complex processing needed for customized persisted schema, reliant on LLM's understanding
* If data comes from a database, you create a second data source that must be maintained


To reduce cost and processing time we restrict the graph to the following:

**Collection Name**: Event

**Fields Included**: ["description", "date", "label", "fatalities"]

### ArangoDB Graph Representation

<img src="resources/images/open_intelligence_sample.png" style="width:800px; height:auto;">

**Data Consistency**

It is worthing point out that now there is essentially two data sources for the application data. There is an in-database version and a flat file/in-memory version. The complexity of maintaining and adding new data continues to grow from here.

In [252]:
## create graph
from pyvis.network import Network

g = index.get_networkx_graph(limit=200)
net = Network(notebook=True, cdn_resources="in_line", directed=True)
display(Markdown(f"### In-memory Knowledge Graph"))
net.from_nx(g)
net.show("example.html")


### In-memory Knowledge Graph

example.html


The knowledge graph created can be very informative and, for some, may be good enough out of the box.

However, the cost, complexity, and accuracy considerations still exist when going from LLM to an in-memory representation.

## Cost
A big factor in considering GraphRAG with ArangoDB is that it can use data directly from the database. This allows for bypassing the index creation process and gives you finer control on the exact costs to obtain valuable answers from data.

The small dataset with only a few fields included from only one collection needs around $10 every time the embeddings are generated. 

### **Simple In-Memory KG Generation**

Total Tokens: 16,067,541 (total LLM token count)
 - Completion token count: 3,882,547
 - Prompt token count: 12,184,994

**Total Cost (USD): ~$50 + query costs** 

(The query cost is associated with non-cached query input processing into LLM-readable format)

-------------------------------------------

### **GraphRAG Direct-to-Database Query**

Tokens Used: 7335

Prompt Tokens: 6969

Completion Tokens: 366

**Total Cost (USD): 0.009922(GPT-3.5-turbo) OR 0.23103(GPT-4)** 

The query cost for GraphRAG is the cost of translating the query into an LLM-readable format.



# Query Comparison

### How does the performance actually stack up?

Callouts:
* A simple graph store was created from a single event collection.
* However, this collection still contains some of the needed data for the questions.
* The in-memory representation is represented as triplets, embeddings not included due to increased complexity and cost with minimal improvement to performance.

In [230]:
# Just making sure it is finding information that shouldn't be there
# There should be no information available about Interleaf

response = query_engine.query(
    "Tell me more about Interleaf"
)

print_response_with_costs(response)

<b>I'm sorry, but the provided context does not contain any information about Interleaf.</b>

Query Embedding Tokens:  0 
 Query LLM Prompt Tokens:  121 
 Query LLM Completion Tokens:  17 
 Query Total LLM Token Count:  138 
 Query Token Cost: $ 0.0041400 
 Total Query Activity Token Cost: $ 0.0041400 
 Total Activity Token Cost: $ 50.004139


In [231]:
response = query_engine.query(
    "Tell me about Unita."
)
print_response_with_costs(response)

<b>I'm sorry, but there's no information available about Unita in the provided context.</b>

Query Embedding Tokens:  0 
 Query LLM Prompt Tokens:  120 
 Query LLM Completion Tokens:  18 
 Query Total LLM Token Count:  138 
 Query Token Cost: $ 0.0041400 
 Total Query Activity Token Cost: $ 0.0082800 
 Total Activity Token Cost: $ 50.008279


In [232]:
query = """
Tell me about the Events happening in a random location
 """
response = query_engine.query(query)
print_response_with_costs(response)

<b>In Tizi Ouzou, there have been several events of note. This location has been the site of running battles, recent violence, and protests against the government. There have been demands for greater rights and clashes have occurred in this area. Riots have also taken place in Tizi Ouzou, with some instances of killing. The location has also been the site of commemorations, such as the 3rd anniversary of an unspecified event. Tizi Ouzou is located in Algeria, east of Algiers, and is the largest city in the Kabylie province. It's also worth noting that it's located near the capital.</b>

Query Embedding Tokens:  0 
 Query LLM Prompt Tokens:  977 
 Query LLM Completion Tokens:  130 
 Query Total LLM Token Count:  1107 
 Query Token Cost: $ 0.0332100 
 Total Query Activity Token Cost: $ 0.0414899 
 Total Activity Token Cost: $ 50.041490


In [233]:

query = """
Have there been any events that mention diamonds in the description?
"""
response = query_engine.query(query)

print_response_with_costs(response)

<b>The context does not provide any information about events that mention diamonds in their description.</b>

Query Embedding Tokens:  0 
 Query LLM Prompt Tokens:  127 
 Query LLM Completion Tokens:  16 
 Query Total LLM Token Count:  143 
 Query Token Cost: $ 0.0042900 
 Total Query Activity Token Cost: $ 0.0457800 
 Total Activity Token Cost: $ 50.045780


In [234]:
# Let's give it an easy one
# We will compare this to the GraphRAG answer later on
query = """
Has there been Violence Against Citizens?
"""
response = query_engine.query(query)

print_response_with_costs(response)

<b>Yes, there has been violence against citizens. For instance, citizens were beheaded in Hassasna, which experienced violence against civilians. Additionally, civilians were killed in a raid.</b>

Query Embedding Tokens:  0 
 Query LLM Prompt Tokens:  1620 
 Query LLM Completion Tokens:  36 
 Query Total LLM Token Count:  1656 
 Query Token Cost: $ 0.0496800 
 Total Query Activity Token Cost: $ 0.0954600 
 Total Activity Token Cost: $ 50.095460


In [235]:
# A tougher one that requires data aggregation and general contextual awareness of relations 

query = """
Look at the number of fatalities per Event and determine based on associated Actor, who is the deadliest actor based on the number of fatalities they were involved with?
"""

# Note: we have parsed the data considered 'metadata' or document data but the relations aren't present and not considered contextual relevant (though they should be)
response = query_engine.query(query)

print_response_with_costs(response)

<b>The deadliest actor based on the number of fatalities they were involved with is the FDD rebels, who killed 113 Burundi soldiers in the event BUR1021.</b>

Query Embedding Tokens:  0 
 Query LLM Prompt Tokens:  1417 
 Query LLM Completion Tokens:  34 
 Query Total LLM Token Count:  1451 
 Query Token Cost: $ 0.0435299 
 Total Query Activity Token Cost: $ 0.1389900 
 Total Activity Token Cost: $ 50.138989


# GraphRAG

Since you do not need to precompute embeddings and maintain a vector index, GraphRAG avoids a lot of comlexity costs associated with using standard non-database RAG approaches.

You get contextually aware answers with a model is able to take full advantage of the relations already in the graph. 

In [236]:
# 1: Instantiate the ArangoDB-LangChain Graph wrapper

from langchain.graphs import ArangoGraph

%useDatabase open_intelligence

graph = ArangoGraph(_db)

chain = ArangoGraphQAChain.from_llm(
        ChatOpenAI(temperature=0),
        graph=ArangoGraph(_db),
        verbose=True
    )

pprint(graph.schema)

Using Database 'open_intelligence' now.
{
    "Graph Schema": [
        {
            "graph_name": "OPEN_INTELLIGENCE",
            "edge_definitions": [
                {
                    "edge_collection": "eventActor",
                    "from_vertex_collections": [
                        "Event"
                    ],
                    "to_vertex_collections": [
                        "Actor"
                    ]
                },
                {
                    "edge_collection": "hasLocation",
                    "from_vertex_collections": [
                        "Event"
                    ],
                    "to_vertex_collections": [
                        "Location"
                    ]
                },
                {
                    "edge_collection": "hasSource",
                    "from_vertex_collections": [
                        "Event"
                    ],
                    "to_vertex_collections": [
                        "Sourc

In [237]:
from langchain.chat_models import ChatOpenAI

model = ChatOpenAI(temperature=0, model_name='gpt-4')

# 3: Instantiate the LangChain Question-Answering Chain with
# our **model** and **graph**

from langchain.chains import ArangoGraphQAChain

chain = ArangoGraphQAChain.from_llm(model, graph=graph, verbose=True)

chain.top_k = 5

## Fine Tuning

A benefit of using ArangoDB GraphRAG is that you can adjust the model to your specific data schema or to respond in a way that aligns with company guidelines. 

The following prompts focus on ensuring the model understands AQL and responds with only the most factual results. However, you can modify the LLM to communicate in whatever way suits your needs best.

In [238]:

from langchain.prompts import PromptTemplate

AQL_GENERATION_TEMPLATE = """Task: Generate an ArangoDB Query Language (AQL) query from a User Input.

You are an ArangoDB Query Language (AQL) expert responsible for translating a `User Input` into an ArangoDB Query Language (AQL) query.

You are given an `ArangoDB Schema`. It is a JSON Object containing:
1. `Graph Schema`: Lists all Graphs within the ArangoDB Database Instance, along with their Edge Relationships.
2. `Collection Schema`: Lists all Collections within the ArangoDB Database Instance, along with their document/edge properties and a document/edge example.

You may also be given a set of `AQL Query Examples` to help you create the `AQL Query`. If provided, the `AQL Query Examples` should be used as a reference, similar to how `ArangoDB Schema` should be used.

Things you should do:
- Think step by step.
- Rely on `ArangoDB Schema` and `AQL Query Examples` (if provided) to generate the query.
- Begin the `AQL Query` by the `WITH` AQL keyword to specify all of the ArangoDB Collections required.
- Return the `AQL Query` wrapped in 3 backticks (```).
- Use only the provided relationship types and properties in the `ArangoDB Schema` and any `AQL Query Examples` queries.
- Only answer to requests related to generating an AQL Query.
- If a request is unrelated to generating AQL Query, say that you cannot help the user.
- Keep in mind that when doing a COLLECT you have to use KEEP in order to not lose variables that were declared before it. For example, if you try and do this 'COLLECT actorName = actor.name WITH COUNT INTO eventCount' it should instead be something more like: 'COLLECT actorName = actor.name INTO eventCount KEEP event'
- In the graph schema the 'edge_definitions' describe the relations from one node type to another. This can also be used to understand the direction of a graph traversal.
The 'from_vertex_collection' means that if you are starting a traversal from a node in that collection, you likely should use OUTBOUND, or if you are starting from the 'to_vertex_collection' you should use INBOUND, if you are starting from a combination then use ANY.
The following is an example of an edge definition in the graph schema:
   ['edge_definitions': [{{'edge_collection': 'eventActor',
     'from_vertex_collections': ['Event'],
     'to_vertex_collections': ['Actor']}},]
Here is an example graph traversal starting from the Event collection the wants to traverse the graph using at least the eventActor edge:
FOR v,e,p IN 1..2 OUTBOUND event._id eventActor
FILTER v.name == "Two citizens were beheaded in "
RETURN p

However, if you wanted to start form the actor node, here is what the traversal would look like:
FOR v,e,p IN 1..2 INBOUND actor._id eventActor
FILTER v.name == "GIA: Armed Islamic Group"
RETURN p


Things you should not do:
- Do not use any properties/relationships that can't be inferred from the `ArangoDB Schema` or the `AQL Query Examples`.
- Do not include any text except the generated AQL Query.
- Do not provide explanations or apologies in your responses.
- Do not generate an AQL Query that removes or deletes any data.

Under no circumstance should you generate an AQL Query that deletes any data whatsoever.

ArangoDB Schema:
{adb_schema}

AQL Query Examples (Optional):
{aql_examples}

User Input:
{user_input}

AQL Query:
"""

AQL_GENERATION_PROMPT = PromptTemplate(
    input_variables=["adb_schema", "aql_examples", "user_input"],
    template=AQL_GENERATION_TEMPLATE,
)

AQL_FIX_TEMPLATE = """Task: Address the ArangoDB Query Language (AQL) error message of an ArangoDB Query Language query.

You are an ArangoDB Query Language (AQL) expert responsible for correcting the provided `AQL Query` based on the provided `AQL Error`.

The `AQL Error` explains why the `AQL Query` could not be executed in the database.
The `AQL Error` may also contain the position of the error relative to the total number of lines of the `AQL Query`.
For example, 'error X at position 2:5' denotes that the error X occurs on line 2, column 5 of the `AQL Query`.

You are also given the `ArangoDB Schema`. It is a JSON Object containing:
1. `Graph Schema`: Lists all Graphs within the ArangoDB Database Instance, along with their Edge Relationships.
2. `Collection Schema`: Lists all Collections within the ArangoDB Database Instance, along with their document/edge properties and a document/edge example.

You will output the `Corrected AQL Query` wrapped in 3 backticks (```). Do not include any text except the Corrected AQL Query.

Remember to think step by step.

ArangoDB Schema:
{adb_schema}

AQL Query:
{aql_query}

AQL Error:
{aql_error}

Corrected AQL Query:
"""

AQL_FIX_PROMPT = PromptTemplate(
    input_variables=["adb_schema", "aql_query", "aql_error"],
    template=AQL_FIX_TEMPLATE,
)

AQL_QA_TEMPLATE = """Task: Generate a natural language `Summary` from the results of an ArangoDB Query Language query.

You are an ArangoDB Query Language (AQL) expert responsible for creating a well-written `Summary` from the `User Input` and associated `AQL Result`.

A user has executed an ArangoDB Query Language query, which has returned the AQL Result in JSON format.
You are responsible for creating an `Summary` based on the AQL Result.

You are given the following information:
- `ArangoDB Schema`: contains a schema representation of the user's ArangoDB Database.
- `User Input`: the original question/request of the user, which has been translated into an AQL Query.
- `AQL Query`: the AQL equivalent of the `User Input`, translated by another AI Model. Should you deem it to be incorrect, suggest a different AQL Query.
- `AQL Result`: the JSON output returned by executing the `AQL Query` within the ArangoDB Database.

Remember to think step by step.

Your `Summary` should sound like it is a response to the `User Input`.
Your `Summary` should not include any mention of the `AQL Query` or the `AQL Result`.

ArangoDB Schema:
{adb_schema}

User Input:
{user_input}

AQL Query:
{aql_query}

AQL Result:
{aql_result}

"""

AQL_QA_PROMPT = PromptTemplate(
    input_variables=["adb_schema", "user_input", "aql_query", "aql_result"],
    template=AQL_QA_TEMPLATE,
)

chain = ArangoGraphQAChain.from_llm(
    model,
    aql_generation_prompt=AQL_GENERATION_PROMPT,
    aql_fix_prompt=AQL_FIX_PROMPT,
    qa_prompt=AQL_QA_PROMPT,
    graph=graph,
    verbose=True
)

from langchain_community.callbacks import get_openai_callback

all_cost_counter = 0

def print_graph_rag_with_response(query, verbose):
    global all_cost_counter
    chain.verbose = verbose
    with get_openai_callback() as cb:
        result = chain.invoke(query)
        display(Markdown(f"{result['result']}"))
        print(cb)
        all_cost_counter = all_cost_counter + cb.total_cost
        print("Total Activity Cost: $",
        all_cost_counter)

## Explainability 

Historically, Ai has faced challenges with observability and explainability; the ability to explain why a certain result or decision was made.

With ArangoDB GraphRAG you can return the AQL query that was used to provide the context information to the language model and pinpoint exactly the derived relations used for decision making.

The following is an example of observing a response with the exact query, sample documents returned, and total cost associated with an LLM call.

In [239]:
# chain.verbose = True
# result = []
# with get_openai_callback() as cb:
#     result = chain.invoke("Tell me about events that mention Unita in the description")
#     display(Markdown(f"{result['result']}"))
#     print(cb)

# The above is turned into the following convenience function that also counts total cost

query = """
Tell me about events that mention Unita in the description
"""

display(Markdown(f""" 
#### GraphRAG Direct-to-ArangoDB Response 

"""))
print_graph_rag_with_response(query=query, verbose=True)

display(Markdown(f"""
#### Compare to the in-memory response: 

<b>I'm sorry, but there's no information provided about Unita in the context.</b>"""))
        

 
#### GraphRAG Direct-to-ArangoDB Response 





[1m> Entering new ArangoGraphQAChain chain...[0m
AQL Query (1):[32;1m[1;3m
WITH Event
FOR event IN Event
FILTER CONTAINS(event.description, 'Unita')
RETURN event
[0m
AQL Result:
[32;1m[1;3m[{'_key': 'ANG6', '_id': 'Event/ANG6', '_rev': '_hqg_m0m-_S', 'date': '1997-02-05T00:00:00.000Z', 'dateStamp': 855100800000, 'description': 'Violence in Kuito continues. Unita accuses FAA of occupying these provinces - Violence in Cleansing here is said to have cost 120,000 lives, a number perhaps exaggerated, but possible given craters made by bombs. An estimated 120,000 fatalities over the entire operation.', 'fatalities': 1000, 'geo': {'type': 'Point', 'coordinates': [-12.383, 16.933]}, 'name': 'Violence in Kuito continues. U', 'label': 'Battles'}, {'_key': 'ANG7', '_id': 'Event/ANG7', '_rev': '_hqg_m0m-_T', 'date': '1997-02-05T00:00:00.000Z', 'dateStamp': 855100800000, 'description': 'Unita accuses FAA of occupying these provinces.', 'fatalities': 0, 'geo': {'type': 'Point', 'coordinates

The database contains several events that mention 'Unita' in their descriptions. Here are some of them:

1. An event with the key 'ANG6' occurred on February 5, 1997, in Kuito. The description mentions violence in Kuito and Unita accusing FAA of occupying these provinces. The event resulted in 1000 fatalities.

2. Another event with the key 'ANG7' happened on the same day. The description states that Unita accused FAA of occupying these provinces. There were no fatalities in this event.

3. On February 22, 1997, an event with the key 'ANG17' took place. The description mentions that Unita accused FAA of going on a military offensive. This event also had no fatalities.

4. An event with the key 'ANG24' occurred on March 24, 1997. The description states that Unita stormed a mission, resulting in 3 fatalities.

5. On April 30, 1997, an event with the key 'ANG26' took place. The description mentions reports from Kenge stating that Unita adopted a scorched earth policy and killed at least 200 people indiscriminately, including 10 local Red Cross workers.

6. An event with the key 'ANG43' occurred on May 7, 1997. The description mentions combat between police and Unita. There were no fatalities in this event.

7. On May 30, 1997, an event with the key 'ANG54' took place. The description states that FAA now controls diamond areas once held by Unita.

8. On June 10, 1997, an event with the key 'ANG72' occurred. The description mentions that FAA attacked the International Migration Organization Column, transporting 53 ex-Unita and their families. This event also had no fatalities.

Please note that these are just a few of the events that mention 'Unita' in their descriptions.

Tokens Used: 7714
	Prompt Tokens: 7289
	Completion Tokens: 425
Successful Requests: 2
Total Cost (USD): $0.24417
Total Activity Cost: $ 0.24417



#### Compare to the in-memory response: 

<b>I'm sorry, but there's no information provided about Unita in the context.</b>

## Confidence and Accuracy

As you can see with the above example, explaining why an answer is correct becomes very difficult without the ability to show the exact documents used to create the answer.
You can also see the logic by evaluating the AQL query returned.


#### GraphRAG Improves Access to Existing Knowledge (sometimes unseen by RAG)

The following is another query that highlights the advanced capabilities of combining the power of RAG and LLMs with existing Knowledge Graphs. 

In [253]:
query = """
Tell me about the Events happening in a random location
"""

display(Markdown(f""" 
#### GraphRAG Direct-to-ArangoDB Response 

"""))
print_graph_rag_with_response(query=query, verbose=True)
display(Markdown(f"""

#### Compare to the in-memory response. It looks impressive, is it true? How can you prove it?: 

<b>In Tizi Ouzou, there have been a series of events including running battles, recent violence, and protests against the government. 
The residents of this location have been demanding greater rights. There have been instances of clashes, riots, and even killings. 
Some of these events were held to commemorate the 3rd anniversary of certain incidents. The location has also been a site of rioting. 
Tizi Ouzou is located in Algeria, 110 km east of Algiers, and is the largest city in the Kabylie province. 
It has also been associated with incidents involving Islamic extremists.</b>

"""))

 
#### GraphRAG Direct-to-ArangoDB Response 





[1m> Entering new ArangoGraphQAChain chain...[0m
AQL Query (1):[32;1m[1;3m
WITH Event, hasLocation, Location
FOR event IN Event
    LIMIT 1
    FOR location IN 1 OUTBOUND event._id hasLocation
    RETURN { "Event": event, "Location": location }
[0m
AQL Result:
[32;1m[1;3m[{'Event': {'_key': 'ALG1', '_id': 'Event/ALG1', '_rev': '_hqg_mgm---', 'date': '1997-01-01T00:00:00.000Z', 'dateStamp': 852076800000, 'description': '5 January: Beheading of 5 citizens in Douaouda (Tipaza).', 'fatalities': 5, 'geo': {'type': 'Point', 'coordinates': [36.672, 2.789]}, 'name': 'Beheading of 5 citizens in Dou', 'label': 'Violence_against_civilians'}, 'Location': {'_key': '61112d48fd89b4045a476e500a51bc5c', '_id': 'Location/61112d48fd89b4045a476e500a51bc5c', '_rev': '_hqg_pUS---', 'name': 'Douaouda'}}][0m

[1m> Finished chain.[0m


An event took place in Douaouda on January 5, 1997. The event was a violent act against civilians, specifically the beheading of 5 citizens.

Tokens Used: 6275
	Prompt Tokens: 6191
	Completion Tokens: 84
Successful Requests: 2
Total Cost (USD): $0.19077
Total Activity Cost: $ 3.1553999999999998




#### Compare to the in-memory response. It looks impressive, is it true? How can you prove it?: 

<b>In Tizi Ouzou, there have been a series of events including running battles, recent violence, and protests against the government. 
The residents of this location have been demanding greater rights. There have been instances of clashes, riots, and even killings. 
Some of these events were held to commemorate the 3rd anniversary of certain incidents. The location has also been a site of rioting. 
Tizi Ouzou is located in Algeria, 110 km east of Algiers, and is the largest city in the Kabylie province. 
It has also been associated with incidents involving Islamic extremists.</b>



In [241]:
query = """
Are there any events that mention diamonds in the description?
"""
display(Markdown(f""" 
#### GraphRAG Direct-to-ArangoDB Response 

"""))
print_graph_rag_with_response(query=query, verbose=False)

display(Markdown(f"""
#### Compare to the in-memory response: 

<b>The context does not provide any information about events that mention diamonds in their description.</b>"""))
      

 
#### GraphRAG Direct-to-ArangoDB Response 



Yes, there are several events that mention diamonds in their descriptions. Here are some of them:

1. An event with the key 'LBR283' took place on May 5, 2001, where there were protests over UN trade sanctions on Liberian diamonds. There were no fatalities reported.

2. Another event with the key 'LBR284' occurred the next day, on May 6, 2001, with similar protests over UN trade sanctions on Liberian diamonds. This event also had no fatalities.

3. On May 7, 2001, an event with the key 'LBR285' took place, again involving protests over UN trade sanctions on Liberian diamonds. No fatalities were reported.

4. An event with the key 'DRC4523' occurred on August 21, 2009. The description mentions that ex-Burundian rebels began to join the ranks of FDLR rebels, being lured with diamonds, gold, and a job fighting for the last bastion of militant Hutuism in Congo. There were no fatalities reported.

5. On June 6, 2017, an event with the key 'DRC11196' took place where LRA forces armed with automatic weapons raided 2 communities near Gangala, abducting 10 civilians. They then looted food, gold, and diamonds from the mine near Gangala. No fatalities were reported.

6. Lastly, an event with the key 'DRC11240' occurred on June 18, 2017, where 10 LRA fighters attacked a mining camp near Gangala. They looted gold, diamonds, and money, and abducted 8 artisanal miners. No fatalities were reported.

Tokens Used: 7167
	Prompt Tokens: 6790
	Completion Tokens: 377
Successful Requests: 2
Total Cost (USD): $0.22632
Total Activity Cost: $ 0.66126



#### Compare to the in-memory response: 

<b>The context does not provide any information about events that mention diamonds in their description.</b>

In [243]:
query = """
Has there been Violence Against Citizens?
"""

display(Markdown(f""" 
#### GraphRAG Direct-to-ArangoDB Response 

"""))
print_graph_rag_with_response(query=query, verbose=False)

display(Markdown(f"""
#### Compare to the in-memory response: 

<b>Yes, there has been violence against citizens. For instance, citizens were beheaded in Hassasna, and there were instances of civilians being killed in a raid.</b>"""))

 
#### GraphRAG Direct-to-ArangoDB Response 



Yes, there have been several instances of violence against citizens. Here are some examples:

1. On January 1, 1997, 5 citizens were beheaded in Douaouda, Tipaza.
2. On January 2, 1997, two citizens were beheaded in Hassasna.
3. On January 3, 1997, two citizens were killed in a raid on the village of Hassi El Abd.
4. On January 4, 1997, 16 citizens were murdered in the village of Benachour, Blida.
5. On January 5, 1997, 18 citizens, including 3 children and 6 women, were killed in the Oliviers district of Douaouda, Tipaza.
6. On January 6, 1997, 23 citizens were horribly mutilated and killed in Hadjout, Tipaza.
7. On January 7, 1997, the president of the chamber of bailiffs of the Jijel court was killed.
8. On January 10, 1997, a chauffeur from the governmental newspaper El Moudjahid was killed in Bachdjarah, Algiers.
9. On January 11, 1997, 5 citizens were killed in Ouled Chebel.
10. On January 12, 1997, 14 citizens were murdered in Tabannant, Bouinan, Blida.

These are just a few examples, and there may be more instances of violence against citizens.

Tokens Used: 7768
	Prompt Tokens: 7411
	Completion Tokens: 357
Successful Requests: 2
Total Cost (USD): $0.24374999999999997
Total Activity Cost: $ 1.1113799999999998



#### Compare to the in-memory response: 

<b>Yes, there has been violence against citizens. For instance, citizens were beheaded in Hassasna, and there were instances of civilians being killed in a raid.</b>

## Ask for More

With the new access to a robust knowledge graph backed by advanced query capabilites you can ask for me than you ever could before!

In [244]:
query = """
Start a graph traversal from an Event with a description containing diamond 
and return the connected location names and countries along with the Event description. 
""" 
display(Markdown(f""" 
#### GraphRAG Direct-to-ArangoDB Response 

"""))
print_graph_rag_with_response(query=query, verbose=True)

 
#### GraphRAG Direct-to-ArangoDB Response 





[1m> Entering new ArangoGraphQAChain chain...[0m
AQL Query (1):[32;1m[1;3m
WITH Event, hasLocation, Location, inCountry, Country
FOR event IN Event
    FILTER CONTAINS(event.description, 'diamond')
    FOR v, e, p IN 1..2 OUTBOUND event._id hasLocation, inCountry
        FILTER IS_SAME_COLLECTION('Location', v) OR IS_SAME_COLLECTION('Country', v)
        RETURN { 'Event Description': event.description, 'Location Name': v.name, 'Country': p.vertices[2].name }
[0m
AQL Result:
[32;1m[1;3m[{'Event Description': 'FAA now controls diamond areas once held by Unita. Not sure if Canfunfo is an example of this, but it is coded anyway.', 'Location Name': 'Quicuhine', 'Country': None}, {'Event Description': 'FAA now controls diamond areas once held by Unita. Not sure if Canfunfo is an example of this, but it is coded anyway.', 'Location Name': 'Angola', 'Country': 'Angola'}, {'Event Description': 'DIAMONDS. Location selected because it is nearest to Luzamba diamond complex where event too

The events related to diamonds are as follows:

1. An event described as "FAA now controls diamond areas once held by Unita. Not sure if Canfunfo is an example of this, but it is coded anyway." took place in the location 'Quicuhine' in 'Angola'.

2. An event described as "DIAMONDS. Location selected because it is nearest to Luzamba diamond complex where event took place (-8.83, 17.9). FAA assault." occurred in 'Calanga', 'Angola'.

3. An event described as "FAA launches offensive into rebel held areas in the diamond producing Lunda Norte" happened in 'Dundo', 'Angola'.

4. An event described as "Luanda denounces Unita killing at Canfunfo diamond area" was reported in 'Quicuhine', 'Angola'.

5. An event described as "This killing took place in the diamond mine area called Bula, 18km S of Lurema. Unita blamed for killing. Unita claims the event was a result of antagonisms between garimpeiros, nationals, and foreign groups. DN claims that the attacks were indiscrimin" took place in 'Muriquixe', 'Angola'.

Tokens Used: 6930
	Prompt Tokens: 6570
	Completion Tokens: 360
Successful Requests: 2
Total Cost (USD): $0.2187
Total Activity Cost: $ 1.3300799999999997


## Putting the Graph in GraphRAG

Clearly direct-to-database is already out performing RAG in-memory but can we actually leverage our graphs?

Spoiler: Yes!



## Advanced Analytics and Aggregations

With a suite of cutting edge graph database functionality at the LLMs disposal it can perform complex opertaions without requires a graph expert.

In [245]:

query = """
Starting from an Actor, Aggregate all fatalities across all events. Group the number of fatalities with Actor connected to the event. 
 """

display(Markdown(f""" 
#### GraphRAG Direct-to-ArangoDB Response 

"""))
print_graph_rag_with_response(query=query, verbose=True)

 
#### GraphRAG Direct-to-ArangoDB Response 





[1m> Entering new ArangoGraphQAChain chain...[0m
AQL Query (1):[32;1m[1;3m
WITH Actor, Event, eventActor
FOR actor IN Actor
  FOR v,e,p IN 1..1 INBOUND actor._id eventActor
    COLLECT actorName = actor.name INTO group KEEP actor, v
    LET totalFatalities = SUM(group[*].v.fatalities)
    RETURN { "actorName": actorName, "totalFatalities": totalFatalities }
[0m
AQL Result:
[32;1m[1;3m[{'actorName': '"Military" Militia', 'totalFatalities': 2}, {'actorName': '"Taliban" Militia', 'totalFatalities': 1}, {'actorName': '3R Militia (Sidiki)', 'totalFatalities': 12}, {'actorName': '3R: Return, Reclamation, Rehabilitation', 'totalFatalities': 57}, {'actorName': 'AAAJ: Anti-Jihadist African Alliance', 'totalFatalities': 0}, {'actorName': 'Abairege Ethnic Militia (Kenya)', 'totalFatalities': 5}, {'actorName': 'Abanyabasi Ethnic Militia (Kenya)', 'totalFatalities': 4}, {'actorName': 'Abbey Ethnic Militia (Ivory Coast)', 'totalFatalities': 12}, {'actorName': 'ABC: All Basotho Convention', 

The aggregation of fatalities across all events, grouped by the Actor connected to the event, has been completed. Here are some of the results:

- The "Military" Militia was involved in events that resulted in a total of 2 fatalities.
- The "Taliban" Militia was involved in events that resulted in a total of 1 fatality.
- The 3R Militia (Sidiki) was involved in events that resulted in a total of 12 fatalities.
- The 3R: Return, Reclamation, Rehabilitation was involved in events that resulted in a total of 57 fatalities.
- The AAAJ: Anti-Jihadist African Alliance was involved in events that resulted in no fatalities.
- The Abairege Ethnic Militia (Kenya) was involved in events that resulted in a total of 5 fatalities.
- The Abanyabasi Ethnic Militia (Kenya) was involved in events that resulted in a total of 4 fatalities.
- The Abbey Ethnic Militia (Ivory Coast) was involved in events that resulted in a total of 12 fatalities.
- The ABC: All Basotho Convention was involved in events that resulted in a total of 2 fatalities.
- The Abduwak Ethnic Militia (Kenya) was involved in events that resulted in a total of 13 fatalities.

Please note that these are just a few examples and the actual data contains more actors and their associated fatalities.

Tokens Used: 6646
	Prompt Tokens: 6268
	Completion Tokens: 378
Successful Requests: 2
Total Cost (USD): $0.21071999999999996
Total Activity Cost: $ 1.5407999999999997


In [247]:

query = """
Look at the number of fatalities per Event and determine based on associated Actor, who is the deadliest actor based on the number of fatalities they were involved with?
"""
print_graph_rag_with_response(query=query, verbose=False)

The deadliest actor based on the number of fatalities they were involved with is 'UNITA: National Union for the Total Independence of Angola' with a total of 138,703 fatalities. They are followed by 'Military Forces of Angola (1975-)' with 122,411 fatalities, and 'Military Forces of Ethiopia (1995-2018)' with 89,571 fatalities. Other notable actors include 'Military Forces of Eritrea (1993-)' with 66,047 fatalities, and 'Civilians (Democratic Republic of Congo)' with 25,707 fatalities.

Tokens Used: 6507
	Prompt Tokens: 6299
	Completion Tokens: 208
Successful Requests: 2
Total Cost (USD): $0.20144999999999996
Total Activity Cost: $ 2.18679


## Leverage Existing Technology Expertise

You can directly use ArangoDB Graph Database (AQL) terms, if you know them, to push a query a specific directions.

* Utilizes Existing Expertise
* Leverage Powerful AQL Features
* Access new ArangoDB features without needing to retrain
* Quick and Easy Experimentation of new features

In [248]:
query = """
Start a graph traversal from the node with _id of 'Country/Angola' with a depth of 3 and direction of ANY. 
Finally, return any paths whose final node has an '_id' value starts with 'Event'
Summarize the events as a new reporter would.
"""

display(Markdown(f""" 
#### GraphRAG Direct-to-ArangoDB Response 

"""))
print_graph_rag_with_response(query=query, verbose=False)

 
#### GraphRAG Direct-to-ArangoDB Response 



Based on the data retrieved from the database, several significant events have occurred in Angola:

1. On December 27, 2019, two people were injured in Kilamba Kiaxi, Luanda, after mistakenly detonating an explosive device which they had collected as scrap metal. No fatalities were reported.

2. On April 20, 2019, a 27-year-old citizen was killed by lynching in Luanda for shooting and attempting to assault a street vendor.

3. On March 8, 2019, in the municipality of Belas, Angolan women marched demanding an end to violence against women.

4. On February 27, 2019, people from Capenda Camulemba, including a number of "sobas" (traditional leaders), clashed with police and soldiers at the Lulo Diamond Project site over failure to uphold an agreement. One person was killed and one was wounded.

5. On January 8, 2019, Military Forces of Angola clashed with Armed Forces of Cabinda (FLEC / FLAC) near Tchiminzi ville in Massabi. 12 people were killed, including 4 civilians.

6. On October 8, 2018, a fuze mortar detonated when kids were playing with it, leaving one killed and two seriously injured in the Chatala sector, Cela district.

7. Around May 25, 2018, the INAD - National Demining Institute - destroyed 14,796 unexploded devices in Icau, Dande.

8. On October 20, 2017, the National Demining Institute de-mined 400 explosive devices in the municipality of Dande.

9. Around May 16, 2017, Angolan security forces fought Congolese Kamwina Nsapu militia who had crossed into Lunda Norte border posts from their positions in Congo's Kasai on nine occasions between March and mid-June. In one attack, Kamwina Nsapu militia attacked the Itanda border post, beheading an Angolan border official.

10. On February 14, 2017, in Seva Tando-Macuco, FLEC-FAC militants attacked a military patrol.

Tokens Used: 11507
	Prompt Tokens: 10994
	Completion Tokens: 513
Successful Requests: 2
Total Cost (USD): $0.3606
Total Activity Cost: $ 2.5473899999999996


In [246]:

query = """
Review all of the Events and determine, based on the connections to Actors, Locations, and number of fatalities
Who are the most dangerous Actors and provide a short summary as to why they are considered dangerous. 
Return the Actor name, number of fatalities, number of Event, and a list of locations the events occurred in.

IMPORTANT: Find locations via graph traversal going OUTBOUND from Event to hasLocation. Keep in mind that the edge eventActor is INBOUND to Actor. LIMIT the Events to 100 per Actor. Keep in mind that when doing a COLLECT you have to use KEEP in order to not lose variables that were declared before it.
For example, if you try and do this 'COLLECT actorName = actor.name WITH COUNT INTO eventCount' it should instead be something more like: 'COLLECT actorName = actor.name INTO eventCount KEEP event'
"""

display(Markdown(f""" 
#### GraphRAG Direct-to-ArangoDB Response 

"""))
print_graph_rag_with_response(query=query, verbose=False)

 
#### GraphRAG Direct-to-ArangoDB Response 



The most dangerous actors based on the number of fatalities and events are as follows:

1. The 'Zaghawa Ethnic Militia (Chad)' is the most dangerous with a total of 50 fatalities across 17 events. The events occurred in various locations including Iriba, Tine-Djagaraba, Bardai, Miski, Biere Saraf, Dar Tama, Abeche, and Am Timan.

2. The 'Wardei Ethnic Militia (Kenya)' has caused 48 fatalities across 4 events. The events took place in Wachu-Oda, Hola, Kipini, and Kinakomb.

3. The 'Young Peace Guard' has been involved in 17 events resulting in 42 fatalities. These events occurred in locations such as Buterere, Munege, Rumonge, Cabara, Rutwenzi, Kizuka, Mutakura, Gihanga, Gatare, Muhanda, Kibira National Park, Mudende, Buruhukiro, and Gatura.

4. The 'Zaraguinas' have caused 40 fatalities across 13 events. The events took place in Isiro, Paoua, Bohong, Bossangoa, Boudingui, Moyenne Sido, Gbada 1, Yaloke, and Bambari.

5. The 'Zeyle Ethnic Militia (Ethiopia)' has caused 40 fatalities in a single event that occurred in Argoba.

Other dangerous actors include the 'Witch Hunters Militia (Gambia)', 'We Ethnic Militia (Ivory Coast)', 'Yacouba Ethnic Militia (Ivory Coast)', 'Zangba Communal Militia (Central African Republic)', and 'Yimbu Communal Militia (Democratic Republic of Congo)'. These actors have been involved in fewer events but have still caused significant fatalities.

Tokens Used: 13869
	Prompt Tokens: 12920
	Completion Tokens: 949
Successful Requests: 4
Total Cost (USD): $0.44453999999999994
Total Activity Cost: $ 1.9853399999999997


# Beyond GraphRAG with ArangoSearch

The ArangoDB differeniator comes through with the ability to combine LLM, Graph, and built-in search capabilities. 

In [249]:
db = %useDatabase open_intelligence
db.delete_view("ActorView", ignore_missing=True)
db.create_view(
    "ActorView",
    "arangosearch",
    {
        "links": {
            "Actor": {
                "analyzers": ["text_en"],
                "fields": {
                    "name": { # <------ Enable Actor search by `name`
                        "analyzers": ["text_en"],
                    },
                },
                "includeAllFields": True,
                "storeValues": "none",
                "trackListPositions": False,
            },
        }
    },
)

True

Using Database 'open_intelligence' now.


True

In [250]:
query = """
ArangoSearch: Fetch me the actors whose name has 'United Nations' in it. Use the ActorView.
"""

display(Markdown(f""" 
#### GraphRAG Direct-to-ArangoDB Response 

"""))
print_graph_rag_with_response(query=query, verbose=False)

 
#### GraphRAG Direct-to-ArangoDB Response 



The actors whose names contain 'United Nations' are as follows:

1. UNOCI: United Nations Operation in Ivory Coast (2004-)
2. UNMIL: United Nations Mission in Liberia (2003-)
3. UNMEE: United Nations Mission in Ethiopia and Eritrea (2000-2008)
4. UNMAS: United Nations Mine Action Service
5. UNAMSIL: United Nations Mission in Sierra Leone (1999-2005)
6. UN: United Nations
7. MONUSCO: United Nations Organization Stabilization Mission in Democratic Republic of Congo (2010-)
8. MONUC: United Nations Organization Mission in Democratic Republic of Congo (1999-2010)
9. MONUA: United Nations Observer Mission in Angola
10. MINUSCA: United Nations Multidimensional Integrated Stabilization Mission in the Central African Republic (2014-)

Tokens Used: 7031
	Prompt Tokens: 6814
	Completion Tokens: 217
Successful Requests: 2
Total Cost (USD): $0.21744
Total Activity Cost: $ 2.7648299999999995


In [251]:
query = """
  ArangoSearch: Use the ActorView to fetch me Actors whose name has the word 'militia' in it.
  From those actors, fetch me the Actor with the highest number of associated events.
  With those events, return their sum of fatalities.
  Remember to use INBOUND for eventActor!
"""

display(Markdown(f""" 
#### GraphRAG Direct-to-ArangoDB Response 

"""))
print_graph_rag_with_response(query=query, verbose=False)

 
#### GraphRAG Direct-to-ArangoDB Response 



The actor named 'Mayi Mayi Militia' has the highest number of associated events among all actors whose name contains the word 'militia'. This actor is associated with 978 events in total. The sum of fatalities from these events is 2711.

Tokens Used: 6444
	Prompt Tokens: 6228
	Completion Tokens: 216
Successful Requests: 2
Total Cost (USD): $0.19979999999999998
Total Activity Cost: $ 2.9646299999999997


# Wrap-up

### Total Activity Cost
**RAG In-Memory**: `$50.138989`

In-memory costs will continue to grow with new insertions as the new data will need parsed and inserted into the the local store. 
This could mean needing to reprocess all or parts of the KG and this is before actually running queries.

**GraphRAG Direct-to-Database**: `$2.964629`

The GraphRAG solution means queries are predictable but also scale based on context supplied. 

This context is the documents returned from a query and so this context can be optimized based on the requested output, number of results, and other optimizations such as pre-computing and storing embeddings.

--------------

LLM's are a powerful tool that can be used during your entire application pipeline.
However, to acheive high performance, at scale, with highly accurate results (or results at all) you MUST leverage existing knowledge stored in a graph database.

Use the Direct to Knowledge Graph approach you gain: 
* Efficiency
* Improved Accuracy
* Reduced Complexity
* Explainability (AQL)

* <img src="resources/images/rvg2.png" style="width:800px; height:auto;">