# Knowledge Graph RAG: Roman Empire Wikipedia

## Introduction

This project combines Knowledge Graphs with Retrieval-Augmented Generation (RAG) to create a more powerful and contextually aware question-answering system about the Roman Empire. 

### What is Knowledge Graph RAG?

Knowledge Graph RAG is an advanced approach that enhances traditional RAG systems by incorporating structured knowledge in the form of a graph. While standard RAG systems retrieve relevant documents based on vector similarity, Knowledge Graph RAG adds the ability to:

1. **Capture relationships between entities** - Understanding connections between emperors, battles, territories, and historical events
2. **Perform multi-hop reasoning** - Following paths through the knowledge graph to answer complex questions
3. **Provide more contextual information** - Leveraging both unstructured text and structured relationships

### Project Components

This notebook demonstrates:
- Loading and processing Roman Empire historical data
- Building a Neo4j knowledge graph with key entities and relationships
- Creating vector embeddings for efficient retrieval
- Implementing a **hybrid retrieval system** that combines:
  - Vector similarity search
  - Graph traversal queries
- Using LLMs to generate accurate, contextual responses based on retrieved information

### Benefits Over Traditional RAG

- **Reduced hallucinations** - The structured knowledge provides factual grounding
- **Better handling of complex queries** - Can answer questions requiring understanding of relationships
- **Improved explainability** - The graph structure makes reasoning paths more transparent
- **Enhanced context awareness** - Combines the strengths of both knowledge graphs and vector retrieval

Let's explore how this approach can provide richer insights into the history of the Roman Empire!

<img src='./kg-rag-process.png'>

In [None]:
import os
from dotenv import load_dotenv
from typing import List, Tuple
from pydantic import BaseModel, Field

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain_community.document_loaders import WikipediaLoader

from langchain_experimental.graph_transformers import LLMGraphTransformer

from langchain.text_splitter import TokenTextSplitter

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.prompts.prompt import PromptTemplate
from langchain_core.messages import HumanMessage, AIMessage
from langchain_core.output_parsers import StrOutputParser

from langchain_core.runnables import (
    RunnableLambda,
    RunnableBranch,
    RunnableParallel,
    RunnablePassthrough
)

from langchain_neo4j.vectorstores.neo4j_vector import remove_lucene_chars

In [3]:
# Load env variables
load_dotenv()

NEO4J_URI=os.getenv('NEO4J_URI')
NEO4J_USERNAME=os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD=os.getenv('NEO4J_PASSWORD')
AURA_INSTANCEID=os.getenv('AURA_INSTANCEID')
AURA_INSTANCENAME=os.getenv('AURA_INSTANCENAME')

OPENAI_API_KEY=os.getenv('OPENAI_API_KEY')
OPENAI_ENDPOINT=os.getenv('OPENAI_ENDPOINT')

In [4]:
llm = ChatOpenAI(
    api_key=OPENAI_API_KEY,
    model='gpt-4o-mini',
    temperature=0
)

In [5]:
kg = Neo4jGraph(
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
)

  kg = Neo4jGraph(


In [None]:
# Read Wikipedia page for Roman Empire
raw_documents = WikipediaLoader(query='The Roman Empire').load()

print(raw_documents)



  lis = BeautifulSoup(html).find_all('li')




In [None]:
# Define text splitter with chunking strategy
text_splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=24,
)

In [None]:
# Split raw document into chunks
documents = text_splitter.split_documents(raw_documents)

print(documents)



In [None]:
# Graph based LLM
llm_transformer = LLMGraphTransformer(llm=llm)

In [12]:
# Convert documents to graph documents
graph_documents = llm_transformer.convert_to_graph_documents(documents)

print(graph_documents)



In [None]:
# Store to Neo4j
result = kg.add_graph_documents(
    graph_documents=graph_documents,
    # Link each node to its original document, for tracing back to the source
    include_source=True,
    # Important: assign an additional entity object label for each node it creates
    # enhance the indexing and querying performance
    baseEntityLabel=True 
)

None


In [None]:
# Hybrid Retrival for RAG: using Neo4jVector
# which allows to configure both keyword and vector search indexes for hybrid search

# Create vector index
vector_index = Neo4jVector.from_existing_graph(
    embedding=OpenAIEmbeddings(),
    node_label='Document',
    embedding_node_property='embedding',
    search_type='hybrid', # 'vector' or 'hybrid'
    text_node_properties=['text']   
)

In [40]:
# Create graph retriever
kg.query('CREATE FULLTEXT INDEX entity IF NOT EXISTS FOR (e:__Entity__) ON EACH [e.id]')

[]

In [41]:
# Extract entities from text
# when pass the text come in, it's going to look through and decipher all the entities

class Entities(BaseModel):
    '''Identifying information about entities'''
    names: List[str] = Field(
        ...,
        description='All the person, organization or business entities tha appear in the text'
    )

In [42]:
# Prompt to put all together with the LLM to pass through the text and be able to extract entities
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an assistant that can extract entities from text."),
        ("human", "Use the given format to extract information from the following text: {question}")
    ]
)

In [43]:
# Define chain to extract entities from text
entity_chain = prompt | llm.with_structured_output(Entities)

In [44]:
# Test
res = entity_chain.invoke({'question': 'My name is Anh and my sister is Van'})
print(res)

names=['Anh', 'Van']


In [59]:
def generate_full_text_query(input: str) -> str:
    '''
    Generate a full-text search query for a given input string.

    This function will construct a query string suitable for a full-text search.
    It processes the input string by splitting it into words and appending 
    a similarity threshold (~2 changed characters) to each word, then combines 
    them using the AND operator.

    Useful for mapping entities from user questions to database values, and allows for some mispellings.
    '''
    full_text_query = ''
    words = [el for el in remove_lucene_chars(input).split() if el]
    for word in words[:-1]:
        full_text_query += f' {word}~2 AND'
    full_text_query += f' {words[-1]}~2'
    return full_text_query.strip()

In [60]:
# Fulltext index query
def structured_retriever(question: str) -> str:
    '''Collect the neighborhood of entities mentioned in the question.'''
    result = ''
    entities = entity_chain.invoke({'question': question}).names
    
    for entity in entities:
        print(f'Getting Entity: {entity}')
        # Collect neighborhood of entities mentioned in the question
        response = kg.query(
            """CALL db.index.fulltext.queryNodes('entity', $query, {limit:2})
            YIELD node,score
            CALL (node) {
              WITH node
              MATCH (node)-[r:!MENTIONS]->(neighbor)
              RETURN node.id + ' - ' + type(r) + ' -> ' + neighbor.id AS output
              UNION ALL
              WITH node
              MATCH (node)<-[r:!MENTIONS]-(neighbor)
              RETURN neighbor.id + ' - ' + type(r) + ' -> ' +  node.id AS output
            }
            RETURN output LIMIT 50
            """,
            {"query": generate_full_text_query(entity)},
        )
        result += "\n".join([el["output"] for el in response])
    return result


In [61]:
# Test 
print(structured_retriever('Who is Commodus?'))

Getting Entity: Commodus
Commodus - MARKED_DECLINE -> Roman Empire
Commodus - BEGINNING_OF -> Crisis Of The Third Century
Commodus - RULED_OVER -> Ancient City-State Of Rome
Roman Empire - RULED_UNDER -> Commodus
Cassius Dio - COMMENTED_ON -> Commodus
Edward Gibbon - HISTORICAL_VIEW -> Commodus


In [None]:
# Combine graph and unstructured retriever to create the final context

# Final retrieval step
def retriever(question: str):
    print(f'Search query: {question}')
    structured_data = structured_retriever(question)
    unstructured_data = [
        el.page_content for el in vector_index.similarity_search(question)
    ]
    final_data = f'''Structured data:
{structured_data}
Unstructured data:
{'#Document '.join(f'{el}\n' for el in unstructured_data)}
    '''
    print(f'\nFinal Data: {final_data}')
    return final_data

In [70]:
# Define the RAG chain

# Condense a chat history and follow-up question into a standalone question
_template = '''Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question,
in its original language.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:'''

CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

In [71]:
def _format_chat_history(chat_history: List[Tuple[str, str]]) -> List:
    buffer = []
    for human, ai in chat_history:
        buffer.append(HumanMessage(content=human))
        buffer.append(AIMessage(content=ai))
    return buffer


In [72]:
_search_query = RunnableBranch(
    # If input includes chat_history, we condense it with the follow-up question
    (
        RunnableLambda(lambda x: bool(x.get("chat_history"))).with_config(
            run_name="HasChatHistoryCheck"
        ),  # Condense follow-up question and chat into a standalone_question
        RunnablePassthrough.assign(
            chat_history=lambda x: _format_chat_history(x["chat_history"])
        )
        | CONDENSE_QUESTION_PROMPT
        | ChatOpenAI(temperature=0)
        | StrOutputParser(),
    ),
    # Else, we have no chat history, so just pass through the question
    RunnableLambda(lambda x: x["question"]),
)

In [73]:
template = '''Answer the question based only on the following context:
{context}

Question: {question}
Use natural language and be concise.
Answer:'''

final_prompt = PromptTemplate.from_template(template)

In [74]:
# Final chain
chain = (
    RunnableParallel(
        {
            "context": _search_query | retriever,
            "question": RunnablePassthrough(),
        }
    )
    | final_prompt | llm | StrOutputParser()
)

In [83]:
# Test all it out
answer = (chain.invoke({'question': 'Who was the first emperor?'}))

Search query: Who was the first emperor?
Getting Entity: emperor





Final Data: Structured data:
Roman Emperor - RULER_OF -> Roman Empire
Roman Emperor - RECOGNITION_BY -> Senate
Roman Emperor - CONTROLLED_BY -> Roman Army
Roman Emperor - LEADER_OF -> Christian Church
Octavian - GRANTED_TITLE -> Roman Emperor
Diocletian - REFORMED -> Roman Emperor
Constantine The Great - FIRST_CHRISTIAN_EMPEROR -> Roman Emperor
Papacy - REGARDED_AS -> Medieval German Emperors
Unstructured data:

text: The Roman emperor was the ruler and monarchical head of state of the Roman Empire, starting with the granting of the title augustus to Octavian in 27 BC. The term emperor is a modern convention, and did not exist as such during the Empire. When a given Roman is described as becoming emperor in English, it generally reflects his accession as augustus, and later as basileus. Another title used was imperator, originally a military honorific, and caesar, originally a cognomen. Early emperors also used the title princeps ("first one") alongside other Republican titles, notabl

In [84]:
print(answer)

The first emperor was Augustus, also known as Octavian.


In [87]:
# History
answer_with_history = chain.invoke(
    {
        'question': 'How did he become the first emperor?',
        'chat_history': [
            ('Who was the first emperor?', 'The first emperor was Augustus, also known as Octavian.')
        ]
    }
)


Search query: How did Augustus become the first emperor?
Getting Entity: Augustus





Final Data: Structured data:
Augustus - TITLE_OF -> Eastern Roman Empire
Augustus - FIRST_EMPEROR -> Roman Empire
Octavian - ASSUMED_TITLE -> Augustus
Egypt - INCORPORATED_BY -> Augustus
Roman Empire - DEPOSED -> Romulus Augustus
Odoacer - DEPOSED -> Romulus Augustus
Odoacer - FORCED_ABDICATION_OF -> Romulus Augustus
Unstructured data:

text: The Roman emperor was the ruler and monarchical head of state of the Roman Empire, starting with the granting of the title augustus to Octavian in 27 BC. The term emperor is a modern convention, and did not exist as such during the Empire. When a given Roman is described as becoming emperor in English, it generally reflects his accession as augustus, and later as basileus. Another title used was imperator, originally a military honorific, and caesar, originally a cognomen. Early emperors also used the title princeps ("first one") alongside other Republican titles, notably consul and pontifex maximus.
The legitimacy of an emperor's rule depended o

In [88]:
print(answer_with_history)

Augustus became the first emperor by being granted the title "Augustus" by the Roman Senate in 27 BC, following his victory over Mark Antony and Cleopatra at the Battle of Actium, which established his effective sole rule over the Roman Empire.
