# Test out GraphRag

[Building a Graph RAG System: A Step-by-Step Approach](https://machinelearningmastery.com/building-graph-rag-system-step-by-step-approach/)

## Why GraphRag

The document retrieved by regular RAG systems can lack of dependenies and cause the answers generated by the LLM to be fragmented. The article [Building a Graph RAG System: A Step-by-Step Approach](https://machinelearningmastery.com/building-graph-rag-system-step-by-step-approach/) used a good example for the fragmented answer generated by regular RAG.

    In a traditional RAG setup, the system might retrieve the following pieces of information:

    Document 1: “James Watson and Francis Crick proposed the double-helix structure in 1953.”
    Document 2: “Rosalind Franklin’s X-ray diffraction images were critical in identifying DNA’s helical structure.”
    Document 3: “Maurice Wilkins shared Franklin’s images with Watson and Crick, which contributed to their discovery.”
    The problem? Traditional RAG systems treat these documents as independent units. They don’t connect the dots effectively, leading to fragmented responses like: 

    “Watson and Crick proposed the structure, and Franklin’s work was important.”

    This response lacks depth and misses key relationships between contributors. Enter Graph RAG! By organizing the retrieved data as a graph, Graph RAG represents each document or fact as a node, and the relationships between them as edges.

    Here’s how Graph RAG would handle the same query:

    Nodes: Represent facts (e.g., “Watson and Crick proposed the structure,” “Franklin contributed critical X-ray images”).
    Edges: Represent relationships (e.g., “Franklin’s images → shared by Wilkins → influenced Watson and Crick”).
    By reasoning across these interconnected nodes, Graph RAG can produce a complete and insightful response like:

    “The discovery of DNA’s double-helix structure in 1953 was primarily led by James Watson and Francis Crick. However, this breakthrough heavily relied on Rosalind Franklin’s X-ray diffraction images, which were shared with them by Maurice Wilkins.”

    This ability to combine information from multiple sources and answer broader, more complex questions is what makes Graph RAG so popular.



## Initiate RAG Building Blocks

Deploy Azure OpenAI Services, including an LLM and an embedding. Model deployed as an Azure endpoint on an [AI Foundary workspace](https://oai.azure.com/resource/overview?wsid=/subscriptions/d91792a2-c9bd-44bc-bcd8-fdddc7ceb1c5/resourceGroups/agentic_applications/providers/Microsoft.CognitiveServices/accounts/multi-agentic-applications&tid=565f1c8e-754e-473e-8352-ac5b86a38c93). Set and AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT and OPENAI_API_VERSION in .env file.

In [1]:
import dotenv
import sys
from pathlib import Path

## Setup Environment
sys.path.append(Path.cwd().parent) # Append project home to system path
dotenv.load_dotenv(override=True) # Load .env

True

In [2]:
# # Test Azure Connection
import openai

client = openai.AzureOpenAI(
    api_version="2025-01-01-preview",
)

# gpt-4o-mini only support chat completion. Use client.chat.completions.create instead of
# client.completions.create
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Test prompt"}],
)

response

ChatCompletion(id='chatcmpl-BAzeI3ugv7hpA9c6q5mJol2FXxsZZ', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='It looks like you are testing the prompt functionality. How can I assist you today? If you have any questions or specific tasks in mind, feel free to let me know!', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None), content_filter_results={'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}})], created=1741959958, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier=None, system_fingerprint='fp_b705f0c291', usage=CompletionUsage(completion_tokens=35, prompt_tokens=9, total_tokens=44, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_pred

### Setup RAG Building Blocks

* LLM
* Embedding
* Vector store

In [3]:
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore

# Connect to chat model. 
# Here we use AzureChatOpenAI instead AzureOpenAI to connect to gpt-4o-mini
llm = AzureChatOpenAI(azure_deployment="gpt-4o-mini")

# Connect to embedding
embeddings = AzureOpenAIEmbeddings(model="text-embedding-3-large")

# Instantiate vector store
vector_store = InMemoryVectorStore(embeddings)

In [4]:
# Test azure connection
llm.invoke("Tell me a joke")

AIMessage(content="Why don't skeletons fight each other? \n\nThey don't have the guts!", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 15, 'prompt_tokens': 11, 'total_tokens': 26, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_b705f0c291', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}], 'finish_reason': 'stop', 'logprobs': None, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 

## Chunk Documents

In [5]:
# Download data
import pandas as pd
news = pd.read_csv("https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv")[:50]
news[:1]

Unnamed: 0,title,date,text
0,Chevron: Best Of Breed,2031-04-06T01:36:32.000000000+00:00,JHVEPhoto Like many companies in the O&G secto...


In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Convert string to Langchain document
news_documents = [Document(row[1]['text']) for row in news.iterrows()]
news_documents[:1]

[Document(metadata={}, page_content='JHVEPhoto Like many companies in the O&G sector, the stock of Chevron (NYSE:CVX) has declined about 10% over the past 90-days despite the fact that Q2 consensus earnings estimates have risen sharply (~25%) during that same time frame. Over the years, Chevron has kept a very strong balance sheet. That allowed the...')]

In [7]:
# Splits text into chunks of 500 characters with a 100-character overlap to maintain context between chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
all_splits = text_splitter.split_documents(news_documents)
all_splits[:1]

[Document(metadata={}, page_content='JHVEPhoto Like many companies in the O&G sector, the stock of Chevron (NYSE:CVX) has declined about 10% over the past 90-days despite the fact that Q2 consensus earnings estimates have risen sharply (~25%) during that same time frame. Over the years, Chevron has kept a very strong balance sheet. That allowed the...')]

## Extract Knowledge Graph

In [8]:
from typing import Callable

from langchain_core.language_models.chat_models import BaseChatModel
from langchain_core.prompts import PromptTemplate

from langchain_community.graphs.networkx_graph import NetworkxEntityGraph

class GraphRAGExtractor:
   '''Extract triples from a graph.

   Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

   Args:
      llm (LLM):
         The language model to use.
      extract_prompt (Union[str, PromptTemplate]):
         The prompt to use for extracting triples.
      parse_fn (callable):
         A function to parse the output of the language model.
      num_workers (int):
         The number of workers to use for parallel processing.
      max_paths_per_chunk (int):
         The maximum number of paths to extract per chunk.

   '''
   llm: BaseChatModel
   extract_prompt: PromptTemplate
   parse_fn: Callable

The following is a simple example to show that the LLM generation is grounded to the retrieved triplets. Compared to the answer directly generated by the LLM, the RAG answer is more grounded to the knowledge stored in the Vector Store about Einstein. Besides, the generation does not seem to identify the knowledge clique, which is supposed to be a strength of Graph Rag. For example, the facts about Theory of Relativity is a theory in physics and it is developed in early 20th century is not reflected in the answer.


```
{
    'query': 'Tell me about Albert Einstein', 
    'result': "Albert Einstein was a theoretical physicist best known for developing the Theory of Relativity, which revolutionized our understanding of space, time, and gravity. He was born on March 14, 1879, in Ulm, Germany, and later became a Swiss citizen. Einstein's work laid the foundation for modern physics, particularly his famous equation E=mc², which describes the equivalence of mass and energy. Throughout his career, he received numerous awards and honors, including the Nobel Prize in Physics in 1921 for his explanation of the photoelectric effect. Einstein passed away on April 18, 1955, but his contributions to science continue to influence the field today."
}
```

The consistency is another issue. The generation takes into account the second level relations about Einstein.

```
{
    'query': 'Tell me about Albert Einstein', 
    'result': "Albert Einstein was a renowned physicist who was born in 1879 and passed away in 1955. He is best known for developing the Theory of Relativity, which he worked on in the early 20th century. Einstein's contributions to science have had a profound impact on our understanding of physics and the universe. His work has influenced various fields and continues to be a subject of study and admiration today."}
```

Answers generated by LLM to the query 'Tell me about Altert Einstein'. This is done by calling `llm.invoke('Tell me about Albert Einstein')`.

```
AIMessage(content='Albert Einstein (1879-1955) was a theoretical physicist renowned for developing the theory of relativity, one of the two pillars of modern physics alongside quantum mechanics. His work revolutionized our understanding of space, time, and energy.\n\n### Early Life\nBorn on March 14, 1879, in Ulm, Germany, Einstein showed an early interest in science and mathematics. He faced challenges in his schooling due to a nonconformist attitude and struggled with rigid educational systems. He later studied at the Polytechnic Institute in Zurich, where he graduated in 1900.\n\n### Career Highlights\nEinstein initially worked as a patent examiner in Bern, Switzerland, where he developed many of his groundbreaking ideas during his free time. In 1905, often referred to as his "miracle year," he published four pivotal papers:\n1. **Special Theory of Relativity** – Introduced the famous equation E=mc², establishing the relationship between mass and energy.\n2. **Photoelectric Effect** – Provided evidence for the quantization of light, which later contributed to the development of quantum theory; this work earned him the Nobel Prize in Physics in 1921.\n3. **Brownian Motion** – Offered explanations for the random motion of particles suspended in fluids, providing empirical support for atomic theory.\n4. **Mass-Energy Equivalence** – Established the foundational principles that would shape nuclear physics.\n\nIn 1915, Einstein completed his General Theory of Relativity, which expanded on his earlier work to include gravity as a curvature of spacetime rather than a force acting at a distance. This theory predicted phenomena like the bending of light around massive objects and was confirmed by observations during a solar eclipse in 1919.\n\n### Later Life and Legacy\nEinstein immigrated to the United States in 1933, fleeing the rise of Nazism in Germany. He accepted a position at the Institute for Advanced Study in Princeton, New Jersey, where he continued his work until his death on April 18, 1955. Throughout his life, Einstein was involved in various social and political causes, advocating for pacifism, civil rights, and nuclear disarmament.\n\nEinstein\'s contributions laid the groundwork for much of modern physics, and his work continues to influence numerous fields, including cosmology, quantum mechanics, and theoretical physics. His iconic status and the phrase "Einstein" have become synonymous with genius, and his legacy endures through both his scientific achievements and his humanitarian efforts.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 503, 'prompt_tokens': 12, 'total_tokens': 515, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_b705f0c291', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}], 'finish_reason': 'stop', 'logprobs': None, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'protected_material_code': {'filtered': False, 'detected': False}, 'protected_material_text': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}, id='run-bf0b8478-646f-4f87-8ce9-cc848312724d-0', usage_metadata={'input_tokens': 12, 'output_tokens': 503, 'total_tokens': 515, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})
```


In [9]:
from langchain_community.graphs.networkx_graph import NetworkxEntityGraph, KnowledgeTriple
from langchain.chains import RetrievalQA # According to ChatGPT, the GraphCyberQAChain requires a Neo4jGraph backend.

# Initialize the graph
graph = NetworkxEntityGraph()

# Sample knowledge to extract relationships
triplets = [
    KnowledgeTriple("Albert Einstein", "discovered", "Theory of Relativity"),
    KnowledgeTriple("Albert Einstein", "born in", "1879"),
    KnowledgeTriple("Albert Einstein", "die at", "1955"),
    KnowledgeTriple("Isaac Newton", "formulated", "Laws of Motion"),
    KnowledgeTriple("Marie Curie", "pioneered", "Radioactivity"),
    KnowledgeTriple("Theory of Relativity", "is", "Theory in Physics"),
    KnowledgeTriple("Theory of Relativity", "develope time", "early 20th century")
]

# Add entities and relationships to the graph
for triplet in triplets:
    graph.add_triple(triplet)

# Convert graph to document
documents = [Document(page_content=f"{subj} {pred} {obj}") for subj, pred, obj in triplets]

# Add documents to vector store
vector_store.add_documents(documents)

# Create a retrieval-based QA chain
retrieval_chain = RetrievalQA.from_chain_type(llm, retriever=vector_store.as_retriever())

# Ask a question
query = "Tell me about Albert Einstein"
response = retrieval_chain.invoke(query)

print(response)


{'query': 'Tell me about Albert Einstein', 'result': "Albert Einstein was a renowned physicist who was born in 1879 and passed away in 1955. He is best known for developing the Theory of Relativity, which he worked on in the early 20th century. Einstein's contributions to science have had a profound impact on our understanding of physics and the universe. His work has influenced various fields and continues to be a subject of study and admiration today."}


In [10]:
llm.invoke("Tell me about Albert Einstein")

AIMessage(content='Albert Einstein was a theoretical physicist born on March 14, 1879, in Ulm, Germany, and he passed away on April 18, 1955, in Princeton, New Jersey, USA. He is best known for developing the theory of relativity, particularly the mass-energy equivalence formula \\(E=mc^2\\), which has become one of the most famous equations in physics.\n\nEinstein\'s early education was in Germany, where he struggled in some subjects but excelled in mathematics and physics. He earned a diploma from the Polytechnic Institute in Zurich, Switzerland, in 1900. After a brief period of working at the Swiss Patent Office, he published several groundbreaking papers in 1905, a year often referred to as his "annus mirabilis" or miracle year. These papers included his work on the photoelectric effect (for which he later received the Nobel Prize in Physics in 1921), Brownian motion, and special relativity.\n\nIn 1915, Einstein completed his general theory of relativity, which expanded the ideas o

In [11]:
graph.get_entity_knowledge(entity="Albert Einstein")

['Albert Einstein discovered Theory of Relativity',
 'Albert Einstein born in 1879',
 'Albert Einstein die at 1955']

Enhance the retrieval using Leiden.

In [55]:
import networkx as nx
from graspologic.partition import hierarchical_leiden

ModuleNotFoundError: No module named 'graspologic'

In [35]:
# Require to install graphviz
# import matplotlib.pyplot as plt
# import networkx as nx

# # Convert to NetworkX graph
# graph.draw_graphviz()

In [56]:
pip install graspologic

Collecting graspologic
  Downloading graspologic-3.4.1-py3-none-any.whl.metadata (5.8 kB)
Collecting POT<0.10,>=0.9 (from graspologic)
  Downloading POT-0.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (34 kB)
Collecting anytree<3.0.0,>=2.12.1 (from graspologic)
  Downloading anytree-2.12.1-py3-none-any.whl.metadata (8.1 kB)
Collecting beartype<0.19.0,>=0.18.5 (from graspologic)
  Downloading beartype-0.18.5-py3-none-any.whl.metadata (30 kB)
Collecting gensim<5.0.0,>=4.3.2 (from graspologic)
  Downloading gensim-4.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.2 kB)
Collecting graspologic-native<2.0.0,>=1.2.1 (from graspologic)
  Downloading graspologic_native-1.2.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.6 kB)
Collecting hyppo<0.5.0,>=0.4.0 (from graspologic)
  Downloading hyppo-0.4.0-py3-none-any.whl.metadata (1.7 kB)
Collecting numpy<2.0.0,>=1.26.4 (from graspologic)
  Downloading numpy-1.26.4-cp310-cp

In [57]:
pip install keras>=3.5.0

Note: you may need to restart the kernel to use updated packages.
