<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/cookbooks/GraphRAG_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GraphRAG Implementation with LlamaIndex

[GraphRAG (Graphs + Retrieval Augmented Generation)](https://www.microsoft.com/en-us/research/project/graphrag/) combines the strengths of Retrieval Augmented Generation (RAG) and Query-Focused Summarization (QFS) to effectively handle complex queries over large text datasets. While RAG excels in fetching precise information, it struggles with broader queries that require thematic understanding, a challenge that QFS addresses but cannot scale well. GraphRAG integrates these approaches to offer responsive and thorough querying capabilities across extensive, diverse text corpora.


This notebook provides guidance on constructing the GraphRAG pipeline using the LlamaIndex PropertyGraph abstractions.


**NOTE:** This is an approximate implementation of GraphRAG. We are currently developing a series of cookbooks that will detail the exact implementation of GraphRAG.

## GraphRAG Aproach

The GraphRAG involves two steps:

1. Graph Generation - Creates Graph, builds communities and its summaries over the given document.
2. Answer to the Query - Use summaries of the communities created from step-1 to answer the query.

**Graph Generation:**

1. **Source Documents to Text Chunks:** Source documents are divided into smaller text chunks for easier processing.

2. **Text Chunks to Element Instances:** Each text chunk is analyzed to identify and extract entities and relationships, resulting in a list of tuples that represent these elements.

3. **Element Instances to Element Summaries:** The extracted entities and relationships are summarized into descriptive text blocks for each element using the LLM.

4. **Element Summaries to Graph Communities:** These entities, relationships and summaries form a graph, which is subsequently partitioned into communities using algorithms using Heirarchical Leiden to establish a hierarchical structure.

5. **Graph Communities to Community Summaries:** The LLM generates summaries for each community, providing insights into the dataset’s overall topical structure and semantics.

**Answering the Query:**

**Community Summaries to Global Answers:** The summaries of the communities are utilized to respond to user queries. This involves generating intermediate answers, which are then consolidated into a comprehensive global answer.


## GraphRAG Pipeline Components

Here are the different components we implemented to build all of the processes mentioned above.

1. **Source Documents to Text Chunks:** Implemented using `SentenceSplitter` with a chunk size of 1024 and chunk overlap of 20 tokens.

2. **Text Chunks to Element Instances AND Element Instances to Element Summaries:** Implemented using `GraphRAGExtractor`.

3. **Element Summaries to Graph Communities AND Graph Communities to Community Summaries:** Implemented using `GraphRAGStore`.

4. **Community Summaries to Global Answers:** Implemented using `GraphQueryEngine`.


Let's check into each of these components and build GraphRAG pipeline.


## Installation

`graspologic` is used to use hierarchical_leiden for building communities.

In [45]:
!uv pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0 llama-index-llms-groq llama-index-embeddings-fastembed fastembed

[2mUsing Python 3.10.0 environment at: /home/bodziosamolot/code/llama_index/.venv[0m
[2K[2mResolved [1m133 packages[0m [2min 1.78s[0m[0m                                       [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/6)                                                   
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/6)--------------[0m[0m     0 B/26.04 KiB           [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/6)2m------------[0m[0m 14.87 KiB/26.04 KiB         [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/6)2m------------[0m[0m 14.87 KiB/26.04 KiB         [1A
[2mflatbuffers         [0m [32m------------------[2m------------[0m[0m 14.87 KiB/26.04 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/6)--------------[0m[0m     0 B/60.15 KiB           [2A
[2mflatbuffers         [0m [32m------------------[2m------------[0m[0m 14.87 KiB/26.04 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/6)--------------[0m[0

## Load Data

We will use a sample news article dataset retrieved from Diffbot, which Tomaz has conveniently made available on GitHub for easy access.

The dataset contains 2,500 samples; for ease of experimentation, we will use 50 of these samples, which include the `title` and `text` of news articles.

In [2]:
import pandas as pd
from llama_index.core import Document

news = pd.read_csv(
    "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
)[:50]

news.head()

Unnamed: 0,title,date,text
0,Chevron: Best Of Breed,2031-04-06T01:36:32.000000000+00:00,JHVEPhoto Like many companies in the O&G secto...
1,FirstEnergy (NYSE:FE) Posts Earnings Results,2030-04-29T06:55:28.000000000+00:00,FirstEnergy (NYSE:FE – Get Rating) posted its ...
2,Dáil almost suspended after Sinn Féin TD put p...,2023-06-15T14:32:11.000000000+00:00,The Dáil was almost suspended on Thursday afte...
3,Epic’s latest tool can animate hyperrealistic ...,2023-06-15T14:00:00.000000000+00:00,"Today, Epic is releasing a new tool designed t..."
4,"EU to Ban Huawei, ZTE from Internal Commission...",2023-06-15T13:50:00.000000000+00:00,The European Commission is planning to ban equ...


Prepare documents as required by LlamaIndex

In [3]:
all_documents = [
    Document(text=f"{row['title']}: {row['text']}")
    for i, row in news.iterrows()
]
# Step is set to 4 to take every fourth document
documents = all_documents[::4]
len(documents)

13

The optimisation below did not help because it's the number of LLM calls which is the limiting factor, not their length.

In [4]:
# from colorama import Fore, Style
# from llama_index.core.node_parser import SentenceSplitter

# GRAY_LIGHT = "\033[38;5;250m"
# GRAY_DARK = "\033[38;5;240m"
# RESET = "\033[0m"

# splitter = SentenceSplitter(chunk_size=30, chunk_overlap=3)

# documents = [
#     Document(text=f"{row['title']}: {row['text']}")
#     for i, row in news.iterrows()
# ]

# shortened_documents = []

# for document in documents:
#     print(Fore.GREEN + "Original Document:" + Style.RESET_ALL)
#     print(document.text)
#     print(Fore.BLUE + "Split Nodes:" + Style.RESET_ALL)
#     nodes = splitter.get_nodes_from_documents([document])
#     shortened_documents.append(nodes[0])
#     for i, node in enumerate(nodes):
#         shade = GRAY_LIGHT if i % 2 == 0 else GRAY_DARK
#         print(f"{shade}{node.text}{RESET}")
#         print()
        
# documents = shortened_documents

## Setup API Key and LLM

In [5]:
import os

api_key = os.environ["GROQ_API_KEY"]

from llama_index.llms.groq import Groq

llm = Groq(
    model="qwen/qwen3-32b",
    api_key=api_key,
)

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


## GraphRAGExtractor

The GraphRAGExtractor class is designed to extract triples (subject-relation-object) from text and enrich them by adding descriptions for entities and relationships to their properties using an LLM.

This functionality is similar to that of the `SimpleLLMPathExtractor`, but includes additional enhancements to handle entity, relationship descriptions. For guidance on implementation, you may look at similar existing [extractors](https://docs.llamaindex.ai/en/latest/examples/property_graph/dynamic_kg_extraction/?h=comparing).

Here's a breakdown of its functionality:

**Key Components:**

1. `llm:` The language model used for extraction.
2. `extract_prompt:` A prompt template used to guide the LLM in extracting information.
3. `parse_fn:` A function to parse the LLM's output into structured data.
4. `max_paths_per_chunk:` Limits the number of triples extracted per text chunk.
5. `num_workers:` For parallel processing of multiple text nodes.


**Main Methods:**

1. `__call__:` The entry point for processing a list of text nodes.
2. `acall:` An asynchronous version of __call__ for improved performance.
3. `_aextract:` The core method that processes each individual node.


**Extraction Process:**

For each input node (chunk of text):
1. It sends the text to the LLM along with the extraction prompt.
2. The LLM's response is parsed to extract entities, relationships, descriptions for entities and relations.
3. Entities are converted into EntityNode objects. Entity description is stored in metadata
4. Relationships are converted into Relation objects. Relationship description is stored in metadata.
5. These are added to the node's metadata under KG_NODES_KEY and KG_RELATIONS_KEY.

**NOTE:** In the current implementation, we are using only relationship descriptions. In the next implementation, we will utilize entity descriptions during the retrieval stage.

In [6]:
import asyncio
import nest_asyncio

nest_asyncio.apply()

from typing import Any, List, Callable, Optional, Union, Dict
from IPython.display import Markdown, display

from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
    default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
    EntityNode,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
    Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
    DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field


class GraphRAGExtractor(TransformComponent):
    """Extract triples from a graph.

    Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

    Args:
        llm (LLM):
            The language model to use.
        extract_prompt (Union[str, PromptTemplate]):
            The prompt to use for extracting triples.
        parse_fn (callable):
            A function to parse the output of the language model.
        num_workers (int):
            The number of workers to use for parallel processing.
        max_paths_per_chunk (int):
            The maximum number of paths to extract per chunk.
    """

    llm: LLM
    extract_prompt: PromptTemplate
    parse_fn: Callable
    num_workers: int
    max_paths_per_chunk: int

    def __init__(
        self,
        llm: Optional[LLM] = None,
        extract_prompt: Optional[Union[str, PromptTemplate]] = None,
        parse_fn: Callable = default_parse_triplets_fn,
        max_paths_per_chunk: int = 10,
        num_workers: int = 4,
    ) -> None:
        """Init params."""
        from llama_index.core import Settings

        if isinstance(extract_prompt, str):
            extract_prompt = PromptTemplate(extract_prompt)

        super().__init__(
            llm=llm or Settings.llm,
            extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
            parse_fn=parse_fn,
            num_workers=num_workers,
            max_paths_per_chunk=max_paths_per_chunk,
        )

    @classmethod
    def class_name(cls) -> str:
        return "GraphExtractor"

    def __call__(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes."""
        return asyncio.run(
            self.acall(nodes, show_progress=show_progress, **kwargs)
        )

    async def _aextract(self, node: BaseNode) -> BaseNode:
        """Extract triples from a node."""
        assert hasattr(node, "text")

        text = node.get_content(metadata_mode="llm")
        try:
            await asyncio.sleep(1) # Spread the requests more to stay under the limit
            llm_response = await self.llm.apredict(
                self.extract_prompt,
                text=text,
                max_knowledge_triplets=self.max_paths_per_chunk,
            )
            entities, entities_relationship = self.parse_fn(llm_response)
        except ValueError:
            entities = []
            entities_relationship = []

        existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
        existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
        metadata = node.metadata.copy()
        for entity, entity_type, description in entities:
            metadata[
                "entity_description"
            ] = description  # Not used in the current implementation. But will be useful in future work.
            entity_node = EntityNode(
                name=entity, label=entity_type, properties=metadata
            )
            existing_nodes.append(entity_node)

        metadata = node.metadata.copy()
        for triple in entities_relationship:
            subj, obj, rel, description = triple
            subj_node = EntityNode(name=subj, properties=metadata)
            obj_node = EntityNode(name=obj, properties=metadata)
            metadata["relationship_description"] = description
            rel_node = Relation(
                label=rel,
                source_id=subj_node.id,
                target_id=obj_node.id,
                properties=metadata,
            )

            existing_nodes.extend([subj_node, obj_node])
            existing_relations.append(rel_node)

        node.metadata[KG_NODES_KEY] = existing_nodes
        node.metadata[KG_RELATIONS_KEY] = existing_relations
        return node

    async def acall(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes async."""
        jobs = []
        for node in nodes:
            jobs.append(self._aextract(node))

        return await run_jobs(
            jobs,
            workers=self.num_workers,
            show_progress=show_progress,
            desc="Extracting paths from text",
        )

# TransformComponent

It defines a contract for any component that transforms a sequence of nodes into another sequence of nodes. The interface is:

__call__(nodes) — synchronous transform (abstract, must be implemented)
acall(nodes) — async transform (default falls back to __call__)

# Sequence type

It's an abstract type representing any ordered, iterable collection that supports len() and indexing ([]), such as list, tuple, or range.

Using Sequence instead of List in the base class is a common Python pattern — it makes the interface more flexible by not forcing callers to use a specific concrete type.

## GraphRAGStore

The `GraphRAGStore` class is an extension of the `SimplePropertyGraphStore `class, designed to implement GraphRAG pipeline. Here's a breakdown of its key components and functions:


The class uses community detection algorithms to group related nodes in the graph and then it generates summaries for each community using an LLM.


**Key Methods:**

`build_communities():`

1. Converts the internal graph representation to a NetworkX graph. NetworkX graph - https://networkx.org/en/

2. Applies the hierarchical Leiden algorithm for community detection.

3. Collects detailed information about each community.

4. Generates summaries for each community.

`generate_community_summary(text):`

1. Uses LLM to generate a summary of the relationships in a community.
2. The summary includes entity names and a synthesis of relationship descriptions.

`_create_nx_graph():`

1. Converts the internal graph representation to a NetworkX graph for community detection.

`_collect_community_info(nx_graph, clusters):`

1. Collects detailed information about each node based on its community.
2. Creates a string representation of each relationship within a community.

`_summarize_communities(community_info):`

1. Generates and stores summaries for each community using LLM.

`get_community_summaries():`

1. Returns the community summaries by building them if not already done.

# Network X

https://networkx.org/en/

- Used for community detection
- Contains:
  - Nodes — each node from the internal self.graph.nodes is added as a string.
  - Edges — each relation from self.graph.relations becomes an edge with two attributes:
    - relationship: the relation label (e.g., "works_at", "is_part_of")
    - description: a textual description of the relationship stored in relation.properties["relationship_description"]
- The sole purpose of this NetworkX graph is to feed it into the hierarchical Leiden algorithm (from the graspologic library) to detect communities — clusters of densely connected entities
- Those communities are then summarized by an LLM, and the summaries power the GraphRAG query engine.
- In short: it's an intermediate graph format that bridges LlamaIndex's internal property graph store with NetworkX's graph algorithms for community detection.

# Communities

A community in this GraphRAG context is a cluster of closely related entities in the knowledge graph — a group of nodes that are more densely connected to each other than to the rest of the graph.

Here's how they're constructed from triplets, step by step:

1. Triplet extraction
The SchemaLLMPathExtractor processes each text chunk and extracts triplets of the form:

(entity1) → [relationship] → (entity2)

For example, from a news article you might get:

("Tesla", "manufactures", "Model 3")
("Elon Musk", "is CEO of", "Tesla")
("Tesla", "headquartered in", "Austin")

These triplets are stored in the internal property graph as nodes and relations.

2. Conversion to a NetworkX graph
The _create_nx_graph() method converts all those triplets into an undirected NetworkX graph:

Each entity becomes a node
Each relationship between two entities becomes an edge (with the relationship label and description as attributes)
So the triplets above would produce nodes {Tesla, Model 3, Elon Musk, Austin} connected by edges.

3. Community detection via hierarchical Leiden
The Leiden algorithm (from the graspologic library) is run on this NetworkX graph. It optimizes modularity — a measure of how well nodes split into groups where:

Nodes within a community have many connections to each other
Nodes across communities have few connections
The hierarchical variant produces nested communities at multiple levels of granularity, controlled by max_cluster_size.

In the example above, Tesla, Model 3, Elon Musk, and Austin would likely land in the same community because they're all directly connected through Tesla.

4. Community info collection
_collect_community_info() iterates over each community and collects all the intra-community edges — i.e., relationships where both endpoints belong to the same cluster. It formats them as:

entity1 -> entity2 -> relationship_label -> relationship_description

5. Community summarization
Finally, _summarize_communities() sends each community's collected relationship strings to an LLM, which produces a natural-language summary capturing the key entities and their relationships within that cluster.

In [7]:
import re
from typing import Optional
from llama_index.core.graph_stores import SimplePropertyGraphStore
from llama_index.core.llms.llm import LLM
import networkx as nx
from graspologic.partition import hierarchical_leiden

from llama_index.core.llms import ChatMessage


class GraphRAGStore(SimplePropertyGraphStore):
    community_summary: dict = {}
    max_cluster_size: int = 5
    llm: Optional[LLM] = None

    def __init__(self, llm: Optional[LLM] = None, **kwargs):
        super().__init__(**kwargs)
        self.llm = llm

    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=(
                    "You are provided with a set of relationships from a knowledge graph, each represented as "
                    "entity1->entity2->relation->relationship_description. Your task is to create a summary of these "
                    "relationships. The summary should include the names of the entities involved and a concise synthesis "
                    "of the relationship descriptions. The goal is to capture the most critical and relevant details that "
                    "highlight the nature and significance of each relationship. Ensure that the summary is coherent and "
                    "integrates the information in a way that emphasizes the key aspects of the relationships."
                ),
            ),
            ChatMessage(role="user", content=text),
        ]
        response = self.llm.chat(messages)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response

    def build_communities(self):
        """Builds communities from the graph and summarizes them."""
        nx_graph = self._create_nx_graph()
        community_hierarchical_clusters = hierarchical_leiden(
            nx_graph, max_cluster_size=self.max_cluster_size
        )
        community_info = self._collect_community_info(
            nx_graph, community_hierarchical_clusters
        )
        self._summarize_communities(community_info)

    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        for node in self.graph.nodes.values():
            nx_graph.add_node(str(node))
        for relation in self.graph.relations.values():
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relation.properties["relationship_description"],
            )
        return nx_graph

    def _collect_community_info(self, nx_graph, clusters):
        """Collect detailed information for each node based on their community."""
        community_mapping = {item.node: item.cluster for item in clusters}
        community_info = {}
        for item in clusters:
            cluster_id = item.cluster
            node = item.node
            if cluster_id not in community_info:
                community_info[cluster_id] = []

            for neighbor in nx_graph.neighbors(node):
                if community_mapping[neighbor] == cluster_id:
                    edge_data = nx_graph.get_edge_data(node, neighbor)
                    if edge_data:
                        detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                        community_info[cluster_id].append(detail)
        return community_info

    def _summarize_communities(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = (
                "\n".join(details) + "."
            )  # Ensure it ends with a period
            self.community_summary[
                community_id
            ] = self.generate_community_summary(details_text)

    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary

  import pkg_resources


# SimplePropertyGraphStore

It's an in-memory property graph store that inherits from the abstract PropertyGraphStore

Key characteristics:

- Storage: Holds a LabelledPropertyGraph object in memory (self.graph) containing nodes, relations, and triplets.
- No external database — everything lives in Python dictionaries. It can be persisted to/loaded from JSON files.
- Does NOT support structured queries (e.g., Cypher) or vector queries — those raise NotImplementedError. This distinguishes it from database-backed stores like Neo4j.

## GraphRAGQueryEngine

The GraphRAGQueryEngine class is a custom query engine designed to process queries using the GraphRAG approach. It leverages the community summaries generated by the GraphRAGStore to answer user queries. Here's a breakdown of its functionality:

**Main Components:**

`graph_store:` An instance of GraphRAGStore, which contains the community summaries.
`llm:` A Language Model (LLM) used for generating and aggregating answers.


**Key Methods:**

`custom_query(query_str: str)`

1. This is the main entry point for processing a query. It retrieves community summaries, generates answers from each summary, and then aggregates these answers into a final response.

`generate_answer_from_summary(community_summary, query):`

1. Generates an answer for the query based on a single community summary.
Uses the LLM to interpret the community summary in the context of the query.

`aggregate_answers(community_answers):`

1. Combines individual answers from different communities into a coherent final response.
2. Uses the LLM to synthesize multiple perspectives into a single, concise answer.


**Query Processing Flow:**

1. Retrieve community summaries from the graph store.
2. For each community summary, generate a specific answer to the query.
3. Aggregate all community-specific answers into a final, coherent response.


**Example usage:**

```
query_engine = GraphRAGQueryEngine(graph_store=graph_store, llm=llm)

response = query_engine.query("query")
```

In [8]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM


class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    llm: LLM

    def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for _, community_summary in community_summaries.items()
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

    def generate_answer_from_summary(self, community_summary, query):
        """Generate an answer from a community summary based on a given query using LLM."""
        prompt = (
            f"Given the community summary: {community_summary}, "
            f"how would you answer the following query? Query: {query}"
        )
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content="I need an answer based on the above information.",
            ),
        ]
        response = self.llm.chat(messages)
        cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return cleaned_response

    def aggregate_answers(self, community_answers):
        """Aggregate individual community answers into a final, coherent response."""
        # intermediate_text = " ".join(community_answers)
        prompt = "Combine the following intermediate answers into a final, concise response."
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content=f"Intermediate answers: {community_answers}",
            ),
        ]
        final_response = self.llm.chat(messages)
        cleaned_final_response = re.sub(
            r"^assistant:\s*", "", str(final_response)
        ).strip()
        return cleaned_final_response

##  Build End to End GraphRAG Pipeline

Now that we have defined all the necessary components, let’s construct the GraphRAG pipeline:

1. Create nodes/chunks from the text.
2. Build a PropertyGraphIndex using `GraphRAGExtractor` and `GraphRAGStore`.
3. Construct communities and generate a summary for each community using the graph built above.
4. Create a `GraphRAGQueryEngine` and begin querying.

### Create nodes/ chunks from the text.

In [9]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)

In [10]:
len(nodes)

13

### Build ProperGraphIndex using `GraphRAGExtractor` and `GraphRAGStore`

In [11]:
KG_TRIPLET_EXTRACT_TMPL = """
-Goal-
Given a text document, identify all entities and their entity types from the text and all relationships among the identified entities.
Given the text, extract up to {max_knowledge_triplets} entity-relation triplets.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: Type of the entity
- entity_description: Comprehensive description of the entity's attributes and activities

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relation: relationship between source_entity and target_entity
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other

3. Output Formatting:
- Return the result in valid JSON format with two keys: 'entities' (list of entity objects) and 'relationships' (list of relationship objects).
- Exclude any text outside the JSON structure (e.g., no explanations or comments).
- If no entities or relationships are identified, return empty lists: { "entities": [], "relationships": [] }.

-An Output Example-
{
  "entities": [
    {
      "entity_name": "Albert Einstein",
      "entity_type": "Person",
      "entity_description": "Albert Einstein was a theoretical physicist who developed the theory of relativity and made significant contributions to physics."
    },
    {
      "entity_name": "Theory of Relativity",
      "entity_type": "Scientific Theory",
      "entity_description": "A scientific theory developed by Albert Einstein, describing the laws of physics in relation to observers in different frames of reference."
    },
    {
      "entity_name": "Nobel Prize in Physics",
      "entity_type": "Award",
      "entity_description": "A prestigious international award in the field of physics, awarded annually by the Royal Swedish Academy of Sciences."
    }
  ],
  "relationships": [
    {
      "source_entity": "Albert Einstein",
      "target_entity": "Theory of Relativity",
      "relation": "developed",
      "relationship_description": "Albert Einstein is the developer of the theory of relativity."
    },
    {
      "source_entity": "Albert Einstein",
      "target_entity": "Nobel Prize in Physics",
      "relation": "won",
      "relationship_description": "Albert Einstein won the Nobel Prize in Physics in 1921."
    }
  ]
}

-Real Data-
######################
text: {text}
######################
output:"""

In [12]:
import json


def parse_fn(response_str: str) -> Any:
    json_pattern = r"\{.*\}"
    match = re.search(json_pattern, response_str, re.DOTALL)
    entities = []
    relationships = []
    if not match:
        return entities, relationships
    json_str = match.group(0)
    try:
        data = json.loads(json_str)
        entities = [
            (
                entity["entity_name"],
                entity["entity_type"],
                entity["entity_description"],
            )
            for entity in data.get("entities", [])
        ]
        relationships = [
            (
                relation["source_entity"],
                relation["target_entity"],
                relation["relation"],
                relation["relationship_description"],
            )
            for relation in data.get("relationships", [])
        ]
        return entities, relationships
    except json.JSONDecodeError as e:
        print("Error parsing JSON:", e)
        return entities, relationships


kg_extractor = GraphRAGExtractor(
    llm=llm,
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    max_paths_per_chunk=2,
    parse_fn=parse_fn,
    num_workers=1, # Increasing the number of workers would risk exceeding the Requests Per Minute rate limit
)

In [13]:
import json
import os
from llama_index.core import PropertyGraphIndex

GRAPH_STORE_PATH = "graph_store.json"
COMMUNITY_SUMMARIES_PATH = "community_summaries.json"

if os.path.exists(GRAPH_STORE_PATH):
    base_store = SimplePropertyGraphStore.from_persist_path(GRAPH_STORE_PATH)
    graph_store = GraphRAGStore(llm=llm)
    graph_store.graph = base_store.graph
    index = PropertyGraphIndex.from_existing(
        property_graph_store=graph_store,
        embed_kg_nodes=False,
        llm=llm,
    )
    print("Loaded graph store from cache.")
else:
    index = PropertyGraphIndex(
        nodes=nodes,
        llm=llm,
        property_graph_store=GraphRAGStore(llm=llm),
        kg_extractors=[kg_extractor],
        embed_kg_nodes=False,
        show_progress=True,
    )
    index.property_graph_store.persist(GRAPH_STORE_PATH)
    print(f"Graph store built and persisted to {GRAPH_STORE_PATH}")

Loaded graph store from cache.


In [14]:
node = list(index.property_graph_store.graph.nodes.values())[-1]
print(f"Name:  {node.name}")
print(f"Label: {node.label}")
print(f"Properties:")
for k, v in node.properties.items():
    print(f"  {k}: {v}")

Name:  Israel
Label: entity
Properties:
  relationship_description: Uber announced the closure of its food delivery operations in Italy due to insufficient market share and competition from local providers.
  triplet_source_id: c70953e4-b065-48d6-99a6-b72a58baa5ef


In [15]:
list(index.property_graph_store.graph.relations.values())[0]

Relation(label='operates in', source_id='Chevron', target_id='O&G Sector', properties={'relationship_description': 'Chevron is a company that operates within the oil and gas (O&G) industry sector.', 'triplet_source_id': '58897347-bc34-4697-abb2-f65d1e4dc272'})

In [16]:
list(index.property_graph_store.graph.relations.values())[0].properties[
    "relationship_description"
]

'Chevron is a company that operates within the oil and gas (O&G) industry sector.'

### Build communities

Creates communities and a summary for each community. Loads from cache if available.

In [18]:
if os.path.exists(COMMUNITY_SUMMARIES_PATH):
    with open(COMMUNITY_SUMMARIES_PATH) as f:
        index.property_graph_store.community_summary = {int(k): v for k, v in json.load(f).items()}
    print("Loaded community summaries from cache.")
else:
    index.property_graph_store.build_communities()
    with open(COMMUNITY_SUMMARIES_PATH, "w") as f:
        json.dump(index.property_graph_store.community_summary, f)
    print(f"Community summaries built and persisted to {COMMUNITY_SUMMARIES_PATH}")

Loaded community summaries from cache.


# Retries on Rate Limitted responses

GraphRAGQueryEngine does get automatic retries on 429 responses — but the retry logic lives in the Groq LLM integration, not in the query engine itself.

The GraphRAGQueryEngine itself has no retry logic.

Retries use the following library: https://tenacity.readthedocs.io/en/latest/

### Create QueryEngine

In [19]:
query_engine = GraphRAGQueryEngine(
    graph_store=index.property_graph_store, llm=llm
)

### Querying

In [21]:
response = query_engine.query(
    "What are the main news discussed in the document?"
)
display(Markdown(f"{response.response}"))

<think>
Okay, let's tackle this. The user wants the main news from the document, but looking at the provided info, there's no actual document. The intermediate answers mention various topics like Chevron's industry, European Commission's ban on Huawei and ZTE, Vivo's X90s, Apple's ecosystem, Square Enix remastering games, CarMax entities, JetBlue's livery, Coinbase's note repurchase, Allegiant Travel's stock upgrade, Uber's market exits, Gordon McQueen's dementia, and Manchester United's transfer interest.

Wait, the user might have intended to provide a document but forgot. The initial answers correctly point out that there's no document, just knowledge graph data. But some answers assume there's a document and list news points. The user's query is a bit conflicting. I need to reconcile these.

The correct approach is to clarify that there's no actual document, but if we consider the knowledge graph as the source, the main points are the relationships mentioned. However, some answers treated the knowledge graph as a document and extracted news-like points. The user might be confused between the two.

I should start by stating that there's no document, but if we proceed with the knowledge graph, the main topics are the entities and their relationships. However, some answers listed specific news items based on the data. To resolve this, the final answer should first clarify the absence of a document, then summarize the key points from the knowledge graph as if they were the main topics discussed. Need to mention that these are structured relationships, not news articles. Also, ensure the answer is concise and addresses the user's possible confusion.
</think>

The provided information does not include a document or news article but consists of structured relationships from a knowledge graph. However, based on the data, the **main topics covered** are:  

1. **Corporate/Financial Relationships**:  
   - Entities like Chevron (O&G sector, NYSE listing), CarMax Auto Owner Trust (sponsorship, servicing), and Coinbase (convertible notes repurchase) highlight operational and financial frameworks.  

2. **Technology & Market Moves**:  
   - **Huawei/ZTE Ban**: European Commission’s plan to exclude high-risk vendors from internal networks.  
   - **Vivo X90s**: Teaser by Jia Jingdong and confirmation of MediaTek Dimensity 9200+ SoC.  
   - **Square Enix**: Remastering *Star Ocean* games (*First Departure R*, *The Second Story R*).  

3. **Corporate Strategy & Leadership**:  
   - **Apple**: Ecosystem integration (iCloud, Apple Pay), partnerships (Intel, TSMC), and competition with Microsoft.  
   - **JetBlue**: Fleet upgrades (Mint Suites, New Standard Livery) for premium travel.  

4. **Market Exits & Acquisitions**:  
   - **Uber**: Exited food delivery in Italy and taxi services in Israel due to competition.  
   - **Allegiant Travel**: Deutsche Bank upgraded its stock amid improving ROIC.  

5. **Sports & Entertainment**:  
   - **Manchester City**: Potential Champions League matchups with Real Madrid and Inter Milan.  
   - **Gordon McQueen**: Diagnosed with vascular dementia in 2021.  

These points reflect structured data rather than news articles. If you intended to reference a specific document, please provide it for a precise summary.

In [28]:
response = query_engine.query("What are news related to financial sector?")
display(Markdown(f"{response.response}"))

<think>
Okay, let's tackle this. The user wants a concise final response combining all the intermediate answers. First, I need to look at each intermediate answer to see what they have in common.

Most of the intermediate answers start by stating that the provided information doesn't include specific financial sector news. They then mention the actual topic covered in the data, like Chevron's stock, EU telecom policies, Vivo phones, sports contracts, etc. Each one explains that the context is unrelated to finance and suggests checking other sources for real financial news.

I need to make sure the final answer is concise but covers all the key points. The main idea is that the user asked for financial news, but the available data is about other topics. So the response should clarify that there's no financial news here and list the actual topics covered in the data. Also, it should offer to help with other queries if needed.

I should avoid repeating the same phrases from each intermediate answer. Instead, summarize the common elements: no financial news, mention the various unrelated topics, and suggest consulting other sources. Make sure it's clear and to the point without being too verbose.
</think>

**Final Answer:**  
The provided information does not include specific news related to the financial sector. The content spans unrelated topics such as Chevron’s NYSE listing, EU telecom policies, smartphone technology, sports contracts, and corporate financial structures (e.g., CarMax securitization). For financial sector updates, consult dedicated financial news sources or provide additional context for targeted assistance.

## Future Work:

This cookbook is an approximate implementation of GraphRAG. In future cookbooks, we plan to extend it as follows:

1. Implement retrieval using entity description embeddings.
2. Integrate with Neo4JPropertyGraphStore.
3. Calculate a helpfulness score for each answer generated from the community summaries and filter out answers where the helpfulness score is zero.
4. Perform entity disambiguation to remove duplicate entities.
5. Implement claims or covariate information extraction, Local Search and Global Search techniques.

## Comparison: GraphRAG vs. Regular RAG

Let's build a standard vector-based RAG pipeline using the same documents and LLM, then compare the answers side by side.

In [46]:
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings
from llama_index.embeddings.fastembed import FastEmbedEmbedding

Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")

vector_index = VectorStoreIndex(nodes=nodes)
vector_query_engine = vector_index.as_query_engine(llm=llm)

Fetching 5 files: 100%|██████████| 5/5 [00:01<00:00,  2.76it/s]


In [None]:
queries = [
    "What are the main news discussed in the document?",
    "What are news related to financial sector?",
]

for q in queries:
    graph_response = query_engine.query(q)
    vector_response = vector_query_engine.query(q)

    display(Markdown(f"### Query: {q}"))
    display(Markdown(f"**GraphRAG:**\n\n{graph_response.response}"))
    display(Markdown(f"**Regular RAG:**\n\n{vector_response.response}"))
    display(Markdown("---"))

### Query: What are the main news discussed in the document?

**GraphRAG:**

<think>
Okay, let's see. The user is asking for the main news discussed in the document based on the provided information. The intermediate answers given are all different responses to similar queries but with varying contexts. Each one addresses a different topic, like Chevron's industry, European Commission's ban on Huawei and ZTE, Vivo's new phone, Apple's strategies, Square Enix's remastered games, etc.

Wait, the user's actual query is about combining these intermediate answers into a final, concise response. But looking at the history, each intermediate answer is a response to a different document or context. However, the user now wants to combine all these into one final answer. But that doesn't make sense because each intermediate answer is for a different topic. Maybe there's a misunderstanding here. 

Wait, perhaps the user is referring to a specific document that was mentioned in the initial query, but the assistant provided multiple intermediate answers for different documents. Now, the user wants to combine all those answers into one. But the problem is that each intermediate answer is for a different topic. So combining them would result in a list of unrelated news items. 

Alternatively, maybe the user is asking to create a single answer that addresses all the different topics covered in the intermediate answers. But that's not typical. Usually, each query is separate. However, given the way the question is phrased, the user might have intended to ask for a summary of all the different news items mentioned across various documents. 

Looking at the intermediate answers, each one is a separate news summary. For example, one is about Chevron, another about European Commission's ban, another about Vivo's phone, etc. So the final answer should list all these as separate news items. But the user wants a concise response. So perhaps the correct approach is to list each main news item from each intermediate answer, but that would be a long list. However, the user specified "concise," so maybe just a brief mention of each topic without going into details. 

But the user's instruction says "Combine the following intermediate answers into a final, concise response." So the assistant needs to take all the intermediate answers (each being a separate news summary) and combine them into one response. However, since they are all different topics, the final answer would be a list of each main news item from each intermediate answer. 

But the user might have intended that each intermediate answer is part of the same document, but that's not indicated. Given the ambiguity, the safest approach is to list each main news item from each intermediate answer as separate points in the final response. However, since the user wants it concise, maybe grouping similar topics or summarizing each briefly. 

Wait, looking at the intermediate answers, each one is a separate answer to a different query. For example, one is about Chevron, another about European Commission, another about Vivo, etc. So combining them would mean listing all these as separate news items. But the user might not want that. Alternatively, maybe the user made a mistake in providing multiple intermediate answers for different queries and now wants a single answer that addresses all. However, without more context, it's challenging. 

Given the instructions, the assistant should proceed to combine all the intermediate answers into a single, concise response. The best way is to list each main news item from each intermediate answer as bullet points or numbered items. However, since the user wants it concise, maybe just a brief summary of each. 

So the final answer would be a list of each main news item from each intermediate answer, presented concisely. For example:

1. Chevron's operations in the O&G sector and NYSE listing.
2. European Commission's ban on Huawei and ZTE.
3. Jia Jingdong's teaser for Vivo X90s with MediaTek SoC.
4. Apple's product lines, leadership, and partnerships.
5. Square Enix's remastered Star Ocean games.
6. CarMax's trust relationships.
7. JetBlue's New Standard Livery.
8. Coinbase's convertible notes repurchase.
9. Allegiant Travel's stock upgrade and ROIC.
10. Manchester City's potential Champions League matchups.
11. Uber's exits from Italy and Israel.
12. Gordon McQueen's dementia diagnosis.
13. Manchester United and Chelsea's Onana transfer race.

But the user wants it concise, so maybe grouping similar topics or summarizing each briefly. However, given the diversity of topics, it's best to list each as a separate point. The assistant should present each main news item from the intermediate answers in a clear, concise manner, ensuring that each is distinct and summarized effectively.
</think>

The main news discussed across the provided documents includes:  

1. **Chevron**: Operates in the oil and gas sector, listed on NYSE under ticker CVX.  
2. **European Commission**: Plans to ban Huawei and ZTE from internal networks due to security concerns, aligning with broader high-risk vendor exclusion strategies.  
3. **Vivo X90s**: Teased by Jia Jingdong, expected to feature MediaTek Dimensity 9200+ SoC, highlighting design and performance upgrades.  
4. **Apple**: Focus on product innovation (iPhone, MacBooks), leadership under Tim Cook, partnerships (Intel, TSMC), and ecosystem expansion (Apple Music, iCloud).  
5. **Square Enix**: Remastering *Star Ocean: First Departure R* (PS4/Switch) and potential *Star Ocean: The Second Story R* (leaked evidence).  
6. **CarMax Trust**: Structural roles of CarMax Business Services, LLC (sponsor/servicer) and CarMax Auto Funding LLC (depositor).  
7. **JetBlue**: Launch of New Standard Livery for aircraft, reflecting brand evolution and strategic innovation.  
8. **Coinbase**: Repurchased $64.5M convertible notes at 29% discount in 2023, managed under CFO Alesia Haas’s financial strategy.  
9. **Allegiant Travel**: Deutsche Bank upgraded its stock to "Buy" amid improving Return on Invested Capital (ROIC).  
10. **UEFA Champions League**: Potential 2023/24 knockout stage matchups between Manchester City, Real Madrid, and Inter Milan.  
11. **Uber**: Exited food delivery in Italy and taxi/private hire markets in Israel due to competition and operational challenges.  
12. **Gordon McQueen**: Diagnosed with vascular dementia in 2021, which he battled until his death.  
13. **Transfer News**: Manchester United and Chelsea competing for Inter Milan’s goalkeeper André Onana, with United seeking a de Gea replacement.  

Each point reflects key updates or strategic developments from the respective domains.

**Regular RAG:**

<think>
Okay, let's tackle this query. The user is asking for the main news discussed in the document provided. First, I need to parse through the context information given.

Looking at the first part, there's a press release from Coinbase about repurchasing some of their convertible senior notes. They mention repurchasing $64.5 million of the Notes at a 29% discount. The CFO, Alesia Haas, comments on capital deployment and shareholder value, linking it to their financial discipline and performance in Q1 2023. There's also a mention of forward-looking statements and the usual legal disclaimers about risks and uncertainties.

Then there's a second part about CarMax Auto Owner Trust 2019-4 filing some forms with the SEC. It includes details like Commission File Numbers, Central Index Keys, and a signature from Andrew J. McMonigle. This seems like a routine SEC filing related to asset-backed securities, possibly an ABS-EE form. The main points here are the filing of an Asset Data File and an Asset Related Document as required by the SEC.

The user wants the main news from the document. The Coinbase section is clearly a significant announcement about their debt repurchase, which is a financial move to create shareholder value. The CarMax part is more of a regulatory filing, which might not be considered "news" in the same sense but is part of the document's content. However, since the query is about the main news, the focus should be on Coinbase's announcement. The CarMax filing is probably just additional information included in the context but isn't the primary news. I need to make sure not to mention the CarMax part unless it's relevant. Wait, the user provided both contexts, but the query is about the main news discussed in the document. The first part is a press release, which is news, while the second is a regulatory filing. So the answer should focus on Coinbase's repurchase of notes and the associated details. The CarMax part might be a separate document included in the context but not the main news. Therefore, the main news is Coinbase's repurchase of their convertible senior notes.
</think>

The main news discussed in the document is Coinbase's announcement of repurchasing $64.5 million of its 0.50% Convertible Senior Notes due 2026 at a 29% discount to par value. The company emphasized this as a strategic capital deployment to enhance shareholder value, reflecting improved financial efficiency and confidence in its business following strong first-quarter performance. The repurchase, expected to close on June 20, 2023, reduces the outstanding notes to $1.373 billion. Additionally, a separate regulatory filing by CarMax Auto Owner Trust 2019-4 regarding asset-backed securities is included, though it is unrelated to the primary Coinbase announcement.

---