# GraphRAG for Query-Focused Summarization

GraphRAG is a structured, hierarchical approach to the classic RAG framework with naive semantic-search approaches.

The GraphRAG process involves extracting a knowledge graph from raw text, building a community hierarchy, generating summaries, and then using these communities for RAG-based tasks.

For community detection [Leiden algo](https://arxiv.org/pdf/1810.08473) is used. 

### Why Baseline RAG is Not Enough?
Baseline RAG cannot connect the dots when the question involves information about certain attributes (people, locations, chemicals, games, etc.). Answering such questions requires traversing pieces of information via the shared attributes to provide new synthesized insights.

### When to Use GraphRAG?
1. When the data contains a lot of misinformation, noise, or involves abstract questions.
2. When RAG must be developed for a highly domain-specific niche.
3. When prompt injection attacks must be avoided at all costs.
4. When the user's questions are more vague than the underlying data.

### Example Use Cases
1. Transcripts & written conversations
2. News & social media
3. Scientific papers (but careful, might result in hallucinations)


### Baseline RAG vs Graph RAG
| Feature                         | Baseline RAG                             | Graph RAG                                  |
|---------------------------------|------------------------------------------|--------------------------------------------|
| **Contextual Understanding**    | Limited to surface-level context, may miss deep connections | Complexity in creating and managing graphs for deep contextual understanding |
| **Data Integration**            | Struggles with integrating structured and unstructured data seamlessly | Requires complex preprocessing to convert data into graph structures |
| **Inference Time**              | Faster but might miss nuanced connections due to simpler architecture | Slower due to the overhead of graph computations and relations |
| **Scalability**                 | Scalable with simple queries but may face issues with complex, interconnected data | More scalable but graph size and complexity can impact performance |
| **Complex Query Handling**      | Less efficient with complex queries involving multiple relations | Efficient with complex queries but at the cost of increased processing time |
| **Knowledge Integration**       | Limited to what is retrieved, may not infer implicit knowledge effectively | Integrating implicit and explicit knowledge requires sophisticated graph design |
| **Resource Requirements**       | Lower computational and memory requirements | Higher computational and memory requirements due to graph processing |
| **Flexibility**                 | Less flexible in handling diverse data types | High flexibility but at the expense of increased system complexity |
| **Accuracy with Ambiguous Data**| Struggles with highly ambiguous or multi-faceted data | Better accuracy but depends on the quality and completeness of the graph |

### How GraphRAG Works?
GraphRAG contains basically from two phases: 

1. **Indexing Phase**
    1. Inside each chunk we do:
        - Entity extraction = People, places, names, plants, products, etc.
        - Relationship extraction = Establishing/investigating the relationships between same entities in different chunks
    2. Knowledge Graph Generation = Generation of set of nodes, which are basically based on the entities & relationships between entities.
    3. Community Detection = Based on the nodes in the graph we detect which entities are closer to each other and which ones lie further apart.
    4. Hierarchical Community Structure = The identified communities are then organized into a hierarchy, where larger, broader communities encompass smaller, more specific ones.
    5. Summarization = Based on the insights, summaries are generated at different levels of the hierarchy, providing contextually rich insights for each community.



2. **Query Phase**
    1. Select Community Level = based on the user's query identify the level of detailisation needed.
    2. Retrieve Relevant Community Summaries = based on the granularity level all the relevant communities & their summaries are retrieved. Retrieved are: data points, entities, relationships and summaries.
    3. Generate Partial Responses = Contextual Analysis = The retrieved info is analyzed to understand the relationships between entities. Like small bricks used for answering the user query.
    4. Combined responses 
    5. Final answer

![image.png](../images/graphrag.png)


In [None]:
import os
import pandas as pd
import tiktoken

from graphrag.query.indexer_adapters import read_indexer_entities, read_indexer_reports
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.query.structured_search.global_search.community_context import (
    GlobalCommunityContext,
)
from graphrag.query.structured_search.global_search.search import GlobalSearch

In [2]:
# env vars

API_KEY = os.environ["GRAPHRAG_API_KEY"]
LLM_MODEL = os.environ['GRAPHRAG_LLM_MODEL']
API_BASE = os.environ['GRAPHRAG_API_BASE']
API_VERSION = os.environ['GRAPHRAG_API_VERSION']

In [3]:
# config

INPUT_DIR = "output/20240710-131202/artifacts"
COMMUNITY_REPORT_TABLE = "create_final_community_reports"
ENTITY_TABLE = "create_final_nodes"
ENTITY_EMBEDDING_TABLE = "create_final_entities"

## Global Search

In [4]:
token_encoder = tiktoken.get_encoding("cl100k_base")

llm = ChatOpenAI(
    api_key=API_KEY,
    model=LLM_MODEL,

    
    api_type=OpenaiApiType.AzureOpenAI,  # or alt. OpenaiApiType.OpenAI
    max_retries=20,
    api_base=API_BASE,
    api_version=API_VERSION
)

llm

<graphrag.query.llm.oai.chat_openai.ChatOpenAI at 0x2abf687a150>

In [5]:
# community level in the Leiden community hierarchy from which we will load the community reports
# higher value means we use reports from more fine-grained communities (at the cost of higher computation cost)

COMMUNITY_LEVEL = 2

In [6]:
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")

In [7]:
reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL)
entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)
print(f"Report records: {len(report_df)}")
report_df.head()

Report records: 21


Unnamed: 0,community,full_content,level,rank,title,rank_explanation,summary,findings,full_content_json,id
0,14,# Scrooge and the Ghostly Encounters\n\nThe co...,1,7.5,Scrooge and the Ghostly Encounters,The impact severity rating is high due to the ...,The community centers around Ebenezer Scrooge ...,[{'explanation': 'The Ghost plays a crucial ro...,"{\n ""title"": ""Scrooge and the Ghostly Encou...",e1c2427c-fe5b-4867-9a17-e88b910b48ef
1,15,"# Scrooge, Christmas, and the Cratchit Family\...",1,7.5,"Scrooge, Christmas, and the Cratchit Family",The impact severity rating is high due to the ...,The community revolves around the characters o...,[{'explanation': 'Scrooge's character undergoe...,"{\n ""title"": ""Scrooge, Christmas, and the C...",3f8a16fc-76b2-4e5f-af44-42aedd88eade
2,16,# Scrooge's Nephew and Christmas-time Communit...,1,6.5,Scrooge's Nephew and Christmas-time Community,The impact severity rating is moderate due to ...,"The community centers around Scrooge's Nephew,...",[{'explanation': 'Scrooge's Nephew is a centra...,"{\n ""title"": ""Scrooge's Nephew and Christma...",e718de5f-8d42-4154-b5c9-f0c4c997e930
3,17,# Scrooge and the Spirits\n\nThe community cen...,1,8.5,Scrooge and the Spirits,The impact severity rating is high due to the ...,"The community centers around Scrooge, a miserl...",[{'explanation': 'Scrooge is initially portray...,"{\n ""title"": ""Scrooge and the Spirits"",\n ...",6ea81a6e-26f7-4a2a-ad60-b3525ed9fa68
4,18,# Scrooge and the Cratchit Family\n\nThe commu...,1,7.5,Scrooge and the Cratchit Family,The impact severity rating is high due to Mr. ...,The community centers around Mr. Scrooge and h...,[{'explanation': 'Mr. Scrooge is the central f...,"{\n ""title"": ""Scrooge and the Cratchit Fami...",4c84e4a4-d9c9-4b86-86e7-045a7039484e


## Global Search parameters config

To learn more, navigate to official Microsoft source code for GlobalSearch [here](https://github.com/microsoft/graphrag/blob/607344022cebc8b8bfae91f5b41088753d5ad30c/graphrag/query/structured_search/global_search/search.py)

In [8]:
context_builder = GlobalCommunityContext(
    community_reports=reports,
    entities=entities,  # default to None if you don't want to use community weights for ranking
    token_encoder=token_encoder,
)


context_builder_params = {
    "use_community_summary": False,  # False = full community reports. True = community short summaries.
    "shuffle_data": True,
    "include_community_rank": True,
    "min_community_rank": 0,
    "community_rank_name": "rank",
    "include_community_weight": True,
    "community_weight_name": "occurrence weight",
    "normalize_community_weight": True,
    "max_tokens": 4500,
    "context_name": "Reports",
}

map_llm_params = {
    "max_tokens": 1000,
    "temperature": 0.0,
    "response_format": {"type": "json_object"},
}

reduce_llm_params = {
    "max_tokens": 2000,
    "temperature": 0.0,
}

In [9]:
search_engine = GlobalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    max_data_tokens=25000,
    map_llm_params=map_llm_params,
    reduce_llm_params=reduce_llm_params,
    allow_general_knowledge=False,  # set this to True will add instruction to encourage the LLM to incorporate general knowledge in the response, which may increase hallucinations, but could be useful in some use cases.
    json_mode=True,  # set this to False if your LLM model does not support JSON mode.
    context_builder_params=context_builder_params,
    concurrent_coroutines=10,
    response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

## Run Global Search

In [10]:
import nest_asyncio
import asyncio

nest_asyncio.apply()

async def main(query: str):
    result = await search_engine.asearch(query)
    print(result.response)
    return result

In [11]:
query = "Can Tiny Tim's behaviour can be described as deviant?"

result = asyncio.run(main(query))

# inspect the data used to build the context for the LLM responses
print(result.context_data["reports"])

# inspect number of LLM calls and tokens
print(f"\nLLM calls: {result.llm_calls}.") 
print(f"\nLLM tokens: {result.prompt_tokens}")

### Analysis of Tiny Tim's Behavior

Based on the provided data, Tiny Tim's behavior cannot be described as deviant. The dataset consistently portrays Tiny Tim as a loving, hopeful, and innocent child who is deeply cared for by his family. There is no indication of any deviant behavior in the available information.

### Familial Relationships and Environment

Tiny Tim's father, Bob Cratchit, is shown to be very affectionate towards him and is emotionally moved by Tim's thoughts and hopes [Data: Reports (20)]. The interactions between Tiny Tim and his siblings are described as playful and affectionate, further indicating a positive and loving family environment [Data: Reports (20)]. This strong familial bond underscores the innocence and positive nature of Tiny Tim's character [Data: Reports (18, 20)].

### Health Condition and Family Concern

The data also highlights Tiny Tim's frail health condition, which is a significant source of concern for his parents, especially his mother [Data