# Knowledge Graph RAG

<img src="./media/graph_start.png" width=600>

*[Improving Knowledge Graph Completion with Generative LM and neighbors](https://deeppavlov.ai/research/tpost/bn15u1y4v1-improving-knowledge-graph-completion-wit)*

In the evolving landscape of AI and information retrieval, knowledge graphs have emerged as a powerful way to represent complex, interconnected information. A knowledge graph is a knowledge base that uses a graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the free-form semantics or relationships underlying these entities. [Source: Wikipedia](https://en.wikipedia.org/wiki/Knowledge_graph)

What makes knowledge graphs particularly powerful is their ability to mirror human cognition in data. They more explicitly map the relationships between objects, concepts, or ideas together through both their semantic and relational connections. This approach closely parallels how our brains naturally understand and internalize information – not as isolated facts, but as a web of interconnected concepts and relationships.

<img src="./media/coffee_graph_ex.png" width=400>

Looking at a concept like "coffee," we don't just know it's a beverage; we automatically connect it to related concepts like beans, brewing methods, caffeine, morning routines, and social interactions. Knowledge graphs capture these natural associations in a structured way.

Traditional RAG systems, while effective at semantic similarity-based retrieval, often struggle to capture broader conceptual relationships across text chunks. Knowledge Graph RAG addresses this limitation by introducing a structured, hierarchical approach to information organization and retrieval. By representing data in a graph format, these systems can traverse relationships between concepts, enabling more sophisticated query understanding and response generation. This approach allows for targeted querying along specific relationship paths, handles complex multi-hop questions, and provides clearer reasoning through explicit connection paths. The result is a more nuanced and interpretable system that combines the structured reasoning of knowledge graphs with the natural language capabilities of large language models.

While [knowledge graphs are not a new concept](https://blog.google/products/search/introducing-knowledge-graph-things-not/), their creation has traditionally been a resource-intensive process. Early knowledge graphs were built either through manual curation by domain experts or by converting existing structured data from relational databases. This limited both their scale and adaptability to new domains.

<img src="./media/table_comp.png" width=600>

*[What is a Knowledge Graph (KG)?](https://zilliz.com/learn/what-is-knowledge-graph)*

The introduction of LLMs has transformed this landscape. LLMs' capabilities in NLP, reasoning, and relationship extraction now enable automated construction of knowledge graphs from unstructured text. These models can identify entities, infer relationships, and structure information in ways that previously required extensive manual labor. As a plus, this allows knowledge graphs to be dynamically updated and expanded as new information becomes available, making them more practical and scalable for real-world applications.

To see this in action ourselves, and compare it to traditional vector similarity techniques, we'll take a look at Microsoft's Open Source [GraphRAG](https://microsoft.github.io/graphrag/) and how it works behind the scenes.

---
## 3 Main Components of Knowledge Graphs

**Entity**

<img src="./media/entities.png" width=500>

An Entity is a distinct object, person, place, event, or concept that has been extracted from a chunk of text through LLM analysis. Entities form the nodes of the knowledge graph. During the creation of the knowledge graph, when duplicate entities are found they are merged while preserving their various descriptions, creating a comprehensive representation of each unique entity.

**Relationship**

<img src="./media/relationship.png" width=400>

A Relationship defines a connection between two entities in the knowledge graph. These connections are extracted directly from text units through LLM analysis, alongside entities. Each relationship includes a source entity, target entity, and descriptive information about their connection. When duplicate relationships are found between the same entities, they are merged by combining their descriptions to create a more complete understanding of the connection.

**Community**

<img src="./media/communities.png" width=400>

A Community is a cluster of related entities and relationships identified through hierarchical community detection, generally using the [Leiden Algorithm](https://en.wikipedia.org/wiki/Leiden_algorithm). Communities create a structured way to understand different levels of granularity within the knowledge graph, from broad overviews at the top level to detailed local clusters at lower levels. This hierarchical structure helps in organizing and navigating complex knowledge graphs.

---
## GraphRAG Creation Data Flow

<img src=./media/graph_building.png width=1000>

Indexxing in GraphRAG is an extensive process, where we load the document, split it into chunks, create sub graphs at a chunk level, combine these subgraphs into our final graph, algorithmically identify communities, then document the communities main features.

---

## 1. Set paths & API key

In [52]:
import os

from pathlib import Path
from dotenv import load_dotenv


load_dotenv()

PROJECT_DIR = Path("./graphrag_demo").resolve()
INPUT_DIR   = PROJECT_DIR / "input"
OUTPUT_DIR  = PROJECT_DIR / "output"
for p in (INPUT_DIR, OUTPUT_DIR):
    p.mkdir(parents=True, exist_ok=True)

# OpenAI (or Azure OpenAI) – GraphRAG reads this via settings.yml, see below
# Retrieve OPENAI_API_KEY from environment variables and set it
os.environ["GRAPHRAG_API_KEY"] = os.getenv("OPENAI_API_KEY")

## 2. Download the PDF and Extract Text

In [53]:
import requests, textwrap
from pypdf import PdfReader

pdf_url = "https://arxiv.org/pdf/2408.13296.pdf"
pdf_path = INPUT_DIR / "ft_guide.pdf"
txt_path = INPUT_DIR / "ft_guide.txt"

if not pdf_path.exists():
    r = requests.get(pdf_url, timeout=60)
    r.raise_for_status()
    pdf_path.write_bytes(r.content)

reader = PdfReader(str(pdf_path))
pages = []
for i, page in enumerate(reader.pages):
    try:
        pages.append(page.extract_text() or "")
    except Exception:
        pages.append("")
raw_text = "\n\n".join(pages)

# (light cleanup – optional)
raw_text = "\n".join(line.strip() for line in raw_text.splitlines())

txt_path.write_text(raw_text, encoding="utf-8")
len(raw_text), str(txt_path)

Ignoring wrong pointing object 414 0 (offset 0)


(263878,
 '/Users/darwinnacionales/Development/Practice/langgraph/l008-GraphRAG/graphrag_demo/input/ft_guide.txt')

## 3. Write a minimal settings.yml programmatically

In [54]:
import yaml

settings = {
    "models": {
        "default_chat_model": {
            "type": "openai_chat",
            "api_key": "${GRAPHRAG_API_KEY}",
            "model": "gpt-4o-mini",
            "model_supports_json": True
        },
        "default_embedding_model": {
            "type": "openai_embedding",
            "api_key": "${GRAPHRAG_API_KEY}",
            "model": "text-embedding-3-small"
        },
    },

    # Input: read *.txt from ./input
    "input": {
        "type": "file",
        "base_dir": "input",
        "file_type": "text",
        # optional: restrict to a pattern if you add more files later
        "file_pattern": ".*\\.txt$$",
    },

    # Chunking: tradeoff between recall and cost
    "chunks": {
        "size": 1200,     # tokens
        "overlap": 120,   # tokens
        "strategy": "tokens",
    },

    # Output location for parquet artifacts
    "output": {
        "type": "file",
        "base_dir": "output",
    },

    # Vector store defaults to LanceDB behind the scenes (fine for local demo)
    # See docs for other backends or Azure AI Search. 
}

with open(PROJECT_DIR / "settings.yml", "w") as f:
    yaml.safe_dump(settings, f, sort_keys=False)

(PROJECT_DIR / "settings.yml").read_text()

'models:\n  default_chat_model:\n    type: openai_chat\n    api_key: ${GRAPHRAG_API_KEY}\n    model: gpt-4o-mini\n    model_supports_json: true\n  default_embedding_model:\n    type: openai_embedding\n    api_key: ${GRAPHRAG_API_KEY}\n    model: text-embedding-3-small\ninput:\n  type: file\n  base_dir: input\n  file_type: text\n  file_pattern: .*\\.txt$$\nchunks:\n  size: 1200\n  overlap: 120\n  strategy: tokens\noutput:\n  type: file\n  base_dir: output\n'

## 4. Build the Index

In [55]:
import pandas as pd
import graphrag.api as api
from graphrag.config.load_config import load_config

cfg = load_config(PROJECT_DIR)  # auto-loads settings.yml and env var
# cfg

In [56]:
# The API is async under the hood – run in a notebook cell:
index_result = await api.build_index(config=cfg)

# Pretty print per-workflow status
for w in index_result:
    status = "OK" if not w.errors else f"ERROR: {w.errors}"
    print(f"Workflow: {w.workflow} -> {status}")

2025-08-10 19:22:02.0039 - INFO - graphrag.api.index - Initializing indexing pipeline...
2025-08-10 19:22:02.0040 - INFO - graphrag.index.workflows.factory - Creating pipeline with workflows: ['load_input_documents', 'create_base_text_units', 'create_final_documents', 'extract_graph', 'finalize_graph', 'extract_covariates', 'create_communities', 'create_final_text_units', 'create_community_reports', 'generate_text_embeddings']
2025-08-10 19:22:02.0040 - INFO - graphrag.storage.file_pipeline_storage - Creating file storage at /Users/darwinnacionales/Development/Practice/langgraph/l008-GraphRAG/graphrag_demo/input
2025-08-10 19:22:02.0040 - INFO - graphrag.storage.file_pipeline_storage - Creating file storage at /Users/darwinnacionales/Development/Practice/langgraph/l008-GraphRAG/graphrag_demo/output
2025-08-10 19:22:02.0042 - INFO - graphrag.index.run.run_pipeline - Running standard indexing.
2025-08-10 19:22:02.0044 - INFO - graphrag.index.run.run_pipeline - Executing pipeline...
2025-

## 5. Load Index Artifacts (Parquet Tables) for Querying

In [57]:
import pandas as pd

entities           = pd.read_parquet(OUTPUT_DIR / "entities.parquet")
communities        = pd.read_parquet(OUTPUT_DIR / "communities.parquet")
community_reports  = pd.read_parquet(OUTPUT_DIR / "community_reports.parquet")
relationships      = pd.read_parquet(OUTPUT_DIR / "relationships.parquet")

len(entities), len(communities), len(community_reports), len(relationships)

(321, 61, 61, 345)

### Looking at Final Entities and Relationships

In [58]:
entities.head(10)

Unnamed: 0,id,human_readable_id,title,type,description,text_unit_ids,frequency,degree,x,y
0,9e7f23f1-60d7-4e15-aa04-8c0fac51cd2b,0,VENKATESH BALAVADHANI PARTHASARATHY,PERSON,Venkatesh Balavadhani Parthasarathy is one of ...,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...,1,1,0.0,0.0
1,3ecc7b35-c0c8-4684-99eb-33278762e5e5,1,AHTSHAM ZAFAR,PERSON,Ahtsham Zafar is one of the authors of the doc...,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...,1,1,0.0,0.0
2,15869250-b1c6-495f-ace7-ae4a2697665c,2,AAFAQ KHAN,PERSON,Aafaq Khan is one of the authors of the docume...,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...,1,1,0.0,0.0
3,18ab75d3-4a91-43ac-85e5-ee51fc622f96,3,ARSALAN SHAHID,PERSON,Arsalan Shahid is one of the authors of the do...,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...,1,1,0.0,0.0
4,d331dc9a-eedb-4d48-8222-1589e6436a68,4,CEADAR,ORGANIZATION,"CeADAR is Ireland’s Centre for AI, affiliated ...",[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...,1,6,0.0,0.0
5,12868df3-b2c0-4d83-999c-9d7de5d4931d,5,UNIVERSITY COLLEGE DUBLIN,ORGANIZATION,University College Dublin is a higher educatio...,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...,1,2,0.0,0.0
6,f08dc2c8-b94f-4396-ac2e-41ef30db4b20,6,DUBLIN,GEO,"Dublin is the capital city of Ireland, where U...",[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...,1,1,0.0,0.0
7,89f4c1c1-23c7-48ea-9989-46984e02f774,7,LARGE LANGUAGE MODELS,EVENT,Large Language Models (LLMs) represent signifi...,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...,1,1,0.0,0.0
8,7029e8ed-9e3f-4fe3-9e43-50a271d8c776,8,GPT-2,ORGANIZATION,GPT-2 is a pre-trained language model develope...,[68f3efe5403089c45e0fc65f07d1a4435bc5e5e4d8f80...,1,1,0.0,0.0
9,5eddaa9b-6ade-4cf2-8275-abb6c075f7bf,9,BERT,ORGANIZATION,"BERT, which stands for Bidirectional Encoder R...",[68f3efe5403089c45e0fc65f07d1a4435bc5e5e4d8f80...,2,3,0.0,0.0


In [51]:
relationships.head(10)

Unnamed: 0,id,human_readable_id,source,target,description,weight,combined_degree,text_unit_ids
0,d7f45c45-13ac-43dd-a266-383d432bd1da,0,VENKATESH BALAVADHANI PARTHASARATHY,CEADAR,Venkatesh Balavadhani Parthasarathy is affilia...,7.0,7,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...
1,51d32096-1a51-45f4-b860-0d5c6d96c9c2,1,AHTSHAM ZAFAR,CEADAR,Ahtsham Zafar is affiliated with CeADAR as an ...,7.0,7,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...
2,d9217c22-69a5-4b71-a4fb-30a81d3dd77f,2,AAFAQ KHAN,CEADAR,Aafaq Khan is affiliated with CeADAR as an aut...,7.0,7,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...
3,4ea35a07-6af2-472e-a8c1-71be3390604b,3,ARSALAN SHAHID,CEADAR,Arsalan Shahid is affiliated with CeADAR as an...,7.0,7,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...
4,b33db089-3de1-4858-a4ee-46c2c06d3823,4,CEADAR,UNIVERSITY COLLEGE DUBLIN,CeADAR is a research group within University C...,8.0,8,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...
5,02f46461-dbe7-4bd4-bb42-c4c7758105bb,5,CEADAR,LARGE LANGUAGE MODELS,CeADAR conducts research on Large Language Mod...,1.0,17,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...
6,8b19b3c2-cd19-4bcc-b265-c80d776921a1,6,UNIVERSITY COLLEGE DUBLIN,DUBLIN,University College Dublin is located in Dublin...,9.0,3,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...
7,f351d41d-b36b-45fb-b013-ee5e714bd4d2,7,LARGE LANGUAGE MODELS,GPT-3,GPT-3 is a significant example of a large lang...,8.0,22,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...
8,95cfd209-021c-4bab-bab0-8732a822a432,8,LARGE LANGUAGE MODELS,GPT-4,GPT-4 is an advanced version of large language...,8.0,17,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...
9,00610e86-cd69-4668-8225-f20b26e91ce0,9,LARGE LANGUAGE MODELS,REINFORCEMENT LEARNING FROM HUMAN FEEDBACK,Reinforcement Learning from Human Feedback is ...,7.0,12,[c11226660d4d5837d8b81fadb7074701e2cf8882560b8...


### **Community Detection & Node Embedding**

<img src="./media/leidan.png" width=600>

After we have our basic graph with entities and relationships, we analyze its structure in two ways. Community Detection uses the [Leiden algorithm](https://en.wikipedia.org/wiki/Leiden_algorithm) to find explicit groupings in the graph, creating a hierarchy of related entities. The lower in the hierarchy, the more granular the community. Node Embedding uses [Node2Vec](https://arxiv.org/abs/1607.00653) to create vector representations of each entity, capturing implicit relationships in the graph structure. These complementary approaches let us understand both obvious connections through communities and subtle patterns through embeddings.

Combining all of this with our relationships gives us our final nodes.

---

## GraphRAG Retrieval

<img src="./media/kg_retrieval.png" width=600>

*[Unifying Large Language Models and Knowledge Graphs: A Roadmap](https://arxiv.org/pdf/2306.08302)*

With our knowledge graph constructed, and hierarchichal communities delineated, we can now perform multiple types of search that can both take advantage of the graph structure, and multiple levels of specificity across our communities. Specifically:

1. **Global Search**: Uses the LLM Generated community reports from a specified level of the graph's community hierarchy as context data to generate response.
2. **Local Search**: Combines structured data from the knowledge graph with unstructured data from the input document(s) to augment the LLM context with relevant entity information.
3. **Drift Search**: Dynamic Reasoning and Inference with Flexible Traversal, an approach to local search queries by including community information in the search process, thus combining global and local search.

## 6. Run Global search (good for “whole-paper” questions)

<img src="./media/global_search.png" width=1000>

Through the semantic clustering of communities during the indexing process outlined above we created community reports as summaries of high level themes across these groupings. Having this community summary data at various levels allows us to do something that traditional RAG performs poorly at, answering queries about broad themes and ideas across our unstructured data.

To capture as much broad information as possible in an efficient manner, GraphRAG implements a [map reduce](https://en.wikipedia.org/wiki/MapReduce) approach. Given a query, relevant community node reports at a specific hierarchical level are retrieved. These are shuffled and chunked, where each chunk is used to generate a list of points that each have their own "importance score". These intermediate points are ranked and filtered, attempting to maintain the most important points. These become the aggregate intermediary response, which is passed to the LLM as the context for the final response.

In [59]:
query = "What are the key best practices and pitfalls in fine-tuning LLMs this paper emphasizes?"

response, context = await api.global_search(
    config=cfg,
    entities=entities,
    communities=communities,
    community_reports=community_reports,
    community_level=2,                 # start at level 2; tweak if needed
    dynamic_community_selection=True,  # often improves relevance
    response_type="Multiple Paragraphs",
    query=query,
)

print(response)

# Best Practices and Pitfalls in Fine-Tuning Large Language Models (LLMs)

Fine-tuning large language models (LLMs) is a critical process that can significantly enhance their performance and adaptability to specific tasks. The paper outlines several best practices and potential pitfalls that practitioners should be aware of when engaging in this process.

## Key Best Practices

### 1. Advanced Fine-Tuning Techniques
The paper emphasizes the importance of utilizing advanced techniques such as Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO). These methods are designed to align model outputs with desired behaviors and human preferences, thereby enhancing the overall effectiveness of the fine-tuning process [Data: Reports (9, 1, 3)].

### 2. Retaining Foundational Knowledge
Another critical best practice is the retention of foundational knowledge during fine-tuning. Techniques like Half Fine-Tuning (HFT) allow models to learn new tasks while preserving essential

## 7. Run Local search (good for entity-specific or scoped Qs)

<img src="./media/local_search.png" width=900>

The GraphRAG approach to local search is the most similar to regular semantic RAG search. It combines structured data from the knowledge graph with unstructured data from the input documents to augment the LLM context with relevant entity information. In essence, we are going to first search for relevant entities to the query using semantic search. These become the entry points on our graph that we can now traverse. Starting at these points, we look at connected chunks of text, community reports, other entities, and relationships between them. All of the data retrieved is filtered and ranked to fit into a pre-defined context window.

In [60]:
# Local search operates over the graph + text units. For the minimal call:
entity_df       = pd.read_parquet(OUTPUT_DIR / "entities.parquet")
community_df    = pd.read_parquet(OUTPUT_DIR / "communities.parquet")
rel_df          = pd.read_parquet(OUTPUT_DIR / "relationships.parquet")
reports_df      = pd.read_parquet(OUTPUT_DIR / "community_reports.parquet")
text_units_df   = pd.read_parquet(OUTPUT_DIR / "text_units.parquet")

# 2) Call the high-level API (requires these args)
response_local, ctx_local = await api.local_search(
    config=cfg,
    entities=entity_df,
    communities=community_df,
    relationships=rel_df,
    text_units=text_units_df,
    community_reports=reports_df,
    covariates=None,
    community_level=2,  # try 2 or 3; higher = narrower/more specific
    query="Summarize recommended data preparation steps before fine-tuning and why they matter.",
    response_type="Bulleted List",
)

print(response_local)

## Recommended Data Preparation Steps Before Fine-Tuning

- **Data Collection**: Gather data from diverse sources (e.g., CSV, web pages, databases) to ensure a comprehensive dataset. This step is crucial for providing a rich foundation for model training [Data: Sources (3)].

- **Data Preprocessing and Formatting**: Clean the data by removing noise, handling missing values, and formatting it to meet task-specific requirements. Proper preprocessing is essential for enhancing model performance, as poorly processed data can lead to significant degradation in results [Data: Sources (4)].

- **Handling Data Imbalance**: Implement techniques to address imbalanced datasets, ensuring that all classes are adequately represented. This is vital for the model to generalize well across different scenarios and avoid bias towards more frequent classes [Data: Sources (4)].

- **Data Annotation**: Ensure precise and consistent labeling of data, which is critical for training algorithms to make accurate

## 8. DRIFT Search

<img src="./media/drift_search.png" width=1000>

[Dynamic Reasoning and Inference with Flexible Traversal](https://www.microsoft.com/en-us/research/blog/introducing-drift-search-combining-global-and-local-search-methods-to-improve-quality-and-efficiency/), or DRIFT, is a novel GraphRAG concept introduced by Microsoft as an approach to local search queries that include community information in the search process.

The user's query is initially processed through [Hypothetical Document Embedding (HyDE)](https://arxiv.org/pdf/2212.10496), which creates a hypothetical document similar to those found in the graph already, but using the user's topic query. This document is embedded and used for semantic retrieval of the top-k relevant community reports. From these matches, we generate an initial answer along with several follow-up questions as a lightweight version of global search. They refer to this as the primer.

Once this primer phase is complete, we execute local searches for each follow-up question generated. Each local search produces both intermediate answers and new follow-up questions, creating a refinement loop. This loop runs for two iterations (noted future research planned to develop reward functions for smarter termination). An important note that makes these local searches unique is that they are informed by both community-level knowledge and detailed entity/relationship data. This allows the DRIFT process to find relevant information even when the initial query diverges from the indexing persona, and it can adapt its approach based on emerging information during the search.

The final output is structured as a hierarchy of questions and answers, ranked by their relevance to the original query. Map reduce is used again with an equal weighting on all intermediate answers, then passed to the language model for a final response. DRIFT cleverly combines global and local search with guided exploration to provide both broad context and specific details in responses.

In [67]:
# =========================
# DRIFT: end-to-end example
# =========================
# Prereqs: you've already run indexing and have OUTPUT_DIR populated with *.parquet and lancedb/.
# Env: GRAPHRAG_API_KEY set; your settings.yml has valid chat + embedding models.

from pathlib import Path
import os
import pandas as pd
import tiktoken

# ---- paths ----
OUTPUT_DIR = Path("./graphrag_demo/output")
COMMUNITY_LEVEL = 2

# ---- load index artifacts ----
entity_df       = pd.read_parquet(OUTPUT_DIR / "entities.parquet")
community_df    = pd.read_parquet(OUTPUT_DIR / "communities.parquet")
rel_df          = pd.read_parquet(OUTPUT_DIR / "relationships.parquet")
reports_df      = pd.read_parquet(OUTPUT_DIR / "community_reports.parquet")
text_units_df   = pd.read_parquet(OUTPUT_DIR / "text_units.parquet")

# ---- wrap into GraphRAG-native types ----
from graphrag.query.indexer_adapters import (
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_report_embeddings,
    read_indexer_text_units,
)

entities      = read_indexer_entities(entity_df, community_df, COMMUNITY_LEVEL)
relationships = read_indexer_relationships(rel_df)
reports       = read_indexer_reports(
    reports_df, community_df, COMMUNITY_LEVEL, content_embedding_col="full_content_embeddings"
)
text_units    = read_indexer_text_units(text_units_df)

# ---- connect LanceDB vector stores produced during indexing ----
from graphrag.vector_stores.lancedb import LanceDBVectorStore

lancedb_uri = str(OUTPUT_DIR / "lancedb")
# entity description embeddings (ids)
desc_store = LanceDBVectorStore(collection_name="default-entity-description")
desc_store.connect(db_uri=lancedb_uri)

# community full content embeddings (needed by DRIFT)
full_store = LanceDBVectorStore(collection_name="default-community-full_content")
full_store.connect(db_uri=lancedb_uri)

# attach report embeddings from LanceDB to the report objects
read_indexer_report_embeddings(reports, full_store)

# ---- build chat & embedding models via ModelManager ----
from graphrag.config.enums import ModelType
from graphrag.config.models.language_model_config import LanguageModelConfig
from graphrag.language_model.manager import ModelManager

api_key        = os.environ["GRAPHRAG_API_KEY"]
llm_model_name = os.environ.get("GRAPHRAG_LLM_MODEL", "gpt-4o-mini")
emb_model_name = os.environ.get("GRAPHRAG_EMBEDDING_MODEL", "text-embedding-3-small")

chat_cfg = LanguageModelConfig(api_key=api_key, type=ModelType.OpenAIChat, model=llm_model_name, max_retries=20)
emb_cfg  = LanguageModelConfig(api_key=api_key, type=ModelType.OpenAIEmbedding, model=emb_model_name, max_retries=20)

mm = ModelManager()
chat_model   = mm.get_or_create_chat_model(name="drift_chat",  model_type=ModelType.OpenAIChat,      config=chat_cfg)
text_embedder= mm.get_or_create_embedding_model(name="drift_embed", model_type=ModelType.OpenAIEmbedding, config=emb_cfg)

# tokenizer (DRIFT needs this)
token_encoder = tiktoken.encoding_for_model(llm_model_name)

# ---- DRIFT config + engine ----
from graphrag.config.models.drift_search_config import DRIFTSearchConfig
from graphrag.query.structured_search.drift_search.drift_context import DRIFTSearchContextBuilder
from graphrag.query.structured_search.drift_search.search import DRIFTSearch

drift_cfg = DRIFTSearchConfig(
    temperature=0,
    max_tokens=5000,
    primer_folds=1,
    drift_k_followups=1,    # fewer follow-up queries
    n_depth=1,              # shallower refinement
    n=1,
    # Local search budget tuning:
    local_search_text_unit_prop=0.3,       # down from default ~0.5
    local_search_community_prop=0.05,      # down from default ~0.1
    local_search_top_k_mapped_entities=5,  # fewer mapped entities
    local_search_top_k_relationships=5,    # fewer relationships
    local_search_max_data_tokens=3000,     # tightened data token window
)


context_builder = DRIFTSearchContextBuilder(
    model=chat_model,
    text_embedder=text_embedder,
    entities=entities,
    relationships=relationships,
    reports=reports,
    entity_text_embeddings=desc_store,
    text_units=text_units,
    token_encoder=token_encoder,
    config=drift_cfg,
)

drift_engine = DRIFTSearch(model=chat_model, context_builder=context_builder, token_encoder=token_encoder)

# ---- run a DRIFT query ----
drift_q = "From this corpus, what data-prep practices most reduce overfitting when fine-tuning, and where do they commonly fail?"
drift_result = await drift_engine.search(drift_q)

print(drift_result.response)

# ---- optional: map sources back to filenames (for your superscripts) ----
docs = pd.read_parquet(OUTPUT_DIR/"documents.parquet")[["id","title"]].rename(columns={"id":"document_id"})
tus  = pd.read_parquet(OUTPUT_DIR/"text_units.parquet")[["id","document_ids"]].rename(columns={"id":"text_unit_id"})
coms = pd.read_parquet(OUTPUT_DIR/"communities.parquet")[["community","level","text_unit_ids"]]

def _as_list(x):
    if x is None: return []
    if isinstance(x, list): return x
    s = str(x)
    return [p.strip() for p in s.split(",")] if "," in s else [s]

def drift_sources_to_docs(drift_result):
    used_tu_ids = set()
    used_comm_ids = set()

    # 1) If a 'sources' DataFrame exists (some builds): use it
    try:
        src = drift_result.context_data.get("sources", None)
        if src is not None and len(src) > 0:
            col = "text_unit_id" if "text_unit_id" in src.columns else ("id" if "id" in src.columns else None)
            if col:
                for v in src[col].tolist():
                    used_tu_ids.update(_as_list(v))
    except Exception:
        pass

    # 2) Walk DRIFT query tree: collect either text_unit_ids OR community ids
    try:
        qgraph = drift_result.query_state.graph
        for node in qgraph.nodes.values():
            meta = getattr(node, "metadata", {}) or {}

            # common keys for text units in DRIFT/Local
            for key in ("text_unit_ids", "tu_ids", "evidence_text_unit_ids"):
                if key in meta and meta[key]:
                    for v in _as_list(meta[key]): used_tu_ids.add(v)

            # common keys for communities
            for key in ("community", "community_id", "communities", "community_ids"):
                if key in meta and meta[key]:
                    for v in _as_list(meta[key]): used_comm_ids.add(str(v))
    except Exception:
        pass

    # 3) If we only got communities, expand them to text units
    if used_comm_ids and not used_tu_ids:
        matched = coms[coms["community"].astype(str).isin(used_comm_ids)]
        for lst in matched["text_unit_ids"].dropna().tolist():
            for v in _as_list(lst): used_tu_ids.add(v)

    # 4) Nothing found? Single-doc fallback so you still get a superscript
    if not used_tu_ids:
        return pd.DataFrame(), ({1: docs["title"].iloc[0]} if len(docs) == 1 else {})

    # 5) Map text units -> documents -> filenames
    joined = pd.DataFrame({"text_unit_id": list(used_tu_ids)}).merge(tus, on="text_unit_id", how="left")
    joined = joined.explode("document_ids").rename(columns={"document_ids":"document_id"})
    joined = joined.merge(docs, on="document_id", how="left").dropna(subset=["title"]).drop_duplicates("title")

    index_to_doc = {i+1: t for i, t in enumerate(joined["title"].tolist())}
    return joined, index_to_doc

sources_df, index_to_doc = drift_sources_to_docs(drift_result)
print(index_to_doc)  # {1: 'ultimate_guide_finetuning_llms.txt', ...}


  0%|          | 0/1 [00:00<?, ?it/s]        



                                             

## Data Preparation Practices to Reduce Overfitting

When fine-tuning Large Language Models (LLMs), several data preparation practices are effective in minimizing overfitting:

1. **Data Augmentation**: This technique increases the training dataset size by creating modified versions of existing data. Variations in text structure or synonyms expose the model to a broader range of inputs, enhancing generalization and reducing overfitting.

2. **Cross-Validation**: Implementing cross-validation allows for a robust evaluation of model performance. By training on different combinations of dataset subsets, practitioners can better identify overfitting and adjust training strategies accordingly.

3. **Regularization Techniques**: Methods like dropout or L2 regularization penalize overly complex models, encouraging simpler models that generalize better to new data.

4. **Balanced Datasets**: Ensuring the training dataset is balanced and representative of the target distribution is crucial. Imb

In [63]:
from pathlib import Path
import pandas as pd

OUTPUT_DIR = Path("./graphrag_demo/output")

docs = pd.read_parquet(OUTPUT_DIR/"documents.parquet")
tus  = pd.read_parquet(OUTPUT_DIR/"text_units.parquet")

print("docs:", len(docs), docs.columns.tolist())
print("tus :", len(tus),  tus.columns.tolist())

# do we actually have doc titles (filenames)?
print(docs[["id","title"]].head(3))             # title should be the filename by default
# do text units point to documents?
print(tus[["id","document_ids"]].head(3))       # document_ids should be a non-empty list

docs: 1 ['id', 'human_readable_id', 'title', 'text', 'text_unit_ids', 'creation_date', 'metadata']
tus : 54 ['id', 'human_readable_id', 'text', 'n_tokens', 'document_ids', 'entity_ids', 'relationship_ids', 'covariate_ids']
                                                  id         title
0  e23f983945048f59ccafe606533e3b33aa8ed0deeaa596...  ft_guide.txt
                                                  id  \
0  c11226660d4d5837d8b81fadb7074701e2cf8882560b83...   
1  68f3efe5403089c45e0fc65f07d1a4435bc5e5e4d8f804...   
2  a5f47714d5cc0e744ef8ff3aa220725fe98a315270efc7...   

                                        document_ids  
0  [e23f983945048f59ccafe606533e3b33aa8ed0deeaa59...  
1  [e23f983945048f59ccafe606533e3b33aa8ed0deeaa59...  
2  [e23f983945048f59ccafe606533e3b33aa8ed0deeaa59...  
