# GraphRAG Toolkit: Simple GraphRAG Workflow

This notebook demonstrates a minimal end-to-end GraphRAG workflow using the [AWS GraphRAG Toolkit](https://github.com/awslabs/graphrag-toolkit). It covers:

1. Installing the GraphRAG Toolkit and dependencies
2. Configuring connections to a graph store and vector store
3. Indexing a small corpus into a **hierarchical lexical graph**
4. Querying the graph using the `LexicalGraphQueryEngine`

The example uses AWS Neptune (graph store) and Amazon OpenSearch Serverless (vector store), which is the reference setup used in the official AWS blog and examples.

## 1. Prerequisites and Architecture

To run this notebook end-to-end you will need:

- An AWS account
- An **Amazon Neptune Database or Neptune Analytics** graph instance
- An **Amazon OpenSearch Serverless** collection for vector storage
- Access to **Amazon Bedrock** foundation models (for extraction and embeddings)
- Appropriate IAM permissions for Neptune, OpenSearch Serverless, and Bedrock

Set the following environment variables in your Jupyter environment (for example via a `.env` file or the notebook UI):

- `GRAPH_STORE` – Neptune endpoint, e.g. `neptune-db://my-graph.cluster-abcdefghijkl.us-east-1.neptune.amazonaws.com`
- `VECTOR_STORE` – OpenSearch Serverless endpoint, e.g. `aoss://https://abcdefghijkl.us-east-1.aoss.amazonaws.com`
- `GRAPHRAG_EXTRACTION_LLM` (optional) – Bedrock model ID used for extraction, e.g. `anthropic.claude-3-haiku-20240307-v1:0`

If these variables are not set, the notebook will fall back to placeholder values that you must replace before running the indexing and querying cells.

## 2. Install GraphRAG Toolkit and Dependencies

In [None]:
# Install the latest GraphRAG Toolkit build from GitHub
# (This pulls a published ZIP artifact with all toolkit components.)
%pip install "https://github.com/awslabs/graphrag-toolkit/releases/latest/download/graphrag-toolkit.zip"

# Helpful utilities
%pip install "nest_asyncio" "python-dotenv" "llama-index"

## 3. Imports and Runtime Configuration

In [None]:
import os

import nest_asyncio
from dotenv import load_dotenv

from graphrag_toolkit import LexicalGraphIndex, LexicalGraphQueryEngine, GraphRAGConfig
from graphrag_toolkit.storage import GraphStoreFactory, VectorStoreFactory
from llama_index.readers.web import SimpleWebPageReader

# Allow nested event loops (needed in some notebook environments)
nest_asyncio.apply()

# Load environment variables from a local .env file if present
load_dotenv()

# Read configuration from environment, falling back to obvious placeholders
GRAPH_STORE = os.environ.get(
    "GRAPH_STORE",
    "neptune-db://<your-neptune-endpoint>",
)
VECTOR_STORE = os.environ.get(
    "VECTOR_STORE",
    "aoss://https://<your-opensearch-endpoint>",
)

# Configure the LLM used during the extraction phase
# (Override via GRAPHRAG_EXTRACTION_LLM if you want a different Bedrock model.)
GraphRAGConfig.extraction_llm = os.environ.get(
    "GRAPHRAG_EXTRACTION_LLM",
    "anthropic.claude-3-haiku-20240307-v1:0",
)

print("GRAPH_STORE:", GRAPH_STORE)
print("VECTOR_STORE:", VECTOR_STORE)
print("Extraction LLM:", GraphRAGConfig.extraction_llm)

## 4. Indexing: Build a Hierarchical Lexical Graph

In this step we:

1. Connect to the configured graph and vector stores
2. Load a small corpus (several Amazon Neptune documentation pages)
3. Use `LexicalGraphIndex.extract_and_build(...)` to:
   - Chunk the documents
   - Run LLM-based extraction to create statements, entities, topics, and facts
   - Persist the resulting **hierarchical lexical graph** and embeddings


In [None]:
# Connect to the graph and vector stores
graph_store = GraphStoreFactory.for_graph_store(GRAPH_STORE)
vector_store = VectorStoreFactory.for_vector_store(VECTOR_STORE)

# Create the lexical graph index
graph_index = LexicalGraphIndex(
    graph_store,
    vector_store,
)

# Example corpus: a few public Neptune docs pages
doc_urls = [
    "https://docs.aws.amazon.com/neptune/latest/userguide/intro.html",
    "https://docs.aws.amazon.com/neptune-analytics/latest/userguide/what-is-neptune-analytics.html",
    "https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-features.html",
    "https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-vs-neptune-database.html",
]

docs = SimpleWebPageReader(
    html_to_text=True,
    metadata_fn=lambda url: {"url": url},
).load_data(doc_urls)

print(f"Loaded {len(docs)} documents. Starting extract_and_build() ...")

# This single call runs both the Extract and Build phases and may take several minutes
# depending on corpus size and LLM configuration.
graph_index.extract_and_build(docs, show_progress=True)

print("\nIndexing complete.")

## 5. Querying: Graph-Enhanced Retrieval with `LexicalGraphQueryEngine`

Now that the lexical graph is populated, we can query it using the `LexicalGraphQueryEngine`.

The example below:

1. Reconnects to the same graph and vector stores
2. Creates a query engine configured for **traversal-based search**
3. Asks a multi-document question about Neptune Database vs Neptune Analytics
4. Prints the generated answer (and, if available, some supporting source URLs)


In [None]:
# Reconnect (in real deployments, indexing and querying might run in separate processes)
graph_store = GraphStoreFactory.for_graph_store(GRAPH_STORE)
vector_store = VectorStoreFactory.for_vector_store(VECTOR_STORE)

# Create a traversal-based query engine
query_engine = LexicalGraphQueryEngine.for_traversal_based_search(
    graph_store,
    vector_store,
)

question = "What are the differences between Neptune Database and Neptune Analytics?"
response = query_engine.query(question)

print("Question:\n", question)
print("\nAnswer:\n", response.response)

# If the response object exposes supporting nodes, print a few for inspection
source_nodes = getattr(response, "source_nodes", None)
if source_nodes:
    print("\nSupporting sources (first few):")
    for node in source_nodes[:5]:
        url = None
        # Different response implementations attach metadata differently; be defensive.
        if hasattr(node, "metadata") and isinstance(node.metadata, dict):
            url = node.metadata.get("url")
        elif hasattr(node, "node") and hasattr(node.node, "metadata"):
            url = node.node.metadata.get("url")
        if url:
            print("-", url)

## 6. Next Steps

This notebook showed a minimal GraphRAG workflow:

1. **Indexing** content into a lexical graph with `LexicalGraphIndex`
2. **Querying** via `LexicalGraphQueryEngine` using traversal-based search

From here you can experiment with more advanced capabilities of the GraphRAG Toolkit:

- Separating **Extract** and **Build** phases (checkpoints, re‑building from extracted data)
- Customizing extraction pipelines (e.g., simplifying sentences, entity/relationship extraction controls)
- Trying different retrievers such as the `SemanticGuidedRetriever` for hybrid semantic + graph search
- Swapping in your own corpus (S3 documents, JSON data, or internal documentation)

Refer to the GraphRAG Toolkit documentation and example notebooks in the repository for more advanced patterns.