# Tutorial on GraphRAG with Couchbase
This notebook walks through the process of setting up a search engine that combines Couchbase for storing embeddings, OpenAI's models for generating embeddings, and a local search engine for querying structured data. This is useful when you need to search through structured data using natural language queries, leveraging both machine learning and a database.

# Importing Necessary Libraries
In this section, we import all the essential Python libraries required to perform various tasks, such as loading data, interacting with Couchbase, and using OpenAI models for generating text and embeddings.

The libraries used include:

asyncio: For running asynchronous tasks.
logging: For managing logs that help in debugging and monitoring the workflow.
pandas: For data manipulation and reading from data files.
tiktoken: For tokenizing text, which is essential for preparing text before passing it to a language model.
graphrag.query and vector_stores: These are custom libraries that handle entity extraction, searching, and vector storage.

In [1]:
import asyncio
import logging
import os
import traceback
from typing import Any, Callable, Dict, List, Union

import pandas as pd
import tiktoken

from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
    read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.input.loaders.dfs import store_entity_semantic_embeddings
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.embedding import OpenAIEmbedding
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.query.structured_search.local_search.mixed_context import (
    LocalSearchMixedContext,
)
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores.couchbasedb import CouchbaseVectorStore

# Configuring Environment Variables
Here, we configure various environment variables that define paths, API keys, and connection strings. These values are essential for connecting to Couchbase and OpenAI, loading data, and defining other constants.

INPUT_DIR: This specifies where to find the data files.
COUCHBASE_CONNECTION_STRING: Connection details for Couchbase.
OPENAI_API_KEY: Your OpenAI API key, required for interacting with their models.
LLM_MODEL: Specifies which OpenAI model to use (e.g., GPT-4).
EMBEDDING_MODEL: Defines the model used to generate embeddings.

In [2]:
INPUT_DIR = os.getenv("INPUT_DIR")
COUCHBASE_CONNECTION_STRING = os.getenv("COUCHBASE_CONNECTION_STRING", "couchbase://localhost")
COUCHBASE_USERNAME = os.getenv("COUCHBASE_USERNAME", "Administrator")
COUCHBASE_PASSWORD = os.getenv("COUCHBASE_PASSWORD", "password")
COUCHBASE_BUCKET_NAME = os.getenv("COUCHBASE_BUCKET_NAME", "graphrag-demo")
COUCHBASE_SCOPE_NAME = os.getenv("COUCHBASE_SCOPE_NAME", "shared")
COUCHBASE_VECTOR_INDEX_NAME = os.getenv("COUCHBASE_VECTOR_INDEX_NAME", "grapghrag_index")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
LLM_MODEL = os.getenv("LLM_MODEL", "gpt-4o")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-ada-002")

# Loading Data from Parquet Files
In this part, we load data from Parquet files into a dictionary. Each file corresponds to a particular table in the dataset, and we define functions that will handle the loading and processing of each table.

read_indexer_entities, read_indexer_relationships, etc., are custom functions responsible for reading specific parts of the data, such as entities and relationships.
We use pandas to load the data from the files, and if a file is not found, we log a warning and continue.

In [3]:
# Set up logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
logger.info("Loading data from parquet files")
data = {}

# Constants
COMMUNITY_LEVEL = 2

# Table names
TABLE_NAMES = {
    "COMMUNITY_REPORT_TABLE": "create_final_community_reports",
    "ENTITY_TABLE": "create_final_nodes",
    "ENTITY_EMBEDDING_TABLE": "create_final_entities",
    "RELATIONSHIP_TABLE": "create_final_relationships",
    "COVARIATE_TABLE": "create_final_covariates",
    "TEXT_UNIT_TABLE": "create_final_text_units",
}

try:
    data["entities"] = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['ENTITY_TABLE']}.parquet")
    entity_embeddings = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['ENTITY_EMBEDDING_TABLE']}.parquet")
    data["entities"] = read_indexer_entities(data["entities"], entity_embeddings, COMMUNITY_LEVEL)
except FileNotFoundError:
    logger.warning("ENTITY_TABLE file not found. Setting entities to None.")
    data["entities"] = None

try:
    data["relationships"] = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['RELATIONSHIP_TABLE']}.parquet")
    data["relationships"] = read_indexer_relationships(data["relationships"])
except FileNotFoundError:
    logger.warning("RELATIONSHIP_TABLE file not found. Setting relationships to None.")
    data["relationships"] = None

try:
    data["covariates"] = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['COVARIATE_TABLE']}.parquet")
    data["covariates"] = read_indexer_covariates(data["covariates"])
except FileNotFoundError:
    logger.warning("COVARIATE_TABLE file not found. Setting covariates to None.")
    data["covariates"] = None

try:
    data["reports"] = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['COMMUNITY_REPORT_TABLE']}.parquet")
    entity_data = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['ENTITY_TABLE']}.parquet")
    data["reports"] = read_indexer_reports(data["reports"], entity_data, COMMUNITY_LEVEL)
except FileNotFoundError:
    logger.warning("COMMUNITY_REPORT_TABLE file not found. Setting reports to None.")
    data["reports"] = None

try:
    data["text_units"] = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['TEXT_UNIT_TABLE']}.parquet")
    data["text_units"] = read_indexer_text_units(data["text_units"])
except FileNotFoundError:
    logger.warning("TEXT_UNIT_TABLE file not found. Setting text_units to None.")
    data["text_units"] = None

logger.info("Data loading completed")

2024-09-06 15:34:48,630 - __main__ - INFO - Loading data from parquet files


2024-09-06 15:34:48,869 - __main__ - INFO - Data loading completed


# Setting Up the Couchbase Vector Store
Couchbase is used here to store the semantic embeddings generated from entities. In this step, we define a method to connect to the Couchbase database using the provided credentials.

The CouchbaseVectorStore allows you to store, retrieve, and manage vector embeddings in Couchbase.
The connect() method initializes the connection to Couchbase using the provided connection string, username, and password.

In [4]:
logger.info("Setting up CouchbaseVectorStore")

try:
    description_embedding_store = CouchbaseVectorStore(
        collection_name="entity_description_embeddings",
        bucket_name=COUCHBASE_BUCKET_NAME,
        scope_name=COUCHBASE_SCOPE_NAME,
        index_name=COUCHBASE_VECTOR_INDEX_NAME,
    )
    description_embedding_store.connect(
        connection_string=COUCHBASE_CONNECTION_STRING,
        username=COUCHBASE_USERNAME,
        password=COUCHBASE_PASSWORD,
    )
    logger.info("CouchbaseVectorStore setup completed")
except Exception as e:
    logger.error(f"Error setting up CouchbaseVectorStore: {str(e)}")
    raise

2024-09-06 15:34:48,892 - __main__ - INFO - Setting up CouchbaseVectorStore
2024-09-06 15:34:48,898 - graphrag.vector_stores.couchbasedb - INFO - Connecting to Couchbase at couchbase://localhost
2024-09-06 15:34:48,966 - graphrag.vector_stores.couchbasedb - INFO - Successfully connected to Couchbase
2024-09-06 15:34:48,970 - __main__ - INFO - CouchbaseVectorStore setup completed


# Setting Up Language Models
In this section, we configure the language models using OpenAI’s API. We initialize:

ChatOpenAI: This is the language model used to generate responses to natural language queries.
OpenAIEmbedding: This is the model used to generate vector embeddings for text data.
tiktoken: This tokenizer is used to split text into tokens, which are essential for sending data to the language model.

In [5]:
logger.info("Setting up LLM and embedding models")

try:
    llm = ChatOpenAI(
        api_key=OPENAI_API_KEY,
        model=LLM_MODEL,
        api_type=OpenaiApiType.OpenAI,
        max_retries=20,
    )

    token_encoder = tiktoken.get_encoding("cl100k_base")

    text_embedder = OpenAIEmbedding(
        api_key=OPENAI_API_KEY,
        api_base=None,
        api_type=OpenaiApiType.OpenAI,
        model=EMBEDDING_MODEL,
        deployment_name=EMBEDDING_MODEL,
        max_retries=20,
    )

    logger.info("LLM and embedding models setup completed")
except Exception as e:
    logger.error(f"Error setting up models: {str(e)}")
    raise

2024-09-06 15:34:48,984 - __main__ - INFO - Setting up LLM and embedding models
2024-09-06 15:34:49,594 - __main__ - INFO - LLM and embedding models setup completed


# Storing Embeddings in Couchbase
After generating embeddings for the entities, we store them in Couchbase. We use the store_entity_semantic_embeddings function to store the embeddings.

This method checks if the input is either a dictionary or a list and processes it accordingly.
It uses the Couchbase vector store to save the embeddings, ensuring that entities have the proper 'id' attribute for storage.


In [6]:
logger.info(f"Storing entity embeddings")

try:
    entities_list = list(data["entities"].values()) if isinstance(data["entities"], dict) else data["entities"]

    store_entity_semantic_embeddings(
        entities=entities_list, vectorstore=description_embedding_store
    )
    logger.info("Entity semantic embeddings stored successfully")
except AttributeError as e:
    logger.error(f"Error storing entity semantic embeddings: {str(e)}")
    logger.error("Ensure all entities have an 'id' attribute")
    raise
except Exception as e:
    logger.error(f"Error storing entity semantic embeddings: {str(e)}")
    raise

2024-09-06 15:34:49,609 - __main__ - INFO - Storing entity embeddings
2024-09-06 15:34:49,612 - graphrag.vector_stores.couchbasedb - INFO - Loading 96 documents into vector storage
2024-09-06 15:34:49,998 - graphrag.vector_stores.couchbasedb - INFO - Successfully loaded 96 out of 96 documents
2024-09-06 15:34:50,007 - __main__ - INFO - Entity semantic embeddings stored successfully


# Building the Search Engine
Now that we have stored the embeddings and set up the models, we create the search engine. This step configures how queries will be processed, how much weight each entity or relationship will have, and how many entities or relationships will be retrieved in a query.

## LocalSearch
A local search engine that integrates with the language model and vector store to retrieve data.
## LocalSearchMixedContext
A context builder that combines different types of context (reports, entities, relationships) and prepares them for querying.
python


In [7]:
logger.info("Creating search engine")

context_builder = LocalSearchMixedContext(
    community_reports=data["reports"],
    text_units=data["text_units"],
    entities=data["entities"],
    relationships=data["relationships"],
    covariates=data["covariates"],
    entity_text_embeddings=description_embedding_store,
    embedding_vectorstore_key=EntityVectorStoreKey.ID,
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)

local_context_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.1,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 10,
    "top_k_relationships": 10,
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.ID,
    "max_tokens": 12_000,
}

llm_params = {
    "max_tokens": 2_000,
    "temperature": 0.0,
}

search_engine = LocalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    llm_params=llm_params,
    context_builder_params=local_context_params,
    response_type="multiple paragraphs",
)

logger.info("Search engine created")

2024-09-06 15:34:50,037 - __main__ - INFO - Creating search engine
2024-09-06 15:34:50,042 - __main__ - INFO - Search engine created


# Running a Query
Finally, we run a query on the search engine. In this case, the query is "Give me a summary about the story". This simulates asking the search engine to summarize the entities and relationships stored in Couchbase.

asearch: This is an asynchronous search function that takes a query and returns a response generated by the language model.

In [8]:
question = "Give me a summary about the story"
logger.info(f"Running query: '{question}'")

try:
    result = await search_engine.asearch(question)
    print(f"Question: '{question}'")
    print(f"Answer: {result.response}")
    logger.info("Query completed successfully")
except Exception as e:
    logger.error(f"An error occurred while processing the query: {str(e)}")
    print(f"An error occurred while processing the query: {str(e)}")

2024-09-06 15:34:50,067 - __main__ - INFO - Running query: 'Give me a summary about the story'
2024-09-06 15:34:50,075 - graphrag.vector_stores.couchbasedb - INFO - Performing similarity search by text with k=20
2024-09-06 15:34:50,606 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-06 15:34:50,617 - graphrag.vector_stores.couchbasedb - INFO - Performing similarity search by vector with k=20
2024-09-06 15:34:50,628 - graphrag.vector_stores.couchbasedb - INFO - Found 20 results in similarity search by vector
2024-09-06 15:34:50,757 - graphrag.query.structured_search.local_search.search - INFO - GENERATE ANSWER: 1725617090.0754764. QUERY: Give me a summary about the story
2024-09-06 15:34:51,781 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-06 15:35:02,055 - __main__ - INFO - Query completed successfully


Question: 'Give me a summary about the story'
Answer: ## Summary of the Story

### Introduction to the Mission

The narrative centers around a team from the Paranormal Military Squad, tasked with a mission at the Dulce military base. This mission, known as Operation: Dulce, involves investigating and responding to an alien signal that has been detected. The team is composed of key figures such as Alex, Dr. Jordan Hayes, Taylor Cruz, and Sam Rivera, each bringing their unique skills and perspectives to the mission [Data: Entities (21, 47, 50, 68, 27, 4); Relationships (117, 56, 31, 88, 68, 50, 27, 4)].

### The Setting and Initial Discoveries

The story unfolds in the eerie and technologically advanced environment of the Dulce base. The team navigates through various locations within the base, including the server room, where they uncover critical information, and the crash site, where Dr. Jordan Hayes studies alien technology [Data: Entities (50, 44); Relationships (58, 54, 32, 8, 73, 

With these steps, the entire process of loading data, setting up models, storing embeddings, and running a search engine query is written out in sequence without using functions. Let me know if any additional modifications are needed!