# Game of Thrones Retrieval-Augmented Generation (RAG)

This notebook demonstrates:
1. Downloading and extracting a dataset.
2. Loading documents from the dataset.
3. Splitting those documents into chunks.
4. Embedding the chunks and storing them in Couchbase as a vector store.
5. Performing retrieval-augmented generation (RAG) using the `dspy` library to answer a test query.


## Install Required Packages
Make sure you have the following libraries installed. If anything is missing, install via `pip install <package>`.

**Required Packages**:
- langchain
- langchain_community (for DirectoryLoader)
- langchain_openai (for OpenAIEmbeddings)
- couchbase (for Couchbase connections)
- requests
- dspy
- openai

*(Versions may matter depending on your environment. This code assumes you have compatible versions of these packages.)*

In [1]:
!pip install langchain langchain_community langchain_openai langchain_couchbase requests dspy openai

Collecting langchain
  Using cached langchain-0.3.17-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain_community
  Using cached langchain_community-0.3.16-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain_openai
  Using cached langchain_openai-0.3.3-py3-none-any.whl.metadata (2.7 kB)
Collecting langchain_couchbase
  Downloading langchain_couchbase-0.2.4-py3-none-any.whl.metadata (4.7 kB)
Collecting dspy
  Downloading dspy-2.6.2-py3-none-any.whl.metadata (7.7 kB)
Collecting openai
  Downloading openai-1.61.1-py3-none-any.whl.metadata (27 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading SQLAlchemy-2.0.37-cp312-cp312-macosx_11_0_arm64.whl.metadata (9.6 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.11.12-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.7 kB)
Collecting langchain-core<0.4.0,>=0.3.33 (from langchain)
  Using cached langchain_core-0.3.33-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.

## 1. Imports and Logging Configuration


In [None]:
import requests
import zipfile
from io import BytesIO
from pathlib import Path
import logging

# LangChain imports
from langchain_community.document_loaders import DirectoryLoader
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_couchbase import CouchbaseVectorStore
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions, KnownConfigProfiles
from couchbase.auth import PasswordAuthenticator
from langchain.docstore.document import Document  # Needed for creating LangChain documents
import re

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

## 2. Utility Functions
- **fetch_archive_from_http()**: Fetches and extracts a ZIP archive.
- **clean_text()**: Cleans up unwanted whitespace, non-ASCII, and special characters.

In [None]:
def fetch_archive_from_http(url: str, output_dir: str):
    """
    Utility function to fetch a zip archive from a URL and extract it to a local directory.
    """
    output_path = Path(output_dir)
    if output_path.is_dir():
        logger.warning(f"'{output_dir}' directory already exists. Skipping data download.")
        return

    with requests.get(url, timeout=10, stream=True) as response:
        response.raise_for_status()
        with zipfile.ZipFile(BytesIO(response.content)) as zip_ref:
            zip_ref.extractall(output_dir)
    
    logger.info(f"Data extracted to: {output_dir}")

def clean_text(text: str) -> str:
    """
    Cleans a document's text:
    - Removes extra spaces
    - Replaces multiple newlines with a single one
    - Removes non-ASCII characters
    - Removes special characters (except basic punctuation)
    """
    text = text.strip()  # Remove leading/trailing spaces
    text = re.sub(r"\s+", " ", text)  # Normalize whitespace
    text = re.sub(r"[^\x00-\x7F]+", " ", text)  # Remove non-ASCII characters
    text = re.sub(r"[^a-zA-Z0-9,.!?;:\-\s]", "", text)  # Remove special characters except punctuation
    return text


## 3. Main Workflow
Below is the core workflow:
1. Download and extract the dataset.
2. Load the documents from the `data/docs` directory.
3. Create an OpenAI Embeddings model.
4. Set up Couchbase connection and vector store.
5. Chunk the documents.
6. Store the chunked documents in the Couchbase Vector Store.

In [None]:
# --------------------------------------------------
# 1. Download and Extract the Dataset
# --------------------------------------------------
docs_dir = "data/docs"
fetch_archive_from_http(
    url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt6.zip",
    output_dir=docs_dir,
)

# --------------------------------------------------
# 2. Load Documents
# --------------------------------------------------
loader = DirectoryLoader(docs_dir, recursive=True)
documents = loader.load()
logger.info(f"Loaded {len(documents)} documents from {docs_dir}")

# --------------------------------------------------
# 3. Initialize OpenAI Embeddings Model
# --------------------------------------------------

api_key = input("Enter OpenAI API Key: ")
embeddings_model = OpenAIEmbeddings(
    model="text-embedding-3-large",
    api_key= api_key
)

# --------------------------------------------------
# 4. Connect to Couchbase
# --------------------------------------------------
cluster_options = ClusterOptions(
    authenticator=PasswordAuthenticator('admin', 'Password'),
)
cluster_options.apply_profile(KnownConfigProfiles.WanDevelopment)

store = CouchbaseVectorStore(
    cluster=Cluster(
        'couchbase://localhost',
        cluster_options
    ),
    bucket_name="dspy_test",
    scope_name="got",
    collection_name="got_collection",
    embedding=embeddings_model,
    index_name="vector_search"
)

# --------------------------------------------------
# 5. Split Documents into Chunks
# --------------------------------------------------
splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,  # Max characters per chunk
    chunk_overlap=90  # Overlap for context retention
)

chunked_documents = []
for doc in documents:
    chunks = splitter.split_text(clean_text(doc.page_content))  # Split text into chunks
    for chunk in chunks:
        chunked_documents.append(Document(page_content=chunk, metadata=doc.metadata))  # Create LangChain Documents

logger.info(f"Generated {len(chunked_documents)} chunks for storage.")

# --------------------------------------------------
# 6. Store Documents in Couchbase Vector Store
# --------------------------------------------------
# batch_size = 50  # Adjust based on your model's limits
# for i in range(0, len(chunked_documents), batch_size):
#     batch = chunked_documents[i:i + batch_size]
#     store.add_documents(batch)
#     logger.info(f"Stored batch {i // batch_size + 1}")
# logger.info("Documents successfully added to Couchbase Vector Store.")

store.add_documents(chunked_documents)

## 4. Retrieval and RAG with `dspy`

Below, we illustrate how to retrieve documents from Couchbase using `dspy`'s `CouchbaseRM` retriever. Then we do a retrieval-augmented generation using a simple [RAG](https://arxiv.org/abs/2005.11401) pattern:

1. We query the index to retrieve relevant chunks (passages).
2. We feed these passages, plus our question, into a GPT model (or your LLM of choice) to generate a final answer.

In [2]:
import dspy
from dspy.retrieve.couchbase_rm import CouchbaseRM
from couchbase.auth import PasswordAuthenticator
from couchbase.options import KnownConfigProfiles, ClusterOptions


# Setup Couchbase retriever
cluster_options = ClusterOptions(
    authenticator=PasswordAuthenticator('admin', 'Password'),
)
cluster_options.apply_profile(KnownConfigProfiles.WanDevelopment)

movie_datasets = CouchbaseRM(
    index_name="vector_search",
    cluster_connection_string="couchbase://localhost",
    cluster_options=cluster_options,
    bucket="haystack_integration_test",
    scope="haystack_test_scope",
    collection="haystack_collection",
    use_kv_get_text=True,
    embedding_model="text-embedding-3-large"
)

dspy.settings.configure(rm=movie_datasets)
turbo = dspy.LM(model='openai/gpt-4o', api_key=api_key)
dspy.settings.configure(lm=turbo, trace=[], temperature=0.7)

search = dspy.retrieve.Retrieve()


* 'fields' has been removed


ModuleNotFoundError: No module named 'dspy.retrieve.couchbase_rm'

### 4.1 Define a Signature for the Generated Answer
We'll define a simple signature for the question, context, and the answer fields.

In [None]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="answer")


### 4.2 Define a Retrieval-Augmented Generation (RAG) Module
This pipeline fetches the top-\(k\) passages and then calls an LLM to produce the final answer.

In [None]:
class RAG(dspy.Module):
    """Retrieval-augmented generation (RAG) pipeline."""
    def __init__(self, num_passages=5):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)


### 4.3 Run a Query and Inspect the Result
We create an instance of our RAG pipeline, run a custom question, and then see the LLM’s final answer plus retrieved context passages.

In [None]:
rag = RAG()

# Example query
my_question = "Why did Jorah Mormont flee to Essos, and how did his actions shape his relationship with Daenerys Targaryen?"

pred = rag(question=my_question)

# Print results
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

turbo.inspect_history(n=1)
