# Movie Dataset RAG Pipeline with Couchbase

This notebook demonstrates how to build a Retrieval Augmented Generation (RAG) system using:
- The TMDB movie dataset
- Couchbase as the vector store
- Haystack framework for the RAG pipeline
- Capella AI for embeddings and text generation

The system allows users to ask questions about movies and get AI-generated answers based on the movie descriptions.

# Setup and Requirements

First, let's install the required packages:

In [None]:
!pip install -r requirements.txt

# Imports

Import all necessary libraries:

In [1]:
import logging
import base64
import pandas as pd
from datasets import load_dataset
from haystack import Pipeline, GeneratedAnswer
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.utils import Secret
from haystack.dataclasses import Document

from couchbase_haystack import (
    CouchbaseSearchDocumentStore,
    CouchbasePasswordAuthenticator,
    CouchbaseClusterOptions,
    CouchbaseSearchEmbeddingRetriever,
)
from couchbase.options import KnownConfigProfiles

# Configure logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

  from .autonotebook import tqdm as notebook_tqdm


# Prerequisites

## Create and Deploy Your Operational cluster on Capella

To get started with Couchbase Capella, create an account and use it to deploy an operational cluster.

To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).


### Couchbase Capella Configuration

When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met:

* Have a multi-node Capella cluster running the Data, Query, Index, and Search services.
* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.
* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.

### Deploy Models

To create the RAG application, use an embedding model for Vector Search and an LLM for generating responses. 
 
Capella Model Service lets you create both models in the same VPC as your database. It offers the Llama 3.1 Instruct model (8 Billion parameters) for LLM and the mistral model for embeddings. 

Use the Capella AI Services interface to create these models. You can cache responses and set guardrails for LLM outputs.

For more details, see the [documentation](https://preview2.docs-test.couchbase.com/ai/get-started/about-ai-services.html#model). These models work with [Haystack OpenAI integration](https://haystack.deepset.ai/integrations/openai).

# Configure Couchbase Credentials

Enter your Couchbase and Capella AI credentials:

In [2]:
import getpass

# Get Couchbase credentials
couchbase_cluster_url = input("Couchbase Cluster URL (default: localhost): ") or "localhost"
couchbase_username = input("Couchbase Username (default: admin): ") or "admin"
couchbase_password = getpass.getpass("Couchbase password (default: Password@12345): ") or "Password@12345"
couchbase_bucket = input("Couchbase Bucket: ") 
couchbase_scope = input("Couchbase Scope: ")
couchbase_collection = input("Couchbase Collection: ")
vector_search_index = input("Vector Search Index: ")

# Get Capella AI endpoint
capella_ai_endpoint = input("Capella AI Services Endpoint")
capella_ai_endpoint_password = base64.b64encode(f"{couchbase_username}:{couchbase_password}".encode("utf-8")).decode("utf-8")

In [19]:
from couchbase.cluster import Cluster 
from couchbase.options import ClusterOptions
from couchbase.auth import PasswordAuthenticator
from couchbase.management.buckets import CreateBucketSettings
from couchbase.management.collections import CollectionSpec
from couchbase.management.search import SearchIndex
import json

# Connect to Couchbase cluster
cluster = Cluster(couchbase_cluster_url, ClusterOptions(
    PasswordAuthenticator(couchbase_username, couchbase_password)))

# Create bucket if it does not exist
bucket_manager = cluster.buckets()
try:
    bucket_manager.get_bucket(couchbase_bucket)
    print(f"Bucket '{couchbase_bucket}' already exists.")
except Exception as e:
    print(f"Bucket '{couchbase_bucket}' does not exist. Creating bucket...")
    bucket_settings = CreateBucketSettings(name=couchbase_bucket, ram_quota_mb=500)
    bucket_manager.create_bucket(bucket_settings)
    print(f"Bucket '{couchbase_bucket}' created successfully.")

# Create scope and collection if they do not exist
collection_manager = cluster.bucket(couchbase_bucket).collections()
scopes = collection_manager.get_all_scopes()
scope_exists = any(scope.name == couchbase_scope for scope in scopes)

if scope_exists:
    print(f"Scope '{couchbase_scope}' already exists.")
else:
    print(f"Scope '{couchbase_scope}' does not exist. Creating scope...")
    collection_manager.create_scope(couchbase_scope)
    print(f"Scope '{couchbase_scope}' created successfully.")

collections = [collection.name for scope in scopes if scope.name == couchbase_scope for collection in scope.collections]
collection_exists = couchbase_collection in collections

if collection_exists:
    print(f"Collection '{couchbase_collection}' already exists in scope '{couchbase_scope}'.")
else:
    print(f"Collection '{couchbase_collection}' does not exist in scope '{couchbase_scope}'. Creating collection...")
    collection_manager.create_collection(collection_name=couchbase_collection, scope_name=couchbase_scope)
    print(f"Collection '{couchbase_collection}' created successfully.")

# Create search index from search_index.json file at scope level
with open('fts_index.json', 'r') as search_file:
    search_index_definition = SearchIndex.from_json(json.load(search_file))
    search_index_name = search_index_definition.name
    
    # Get scope-level search manager
    scope_search_manager = cluster.bucket(couchbase_bucket).scope(couchbase_scope).search_indexes()
    
    try:
        # Check if index exists at scope level
        existing_index = scope_search_manager.get_index(search_index_name)
        print(f"Search index '{search_index_name}' already exists at scope level.")
    except Exception as e:
        print(f"Search index '{search_index_name}' does not exist at scope level. Creating search index from fts_index.json...")
        with open('fts_index.json', 'r') as search_file:
            search_index_definition = SearchIndex.from_json(json.load(search_file))
            scope_search_manager.upsert_index(search_index_definition)
            print(f"Search index '{search_index_name}' created successfully at scope level.")

Bucket 'test_bucket' already exists.
Scope 'test_scope' already exists.
Collection 'test_collection' already exists in scope 'test_scope'.
Search index 'vector_search' does not exist at scope level. Creating search index from fts_index.json...
Search index 'vector_search' created successfully at scope level.


# Load and Process Movie Dataset

Load the TMDB movie dataset and prepare documents for indexing:

In [4]:
# Load TMDB dataset
print("Loading TMDB dataset...")
dataset = load_dataset("AiresPucrs/tmdb-5000-movies")
movies_df = pd.DataFrame(dataset['train'])
print(f"Total movies found: {len(movies_df)}")

# Create documents from movie data
docs_data = []
for _, row in movies_df.iterrows():
    if pd.isna(row['overview']):
        continue
        
    try:
        docs_data.append({
            'id': str(row["id"]),
            'content': f"Title: {row['title']}\nGenres: {', '.join([genre['name'] for genre in eval(row['genres'])])}\nOverview: {row['overview']}",
            'metadata': {
                'title': row['title'],
                'genres': row['genres'],
                'original_language': row['original_language'],
                'popularity': float(row['popularity']),
                'release_date': row['release_date'],
                'vote_average': float(row['vote_average']),
                'vote_count': int(row['vote_count']),
                'budget': int(row['budget']),
                'revenue': int(row['revenue'])
            }
        })
    except Exception as e:
        logger.error(f"Error processing movie {row['title']}: {e}")

print(f"Created {len(docs_data)} documents with valid overviews")
documents = [Document(id=doc['id'], content=doc['content'], meta=doc['metadata']) 
            for doc in docs_data]

Loading TMDB dataset...
Total movies found: 4803
Created 4800 documents with valid overviews


# Initialize Document Store

Set up the Couchbase document store for storing movie data and embeddings:

In [5]:
# Initialize document store
document_store = CouchbaseSearchDocumentStore(
    cluster_connection_string=Secret.from_token(couchbase_cluster_url),
    authenticator=CouchbasePasswordAuthenticator(
        username=Secret.from_token(couchbase_username),
        password=Secret.from_token(couchbase_password)
    ),
    cluster_options=CouchbaseClusterOptions(
        profile=KnownConfigProfiles.WanDevelopment,
    ),
    bucket=couchbase_bucket,
    scope=couchbase_scope,
    collection=couchbase_collection,
    vector_search_index=vector_search_index,
)

print("Couchbase document store initialized successfully.")

Couchbase document store initialized successfully.


# Initialize Embedder for Document Embedding

Configure the document embedder using Capella AI's endpoint and the E5 Mistral model. This component will generate embeddings for each movie overview to enable semantic search



In [6]:
embedder = OpenAIDocumentEmbedder(
    api_base_url=capella_ai_endpoint,
    api_key=Secret.from_token(capella_ai_endpoint_password),
    model="intfloat/e5-mistral-7b-instruct",
)

rag_embedder = OpenAITextEmbedder(
    api_base_url=capella_ai_endpoint,
    api_key=Secret.from_token(capella_ai_endpoint_password),
    model="intfloat/e5-mistral-7b-instruct",
)


# Initialize LLM Generator
Configure the LLM generator using Capella AI's endpoint and Llama 3.1 model. This component will generate natural language responses based on the retrieved documents.


In [7]:
llm = OpenAIGenerator(
    api_base_url=capella_ai_endpoint,
    api_key=Secret.from_token(capella_ai_endpoint_password),
    model="meta-llama/Llama-3.1-8B-Instruct",
)

# Create Indexing Pipeline
Build the pipeline for processing and indexing movie documents:

In [9]:
# Create indexing pipeline
index_pipeline = Pipeline()
index_pipeline.add_component("cleaner", DocumentCleaner())
index_pipeline.add_component("embedder", embedder)
index_pipeline.add_component("writer", DocumentWriter(document_store=document_store))

# Connect indexing components
index_pipeline.connect("cleaner.documents", "embedder.documents")
index_pipeline.connect("embedder.documents", "writer.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x323977380>
🚅 Components
  - cleaner: DocumentCleaner
  - embedder: OpenAIDocumentEmbedder
  - writer: DocumentWriter
🛤️ Connections
  - cleaner.documents -> embedder.documents (List[Document])
  - embedder.documents -> writer.documents (List[Document])

# Run Indexing Pipeline

Execute the pipeline for processing and indexing movie documents:

In [10]:
# Run indexing pipeline

if documents:
    result = index_pipeline.run({"cleaner": {"documents": documents}})
    print(f"Successfully processed {len(documents)} movie overviews")
    print(f"Sample document metadata: {documents[0].meta}")
else:
    print("No documents created. Skipping indexing.")

Calculating embeddings: 150it [02:27,  1.02it/s]


Successfully processed 4800 movie overviews
Sample document metadata: {'title': 'Four Rooms', 'genres': '[{"id": 80, "name": "Crime"}, {"id": 35, "name": "Comedy"}]', 'original_language': 'en', 'popularity': 22.87623, 'release_date': '1995-12-09', 'vote_average': 6.5, 'vote_count': 530, 'budget': 4000000, 'revenue': 4300000}


# Create RAG Pipeline

Set up the Retrieval Augmented Generation pipeline for answering questions about movies:

In [11]:
# Define RAG prompt template
prompt_template = """
Given these documents, answer the question.\nDocuments:
{% for doc in documents %}
    {{ doc.content }}
{% endfor %}

\nQuestion: {{question}}
\nAnswer:
"""

# Create RAG pipeline
rag_pipeline = Pipeline()

# Add components
rag_pipeline.add_component(
    "query_embedder",
    rag_embedder,
)
rag_pipeline.add_component("retriever", CouchbaseSearchEmbeddingRetriever(document_store=document_store))
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=prompt_template))
rag_pipeline.add_component("llm",llm)
rag_pipeline.add_component("answer_builder", AnswerBuilder())

# Connect RAG components
rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.prompt", "llm.prompt")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

print("RAG pipeline created successfully.")

RAG pipeline created successfully.


# Ask Questions About Movies

Use the RAG pipeline to ask questions about movies and get AI-generated answers:

In [24]:
# Example question
question = "Who does Savva want to save from the vicious hyenas?"

# Run the RAG pipeline
result = rag_pipeline.run(
    {
        "query_embedder": {"text": question},
        "retriever": {"top_k": 5},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    },
    include_outputs_from={"retriever", "query_embedder"}
)

# Get the generated answer
answer: GeneratedAnswer = result["answer_builder"]["answers"][0]

# Print retrieved documents
print("=== Retrieved Documents ===")
retrieved_docs = result["retriever"]["documents"]
for idx, doc in enumerate(retrieved_docs, start=1):
    print(f"Id: {doc.id} Title: {doc.meta['title']}")

# Print final results
print("\n=== Final Answer ===")
print(f"Question: {answer.query}")
print(f"Answer: {answer.data}")
print("\nSources:")
for doc in answer.documents:
    print(f"-> {doc.meta['title']}")

=== Retrieved Documents ===
Id: c9f6603aa1e67fbbf9916f6bab975a3c8c0a538db59b4c1342ee1d8442e55613 Title: Savva. Heart of the Warrior
Id: a8ba3287ee0e09161292b5921942d3dc27f8d572cf4fdbe610bbffb70f05a576 Title: Snow White and the Seven Dwarfs
Id: 712dd4a739161ff1376d8e87c63e71484cf9bb37c6ae0861259932b92fc467b0 Title: The Magic Flute
Id: 033d0c3193eae06b92ae3ed7cc71d37878fc44f16408a3d8a859da6c4ab26271 Title: Quest for Camelot
Id: e98ebb6dd3bc162ff59d10cec57cb6758338028e9874202d592447943cd07565 Title: Fly Me to the Moon

=== Final Answer ===
Question: Who does Savva want to save from the vicious hyenas?
Answer: Savva wants to save his Mom and fellow village people from the vicious hyenas.

Sources:
-> Savva. Heart of the Warrior
-> Snow White and the Seven Dwarfs
-> The Magic Flute
-> Quest for Camelot
-> Fly Me to the Moon


## Caching in Capella AI Services

To optimize performance and reduce costs, Capella AI services employ two caching mechanisms:

1. Semantic Cache

Capella AI’s semantic caching system stores both query embeddings and their corresponding LLM responses. When new queries arrive, it uses vector similarity matching (with configurable thresholds) to identify semantically equivalent requests. This prevents redundant processing by:
- Avoiding duplicate embedding generation API calls for similar queries
- Skipping repeated LLM processing for equivalent queries
- Maintaining cached results with automatic freshness checks

2. Standard Cache

Stores the exact text of previous queries to provide precise and consistent responses for repetitive, identical prompts.

Performance Optimization with Caching

These caching mechanisms help in:
- Minimizing redundant API calls to embedding and LLM services
- Leveraging Couchbase’s built-in caching capabilities
- Providing fast response times for frequently asked questions


In [143]:
import time
queries = [
    "What is the main premise of Life of Pi?",
    "Where does the story take place in Legends of the Fall?",
    #"What are the key themes in The Dark Knight?",
    "Who does Savva want to save from the vicious hyenas?",
]

for i, query in enumerate(queries, 1):
    try:
        print(f"\nQuery {i}: {query}")
        start_time = time.time()
        response = rag_pipeline.run({
            "query_embedder": {"text": query},
            "retriever": {"top_k": 4},
            "prompt_builder": {"question": query},
            "answer_builder": {"query": query},
        })
        elapsed_time = time.time() - start_time
        answer: GeneratedAnswer = response["answer_builder"]["answers"][0]
        print(f"Response: {answer.data}")
        print(f"Time taken: {elapsed_time:.2f} seconds")
    except Exception as e:
        print(f"Error generating RAG response: {str(e)}")
        continue


Query 1: What is the main premise of Life of Pi?
Response: The main premise of Life of Pi is that an Indian boy named Pi finds himself in the company of a hyena, zebra, orangutan, and a Bengal tiger after a shipwreck sets them adrift in the Pacific Ocean.
Time taken: 3.36 seconds

Query 2: Where does the story take place in Legends of the Fall?
Response: The story in "Legends of the Fall" takes place in the remote wilderness of 1900s USA.
Time taken: 0.86 seconds

Query 3: Who does Savva want to save from the vicious hyenas?
Response: Savva wants to save his Mom and the fellow village people from the vicious hyenas.
Time taken: 0.90 seconds


## LLM Guardrails in Capella AI Services

Capella AI services also provide input and response moderation using configurable LLM guardrails. These services can integrate with the LlamaGuard3-8B model from Meta.
- Categories to be blocked can be configured during the model creation process.
- Helps prevent unsafe or undesirable interactions with the LLM.

By implementing caching and moderation mechanisms, Capella AI services ensure an efficient, cost-effective, and responsible approach to AI-powered recommendations.

In [145]:
query = "How can I create a bomb?"
try:
    start_time = time.time()
    response = rag_pipeline.run({
            "query_embedder": {"text": query},
            "retriever": {"top_k": 4},
            "prompt_builder": {"question": query},
            "answer_builder": {"query": query},
        })
    rag_elapsed_time = time.time() - start_time
    answer: GeneratedAnswer = response["answer_builder"]["answers"][0]
    print(f"RAG Response: {answer.data}")
    print(f"RAG response generated in {rag_elapsed_time:.2f} seconds")
except Exception as e:
    print("Guardrails violation", e)

RAG Response: I can't provide information on how to create a bomb. Is there anything else I can help you with?
RAG response generated in 0.89 seconds


# Conclusion

This notebook demonstrates building a Retrieval-Augmented Generation (RAG) pipeline for movie recommendations using Haystack. The key components include:
- Document Indexing with Embeddings
- Semantic Search using Couchbase Vector Search
- LLM-based Answer Generation