# Introduction

In this guide, we will walk you through building a Retrieval Augmented Generation (RAG) application with Haystack orchestrating Capella Model Services and Couchbase Capella. We will use the models hosted on Capella Model Services for response generation and generating embeddings.

This notebook demonstrates how to build a RAG system using:
- The [BBC News dataset](https://huggingface.co/datasets/RealTimeData/bbc_news_alltime) containing news articles
- Couchbase Capella Hyperscale and Composite Vector Indexes for vector search
- Haystack framework for the RAG pipeline
- Capella Model Services for embeddings and text generation

We leverage Couchbase's Hyperscale and Composite Vector Indexes to enable efficient semantic search at scale. Hyperscale indexes prioritize high-throughput vector similarity across billions of vectors with a compact on-disk footprint, while Composite indexes blend scalar predicates with a vector column to narrow candidate sets before similarity search. For a deeper dive into how these indexes work, see the [overview of Capella vector indexes](https://docs.couchbase.com/cloud/vector-index/vectors-and-indexes-overview.html).

Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial shows how to combine Capella Model Services and Haystack with Couchbase's Hyperscale and Composite Vector Indexes to deliver a production-ready RAG workflow.

# Before you start

## Create and Deploy Your Operational cluster on Capella

To get started with Couchbase Capella, create an account and use it to deploy an operational cluster.

To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html). 

### Couchbase Capella Configuration

When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met:

* Have a multi-node Capella cluster running the Data, Query, Index, and Search services.
* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the bucket (Read and Write) used in the application.
* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.

### Deploy Models

To create the RAG application, use an embedding model for Vector Search and an LLM for generating responses. 
 
Capella Model Service lets you create both models in the same VPC as your database. It offers the Llama 3.1 Instruct model (8 Billion parameters) for LLM and the mistral model for embeddings. 

Use the Capella AI Services interface to create these models. You can cache responses and set guardrails for LLM outputs.

For more details, see the [documentation](https://preview2.docs-test.couchbase.com/ai/get-started/about-ai-services.html#model). These models work with [Haystack OpenAI integration](https://haystack.deepset.ai/integrations/openai).

# Installing Necessary Libraries
To build our RAG system, we need a set of libraries. The libraries we install handle everything from connecting to databases to performing AI tasks. Each library has a specific role: Couchbase libraries manage database operations, Haystack handles AI model integrations and pipeline management, and we will use the OpenAI SDK (compatible with Capella Model Services) for generating embeddings and calling language models.


In [1]:
# Install required packages
%pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# Importing Necessary Libraries
The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, Haystack components for RAG pipeline, embedding generation, and dataset loading.


In [2]:
import getpass
import logging
import sys
import time
import pandas as pd
from datetime import timedelta

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.exceptions import CouchbaseException
from couchbase.options import ClusterOptions, KnownConfigProfiles, QueryOptions

from datasets import load_dataset

from haystack import Pipeline, Document, GeneratedAnswer
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.utils import Secret
from haystack.components.builders import PromptBuilder
from couchbase_haystack import (
    CouchbaseQueryDocumentStore, 
    CouchbaseQueryEmbeddingRetriever,
    QueryVectorSearchType, 
    QueryVectorSearchSimilarity,
    CouchbasePasswordAuthenticator,
    CouchbaseClusterOptions
)


  from .autonotebook import tqdm as notebook_tqdm


# Loading Sensitive Information
In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like database credentials, collection names, and API keys. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.

The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code.

**CAPELLA_MODEL_SERVICES_ENDPOINT** is the Capella AI Services endpoint found in the models section.
> Note that the Capella Model Services Endpoint also requires an additional `/v1` from the endpoint shown on the UI if it is not shown on the UI.

**INDEX_NAME** is the name of the Hyperscale or Composite Vector Index we will create for vector search operations.

In [3]:
CB_CONNECTION_STRING = input("Couchbase Cluster URL (default: localhost): ") or "couchbase://localhost"
CB_USERNAME = input("Couchbase Username (default: admin): ") or "admin"
CB_PASSWORD = getpass.getpass("Couchbase password (default: Password@12345): ") or "Password@12345"
CB_BUCKET_NAME = input("Couchbase Bucket: ")
SCOPE_NAME = input("Couchbase Scope: ")
COLLECTION_NAME = input("Couchbase Collection: ")
INDEX_NAME = input("Vector Search Index: ")

# Get Capella AI endpoint
CAPELLA_MODEL_SERVICES_ENDPOINT = input("Enter your Capella Model Services Endpoint: ")
LLM_MODEL_NAME = input("Enter the LLM name: ")
LLM_API_KEY = getpass.getpass("Enter your Capella Model Services LLM API Key: ")
EMBEDDING_MODEL_NAME = input("Enter the Embedding Model name: ")
EMBEDDING_API_KEY = getpass.getpass("Enter your Capella Model Services Embedding Model API Key: ")
EMBEDDING_DIMENSION = input("Enter the Embedding Dimension (e.g. 3072, 4096): ") or "3072"

# Check if the variables are correctly loaded
if not all([CB_CONNECTION_STRING, CB_USERNAME, CB_PASSWORD, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME, INDEX_NAME, CAPELLA_MODEL_SERVICES_ENDPOINT, LLM_MODEL_NAME, LLM_API_KEY, EMBEDDING_MODEL_NAME, EMBEDDING_API_KEY]):
    raise ValueError("All configuration variables must be provided.")

# Setting Up Logging
Logging is essential for tracking the execution of our script and debugging any issues that may arise. We set up a logger that will display information about the script's progress, including timestamps and log levels.


In [4]:
# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)],
)

# Connecting to Couchbase Capella
The next step is to establish a connection to our Couchbase Capella cluster. This connection will allow us to interact with the database, store and retrieve documents, and perform vector searches.


In [5]:
try:
    # Initialize the Couchbase Cluster
    auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)
    options = ClusterOptions(auth)
    options.apply_profile(KnownConfigProfiles.WanDevelopment)
    
    # Connect to the cluster
    cluster = Cluster(CB_CONNECTION_STRING, options)
    
    # Wait for the cluster to be ready
    cluster.wait_until_ready(timedelta(seconds=5))
    logging.info("Successfully connected to the Couchbase cluster")
except CouchbaseException as e:
    raise RuntimeError(f"Failed to connect to Couchbase: {str(e)}")

2025-12-10 10:56:30,321 - INFO - Successfully connected to the Couchbase cluster


# Setting Up the Bucket, Scope, and Collection
Before we can store our data, we need to ensure that the appropriate bucket, scope, and collection exist in our Couchbase cluster. The code below checks if these components exist and creates them if they don't, providing a foundation for storing our vector embeddings and documents.

In [6]:
from couchbase.management.buckets import CreateBucketSettings
import json

# Create bucket if it does not exist
bucket_manager = cluster.buckets()
try:
    bucket_manager.get_bucket(CB_BUCKET_NAME)
    print(f"Bucket '{CB_BUCKET_NAME}' already exists.")
except Exception as e:
    print(f"Bucket '{CB_BUCKET_NAME}' does not exist. Creating bucket...")
    bucket_settings = CreateBucketSettings(name=CB_BUCKET_NAME, ram_quota_mb=500)
    bucket_manager.create_bucket(bucket_settings)
    print(f"Bucket '{CB_BUCKET_NAME}' created successfully.")

# Create scope and collection if they do not exist
collection_manager = cluster.bucket(CB_BUCKET_NAME).collections()
scopes = collection_manager.get_all_scopes()
scope_exists = any(scope.name == SCOPE_NAME for scope in scopes)

if scope_exists:
    print(f"Scope '{SCOPE_NAME}' already exists.")
else:
    print(f"Scope '{SCOPE_NAME}' does not exist. Creating scope...")
    collection_manager.create_scope(SCOPE_NAME)
    print(f"Scope '{SCOPE_NAME}' created successfully.")

collections = [collection.name for scope in scopes if scope.name == SCOPE_NAME for collection in scope.collections]
collection_exists = COLLECTION_NAME in collections

if collection_exists:
    print(f"Collection '{COLLECTION_NAME}' already exists in scope '{SCOPE_NAME}'.")
else:
    print(f"Collection '{COLLECTION_NAME}' does not exist in scope '{SCOPE_NAME}'. Creating collection...")
    collection_manager.create_collection(collection_name=COLLECTION_NAME, scope_name=SCOPE_NAME)
    print(f"Collection '{COLLECTION_NAME}' created successfully.")


Bucket 'test_bucket' already exists.
Scope 'test_scope' already exists.
Collection 'test_collection1' does not exist in scope 'test_scope'. Creating collection...
Collection 'test_collection1' created successfully.


# Load the BBC News Dataset
To build a RAG engine, we need data to search through. We use the [BBC Realtime News dataset](https://huggingface.co/datasets/RealTimeData/bbc_news_alltime), a dataset with up-to-date BBC news articles grouped by month. This dataset contains articles that were created after the LLM was trained. It will showcase the use of RAG to augment the LLM. 

The BBC News dataset's varied content allows us to simulate real-world scenarios where users ask complex questions, enabling us to fine-tune our RAG's ability to understand and respond to various types of queries.


In [7]:
try:
    news_dataset = load_dataset('RealTimeData/bbc_news_alltime', '2024-12', split="train")
    print(f"Loaded the BBC News dataset with {len(news_dataset)} rows")
except Exception as e:
    raise ValueError(f"Error loading BBC News dataset: {str(e)}")

Loaded the BBC News dataset with 2687 rows


## Preview the Data

In [8]:
# Print the first two examples from the dataset
print("Dataset columns:", news_dataset.column_names)
print("\nFirst two examples:")
print(news_dataset[:2])

Dataset columns: ['title', 'published_date', 'authors', 'description', 'section', 'content', 'link', 'top_image']

First two examples:
{'title': ["Pakistan protest: Bushra Bibi's march for Imran Khan disappeared - BBC News", 'Lockdown DIY linked to Walleys Quarry gases - BBC News'], 'published_date': ['2024-12-01', '2024-12-01'], 'authors': ['https://www.facebook.com/bbcnews', 'https://www.facebook.com/bbcnews'], 'description': ["Imran Khan's third wife guided protesters to the heart of the capital - and then disappeared.", 'An academic says an increase in plasterboard sent to landfill could be behind a spike in smells.'], 'section': ['Asia', 'Stoke & Staffordshire'], 'content': ['Bushra Bibi led a protest to free Imran Khan - what happened next is a mystery\n\nImran Khan\'s wife, Bushra Bibi, encouraged protesters into the heart of Pakistan\'s capital, Islamabad\n\nA charred lorry, empty tear gas shells and posters of former Pakistan Prime Minister Imran Khan - it was all that remaine

## Preparing the Data for RAG

We need to extract the context passages from the dataset to use as our knowledge base for the RAG system.

In [9]:
import hashlib

news_articles = news_dataset
unique_articles = {}

for article in news_articles:
    content = article.get("content")
    if content:
        content_hash = hashlib.md5(content.encode()).hexdigest()  # Generate hash of content
        if content_hash not in unique_articles:
            unique_articles[content_hash] = article  # Store full article

unique_news_articles = list(unique_articles.values())  # Convert back to list

print(f"We have {len(unique_news_articles)} unique articles in our database.")


We have 1749 unique articles in our database.


# Creating Embeddings using Capella Model Services
Embeddings are numerical representations of text that capture semantic meaning. Unlike keyword-based search, embeddings enable semantic search to understand context and retrieve documents that are conceptually similar even without exact keyword matches. We'll use the model deployed on Capella Model Services to create high-quality embeddings. This model transforms our text data into vector representations that can be efficiently searched using Haystack's OpenAI document embedder (configured to point to Capella).


In [10]:
try:
    # Set up the document embedder for processing documents
    document_embedder = OpenAIDocumentEmbedder(
        api_base_url=CAPELLA_MODEL_SERVICES_ENDPOINT,
        api_key=Secret.from_token(EMBEDDING_API_KEY),
        model=EMBEDDING_MODEL_NAME
    )
    
    # Set up the text embedder for query processing
    rag_embedder = OpenAITextEmbedder(
        api_base_url=CAPELLA_MODEL_SERVICES_ENDPOINT,
        api_key=Secret.from_token(EMBEDDING_API_KEY),
        model=EMBEDDING_MODEL_NAME
    )
    
    print("Successfully created embedding models")
except Exception as e:
    raise ValueError(f"Error creating embedding models: {str(e)}")

Successfully created embedding models


# Testing the Embeddings Model
We can test the text embeddings model by generating an embedding for a string

In [11]:
test_result = rag_embedder.run(text="this is a test sentence")
test_embedding = test_result["embedding"]
print(f"Embedding dimension: {len(test_embedding)}")

2025-12-10 10:56:54,817 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"
Embedding dimension: 1024


# Setting Up the Couchbase Vector Document Store
The `CouchbaseQueryDocumentStore` from the `couchbase_haystack` package provides seamless integration with Couchbase, supporting both Hyperscale and Composite Vector Indexes.

In [12]:
try:
    # Create the Couchbase vector document store
    document_store = CouchbaseQueryDocumentStore(
        cluster_connection_string=Secret.from_token(CB_CONNECTION_STRING),
        authenticator=CouchbasePasswordAuthenticator(
            username=Secret.from_token(CB_USERNAME),
            password=Secret.from_token(CB_PASSWORD)
        ),
        cluster_options=CouchbaseClusterOptions(
            profile=KnownConfigProfiles.WanDevelopment,
        ),
        bucket=CB_BUCKET_NAME,
        scope=SCOPE_NAME,
        collection=COLLECTION_NAME,
        search_type=QueryVectorSearchType.ANN,
        similarity=QueryVectorSearchSimilarity.COSINE
    )
    print("Successfully created Couchbase vector document store")
except Exception as e:
    raise ValueError(f"Failed to create Couchbase vector document store: {str(e)}")

Successfully created Couchbase vector document store


# Creating Haystack Documents
In this section, we'll process our news articles and create Haystack Document objects.
Each Document is created with specific metadata that will be used for retrieval and generation.
We'll observe examples of the document content to understand how the documents are structured.

In [13]:
haystack_documents = []
# Process and store documents
for article in unique_news_articles:  # Process all unique articles
    try:
        document = Document(
            content=article["content"],
            meta={
                "title": article["title"],
                "description": article["description"],
                "published_date": article["published_date"],
                "link": article["link"],
            }
        )
        haystack_documents.append(document)
    except Exception as e:
        print(f"Failed to create document: {str(e)}")
        continue

# Observing an example of the document content
print("Document content preview:")
print(f"Content: {haystack_documents[0].content[:200]}...")
print(f"Metadata: {haystack_documents[0].meta}")

print(f"Created {len(haystack_documents)} documents")

        

Document content preview:
Content: Bushra Bibi led a protest to free Imran Khan - what happened next is a mystery

Imran Khan's wife, Bushra Bibi, encouraged protesters into the heart of Pakistan's capital, Islamabad

A charred lorry, ...
Metadata: {'title': "Pakistan protest: Bushra Bibi's march for Imran Khan disappeared - BBC News", 'description': "Imran Khan's third wife guided protesters to the heart of the capital - and then disappeared.", 'published_date': '2024-12-01', 'link': 'http://www.bbc.co.uk/news/articles/cvg02lvj1e7o'}
Created 1749 documents


# Creating and Running the Indexing Pipeline

In this section, we'll create an indexing pipeline to process our documents. The pipeline will:

1. DocumentCleaner - Cleans and preprocesses the raw Haystack documents (removes extra whitespace, normalizes text)
2. document_embedder - Generates vector embeddings for each document using an embedding model (likely OpenAI's), converting text into numerical representations for semantic search
3. DocumentWriter - Writes the cleaned documents along with their embeddings to the Couchbase document store

This transforms raw news articles into searchable vector representations stored in Couchbase for later semantic retrieval in the RAG system.

In [14]:


# Process documents: split into chunks, generate embeddings, and store in document store
# Create indexing pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("cleaner", DocumentCleaner())
indexing_pipeline.add_component("embedder", document_embedder)
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))

indexing_pipeline.connect("cleaner.documents", "embedder.documents")
indexing_pipeline.connect("embedder.documents", "writer.documents")




<haystack.core.pipeline.pipeline.Pipeline object at 0x120a32ba0>
üöÖ Components
  - cleaner: DocumentCleaner
  - embedder: OpenAIDocumentEmbedder
  - writer: DocumentWriter
üõ§Ô∏è Connections
  - cleaner.documents -> embedder.documents (list[Document])
  - embedder.documents -> writer.documents (list[Document])

# Run Indexing Pipeline

Execute the pipeline for processing and indexing BCC news documents:

In [15]:
# Run the indexing pipeline
if haystack_documents:
    result = indexing_pipeline.run({"cleaner": {"documents": haystack_documents[:1200]}})
    print(f"Indexed {result['writer']['documents_written']} document chunks")
else:
    print("No documents created. Skipping indexing.")


2025-12-10 10:57:07,785 - INFO - Running component cleaner
2025-12-10 10:57:07,850 - INFO - Running component embedder


Calculating embeddings: 0it [00:00, ?it/s]

2025-12-10 10:57:10,449 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 1it [00:03,  3.66s/it]

2025-12-10 10:57:12,799 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 2it [00:05,  2.43s/it]

2025-12-10 10:57:14,281 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 3it [00:06,  1.93s/it]

2025-12-10 10:57:15,540 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 4it [00:07,  1.66s/it]

2025-12-10 10:57:16,771 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 5it [00:09,  1.49s/it]

2025-12-10 10:57:17,851 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 6it [00:10,  1.35s/it]

2025-12-10 10:57:19,012 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 7it [00:11,  1.29s/it]

2025-12-10 10:57:20,213 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 8it [00:12,  1.26s/it]

2025-12-10 10:57:21,204 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 9it [00:13,  1.18s/it]

2025-12-10 10:57:22,235 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 10it [00:14,  1.13s/it]

2025-12-10 10:57:23,327 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 11it [00:15,  1.12s/it]

2025-12-10 10:57:24,259 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 12it [00:16,  1.07s/it]

2025-12-10 10:57:25,438 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 13it [00:17,  1.10s/it]

2025-12-10 10:57:26,443 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 14it [00:18,  1.07s/it]

2025-12-10 10:57:27,477 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 15it [00:19,  1.06s/it]

2025-12-10 10:57:28,558 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 16it [00:20,  1.07s/it]

2025-12-10 10:57:29,766 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 17it [00:21,  1.10s/it]

2025-12-10 10:57:30,826 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 18it [00:23,  1.09s/it]

2025-12-10 10:57:32,114 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 19it [00:24,  1.15s/it]

2025-12-10 10:57:33,192 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 20it [00:25,  1.13s/it]

2025-12-10 10:57:34,193 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 21it [00:26,  1.09s/it]

2025-12-10 10:57:35,211 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 22it [00:27,  1.07s/it]

2025-12-10 10:57:36,320 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 23it [00:28,  1.08s/it]

2025-12-10 10:57:37,558 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 24it [00:29,  1.13s/it]

2025-12-10 10:57:38,642 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 25it [00:30,  1.12s/it]

2025-12-10 10:57:39,724 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 26it [00:31,  1.11s/it]

2025-12-10 10:57:40,759 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 27it [00:32,  1.08s/it]

2025-12-10 10:57:41,729 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 28it [00:33,  1.05s/it]

2025-12-10 10:57:42,757 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 29it [00:34,  1.04s/it]

2025-12-10 10:57:43,893 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 30it [00:36,  1.07s/it]

2025-12-10 10:57:44,915 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 31it [00:37,  1.06s/it]

2025-12-10 10:57:46,140 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 32it [00:38,  1.11s/it]

2025-12-10 10:57:47,208 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 33it [00:39,  1.10s/it]

2025-12-10 10:57:48,507 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 34it [00:40,  1.15s/it]

2025-12-10 10:57:49,584 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 35it [00:41,  1.13s/it]

2025-12-10 10:57:50,795 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 36it [00:43,  1.16s/it]

2025-12-10 10:57:51,657 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 37it [00:43,  1.07s/it]

2025-12-10 10:57:52,327 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"


Calculating embeddings: 38it [00:44,  1.17s/it]


2025-12-10 10:57:52,617 - INFO - Running component writer
Indexed 1200 document chunks


# Using Capella Model Services Large Language Model (LLM)
Large language models are AI systems that are trained to understand and generate human language. We'll be using the model deployed on Capella Model Services to process user queries and generate meaningful responses based on the retrieved context from our Couchbase document store. This model is a key component of our RAG system, allowing it to go beyond simple keyword matching and truly understand the intent behind a query. By integrating the LLM, we equip our RAG system with the ability to interpret complex queries, understand the nuances of language, and provide more accurate and contextually relevant responses.

The language model's ability to understand context and generate coherent responses is what makes our RAG system truly intelligent. It can not only find the right information but also present it in a way that is useful and understandable to the user.

The LLM is configured using Haystack's OpenAI generator component with your Capella Model Services API key for seamless integration.

In [16]:
try:
    # Set up the LLM generator
    generator = OpenAIGenerator(
        api_base_url=CAPELLA_MODEL_SERVICES_ENDPOINT,
        api_key=Secret.from_token(LLM_API_KEY),
        model=LLM_MODEL_NAME
    )
    logging.info("Successfully created the generator")
except Exception as e:
    raise ValueError(f"Error creating generator: {str(e)}")

2025-12-10 10:58:02,687 - INFO - Successfully created the generator


# Creating the RAG Pipeline

In this section, we'll create a RAG pipeline using Haystack components. This pipeline serves as the foundation for our RAG system, enabling semantic search capabilities and efficient retrieval of relevant information.

The RAG pipeline provides a complete workflow that allows us to:
1. Perform semantic searches based on user queries
2. Retrieve the most relevant documents or chunks
3. Generate contextually appropriate responses using our LLM


In [17]:
# Define RAG prompt template
prompt_template = """
Given these documents, answer the question.\nDocuments:
{% for doc in documents %}
    {{ doc.content }}
{% endfor %}

\nQuestion: {{question}}
\nAnswer:
"""

# Create the RAG pipeline
rag_pipeline = Pipeline()

# Add components to the pipeline
rag_pipeline.add_component(
    "query_embedder",
    rag_embedder,
)
rag_pipeline.add_component("retriever", CouchbaseQueryEmbeddingRetriever(document_store=document_store))
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=prompt_template))
rag_pipeline.add_component("llm",generator)
rag_pipeline.add_component("answer_builder", AnswerBuilder())

# Connect RAG components
rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.prompt", "llm.prompt")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

print("Successfully created RAG pipeline")

Successfully created RAG pipeline


# Retrieval-Augmented Generation (RAG) with Couchbase and Haystack

Let's test our RAG system by performing a semantic search on a sample query. In this example, we'll use a question about Pep Guardiola's reaction to Manchester City's recent form. The RAG system will:

1. Process the natural language query
2. Search through our document store for relevant information
3. Retrieve the most semantically similar documents
4. Generate a comprehensive response using the LLM

This demonstrates how our system combines the power of vector search with language model capabilities to provide accurate, contextual answers based on the information in our database.

**Note:** By default, without any Hyperscale or Composite Vector Index, Couchbase falls back to linear brute-force search that compares the query vector against every document in the collection. This works for small datasets but can become slow as the dataset grows.

In [18]:
# Sample query from the dataset

query = "What was Pep Guardiola's reaction to Manchester City's current form?"

try:
    # Perform the semantic search using the RAG pipeline
    start_time = time.time()
    result = rag_pipeline.run({
        "query_embedder": {"text": query},
        "retriever": {"top_k": 5},
        "prompt_builder": {"question": query},
        "answer_builder": {"query": query},
        },
     include_outputs_from={"retriever", "query_embedder"}
    )
    search_elapsed_time = time.time() - start_time
    # Get the generated answer
    answer: GeneratedAnswer = result["answer_builder"]["answers"][0]

    # Print retrieved documents
    print("=== Retrieved Documents ===")
    retrieved_docs = result["retriever"]["documents"]
    for idx, doc in enumerate(retrieved_docs, start=1):
        print(f"Id: {doc.id} Title: {doc.meta['title']}")

    # Print final results
    print("\n=== Final Answer ===")
    print(f"Question: {answer.query}")
    print(f"Answer: {answer.data}")
    print("\nSources:")
    for doc in answer.documents:
        print(f"-> {doc.meta['title']}")
    # Display search results
    print(f"\nLinear Vector Search Results (completed in {search_elapsed_time:.2f} seconds):")
    #print(result["generator"]["replies"][0])

except Exception as e:
    raise RuntimeError(f"Error performing RAG search: {e}")

2025-12-10 10:58:26,463 - INFO - Running component query_embedder
2025-12-10 10:58:27,722 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-10 10:58:27,732 - INFO - Running component retriever
2025-12-10 10:58:28,270 - INFO - Running component prompt_builder
2025-12-10 10:58:28,271 - INFO - Running component llm
2025-12-10 10:58:32,097 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-12-10 10:58:32,102 - INFO - Running component answer_builder
=== Retrieved Documents ===
Id: b0cdcffd58641d16dacd0b2659fb7e8d87613ba576940296d12f6621dac1d1ea Title: Man City 1-2 Man Utd: Crisis-hit Pep Guardiola faces huge rebuild - BBC Sport
Id: 00a58764c0dee414e19030793892ea723cc100de631c3a55cd8e5d0329d38a63 Title: Man City lose to Aston Villa: Pep Guardiola says struggling champions 'have to find a way' to win again - BBC Sport
Id: feca66cdb254c478782824573020df44c2a189

# Create Hyperscale or Composite Vector Indexes

While the above RAG system works effectively, you can significantly improve query performance by enabling Couchbase Capella's Hyperscale or Composite Vector Indexes.

## Hyperscale Vector Indexes
- Specifically designed for vector searches
- Perform vector similarity and semantic searches faster than other index types
- Scale to billions of vectors while keeping most of the structure in an optimized on-disk format
- Maintain high accuracy even for vectors with a large number of dimensions
- Support concurrent searches and inserts for constantly changing datasets

Use this type of index when you primarily query vector values and need low-latency similarity search at scale. In general, Hyperscale Vector Indexes are the best starting point for most vector search workloads.

## Composite Vector Indexes
- Combine scalar filters with a single vector column in the same index definition
- Designed for searches that apply one vector value alongside scalar attributes that remove large portions of the dataset before similarity scoring
- Consume a moderate amount of memory and can index Tens of million to billion of documents
- Excel when your queries must return a small, highly targeted result set

Use Composite Vector Indexes when you want to perform searches that blend scalar predicates and vector similarity so that the scalar filters tighten the candidate set.

For an in-depth comparison and tuning guidance, review the [Couchbase vector index documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html) and the [overview of Capella vector indexes](https://docs.couchbase.com/cloud/vector-index/vectors-and-indexes-overview.html).

## Understanding Index Configuration (Couchbase 8.0 Feature)

The `index_description` parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:

Format: `'IVF[<centroids>],{PQ|SQ}<settings>'`

**Centroids (IVF - Inverted File):**
- Controls how the dataset is subdivided for faster searches
- More centroids = faster search, slower training  
- Fewer centroids = slower search, faster training
- If omitted (like `IVF,SQ8`), Couchbase auto-selects based on dataset size

**Quantization Options:**
- SQ (Scalar Quantization): `SQ4`, `SQ6`, `SQ8` (4, 6, or 8 bits per dimension)
- PQ (Product Quantization): `PQ<subquantizers>x<bits>` (e.g., `PQ32x8`)
- Higher values = better accuracy, larger index size

**Common Examples:**
- `IVF,SQ8` ‚Äì Auto centroids, 8-bit scalar quantization (good default)
- `IVF1000,SQ6` ‚Äì 1000 centroids, 6-bit scalar quantization  
- `IVF,PQ32x8` ‚Äì Auto centroids, 32 subquantizers with 8 bits

For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/server/current/vector-index/hyperscale-vector-index.html#algo_settings).

In the code below, we demonstrate creating a Hyperscale index for optimal performance. You can adapt the same flow to create a COMPOSITE index by replacing the index type and options.

In [19]:
# Create a Hyperscale Vector Index for optimized vector search
try:
    hyperscale_index_name = f"{INDEX_NAME}_hyperscale"

    # Use the cluster connection to create the Hyperscale index
    scope = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME)
    
    options = {
        "dimension": int(EMBEDDING_DIMENSION),  # dimension based on the model
        "similarity": "cosine",
        "scan_nprobes": 3,
    }
    
    scope.query(
        f"""
        CREATE VECTOR INDEX {hyperscale_index_name}
        ON {COLLECTION_NAME} (embedding VECTOR)
        WITH {json.dumps(options)}
        """,
    QueryOptions(
        timeout=timedelta(seconds=300)
    )).execute()
    print(f"Successfully created Hyperscale index: {hyperscale_index_name}")
except Exception as e:
    print(f"Hyperscale index may already exist or error occurred: {str(e)}")


Successfully created Hyperscale index: vector_search_hyperscale


# Testing Optimized Hyperscale Vector Search

The example below runs the same RAG query, but now uses the Hyperscale index created above. You'll notice improved performance as the index efficiently retrieves data. If you create a Composite index, the workflow is identical ‚Äî Haystack automatically routes queries through the scalar filters before performing the vector similarity search.

In [None]:
# Test the optimized Hyperscale vector search
query = "What was Pep Guardiola's reaction to Manchester City's current form?"

try:
    # The RAG pipeline will automatically use the optimized Hyperscale index
    # Perform the semantic search with Hyperscale optimization
    start_time = time.time()
    result = rag_pipeline.run({
        "query_embedder": {"text": query},
        "retriever": {"top_k": 4},
        "prompt_builder": {"question": query},
        "answer_builder": {"query": query},
        },
     include_outputs_from={"retriever", "query_embedder"}
    )
    search_elapsed_time = time.time() - start_time
    # Get the generated answer
    answer: GeneratedAnswer = result["answer_builder"]["answers"][0]

    # Print retrieved documents
    print("=== Retrieved Documents ===")
    retrieved_docs = result["retriever"]["documents"]
    for idx, doc in enumerate(retrieved_docs, start=0):
        print(f"Id: {doc.id} Title: {doc.meta['title']}")

    # Print final results
    print("\n=== Final Answer ===")
    print(f"Question: {answer.query}")
    print(f"Answer: {answer.data}")
    print("\nSources:")
    for doc in answer.documents:
        print(f"-> {doc.meta['title']}")
    # Display search results
    print(f"\nOptimized Hyperscale Vector Search Results (completed in {search_elapsed_time:.2f} seconds):")
    #print(result["generator"]["replies"][0])

except Exception as e:
    raise RuntimeError(f"Error performing optimized semantic search: {e}")


2025-12-10 11:04:06,122 - INFO - Running component query_embedder
2025-12-10 11:04:07,295 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-10 11:04:07,301 - INFO - Running component retriever
2025-12-10 11:04:07,330 - INFO - Running component prompt_builder
2025-12-10 11:04:07,331 - INFO - Running component llm
2025-12-10 11:04:11,608 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-12-10 11:04:11,611 - INFO - Running component answer_builder
=== Retrieved Documents ===
Id: 06d5d45291b037c80be3371d3dabb2e658875e033f7b8462f2eb1021ded2bfcc Title: Electric cars: Five ways to persuade people to buy them - BBC News
Id: 80b63a966f66aaf455f575b469413d0622eae4c47d27c5f8714b0b55190d3d27 Title: Car industry consulted over how to phase out petrol and diesel cars by 2030 - BBC News
Id: 6749b8f03a4d795ef466e780fa70a8b3dbb8b0eae9791b511f4fcc6d5c508255 Title: Elon M

## Caching in Capella Model Services

To optimize performance and reduce costs, Capella Model Services employ two caching mechanisms:

1. Semantic Cache

    Capella Model Services‚Äô semantic caching system stores both query embeddings and their corresponding LLM responses. When new queries arrive, it uses vector similarity matching (with configurable thresholds) to identify semantically equivalent requests. This prevents redundant processing by:
    - Avoiding duplicate embedding generation API calls for similar queries
    - Skipping repeated LLM processing for equivalent queries
    - Maintaining cached results with automatic freshness checks

2. Standard Cache

    Stores the exact text of previous queries to provide precise and consistent responses for repetitive, identical prompts.

    Performance Optimization with Caching

    These caching mechanisms help in:
    - Minimizing redundant API calls to LLM service
    - Leveraging Couchbase‚Äôs built-in caching capabilities
    - Providing fast response times for frequently asked questions


In [29]:
import time
queries = [
    "Why are car manufacturers like Ford and Stellantis unhappy with the UK government‚Äôs current rules designed to promote electric vehicles?",
    "Who inaugurated the reopening of the Notre Dam Cathedral in Paris?",
    "What was Pep Guardiola's reaction to Manchester City's recent form?",
    "Why are car manufacturers like Ford and Stellantis unhappy with the UK government‚Äôs current rules designed to promote electric vehicles?",
]

for i, query in enumerate(queries, 1):
    try:
        print(f"\nQuery {i}: {query}")
        start_time = time.time()
        result = rag_pipeline.run({
            "query_embedder": {"text": query},
            "retriever": {"top_k": 4},
            "prompt_builder": {"question": query},
            "answer_builder": {"query": query},
        })
        elapsed_time = time.time() - start_time
        answer: GeneratedAnswer = result["answer_builder"]["answers"][0]
        print(f"Response: {answer.data}")
        print(f"Time taken: {elapsed_time:.2f} seconds")
    except Exception as e:
        print(f"Error generating RAG response: {str(e)}")
        continue


Query 1: Why are car manufacturers like Ford and Stellantis unhappy with the UK government‚Äôs current rules designed to promote electric vehicles?
2025-12-10 11:07:51,463 - INFO - Running component query_embedder
2025-12-10 11:07:52,687 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-10 11:07:52,693 - INFO - Running component retriever
2025-12-10 11:07:52,721 - INFO - Running component prompt_builder
2025-12-10 11:07:52,721 - INFO - Running component llm
2025-12-10 11:07:57,816 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-12-10 11:07:57,820 - INFO - Running component answer_builder
Response: Car manufacturers like Ford and Stellantis are unhappy with the UK government‚Äôs current rules designed to promote electric vehicles because they claim these rules are too harsh, consumer demand for electric cars has fallen far short of what was expected, 

## LLM Guardrails in Capella Model Services

Capella Model Services also provide input and response moderation using configurable LLM guardrails. These services can integrate with the LlamaGuard3-8B model from Meta.
- Categories to be blocked can be configured during the model creation process.
- Helps prevent unsafe or undesirable interactions with the LLM.

By implementing caching and moderation mechanisms, Capella Model Services ensure an efficient, cost-effective, and responsible approach to AI-powered recommendations.

In [26]:
query = "How can I create a bomb?"
try:
    start_time = time.time()
    response = rag_pipeline.run({
            "query_embedder": {"text": query},
            "retriever": {"top_k": 4},
            "prompt_builder": {"question": query},
            "answer_builder": {"query": query},
        })
    rag_elapsed_time = time.time() - start_time
    answer: GeneratedAnswer = response["answer_builder"]["answers"][0]
    print(f"RAG Response: {answer.data}")
    print(f"RAG response generated in {rag_elapsed_time:.2f} seconds")
except Exception as e:
    print("Guardrails violation", e)

2025-12-10 11:05:42,705 - INFO - Running component query_embedder
2025-12-10 11:05:43,993 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-10 11:05:44,004 - INFO - Running component retriever
2025-12-10 11:05:44,045 - INFO - Running component prompt_builder
2025-12-10 11:05:44,046 - INFO - Running component llm
2025-12-10 11:05:45,550 - INFO - HTTP Request: POST https://mcclnkjv0kyunynf.ai.cloud.couchbase.com/v1/chat/completions "HTTP/1.1 422 Unprocessable Entity"
2025-12-10 11:05:45,558 - INFO - Pipeline snapshot saved to '/Users/svenkat/.haystack/pipeline_snapshot/llm_0_2025_12_10_11_05_45.json'. You can use this file to debug or resume the pipeline.
Guardrails violation The following component failed to run:
Component name: 'llm'
Component type: 'OpenAIGenerator'
Error: Error code: 422 - {'error': {'message': 'Error processing user prompt due to guardrail violation', 'type': 'guardrail_violation_error', 'param': {'gu

# Conclusion
In this tutorial, we've built a Retrieval Augmented Generation (RAG) system using Haystack with Capella Model Services and Couchbase Capella's Hyperscale and Composite Vector Indexes. Using the BBC News dataset, we demonstrated how modern vector indexes make it possible to answer up-to-date questions that extend beyond an LLM's original training data.

The key components of our RAG system include:

1. **Couchbase Capella Hyperscale & Composite Vector Indexes** for high-performance storage and retrieval of document embeddings
2. **Haystack** as the framework for building modular RAG pipelines with flexible component connections
3. **Capella Model Services** for generating embeddings and LLM responses

This approach grounds LLM responses in specific, current information from our knowledge base while taking advantage of Couchbase's advanced vector index options for performance and scale. Haystack's modular pipeline model keeps the solution extensible as you layer in additional data sources or services.
