# Introduction
In this guide, we will walk you through building a Retrieval Augmented Generation (RAG) application using Couchbase Capella as the database, [Mistral-7B-Instruct-v0.3](https://build.nvidia.com/mistralai/mistral-7b-instruct-v03/modelcard) model as the large language model provided by Capella Model Services. We will use the [NVIDIA NeMo Retriever Llama3.2](https://build.nvidia.com/nvidia/llama-3_2-nv-embedqa-1b-v2/modelcard) model for generating embeddings via Capella Model Services.

This notebook demonstrates how to build a RAG system using:
- The [BBC News dataset](https://huggingface.co/datasets/RealTimeData/bbc_news_alltime) containing news articles
- Couchbase Capella as the vector store with **Hyperscale and Composite Vector Indexes** for high-performance vector search
- Capella Model Services for embeddings and text generation
- LangChain framework for the RAG pipeline

We leverage Couchbase's **Query Service** to create and manage Hyperscale Vector Indexes, enabling efficient semantic search capabilities that can scale to billions of vectors. Hyperscale and Composite indexes provide superior performance for large-scale vector search operations compared to traditional approaches. This tutorial can also be recreated using the Search Service with [Search Vector Index](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/capella-model-services/langchain/search_based/RAG_with_Capella_Model_Services_and_LangChain.ipynb).

**Key Features:**
- High-performance vector search using Hyperscale/Composite indexes
- Performance benchmarks showing optimization benefits
- Complete RAG workflow with caching optimization

**Requirements:** Couchbase Server 8.0+ or Capella with Query Service enabled.

Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial will equip you with the knowledge to create a fully functional RAG system using Capella Model Services and [LangChain](https://langchain.com/)

## How to run this tutorial

This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/capella-model-services/langchain/query_based/RAG_with_Capella_Model_Services_and_LangChain.ipynb)

You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment.

## Before you start

### Create and Deploy Your Operational cluster on Capella

To get started with Couchbase Capella, create an account and use it to deploy an operational cluster.

To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).


#### Couchbase Capella Configuration

When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met:

* Have a multi-node Capella cluster running the **Data, Query, and Index services** (Query Service is required for Hyperscale/Composite indexes).
* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the bucket (Read and Write) used in the application.
* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.

#### Deploy Models

In order to create the RAG application, we need an embedding model to ingest the documents for Vector Search and a large language model (LLM) for generating the responses based on the context.

Capella Model Service allows you to create both the embedding model and the LLM in the same VPC as your database. There are multiple options for both the Embedding & Large Language Models, along with Value Adds to the models.

Create the models using the Capella Model Services interface. While creating the model, it is possible to cache the responses (both standard and semantic cache) and apply guardrails to the LLM responses.

For more details, please refer to the [documentation](https://docs.couchbase.com/ai/build/model-service/model-service.html). These models are compatible with the [LangChain OpenAI integration](https://python.langchain.com/api_reference/openai/index.html).

After the models are deployed, please create the API keys for them and whitelist the keys on the IP on which the tutorial is being run. For more details, please refer to the documentation on [generating the API keys](https://docs.couchbase.com/ai/api-guide/api-start.html#model-service-keys).

## Installing Necessary Libraries
To build our RAG system, we need a set of libraries. The libraries we install handle everything from connecting to databases to performing AI tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and we will use the OpenAI SDK for generating embeddings and calling the LLM in Capella Model services. By setting up these libraries, we ensure our environment is equipped to handle the tasks required for RAG.

In [1]:
%pip install --quiet datasets==4.4.1 langchain-couchbase==1.0.0 langchain-openai==1.1.0

Note: you may need to restart the kernel to use updated packages.


## Importing Necessary Libraries
The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models.

Note that we import `CouchbaseQueryVectorStore` along with `DistanceStrategy` and `IndexType` for creating Hyperscale/Composite Vector Indexes.

In [2]:
import logging
import os
import time

from datetime import timedelta

from dotenv import load_dotenv

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.exceptions import CouchbaseException
from couchbase.options import ClusterOptions

from datasets import load_dataset

from langchain_core.documents import Document
from langchain_core.globals import set_llm_cache
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_couchbase.cache import CouchbaseCache
from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore
from langchain_couchbase.vectorstores import DistanceStrategy, IndexType
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm
  from pydantic.v1.fields import FieldInfo as FieldInfoV1


## Loading Environment Variables

This notebook loads configuration from a `.env` file in the same directory. Create a `.env` file with the following variables:

**Required (no defaults):**
- `CB_CONNECTION_STRING` - Your Couchbase connection string
- `CB_USERNAME` - Couchbase database username
- `CB_PASSWORD` - Couchbase database password
- `CB_BUCKET_NAME` - Name of your Couchbase bucket
- `CAPELLA_MODEL_SERVICES_ENDPOINT` - Capella Model Services endpoint (include `/v1` suffix)
- `LLM_API_KEY` - API key for the LLM model
- `EMBEDDING_API_KEY` - API key for the embedding model

**Optional (with defaults):**
- `SCOPE_NAME` - Scope name (default: `_default`)
- `COLLECTION_NAME` - Collection name (default: `langchain_docs`)
- `CACHE_COLLECTION` - Cache collection name (default: `cache`)
- `LLM_MODEL_NAME` - LLM model name (default: `mistralai/mistral-7b-instruct-v0.3`)
- `EMBEDDING_MODEL_NAME` - Embedding model name (default: `nvidia/llama-3.2-nv-embedqa-1b-v2`)

> **Note:** The Capella Model Services Endpoint requires `/v1` suffix if not shown on the UI.

> If the models are running in the same region, either API key can be used interchangeably. See [generating API keys](https://docs.couchbase.com/ai/api-guide/api-start.html#model-service-keys).

In [3]:
# Load environment variables from .env file
load_dotenv()

# Couchbase connection settings (no defaults for sensitive values)
CB_CONNECTION_STRING = os.getenv("CB_CONNECTION_STRING")
CB_USERNAME = os.getenv("CB_USERNAME")
CB_PASSWORD = os.getenv("CB_PASSWORD")
CB_BUCKET_NAME = os.getenv("CB_BUCKET_NAME")

# Collection settings (with sensible defaults)
SCOPE_NAME = os.getenv("SCOPE_NAME", "_default")
COLLECTION_NAME = os.getenv("COLLECTION_NAME", "langchain_docs")
CACHE_COLLECTION = os.getenv("CACHE_COLLECTION", "cache")

# Capella Model Services settings
CAPELLA_MODEL_SERVICES_ENDPOINT = os.getenv("CAPELLA_MODEL_SERVICES_ENDPOINT")

# Model names (with defaults matching tutorial recommendations)
LLM_MODEL_NAME = os.getenv("LLM_MODEL_NAME", "mistralai/mistral-7b-instruct-v0.3")
EMBEDDING_MODEL_NAME = os.getenv("EMBEDDING_MODEL_NAME", "nvidia/llama-3.2-nv-embedqa-1b-v2")

# API keys (no defaults for sensitive values)
LLM_API_KEY = os.getenv("LLM_API_KEY")
EMBEDDING_API_KEY = os.getenv("EMBEDDING_API_KEY")

# Validate required environment variables
if not all([
    CB_CONNECTION_STRING,
    CB_USERNAME,
    CB_PASSWORD,
    CB_BUCKET_NAME,
    CAPELLA_MODEL_SERVICES_ENDPOINT,
    LLM_API_KEY,
    EMBEDDING_API_KEY,
]):
    raise ValueError(
        "Missing required environment variables. Please ensure your .env file contains:\n"
        "- CB_CONNECTION_STRING\n"
        "- CB_USERNAME\n"
        "- CB_PASSWORD\n"
        "- CB_BUCKET_NAME\n"
        "- CAPELLA_MODEL_SERVICES_ENDPOINT\n"
        "- LLM_API_KEY\n"
        "- EMBEDDING_API_KEY"
    )

print("Environment variables loaded successfully")

Environment variables loaded successfully


## Connecting to the Couchbase Cluster
Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our RAG system. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections.

In [4]:
try:
    auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)
    options = ClusterOptions(auth)
    cluster = Cluster(CB_CONNECTION_STRING, options)
    cluster.wait_until_ready(timedelta(seconds=5))
    print("Successfully connected to Couchbase")
except Exception as e:
    raise ConnectionError(f"Failed to connect to Couchbase: {str(e)}")

Successfully connected to Couchbase


## Setting Up Collections in Couchbase
In Couchbase, data is organized in buckets, which can be further divided into scopes and collections. Think of a collection as a table in a traditional SQL database. Before we can store any data, we need to ensure that our collections exist. If they don't, we must create them. This step is important because it prepares the database to handle the specific types of data our application will process. By setting up collections, we define the structure of our data storage, which is essential for efficient data retrieval and management.

Moreover, setting up collections allows us to isolate different types of data within the same bucket, providing a more organized and scalable data structure. This is particularly useful when dealing with large datasets, as it ensures that related data is stored together, making it easier to manage and query. Here, we clear the existing documents in the collection if any. If you do not want to do that, please skip this step.

In [5]:
def setup_collection(cluster, bucket_name, scope_name, collection_name, flush_collection=False):
    try:
        bucket = cluster.bucket(bucket_name)
        bucket_manager = bucket.collections()

        # Check if scope exists, create if it doesn't
        scopes = bucket_manager.get_all_scopes()
        scope_exists = any(scope.name == scope_name for scope in scopes)
        
        if not scope_exists:
            print(f"Scope '{scope_name}' does not exist. Creating it...")
            bucket_manager.create_scope(scope_name)
            print(f"Scope '{scope_name}' created successfully.")
            # Refresh scopes list after creation
            scopes = bucket_manager.get_all_scopes()
        else:
            print(f"Scope '{scope_name}' already exists. Skipping creation.")
        
        # Check if collection exists, create if it doesn't (reuse scopes variable)
        collection_exists = any(
            scope.name == scope_name
            and collection_name in [col.name for col in scope.collections]
            for scope in scopes
        )

        if not collection_exists:
            print(f"Collection '{collection_name}' does not exist. Creating it...")
            bucket_manager.create_collection(scope_name, collection_name)
            print(f"Collection '{collection_name}' created successfully.")
        else:
            print(f"Collection '{collection_name}' already exists. Skipping creation.")

        collection = bucket.scope(scope_name).collection(collection_name)
        time.sleep(2)  # Give the collection time to be ready for queries

        # Create primary index for the collection (required for DELETE operations)
        try:
            index_name = f"`{bucket_name}`.`{scope_name}`.`{collection_name}`"
            query = f"CREATE PRIMARY INDEX IF NOT EXISTS ON {index_name}"
            cluster.query(query).execute()
            print(f"Primary index created/verified for {collection_name}.")
        except Exception as e:
            print(f"Note: Could not create primary index: {str(e)}")

        if flush_collection:
            # Clear all documents in the collection
            try:
                query = f"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`"
                cluster.query(query).execute()
                print("All documents cleared from the collection.")
            except Exception as e:
                print(
                    f"Error while clearing documents: {str(e)}. The collection might be empty."
                )

    except Exception as e:
        raise Exception(f"Error setting up collection: {str(e)}")


# Setup main collection for vector store
setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME, flush_collection=True)

# Setup cache collection for LLM response caching
setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION, flush_collection=True)

Scope 'shared' already exists. Skipping creation.
Collection 'langchain_query' already exists. Skipping creation.
Primary index created/verified for langchain_query.
All documents cleared from the collection.
Scope 'shared' already exists. Skipping creation.
Collection 'cache' already exists. Skipping creation.
Primary index created/verified for cache.
All documents cleared from the collection.


## Load the BBC News Dataset
To build a RAG engine, we need data to search through. We use the [BBC Realtime News dataset](https://huggingface.co/datasets/RealTimeData/bbc_news_alltime), a dataset with up-to-date BBC news articles grouped by month. This dataset contains articles that were created after the LLM was trained. It will showcase the use of RAG to augment the LLM.

The BBC News dataset's varied content allows us to simulate real-world scenarios where users ask complex questions, enabling us to fine-tune our RAG's ability to understand and respond to various types of queries.

In [6]:
try:
    news_dataset = load_dataset('RealTimeData/bbc_news_alltime', '2024-12', split="train")
    print(f"Loaded the BBC News dataset with {len(news_dataset)} rows")
except Exception as e:
    raise ValueError(f"Error loading BBC dataset: {str(e)}")

Loaded the BBC News dataset with 2687 rows


## Preview the Data

In [7]:
print(news_dataset[:5])

{'title': ["Pakistan protest: Bushra Bibi's march for Imran Khan disappeared - BBC News", 'Lockdown DIY linked to Walleys Quarry gases - BBC News', 'Newscast - What next for the assisted dying bill? - BBC Sounds', "F1: Bernie Ecclestone to sell car collection worth 'hundreds of millions' - BBC Sport", 'British man Tyler Kerry from Basildon dies on holiday in Turkey - BBC News'], 'published_date': ['2024-12-01', '2024-12-01', '2024-12-01', '2024-12-01', '2024-12-01'], 'authors': ['https://www.facebook.com/bbcnews', 'https://www.facebook.com/bbcnews', None, 'https://www.facebook.com/BBCSport/', 'https://www.facebook.com/bbcnews'], 'description': ["Imran Khan's third wife guided protesters to the heart of the capital - and then disappeared.", 'An academic says an increase in plasterboard sent to landfill could be behind a spike in smells.', 'And rebel forces in Syria have taken control of Aleppo', 'Former Formula 1 boss Bernie Ecclestone is to sell his collection of race cars driven by mo

## Cleaning up the Data

We will use the content of the news articles for our RAG system.

The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system.

In [8]:
news_articles = news_dataset["content"]
unique_articles = set()
for article in news_articles:
    if article:
        unique_articles.add(article)
unique_news_articles = list(unique_articles)
print(f"We have {len(unique_news_articles)} unique articles in our database.")

We have 1749 unique articles in our database.


## Creating Embeddings using Capella Model Service
Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using Capella Model service, we equip our RAG system with the ability to understand and process natural language in a way that is much closer to how humans understand language. This step transforms our raw text data into a format that the Capella vector store can use to find and rank relevant documents.

We are using the OpenAI Embeddings via the [LangChain OpenAI provider](https://python.langchain.com/docs/integrations/providers/openai/) with a few extra parameters specific to the Capella Model Services such as disabling the tokenization and handling of longer inputs using the LangChain handler. We provide the model and api_key and the URL for the SDK to those for Capella Model Services. For this tutorial, we are using the [nvidia/llama-3.2-nv-embedqa-1b-v2](https://build.nvidia.com/nvidia/llama-3_2-nv-embedqa-1b-v2) embedding model. If you are using a different model, you would need to change the model name accordingly.

In [9]:
try:
    embeddings = OpenAIEmbeddings(
        openai_api_key=EMBEDDING_API_KEY,
        openai_api_base=CAPELLA_MODEL_SERVICES_ENDPOINT,
        check_embedding_ctx_length=False,
        tiktoken_enabled=False,
        model=EMBEDDING_MODEL_NAME,
    )
    print("Successfully created CapellaAIEmbeddings")
except Exception as e:
    raise ValueError(f"Error creating CapellaAIEmbeddings: {str(e)}")

Successfully created CapellaAIEmbeddings


## Testing the Embeddings Model
We can test the embeddings model by generating an embedding for a string using the LangChain OpenAI package

In [10]:
print(len(embeddings.embed_query("this is a test sentence")))

2048


## Setting Up the Couchbase Query Vector Store
The vector store is set up to store the documents from the dataset. We use `CouchbaseQueryVectorStore` which enables Hyperscale and Composite Vector Indexes for high-performance vector storage and retrieval using Couchbase's Query Service.

**Key differences from Search Vector Store:**
- Uses Query Service instead of Search Service
- Supports Hyperscale indexes that can scale to billions of vectors
- Index is created programmatically after data ingestion
- No need for a separate JSON index definition file

In [11]:
try:
    vector_store = CouchbaseQueryVectorStore(
        cluster=cluster,
        bucket_name=CB_BUCKET_NAME,
        scope_name=SCOPE_NAME,
        collection_name=COLLECTION_NAME,
        embedding=embeddings,
        distance_metric=DistanceStrategy.COSINE
    )
    print("Successfully created Couchbase Query Vector Store")
except Exception as e:
    raise ValueError(f"Failed to create vector store: {str(e)}")

Successfully created Couchbase Query Vector Store


## Saving Data to the Vector Store
With the Vector store set up, the next step is to populate it with data. We save the BBC articles dataset to the vector store using batch ingestion for efficiency. Each document will have its embeddings generated for semantic search using LangChain.

**Important:** With Hyperscale/Composite indexes, data must be ingested BEFORE creating the index. The index creation process analyzes existing vectors to optimize search performance through clustering and quantization.

Some articles may exceed the embedding model's maximum token limit (8192 tokens). These documents are automatically skipped during ingestion. For production use, consider splitting longer documents into chunks.

In [12]:
# Filter articles that are within token limits (50000 chars as rough estimate)
batch_size = 20  # Smaller batch size for better reliability with remote clusters
filtered_articles = [a for a in unique_news_articles if a and len(a) <= 50000]

print(f"Filtered {len(unique_news_articles) - len(filtered_articles)} articles exceeding length limit")
print(f"Ingesting {len(filtered_articles)} articles in batches of {batch_size}...")

try:
    vector_store.add_texts(
        texts=filtered_articles,
        batch_size=batch_size
    )
    print(f"\nIngestion complete: {len(filtered_articles)} documents ingested successfully")
except Exception as e:
    error_str = str(e).lower()
    if "timeout" in error_str or "exceeds maximum" in error_str or "token" in error_str:
        # Fall back to individual ingestion for problematic batches
        print(f"Batch ingestion encountered issues, falling back to individual ingestion...")
        skipped_count = 0
        ingested_count = 0
        
        for article in tqdm(filtered_articles, desc="Ingesting articles"):
            try:
                vector_store.add_texts(texts=[article])
                ingested_count += 1
            except Exception as inner_e:
                inner_error = str(inner_e).lower()
                if "timeout" in inner_error or "exceeds maximum" in inner_error or "token" in inner_error:
                    skipped_count += 1
                    continue
                else:
                    print(f"Failed to save document: {str(inner_e)[:100]}...")
                    skipped_count += 1
                    continue
        
        print(f"\nIngestion complete: {ingested_count} documents ingested, {skipped_count} skipped")
    else:
        raise Exception(f"Error during batch ingestion: {str(e)}")

Filtered 1 articles exceeding length limit
Ingesting 1748 articles in batches of 20...

Ingestion complete: 1748 documents ingested successfully


## Vector Search Performance Testing

Now let's demonstrate the performance benefits of Hyperscale Vector Index by testing pure vector search performance. We'll compare:

1. **Baseline Performance**: Vector search without Hyperscale index optimization
2. **Hyperscale-Optimized Performance**: Same search with Hyperscale index

## Vector Index Types Overview

Before we start testing, let's understand the index types available:

**BHIVE (Hyperscale) Vector Indexes:**
- **Best for**: Pure vector searches - content discovery, recommendations, semantic search
- **Performance**: High performance with low memory footprint, designed to scale to billions of vectors
- **Optimization**: Optimized for concurrent operations, supports simultaneous searches and inserts
- **Use when**: You primarily perform vector-only queries without complex scalar filtering

**Composite Vector Indexes:**
- **Best for**: Filtered vector searches that combine vector search with scalar value filtering
- **Performance**: Efficient pre-filtering where scalar attributes reduce the vector comparison scope
- **Use when**: Your queries combine vector similarity with scalar filters that eliminate large portions of data

For this tutorial, we'll create and test a BHIVE index. See the alternative section below for Composite index configuration.

For more information, see [Couchbase Hyperscale and Composite Vector Index Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).

## Vector Search Test Function

In [13]:
def test_vector_search_performance(vector_store, query, label="Vector Search"):
    """Test pure vector search performance and return timing metrics"""
    print(f"\n[{label}] Testing vector search performance")
    print(f"[{label}] Query: '{query}'")
    
    start_time = time.time()
    
    try:
        results = vector_store.similarity_search_with_score(query, k=3)
        end_time = time.time()
        search_time = end_time - start_time
        
        print(f"[{label}] Vector search completed in {search_time:.4f} seconds")
        print(f"[{label}] Found {len(results)} documents")
        
        if results:
            doc, distance = results[0]
            print(f"[{label}] Top result distance: {distance:.6f} (lower = more similar)")
            preview = doc.page_content[:100] + "..." if len(doc.page_content) > 100 else doc.page_content
            print(f"[{label}] Top result preview: {preview}")
        
        return search_time
    except Exception as e:
        print(f"[{label}] Vector search failed: {str(e)}")
        return None

## Test 1: Baseline Performance (No Hyperscale Index)

In [14]:
test_query = "What was Pep Guardiola's reaction to Manchester City's current form?"
print("Testing baseline vector search performance without Hyperscale index optimization...")
baseline_time = test_vector_search_performance(vector_store, test_query, "Baseline Search")
print(f"\nBaseline vector search time (without Hyperscale index): {baseline_time:.4f} seconds\n")

Testing baseline vector search performance without Hyperscale index optimization...

[Baseline Search] Testing vector search performance
[Baseline Search] Query: 'What was Pep Guardiola's reaction to Manchester City's current form?'
[Baseline Search] Vector search completed in 5.6997 seconds
[Baseline Search] Found 3 documents
[Baseline Search] Top result distance: 0.491326 (lower = more similar)
[Baseline Search] Top result preview: 'We have to find a way' - Guardiola vows to end relegation form

This video can not be played To pla...

Baseline vector search time (without Hyperscale index): 5.6997 seconds



## Creating BHIVE (Hyperscale) Vector Index

Now let's create a BHIVE vector index to enable high-performance vector searches. The index creation is done programmatically through the vector store.

**Index Configuration:**
- `index_type`: `IndexType.HYPERSCALE` for pure vector search, `IndexType.COMPOSITE` for filtered searches
- `index_name`: Unique name for the index
- `index_description`: Controls centroids and quantization settings

### Index Configuration Details

The `index_description` parameter controls vector optimization through centroids and quantization:

**Format**: `'IVF[<centroids>],{PQ|SQ}<settings>'`

#### IVF (Inverted File Index) - Centroids
- Auto-selection: `IVF,SQ8` (Couchbase selects optimal count)
- Manual: `IVF1000,SQ8` (1000 centroids)

#### Quantization Options
- **SQ (Scalar)**: `SQ4`, `SQ6`, `SQ8` - simpler, good for general use
- **PQ (Product)**: `PQ32x8` - better precision at similar compression

#### Common Examples
- `IVF,SQ8` - Recommended default
- `IVF1000,SQ6` - Higher compression
- `IVF,PQ32x8` - High precision

For more details, see [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).

In [15]:
print("Creating BHIVE (Hyperscale) vector index...")
try:
    vector_store.create_index(
        index_type=IndexType.HYPERSCALE,
        index_name="langchain_bhive_index",
        index_description="IVF,SQ8"
    )
    print("BHIVE vector index created successfully")
    
    # Wait for index to become available
    print("Waiting for index to become available...")
    time.sleep(5)
    
except Exception as e:
    if "already exists" in str(e).lower():
        print("BHIVE vector index already exists, proceeding...")
    else:
        print(f"Error creating BHIVE vector index: {str(e)}")

Creating BHIVE (Hyperscale) vector index...
BHIVE vector index created successfully
Waiting for index to become available...


## Test 2: BHIVE (Hyperscale) Optimized Performance

In [16]:
print("Testing vector search performance with BHIVE (Hyperscale) optimization...")
bhive_time = test_vector_search_performance(vector_store, test_query, "BHIVE")

Testing vector search performance with BHIVE (Hyperscale) optimization...

[BHIVE] Testing vector search performance
[BHIVE] Query: 'What was Pep Guardiola's reaction to Manchester City's current form?'
[BHIVE] Vector search completed in 2.1784 seconds
[BHIVE] Found 3 documents
[BHIVE] Top result distance: 0.491326 (lower = more similar)
[BHIVE] Top result preview: 'We have to find a way' - Guardiola vows to end relegation form

This video can not be played To pla...


### Alternative: Composite Index Configuration

If your use case requires complex filtering with scalar attributes, you can create a Composite index instead:

```python
vector_store.create_index(
    index_type=IndexType.COMPOSITE,  # Instead of IndexType.HYPERSCALE
    index_name="langchain_composite_index",
    index_description="IVF,SQ8"
)
```

Composite indexes are optimized for queries that combine vector similarity with scalar filters (e.g., filtering by date, category, or other metadata fields).

## Performance Summary

In [17]:
print("\n" + "="*60)
print("PERFORMANCE SUMMARY")
print("="*60)

print(f"Baseline Search Time:     {baseline_time:.4f} seconds")

if baseline_time and bhive_time:
    speedup = baseline_time / bhive_time if bhive_time > 0 else float('inf')
    percent_improvement = ((baseline_time - bhive_time) / baseline_time) * 100 if baseline_time > 0 else 0
    print(f"BHIVE Search Time:        {bhive_time:.4f} seconds ({speedup:.2f}x faster, {percent_improvement:.1f}% improvement)")

print("\n" + "-"*60)
print("Index Recommendation:")
print("-"*60)
print("- BHIVE (Hyperscale): Best for pure vector searches, scales to billions of vectors")
print("- Composite: Best for filtered searches combining vector + scalar filters")


PERFORMANCE SUMMARY
Baseline Search Time:     5.6997 seconds
BHIVE Search Time:        2.1784 seconds (2.62x faster, 61.8% improvement)

------------------------------------------------------------
Index Recommendation:
------------------------------------------------------------
- BHIVE (Hyperscale): Best for pure vector searches, scales to billions of vectors
- Composite: Best for filtered searches combining vector + scalar filters


## Perform Semantic Search
Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors.

With Hyperscale indexes, the similarity metric (COSINE) is configured at vector store initialization time via `distance_metric=DistanceStrategy.COSINE`. The search process uses the Query Service for efficient ANN (Approximate Nearest Neighbor) search.

**Distance Interpretation**: In vector search using Hyperscale indexes, lower distance values indicate higher similarity, while higher distance values indicate lower similarity.

In [18]:
query = "What was Pep Guardiola's reaction to Manchester City's current form?"

try:
    # Perform the semantic search
    start_time = time.time()
    search_results = vector_store.similarity_search_with_score(query, k=5)
    search_elapsed_time = time.time() - start_time

    # Display search results
    print(
        f"\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):"
    )
    for doc, score in search_results:
        print(f"Score: {score:.4f}, ID: {doc.id}, Text: {doc.page_content[:200]}...")
        print("---"*20)

except CouchbaseException as e:
    raise RuntimeError(f"Error performing semantic search: {str(e)}")
except Exception as e:
    raise RuntimeError(f"Unexpected error: {str(e)}")


Semantic Search Results (completed in 1.01 seconds):
Score: 0.4913, ID: a468aabf71d14bf285d4365a56da3329, Text: 'We have to find a way' - Guardiola vows to end relegation form

This video can not be played To play this video you need to enable JavaScript in your browser. 'Worrying' and 'staggering' - Why do Man...
------------------------------------------------------------
Score: 0.5177, ID: 14310dbb06c744d3b2c481ba84427f6c, Text: 'I am not good enough' - Guardiola faces daunting and major rebuild

This video can not be played To play this video you need to enable JavaScript in your browser. 'I am not good enough' - Guardiola s...
------------------------------------------------------------
Score: 0.5312, ID: 6d254fcd373240f494686a5af3cdc269, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016

Manchester City boss Pep Guardiola says he is "fine" despite admitting his sleep and diet are being affecte...
-----------------------------------

## Retrieval-Augmented Generation (RAG) with Couchbase and LangChain
Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query's embedding with the stored document embeddings using our Hyperscale-optimized search. These documents, which provide contextual information, are then passed to a large language model using LangChain.

The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase's efficient storage and retrieval capabilities with Hyperscale performance, while the LLM handles the generation of responses based on the context provided by the retrieved documents.

## Using the Large Language Model (LLM) in Capella Model Services
We'll be using the [mistralai/mistral-7b-instruct-v0.3](https://build.nvidia.com/mistralai/mistral-7b-instruct-v03) large language model via the Capella Model Services inside the same network as the Capella operational database to process user queries and generate meaningful responses.

In [19]:
try:
    llm = ChatOpenAI(openai_api_base=CAPELLA_MODEL_SERVICES_ENDPOINT, openai_api_key=LLM_API_KEY, model=LLM_MODEL_NAME, temperature=0)
    logging.info("Successfully created the Chat model in Capella Model Services")
except Exception as e:
    raise ValueError(f"Error creating Chat model in Capella Model Services: {str(e)}")

In [20]:
llm.invoke("What was Pep Guardiola's reaction to Manchester City's current form?")

AIMessage(content='I don\'t have real-time data or the ability to follow live events. However, Pep Guardiola, the manager of Manchester City, has expressed his usual balance of optimism and desire for improvement. Even though City has faced some challenges in the 2021/2022 season, he continues to emphasize the need for patience, hard work, and a focus on continuous improvement.\n\nIn a press conference, Guardiola noted, "In football, you have to have patience. When I arrived, we were fifth and I said, \'okay, we are not far away.\' Now, we are not far away again." He also added, "We have to find our best level, and when we find it, we are going to remain for a long time at the top."\n\nWhile the team has experienced ups and downs, Guardiola maintains his belief in the players and their ability to turn things around.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 200, 'prompt_tokens': 21, 'total_tokens': 221, 'completion_tokens_details': N

## Setting Up Couchbase Cache

We set up a Couchbase-based cache to store and retrieve LLM responses. This cache accelerates repeated queries by storing precomputed results, significantly reducing response time for frequently asked questions.

When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the LLM, and stores this response in the cache. For subsequent identical queries, the cached response is returned directly, bypassing the expensive LLM call.

In [21]:
try:
    cache = CouchbaseCache(
        cluster=cluster,
        bucket_name=CB_BUCKET_NAME,
        scope_name=SCOPE_NAME,
        collection_name=CACHE_COLLECTION,
    )
    set_llm_cache(cache)
    print("Successfully created and configured Couchbase cache")
except Exception as e:
    raise ValueError(f"Failed to create cache: {str(e)}")

Successfully created and configured Couchbase cache


## Building the RAG Chain

In [22]:
template = """You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:
    {context}
    Question: {question}"""
prompt = ChatPromptTemplate.from_template(template)
rag_chain = (
    {"context": vector_store.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
logging.info("Successfully created RAG chain")

In [23]:
# Get responses
query = "What was Pep Guardiola's reaction to Manchester City's recent form?"
try:
    start_time = time.time()
    rag_response = rag_chain.invoke(query)
    rag_elapsed_time = time.time() - start_time

    print(f"RAG Response: {rag_response}")
    print(f"RAG response generated in {rag_elapsed_time:.2f} seconds")
except Exception as e:
    print("Error occurred:", e)

RAG Response: Pep Guardiola expressed concern and frustration over Manchester City's recent form, particularly their troubles in conceding goals and their current struggle to secure wins. He acknowledged their performances have not been satisfactory and that there is a need for improvements, especially in defense and avoiding mistakes at both ends. He also mentioned the recent dip in form has affected his personal life, causing sleep and diet issues.
RAG response generated in 5.58 seconds


## Demonstrating Cache Performance

This tutorial uses **two levels of caching**:

1. **Client-side CouchbaseCache** (configured above): Stores LLM responses in Couchbase, providing fast retrieval for identical queries at the application level.

2. **Server-side Capella Model Services caching**: The model outputs can be [cached](https://docs.couchbase.com/ai/build/model-service/configure-value-adds.html#caching) (both semantic and standard cache) at the model service level. This is configured in the Capella Model Services UI when deploying models.

The following example demonstrates caching in action - notice how repeated queries are significantly faster:

In [24]:
queries = [
        "Who inaugurated the reopening of the Notre Dam Cathedral in Paris?",
        "What was Pep Guardiola's reaction to Manchester City's recent form?", 
        "Who inaugurated the reopening of the Notre Dam Cathedral in Paris?", # Repeated query
]

for i, query in enumerate(queries, 1):
    try:
        print(f"\nQuery {i}: {query}")
        start_time = time.time()
        response = rag_chain.invoke(query)
        elapsed_time = time.time() - start_time
        print(f"Response: {response}")
        print(f"Time taken: {elapsed_time:.2f} seconds")
    except Exception as e:
        print(f"Error generating RAG response: {str(e)}")
        continue


Query 1: Who inaugurated the reopening of the Notre Dam Cathedral in Paris?
Response: The reopening of the Notre-Dame Cathedral in Paris was inaugurated by French President Emmanuel Macron.
Time taken: 2.35 seconds

Query 2: What was Pep Guardiola's reaction to Manchester City's recent form?
Response: Pep Guardiola expressed concern and frustration over Manchester City's recent form, particularly their troubles in conceding goals and their current struggle to secure wins. He acknowledged their performances have not been satisfactory and that there is a need for improvements, especially in defense and avoiding mistakes at both ends. He also mentioned the recent dip in form has affected his personal life, causing sleep and diet issues.
Time taken: 1.88 seconds

Query 3: Who inaugurated the reopening of the Notre Dam Cathedral in Paris?
Response: The reopening of the Notre-Dame Cathedral in Paris was inaugurated by French President Emmanuel Macron.
Time taken: 1.06 seconds


Here you can see that repeated queries benefit from caching. The CouchbaseCache stores responses for exact query matches, while Capella Model Services can additionally provide semantic caching for similar (but not identical) queries.

Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.

## LLM Guardrails in Capella Model Services
Capella Model services also have the ability to moderate the user inputs and the responses generated by the LLM. Capella Model Services can be configured to use the [Llama 3.1 NemoGuard 8B safety model](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety/modelcard) guardrails model from Meta. The categories to be blocked can be configured in the model creation flow. More information about Guardrails usage can be found in the [documentation](https://docs.couchbase.com/ai/build/model-service/configure-guardrails-security.html#guardrails).

Here is an example of the Guardrails in action

In [25]:
query = "How can I create a bomb?"
try:
    start_time = time.time()
    rag_response = rag_chain.invoke(query)
    rag_elapsed_time = time.time() - start_time

    print(f"RAG Response: {rag_response}")
    print(f"RAG response generated in {rag_elapsed_time:.2f} seconds")
except Exception as e:
    print("Guardrails violation", e)

RAG Response: I'm really sorry for any confusion, but it's important to clarify that I can't assist with that question. Discussing or providing information about making explosives is illegal and dangerous. It's essential to prioritize safety and respect the law in all circumstances. It's best to avoid searching for such information online. If you have a different question, I'd be happy to help!
RAG response generated in 2.44 seconds


Guardrails can be quite useful in preventing users from hijacking the model into doing things that you might not want the application to do.

## Conclusion

You've built a high-performance semantic search engine using Couchbase BHIVE/Composite indexes with Capella Model Services and LangChain. For the Search Vector Index alternative, see the [search_based tutorial](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/capella-model-services/langchain/search_based/RAG_with_Capella_Model_Services_and_LangChain.ipynb).