# Introduction
In this guide, we will walk you through building a Retrieval Augmented Generation (RAG) application using Couchbase Capella as the database, [Llama 3.1 8B Instruct](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/) model as the as the large language model provided by Couchbase Capella AI Services. We will use the [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) model for generating embeddings via the Capella AI Services. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial will equip you with the knowledge to create a fully functional RAG system using Capella AI Services and [LangChain](https://langchain.com/).

# How to run this tutorial

This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/capella-ai/RAG_with_Couchbase_Capella.ipynb).

You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment.

# Before you start

## Create and Deploy Your Operational cluster on Capella

To get started with Couchbase Capella, create an account and use it to deploy an operational cluster.

To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).


### Couchbase Capella Configuration

When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met:

* Have a multi-node Capella cluster running the Data, Query, Index, and Search services.
* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.
* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.

### Deploy Models

In order to create the RAG application, we need an embedding model to ingest the documents for Vector Search and a large language model (LLM) for generating the responses based on the context. 

Capella Model Service allows you to create both the embedding model and the LLM in the same VPC as your database. Currently, the service offers Llama 3.1 Instruct model with 8 Billion parameters as an LLM and the mistral model for embeddings. 

Create the models using the Capella AI Services interface. While creating the model, it is possible to cache the responses (both standard and semantic cache) and apply guardrails to the LLM responses.

For more details, please refer to the [documentation](https://preview2.docs-test.couchbase.com/ai/get-started/about-ai-services.html#model). These models are compatible with the [LangChain OpenAI integration](https://python.langchain.com/api_reference/openai/index.html).


# Installing Necessary Libraries
To build our RAG system, we need a set of libaries. The libraries we install handle everything from connecting to databases to performing AI tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and we will use the OpenAI SDK for generating embeddings and calling the LLM in Capella AI services. By setting up these libraries, we ensure our environment is equipped to handle the tasks required for RAG.

In [None]:
!pip install --quiet datasets==3.6.0 langchain-couchbase==0.3.0 langchain-openai==0.3.17

# Importing Necessary Libraries
The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models.

In [None]:
import getpass
import json
import logging
import sys
import time

from datetime import timedelta

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.exceptions import CouchbaseException
from couchbase.management.search import SearchIndex
from couchbase.options import ClusterOptions

from datasets import load_dataset

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from tqdm import tqdm
import base64

# Loading Sensitive Information
In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like database credentials and collection names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.

The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code.

CAPELLA_AI_ENDPOINT is the Capella AI Services endpoint found in the models section.

> Note that the Capella AI Endpoint also requires an additional `/v1` from the endpoint shown on the UI if it is not shown on the UI.

INDEX_NAME is the name of the search index we will use for the vector search.

In [4]:
CB_CONNECTION_STRING = getpass.getpass("Enter your Couchbase Connection String: ")
CB_USERNAME = input("Enter your Couchbase Database username: ")
CB_PASSWORD = getpass.getpass("Enter your Couchbase Database password: ")
CB_BUCKET_NAME = input("Enter your Couchbase bucket name: ")
SCOPE_NAME = input("Enter your scope name: ")
COLLECTION_NAME = input("Enter your collection name: ")
INDEX_NAME = input("Enter your Search index name: ")
CAPELLA_AI_ENDPOINT = getpass.getpass("Enter your Capella AI Services Endpoint: ")

# Check if the variables are correctly loaded
if not all(
    [
        CB_CONNECTION_STRING,
        CB_USERNAME,
        CB_PASSWORD,
        CB_BUCKET_NAME,
        CAPELLA_AI_ENDPOINT,
        SCOPE_NAME,
        COLLECTION_NAME,
        INDEX_NAME,
    ]
):
    raise ValueError("Missing required environment variables variables")

# Generating Credentials for Capella Model Service
In Capella AI Services, the models are accessed using basic authentication with the linked Capella cluster credentials. We generate the credentials using the following code snippet:

In [5]:
key_string = f"{CB_USERNAME}:{CB_PASSWORD}"
CAPELLA_AI_KEY = base64.b64encode(key_string.encode("utf-8"))

# Connecting to the Couchbase Cluster
Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our RAG system. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections.

In [None]:
try:
    auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)
    options = ClusterOptions(auth)
    cluster = Cluster(CB_CONNECTION_STRING, options)
    cluster.wait_until_ready(timedelta(seconds=5))
    print("Successfully connected to Couchbase")
except Exception as e:
    raise ConnectionError(f"Failed to connect to Couchbase: {str(e)}")

# Setting Up Collections in Couchbase
In Couchbase, data is organized in buckets, which can be further divided into scopes and collections. Think of a collection as a table in a traditional SQL database. Before we can store any data, we need to ensure that our collections exist. If they don't, we must create them. This step is important because it prepares the database to handle the specific types of data our application will process. By setting up collections, we define the structure of our data storage, which is essential for efficient data retrieval and management.

Moreover, setting up collections allows us to isolate different types of data within the same bucket, providing a more organized and scalable data structure. This is particularly useful when dealing with large datasets, as it ensures that related data is stored together, making it easier to manage and query. Here, we also set up the primary index for query operations on the collection and clear the existing documents in the collection if any. If you do not want to do that, please skip this step.

In [None]:
def setup_collection(cluster, bucket_name, scope_name, collection_name, flush_collection=False):
    try:
        bucket = cluster.bucket(bucket_name)
        bucket_manager = bucket.collections()

        # Check if scope exists, create if it doesn't
        scopes = bucket_manager.get_all_scopes()
        scope_exists = any(scope.name == scope_name for scope in scopes)
        
        if not scope_exists:
            print(f"Scope '{scope_name}' does not exist. Creating it...")
            bucket_manager.create_scope(scope_name)
            print(f"Scope '{scope_name}' created successfully.")
        else:
            print(f"Scope '{scope_name}' already exists. Skipping creation.")
        
        # Check if collection exists, create if it doesn't
        collections = bucket_manager.get_all_scopes()
        collection_exists = any(
            scope.name == scope_name
            and collection_name in [col.name for col in scope.collections]
            for scope in collections
        )

        if not collection_exists:
            print(f"Collection '{collection_name}' does not exist. Creating it...")
            bucket_manager.create_collection(scope_name, collection_name)
            print(f"Collection '{collection_name}' created successfully.")
        else:
            print(f"Collection '{collection_name}' already exists. Skipping creation.")

        collection = bucket.scope(scope_name).collection(collection_name)
        time.sleep(2)  # Give the collection time to be ready for queries

        # Ensure primary index exists
        try:
            cluster.query(
                f"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`"
            ).execute()
            print("Primary index present or created successfully.")
        except Exception as e:
            logging.warning(f"Error creating primary index: {str(e)}")

        if flush_collection:
            # Clear all documents in the collection
            try:
                query = f"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`"
                cluster.query(query).execute()
                print("All documents cleared from the collection.")
            except Exception as e:
                print(
                    f"Error while clearing documents: {str(e)}. The collection might be empty."
                )

    except Exception as e:
        raise Exception(f"Error setting up collection: {str(e)}")


setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME, flush_collection=True)

# Loading Couchbase Vector Search Index

Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase Vector Search comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.

Note that you might have to update the index parameters depending on the names of your bucket, scope and collection. The provided index assumes the bucket to be model_tutorial, scope to be rag and the collection to be data.

For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html).

To import the index into Capella via the UI, please follow the [instructions](https://docs.couchbase.com/cloud/search/import-search-index.html) on the documentation.

There is code to create the index using the SDK as well below if you want to do it via code.

In [17]:
# If you are running this script in Google Colab, comment the following line
# and provide the path to your index definition file.

index_definition_path = "capella_index.json"  # Local setup: specify your file path here

# If you are running in Google Colab, use the following code to upload the index definition file
# from google.colab import files
# print("Upload your index definition file")
# uploaded = files.upload()
# index_definition_path = list(uploaded.keys())[0]

try:
    with open(index_definition_path, "r") as file:
        index_definition = json.load(file)

        # Update search index definition with user inputs
        index_definition['name'] = INDEX_NAME
        index_definition['sourceName'] = CB_BUCKET_NAME
        # Update types mapping
        old_type_key = next(iter(index_definition['params']['mapping']['types'].keys()))
        type_obj = index_definition['params']['mapping']['types'].pop(old_type_key)
        index_definition['params']['mapping']['types'][f"{SCOPE_NAME}.{COLLECTION_NAME}"] = type_obj
        
except Exception as e:
    raise ValueError(
        f"Error loading index definition from {index_definition_path}: {str(e)}"
    )

# Creating or Updating Search Indexes

With the index definition loaded, the next step is to create or update the Vector Search Index in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our RAG to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust RAG system.

In [None]:
# Create the Vector Index via SDK
try:
    scope_index_manager = (
        cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()
    )

    # Check if index already exists
    existing_indexes = scope_index_manager.get_all_indexes()
    index_name = index_definition["name"]

    if index_name in [index.name for index in existing_indexes]:
        print(f"Index '{index_name}' found")
    else:
        print(f"Creating new index '{index_name}'...")

    # Create SearchIndex object from JSON definition
    search_index = SearchIndex.from_json(index_definition)

    # Upsert the index (create if not exists, update if exists)
    scope_index_manager.upsert_index(search_index)
    print(f"Index '{index_name}' successfully created/updated.")

except Exception as e:
    logging.error(f"Index exists: {e}")

# Load the BBC News Dataset
To build a RAG engine, we need data to search through. We use the [BBC Realtime News dataset](https://huggingface.co/datasets/RealTimeData/bbc_news_alltime), a dataset with up-to-date BBC news articles grouped by month. This dataset contains articles that were created after the LLM was trained. It will showcase the use of RAG to augment the LLM. 

The BBC News dataset's varied content allows us to simulate real-world scenarios where users ask complex questions, enabling us to fine-tune our RAG's ability to understand and respond to various types of queries.

In [None]:
try:
    news_dataset = load_dataset('RealTimeData/bbc_news_alltime', '2024-12', split="train")
    print(f"Loaded the BBC News dataset with {len(news_dataset)} rows")
except Exception as e:
    raise ValueError(f"Error loading TREC dataset: {str(e)}")

## Preview the Data

In [None]:
print(news_dataset[:5])

## Cleaning up the Data

We will use the content of the news articles for our RAG system. 

The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system.

In [None]:
news_articles = news_dataset["content"]
unique_articles = set()
for article in news_articles:
    if article:
        unique_articles.add(article)
unique_news_articles = list(unique_articles)
print(f"We have {len(unique_news_articles)} unique articles in our database.")

# Creating Embeddings using Capella AI Service
Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using Capella AI service, we equip our RAG system with the ability to understand and process natural language in a way that is much closer to how humans understand language. This step transforms our raw text data into a format that the Capella vector store can use to find and rank relevant documents.

We are using the OpenAI Embeddings via the [LangChain OpenAI provider](https://python.langchain.com/docs/integrations/providers/openai/) with a few extra parameters specific to the Capella AI Services such as disabling the tokenization and handling of longer inputs using the LangChain handler. We provide the model and api_key and the URL for the SDK to those for Capella AI Services.

In [None]:
try:
    embeddings = OpenAIEmbeddings(
        openai_api_key=CAPELLA_AI_KEY,
        openai_api_base=CAPELLA_AI_ENDPOINT,
        check_embedding_ctx_length=False,
        tiktoken_enabled=False,
        model="intfloat/e5-mistral-7b-instruct",
    )
    print("Successfully created CapellaAIEmbeddings")
except Exception as e:
    raise ValueError(f"Error creating CapellaAIEmbeddings: {str(e)}")

# Testing the Embeddings Model
We can test the embeddings model by generating an embedding for a string using the LangChain OpenAI package

In [None]:
print(len(embeddings.embed_query("this is a test sentence")))


#  Setting Up the Couchbase Vector Store
The vector store is set up to store the documents from the dataset. The vector store is essentially a database optimized for storing and retrieving high-dimensional vectors. In this case, the vector store is using Couchbase using the [LangChain integration](https://python.langchain.com/docs/integrations/providers/couchbase/).

In [None]:
try:
    vector_store = CouchbaseSearchVectorStore(
        cluster=cluster,
        bucket_name=CB_BUCKET_NAME,
        scope_name=SCOPE_NAME,
        collection_name=COLLECTION_NAME,
        embedding=embeddings,
        index_name=INDEX_NAME,
    )
    print("Successfully created vector store")
except Exception as e:
    raise ValueError(f"Failed to create vector store: {str(e)}")

# Saving Data to the Vector Store
With the Vector store set up, the next step is to populate it with data. We save the BBC articles dataset to the vector store. For each document, we will generate the embeddings for the article to use with the semantic search using LangChain.

Here one of the articles is larger than the maximum tokens that we can use for our embedding model. If we want to ingest that document, we could split the document and ingest it in parts. However, since it is only a single document for simplicity, we ignore that document from the ingestion process.

In [None]:
from langchain_core.documents import Document
from uuid import uuid4

for article in tqdm(unique_news_articles, desc="Ingesting articles"):
    try:
        documents = [Document(page_content=article)]
        uuids = [str(uuid4()) for _ in range(len(documents))]
        vector_store.add_documents(documents=documents)
    except Exception as e:
        print(f"Failed to save documents to vector store: {str(e)}")
        continue

# Using the Large Language Model (LLM) in Capella AI
Language language models are AI systems that are trained to understand and generate human language. We'll be using the `Llama3.1-8B-Instruct` large language model via the Capella AI services inside the same network as the Capella operational database to process user queries and generate meaningful responses. This model is a key component of our RAG system, allowing it to go beyond simple keyword matching and truly understand the intent behind a query. By creating this language model, we equip our RAG system with the ability to interpret complex queries, understand the nuances of language, and provide more accurate and contextually relevant responses.

The language model's ability to understand context and generate coherent responses is what makes our RAG system truly intelligent. It can not only find the right information but also present it in a way that is useful and understandable to the user.

The LLM has been created using the LangChain OpenAI provider as well with the model name, URL and the API key based on the Capella AI Services.

In [14]:
try:
    llm = ChatOpenAI(openai_api_base=CAPELLA_AI_ENDPOINT, openai_api_key=CAPELLA_AI_KEY, model="meta-llama/Llama-3.1-8B-Instruct", temperature=0)
    logging.info("Successfully created the Chat model in Capella AI Services")
except Exception as e:
    raise ValueError(f"Error creating Chat model in Capella AI Services: {str(e)}")

# Perform Semantic Search
Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.

In the provided code, the search process begins by recording the start time, followed by executing the `similarity_search_with_score` method of the `CouchbaseSearchVectorStore`. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and a similarity score that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison.

In [None]:
query = "What was Pep Guardiola's reaction to Manchester City's current form?"

try:
    # Perform the semantic search
    start_time = time.time()
    search_results = vector_store.similarity_search_with_score(query, k=5)
    search_elapsed_time = time.time() - start_time

    # Display search results
    print(
        f"\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):"
    )
    for doc, score in search_results:
        print(f"Score: {score:.4f}, ID: {doc.id}, Text: {doc.page_content}")
        print("---"*20)

except CouchbaseException as e:
    raise RuntimeError(f"Error performing semantic search: {str(e)}")
except Exception as e:
    raise RuntimeError(f"Unexpected error: {str(e)}")

# Retrieval-Augmented Generation (RAG) with Couchbase and Langchain
Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query’s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a large language model using LangChain.

The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase’s efficient storage and retrieval capabilities, while the LLM handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation.

In [16]:
template = """You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:
    {context}
    Question: {question}"""
prompt = ChatPromptTemplate.from_template(template)
rag_chain = (
    {"context": vector_store.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
logging.info("Successfully created RAG chain")

In [None]:
# Get responses
query = "What was Pep Guardiola's reaction to Manchester City's recent form?"
try:
    start_time = time.time()
    rag_response = rag_chain.invoke(query)
    rag_elapsed_time = time.time() - start_time

    print(f"RAG Response: {rag_response}")
    print(f"RAG response generated in {rag_elapsed_time:.2f} seconds")
except Exception as e:
    print("Guardrails violation", e)

# Using Caching mechanism in Capella AI Services
In Capella AI services, the model outputs can be [cached](https://preview.docs-test.couchbase.com/ai-services-concepts/ai/get-started/about-ai-services.html#llm-caching) (both semantic and standard cache). The caching mechanism enhances the RAG's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the LLM generates a response and then stores this response in Couchbase. When similar queries come in later, the cached responses are returned. The caching duration can be configured in the Capella AI services.

In this example, we are using the standard cache which works for exact matches of the queries.

In [None]:
queries = [
        "Who inaugurated the reopening of the Notre Dam Cathedral in Paris?",
        "What was Pep Guardiola's reaction to Manchester City's recent form?", 
        "Who inaugurated the reopening of the Notre Dam Cathedral in Paris?", # Repeated query
]

for i, query in enumerate(queries, 1):
    try:
        print(f"\nQuery {i}: {query}")
        start_time = time.time()
        response = rag_chain.invoke(query)
        elapsed_time = time.time() - start_time
        print(f"Response: {response}")
        print(f"Time taken: {elapsed_time:.2f} seconds")
    except Exception as e:
        print(f"Error generating RAG response: {str(e)}")
        continue

Here you can see that the repeated queries were significantly faster than the original query. In Capella AI services, semantic similarity can also be used to find responses from the cache. 

Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.

# LLM Guardrails in Capella AI Services
Capella AI services also have the ability to moderate the user inputs and the responses generated by the LLM. Capella AI Services can be configured to use the [LlamaGuard3-8B](https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard3/8B/MODEL_CARD.md) guardrails model from Meta. The categories to be blocked can be configured in the model creation flow.

Here is an example of the Guardrails in action

In [None]:
query = "How can I create a bomb?"
try:
    start_time = time.time()
    rag_response = rag_chain.invoke(query)
    rag_elapsed_time = time.time() - start_time

    print(f"RAG Response: {rag_response}")
    print(f"RAG response generated in {rag_elapsed_time:.2f} seconds")
except Exception as e:
    print("Guardrails violation", e)

Guardrails can be quite useful in preventing users from hijacking the model into doing things that you might not want the application to do.

By following this tutorial, you will have a fully functional semantic search engine that leverages the strengths of Capella AI Services without the data being sent to third-party embedding or large language models. This guide explains the principles behind semantic search and how to implement it effectively using Capella AI Services. 