# Couchbase Capella AI Services Auto-Vectorization Tutorial

This comprehensive tutorial demonstrates how to use Couchbase Capella's new AI Services auto-vectorization feature to automatically convert your data into vector embeddings and perform semantic search using LangChain.

---

## 📚 Table of Contents

1. [Capella Account Setup](#1-capella-account-setup)
2. [Data Upload and Preparation](#2-data-upload-and-preparation)
3. [Deploying the Model](#3-deploying-the-model)
4. [Auto-Vectorization Process](#4-deploying-autovectorization-workflow)
5. [LangChain Vector Search]()


# 1. Capella Account Setup

Before we can use AI Services auto-vectorization, you need to set up a Couchbase Capella account and create a cluster.

## Step 1.1: Sign Up for Couchbase Capella

1. **Visit Capella**: Go to [https://cloud.couchbase.com](https://cloud.couchbase.com)
   
   <img src="./img/login_.png" width=500pt height=1000pt>

2. **Sign In**: Click "Sign in" or create your free account by clicking "Try free", you can also sign in using google, github or using your organization's SSO.


## Step 1.2: Create a New Cluster

1. **Access Dashboard**: After logging in, you'll see the Capella dashboard
2. **Create Cluster**: Click "Create Cluster"
   
   <img src="./img/create_cluster.png" width=900 style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">
   
4. **Choose Configuration**:
   - **Cluster Configuration**: 
     - For development: Single node cluster
     - For production: Multi-node with replicas
       
     <img src="./img/node_select_cluster_opt.png" width=900 style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">
     
   - **Cloud Provider**: AWS, Azure, or GCP (AWS recommended for this tutorial)
     
     <img src="./img/cluster_cloud_config.png" width=900 style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">
  
   - **Cluster Configuration**: Select number of nodes and their configuration, make sure to allow <B>searching</B> and <B>eventing</B> for using AutoVectorization.
     
     <img src="./img/cluster_no_nodes.png" width=900 style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">


## Step 1.3: Configure Access Control 

1. **Access Control**: Navigate to the <B>access control</B> tab which is present in <B>cluster settings</B> as highlited in the image below:-
   
    <img src="./img/Access_control.png" width=900 style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">
   
3. **Enter your details**:
   - <B>Cluster Access Name</B>: `username`
   - <B>Password</B>: Create a strong password

     
    <img src="./img/password_cluster.png" width=900 style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

   - Also, do not forget the level of authorization you want to give to these credentials, as shown in the above image modify the <B>bucket-level access</B> field as per the requirement.



# 2. Data Upload and Preparation

Now we'll upload sample data that will be automatically vectorized by AI Services.

## Option A: Upload Sample Dataset (Recommended)

We'll create sample documents about different topics to demonstrate the vectorization capabilities.

## Option B: Use Existing Couchbase Data

If you already have data in Couchbase (like travel-sample), you can configure vectorization for existing collections.

Let's proceed with **Option A** for this tutorial:

## 2.1: Uploading the sample-data provided by capella in your cluster
<div style="display: flex; align-items: flex-start; gap: 10px;">
          <img src="./img/select_cluster.png" width="160" height="300" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">
          <img src="./img/import_sd.png" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;" width="800px">
        </div>
   
   1. In order to upload sample data in your cluster, you need to navigate to the import section inside your cluster    
   2. Click on "<B>Load Sample Data</B>"
   3. Click on "<B>travel-sample</B>"
   4. Click on "<B>Import</B>"
   - After importing the data you can check that a bucket named travel-sample would have been created inside the cluster.
   - Select the "<B>travel-sample</B>" bucket, "<B>Inventory</B>" scope, and "<B>Hotel</B>" collection. Then you will see the documents inside this collection.
   - The document will not contain any vector embeddings inside it.
   - Now, we can proceed with the formation of vectors using auto-vectorization service.
     
## 2.2: Uploading data from your program

We'll also demonstrate how to programmatically upload sample documents using Python and the Couchbase SDK.


# 3. Deploying the Model
Now, before we actually create embedding for the documents we need to deploy a model which will create the embedding for us.
## 3.1: Selecting the model 
1. To select the model, you first need to navigate to the "<B>AI Services</B>" tab, then selecting "<B>Models</B>" and clicking on "<B>Deploy New Model</B>"
   
   <img src="./img/importing_model.png" width="950px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

3. Enter the <B>model name</B>, and choose the model that you want to deploy. After Selecting your model, choose the <B>model infrastructure</B> and <B>region</B> where the model will be deployed.
   
   <img src="./img/deploying_model.png" width="800px" height="800px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

## 3.2 Access control to the model

1. After deploying the model, go to the "<B>Models</B>" tab in the <B>AI-services</B> and click on "<B>setup access</B>".

    <img src="./img/model_setup_access.png" width="1100px" height="400px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

3. Enter your <B>api_key_name</B>, <B>expiration time</B> and the <B>IP-address</B> from which you will be accessing the model.

    <img src="./img/model_api_key_form.png" width="1100px" height="600px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

4. Download your API key

   <img src="./img/download_api_key_details.png" width="1200px" height="800px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

# 4. Deploying AutoVectorization Workflow

1. For deploying the autovectorization, you need to go to the <B>ai-services</B> tab, then click on the <B>workflows</B>, and then click on <B>Get started with RAG</B>.
   <img src="./img/Create_auto_vec.png" width="1000px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">
   
2. Start your workflow deployment by giving it a name, and selecting from where your data will be provided to the auto-vectorization service. There are currently 3 options, <B>pre-processed data(JSON format) from capella</B>, <B>pre-processed data(JSON format) from external sources(S3 buckets)</B> and <B>unstructured data from external sources (S3 buckets)</B>. For this tutorial we will be choosing first option which is pre-processed data from capella.

   <img src="./img/start_workflow.png" width="1000px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

3. Now, select the <B>cluster</B>, <B>bucket</B>, <B>scope</B> and <B>collection</B> from which you want to select the documents and get the data vectorized.

   <img src="./img/vector_data_source.png" width="1000px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

4. <B>Field Mapping</B> will be used to tell the AutoVectorize service that which data will be converted to embeddings.

   There are two options:-

   - <B>All source fields</B> - This feature will convert all your fields inside the document to a single vector field.
   
     <img src="./img/vector_all_field_mapping.png" width="900px" height="400px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">


   - <B>Custom source fields</B> - This feature will convert specific fields which are chosen by the user to a single vector field, in the image below we have chosen <B>address</B>, <B>description</B> and <B>id</B> as the fields to be converted to a vector having the name as <B>vec_addr_decr_id_mapping</B>.
  
       <img src="./img/vector_custom_field_mapping.png" width="900px" height="400px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">
  
5. After choosing your type of mapping, you will be required to either have an index on the new vector_embedding field or you can skip the creation of vector index which is not recommended as you will be losing out the functionality of vector searching.

   <img src="./img/vector_index.png" width="1200px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

6. Review and deploy your workflow configuration. Once all settings are configured correctly, click "Deploy" to start the auto-vectorization process.

   <img src="./img/vector_index_page.png" width="1200px" height="1200px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">
   

In [4]:
from couchbase.cluster import Cluster
from couchbase.auth import PasswordAuthenticator
from couchbase.options import ClusterOptions

from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings      
from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore

In [None]:
endpoint = "couchbases://cb.XYZ.com" # Replace this with Connection String
username = "testing" 
password = "Testing@1"
auth = PasswordAuthenticator(username, password)
# Configure cluster options with SSL verification disabled for testing, in production you should enable it
options = ClusterOptions(auth, tls_verify='none')
options.apply_profile("wan_development")
cluster = Cluster(endpoint, options)

In [None]:
bucket_name = "travel-sample"
scope_name = "inventory"
collection_name = "hotel"
index_name = "Vector_av_workflow_vec_addr_descr_id"  # This is the name of the search index created in step 4.5, and can to verify this index can also be seen in the search tab of the cluster.
embedder = NVIDIAEmbeddings(
    model="nvidia/nv-embedqa-e5-v5",    # This is the model which will be used to create the embedding of the query.
    api_key="nvapi-XYZ" # This is the api key using which your model will be accessed.
)

In [None]:
vector_store = CouchbaseSearchVectorStore(
    cluster=cluster,
    bucket_name=bucket_name,
    scope_name=scope_name,
    collection_name=collection_name,
    embedding=embedder,
    index_name=index_name,
    text_key="address",                  # your document's text field
    embedding_key="vec_addr_descr_id_mapping"  # this is the field in which your vector(embedding) is stored in the cluster.
)

In [None]:
query = "USA"
results = vector_store.similarity_search(query, k=3)

# Printing out the top-k results
for rank, doc in enumerate(results, start=1):
    title = doc.metadata.get("title", "<no title>")
    address_text = doc.page_content
    print(f"{rank}. {title} — Address: {address_text}")