# Auto-Vectorization on Unstructured Data Stored in S3 Buckets Using Couchbase Capella AI Services 
This comprehensive tutorial demonstrates how to use Couchbase Capella's new AI Services auto-vectorization feature to automatically convert your unstructured data stored in S3 buckets to import it in Capella and convert it into vector embeddings and perform semantic search using LangChain.

# 1. Create and Deploy Your Operational cluster on Capella
To get started with Couchbase Capella, create an account and use it to deploy a cluster. To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).
    
### Couchbase Capella Configuration
When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.
- Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the bucket you will be using for this tutorial (e.g., `Unstructured_data_bucket`) with Read and Write permissions.
- [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.

# 2. Deploying the Model
Now, before we actually create embeddings for the documents, we need to deploy a model that will create the embeddings for us.
## 2.1: Selecting the Model 
1. To select the model, you first need to navigate to the "<B>AI Services</B>" tab, then select "<B>Models</B>" and click on "<B>Deploy New Model</B>".
   
   <img src="./img/importing_model.png" width="950px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

2. Enter the <B>model name</B>, and choose the model that you want to deploy. After selecting your model, choose the <B>model infrastructure</B> and <B>region</B> where the model will be deployed.
   
   <img src="./img/deploying_model.png" width="800px" height="800px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

## 2.2: Access Control to the Model

1. After deploying the model, go to the "<B>Models</B>" tab in the <B>AI Services</B> and click on "<B>Setup Access</B>".

    <img src="./img/model_setup_access.png" width="1100px" height="400px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

2. Enter your <B>API key name</B>, <B>expiration time</B> and the <B>IP address</B> from which you will be accessing the model.

    <img src="./img/model_api_key_form.png" width="1100px" height="600px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

3. Download your API key

   <img src="./img/download_api_key_details.png" width="1200px" height="800px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

# 3. Data upload from S3 bucket to Couchbase (with chunking and vectorization)

In order to import unstructured data from the S3 bucket, you need to create a workflow that connects to your S3 bucket and chunks your unstructured data before importing it into the collections. To do so, please follow the steps mentioned below:
1) Let's start by creating a new workflow. This can be done by clicking on the <B>`AI Services`</B> tab, then click on <B>`Workflows`</B>, and then click on <B>`Create New Workflow`</B>.
   
   <img src="./img/workflow.png" width="1000px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">
   
2) Start your workflow deployment by giving it a name and selecting where your data will be provided to the auto-vectorization service. There are currently three options: <B>`pre-processed data (JSON format) from Capella`</B>, <B>`pre-processed data (JSON format) from external sources (S3 buckets)`</B> and <B>`unstructured data from external sources (S3 buckets)`</B>. For this tutorial, we will choose the third option, which is unstructured data from external sources (S3 buckets). After selecting the workflow enter the workflow name and click on <B>`Start Workflow`</B>.
   
   <img src="./img/start_workflow.png" width="1000px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

3) To proceed, Capella needs to connect to your S3 bucket which will be the source of the data, and to do so click on the <B>`+ Add New S3 Bucket`</B>.

   <img src="./img/addS3bucket.png" width="1000px" height="300px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

4) Upon clicking <B>`+ Add New S3 Bucket`</B> a new sidebar will appear that asks for the credentials of your S3 bucket.

   <img src="./img/S3credentials.png" width="1000px" height="800px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">
   
   - Enter <B>`Integration Name`</B>, which will be later used to select your S3 Bucket.
   - Select the AWS Region where the bucket is deployed.
   - Enter the name of the S3 bucket deployed in AWS.
   - Enter the path where your unstructured-data is present.
   - Enter your S3 bucket credentials.
   - Click on ADD Credentials.
5) If the steps mentioned above are followed correctly then you should see a success pop-up as shown below and then the S3 bucket can be selected from the drop-down menu.

   <img src="./img/S3bucketsuccess.png" width="800px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

6) On selection of the S3 bucket, various options will be displayed as described below.

   <img src="./img/configure_data_source.png" width="900px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">
- `Index Configuration` allows us to create a search index on the generated embeddings of the imported data. If it's skipped then the functionality of vector searching will not be enabled and you need to create index later on.
- `Destination Cluster` helps choose the cluster, bucket, scope and collection in which the data needs to be imported.
- `Estimated Cost` dialogue box in blue color(on the right) will show you the cost of operation per document.
- Click on `Next`.
  
7) <B>`Configure Data Preprocessing`</B> allows you to perform various operations on the data being imported from the S3 buckets and are described below.
   
   <img src="./img/data_processing.png" width="600px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">
- <B>`Page Range selection`</B> allows you to select a custom page range when working with PDFs. (Optional)
- <B>`Layout Exclusions`</B> allows you to skip various unnecessary objects in your unstructured data. (Optional)
- <B>`Object Character Recognition (OCR)`</B> allows you to detect text from images/pdfs. (Optional)
- <B>`Chunking Strategy`</B> is an important step for importing data and creating embeddings(vectors) in Capella, the step will be further described below.
    - `Strategy` dropdown menu helps to select the strategy that will be used to chunk the data present in S3 bucket and might be useful depending upon the data present in the S3 bucket.
    - `Max Token in Chunk` decides the number of tokens that will be present in a chunk.
    - `Chunk Overlap` decides the number of tokens that will overlap, this helps create context between chunks.
- Click `Next` after the options above specified are modified according to the requirement.

8) Select the model which will be used to create the embeddings. There are two options to create the embeddings, `Capella-based` and `external model`.

   <img src="./img/Select_embedding_model.png" width="600px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">
   
   - For this tutorial, Capella-based embedding model is used as can be seen in the image above. API credentials can be uploaded using the file downloaded in `step 2.2` or it can be entered manually as well.
   - Choices between private and insecure networking is available to choose.
   - A click on `Next` will land you at the final page of the workflow.
          
9) <B>`Workflow Summary`</B> will display all the necessary details of the workflow including `Data Source`, `Model Service`, `Unstructured Data Service` and `Billing Overview` as shown in image below.

   <img src="./img/workflow_summary.png" width="800px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">

10) <B>`Hurray! Workflow Deployed`</B> Now in the `workflow` tab we can see our workflow deployed and can check the status of our workflow. The status of the workflow run will be shown over here.

       <img src="./img/workflow_deployed.png" width="950px" height="350px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;">


    After this step, your vector embeddings for the selected fields should be ready, and you can check them out in the Capella UI. In the next step, we will demonstrate how we can use the generated vectors to perform vector search.

# 4. Vector Search

The following code cells implement semantic vector search against the embeddings generated by the AutoVectorization workflow. 

Before you proceed, make sure the following packages are installed by running: 

In [None]:
!pip install couchbase langchain-couchbase langchain-openai

`couchbase - Version: 4.4.0` \
`langchain-couchbase - Version: 0.4.0` \
`pip install langchain-openai - Version: 0.3.34` 

Now, please proceed to execute the cells in order to run the vector similarity search.

# Importing Required Packages

In [2]:
from couchbase.cluster import Cluster
from couchbase.auth import PasswordAuthenticator
from couchbase.options import ClusterOptions

from langchain_openai import OpenAIEmbeddings
from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore

# Cluster Connection Setup
   - Defines the secure connection string, user credentials, and creates a `Cluster` object.

In [3]:
endpoint = "CLUSTER_CONNECTION_STRING"                                              # Replace this with Connection String
username = "YOUR_USERNAME"                                                          # Replace this with your username
password = "YOUR_PASSWORD"                                                          # Replace this with your password
auth = PasswordAuthenticator(username, password)

options = ClusterOptions(auth)
cluster = Cluster(endpoint, options)

cluster.wait_until_ready(timedelta(seconds=5))

# Selection of Buckets / Scope / Collection / Index / Embedder
   - Sets the bucket, scope, and collection where the documents (with vector fields) live.
   - `index_name` specifies the Capella Search index name.
   - `embedder` instantiates the NVIDIA embedding model that will transform the user's natural language query into a vector at search time.
       - `open_api_key` is the api key token created in `step 2.3`.
       - `open_api_base` is the Capella model services endpoint found in the models section.

`Note that the Capella AI Endpoint also requires an additional /v1 from the endpoint if not shown on the UI`

In [7]:
bucket_name = "Unstructured_data_bucket"
scope_name = "_default"
collection_name = "_default"
index_name = "search_autovec_workflow_text-embedding"       # This is the name of the search index that was created in step 3.6 and can also be seen in the search tab of the cluster.
                                                                                        
#  Using the OpenAI SDK for the embeddings with the capella model services and they are compatible with the OpenAIEmbeddings class in Langchain
embedder = OpenAIEmbeddings(
    model="nvidia/nv-embedqa-e5-v5",                        # This is the model that will be used to create the embedding of the query.
    openai_api_key="CAPELLA_MODEL_KEY",
    openai_api_base="CAPELLA_MODEL_ENDPOINT/v1",
    check_embedding_ctx_length=False,
    tiktoken_enabled=False,                                                            
)

# VectorStore Construction
   - Creates a `CouchbaseSearchVectorStore` instance that:
     * Knows where to read documents (`bucket/scope/collection`).
     * Knows the embedding field (the vector produced by the AutoVectorization workflow).
     * Uses the provided embedder to embed queries on-demand.
   - If your AutoVectorization workflow produced a different vector field name, update `embedding_key` accordingly.
   - If you mapped multiple fields into a single vector, you can choose any representative field for `text_key`, or modify the VectorStore wrapper to concatenate fields.

In [8]:
vector_store = CouchbaseSearchVectorStore(
    cluster=cluster,
    bucket_name=bucket_name,
    scope_name=scope_name,
    collection_name=collection_name,
    embedding=embedder,
    index_name=index_name,
    text_key="text-to-embed",                   # Your document's text field
    embedding_key="text-embedding"              # This is the field in which your vector (embedding) is stored in the cluster.
)

# Performing a Similarity Search
   - Defines a natural language query (e.g., "USA").
   - Calls `similarity_search(k=3)` to retrieve the top 3 most semantically similar documents.
   - Prints ranked results, extracting the chosen `text_key` (here `text-to-embed`).
   - Change `query` to any descriptive phrase (e.g., "beach resort", "airport hotel near NYC").
   - Adjust `k` for more or fewer results.

In [11]:
query = "How to setup java SDK?"
results = vector_store.similarity_search_with_score(query, k=3)

for rank, (doc, score) in enumerate(results, start=1):
    text = getattr(doc, "page_content", None)
    print(f"{rank}. — Score: {score:.4f} — Content: {text}")


1. — Score: 0.8052 — Content: Section Title: Set Up the Java SDK
Content: Run the command mvn install to pull in all the dependencies and finish your SDK setup.
2. — Score: 0.7971 — Content: Section Title: Set Up the Java SDK
Content: To set up the Java SDK: Create the following directory structure on your computer: In the student directory, create a new file called pom. xml. Paste the following code block into your pom. xm1 file: Open a terminal window and navigate to your student directory.
3. — Score: 0.7745 — Content: Section Title: Prerequisites
Content: e You have installed the Java Software Development Kit (version 8, 11, 17, or 21). o The recommended version is the latest Java LTS release. Make sure to install the highest available patch for the LTS version.


# Results and Interpretation

As we can see, 3 (or `k`) ranked results are printed in the output.

### What Each Part Means
- Leading number (1, 2, 3): The result rank (1 = most similar to your query).
- Content text: This is the value of the field you configured as `text_key` (in this tutorial: `text-to-embed`). It represents the human-readable content we chose to display.

### How the Ranking Works
1. Your natural language query (e.g., `query = "How to setup java SDK?"`) is embedded using the NVIDIA model (`nvidia/nv-embedqa-e5-v5`).
2. The vector store compares the query embedding to stored document embeddings in the field you configured (`embedding_key = "text-embedding"`).
3. Results are sorted by vector similarity. Higher similarity = closer semantic meaning.


> Your vector search pipeline is working if the returned documents feel meaningfully related to your natural language query—even when exact keywords do not match. Feel free to experiment with increasingly descriptive queries to observe the semantic power of the embeddings.