# Introduction

In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Hugging Face](https://huggingface.co/) as the AI-powered embedding model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval.

This tutorial demonstrates how to leverage Couchbase's **Global Secondary Index (GSI) vector search capabilities** with Hugging Face embeddings to create a high-performance semantic search system. GSI vector search in Couchbase offers significant advantages over traditional FTS (Full-Text Search) approaches, particularly for vector-first workloads and scenarios requiring complex filtering with high query-per-second (QPS) performance.

This guide is designed to be comprehensive yet accessible, with clear step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system. Whether you're building a recommendation engine, content discovery platform, or any application requiring intelligent document retrieval, this tutorial provides the foundation you need.

**Note**: If you want to perform semantic search using the FTS (Full-Text Search) index instead, please take a look at [this alternative approach](https://developer.couchbase.com//tutorial-huggingface-couchbase-vector-search-with-fts).

# How to run this tutorial

This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/huggingface/gsi/hugging_face.ipynb).

You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment.

# Install necessary libraries

In [1]:
!pip install --quiet langchain-couchbase==0.5.0rc1 transformers==4.56.1 sentence_transformers==5.1.0 langchain_huggingface==0.3.1 python-dotenv==1.1.1 ipywidgets

# Imports

In [2]:
from pathlib import Path
from datetime import timedelta
from transformers import pipeline, AutoModel, AutoTokenizer
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions
from langchain_core.globals import set_llm_cache
from langchain_couchbase.cache import CouchbaseCache
from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore
from langchain_couchbase.vectorstores import DistanceStrategy
from langchain_couchbase.vectorstores import IndexType
import getpass
import os
from dotenv import load_dotenv


# Prerequisites

To run this tutorial successfully, you will need the following requirements:

### Couchbase Requirements

**Version Requirements:**
- **Couchbase Server 8.0+** or **Couchbase Capella** with Query Service enabled
- Note: GSI vector search is a newer feature that requires Couchbase Server 8.0 or above, unlike FTS-based vector search which works with 7.6.4+

**Access Requirements:**
- A configured Bucket, Scope, and Collection
- User credentials with **Read and Write** access to your target collection
- Network connectivity to your Couchbase cluster

### Create and Deploy Your Free Tier Operational Cluster on Capella

To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.

To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).

### Couchbase Capella Configuration

When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met:

* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.
* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.

### Python Environment Requirements

- **Python 3.8+** 
- Required Python packages (installed via pip in the next section):
  - `langchain-couchbase==0.5.0rc1`
  - `transformers==4.56.1` 
  - `sentence_transformers==5.1.0`
  - `langchain_huggingface==0.3.1`

In [3]:
# Load environment variables
load_dotenv("./.env")

# Configuration
couchbase_cluster_url = os.getenv('CB_CLUSTER_URL') or input("Couchbase Cluster URL:")
couchbase_username = os.getenv('CB_USERNAME') or input("Couchbase Username:")
couchbase_password = os.getenv('CB_PASSWORD') or getpass.getpass("Couchbase password:")
couchbase_bucket = os.getenv('CB_BUCKET') or input("Couchbase Bucket:")
couchbase_scope = os.getenv('CB_SCOPE') or input("Couchbase Scope:")
couchbase_collection = os.getenv('CB_COLLECTION') or input("Couchbase Collection:")

# Couchbase Connection
In this section, we first need to create a `PasswordAuthenticator` object that would hold our Couchbase credentials:

In [4]:
auth = PasswordAuthenticator(
    couchbase_username,
    couchbase_password
)

Then, we use this object to connect to Couchbase Cluster and select specified above bucket, scope and collection:

In [5]:
print("Connecting to cluster at URL: " + couchbase_cluster_url)
cluster = Cluster(couchbase_cluster_url, ClusterOptions(auth))
cluster.wait_until_ready(timedelta(seconds=5))

bucket = cluster.bucket(couchbase_bucket)
scope = bucket.scope(couchbase_scope)
collection = scope.collection(couchbase_collection)
print("Connected to the cluster")

Connecting to cluster at URL: couchbase://localhost
Connected to the cluster


# Optimizing Vector Search with Global Secondary Index (GSI)

With Couchbase 8.0+, you can leverage the power of GSI-based vector search, which offers significant performance improvements over traditional Full-Text Search (FTS) approaches for vector-first workloads. GSI vector search provides high-performance vector similarity search with advanced filtering capabilities and is designed to scale to billions of vectors.

## GSI vs FTS: Choosing the Right Approach

| Feature               | GSI Vector Search                                               | FTS Vector Search                         |
| --------------------- | --------------------------------------------------------------- | ----------------------------------------- |
| **Best For**          | Vector-first workloads, complex filtering, high QPS performance| Hybrid search and high recall rates      |
| **Couchbase Version** | 8.0.0+                                                         | 7.6.4+                                    |
| **Filtering**         | Pre-filtering with `WHERE` clauses (Composite) or post-filtering (BHIVE) | Pre-filtering with flexible ordering |
| **Scalability**       | Up to billions of vectors (BHIVE)                              | Up to 10 million vectors                  |
| **Performance**       | Optimized for concurrent operations with low memory footprint  | Good for mixed text and vector queries   |

## GSI Vector Index Types

Couchbase offers two distinct GSI vector index types, each optimized for different use cases:

### Hyperscale Vector Indexes (BHIVE)

- **Best for**: Pure vector searches like content discovery, recommendations, and semantic search
- **Use when**: You primarily perform vector-only queries without complex scalar filtering
- **Features**: 
  - High performance with low memory footprint
  - Optimized for concurrent operations
  - Designed to scale to billions of vectors
  - Supports post-scan filtering for basic metadata filtering

### Composite Vector Indexes

- **Best for**: Filtered vector searches that combine vector similarity with scalar value filtering
- **Use when**: Your queries combine vector similarity with scalar filters that eliminate large portions of data
- **Features**: 
  - Efficient pre-filtering where scalar attributes reduce the vector comparison scope
  - Best for well-defined workloads requiring complex filtering using GSI features
  - Supports range lookups combined with vector search

## Why Choose GSI for This Tutorial?

In this tutorial, we'll demonstrate creating a **BHIVE index** and running vector similarity queries using GSI. BHIVE is ideal for semantic search scenarios where you want:

1. **High-performance vector search** across large datasets
2. **Low latency** for real-time applications
3. **Scalability** to handle growing vector collections
4. **Concurrent operations** for multi-user environments

The BHIVE index will provide optimal performance for our Hugging Face embedding-based semantic search implementation.

In [6]:
# Create a BHIVE GSI vector index (good default: IVF,SQ8)
vector_store = CouchbaseQueryVectorStore(
    cluster=cluster,
    bucket_name=couchbase_bucket,
    scope_name=couchbase_scope,
    collection_name=couchbase_collection,
    embedding=HuggingFaceEmbeddings(), # Hugging Face Initialization
    distance_metric=DistanceStrategy.COSINE
)

# Embedding Documents

Now that we have set up our vector store with Hugging Face embeddings, we can add documents to our collection. The `CouchbaseQueryVectorStore` automatically handles the embedding generation process using the Hugging Face transformers library.

## Understanding the Embedding Process

When we add text documents to our vector store, several important processes happen automatically:

1. **Text Preprocessing**: The input text is preprocessed and tokenized according to the Hugging Face model's requirements
2. **Vector Generation**: Each document is converted into a high-dimensional vector (embedding) that captures its semantic meaning
3. **Storage**: The embeddings are stored in Couchbase along with the original text and any metadata
4. **Indexing**: The vectors are indexed using our BHIVE GSI index for efficient similarity search

## Adding Sample Documents

In this example, we're adding sample documents that demonstrate Couchbase's capabilities. The system will:
- Generate embeddings for each text document using the Hugging Face model
- Store them in our Couchbase collection
- Make them immediately available for semantic search once the GSI index is ready

**Note**: The `batch_size` parameter controls how many documents are processed together, which can help optimize performance for large document sets.

In [7]:
texts = [
    "Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.",
    "It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.",
    input("Enter custom embedding text:")
]
vector_store.add_texts(texts=texts, batch_size=32)

['c31ced04bcd74289acfcec58da1b5d02',
 '5c9eeae63a0f4ef39d2545fa8fc3f8e3',
 'cde9928042294055b366a93a74754ed1']

In [8]:
cache = CouchbaseCache(
    cluster=cluster,
    bucket_name=couchbase_bucket,
    scope_name=couchbase_scope,
    collection_name=couchbase_collection,
)
set_llm_cache(cache)

# Understanding GSI Index Configuration (Couchbase 8.0 Feature)

Before creating our BHIVE index, it's important to understand the configuration parameters that optimize vector storage and search performance. The `index_description` parameter controls how Couchbase optimizes vector storage through centroids and quantization.

## Index Description Format: `'IVF[<centroids>],{PQ|SQ}<settings>'`

### Centroids (IVF - Inverted File)
- Controls how the dataset is subdivided for faster searches
- **More centroids** = faster search, slower training time
- **Fewer centroids** = slower search, faster training time
- If omitted (like `IVF,SQ8`), Couchbase auto-selects based on dataset size

### Quantization Options
**Scalar Quantization (SQ):**
- `SQ4`, `SQ6`, `SQ8` (4, 6, or 8 bits per dimension)
- Lower memory usage, faster search, slightly reduced accuracy

**Product Quantization (PQ):**
- Format: `PQ<subquantizers>x<bits>` (e.g., `PQ32x8`)
- Better compression for very large datasets
- More complex but can maintain accuracy with smaller index size

### Common Configuration Examples
- **`IVF,SQ8`** - Auto centroids, 8-bit scalar quantization (good default)
- **`IVF1000,SQ6`** - 1000 centroids, 6-bit scalar quantization
- **`IVF,PQ32x8`** - Auto centroids, 32 subquantizers with 8 bits

For detailed configuration options, see the [Couchbase Vector Index documentation](https://docs.couchbase.com/server/current/vector-index/hyperscale-vector-index.html#algo_settings).

## Our Configuration Choice

In this tutorial, we use `IVF,SQ8` which provides:
- **Auto-selected centroids** optimized for our dataset size
- **8-bit scalar quantization** for good balance of speed, memory usage, and accuracy
- **COSINE distance metric** ideal for semantic similarity search
- **Optimal performance** for most semantic search use cases


In [None]:
# Create BHIVE index
vector_store.create_index(
    index_type=IndexType.BHIVE,
    index_description="IVF,SQ8",
    distance_metric=DistanceStrategy.COSINE,
    index_name="huggingface_bhive_index",
)

# Performing Semantic Search with GSI Vector Index

Now that we have created our BHIVE GSI vector index and added documents to our collection, we can perform powerful semantic search queries. The `similarity_search_with_score` method allows us to find documents that are semantically similar to our query text.

## How GSI Vector Search Works

When you perform a search query with GSI vector search:

1. **Query Embedding**: Your search text is converted into a vector embedding using the same Hugging Face model
2. **Vector Similarity Calculation**: The GSI index efficiently compares your query vector against all stored document vectors
3. **Distance Computation**: Using the COSINE distance metric, the system calculates similarity scores
4. **Result Ranking**: Documents are ranked by their similarity scores and returned with their relevance scores
5. **Post-processing**: Results include both the document content and metadata for further processing

## Understanding Search Results

The search results include:
- **Document Content**: The original text that was embedded
- **Similarity Score**: Lower scores indicate higher similarity (distance-based metric)
- **Document ID**: Unique identifier for the document in Couchbase
- **Metadata**: Any additional information stored with the document

**Note**: In GSI vector search, the score represents the vector distance between the query and document embeddings. Lower distance values indicate higher similarity, which is the opposite of some other similarity systems.

In [10]:
def search_similar(text):
    print("Vector similarity search for phrase: \"" + text + "\"")
    results = vector_store.similarity_search_with_score(text, k=1)
    print(results)
    for doc, score in results:
        print("Found answer: " + doc.id + "; score: " + str(score))
        doc = collection.get(doc.id)
        print("Answer text: " + doc.value["text"])
        
search_similar("name a multipurpose database with distributed capability")
print("------")
search_similar(input("Enter custom search phrase:"))

Vector similarity search for phrase: "name a multipurpose database with distributed capability"
[(Document(id='c31ced04bcd74289acfcec58da1b5d02', metadata={}, page_content='Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.'), 0.5401588405489548)]
Found answer: c31ced04bcd74289acfcec58da1b5d02; score: 0.5401588405489548
Answer text: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.
------
Vector similarity search for phrase: "What was the data inside the sample text?"
[(Document(id='cde9928042294055b366a93a74754ed1', metadata={}, page_content='This is a sample text with the data "Qwerty"'), 0.5143860972617782)]
Found answer: cde9928042294055b366a93a7475

# Conclusion

You have successfully built a powerful semantic search engine using Couchbase's GSI vector search capabilities and Hugging Face embeddings. This guide has walked you through the complete process of creating a high-performance vector search system that can scale to handle billions of documents.

## Next Steps and Extensions

This foundation can be extended for more advanced use cases:

1. **Add Metadata Filtering**: Implement complex filtering using Composite GSI indexes
2. **Scale to Production**: Deploy with proper resource allocation and monitoring
3. **Implement RAG Systems**: Build Retrieval-Augmented Generation applications
4. **Add Real-time Updates**: Implement streaming document updates
5. **Optimize for Specific Domains**: Fine-tune embeddings for your specific use case

## Further Resources

- [Couchbase Vector Search Documentation](https://docs.couchbase.com/cloud/vector-search/vector-search.html)
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [LangChain Couchbase Integration](https://python.langchain.com/docs/integrations/vectorstores/couchbase)
- [GSI Vector Index Configuration Guide](https://docs.couchbase.com/server/current/vector-index/hyperscale-vector-index.html)