# Semantic Search with Couchbase GSI Vector Search and Hugging Face

## Overview

In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Hugging Face](https://huggingface.co/) as the AI-powered embedding model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval.

This tutorial demonstrates how to leverage Couchbase's **Global Secondary Index (GSI) vector search capabilities** with Hugging Face embeddings to create a high-performance semantic search system. GSI vector search in Couchbase offers significant advantages over traditional FTS (Full-Text Search) approaches, particularly for vector-first workloads and scenarios requiring complex filtering with high query-per-second (QPS) performance.

This guide is designed to be comprehensive yet accessible, with clear step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system. Whether you're building a recommendation engine, content discovery platform, or any application requiring intelligent document retrieval, this tutorial provides the foundation you need.

**Note**: If you want to perform semantic search using the FTS (Full-Text Search) index instead, please take a look at [this alternative approach](https://developer.couchbase.com//tutorial-huggingface-couchbase-vector-search-with-fts).

## How to Run This Tutorial

This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/huggingface/gsi/hugging_face.ipynb).

You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment.

## Setup and Installation

### Install Necessary Libraries

In [1]:
!pip install --quiet langchain-couchbase==0.5.0rc1 transformers==4.56.1 sentence_transformers==5.1.0 langchain_huggingface==0.3.1 python-dotenv==1.1.1 ipywidgets

### Import Required Modules

In [2]:
from pathlib import Path
from datetime import timedelta
from transformers import pipeline, AutoModel, AutoTokenizer
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions
from langchain_core.globals import set_llm_cache
from langchain_couchbase.cache import CouchbaseCache
from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore
from langchain_couchbase.vectorstores import DistanceStrategy
from langchain_couchbase.vectorstores import IndexType
import getpass
import os
from dotenv import load_dotenv

### Prerequisites

To run this tutorial successfully, you will need the following requirements:

#### Couchbase Requirements

**Version Requirements:**
- **Couchbase Server 8.0+** or **Couchbase Capella** with Query Service enabled
- Note: GSI vector search is a newer feature that requires Couchbase Server 8.0 or above, unlike FTS-based vector search which works with 7.6+

**Access Requirements:**
- A configured Bucket, Scope, and Collection
- User credentials with **Read and Write** access to your target collection
- Network connectivity to your Couchbase cluster

#### Create and Deploy Your Free Tier Operational Cluster on Capella

To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.

To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).

#### Couchbase Capella Configuration

When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met:

* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.
* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.

#### Python Environment Requirements

- **Python 3.8+** 
- Required Python packages (installed via pip in the next section):
  - `langchain-couchbase==0.5.0rc1`
  - `transformers==4.56.1` 
  - `sentence_transformers==5.1.0`
  - `langchain_huggingface==0.3.1`

In [3]:
# Load environment variables
load_dotenv("./.env")

# Configuration
couchbase_cluster_url = os.getenv('CB_CLUSTER_URL') or input("Couchbase Cluster URL:")
couchbase_username = os.getenv('CB_USERNAME') or input("Couchbase Username:")
couchbase_password = os.getenv('CB_PASSWORD') or getpass.getpass("Couchbase password:")
couchbase_bucket = os.getenv('CB_BUCKET') or input("Couchbase Bucket:")
couchbase_scope = os.getenv('CB_SCOPE') or input("Couchbase Scope:")
couchbase_collection = os.getenv('CB_COLLECTION') or input("Couchbase Collection:")

## Couchbase Connection Setup

### Create Authentication Object

In this section, we first need to create a `PasswordAuthenticator` object that would hold our Couchbase credentials:

In [4]:
auth = PasswordAuthenticator(
    couchbase_username,
    couchbase_password
)

### Connect to Cluster

Then, we use this object to connect to Couchbase Cluster and select specified above bucket, scope and collection:

In [5]:
print("Connecting to cluster at URL: " + couchbase_cluster_url)
cluster = Cluster(couchbase_cluster_url, ClusterOptions(auth))
cluster.wait_until_ready(timedelta(seconds=5))

bucket = cluster.bucket(couchbase_bucket)
scope = bucket.scope(couchbase_scope)
collection = scope.collection(couchbase_collection)
print("Connected to the cluster")

Connecting to cluster at URL: couchbase://localhost
Connected to the cluster


## Understanding GSI Vector Search

### Optimizing Vector Search with Global Secondary Index (GSI)

With Couchbase 8.0+, you can leverage the power of GSI-based vector search, which offers significant performance improvements over traditional Full-Text Search (FTS) approaches for vector-first workloads. GSI vector search provides high-performance vector similarity search with advanced filtering capabilities and is designed to scale to billions of vectors.

#### GSI vs FTS: Choosing the Right Approach

| Feature               | GSI Vector Search                                               | FTS Vector Search                         |
| --------------------- | --------------------------------------------------------------- | ----------------------------------------- |
| **Best For**          | Vector-first workloads, complex filtering, high QPS performance| Hybrid search and high recall rates      |
| **Couchbase Version** | 8.0.0+                                                         | 7.6+                                      |
| **Filtering**         | Pre-filtering with `WHERE` clauses (Composite) or post-filtering (BHIVE) | Pre-filtering with flexible ordering |
| **Scalability**       | Up to billions of vectors (BHIVE)                              | Up to 10 million vectors                  |
| **Performance**       | Optimized for concurrent operations with low memory footprint  | Good for mixed text and vector queries   |

#### GSI Vector Index Types

Couchbase offers two distinct GSI vector index types, each optimized for different use cases:

##### Hyperscale Vector Indexes (BHIVE)

- **Best for**: Pure vector searches like content discovery, recommendations, and semantic search
- **Use when**: You primarily perform vector-only queries without complex scalar filtering
- **Features**: 
  - High performance with low memory footprint
  - Optimized for concurrent operations
  - Designed to scale to billions of vectors
  - Supports post-scan filtering for basic metadata filtering

##### Composite Vector Indexes

- **Best for**: Filtered vector searches that combine vector similarity with scalar value filtering
- **Use when**: Your queries combine vector similarity with scalar filters that eliminate large portions of data
- **Features**: 
  - Efficient pre-filtering where scalar attributes reduce the vector comparison scope
  - Best for well-defined workloads requiring complex filtering using GSI features
  - Supports range lookups combined with vector search

#### Index Type Selection for This Tutorial

In this tutorial, we'll demonstrate creating a **BHIVE index** and running vector similarity queries using GSI. BHIVE is ideal for semantic search scenarios where you want:

1. **High-performance vector search** across large datasets
2. **Low latency** for real-time applications
3. **Scalability** to handle growing vector collections
4. **Concurrent operations** for multi-user environments

The BHIVE index will provide optimal performance for our Hugging Face embedding-based semantic search implementation.

#### Alternative: Composite Vector Index

If your use case requires complex filtering with scalar attributes, you may want to consider using a **Composite Vector Index** instead:

```python
# Alternative: Create a Composite index for filtered searches
vector_store.create_index(
    index_type=IndexType.COMPOSITE,
    index_description="IVF,SQ8",
    distance_metric=DistanceStrategy.COSINE,
    index_name="huggingface_composite_index",
)
```

**Use Composite indexes when:**
- You need to filter by document metadata or attributes before vector similarity
- Your queries combine vector search with WHERE clauses
- You have well-defined filtering requirements that can reduce the search space

**Note**: Composite indexes enable pre-filtering with scalar attributes, making them ideal for applications where you need to search within specific categories, date ranges, or user-specific data segments.

#### Understanding GSI Index Configuration (Couchbase 8.0 Feature)

Before creating our BHIVE index, it's important to understand the configuration parameters that optimize vector storage and search performance. The `index_description` parameter controls how Couchbase optimizes vector storage through centroids and quantization.

##### Index Description Format: `'IVF[<centroids>],{PQ|SQ}<settings>'`

###### Centroids (IVF - Inverted File)

- Controls how the dataset is subdivided for faster searches
- **More centroids** = faster search, slower training time
- **Fewer centroids** = slower search, faster training time
- If omitted (like `IVF,SQ8`), Couchbase auto-selects based on dataset size

###### Quantization Options

**Scalar Quantization (SQ):**
- `SQ4`, `SQ6`, `SQ8` (4, 6, or 8 bits per dimension)
- Lower memory usage, faster search, slightly reduced accuracy

**Product Quantization (PQ):**
- Format: `PQ<subquantizers>x<bits>` (e.g., `PQ32x8`)
- Better compression for very large datasets
- More complex but can maintain accuracy with smaller index size

###### Common Configuration Examples

- **`IVF,SQ8`** - Auto centroids, 8-bit scalar quantization (good default)
- **`IVF1000,SQ6`** - 1000 centroids, 6-bit scalar quantization
- **`IVF,PQ32x8`** - Auto centroids, 32 subquantizers with 8 bits

For detailed configuration options, see the [Couchbase Vector Index documentation](https://docs.couchbase.com/server/current/vector-index/hyperscale-vector-index.html#algo_settings).

##### Our Configuration Choice

In this tutorial, we use `IVF,SQ8` which provides:
- **Auto-selected centroids** optimized for our dataset size
- **8-bit scalar quantization** for good balance of speed, memory usage, and accuracy
- **COSINE distance metric** ideal for semantic similarity search
- **Optimal performance** for most semantic search use cases

In [6]:
# Create a BHIVE GSI vector index (good default: IVF,SQ8)
vector_store = CouchbaseQueryVectorStore(
    cluster=cluster,
    bucket_name=couchbase_bucket,
    scope_name=couchbase_scope,
    collection_name=couchbase_collection,
    embedding=HuggingFaceEmbeddings(), # Hugging Face Initialization
    distance_metric=DistanceStrategy.COSINE
)

## Document Processing and Embedding

### Embedding Documents

Now that we have set up our vector store with Hugging Face embeddings, we can add documents to our collection. The `CouchbaseQueryVectorStore` automatically handles the embedding generation process using the Hugging Face transformers library.

#### Understanding the Embedding Process

When we add text documents to our vector store, several important processes happen automatically:

1. **Text Preprocessing**: The input text is preprocessed and tokenized according to the Hugging Face model's requirements
2. **Vector Generation**: Each document is converted into a high-dimensional vector (embedding) that captures its semantic meaning
3. **Storage**: The embeddings are stored in Couchbase along with the original text and any metadata
4. **Indexing**: The vectors are indexed using our BHIVE GSI index for efficient similarity search

#### Adding Sample Documents

In this example, we're adding sample documents that demonstrate Couchbase's capabilities. The system will:
- Generate embeddings for each text document using the Hugging Face model
- Store them in our Couchbase collection
- Make them immediately available for semantic search once the GSI index is ready

**Note**: The `batch_size` parameter controls how many documents are processed together, which can help optimize performance for large document sets.

In [7]:
texts = [
    "Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.",
    "It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.",
    input("Enter custom embedding text:")
]
vector_store.add_texts(texts=texts, batch_size=32)

['7c601881e4bf4c53b5b4c2a25628d904',
 '0442f351aec2415481138315d492ee80',
 'e20a8dcd8b464e8e819b87c9a0ff05c3']

## Vector Search Performance Optimization

Now let's demonstrate the performance benefits of different optimization approaches available in Couchbase. We'll compare three optimization levels to show how each contributes to building a production-ready semantic search system:

1. **Baseline (Raw Search)**: Basic vector similarity search without GSI optimization
2. **GSI-Optimized Search**: High-performance search using BHIVE GSI index
3. **Cache Benefits**: Show how caching can be applied on top of any search approach

**Important**: Caching is orthogonal to index types - you can apply caching benefits to both raw searches and GSI-optimized searches to improve repeated query performance.

### Understanding Vector Search Results

Before we start our RAG comparisons, let's understand what the search results mean:

When you perform a search query with vector search:

1. **Query Embedding**: Your search text is converted into a vector embedding using the Hugging Face model
2. **Vector Similarity Calculation**: The system compares your query vector against all stored document vectors
3. **Distance Computation**: Using the COSINE distance metric, the system calculates similarity distances
4. **Result Ranking**: Documents are ranked by their distance values (lower = more similar)
5. **Post-processing**: Results include both the document content and metadata

**Note**: The returned value represents the vector distance between query and document embeddings. Lower distance values indicate higher similarity.

### RAG Search Function

Let's create a comprehensive search function for our RAG performance comparison:

In [8]:
import time

def search_with_performance_metrics(query_text, stage_name, k=3):
    """Perform optimized semantic search with detailed performance metrics"""
    print(f"\n=== {stage_name.upper()} ===")
    print(f"Query: \"{query_text}\"")
    
    start_time = time.time()
    results = vector_store.similarity_search_with_score(query_text, k=k)
    end_time = time.time()
    
    search_time = end_time - start_time
    print(f"Search Time: {search_time:.4f} seconds")
    print(f"Results Found: {len(results)} documents")
    
    for i, (doc, distance) in enumerate(results, 1):
        print(f"\n[Result {i}]")
        print(f"Vector Distance: {distance:.6f} (lower = more similar)")
        # Use the document content directly from search results (no additional KV call needed)
        print(f"Document Content: {doc.page_content}")
        if hasattr(doc, 'metadata') and doc.metadata:
            print(f"Metadata: {doc.metadata}")
    
    return search_time, results

### Phase 1: Baseline Performance (Raw Vector Search)

First, let's establish baseline performance with raw vector search - no GSI optimization yet:

In [9]:
test_query = "What are the key features of a scalable NoSQL database?"
print("Testing baseline performance without GSI optimization...")
baseline_time, baseline_results = search_with_performance_metrics(
    test_query, "Phase 1: Baseline Vector Search"
)

Testing baseline performance without GSI optimization...

=== PHASE 1: BASELINE VECTOR SEARCH ===
Query: "What are the key features of a scalable NoSQL database?"
Search Time: 0.1484 seconds
Results Found: 3 documents

[Result 1]
Vector Distance: 0.586197 (lower = more similar)
Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.

[Result 2]
Vector Distance: 0.645435 (lower = more similar)
Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.

[Result 3]
Vector Distance: 0.976888 (lower = more similar)
Document Content: this is a sample text with the data "hello"


### Phase 2: Create BHIVE GSI Index and Test Performance

Now let's create the BHIVE GSI index and measure the performance improvement:

In [10]:
# Create BHIVE index for optimized vector search
print("Creating BHIVE GSI vector index...")
try:
    vector_store.create_index(
        index_type=IndexType.BHIVE,
        index_description="IVF,SQ8",
        distance_metric=DistanceStrategy.COSINE,
        index_name="huggingface_bhive_index",
    )
    print("✓ BHIVE GSI vector index created successfully!")
    
    # Wait for index to become available
    print("Waiting for index to become available...")
    time.sleep(3)
    
except Exception as e:
    if "already exists" in str(e).lower():
        print("✓ BHIVE GSI vector index already exists, proceeding...")
    else:
        print(f"Error creating GSI index: {str(e)}")

# Test the same query with GSI optimization
print("\nTesting performance with BHIVE GSI optimization...")
gsi_time, gsi_results = search_with_performance_metrics(
    test_query, "Phase 2: GSI-Optimized Search"
)

Creating BHIVE GSI vector index...
✓ BHIVE GSI vector index created successfully!
Waiting for index to become available...

Testing performance with BHIVE GSI optimization...

=== PHASE 2: GSI-OPTIMIZED SEARCH ===
Query: "What are the key features of a scalable NoSQL database?"
Search Time: 0.0848 seconds
Results Found: 3 documents

[Result 1]
Vector Distance: 0.586197 (lower = more similar)
Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.

[Result 2]
Vector Distance: 0.645435 (lower = more similar)
Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.

[Result 3]
Vector Distance: 0.976888 (lower = more similar)
Document Content: this is a sample text with the data "hello"


### Phase 3: Demonstrate Cache Benefits

Now let's show how caching can improve performance for repeated queries. **Note**: Caching benefits apply to both raw searches and GSI-optimized searches.

In [11]:
# Set up Couchbase cache (can be applied to any search approach)
print("Setting up Couchbase cache for improved performance on repeated queries...")
cache = CouchbaseCache(
    cluster=cluster,
    bucket_name=couchbase_bucket,
    scope_name=couchbase_scope,
    collection_name=couchbase_collection,
)
set_llm_cache(cache)
print("✓ Couchbase cache enabled!")

# Test cache benefits with the same query (should show improvement on second run)
cache_query = "How does a distributed database handle high-speed operations?"

print("\nTesting cache benefits with a different query...")
print("First execution (cache miss):")
cache_time_1, _ = search_with_performance_metrics(
    cache_query, "Phase 3a: First Query (Cache Miss)", k=2
)

print("\nSecond execution (cache hit):")
cache_time_2, _ = search_with_performance_metrics(
    cache_query, "Phase 3b: Repeated Query (Cache Hit)", k=2
)

Setting up Couchbase cache for improved performance on repeated queries...
✓ Couchbase cache enabled!

Testing cache benefits with a different query...
First execution (cache miss):

=== PHASE 3A: FIRST QUERY (CACHE MISS) ===
Query: "How does a distributed database handle high-speed operations?"
Search Time: 0.1024 seconds
Results Found: 2 documents

[Result 1]
Vector Distance: 0.632770 (lower = more similar)
Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.

[Result 2]
Vector Distance: 0.677951 (lower = more similar)
Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.

Second execution (cache hit):

=== PHASE 3B: REPEATED QUERY (CACHE HIT) ===
Query: "How does a distributed database handle

### Complete Performance Analysis

Let's analyze the complete performance improvements across all optimization levels:

In [12]:
print("\n" + "="*80)
print("VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY")
print("="*80)

print(f"Phase 1 - Baseline (Raw Search):     {baseline_time:.4f} seconds")
print(f"Phase 2 - GSI-Optimized Search:      {gsi_time:.4f} seconds")
print(f"Phase 3 - Cache Benefits:")
print(f"  First execution (cache miss):      {cache_time_1:.4f} seconds")
print(f"  Second execution (cache hit):      {cache_time_2:.4f} seconds")

print("\n" + "-"*80)
print("OPTIMIZATION IMPACT ANALYSIS:")
print("-"*80)

# GSI improvement analysis
if gsi_time and baseline_time and gsi_time < baseline_time:
    gsi_speedup = baseline_time / gsi_time
    gsi_improvement = ((baseline_time - gsi_time) / baseline_time) * 100
    print(f"GSI Index Benefit:      {gsi_speedup:.2f}x faster ({gsi_improvement:.1f}% improvement)")
else:
    print(f"GSI Index Benefit:      Performance similar to baseline (may vary with dataset size)")

# Cache improvement analysis
if cache_time_2 and cache_time_1 and cache_time_2 < cache_time_1:
    cache_speedup = cache_time_1 / cache_time_2
    cache_improvement = ((cache_time_1 - cache_time_2) / cache_time_1) * 100
    print(f"Cache Benefit:          {cache_speedup:.2f}x faster ({cache_improvement:.1f}% improvement)")
else:
    print(f"Cache Benefit:          No significant improvement (results may be cached already)")

print(f"\nKey Insights:")
print(f"• GSI optimization provides consistent performance benefits, especially with larger datasets")
print(f"• Caching benefits apply to both raw and GSI-optimized searches")
print(f"• Combined GSI + Cache provides the best performance for production applications")
print(f"• BHIVE indexes scale to billions of vectors with optimized concurrent operations")


VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY
Phase 1 - Baseline (Raw Search):     0.1484 seconds
Phase 2 - GSI-Optimized Search:      0.0848 seconds
Phase 3 - Cache Benefits:
  First execution (cache miss):      0.1024 seconds
  Second execution (cache hit):      0.0289 seconds

--------------------------------------------------------------------------------
OPTIMIZATION IMPACT ANALYSIS:
--------------------------------------------------------------------------------
GSI Index Benefit:      1.75x faster (42.8% improvement)
Cache Benefit:          3.55x faster (71.8% improvement)

Key Insights:
• GSI optimization provides consistent performance benefits, especially with larger datasets
• Caching benefits apply to both raw and GSI-optimized searches
• Combined GSI + Cache provides the best performance for production applications
• BHIVE indexes scale to billions of vectors with optimized concurrent operations


### Interactive Testing

Try your own queries with the optimized search system:

In [14]:
custom_query = input("Enter your search query: ")
search_with_performance_metrics(custom_query, "Interactive GSI-Optimized Search")



=== INTERACTIVE GSI-OPTIMIZED SEARCH ===
Query: "What is the sample data?"
Search Time: 0.0812 seconds
Results Found: 3 documents

[Result 1]
Vector Distance: 0.623644 (lower = more similar)
Document Content: this is a sample text with the data "hello"

[Result 2]
Vector Distance: 0.860599 (lower = more similar)
Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.

[Result 3]
Vector Distance: 0.909207 (lower = more similar)
Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.


(0.08118820190429688,
 [(Document(id='e20a8dcd8b464e8e819b87c9a0ff05c3', metadata={}, page_content='this is a sample text with the data "hello"'),
   0.6236441411684932),
  (Document(id='0442f351aec2415481138315d492ee80', metadata={}, page_content='It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.'),
   0.8605992009935179),
  (Document(id='7c601881e4bf4c53b5b4c2a25628d904', metadata={}, page_content='Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.'),
   0.9092065785676496)])

## Conclusion

You have successfully built a powerful semantic search engine using Couchbase's GSI vector search capabilities and Hugging Face embeddings. This guide has walked you through the complete process of creating a high-performance vector search system that can scale to handle billions of documents.