# Create Vector Search Index for CLIP Embeddings

This notebook demonstrates how to:
- Create or use an existing vector search endpoint
- Enable change data feed on the embeddings table if not already enabled
- Create a vector search index using pre-calculated CLIP embeddings
- Wait for the index to come online and verify it's ready for similarity search

The resulting vector search index will enable fast semantic similarity searches across the crop image dataset using the generated CLIP embeddings.

## Import Required Libraries

Import necessary libraries for vector search client operations and endpoint management.

In [0]:
%pip install --upgrade --force-reinstall databricks-vectorsearch
dbutils.library.restartPython()

In [0]:
from databricks.vector_search.client import VectorSearchClient
import time

## Dataset Configuration

Configure the vector search parameters. Update these values to match your setup and desired endpoint/index names.

In [0]:
# Dataset Configuration - Update these values for your setup
CATALOG_NAME = "autobricks"  # Your Unity Catalog name
SCHEMA_NAME = "agriculture"   # Your schema name
TABLE_NAME = "crop_images_directory_embeddings"  # Table with embeddings from notebook 02
INDEX_NAME = "crop_images_directory_embeddings_index_2"  # Desired index name
ENDPOINT_NAME = "one-env-shared-endpoint-0"  # Vector search endpoint name

# Construct full names
SOURCE_TABLE = f"{CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}"
FULL_INDEX_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.{INDEX_NAME}"

# Vector search configuration
PRIMARY_KEY = "file_path"
EMBEDDING_VECTOR_COLUMN = "embeddings"
EMBEDDING_DIMENSION = 768

## Initialize Vector Search Client

Create a vector search client to manage endpoints and indexes.

In [0]:
vsc = VectorSearchClient()

## Create Vector Search Endpoint

Check if the specified vector search endpoint exists. If not, create a new one.

In [0]:
# Check if endpoint exists, create if it doesn't
try:
    endpoint = vsc.get_endpoint(name=ENDPOINT_NAME)
    print(f"Using existing endpoint: {ENDPOINT_NAME}")
except:
    print(f"Creating endpoint: {ENDPOINT_NAME}")
    vsc.create_endpoint(
        name=ENDPOINT_NAME,
        endpoint_type="STANDARD"
    )
    
    # Wait for endpoint to be ready
    while True:
        endpoint = vsc.get_endpoint(name=ENDPOINT_NAME)
        status = endpoint.get('endpoint_status', {}).get('state', 'Unknown')
        if status == 'ONLINE':
            break
        time.sleep(30)
    
    print(f"Endpoint {ENDPOINT_NAME} is ready")

## Enable Change Data Feed

Ensure change data feed is enabled on the embeddings table for the vector index to sync properly.

In [0]:
# Enable change data feed on the source table
try:
    spark.sql(f"ALTER TABLE {SOURCE_TABLE} SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")
    print(f"Change data feed enabled on {SOURCE_TABLE}")
except Exception as e:
    print(f"Change data feed may already be enabled: {e}")

## Create Vector Search Index

Create the vector search index using the pre-calculated CLIP embeddings from the embeddings table.

In [0]:
# Create vector search index with pre-calculated embeddings
try:
    index = vsc.get_index(endpoint_name=ENDPOINT_NAME, index_name=FULL_INDEX_NAME)
    print(f"Using existing index: {FULL_INDEX_NAME}")
except:
    print(f"Creating index: {FULL_INDEX_NAME}")
    index = vsc.create_delta_sync_index(
        endpoint_name=ENDPOINT_NAME,
        source_table_name=SOURCE_TABLE,
        index_name=FULL_INDEX_NAME,
        pipeline_type='TRIGGERED',
        primary_key=PRIMARY_KEY,
        embedding_vector_column=EMBEDDING_VECTOR_COLUMN,
        embedding_dimension=EMBEDDING_DIMENSION
    )

## Wait for Index to Come Online

Monitor the index status and wait for it to be fully online and ready for similarity searches.

In [0]:
# Wait for index to come online
print("Waiting for index to be ONLINE...")

while True:
    try:
        status_info = index.describe()
        status = status_info.get('status', {}).get('detailed_state', 'Unknown')
        
        if status.startswith('ONLINE'):
            print("Index is ONLINE and ready")
            break
        elif 'FAILED' in status or 'ERROR' in status:
            print(f"Index creation failed with status: {status}")
            break
            
        time.sleep(30)
        
    except Exception as e:
        print(f"Error checking index status: {e}")
        time.sleep(30)

## Verify Index Readiness

Perform a simple test to ensure the index is working correctly and ready for similarity searches.

In [0]:
# Test the index with a simple query
try:
    sample_row = spark.sql(f"SELECT embeddings FROM {SOURCE_TABLE} WHERE embeddings IS NOT NULL LIMIT 1").collect()
    
    if sample_row:
        test_vector = sample_row[0]['embeddings']
        
        test_results = index.similarity_search(
            query_vector=test_vector,
            columns=["file_name", "folder", "file_path"],
            num_results=3
        )
        
        print(f"Index test successful - found {len(test_results.get('result', {}).get('data_array', []))} results")
    else:
        print("No sample embeddings found for testing")
        
except Exception as e:
    print(f"Index test failed: {e}")

print(f"Vector search setup complete for {FULL_INDEX_NAME}")