## Step 1: Create OSS Collection / Policies

# Image Vector Storage and Search Pipeline with OpenSearch

### Introduction

Welcome to the second notebook in our image processing series. This notebook builds upon the image processing completed in the previous notebook and focuses on creating a sophisticated vector search system using Amazon OpenSearch Serverless. We'll transform our processed images into searchable vector embeddings, enabling powerful semantic search capabilities that go beyond traditional keyword matching.

Vector search represents a significant advancement in image retrieval systems. Unlike conventional methods that rely on tags or metadata, vector search converts images into high-dimensional vector representations (embeddings) that capture both visual and semantic information. This approach enables us to find similar images based on their actual content and meaning, rather than just matching keywords.

The implementation leverages Amazon's Titan Multimodal Embeddings model, a state-of-the-art AI system capable of understanding both images and text. By generating vector representations of our images, we create a sophisticated search system that can understand and match images based on natural language queries. This is particularly valuable for applications requiring intuitive image retrieval, content recommendation, or visual similarity matching.


This notebook utilizes several AWS services, including:

* **Amazon OpenSearch Serverless** for vector storage and search
* **Amazon Bedrock** with Titan Multimodal for embedding generation
* **Amazon S3** for data storage
* **Amazon SageMaker** for notebook hosting

### Key Features:

* **Vector Database Management**:
  * Index creation and configuration
  * Bulk data ingestion
  * Efficient vector storage

* **Embedding Generation**:
  * Multimodal embedding creation
  * Dimension optimization
  * Batch processing support

* **Semantic Search Capabilities**:
  * k-NN search implementation
  * Query vector generation
  * Result visualization


## Table of Contents
1. [Setup and Dependencies](#Setup-and-Dependencies)
2. [Configuration and Environment Setup](#Configuration-and-Environment-Setup)
3. [Embedding Generation](#Embedding-Generation)
4. [OpenSearch Client Creation](#OpenSearch-Client-Creation)
5. [Index Creation and Management](#Index-Creation-and-Management)
6. [Data Ingestion](#Data-Ingestion)
7. [Search Implementation](#Search-Implementation)


***

## Setup and Dependencies


In [1]:
import random
import os
import nbimporter
import boto3
from sagemaker import get_execution_role
import json
import base64
from tqdm import tqdm
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth, helpers
from IPython.display import display, Image as IPImage, Markdown
import nbimporter
import boto3
from requests_aws4auth import AWS4Auth
from _00_image_processing import resize_and_encode
import time
from typing import Any, Optional, List, Dict

%store -r REGION
bedrock_runtime = boto3.client("bedrock-runtime", REGION)
aoss_client = boto3.client("opensearchserverless")



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


***

## Configuration and Environment Setup
Loading stored variables from previous notebook and initializing clients.

In [2]:
%store -r VECTOR_STORE_NAME_PREFIX REGION image_paths BASE64_IMAGES_DIR IMAGES_DIR image_descriptions BUCKET INDEX_NAME COLLECTION_ID  COLLECTION_ENDPOINT

<div class="alert alert-block alert-info">
<b>Note:</b> Variables are imported from the previous notebook using IPython's %store magic command. Make sure you've run the first notebook successfully before proceeding.
</div>


***

## Embedding Generation

The embedding generation process is a crucial step that transforms our images into mathematical representations suitable for vector search. We utilize Amazon's Titan Multimodal model, which excels at understanding both visual and textual content.

The `get_titan_multimodal_embedding` function serves as our primary tool for generating these embeddings. It can process both images and text descriptions, making it versatile for our needs. The function:

1. Accepts either an image path or text description as input
2. Handles the base64 encoding of images when necessary
3. Configures the embedding dimension (default 1024) for optimal performance
4. Manages the API communication with Amazon Bedrock
5. Returns normalized vector embeddings ready for indexing


In [3]:
def get_titan_multimodal_embedding(
    bedrock_runtime: Any,
    image_path: Optional[str] = None,
    description: Optional[str] = None,
    dimension: int = 1024
) -> Dict[str, List[float]]:
    """
    Generates a multimodal embedding using Amazon Titan for either an image or a text description.

    Args:
        bedrock_runtime: Bedrock client instance
        image_path: Path to image file
        description: Text description
        dimension: Desired embedding dimension

    Returns:
        Dict containing embedding vector
    """
    payload_body: Dict[str, Any] = {}

    if image_path:
        with open(image_path, "rb") as image_file:
            payload_body["inputImage"] = resize_and_encode(image_path)
    if description:
        payload_body["inputText"] = description

    embedding_config: Dict[str, Dict[str, int]] = {
        "embeddingConfig": {"outputEmbeddingLength": dimension}
    }

    response = bedrock_runtime.invoke_model(
        body=json.dumps({**payload_body, **embedding_config}),
        modelId="amazon.titan-embed-image-v1",
        accept="application/json",
        contentType="application/json"
    )
    return json.loads(response.get("body").read())


Generate embeddings for all images

In [4]:
embeddings = [
    get_titan_multimodal_embedding(
        bedrock_runtime=bedrock_runtime,
        image_path=path,
        dimension=1024
    )
    for path in image_paths
]

<div class="alert alert-block alert-warning">
<b>Processing:</b> Generating embeddings for all images. This process may take several minutes depending on the number and size of images, as each image needs to be processed by the Titan model.
</div>


Slice of the embeddings

In [5]:
embeddings[0]['embedding'][:10]

[0.05510917,
 0.038320947,
 -0.005885002,
 -0.0017335666,
 0.04178808,
 -0.016058302,
 -0.014324735,
 -0.014872177,
 -0.009990818,
 0.0416056]

***

## OpenSearch Client Creation

The establishment of a proper connection to OpenSearch represents a critical infrastructure component of our vector search system. This section handles the authentication and security requirements necessary for deployment.

Our `get_oss_client` function implements several essential security features:
1. **AWS IAM Authentication**: Utilizes AWS4SignerAuth for secure identity verification
2. **SSL/TLS Configuration**: Establishes encrypted connections to protect data in transit
3. **Connection Pooling**: Manages persistent connections for optimal performance
4. **Error Handling**: Implements robust error catching and reporting

This robust setup ensures our vector search system maintains security best practices while providing reliable performance for production workloads.


In [6]:
def get_oss_client(
    collection_endpoint: str,
    region: str
) -> OpenSearch:
    """
    Creates an OpenSearch client with AWS authentication.

    Args:
        collection_endpoint: OpenSearch endpoint
        region: AWS region

    Returns:
        Configured OpenSearch client
    """
    return OpenSearch(
        hosts=[{'host': collection_endpoint, 'port': 443}],
        http_auth=AWSV4SignerAuth(boto3.Session().get_credentials(), region, 'aoss'),
        use_ssl=True,
        verify_certs=True,
        connection_class=RequestsHttpConnection,
        use_ssl_context=True,
        ssl_assert_hostname=False,
        ssl_show_warn=False
    )

In [7]:
oss_client = get_oss_client(COLLECTION_ENDPOINT, region=REGION)

## Index Creation and Management

The index creation phase establishes the foundation for our vector search capabilities. We configure a specialized OpenSearch index optimized for k-NN (k-Nearest Neighbors) vector search operations. The index configuration involves several crucial components:

### Index Structure
The index mapping defines four key fields:
- **image_vector**: A high-dimensional vector field (1024D) storing our image embeddings
- **description**: Text field containing image descriptions
- **image_base64_s3_uri**: Reference to the encoded image in S3
- **image_s3_uri**: Original image location in S3

In [8]:
index_body: Dict[str, Any] = {
    "settings": {"index.knn": "true"},
    "mappings": {
        "properties": {
            "image_vector": {"type": "knn_vector", "dimension": 1024},
            "description": {"type": "text"},
            "image_base64_s3_uri": {"type": "text"},
            "image_s3_uri": {"type": "text"}
        }
    }
}


oss_client.indices.create(index=INDEX_NAME, body=index_body)


{'acknowledged': True, 'shards_acknowledged': True, 'index': 'vrag-index'}

***

## Data Ingestion

The data ingestion process represents a critical phase where we populate our OpenSearch index with vector embeddings and associated metadata. This section implements a sophisticated ETL (Extract, Transform, Load) pipeline that handles:

### Data Processing Steps:
1. **Extraction**: 
   - Retrieves processed images from designated directories
   - Sorts files to maintain consistent ordering
   - Validates file extensions and formats

2. **Transformation**:
   - Pairs embeddings with corresponding metadata
   - Structures data according to our index mapping
   - Generates S3 URIs for both original and encoded images

3. **Loading**:
   - Implements batch processing with progress tracking
   - Includes error handling and retry logic
   - Maintains rate limiting to prevent service throttling


Extract the paths of the text files and images

In [9]:
_base64_images = sorted([f for f in os.listdir(BASE64_IMAGES_DIR) if f.endswith(".txt")])
_images = sorted([f for f in os.listdir(IMAGES_DIR) if f.endswith((".jpg", ".png"))])

Iterate through embeddings, descriptions, and image metadata to index in OpenSearch

In [None]:
print("Ingesting data into OpenSearch...")
for embedding, description, base64_img, img_path in zip(
    embeddings, image_descriptions, _base64_images, _images
):
    document: Dict[str, Any] = {
        "image_vector": embedding['embedding'],
        "description": description,
        "image_base64_s3_uri": f"s3://{BUCKET}/{BASE64_IMAGES_DIR}/{base64_img}",
        "image_s3_uri": f"s3://{BUCKET}/{IMAGES_DIR}/{img_path}"
    }
    oss_client.index(index=INDEX_NAME, body=document)
    time.sleep(5)

print("Data ingestion complete.")


Ingesting data into OpenSearch...


<div class="alert alert-block alert-warning">
<b>Processing:</b> The ingestion process includes a 5-second delay between documents to prevent rate limiting. This process may take a couple of minutes depending on the number of images.
</div>


***

## Search Implementation

The search implementation represents the culmination of our vector search pipeline, enabling sophisticated similarity-based image retrieval. Our system implements a k-NN (k-Nearest Neighbors) search strategy that leverages the vector space to find visually and semantically similar images.

### Search Architecture
The `query_open_search` function implements a multi-stage process:

1. **Query Processing**:
   - Converts natural language queries into vector embeddings
   - Configures search parameters including result count
   - Optimizes query structure for performance

2. **Vector Similarity Search**:
   - Executes k-NN algorithm over the vector space
   - Calculates similarity scores between query and stored vectors
   - Ranks results based on vector distance metrics

3. **Result Management**:
   - Filters and formats search results
   - Excludes unnecessary vector data from responses
   - Provides relevant metadata for result presentation

This implementation enables natural and intuitive image search capabilities, allowing users to find relevant images using plain language descriptions.


In [None]:
def query_open_search(
    bedrock_runtime: Any,
    oss_client: OpenSearch,
    index_name: str,
    prompt: str,
    top_k: int = 3
) -> List[Dict[str, Any]]:
    """
    Queries the OpenSearch index using a k-NN search with the given text prompt.

    Args:
        bedrock_runtime: Bedrock client instance
        oss_client: OpenSearch client instance
        index_name: Name of the OpenSearch index
        prompt: Text query for searching
        top_k: Number of top results to retrieve

    Returns:
        List of search results from OpenSearch
    """
    query_emb: List[float] = get_titan_multimodal_embedding(
        bedrock_runtime=bedrock_runtime,
        description=prompt,
        dimension=1024
    )["embedding"]

    query_body: Dict[str, Any] = {
        "size": top_k,
        "_source": {
            "exclude": ["image_vector"],
        },
        "query": {
            "knn": {
                "image_vector": {
                    "vector": query_emb,
                    "k": top_k
                }
            }
        },
    }

    response = oss_client.search(index=index_name, body=query_body)
    return response["hits"]["hits"]

***

### Example Usage and Visualization

This section demonstrates the practical application of our vector search system through a concrete example. We implement a visual search interface that:

1. **Query Processing**:
   - Accepts a natural language search prompt ("building" in this example)
   - Converts the query into a vector embedding
   - Executes the search against our index

2. **Result Handling**:
   - Downloads matched images from S3
   - Formats descriptions for display
   - Generates a clean, visual presentation

3. **Display Formatting**:
   - Implements consistent image sizing
   - Creates a structured markdown layout
   - Pairs images with their descriptions

The visualization provides a user-friendly way to validate search results and demonstrate the system's effectiveness in finding relevant images based on textual queries.


In [None]:
search_prompt = "building"

Query index and display results

In [None]:
markdown_content = ""
IMAGE_WIDTH = 500

try:
    results = query_open_search(
        bedrock_runtime=bedrock_runtime,
        oss_client=oss_client,
        index_name=INDEX_NAME,
        prompt=search_prompt,
        top_k=1
    )

    for idx, result in enumerate(results):
        description = result["_source"]["description"]
        description = description.replace('\n', '')
        image_uri = result["_source"]["image_s3_uri"]
        _desc = f"**Result {idx + 1}**: \n{description}"

        local_image_path = f"./image_download/result_{idx + 1}.jpg"
        !aws s3 cp "$image_uri" "$local_image_path"

        markdown_content += f"""
| <img src="{local_image_path}" width="{IMAGE_WIDTH}"/> |
|:----------------:|
| {_desc} |

---
        """
    display(Markdown(markdown_content))

except Exception as e:
    print(f"Query failed: {e}")

***

<div class="alert alert-success">
<b>🎉 Congratulations!</b> You have successfully completed the vector storage and search notebook!

Key accomplishments:
- ✅ Generated image embeddings using Titan Multimodal
- ✅ Created and configured OpenSearch index
- ✅ Ingested vector data
- ✅ Implemented semantic search functionality
- ✅ Demonstrated search capabilities

You can now proceed to the next notebook in the series.
</div>

    

In [None]:
# response = oss_client.indices.delete(index=INDEX_NAME)