## Data Preparation For Image Search

Multi-modal data, such as images and diagrams, represent a significant untapped resource in most current Retrieval-Augmented Generation (RAG) systems. While these systems excel at processing and generating text, they often overlook the wealth of information contained in visual formats. Images and diagrams can convey complex concepts, relationships, and trends in ways that text alone cannot match. By incorporating multi-modal data, RAG systems could unlock a treasure trove of insights, enabling more comprehensive understanding and analysis.

In the lab, you will explore 2 different techniques for handling image data for RAG: multi-modal embedding and grounding images with text. 

- **Multi-modal embedding** directly encodes images into vector representations. This approach preserves the visual information and can capture nuanced features that might be lost in text descriptions. However, it requires specialized models and may struggle with complex or abstract visual concepts.

- **Grounding images** with text involves generating textual descriptions or captions for images, then using these descriptions for text-based embedding. This method leverages the power of existing text embedding models and can provide more interpretable representations. It's particularly effective for images with clear, describable content. However, it may lose some fine-grained visual details and is dependent on the quality of the image-to-text conversion.

- Note: You can also use both and adopt a combined strategy. This method can provide a more comprehensive representation, capturing both visual and textual aspects of the data. It allows for flexible querying and can potentially improve retrieval accuracy. The downside is increased computational complexity and storage requirements

You will first analyze the top K precision for two techniques on provided image data: 1) simple images synthetically generated by the Amazon Titan Image Generator Model, and 2) complex images (architecture diagrams) from the [AWS Solutions Library](https://aws.amazon.com/solutions/). After comparing the pros and cons of both approaches, we will build a Naive RAG using Amazon Bedrock to retrieve these images using natural language queries.

| |  |
|----------|----------|
| ![Image 1](static/multi-modal.png)| ![Image 2](static/ground-to-text.png)|

## Pre-req
You must run the [workshop_setup.ipynb](../lab00-setup/workshop_setup.ipynb) notebook in `lab00-setup` before starting this lab.

In [None]:
import warnings
warnings.warn("Warning: if you did not run lab00-setup, please go back and run the lab00 notebook") 

## Load the parameters

In [None]:
print("load the data parameters....\n")
# bucket and parameter stored from Initial setup lab01
%store -r root_dir
%store -r jsonl_files

## check all 5 values are printed and do not fail
print(root_dir)
print(jsonl_files)

print("\nload the vector db parameters....\n")

# vector parameters stored from Initial setup
%store -r vector_host
%store -r vector_collection_arn
%store -r vector_collection_id
%store -r bedrock_kb_execution_role_arn

## check all 4 values are printed and do not fail
print(vector_host)
print(vector_collection_arn)
print(vector_collection_id)
print(bedrock_kb_execution_role_arn)


### > Initialize parameters and import helper functions

In [None]:
import json
import boto3
import sys
import os
import io
from PIL import Image
import time
import shutil
import pandas as pd
from sagemaker.utils import name_from_base
from opensearch_util import OpenSearchManager

from helper import (
    _encode,
    download_file_from_s3,
    get_mm_embedding,
    get_text_embedding,
    evaluate_top_hit
)

os_manager = OpenSearchManager()

## > Load image manifest file

You will first load both simple and complex image manifest file. These manifest file contain images information under `corpus`, question to search the image under `queries`, and groundtruth mapping between `corpus` and `queries` under `relevant_docs`. 

Here is the structure of the manifest file. image info under `corpus` contains the local and s3 location of the image, and the image caption. 
```
{
    corpus:{
        <image_id>:{
            "image-ref": <S3_location>,
            "image-path": <local_path>,
            "caption": <image_caption>,
        }
        ...
    }
    queries:{
        <query_id>:<sample_query>,
        ....
    },
    relevant_docs:{
        <query_id>:[<image_id>, ...],
        ...
    }
}

```

After the manifest files are loaded, you will create four lists of index objects with embeddings: 

1. simple image - multimodal embedding
2. complex image - multimodal embedding
3. simple image - text embedding of the image caption
4. complex image - text embedding of the image caption

In [None]:
%%time
indexes=[]
for jsonl in jsonl_files:

    print(f"Prepare image manifest file: {jsonl}")
    
    jsonl_path = os.path.join(root_dir, jsonl)
    
    with open(jsonl_path, 'r+') as f:
        dataset = json.load(f)
        
    image_data = dataset['corpus']

    for model_id in ["amazon.titan-embed-image-v1", "amazon.titan-embed-text-v2:0"]:
        
        index_obj = dict()
        index_obj["file"] = jsonl.split("/")[-1]
        index_obj["model_id"] = model_id
        index_obj["image_data"] = []
        index_obj["dataset"] = dataset
        
        for i, key in enumerate(image_data):           
            
            metadata = dict()
        
            metadata['id'] = key
        
            image = download_file_from_s3(image_data[key]['image-ref'])
            image_base64 = _encode(image)
        
            metadata['image-ref'] = image_data[key]['image-ref']
        
            metadata['caption'] = image_data[key]['caption']
    
            if model_id == "amazon.titan-embed-image-v1":
                metadata['vector_field'] = get_mm_embedding(image_base64=image_base64)
                index_obj["image_data"].append(metadata)
                
            else:
                metadata['vector_field'] = get_text_embedding(image_data[key]['caption'], model_id="amazon.titan-embed-text-v2:0")
                index_obj["image_data"].append(metadata)

        indexes.append(index_obj)

### > Create a vector index

You then iterating over the 4 different type of index lists, creating the vector index in Opensearch and then bulk ingest the image data.

In [None]:
os_manager.initialize_client(host=vector_host)

In [None]:
index_body = {
  "mappings": {
    "properties": {
      "id": {
        "type": "text"
      },
      "image-ref": {
        "type": "text"
      },
      "caption": {
        "type": "text"
      },
      "vector_field": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "engine": "nmslib",
          "space_type": "cosinesimil", 
          "name": "hnsw",
          "parameters": {
            "ef_construction": 512,
            "m": 16
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": 2,
      "knn.algo_param": {
        "ef_search": 512
      },
      "knn": True
    }
  }
}

### > Bulk ingestion

In [None]:
for index in indexes:
    index_name = name_from_base(f"{index['file'].split('_')[0]}-{index['model_id'].split('.')[-1]}".replace(':0', ''))

    index["index_name"] = index_name

    resp = os_manager.create_index(index_name=index_name, index_body=index_body)
    time.sleep(40)
    
    sucess, failed = os_manager.bulk_index_ingestion(index_name=index_name,
                                                     data=index["image_data"])
    
    print(f"number of record successfully ingested: {sucess}, failed: {failed}")
    time.sleep(20)

### > Opensearch query template

In [None]:
# build opensearch query
os_query = {
    "size": 5,
    "query":{
        "knn": {
        "vector_field": {
            "vector": [],
            "k": 5
        }
        }
    },
    "_source": ["id", 
                "image-ref", 
                "caption"]}

### > Run Top K Precision Benchmark

Once the indexes are ready, you can perform an evaluation of image retrieval performance across multiple indexes and models. It iterates through a list of indexes, and for each index, it calls an evaluate_top_hit function to calculate the percentage of top hits (correct matches) for different values of k (1, 5, and 10). 

The resuls is compared with the groundtruth in `relevant_docs` data to get a percentage of top hits for each value of k.

In [None]:
benchmark = []
for index in indexes:
    
    test_output = dict()
    model_id = index["model_id"]
    index_name =index["index_name"]
    test_output["index_name"] = index_name
    
    for k in [1, 5, 10]:

        eval_results = evaluate_top_hit(os_manager, os_query, index["dataset"], index_name, top_k=5, model_id=model_id)
        df_base = pd.DataFrame(eval_results)
        top_hits = df_base['is_hit'].mean()

        
        test_output[f"top_{k}"] = top_hits
    
        print(f"{index_name} at top {k}, the percent of top hits: {top_hits*100:.2f} %")
    benchmark.append(test_output)

Here are the results: notice that for simple images, both multi-modal and embedding of image captions performed perfectly for top 1, 5, and 10 precision. However, the multi-modal accuracy drops more on complex diagrams as the embedding space is no longer able to pick up all the details in the image.

In [None]:
pd.options.display.float_format = '{:.0%}'.format
df = pd.DataFrame(benchmark)
df

### > Utility function to properly display the retrieved images

In [None]:
from IPython.display import Image, HTML, display

def display_s3_images_grid(results, images_per_row=5, img_width=600):
    """
    Display images from a list of S3 URLs in a side-by-side grid in a Jupyter notebook output cell.
    
    Args:
    s3_urls (list): A list of S3 URLs in the format 's3://bucket-name/key'
    images_per_row (int): Number of images to display per row
    img_width (int): Width of each image in pixels
    """
    
    html_table = "<table><tr>"
    for i, image_output in enumerate(results):
        
        image = download_file_from_s3(image_output["_source"]["image-ref"])
            
        image_base64 = _encode(image)
        
        # Add image to HTML
        html_table += f"<td style='padding:5px;'><img src='data:image/jpeg;base64,{image_base64}' width='{img_width}px'/></td>"
        
        # Start a new row after every 'images_per_row' images
        if i+1 % images_per_row == 0:
            html_table += "</tr><tr>"
    
    html_table += "</tr></table>"
    display(HTML(html_table))

### > Create a Naive RAG system for Simple Image data

Here are some sample questions:
- I want a picture of a kid drawing pictures on the wall
- Give me an image of a circus
- Picture of animals happily playing instruments

In [None]:
query = "I want a picture of a kid drawing pictures on the wall"
top_k = 3

os_query["query"]["knn"]["vector_field"]["vector"] = get_text_embedding(query, model_id="amazon.titan-embed-text-v2:0")
os_query["size"] = top_k
os_query["query"]["knn"]["vector_field"]["k"] = top_k

for index in indexes:
    if "simple-titan-embed-text" in index["index_name"]:
        results = os_manager.opensearch_query(os_query,
                                              index_name=index["index_name"])
display_s3_images_grid(results, images_per_row=3)

### > Create a Naive RAG system for complex image (architecture diagrams)

Here are some sample questions:
- a solution that can that can generate slow-motion from existing video
- a solution to increase resolution for existing videos
- a solution to create personalized avatar images

In [None]:
query = "a solution that can that can generate slow-motion from existing video"
top_k = 3

os_query["query"]["knn"]["vector_field"]["vector"] = get_text_embedding(query, model_id="amazon.titan-embed-text-v2:0")
os_query["size"] = top_k
os_query["query"]["knn"]["vector_field"]["k"] = top_k

for index in indexes:
    if "complex-titan-embed-text" in index["index_name"]:
        results = os_manager.opensearch_query(os_query,
                                          index_name=index["index_name"])
        
display_s3_images_grid(results, images_per_row=3)