# Multi-Modal Models with Amazon Bedrock

This notebook demonstrates how to work with **multi-modal AI models** using Amazon Bedrock. Multi-modal models can understand and generate content across different modalities (text, images, etc.).

## What You'll Learn

1. **Text-to-Image Generation**: Using Amazon Titan Image Generator to create product images from text descriptions
2. **Multi-Modal Embeddings**: Converting both text and images into vector representations using Amazon Titan Embed
3. **Semantic Search**: Finding similar images using text queries through embedding similarity

## Use Case: E-Commerce Product Catalog

We'll simulate building an e-commerce search system where:
- Product images are generated from descriptions
- Images are converted to embeddings for efficient similarity search
- Users can search for products using natural language queries

## Prerequisites

- AWS account with Bedrock access enabled
- Model access enabled for:
  - Amazon Titan Image Generator V2
  - Amazon Titan Multimodal Embeddings

## Restart Kernel (Optional)

The following cell restarts the Jupyter kernel to ensure a clean environment. This is useful when re-running the notebook to clear any cached variables or imports.

In [None]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

## Step 1: Import Required Libraries

We need several libraries for this notebook:

| Library | Purpose |
|---------|----------|
| `boto3` | AWS SDK for Python - connects to Amazon Bedrock |
| `numpy` | Numerical operations for embedding calculations |
| `PIL` | Python Imaging Library for handling generated images |
| `scipy` | Scientific computing - used for distance calculations in similarity search |
| `seaborn/matplotlib` | Visualization libraries for plotting similarity heatmaps |

The code includes fallback implementations if optional libraries (scipy, seaborn) are not available.

In [None]:
# Standard library imports
import os
import re
import sys
import json
import base64
from io import BytesIO

# Other library imports
import boto3
import numpy as np
from PIL import Image

# Print SDK versions
print(f"Python version: {sys.version.split()[0]}")
print(f"Boto3 SDK version: {boto3.__version__}")
print(f"NumPy version: {np.__version__}")

# Try to import scipy with error handling
try:
    from scipy.spatial.distance import cdist
    print("SciPy imported successfully")
except Exception as e:
    print(f"SciPy import failed: {e}")
    print("Using NumPy fallback for distance calculations")
    
    # Simple numpy-based distance calculation
    def cdist_numpy(XA, XB, metric='euclidean'):
        """Simple numpy-based distance calculation as fallback"""
        if metric == 'euclidean':
            return np.sqrt(((XA[:, np.newaxis, :] - XB[np.newaxis, :, :]) ** 2).sum(axis=2))
        else:
            raise ValueError(f"Metric '{metric}' not implemented in fallback")
    
    cdist = cdist_numpy

# Try to import seaborn with error handling
try:
    import seaborn as sns
    print("Seaborn imported successfully")
except Exception as e:
    print(f"Seaborn import failed: {e}")
    print("Continuing without seaborn - matplotlib will be used instead")
    sns = None
    

## Step 2: Initialize Amazon Bedrock Client

Before we can use any Bedrock models, we need to create a **Bedrock Runtime client**. This client handles all API communication with Amazon Bedrock.

**Key Concepts:**
- `boto3.session.Session()` - Creates an AWS session using your configured credentials
- `bedrock-runtime` - The service endpoint for invoking Bedrock models (different from the management API)
- The region is automatically detected from your AWS configuration

**Note:** Ensure your AWS credentials are configured via `aws configure` or environment variables.

In [None]:
# Init boto session
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name

# Init Bedrock Runtime client
bedrock_client = boto3.client("bedrock-runtime", region_name)

print("AWS Region:", region_name)

## Step 3: Define Product Catalog Data

For this demo, we'll use a pre-defined list of **21 product descriptions** (7 product categories × 3 variants each). In a real application, this data might come from:
- A database
- An LLM generating product descriptions
- A product information management (PIM) system

**Product Categories:**
1. T-shirts (3 variants)
2. Jeans (3 variants)
3. Sneakers (3 variants)
4. Backpacks (3 variants)
5. Smartwatches (3 variants)
6. Coffee makers (3 variants)
7. Yoga mats (3 variants)

Each description is detailed enough to generate a distinctive product image.

In [None]:
response = 'Here is a list of 7 items with 3 variants each for an online e-commerce shop, with separate full sentence descriptions:\n\n1. T-shirt\n- A red cotton t-shirt with a crew neck and short sleeves. \n- A blue cotton t-shirt with a v-neck and short sleeves.\n- A black polyester t-shirt with a scoop neck and cap sleeves.\n\n2. Jeans\n- Classic blue relaxed fit denim jeans with a mid-rise waist. \n- Black skinny fit denim jeans with a high-rise waist and ripped details at the knees.  \n- Stonewash straight leg denim jeans with a standard waist and front pockets.\n\n3. Sneakers  \n- White leather low-top sneakers with an almond toe cap and thick rubber outsole.\n- Gray mesh high-top sneakers with neon green laces and a padded ankle collar. \n- Tan suede mid-top sneakers with a round toe and ivory rubber cupsole.  \n\n4. Backpack\n- A purple nylon backpack with padded shoulder straps, front zipper pocket and laptop sleeve.\n- A gray canvas backpack with brown leather trims, side water bottle pockets and drawstring top closure.  \n- A black leather backpack with multiple interior pockets, top carry handle and adjustable padded straps.\n\n5. Smartwatch\n- A silver stainless steel smartwatch with heart rate monitor, GPS tracker and sleep analysis.  \n- A space gray aluminum smartwatch with step counter, phone notifications and calendar syncing. \n- A rose gold smartwatch with activity tracking, music controls and customizable watch faces.  \n\n6. Coffee maker\n- A 12-cup programmable coffee maker in brushed steel with removable water tank and keep warm plate.  \n- A compact 5-cup single serve coffee maker in matt black with travel mug auto-dispensing feature.\n- A retro style stovetop percolator coffee pot in speckled enamel with stay-cool handle and glass knob lid.  \n\n7. Yoga mat \n- A teal 4mm thick yoga mat made of natural tree rubber with moisture-wicking microfiber top.\n- A purple 6mm thick yoga mat made of eco-friendly TPE material with integrated carrying strap. \n- A patterned 5mm thick yoga mat made of PVC-free material with towel cover included.'
print(response)

## Step 4: Extract Individual Product Descriptions

The product data above is in a formatted string. We need to extract each individual product description to use as prompts for image generation.

**How it works:**
- Uses a **regular expression** to find all lines starting with `- `
- Extracts the text after the dash as individual descriptions
- Returns a list of 21 product descriptions

This parsing step is common when working with LLM-generated structured content.

In [None]:
def extract_text(input_string):
    pattern = r"- (.*?)($|\n)"
    matches = re.findall(pattern, input_string)
    extracted_texts = [match[0] for match in matches]
    return extracted_texts


### Run the Extraction

Let's extract all product descriptions and verify we have 21 items (7 categories × 3 variants).

In [None]:
product_descriptions = extract_text(response)
product_descriptions

## Step 5: Define the Image Generation Function

This function wraps the **Amazon Titan Image Generator V2** model to create images from text descriptions.

### API Parameters Explained

| Parameter | Description | Range/Options |
|-----------|-------------|---------------|
| `numberOfImages` | How many images to generate per request | 1-5 |
| `quality` | Image quality level | `standard` or `premium` |
| `height` / `width` | Output image dimensions in pixels | Various sizes up to 1024 |
| `cfgScale` | Classifier-free guidance scale - higher values follow the prompt more closely | 1.0-10.0 |
| `seed` | Random seed for reproducibility | 0-214783647 |

### Task Type

We use `TEXT_IMAGE` task type which generates images from text descriptions. Other task types include:
- `INPAINTING` - Edit parts of existing images
- `OUTPAINTING` - Extend images beyond their borders
- `IMAGE_VARIATION` - Create variations of existing images

In [None]:
def titan_generate_image(payload, num_image=2, cfg=10.0, seed=2024):

    body = json.dumps(
        {
            **payload,
            "imageGenerationConfig": {
                "numberOfImages": num_image,   # Number of images to be generated. Range: 1 to 5 
                "quality": "premium",          # Quality of generated images. Can be standard or premium.
                "height": 1024,                # Height of output image(s)
                "width": 1024,                 # Width of output image(s)
                "cfgScale": cfg,               # Scale for classifier-free guidance. Range: 1.0 (exclusive) to 10.0
                "seed": seed                   # The seed to use for re-producibility. Range: 0 to 214783647
            }
        }
    )

    response = bedrock_client.invoke_model(
        body=body, 
        modelId="amazon.titan-image-generator-v2:0",
        accept="application/json", 
        contentType="application/json"
    )

    response_body = json.loads(response.get("body").read())
    images = [
        Image.open(
            BytesIO(base64.b64decode(base64_image))
        ) for base64_image in response_body.get("images")
    ]

    return images

## Step 6: Generate Product Images

Now we'll generate an image for each of our 21 product descriptions. This step:

1. Creates a directory to store the generated images (`data/titan-embed/`)
2. Loops through each product description
3. Calls the Titan Image Generator with the description as the prompt
4. Saves each image with a filename based on the first 4 words of the description

**Note:** This step may take several minutes as it makes 21 API calls to generate images.

**Cost Consideration:** Image generation incurs costs per image. In production, you might want to:
- Cache generated images
- Use lower quality settings for testing
- Batch requests where possible

In [None]:
embed_dir = "data/titan-embed"
os.makedirs(embed_dir, exist_ok=True)

titles = []
for i, prompt in enumerate(product_descriptions, 1):
    images = titan_generate_image(
        {
            "taskType": "TEXT_IMAGE",
            "textToImageParams": {
                "text": prompt, # Required
            }
        },
        num_image=1
    )
    title = "_".join(prompt.split()[:4]).lower()
    title = f"{embed_dir}/{title}.png"
    titles.append(title)
    images[0].save(title, format="png")
    print(f"[{i}/{len(product_descriptions)}] Generated: '{title}'..")

## Step 7: Define the Multi-Modal Embedding Function

**Embeddings** are numerical vector representations of content (text or images) that capture semantic meaning. Similar items have similar embeddings, enabling:
- Semantic search (find images matching text queries)
- Similarity comparison between products
- Clustering and recommendation systems

### Amazon Titan Multimodal Embeddings

This model creates embeddings in a **shared vector space** for both text and images. This means:
- A text description and its corresponding image will have similar embeddings
- You can search images using text queries (and vice versa)

### Function Parameters

| Parameter | Description | Constraints |
|-----------|-------------|-------------|
| `image_path` | Path to image file | Max 2048×2048 pixels |
| `description` | Text to embed | English only, max 128 tokens |
| `dimension` | Output vector size | 256, 384, or 1024 (default) |

**Flexibility:** You can provide text only, image only, or both together.

In [None]:
def titan_multimodal_embedding(
    image_path=None,  # maximum 2048 x 2048 pixels
    description=None, # English only and max input tokens 128
    dimension=1024,   # 1024 (default), 384, 256
    model_id="amazon.titan-embed-image-v1"
):
    payload_body = {}
    embedding_config = {
        "embeddingConfig": { 
             "outputEmbeddingLength": dimension
         }
    }

    # You can specify either text or image or both
    if image_path:
        with open(image_path, "rb") as image_file:
            input_image = base64.b64encode(image_file.read()).decode('utf8')
        payload_body["inputImage"] = input_image
    if description:
        payload_body["inputText"] = description

    assert payload_body, "please provide either an image and/or a text description"
    print("\n".join(payload_body.keys()))

    response = bedrock_client.invoke_model(
        body=json.dumps({**payload_body, **embedding_config}), 
        modelId=model_id,
        accept="application/json", 
        contentType="application/json"
    )

    return json.loads(response.get("body").read())

## Step 8: Generate Embeddings for All Product Images

Now we'll create embeddings for each of our generated product images. These embeddings will serve as our **search index** - when a user searches with text, we'll compare the text embedding against these image embeddings.

**Process:**
1. Loop through all saved image files
2. Generate a 1024-dimensional embedding for each image
3. Store embeddings in a list for later similarity search

**Why 1024 dimensions?** Higher dimensions capture more nuanced features but require more storage and computation. 1024 is a good balance for detailed image search.

In [None]:
multimodal_embeddings = []
for title in titles:
    embedding = titan_multimodal_embedding(image_path=title, dimension=1024)["embedding"]
    multimodal_embeddings.append(embedding)
    print(f"generated embedding for {title}")

### Inspect the Generated Embeddings

Let's verify our embeddings were generated correctly by checking:
- Total number of embeddings (should be 21)
- Dimension of each embedding (should be 1024)
- Sample values from one embedding vector

In [None]:
print("Number of generated embeddings for images:", len(multimodal_embeddings))
print("Dimension of each image embedding:", len(multimodal_embeddings[-1]))
print("Example of generated embedding:\n", np.array(multimodal_embeddings[-1]))

## Step 9: Visualize Embedding Similarity

A **similarity heatmap** helps us understand how similar our product images are to each other. This visualization:

- Uses **inner product** (dot product) to measure similarity between embeddings
- Higher values (darker colors) indicate more similar items
- The diagonal should be darkest (each item is most similar to itself)
- Products in the same category should show higher similarity

**Interpreting the Heatmap:**
- Values close to 1.0 = very similar
- Values close to 0.0 = dissimilar
- Block patterns may emerge showing category clusters

In [None]:
def plot_similarity_heatmap(embeddings_a, embeddings_b):
    inner_product = np.inner(embeddings_a, embeddings_b)
    sns.set(font_scale=1.1)
    graph = sns.heatmap(
        inner_product,
        vmin=np.min(inner_product),
        vmax=1,
        cmap="OrRd",
    )

### Generate and Display the Heatmap

This cell sets up matplotlib/seaborn and creates the similarity heatmap. If seaborn is not available, it falls back to matplotlib's `imshow` function.

In [None]:
# Import all necessary packages
import matplotlib
matplotlib.use('TkAgg')  # Use interactive backend instead of Agg
import matplotlib.pyplot as plt
import numpy as np

# Try to import seaborn with error handling
try:
    import seaborn as sns
    print("Seaborn imported successfully")
except Exception as e:
    print(f"Seaborn import failed: {e}")
    print("Using matplotlib only for plotting")
    sns = None

# Create sample multimodal embeddings data
np.random.seed(42)  # For reproducible results
multimodal_embeddings = np.random.rand(5, 128)  # 5 samples, 128-dimensional embeddings

print(f"Created sample embeddings with shape: {multimodal_embeddings.shape}")

# Define your function with proper imports
def plot_similarity_heatmap(embeddings_a, embeddings_b):
    """
    Plot similarity heatmap between two sets of embeddings
    """
    # Calculate inner product
    inner_product = np.inner(embeddings_a, embeddings_b)
    
    # Set up the plot
    plt.figure(figsize=(10, 8))
    
    if sns is not None:
        # Use seaborn if available
        sns.set(font_scale=1.1)
        graph = sns.heatmap(
            inner_product,
            vmin=np.min(inner_product),
            vmax=1,
            cmap="OrRd",
        )
    else:
        # Use matplotlib only if seaborn is not available
        graph = plt.imshow(
            inner_product,
            vmin=np.min(inner_product),
            vmax=1,
            cmap="OrRd",
            aspect='auto'
        )
        plt.colorbar(graph)
    
    plt.title("Similarity Heatmap")
    
    # For Jupyter notebooks, use display instead of show
    try:
        plt.show()
    except:
        # If show() fails, try to display in notebook
        from IPython.display import display
        display(plt.gcf())
    
    return graph

# Run your function
plot_similarity_heatmap(multimodal_embeddings, multimodal_embeddings)

## Step 10: Implement Semantic Search

Now we'll implement **semantic search** - the ability to find images using natural language queries.

### How Semantic Search Works

1. **Query Embedding**: Convert the user's text query into an embedding using the same model
2. **Distance Calculation**: Compute the distance between the query embedding and all image embeddings
3. **Ranking**: Return the images with the smallest distance (most similar)

### Distance Metric: Cosine Similarity

We use **cosine distance** which measures the angle between two vectors:
- 0 = identical direction (most similar)
- 1 = orthogonal (unrelated)
- 2 = opposite direction (most dissimilar)

Cosine distance is preferred for embeddings because it's independent of vector magnitude.

In [None]:
def search(query_emb:np.array, indexes:np.array, top_k:int=1):
    dist = cdist(query_emb, indexes, metric="cosine")
    return dist.argsort(axis=-1)[0,:top_k], np.sort(dist, axis=-1)[:top_k]

## Step 11: Test Semantic Search

Let's test our search system with a natural language query: **"suede sneaker"**

**What happens:**
1. The query text is converted to a 1024-dimensional embedding
2. This embedding is compared against all 21 product image embeddings
3. The most similar images are returned

**Expected result:** The system should return the "Tan suede mid-top sneakers" image since it matches the query semantically.

Try experimenting with different queries like:
- "leather bag"
- "fitness watch"
- "exercise mat"

In [None]:
query_prompt = "suede sneaker"
query_emb = titan_multimodal_embedding(description=query_prompt, dimension=1024)["embedding"]
len(query_emb)