# 2: Analyzing and Visualizing Text Embeddings for Semantic Similarity using Amazon Bedrock Embeddings

- SageMaker Notebook Kernel: `conda_python3`
- SageMaker Notebook Instance Type: ml.m5d.large | ml.t3.large

In this notebook, you'll explore semantic text similarity by generating, analyzing, and visualizing embeddings for a collection of sentences. It leverages [Amazon Bedrock](https://aws.amazon.com/bedrock/) embeddings for generating high-dimensional vector representations of textual data. For visualization, the notebook employs t-Distributed Stochastic Neighbor Embedding (t-SNE), a dimensionality reduction technique, to plot embeddings in a 2D space. It further investigates similarity metrics by computing and visualizing cosine similarity between texts. 

## Runtime 

This notebook takes approximately 10 minutes to run.

## Contents

1. [Prerequisites](#prerequisites)
1. [Setup](#setup)
1. [Embeddings](#embeddings)
1. [Visualize similarity with t-SNE](#visualize-similarity-with-t-sne)
1. [Compute the cosine similarity between texts](#compute-the-cosine-similarity-between-texts)
1. [Test different texts](#test-different-texts)

## Prerequisites

`amazon.titan-embed-text-v1` enabled in the Amazon Bedrock console in `us-west-2`


## Setup

Let's start by installing and importing the required packages for this notebook. 

<div class="alert alert-block alert-warning"><b>Note:</b> Verify that the notebook kernel is `conda_python3`. Also, if you run into an issue where a module can't be imported after installation, restart the notebook kernel, then rerun the import notebook cell.</div>

In [None]:
%pip install --upgrade sagemaker --quiet
%pip install pandas --quiet
%pip install numexpr --quiet
%pip install scikit-learn --quiet
%pip install matplotlib --quiet
%pip install seaborn --quiet
%pip install matplotlib --quiet

In [None]:
import json
import boto3
import warnings
import pandas as pd
import numpy as np
import seaborn as sns

from matplotlib import pyplot as plt 
from IPython.display import display
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.manifold import TSNE
    

***

Next, we will initialize the Amazon Bedrock boto3 client. The embeddings model we will use for is `amazon.titan-embed-text-v1`. 

***

In [None]:
bedrock_client = boto3.client("bedrock-runtime", region_name="us-west-2")

embeddings_model_id = "amazon.titan-embed-text-v1"

## Embeddings

Create a helper method called `get_embeddings`, which will properly format our request to the embeddings model and will handle extracting the response. 

In [None]:
def get_embedding(text):
    input_body = {"inputText": text}

    response = bedrock_client.invoke_model(
        body=json.dumps(input_body),
        modelId=embeddings_model_id,
        accept="application/json",
        contentType="application/json",
    )
    response_body = json.loads(response.get("body").read())
    return response_body.get("embedding")

***

Get the embeddings for the question `What is Amazon Bedrock` and take a look at the output.

***

In [None]:
response = get_embedding("What is Amazon Bedrock?")

embeddings = np.array(response)
embeddings_dimensions = len(embeddings)

display(embeddings)
print(f"Vector Dimensions: {embeddings_dimensions}")

***

Text embeddings are numerical representations of text data that capture semantic and contextual information about words, phrases, or entire documents. The sentence `What is Amazon Bedrock?` was transformed into a vector, which is an ordered sequence of numbers, similar to a list or an array. The number of values in a text embedding is known as its dimensions. The transformer based Amazon Bedrock embeddings model returns a vector with a fixed size 1536 dimensions. This dense vector numerically represents the semantic and contextual relationships of the input text. 

Now, let's take a look at how we can use embeddings to determine similarity between different texts.

First, we will create a helper function to create embeddings for an array of texts and store it in [Pandas](https://pandas.pydata.org/) [Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). Pandas is a library that provides data structures for efficiently storing large or complex data sets and data analysis tools and a DataFrame is a tabular data structure that makes it easy to handle and display data in our notebook.

***

In [None]:
def get_embeddings_df(texts):
    df = pd.DataFrame(texts, columns=["text"])
    df["embedding"] = df["text"].apply(get_embedding)
    return df

***

Next, generate embeddings for the texts below. Take a look at the lines of text and think about which ones you think are similar and which ones aren't.

***

In [None]:
texts = [
    "How do I deploy a SageMaker endpoint?",
    "I need instructions to deploy a ML model endpoint",
    "RDS, DynamoDB, and Neptune",
    "Relational database, key-value database, and graph database",
    "Large language models on Amazon Bedrock",
]
df = get_embeddings_df(texts)
display(df)


## Visualize similarity with t-SNE

Understanding textual relationships in their raw form is easy for us humans, but comprehending these relationships when they are embedded in a 1536-dimensional vector space is not as easy. This is where [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) is helpful. t-SNE is a dimensionality reduction technique commonly used in machine learning and data visualization. It is particularly useful for visualizing high-dimensional data in a lower-dimensional space while preserving the pairwise similarities between data points. 

Let's visualize the sentence embeddings using t-SNE from scikit-learn. Since it's a projection to 2D space. Notice how related items are closer together and unrelated items are further away?

In [None]:
def show_tsne_plot(df):

    # Convert the list of embeddings to a NumPy array
    embeddings = np.array(df["embedding"].tolist())

    # Apply t-SNE
    tsne = TSNE(n_components=2, learning_rate="auto", init="random", random_state=4, perplexity=3)
    embeddings_2d = tsne.fit_transform(embeddings)

    # Scatter plot
    plt.figure(figsize=(12, 8))
    plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c="blue", alpha=0.6, label="Embeddings")

    # Annotate each point with the corresponding text
    for i, txt in enumerate(df["text"]):
        plt.annotate(
            txt, (embeddings_2d[i, 0], embeddings_2d[i, 1]), textcoords="offset points", xytext=(0, 5), ha="center"
        )

    plt.xlabel("t-SNE Component 1")
    plt.ylabel("t-SNE Component 2")
    plt.title("2D Visualization of Text Embeddings using t-SNE")
    plt.legend()
    plt.show()


show_tsne_plot(df)

## Compute the cosine similarity between texts

Luckily, computers are much more capable than we are in higher dimensions. There are a few methods of determining similarity between vectors. One way is to measure the euclidean distance between the vectors which calculates the straight-line distance between two points in a vector space. It considers both the direction and the magnitude of the vectors. Another method is measuring the cosine of the angle between two vectors effectively determining whether they are pointing in roughly the same direction, irrespective of their magnitude. Euclidean distance is more suitable when magnitude is a significant factor, while cosine similarity is often used in text analysis and other domains where the direction of the data vectors is more important than their magnitude, so we will use cosine similarity. The formula for the cosine similarity between two vectors is is the dot product of the vector divided by the dot product of the norms of vectors x and y respectively.

$$\text{d}(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_2 \times \|\mathbf{y}\|_2}$$


Let's compute the cosine similarity between the vectors and plot the values on a heat map. On the heatmap, you'll observe cosine values ranging from -1, meaning exactly opposite, to 1 meaning exactly similar. In vector terms, -1 represents vectors pointing in opposite directions, 0 represents vectors that are orthogonal to each other, and 1 represents vectors pointing in the same direction.

In [None]:
def show_similarity_matrix(df, similarity_fn=cosine_similarity):
    # Convert the list of embeddings into a NumPy array
    embeddings_matrix = np.array(df["embedding"].tolist())

    # Calculate the cosine similarity matrix
    similarity_matrix = similarity_fn(embeddings_matrix)

    # Create a DataFrame for the cosine similarity matrix with row and column headers
    similarity_df = pd.DataFrame(similarity_matrix, index=df.index, columns=df.index)

    similarity_df_rounded = similarity_df.round(2)

    # print the index and text column of the df
    display(df[["text"]])

    warnings.filterwarnings("ignore")
    cmap = sns.diverging_palette(10, 240, n=9, as_cmap=True)

    # Generate the heatmap with the new color map
    plt.figure(figsize=(12, 8))
    g = sns.heatmap(similarity_df_rounded, annot=True, cmap=cmap, cbar=True, linewidths=0.5, center=0)
    g.xaxis.tick_top()

    plt.title(f"{similarity_fn.__name__} Matrix")
    plt.show()


show_similarity_matrix(df)

## Test different texts

If you were having trouble seeing the color differences in the previous example, try the next example which has greater separation between the clusters.

In [None]:
texts = [
	"Can you please tell me how to get to the bakery?",
	"I need directions to the bread shop",
	"Cats, dogs, and mice",
	"Felines, canines, and rodents",
	"Four score and seven years ago"
]

df2 = get_embeddings_df(texts)
show_tsne_plot(df2)
show_similarity_matrix(df2)

***

Now, replace the text in the array below with your own sentences and see how they compare. Some things to try are different lengths of text even text in other languages.

***

In [None]:
texts = [
	"text",
	"text",
	"text",
	"text",
]

df3 = get_embeddings_df(texts)
show_tsne_plot(df3)
show_similarity_matrix(df3)

## Notebook complete

Embeddings serve as the foundation of semantic search engines and advanced question answering (QA) systems. In the next notebook, we will look how to extract and load data into to vector store to use for similarity search.
