## Module 1

### Part-1 TSNE to visualize vectors


In this section of the notebook, we will generate embeddings for sample text and plot them in a two-dimensional vector space. The purpose is to demonstrate how relationships between data points are measured through their distances from one another. We will use the t-SNE (t-Distributed Stochastic Neighbor Embedding) library to visualize these vectors. The final output will be displayed as a graph

In next cell, we will define a function to generate embeddings using the amazon.titan-embed-text-v2:0 model

In [None]:
import json
import logging
import boto3
from botocore.exceptions import ClientError

bedrock_client = boto3.client("bedrock-runtime")
embeddings_model_id = 'amazon.titan-embed-text-v2:0'


def generate_embeddings(model_id, body):
    """
    Generate a vector of embeddings for a text input using Amazon Titan Text Embeddings v2 on demand.
    Args:
        model_id (str): The model ID to use.
        body (str) : The request body to use.
    Returns:
        response (JSON): The embedding created by the model and the number of input tokens.
    """

    bedrock = boto3.client(service_name='bedrock-runtime')

    accept = "application/json"
    content_type = "application/json"

    response = bedrock.invoke_model(
        body=body, modelId=model_id, accept=accept, contentType=content_type
    )

    response_body = json.loads(response.get('body').read())

    return response_body

In the next cell we will define a function to create pandas DataFrame from a list of text strings. Creating a DataFrame from text data is a common step in natural language processing tasks, as it allows you to easily manipulate and analyze the text data using pandas' powerful data manipulation and analysis capabilities.

In [None]:
import pandas as pd

def get_embeddings_df(texts):
    df = pd.DataFrame(texts, columns=["text"])
    df["embedding"] = df["text"].apply(lambda text: generate_embeddings(embeddings_model_id, json.dumps
    ({"inputText": text, "dimensions": 1024, "normalize": True})).get('embedding'))
    return df

In the next cell, we will pass an array of texts to generate embeddings. Here, we will display each text and its corresponding vector.

In [None]:
# Sample data for learning the concepts
texts = [
    "Red",
    "White",
    "Blue",
    "Fish",
    "Horse",
    "Cat",
    "Orange",
    "USA",
    "Canada",
    "Japan"
]

# Call utility function to generate the embeddings
df = get_embeddings_df(texts)

# Show the embeddings
display(df)

In the next cell, we will use tSNE (t-Distributed Stochastic Neighbor Embedding) is a popular technique for dimensionality reduction and visualization of high-dimensional data. It is particularly useful for visualizing embeddings or vectors in a lower-dimensional space, typically 2D or 3D, which can be easily plotted and interpreted. Let's see how our data looks in a two-dimensional vector space.

In [None]:
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import matplotlib.pyplot as plt

def show_tsne_plot(df):
    embeddings = np.array(df["embedding"].tolist())

    tsne = TSNE(n_components=2, learning_rate="auto", init="random", random_state=4, perplexity=3)
    embeddings_2d = tsne.fit_transform(embeddings)

    # plot
    plt.figure(figsize=(12, 8))
    plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c="blue", alpha=0.6, label="Embeddings")

    for i, txt in enumerate(df["text"]):
        plt.annotate(
            txt, (embeddings_2d[i, 0], embeddings_2d[i, 1]), textcoords="offset points", xytext=(0, 5), ha="center"
        )

    similarity_matrix = cosine_similarity(embeddings)

    # lines
    for i in range(len(embeddings_2d)):
        for j in range(i+1, len(embeddings_2d)):  # avoid repeating the same pair
            sim = similarity_matrix[i, j]
            # higher similarity = bolder lines
            alpha = sim 
            if sim > 0:  # plot lines for positive similarity values
                plt.plot(
                    [embeddings_2d[i, 0], embeddings_2d[j, 0]],
                    [embeddings_2d[i, 1], embeddings_2d[j, 1]],
                    color='gray', linestyle='-', alpha=alpha, linewidth=2*sim
                )

    plt.title("2D Visualization of Text Embeddings using t-SNE with Cosine Similarity")
    plt.legend()
    plt.show()

In [None]:
show_tsne_plot(df)