## Embeddings

Embeddings are integral to various natural language processing applications, with their quality crucial for optimal performance. They are commonly used in knowledge bases to represent textual data as dense vectors enabling efficient similarity search and retrieval. In Retrieval Augmented Generation (RAG), embeddings are used to retrieve relevant passages from a corpus to provide context for language models to generate informed, knowledge-grounded responses. Embeddings also play a key role in personalization and recommendation systems by representing user preferences, item characteristics, and historical interactions as vectors, allowing calculation of similarities for personalized recommendations based on user behavior and item embeddings. As new embedding models are released with incremental quality improvements, organizations must weigh the potential benefits against the associated costs of upgrading, considering factors like computational resources, data preprocessing, integration efforts, and projected performance gains impacting business metrics.

#### How a piece of text is converted into a vector?
Common approach is to use models which can provide contextualized embeddings for entire sentences. These models are based on deep learning architectures such as Transformers, which can capture the contextual information and relationships between words in a sentence more effectively.

![Embedding Model](./images/vector_embedding.png)

In addition to semantic search, you can use embeddings to augment your prompts for more accurate results through Retrieval Augmented Generation (RAG)—but in order to use them, you’ll need to store them in a database with vector capabilities.

![Embedding Model](./images/vector_db.jpg)

In [None]:
#%pip install langchain_cohere -q
%pip install spacy
#%pip install python-dotenv -q
#ignore error

In [None]:
# now you need to run this in a terminal window
# python -m spacy download en_core_web_md
# now restart your kernel

Standard imports for the libraires we will be using in this notebook.  Try to keep your imports in the first cell so this can this code can more easliy be converted into a python program later

In [None]:
import sys
import boto3
from botocore.config import Config
import pandas as pd
import json
import time
import os
import numpy as np
import pyarrow
import traceback
from langchain.embeddings import BedrockEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.chat_models import BedrockChat
from langchain_core.output_parsers import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import BedrockEmbeddings
from langchain.embeddings import SpacyEmbeddings
from dotenv import load_dotenv
from typing import Optional


## spaCy
Lets define functions that will use various embedding models so we can generate vector embeddings and try to see the relationship between vectors


In [None]:
def generate_spacy_vector_embedding(text):
    embedder = SpacyEmbeddings(model_name="en_core_web_md")
    query_embedding = embedder.embed_query(text)

    return(np.array(query_embedding))

In [None]:
# Mathematical formula for cosine similarity
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity

In [None]:
#Spacy
king_vector = generate_spacy_vector_embedding("King")
queen_vector = generate_spacy_vector_embedding("Queen")
man_vector = generate_spacy_vector_embedding("man")
woman_vector = generate_spacy_vector_embedding("woman")
print(f"This embedding has {len(king_vector)} dimensions")
print(king_vector[:5])

In [None]:
calculated_queen_vector = king_vector - man_vector + woman_vector

similarity = cosine_similarity(man_vector, woman_vector)
print(f"Cosine Similarity distance man to woman: {similarity:.4f}")

similarity = cosine_similarity(king_vector, queen_vector)
print(f"Cosine Similarity distance Spacey King to Queen: {similarity:.4f}")

similarity = cosine_similarity(calculated_queen_vector, queen_vector)
print(f"Cosine Similarity distance between Spacey Queen vector and our King - man + woman: {similarity:.4f}")

So as we can see here, the relationships that are mathematically represented by these vectors can be used to relate these items in semantic space.

### Now lets try some more models that have more features and see how that translates to different relationships
Here is a helper class to allow us to use Amazon Titan text embedding model

In [None]:
class TitanEmbeddings(object):
    accept = "application/json"
    content_type = "application/json"
    
    def __init__(self, model_id="amazon.titan-embed-text-v2:0", boto3_client=None, region_name='us-west-1'):
        
        if boto3_client:
            self.bedrock_boto3 = boto3_client
        else:
            # self.bedrock_boto3 = boto3.client(service_name='bedrock-runtime')
            self.bedrock_boto3 = boto3.client(
                service_name='bedrock-runtime', 
                region_name=region_name, 
            )
        self.model_id = model_id

    def __call__(self, text, dimensions, normalize=True):
        """
        Returns Titan Embeddings

        Args:
            text (str): text to embed
            dimensions (int): Number of output dimensions.
            normalize (bool): Whether to return the normalized embedding or not.

        Return:
            List[float]: Embedding
            
        """

        body = json.dumps({
            "inputText": text,
            "dimensions": dimensions,
            "normalize": normalize
        })

        response = self.bedrock_boto3.invoke_model(
            body=body, modelId=self.model_id, accept=self.accept, contentType=self.content_type
        )

        response_body = json.loads(response.get('body').read())

        return response_body['embedding']


In [None]:
def get_bedrock_client(assumed_role: Optional[str] = None, region: Optional[str] = 'us-west-2',runtime: Optional[bool] = True,external_id=None, ep_url=None):
    """Create a boto3 client for Amazon Bedrock, with optional configuration overrides 
    """
    target_region = region

    #print(f"Create new client\n  Using region: {target_region}:external_id={external_id}: ")
    session_kwargs = {"region_name": target_region}
    client_kwargs = {**session_kwargs}

    profile_name = os.environ.get("AWS_PROFILE")
    if profile_name:
        print(f"  Using profile: {profile_name}")
        session_kwargs["profile_name"] = profile_name

    retry_config = Config(
        region_name=target_region,
        retries={
            "max_attempts": 10,
            "mode": "standard",
        },
    )
    session = boto3.Session(**session_kwargs)

    if assumed_role:
        print(f"  Using role: {assumed_role}", end='')
        sts = session.client("sts")
        if external_id:
            response = sts.assume_role(
                RoleArn=str(assumed_role),
                RoleSessionName="langchain-llm-1",
                ExternalId=external_id
            )
        else:
            response = sts.assume_role(
                RoleArn=str(assumed_role),
                RoleSessionName="langchain-llm-1",
            )
        print(f"Using role: {assumed_role} ... sts::successful!")
        client_kwargs["aws_access_key_id"] = response["Credentials"]["AccessKeyId"]
        client_kwargs["aws_secret_access_key"] = response["Credentials"]["SecretAccessKey"]
        client_kwargs["aws_session_token"] = response["Credentials"]["SessionToken"]

    if runtime:
        service_name='bedrock-runtime'
    else:
        service_name='bedrock'

    if ep_url:
        bedrock_client = session.client(service_name=service_name,config=retry_config,endpoint_url = ep_url, **client_kwargs )
    else:
        bedrock_client = session.client(service_name=service_name,config=retry_config, **client_kwargs )

    #print("boto3 Bedrock client successfully created!")
    #print(bedrock_client._endpoint)
    return bedrock_client

In [None]:
def generate_titan_vector_embedding(text, embedding_size):
    aws_client = get_bedrock_client()
    bedrock_embeddings = TitanEmbeddings(model_id="amazon.titan-embed-text-v2:0", boto3_client=aws_client)
    
    modelId = "amazon.titan-embed-text-v2:0"  # 
    accept = "application/json"
    contentType = "application/json"
    

    model_input={
        "inputText": text,
        "dimensions": embedding_size,
        "normalize": True
    }
    
    body = json.dumps(model_input)
    response = aws_client.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)    
    response_body = json.loads(response.get('body').read())
    embedding = response_body.get("embedding")

    return np.array(embedding)
   

Now lets look at the same idea with Titan

In [None]:
#Titan
king_vector = generate_titan_vector_embedding("King", 1024)
queen_vector = generate_titan_vector_embedding("Queen", 1024)
man_vector = generate_titan_vector_embedding("man", 1024)
woman_vector = generate_titan_vector_embedding("woman", 1024)
print(f"This embedding has {len(king_vector)} dimensions")


In [None]:
calculated_queen_vector = king_vector - man_vector + woman_vector

similarity = cosine_similarity(man_vector, woman_vector)
print(f"Cosine Similarity distance man to woman: {similarity:.4f}")

similarity = cosine_similarity(king_vector, queen_vector)
print(f"Cosine Similarity distance Titan King to Queen: {similarity:.4f}")

similarity = cosine_similarity(calculated_queen_vector, queen_vector)
print(f"Cosine Similarity distance between Titan Queen vector and our King - man + woman: {similarity:.4f}")

In [None]:
calculated_queen_vector = king_vector - man_vector + woman_vector

similarity = cosine_similarity(man_vector, woman_vector)
print(f"Cosine Similarity distance man to woman: {similarity:.4f}")

similarity = cosine_similarity(king_vector, queen_vector)
print(f"Cosine Similarity distance Titan King to Queen: {similarity:.4f}")

similarity = cosine_similarity(calculated_queen_vector, queen_vector)
print(f"Cosine Similarity distance between Titan Queen vector and our King - man + woman: {similarity:.4f}")

### Cohere
Let's look at one more model to see how it compares 

In [None]:
# send in an array size of one and only return the 0th element
def generate_cohere_vector_embedding(text_data):
    aws_client = get_bedrock_client()
    input_type = "clustering"
    truncate = "NONE" # optional
    model_id = "cohere.embed-english-v3" # or "cohere.embed-multilingual-v3"
    
    # Create the JSON payload for the request
    json_params = {
            'texts': [text_data],
            'truncate': truncate, 
            "input_type": input_type
        }
    json_body = json.dumps(json_params)
    params = {'body': json_body, 'modelId': model_id,}
    
    # Invoke the model and print the response
    result = aws_client.invoke_model(**params)
    response = json.loads(result['body'].read().decode())
    return(np.array(response['embeddings'][0]))


In [None]:
# Input cohere for embedding 
king_vector = generate_cohere_vector_embedding('King')
queen_vector = generate_cohere_vector_embedding("Queen")
man_vector = generate_cohere_vector_embedding("man")
woman_vector = generate_cohere_vector_embedding("woman")
print(f"This embedding has {len(king_vector)} dimensions")
print(king_vector[:5])

In [None]:
calculated_queen_vector = king_vector - man_vector + woman_vector

similarity = cosine_similarity(man_vector, woman_vector)
print(f"Cosine Similarity distance man to woman: {similarity:.4f}")

similarity = cosine_similarity(king_vector, queen_vector)
print(f"Cosine Similarity distance Cohere King to Queen: {similarity:.4f}")

similarity = cosine_similarity(calculated_queen_vector, queen_vector)
print(f"Cosine Similarity distance between Cohere Queen vector and our King - man + woman: {similarity:.4f}")

In [None]:
#Spacy
king_vector = generate_spacy_vector_embedding("King")
queen_vector = generate_spacy_vector_embedding("Queen")
man_vector = generate_spacy_vector_embedding("man")
woman_vector = generate_spacy_vector_embedding("woman")
print(f"This embedding has {len(king_vector)} dimensions")
print(king_vector[:5])

Let's examine other phrases

In [None]:
similarity = cosine_similarity(generate_titan_vector_embedding("cat", 1024), generate_titan_vector_embedding("book", 1024))
print(f"Cosine Similarity of cat to book using Titan: {similarity:.4f}")

In [None]:
similarity = cosine_similarity(generate_cohere_vector_embedding("cat"), generate_cohere_vector_embedding("book"))
print(f"Cosine Similarity of cat to book using Cohere: {similarity:.4f}")

In [None]:
similarity = cosine_similarity(generate_spacy_vector_embedding("cat"), generate_spacy_vector_embedding("book"))
print(f"Cosine Similarity of cat to book using Spacey: {similarity:.4f}")

Now let's look at a larger sentences and see how larger models with more complexity handle the same task Here are 2 sentences that semantically similar but use different words and phrasing.

The majestic, towering skyscrapers, their gleaming windows reflecting the golden rays of the setting sun, stood as a testament to human ingenuity and the indomitable spirit of progress, while the bustling streets below teemed with life as people from all walks of life hurried to their destinations, their faces a mix of determination and weariness, yet each individual contributing to the vibrant tapestry of the city's existence.

The awe-inspiring, colossal high-rises, their polished glass facades mirroring the warm, amber glow of the fading daylight, served as a powerful symbol of human innovation and the unyielding drive for advancement, as the lively thoroughfares beneath pulsed with energy, filled with individuals from diverse backgrounds rushing to their intended locations, their expressions an amalgamation of resolve and fatigue, yet all playing a vital role in the dynamic, intricate mosaic that shaped the city's vibrant identity.

In [None]:
sentence1 = "The majestic, towering skyscrapers, their gleaming windows reflecting the golden rays of the setting sun, stood as a testament to human ingenuity and the indomitable spirit of progress, while the bustling streets below teemed with life as people from all walks of life hurried to their destinations, their faces a mix of determination and weariness, yet each individual contributing to the vibrant tapestry of the city's existence."
sentence2 = "The awe-inspiring, colossal high-rises, their polished glass facades mirroring the warm, amber glow of the fading daylight, served as a powerful symbol of human innovation and the unyielding drive for advancement, as the lively thoroughfares beneath pulsed with energy, filled with individuals from diverse backgrounds rushing to their intended locations, their expressions an amalgamation of resolve and fatigue, yet all playing a vital role in the dynamic, intricate mosaic that shaped the city's vibrant identity."
similarity = cosine_similarity(generate_spacy_vector_embedding(sentence1), generate_spacy_vector_embedding(sentence2))
print(f"Cosine Similarity of S1 to S2 using Spacey: {similarity:.4f}")

In [None]:
similarity = cosine_similarity(generate_titan_vector_embedding(sentence1, 1024), generate_titan_vector_embedding(sentence2, 1024))
print(f"Cosine Similarity of S1 to S2 using Titan: {similarity:.4f}")

In [None]:
similarity = cosine_similarity(generate_cohere_vector_embedding(sentence1), generate_cohere_vector_embedding(sentence2))
print(f"Cosine Similarity of S1 to S2 using Cohere: {similarity:.4f}")

#### Snowboarding
Snowboarding on fresh powder is pure freedom—like floating, flying, and dancing all at once. The moment your board hits untouched snow, everything changes. It’s soft, silent, and surreal. Instead of the usual hardpack chatter under your feet, there’s this quiet swoosh as you carve through the fluff. Each turn feels like slicing through silk. Your board sinks just a little, giving you that surfy, weightless feeling, like you're gliding above the ground. You lean back slightly to stay afloat, and with each movement, you’re not just riding the mountain—you’re flowing with it. The powder cushions every bump and fall, making the ride forgiving and playful. The world around you goes quiet—muffled by the snow—so all you hear is the wind in your ears and the sound of your own breath. Trees blur past, sunlight catches on snowflakes in the air, and your legs start to burn as you float turn after turn, not wanting it to end. It’s the kind of ride that leaves your heart pounding and your face aching from grinning so hard. The first tracks you lay through fresh powder? They’re yours alone—like signing your name on nature.

#### Surfing 
Surfing is an experience that blends adrenaline, tranquility, and connection with nature in a way that's hard to put into words but unforgettable once you feel it. Imagine paddling out through the rhythm of the ocean, the sun warming your back, saltwater clinging to your skin. You wait just beyond the breakers, scanning the horizon for the right wave—a moment of stillness, of anticipation. When it comes, you turn your board toward shore, paddle hard, and feel the lift as the wave catches you. Then you pop up—feet planted, knees bent, arms out—and suddenly, you’re riding a force of nature. The board glides effortlessly as the wave curls behind you. There's a split second where everything aligns: your balance, the speed, the sound of rushing water, and the pure exhilaration of being carried by the ocean. Every ride is different. Some are smooth and easy, others fast and wild. Sometimes you wipe out—tossed and tumbled underwater, lungs burning, trying to find the surface. But even that feels part of the magic. It’s humbling. It teaches respect. Surfing isn't just a sport. It’s a state of mind. It’s patience and persistence. It’s the joy of catching your first wave or the meditative calm of floating on your board, just watching the sun dip below the horizon.


In [None]:
surfing = "Surfing is an experience that blends adrenaline, tranquility, and connection with nature in a way that's hard to put into words but unforgettable once you feel it. Imagine paddling out through the rhythm of the ocean, the sun warming your back, saltwater clinging to your skin. You wait just beyond the breakers, scanning the horizon for the right wave—a moment of stillness, of anticipation. When it comes, you turn your board toward shore, paddle hard, and feel the lift as the wave catches you. Then you pop up—feet planted, knees bent, arms out—and suddenly, you’re riding a force of nature. The board glides effortlessly as the wave curls behind you. There's a split second where everything aligns: your balance, the speed, the sound of rushing water, and the pure exhilaration of being carried by the ocean. Every ride is different. Some are smooth and easy, others fast and wild. Sometimes you wipe out—tossed and tumbled underwater, lungs burning, trying to find the surface. But even that feels part of the magic. It’s humbling. It teaches respect. Surfing isn't just a sport. It’s a state of mind. It’s patience and persistence. It’s the joy of catching your first wave or the meditative calm of floating on your board, just watching the sun dip below the horizon."
snowboading = "Snowboarding on fresh powder is pure freedom—like floating, flying, and dancing all at once. The moment your board hits untouched snow, everything changes. It’s soft, silent, and surreal. Instead of the usual hardpack chatter under your feet, there’s this quiet swoosh as you carve through the fluff. Each turn feels like slicing through silk. Your board sinks just a little, giving you that surfy, weightless feeling, like you're gliding above the ground. You lean back slightly to stay afloat, and with each movement, you’re not just riding the mountain—you’re flowing with it. The powder cushions every bump and fall, making the ride forgiving and playful. The world around you goes quiet—muffled by the snow—so all you hear is the wind in your ears and the sound of your own breath. Trees blur past, sunlight catches on snowflakes in the air, and your legs start to burn as you float turn after turn, not wanting it to end. It’s the kind of ride that leaves your heart pounding and your face aching from grinning so hard. The first tracks you lay through fresh powder? They’re yours alone—like signing your name on nature."

In [None]:
similarity = cosine_similarity(generate_titan_vector_embedding(surfing, 1024), generate_titan_vector_embedding(snowboading, 1024))
print(f"Cosine Similarity of S1 to S2 using Titan: {similarity:.4f}")

In [None]:
similarity = cosine_similarity(generate_spacy_vector_embedding(surfing), generate_spacy_vector_embedding(snowboading))
print(f"Cosine Similarity of S1 to S2 using Spacey: {similarity:.4f}")

In [None]:
similarity = cosine_similarity(generate_cohere_vector_embedding(surfing), generate_cohere_vector_embedding(snowboading))
print(f"Cosine Similarity of S1 to S2 using Cohere: {similarity:.4f}")

#### Random
Okay let's try something random and see what the similarity looks like

In [None]:
sentence1= "The giraffe wore a monocle while arguing with a squirrel about the ethics of pancake toppings."
sentence2="A forgotten shoelace fluttered quietly on a moonlit battlefield, untouched by time or tennis."

In [None]:
similarity = cosine_similarity(generate_spacy_vector_embedding(sentence1), generate_spacy_vector_embedding(sentence2))
print(f"Cosine Similarity of S1 to S2 using Spacey: {similarity:.4f}")

In [None]:
similarity = cosine_similarity(generate_titan_vector_embedding(sentence1, 1024), generate_titan_vector_embedding(sentence2, 1024))
print(f"Cosine Similarity of S1 to S2 using Titan: {similarity:.4f}")

In [None]:
similarity = cosine_similarity(generate_cohere_vector_embedding(sentence1), generate_cohere_vector_embedding(sentence2))
print(f"Cosine Similarity of S1 to S2 using Cohere: {similarity:.4f}")

#### And the winner is?
I guess it depends on what you are tryiing to accomplish.  You need to sample different embedding models based on your use case.
Here is the [Huggingface MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)

#### Embeddings work on images as well
Let's compare different images together using an image embedding model.  There are many well known image embedding models here are a few:
ResNet (e.g., ResNet-50, ResNet-101)
-Deep residual networks
-Common for feature extraction
-Embeddings: vectors from the penultimate (before-softmax) layer
VGG (VGG16, VGG19)
-Simpler architecture
-Good for smaller-scale tasks or transfer learning
-Inception (GoogLeNet, Inception-v3)
-Multi-scale filters
-Good trade-off between speed and accuracy
EfficientNet
-Very efficient, scalable architecture
-Great for resource-constrained environments

Using EfficientNet let's try comparing the following images

![House Sparrow](./images/sparrow-sm.jpg) ![Black Phoebe](./images/phoebe-sm.jpg) ![White Crown Sparrow](./images/whitecrown-sm.jpg)![Fork](./images/fork-sm.jpg)

In [None]:
import torch
import torchvision.transforms as transforms
from torchvision import models
from PIL import Image
import torch.nn.functional as F

# Load EfficientNet (without the classifier head)
model = models.efficientnet_b0(pretrained=True)
model.classifier = torch.nn.Identity()  # Remove the final classification layer
model.eval()

# Preprocessing pipeline to match EfficientNet's input requirements
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],  # ImageNet mean
        std=[0.229, 0.224, 0.225]    # ImageNet std
    )
])

def get_image_embedding(image_path):
    image = Image.open(image_path).convert('RGB')
    input_tensor = transform(image).unsqueeze(0)  # Add batch dimension
    with torch.no_grad():
        embedding = model(input_tensor)
    return embedding.squeeze()  # Remove batch dimension

def cosine_similarity(tensor1, tensor2):
    return F.cosine_similarity(tensor1.unsqueeze(0), tensor2.unsqueeze(0)).item()


In [None]:
# Call EfficientNet on all sample images
fork_image_emb = get_image_embedding('./images/fork.jpg')
sparrow_image_emb = get_image_embedding('./images/sparrow.jpg')
phoebe_image_emb = get_image_embedding('./images/phoebe.jpg')
white_crown_image_emb = get_image_embedding('./images/whitecrown.jpg')

similarity = cosine_similarity(sparrow_image_emb, phoebe_image_emb)
print(f"Cosine similarity between two different birds: {similarity:.4f}")

similarity = cosine_similarity(sparrow_image_emb, fork_image_emb)
print(f"Cosine similarity between the fork and the sparrow: {similarity:.4f}")

similarity = cosine_similarity(sparrow_image_emb, white_crown_image_emb)
print(f"Cosine similarity between the white crown sparrow and house sparrow: {similarity:.4f}")

#### Assignment
Go find some data that you think will be usefull for your project and try a few embedding models that you think mght work well for your use case.  After you manually separate the data into a few samples, show how the embedding model you choose will do a good job comaring relavancy