# Challenge 03-C-Embedding 

## 1. Overview 

In the last challenge (03-B-Chunking), we worked towards understanding token limits with LLM and utilizing chunking. Now if there are gigabytes of data, we will have lots of chunks to be created as well. Is there a way to select the most relevant chunks of text? The answer is yes. To solve this problem, we can take a look at a process called Embedding. Embedding helps us create numerical representations for all the chunks. Then, we can find the most similar chunks in the the list of embeddings. One popular way to find the similar chunks is through cosine similarity.

### **Embeddings Overview**
An embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating-point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar.

Different Azure OpenAI embedding models are specifically created to be good at particular tasks:
- Similarity embeddings are good at capturing semantic similarity between two or more pieces of text.
- Text search embeddings help find which long document is relevant to a short query.
- Code search embeddings are useful for embedding code snippets and embedding nature language search queries.

Embeddings make it easier to do machine learning on large inputs representing words by capturing the semantic similarities in a vector space. Therefore, we can use embeddings to if two text chunks are semantically related or similar, and inherently provide a score to assess similarity.

### **Cosine Similarity**
A previously used approach to match similar documents was based on counting maximum number of common words between documents. This is flawed since as the document size increases, the overlap of common words increases even if the topics differ. Therefore cosine similarity is a better approach.

Mathematically, cosine similarity measures the cosine of the angle between two vectors projected in a multi-dimensional space. This is beneficial because if two documents are far apart by Euclidean distance because of size, they could still have a smaller angle between them and therefore higher cosine similarity.

The Azure OpenAI embeddings rely on cosine similarity to compute similarity between documents and a query.

### **Applications**

Embeddings can be created for all different data types including images, audio, video, and text. In this notebook, we will look at generating embeddings for text and csv files. 

There are many applications in which embeddings can be useful. For example, let's say you want to classify a piece of text. Once embeddings are generated, they can be inserted into a machine learning model to predict the right label. In addition, you can utilize embeddings for similarity in time series data, graph data, or for user profile or products. A very popular use case is one that involves semantic search. If you want to retrieve documents that are very relevant to your query, embeddings can be generated for both the query as well as the documents in order to get an accurate response. We will see an example of this in Challenge 4.

## 2. Let's Start Implementation

You will need to import the needed modules. The following cells are key setup steps you completed in the previous challenges.

In [1]:
! pip install num2words
! pip install plotly
! pip install nptyping

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable
Collecting nptyping
  Downloading nptyping-2.5.0-py3-none-any.whl.metadata (7.6 kB)
Collecting numpy<2.0.0,>=1.20.0 (from nptyping)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━

In [2]:
import os
import re 
import requests
import sys
from num2words import num2words 
import pandas as pd 
import numpy as np
import tiktoken
from dotenv import load_dotenv
from tenacity import retry, wait_random_exponential, stop_after_attempt
from sklearn.metrics.pairwise import cosine_similarity as sklearn_cosine_similarity
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

load_dotenv()

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default"
)

# Initialize the Azure OpenAI client
client = AzureOpenAI(
    azure_endpoint=os.getenv("OPENAI_API_BASE"),
    azure_ad_token_provider=token_provider,
    api_version=os.getenv("OPENAI_API_VERSION")
)

# Define helper functions using the OpenAI 1.x API
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def get_embedding(text: str, engine: str) -> list:
	text = text.replace("\n", " ")
	response = client.embeddings.create(input=[text], model=engine)
	return response.data[0].embedding

def cosine_similarity(a, b):
	return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Set up your environment to access your Azure OpenAI keys. Refer to your Azure OpenAI resource in the Azure Portal to retrieve information regarding your Azure OpenAI endpoint and keys. 

For security purposes, store your sensitive information in an .env file.

In [4]:
# Get the embedding model name from environment
embedding_model = os.getenv("EMBEDDING_MODEL_NAME")

## 3. Generate Embeddings on text

#### Student Task #1:
Use the Azure OpenAI Embeddings class to create an embedding for the input text below. 

In [5]:

input="I would like to order a pizza"

# Add code here: Create embedding using the helper function
response = client.embeddings.create(
    input = [input],
    model = embedding_model
)

# Extract the actual embedding vector (the list of numbers)
embedding = response.data[0].embedding

print(f"Embedding length: {len(embedding)}")
print(f"First few values: {embedding[:5]}")

Embedding length: 1536
First few values: [0.006326164584606886, -0.013710793107748032, -0.013661562465131283, -0.01329233031719923, -0.02008618786931038]


The client.embeddings.create() method will take a list of text - here we have a single sentence - and then will return a list containing a single embedding. You can use these embeddings when searching, providing recommendations, classification, and more.

### 3.1 Generate Embeddings for a CSV file

#### Student Task #2:
Enter in the path of the `Automobile.csv` file which you can find in the `/data` folder. Run the cells below.

In [6]:
df=pd.read_csv(os.path.join(os.getcwd(),r'../data/Automobile.csv'))

df

Unnamed: 0,name,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,70,usa
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,70,usa
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,70,usa
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,70,usa
4,ford torino,17.0,8,302.0,140.0,3449,10.5,70,usa
...,...,...,...,...,...,...,...,...,...
393,ford mustang gl,27.0,4,140.0,86.0,2790,15.6,82,usa
394,vw pickup,44.0,4,97.0,52.0,2130,24.6,82,europe
395,dodge rampage,32.0,4,135.0,84.0,2295,11.6,82,usa
396,ford ranger,28.0,4,120.0,79.0,2625,18.6,82,usa


In [7]:
shortened_df = df[['name', 'mpg', 'origin']]
shortened_df

Unnamed: 0,name,mpg,origin
0,chevrolet chevelle malibu,18.0,usa
1,buick skylark 320,15.0,usa
2,plymouth satellite,18.0,usa
3,amc rebel sst,16.0,usa
4,ford torino,17.0,usa
...,...,...,...
393,ford mustang gl,27.0,usa
394,vw pickup,44.0,europe
395,dodge rampage,32.0,usa
396,ford ranger,28.0,usa


In [8]:
tokenizer = tiktoken.get_encoding("cl100k_base")
shortened_df['n_tokens'] = shortened_df["name"].apply(lambda x: len(tokenizer.encode(x)))
shortened_df = shortened_df[shortened_df.n_tokens<8192]
len(shortened_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shortened_df['n_tokens'] = shortened_df["name"].apply(lambda x: len(tokenizer.encode(x)))


398

In [9]:
shortened_df

Unnamed: 0,name,mpg,origin,n_tokens
0,chevrolet chevelle malibu,18.0,usa,6
1,buick skylark 320,15.0,usa,7
2,plymouth satellite,18.0,usa,3
3,amc rebel sst,16.0,usa,5
4,ford torino,17.0,usa,2
...,...,...,...,...
393,ford mustang gl,27.0,usa,4
394,vw pickup,44.0,europe,2
395,dodge rampage,32.0,usa,3
396,ford ranger,28.0,usa,2


In [10]:
sample_encode = tokenizer.encode(shortened_df.name[0]) 
decode = tokenizer.decode_tokens_bytes(sample_encode)
decode

[b'che', b'vrolet', b' che', b'velle', b' mal', b'ibu']

In [11]:
len(decode)
shortened_df['ada-v2'] = shortened_df['name'].apply(lambda x : get_embedding(x, engine = embedding_model)) 

In [12]:
shortened_df

Unnamed: 0,name,mpg,origin,n_tokens,ada-v2
0,chevrolet chevelle malibu,18.0,usa,6,"[-0.03426655754446983, 0.0006240851362235844, ..."
1,buick skylark 320,15.0,usa,7,"[-0.00986679457128048, -0.003105591982603073, ..."
2,plymouth satellite,18.0,usa,3,"[-0.020054809749126434, 0.016926994547247887, ..."
3,amc rebel sst,16.0,usa,5,"[-0.023998675867915154, -0.008042722009122372,..."
4,ford torino,17.0,usa,2,"[-0.038462698459625244, -0.027699586004018784,..."
...,...,...,...,...,...
393,ford mustang gl,27.0,usa,4,"[-0.042263686656951904, -0.0038424620870500803..."
394,vw pickup,44.0,europe,2,"[-0.023973260074853897, -0.015643903985619545,..."
395,dodge rampage,32.0,usa,3,"[-0.02398286946117878, -0.005085835233330727, ..."
396,ford ranger,28.0,usa,2,"[-0.014130380004644394, -0.021175479516386986,..."


The embeddings generated from the csv file can be used to perform search. You can calculate the cosine similarity between a query embedding and the embeddings from the csv file. Then you can rank the search results to what is most relevant to the query. We will see an application of embedddings in Challenge 4.

## Success Criteria 

To complete this challenge successfully:

* Show an understanding of embeddings by working with different inputs.