<img src="../Assets/Images/Day 3 Header (Embeddings).png">

# Welcome to Day 3

Today we will 

- Learn about Embeddings
- Look at available embedding models of OpenAI
- Study the embeddings API
- Decode the embeddings response object
- Apply embeddings to 
    - Perform Text Search
    - Do Text Clustering

## Brief Introduction to Embeddings

<span style="font-size: 20px; color: orange"><b>Embeddings are vector representations of data that capture meaningful relationships between entities</b></span>

<span style="font-size: 16px; color: blue"><b>These units are typically words, punctuation marks, or other meaningful substrings that make up the text</b></span>

- All Machine Learning/AI models work with numerical data. Before the performance of any operation all text/image/audio/video data has to be transformed into a numerical representation

- As a general definition, embeddings are data that has been transformed into n-dimensional matrices for use in deep learning computations.

<img src="../Assets/Images/Embeddings.png" width=800>

## Available OpenAI Embeddings

__text-embedding-3-small__	| $0.02 / 1M tokens

__text-embedding-3-large__	| $0.13 / 1M tokens

__ada v2__	| $0.10 / 1M tokens

<span style="font-size: 14px; color: orange">__IMP__ : __"model"__ is passed as a parameter in the embeddings API</span>


## Embeddings API

In [2]:
%pip install openai --quiet #You can remove '--quiet' to see the installation steps
%pip install python-dotenv --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Loading API key

In [3]:
#### Import Libraries ####
import openai #OpenAI python library
from openai import OpenAI #OpenAI Client

from dotenv import load_dotenv
import os

load_dotenv()

openai_api_key=os.getenv("OPENAI_API_KEY")

client = OpenAI(api_key=openai_api_key)

### Embeddings API Parameters

- __model__ : The embedding model to use (required)

- __input__ : A string/integer or an array of strings/integers for which the embeddings are desired (required)

- __encoding format__ : either float or base64

- __dimensions__ : length of the embedding vector. [The text-embedding-3-large and text-embedding-3-small models create embedding vectors of length 3,072 and 1,536 respectively. This parameter lets you set lower dimension lengths]

Let's try creating an embedding vector for a simple statement - __"The food was delicious"__

In [4]:
embeddings=client.embeddings.create(
  model="text-embedding-3-small",
  input="The food was delicious",
  encoding_format="float",
)

Since we chose text-embedding-3-small model, the embedding vector will be of size 1,536.

In [5]:
print(embeddings.data[0].embedding)

[-0.019819789, -0.021811483, -0.06169395, -0.038838044, 0.011288293, -0.032474335, -0.007814972, 0.070437975, -0.008889758, -0.04471597, 0.020682048, -0.030701242, 0.005167476, -0.027980879, -0.009915967, -0.009472693, 0.018993964, -0.021738617, -0.017390894, 0.023438843, 0.053872906, -0.0061329617, -0.023001643, -4.990622e-05, -0.0068859193, 0.03733213, -0.00096320896, -0.0014429159, -0.009928111, 0.0076753106, 0.017220872, -0.011616194, 0.023693878, -0.03424743, -0.026985032, -0.022187963, 0.05358144, 0.0347575, 0.034417454, 0.02654783, 0.021520017, 0.060285192, 0.02459257, 0.026474964, 0.049743786, -0.006090456, -0.066988945, 0.03235289, -0.01697798, -0.0130067365, -0.012375223, 0.015532788, -0.0009791485, 0.00421717, 0.0072138202, -0.0030148667, -0.009970617, 0.03633628, 0.034660343, -0.011713349, 0.037696462, -0.005301065, 0.013492516, 0.016237168, 0.008561857, -0.0541158, -0.04933087, 0.008306824, 0.02654783, 0.00553181, 0.0152777545, 0.032765802, -0.060430925, -0.043501522, 0.04

### Embeddings Response Object

Now let's create embeddings for 3 strings - __"The food was delicious"__, __"The ambience was nice"__, __"The service was ordinary"__

We'll set the size of the embeddings vector to size 10, 

In [6]:
embeddings=client.embeddings.create(
  model="text-embedding-3-small",
  input=["The food was delicious","The ambience was nice","The service was ordinary"],
  encoding_format="float",
  dimensions=10
)

In [None]:
print(embeddings.model_dump_json(indent=4))

<img src="../Assets/Images/Embedding Response Object.png" width=500>

## Application of Embeddings

- Text Search
- Clustering
- For downstream ML models
- Recommendation Algorithms

### Text Search

Let's continue with our example of the three sentences - __"The food was delicious"__, __"The ambience was nice"__, __"The service was ordinary"__

Now, just to demonstrate the idea of how text search is performed, let's search for __"food"__ amongst the three statements

To achieve this, we will first create embeddings for the search string i.e. __"food"__

In [7]:
embeddings_q=client.embeddings.create(
  model="text-embedding-3-small",
  input="food",
  encoding_format="float",
  dimensions=10
)

In [8]:
query=embeddings_q.data[0].embedding

In [9]:
query

[-0.012315562,
 -0.16567738,
 -0.069231726,
 -0.10728083,
 0.4480219,
 -0.61836094,
 0.29532152,
 0.4434862,
 -0.24505134,
 -0.17046502]

Let's recall the embeddings of the three statements

In [10]:
d1=embeddings.data[0].embedding #"The food was delicious"
d2=embeddings.data[1].embedding #"The ambience was nice"
d3=embeddings.data[2].embedding #"The service was ordinary"

#### Measuring Similarity

__Cosine Similarity__: This method measures the cosine of the angle between the two embeddings. It ranges from -1 to 1, where 1 means the embeddings are identical, 0 means they are orthogonal, and -1 means they are diametrically opposed.



<img src="../Assets/Images/cosine similarity.png" width=200>

<img src="https://storage.googleapis.com/lds-media/images/cosine-similarity-vectors.original.jpg">

While we code this calculation, there's a simpler way.

Let's install the __scikit learn__ library and import the __cosine similarity__ function

_scikit-learn is a popular open-source machine learning library for the Python programming language. It provides simple and efficient tools for data mining and data analysis, built on top of other Python libraries such as NumPy, SciPy, and matplotlib_

In [11]:
%pip install scikit-learn --quiet

Note: you may need to restart the kernel to use updated packages.


In [12]:
from sklearn.metrics.pairwise import cosine_similarity #for calculating similarities between embeddings

Now we can calculate the similarity scores between each of the three input vectors and our query vector

In [19]:
print(f"The similarity between the string 1 and the query is {cosine_similarity([query],[d1])[0][0]}")

The similarity between the string 1 and the query is 0.6332541953557046


In [20]:
print(f"The similarity between the string 2 and the query is {cosine_similarity([query],[d2])[0][0]}")

The similarity between the string 2 and the query is 0.5118057548747722


In [21]:
print(f"The similarity between the string 3 and the query is {cosine_similarity([query],[d3])[0][0]}")

The similarity between the string 3 and the query is 0.2872144907165794


We see that the similarity score between the statement __"The food was delicious"__ and the query __"food"__ is the highest.

Great, this cosine similarity seems to work. Now let's see how to implement this at a slightly bigger scale.

### Text Search at Scale

__Medium__ is a popular online publishing platform where users can read, write, and interact with a wide range of articles and blog posts covering various topics such as technology, entrepreneurship, politics, culture, and more.

We will use this dataset which has __1300+ Towards DataScience Medium Articles__ from Kaggle and search the article headlines for closest matches.

Data Source - https://www.kaggle.com/datasets/meruvulikith/1300-towards-datascience-medium-articles-dataset

In [22]:
import pandas as pd

data=pd.read_csv("../Assets/Data/medium.csv")

In [23]:
data.head()

Unnamed: 0,Title,Text
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\n\nWord2vec is one...
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o..."
2,How to Use ggplot2 in Python,Introduction\n\nThanks to its strict implement...
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...


Dataset has just two columns - 

- The title of the article
- The text of the article

We'll use the title of the article for this demonstration

In [24]:
data.shape

(1391, 2)

Creation of embeddings takes time. The data has 1391 articles, but for our demonstration we shall use 100 records.

In [25]:
trunc_data=data.iloc[0:100,:]

In [26]:
trunc_data.shape

(100, 2)

We'll now define a function that replaces new line characters with space and creates embeddings for an input string

In [27]:
def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

In [29]:
trunc_data['embedding'] = trunc_data['Title'].apply(lambda x: get_embedding(x, model='text-embedding-3-small'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trunc_data['embedding'] = trunc_data['Title'].apply(lambda x: get_embedding(x, model='text-embedding-3-small'))


In [27]:
trunc_data.head()

Unnamed: 0,Title,Text,embedding
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\n\nWord2vec is one...,"[-0.024416513741016388, 0.019436374306678772, ..."
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o...","[-0.011600558646023273, -0.02944890409708023, ..."
2,How to Use ggplot2 in Python,Introduction\n\nThanks to its strict implement...,"[-0.0054690418764948845, -0.024078190326690674..."
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...,"[-0.003941336181014776, -0.022551342844963074,..."
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...,"[-0.0020211779046803713, -0.000974284252151846..."


Now we have another column in our data which stores the embeddings

Let's try searching articles related to __"Deep Learning"__ from the set. 

To do this, we'll first create an embedding of the search string and then find the cosine distance of the search string embedding from all the titles in the set.

In [30]:
search_string="Deep Learning"

In [31]:
search_embedding=get_embedding(search_string)

Now, we'll find the cosine similarity and store it in the dataset

In [32]:
trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))

  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_si

In [41]:
trunc_data

Unnamed: 0,Title,Text,embedding,relevance
0,A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model,1. Introduction of Word2vec\n\nWord2vec is one of the most popular technique to learn word embed...,"[-0.024416513741016388, 0.019436374306678772, 0.023113839328289032, -0.01980527490377426, -0.044...",0.300155
1,Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric,"In my last article, I introduced the concept of Graph Neural Network (GNN) and some recent advan...","[-0.011600558646023273, -0.02944890409708023, 0.004553031641989946, -0.03210508078336716, 0.0400...",0.344641
2,How to Use ggplot2 in Python,"Introduction\n\nThanks to its strict implementation of the grammar of graphics, ggplot2 provides...","[-0.0054690418764948845, -0.024078190326690674, 0.01770878955721855, -0.012851991690695286, -0.0...",0.109014
3,Databricks: How to Save Data Frames as CSV Files on Your Local Computer,Photo credit to Mika Baumeister from Unsplash\n\nWhen I work on Python projects dealing with lar...,"[-0.003941336181014776, -0.022551342844963074, 0.06862198561429977, -0.04217821732163429, 0.0360...",0.169241
4,A Step-by-Step Implementation of Gradient Descent and Backpropagation,A Step-by-Step Implementation of Gradient Descent and Backpropagation\n\nThe original intention ...,"[-0.0020211779046803713, -0.0009742842521518469, 0.01972859725356102, -0.01853516511619091, 0.01...",0.388754
...,...,...,...,...
95,Data Scientist’s toolkit — How to gather data from different sources,Data Scientist’s toolkit — How to gather data from different sources\n\nPhoto by Jakob Owens on ...,"[-0.026935778558254242, -0.02546488121151924, 0.022867832332849503, -0.014329741708934307, 0.045...",0.313273
96,Deep Learning on a Budget,Introduction\n\nWhy?\n\nThere are many articles and courses dedicated to the latest ML/AI resear...,"[-0.016848241910338402, -0.045139651745557785, -0.008332818746566772, -0.010773622430860996, 0.0...",0.719206
97,Generating Startup names with Markov Chains,Generating Startup names with Markov Chains\n\nThe most interesting applications of Machine Lear...,"[0.013092967681586742, -0.0023525876458734274, 0.03218775615096092, -0.037528958171606064, -0.03...",0.165855
98,A Recipe for using Open Source Machine Learning models,A Recipe for using Open Source Machine Learning models\n\nPhoto by Luca Bravo on Unsplash\n\nMac...,"[-0.017321426421403885, -0.021059678867459297, 0.026052551344037056, -0.0624237060546875, 0.0059...",0.387291


Finally we can sort the dataset by this score and see the top 10 results

In [43]:
pd.set_option('display.max_colwidth', 150)
print(trunc_data.sort_values(by="relevance",ascending=False).iloc[0:10,:][["Title","relevance"]])

                                                                        Title  \
96                                                  Deep Learning on a Budget   
79                            Applied AI: Going From Concept to ML Components   
73                        Transfer Learning Intuition for Text Classification   
54                                        Reinforcement Learning Introduction   
80                                     Wild Wide AI: responsible data science   
26                          Why Machine Learning Models Degrade In Production   
29                 An Introduction to Recurrent Neural Networks for Beginners   
9                                   What if AI model understanding were easy?   
68  Getting Started with Google BigQuery’s Machine Learning — Titanic Dataset   
69                  Review: DeepPose — Cascade of CNN (Human Pose Estimation)   

    relevance  
96   0.719206  
79   0.482872  
73   0.478084  
54   0.470448  
80   0.445177  
26   0.43760

Congratulations! We're at the end of Day 3!

Hopefully, now we are fairly confident around using OpenAI embeddings. 



<img src="../Assets/Images/That’s all for the day!.png">

# About



<img src="../Assets/Images/profile.png" width=100> 

#### Hi! I'm Abhinav! A data science and AI professional with over 15 years in the industry. Passionate about AI advancements, I constantly explore emerging technologies to push the boundaries and create positive impacts in the world. Let’s build the future, together!

<span style="font-size: 20px; color: orange"><b>Connect with me!</b></span>


[![GitHub followers](https://img.shields.io/badge/Github-000000?style=for-the-badge&logo=github&logoColor=black&color=orange)](https://github.com/abhinav-kimothi)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-000000?style=for-the-badge&logo=linkedin&logoColor=orange&color=black)](https://www.linkedin.com/comm/mynetwork/discovery-see-all?usecase=PEOPLE_FOLLOWS&followMember=abhinav-kimothi)
[![Medium](https://img.shields.io/badge/Medium-000000?style=for-the-badge&logo=medium&logoColor=black&color=orange)](https://medium.com/@abhinavkimothi)
[![Insta](https://img.shields.io/badge/Instagram-000000?style=for-the-badge&logo=instagram&logoColor=orange&color=black)](https://www.instagram.com/akaiworks/)
[![Mail](https://img.shields.io/badge/email-000000?style=for-the-badge&logo=gmail&logoColor=black&color=orange)](mailto:abhinav.kimothi.ds@gmail.com)
[![X](https://img.shields.io/badge/Follow-000000?style=for-the-badge&logo=X&logoColor=orange&color=black)](https://twitter.com/abhinav_kimothi)
[![Linktree](https://img.shields.io/badge/Linktree-000000?style=for-the-badge&logo=linktree&logoColor=black&color=orange)](https://linktr.ee/abhinavkimothi)
[![Gumroad](https://img.shields.io/badge/Gumroad-000000?style=for-the-badge&logo=gumroad&logoColor=orange&color=black)](https://abhinavkimothi.gumroad.com/)


<span style="font-size: 20px; color: orange"><b>You can also book a time-slot with me</b></span>

[![Static Badge](https://img.shields.io/badge/Free%20Virtual%20Coffee%20(15%20min)-000000?style=for-the-badge&logo=googlecalendar&logoColor=black&color=blue)](https://topmate.io/abhinav_kimothi/544386)
[![Static Badge](https://img.shields.io/badge/Resume%20Review%20(DS/AI/ML)%20(30%20min)-000000?style=for-the-badge&logo=googlecalendar&logoColor=blue&color=black)](https://topmate.io/abhinav_kimothi/544382)
[![Static Badge](https://img.shields.io/badge/AIML%20Learning%20Path%20(30%20min)-000000?style=for-the-badge&logo=googlecalendar&logoColor=black&color=blue)](https://topmate.io/abhinav_kimothi/544380)
[![Static Badge](https://img.shields.io/badge/Generative%20AI%20Consulting%20(60%20min)-000000?style=for-the-badge&logo=googlecalendar&logoColor=blue&color=black)](https://topmate.io/abhinav_kimothi/544379)


<span style="font-size: 20px; color: orange"><b>Also, read my ebooks for more on Generative AI!</b></span>



<a href="https://abhinavkimothi.gumroad.com/l/GenAILLM">
    <img src="https://public-files.gumroad.com/jsdnnne2gnhu61f6hrdprwx2255i" width=150>
</a><a href="abhinavkimothi.gumroad.com/l/RAG">
    <img src="https://public-files.gumroad.com/v17k9tp2fnbbtg8iwoxt4m3xgivq" width=150>
</a><a href="abhinavkimothi.gumroad.com/l/GenAITaxonomy">
    <img src="https://public-files.gumroad.com/a730ysxb7a928bb5xkz6fuqabaqp" width=150>
</a>



