<img src="../Assets/Images/Day 3 Header (Embeddings).png">

# Welcome to Day 3

Today we will 

- Learn about Embeddings
- Look at available embedding models of OpenAI
- Study the embeddings API
- Decode the embeddings response object
- Apply embeddings to 
    - Perform Text Search
    - Do Text Clustering

## Brief Introduction to Embeddings

<span style="font-size: 20px; color: orange"><b>Embeddings are vector representations of data that capture meaningful relationships between entities</b></span>

<span style="font-size: 16px; color: blue"><b>These units are typically words, punctuation marks, or other meaningful substrings that make up the text</b></span>

- All Machine Learning/AI models work with numerical data. Before the performance of any operation all text/image/audio/video data has to be transformed into a numerical representation

- As a general definition, embeddings are data that has been transformed into n-dimensional matrices for use in deep learning computations.

<img src="../Assets/Images/Embeddings.png" width=800>

## Available OpenAI Embeddings

__text-embedding-3-small__	| $0.02 / 1M tokens

__text-embedding-3-large__	| $0.13 / 1M tokens

__ada v2__	| $0.10 / 1M tokens

## Embeddings API

In [2]:
%pip install openai --quiet #You can remove '--quiet' to see the installation steps
%pip install python-dotenv --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
#### Import Libraries ####
import openai #OpenAI python library
from openai import OpenAI #OpenAI Client

from dotenv import load_dotenv
import os

load_dotenv()

openai_api_key=os.getenv("OPENAI_API_KEY")

client = OpenAI(api_key=openai_api_key)

In [4]:
embeddings=client.embeddings.create(
  model="text-embedding-3-small",
  input="The food was delicious",
  encoding_format="float",
)

In [5]:
print(embeddings.model_dump_json(indent=4))

{
    "data": [
        {
            "embedding": [
                -0.019819789,
                -0.021811483,
                -0.06169395,
                -0.038838044,
                0.011288293,
                -0.032474335,
                -0.007814972,
                0.070437975,
                -0.008889758,
                -0.04471597,
                0.020682048,
                -0.030701242,
                0.005167476,
                -0.027980879,
                -0.009915967,
                -0.009472693,
                0.018993964,
                -0.021738617,
                -0.017390894,
                0.023438843,
                0.053872906,
                -0.0061329617,
                -0.023001643,
                -0.00004990622,
                -0.0068859193,
                0.03733213,
                -0.00096320896,
                -0.0014429159,
                -0.009928111,
                0.0076753106,
                0.017220872,
                -0.011

In [6]:
embeddings=client.embeddings.create(
  model="text-embedding-3-small",
  input=["The food was delicious","The ambience was nice","The service was ordinary"],
  encoding_format="float",
  dimensions=10
)

In [7]:
print(embeddings.model_dump_json(indent=4))

{
    "data": [
        {
            "embedding": [
                -0.16478635,
                -0.18134578,
                -0.5129379,
                -0.32290855,
                0.0938535,
                -0.2699992,
                -0.0649755,
                0.5856378,
                -0.073911525,
                -0.37177902
            ],
            "index": 0,
            "object": "embedding"
        },
        {
            "embedding": [
                -0.029102806,
                -0.34336767,
                -0.66657615,
                -0.37806797,
                0.08133914,
                -0.4815079,
                -0.12963039,
                0.060725518,
                -0.060477655,
                -0.17713675
            ],
            "index": 1,
            "object": "embedding"
        },
        {
            "embedding": [
                -0.27565208,
                0.1992017,
                -0.5211547,
                -0.37973946,
                -0.0

## Text Search

In [8]:
embeddings_q=client.embeddings.create(
  model="text-embedding-3-small",
  input="food",
  encoding_format="float",
  dimensions=10
)

In [9]:
query=embeddings_q.data[0].embedding

In [10]:
query

[-0.012315562,
 -0.16567738,
 -0.069231726,
 -0.10728083,
 0.4480219,
 -0.61836094,
 0.29532152,
 0.4434862,
 -0.24505134,
 -0.17046502]

In [11]:
d1=embeddings.data[0].embedding #"The food was delicious"
d2=embeddings.data[1].embedding #"The ambience was nice"
d3=embeddings.data[2].embedding #"The service was ordinary"

<img src="https://storage.googleapis.com/lds-media/images/cosine-similarity-vectors.original.jpg">

In [14]:
%pip install scikit-learn --quiet

Note: you may need to restart the kernel to use updated packages.


In [15]:
from sklearn.metrics.pairwise import cosine_similarity #for calculating similarities between embeddings

In [16]:
cosine_similarity([query],[d1])

array([[0.6332542]])

In [17]:
cosine_similarity([query],[d2])

array([[0.51180575]])

In [18]:
cosine_similarity([query],[d3])

array([[0.28721449]])

__1300+ Towards DataScience Medium Articles Dataset__

Data Source - https://www.kaggle.com/datasets/meruvulikith/1300-towards-datascience-medium-articles-dataset

In [20]:
import pandas as pd
data=pd.read_csv("../Assets/Data/medium.csv")

In [21]:
data.head()

Unnamed: 0,Title,Text
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\n\nWord2vec is one...
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o..."
2,How to Use ggplot2 in Python,Introduction\n\nThanks to its strict implement...
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...


In [22]:
data.shape

(1391, 2)

In [23]:
trunc_data=data.iloc[0:100,:]

In [24]:
trunc_data.shape

(100, 2)

In [25]:
def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

In [26]:
trunc_data['embedding'] = trunc_data.Title.apply(lambda x: get_embedding(x, model='text-embedding-3-small'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trunc_data['embedding'] = trunc_data.Title.apply(lambda x: get_embedding(x, model='text-embedding-3-small'))


In [27]:
trunc_data.head()

Unnamed: 0,Title,Text,embedding
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\n\nWord2vec is one...,"[-0.024416513741016388, 0.019436374306678772, ..."
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o...","[-0.011600558646023273, -0.02944890409708023, ..."
2,How to Use ggplot2 in Python,Introduction\n\nThanks to its strict implement...,"[-0.0054690418764948845, -0.024078190326690674..."
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...,"[-0.003941336181014776, -0.022551342844963074,..."
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...,"[-0.0020211779046803713, -0.000974284252151846..."


In [28]:
search_string="Deep Learning"

In [29]:
search_embedding=get_embedding(search_string)

In [30]:
trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))

  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))
  trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_si

In [31]:
trunc_data

Unnamed: 0,Title,Text,embedding,relevance
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\n\nWord2vec is one...,"[-0.024416513741016388, 0.019436374306678772, ...",0.300155
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o...","[-0.011600558646023273, -0.02944890409708023, ...",0.344641
2,How to Use ggplot2 in Python,Introduction\n\nThanks to its strict implement...,"[-0.0054690418764948845, -0.024078190326690674...",0.109014
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...,"[-0.003941336181014776, -0.022551342844963074,...",0.169241
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...,"[-0.0020211779046803713, -0.000974284252151846...",0.388754
...,...,...,...,...
95,Data Scientist’s toolkit — How to gather data ...,Data Scientist’s toolkit — How to gather data ...,"[-0.026935778558254242, -0.02546488121151924, ...",0.313273
96,Deep Learning on a Budget,Introduction\n\nWhy?\n\nThere are many article...,"[-0.016848241910338402, -0.045139651745557785,...",0.719206
97,Generating Startup names with Markov Chains,Generating Startup names with Markov Chains\n\...,"[0.013092967681586742, -0.0023525876458734274,...",0.165855
98,A Recipe for using Open Source Machine Learnin...,A Recipe for using Open Source Machine Learnin...,"[-0.017321426421403885, -0.021059678867459297,...",0.387291


In [32]:
trunc_data.sort_values(by="relevance",ascending=False).iloc[0:10,:]

Unnamed: 0,Title,Text,embedding,relevance
96,Deep Learning on a Budget,Introduction\n\nWhy?\n\nThere are many article...,"[-0.016848241910338402, -0.045139651745557785,...",0.719206
79,Applied AI: Going From Concept to ML Components,Opening your mind to different ways of applyin...,"[-0.016721589490771294, -0.026498831808567047,...",0.482872
73,Transfer Learning Intuition for Text Classific...,Transfer Learning Intuition for Text Classific...,"[-0.022648293524980545, 0.0017097401432693005,...",0.478084
54,Reinforcement Learning Introduction,Reinforcement Learning Introduction\n\nAn intr...,"[0.009179973974823952, -0.05973218381404877, 0...",0.470448
80,Wild Wide AI: responsible data science,Wild Wide AI: responsible data science\n\nData...,"[0.041022028774023056, -0.00013012583076488227...",0.445177
26,Why Machine Learning Models Degrade In Production,After several failed ML projects due to unexpe...,"[0.012906364165246487, 0.030608268454670906, 0...",0.437601
29,An Introduction to Recurrent Neural Networks f...,An Introduction to Recurrent Neural Networks f...,"[-0.01791239343583584, -0.02631079964339733, 0...",0.424425
9,What if AI model understanding were easy?,Irreverent Demystifiers\n\nWhat if AI model un...,"[-0.011677316389977932, -0.0018296980997547507...",0.42285
68,Getting Started with Google BigQuery’s Machine...,"While still in Beta, BigQuery ML has been avai...","[-0.03437798097729683, 0.012720847502350807, 0...",0.41643
69,Review: DeepPose — Cascade of CNN (Human Pose ...,Review: DeepPose — Cascade of CNN (Human Pose ...,"[0.01773509941995144, -0.03974172845482826, 0....",0.412199


Congratulations! We're at the end of Day 3!

Hopefully, now we are fairly confident around using OpenAI embeddings. 



#



<img src="../Assets/Images/profile.png" width=50> 

Hi! I'm Abhinav! A data science and AI professional with over 15 years in the industry. Passionate about AI advancements, I constantly explore emerging technologies to push the boundaries and create positive impacts in the world. Let’s build the future, together!

<span style="font-size: 20px; color: orange"><b>Connect with me!</b></span>


 
[![GitHub followers](https://img.shields.io/github/followers/abhinav-kimothi?label=Follow&style=social)](https://github.com/abhinav-kimothi)
[![Me](https://img.shields.io/badge/Medium-8A2BE2)](https://medium.com/@abhinavkimothi)
[![LIn](https://img.shields.io/badge/LinkedIn-blue)](https://www.linkedin.com/in/abhinav-kimothi/)
[![Mail](https://img.shields.io/badge/eMail-green)](mailto:abhinav.kimothi.ds@gmail.com)
[![Twitter Follow](https://img.shields.io/twitter/follow/@?style=social)](https://twitter.com/abhinav_kimothi)

<span style="font-size: 20px; color: orange"><b>Also, read my ebooks for more on Generative AI!</b></span>



<a href="https://abhinavkimothi.gumroad.com/l/GenAILLM">
    <img src="https://public-files.gumroad.com/jsdnnne2gnhu61f6hrdprwx2255i" width=150>
</a><a href="abhinavkimothi.gumroad.com/l/RAG">
    <img src="https://public-files.gumroad.com/v17k9tp2fnbbtg8iwoxt4m3xgivq" width=150>
</a><a href="abhinavkimothi.gumroad.com/l/GenAITaxonomy">
    <img src="https://public-files.gumroad.com/a730ysxb7a928bb5xkz6fuqabaqp" width=150>
</a>


<img src="../Assets/Images/That’s all for the day!.png">