# 4.3 Embeddings

OpenAI’s text embeddings transforms strings to numbers, that allows to measure the relatedness of text strings. Embeddings are commonly used for:

* Search (where results are ranked by relevance to a query string)
* Clustering (where text strings are grouped by similarity)
* Recommendations (where items with related text strings are recommended)
* Anomaly detection (where outliers with little relatedness are identified)
* Diversity measurement (where similarity distributions are analyzed)
* Classification (where text strings are classified by their most similar label)

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.


Like mentioned before, Embeddings have a lot of uses but this tutorial will only focus on how to make the request and a show simple comparison between two strings.

### Models

Right now there are three available models for Embeddings. The ones with "-3" on the name are third generation models.

* text-embedding-3-small	
* text-embedding-3-large
* text-embedding-ada-002

In [1]:
from openai import OpenAI
client = OpenAI()

**Endpoint to retrieve the information about the embeddings**

*model* - changes the model use to retieve information  
*input* - string message you want to retrieve the embeddings from

In [2]:
# Retrieve the embeddings from a string
response1 = client.embeddings.create(
    input="We are testing to see if this string has any similarities to another one.",
    model="text-embedding-3-small"
)

embeddings1 = response1.data[0].embedding

print(f"\n Full Response - {response1} \n Embeddings - {embeddings1}")


 Full Response - CreateEmbeddingResponse(data=[Embedding(embedding=[-0.00921399425715208, -0.03483395278453827, 0.01737102121114731, -0.015417930670082569, -0.038479723036289215, 0.0043580736964941025, 0.03563050925731659, 0.007149845361709595, 0.00019602716201916337, 0.006797522772103548, -0.005472484510391951, -0.05983351916074753, -0.05542183294892311, 0.007479189895093441, -0.014085233211517334, 0.03268938511610031, -0.0025830587837845087, -0.009275267831981182, -0.054778460413217545, -0.010263302363455296, 0.038510359823703766, 0.0046529523096978664, 0.029610393568873405, 0.024049827829003334, 0.06020116060972214, -0.03029971942305565, -0.019239861518144608, 0.016405966132879257, 0.024754472076892853, -0.023084770888090134, 0.017309749498963356, -0.03988901525735855, -0.050948869436979294, 0.008945923298597336, 0.0015634302981197834, 0.02469319850206375, 0.003902352647855878, 0.01706465519964695, -0.024402150884270668, -0.012024913914501667, -0.04402497038245201, 0.02562761865556

**Embedding comparing between two strings using cosine similarity**

In [3]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Retrieve the embeddings for another string
response2 = client.embeddings.create(
    input="We're experimenting to determine if this string bears resemblance to another.",
    model="text-embedding-3-small"
)

embeddings2 = response2.data[0].embedding

# Convert the embeddings to numpy arrays
embeddings1 = np.array(embeddings1).reshape(1, -1)
embeddings2 = np.array(embeddings2).reshape(1, -1)

# Computing the cosine similarity between the embeddings of the two strings
similarity_score = cosine_similarity(embeddings1, embeddings2)

final_score = float(format(similarity_score[0][0], ".2f"))

print(f"Similarity: {final_score}")

# Determining the level of similarity
if final_score < 0.25:
    print("Totally Different")
elif final_score <= 0.5:
    print("Different")
elif final_score <= 0.75:
    print("Parcial Equal")
elif final_score <= 0.90:
    print("Almost Equal")
elif final_score <= 1:
    print("Equal")
else:
    print("Unknown value")

Similarity: 0.88
Almost Equal
