## GPT-3 Embeddings
**An updated approach to semantic search.**

The embedding is an **information dense representation of the semantic meaning of a piece of text**.
Each embedding is a vector of floating point numbers, such that the **distance between two embeddings in the vector space is correlated with semantic similarity between two inputs** in the original format.
For example, if two texts are similar, then their vector representations should also be similar.

Use cases:

- Text Similarity
- Semantic Search
- Classification
- Clustering

1. **Similarity embeddings** : These models are good at capturing semantic similarity between two or more pieces of text.
2. **Text search embeddings**: These models help measure whether long documents are relevant to a short search query. There are two types: one for *embedding the documents* to be retrieved, and one for **embedding the search query**.

In [1]:
# Import required libraries

import os
import openai, numpy as np, pandas as pd
from openai.embeddings_utils import get_embedding, cosine_similarity

In [None]:
# API key stored as an environment variable

openai.api_key = os.environ["OPENAI_API_KEY"]
# print(openai.api_key) # You don't wanna do that.

In [3]:
# Text Similarity: captures semantic similarity between pieces of text.

document = ["eating food", "I am hungry", "I am traveling", "exploring new places"]
response = openai.Embedding.create(
    input=document,
    engine="text-similarity-curie-001"
)

In [6]:
# Interrogate response

print(*[f"{type(r)} | {len(r)}" for r in [response["data"], response["data"][0]]], sep='\n')
print(response["data"][0].keys())
print(response["data"][0]["embedding"][:20])

<class 'list'> | 4
<class 'openai.openai_object.OpenAIObject'> | 3
dict_keys(['object', 'index', 'embedding'])
[-0.0019766087643802166, 0.0014332111459225416, -0.015559284016489983, 0.011673991568386555, -0.01459022518247366, 0.009178890846669674, -0.0053796363063156605, -0.011148707009851933, 0.010596252977848053, 0.02966950833797455, 0.0068060546182096004, -0.015504944138228893, -0.007222659420222044, 0.011357009410858154, 0.016854381188750267, 0.014372865669429302, -0.019580425694584846, -0.026680821552872658, 0.01044229045510292, 0.001698117470368743]


In [7]:
# Embeddings (vector representation) for the contents of `document`

embedding_a = response['data'][0]['embedding']  # eating food
embedding_b = response['data'][1]['embedding']  # I am hungry
embedding_c = response['data'][2]['embedding']  # I am traveling
embedding_d = response['data'][3]['embedding']  # exploring new places

In [8]:
# Compare embeddings (using dot products)

print(np.dot(embedding_a, embedding_b))   # eating food vs I am hungry
print(np.dot(embedding_a, embedding_c))   # eating food vs I am traveling
print(np.dot(embedding_c, embedding_d))   # I am traveling vs exploring new places

0.8482169927861086
0.7816395422937359
0.8348286175991912


In [9]:
# https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews
# Precomputed embeddings

datafile_path = "https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv"
df = pd.read_csv(datafile_path)
df.head()

Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,n_tokens,babbage_similarity,babbage_search
0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,51,"[-0.01274053193628788, 0.010849879123270512, -...","[-0.01880764216184616, 0.019457539543509483, -..."
1,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...,35,"[-0.024154752492904663, 0.0024838377721607685,...","[-0.03571609780192375, 0.010356518439948559, -..."
2,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,"Title: It isn't blanc mange, but isn't bad . ....",277,"[0.0032693513203412294, 0.017815979197621346, ...","[-0.010433986783027649, 0.024620095267891884, ..."
3,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,Title: These also have SALT and it's not sea s...,246,"[-0.03584608808159828, 0.03424076735973358, -0...","[-0.040209852159023285, 0.03804996609687805, -..."
4,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,Title: Happy with the product; Content: My dog...,87,"[0.005218076519668102, 0.018165964633226395, -...","[0.010450801812112331, 0.022801749408245087, -..."
