# Get embedding from dataset

This notebook gives an example on how to get embeddings from a large dataset.

# 1. Load the dataset

The dataset used in this example is [Amazon Fine Food Reviews Dataset](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews). The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).

We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.

To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

In [6]:
// Imports
#r "nuget: Microsoft.ML, 3.0.0-preview.23511.1"
#r "nuget: Microsoft.Data.Analysis, 0.21.0-preview.23511.1"
#r "nuget: SharpToken, 1.2.12"
#r "nuget: System.Numerics.Tensors, 8.0.0-rc.2.23479.6"
#load "OpenAi.fs"

In [18]:
let embeddingModel = "text-embedding-ada-002"
let embeddingEncoding = "cl100k_base"  // this the encoding for text-embedding-ada-002
let maxTokens = 8000  // the maximum for text-embedding-ada-002 is 8191

In [19]:
open Microsoft.Data.Analysis
open System.IO

let inputDataPath = Path.GetFullPath(@"data/fine_food_reviews_1k.csv")
let df = DataFrame.LoadCsv(inputDataPath)

df.Columns.Remove("Id")
df.DropNulls()

let combined = 
    df.Rows
    |> Seq.map (fun row -> 
        let title = string row["Summary"]
        let content = string row["Text"]

        $"Title: {title.Trim()}; Content: {content.Trim()}") 

df.Columns.Add(new StringDataFrameColumn("combined", combined))
df.Head(2)
    

index,Time,ProductId,UserId,Score,Summary,Text,combined
0,1351123200.0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a treat like this,Wanted to save some to bring to my Chicago family but my North Carolina family ate all 4 boxes before I could pack. These are excellent...could serve to anyone,Title: where does one start...and stop... with a treat like this; Content: Wanted to save some to bring to my Chicago family but my North Carolina family ate all 4 boxes before I could pack. These are excellent...could serve to anyone
1,1351123200.0,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, most of the rings were broken in pieces. A total waste of money.","Title: Arrived in pieces; Content: Not pleased at all. When I opened the box, most of the rings were broken in pieces. A total waste of money."


In [20]:
open SharpToken

// subsample to 1k most recent reviews and remove samples that are too long
let topN = 1000
df.OrderBy("Time").Tail(topN)
df.Columns.Remove("Time")

let encoding = GptEncoding.GetEncoding(embeddingEncoding)

// Omit reviews that are too long to embed
let nTokens = 
    df["combined"]
    |> Seq.cast<string>
    |> Seq.map (encoding.Encode >> Seq.length)
    
df.Columns.Add(new Int32DataFrameColumn("n_tokens", nTokens))
df.Filter(df["n_tokens"].ElementwiseLessThanOrEqual(maxTokens))
df.Rows.Count

# 2. Get embeddings and save them for future use

In [21]:
open OpenAi.EmbeddingsUtils
open System.Text.Json

// Ensure you have the OPENAI_API_KEY environment variable set 

// This may take a few minutes
let embedding = 
    df["combined"]
    |> Seq.cast<string>
    |> Seq.map (fun x -> getEmbedding x embeddingModel)
    |> Seq.map (function
        | Ok embedding -> JsonSerializer.Serialize(embedding[0].Embedding)
        | _ -> "")

df.Columns.Add(new StringDataFrameColumn("embedding", embedding))
DataFrame.SaveCsv(df, "data/fine_food_reviews_with_embeddings_1k.csv")
