# Get embedding from dataset

This notebook gives an example on how to get embeddings from a large dataset.

# 1. Dependencies and Types

In [10]:
#r "nuget: CsvHelper, 30.0.1"
#r "nuget: SharpToken, 1.2.12"
#r "nuget: System.Numerics.Tensors, 8.0.0-rc.2.23479.6"
#load "Csv.fs"
#load "OpenAi.fs"
#load "Review.fs"

# 2. Load the dataset

The dataset used in this example is [Amazon Fine Food Reviews Dataset](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews). The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).

We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.


In [11]:
open Csv
open Review

let inputPath = "data/fine_food_reviews_1k.csv"
let df = 
    readCsv<Review> inputPath
    |> Seq.map AugmentedReview
    |> Seq.map (fun r -> 
        r.Combined <- $"Title: {r.Summary.Trim()}; Content: {r.Text.Trim()}"
        r)


for data in df |> Seq.take 2 do
    printfn $"{data.Id} {data.Time} {data.ProductId} {data.UserId} {data.Score} {data.Summary} {data.Text} {data.Combined}"
    printfn ""


0 1351123200 B003XPF9BO A3R7JR3FMEBXQB 5 where does one  start...and stop... with a treat like this Wanted to save some to bring to my Chicago family but my North Carolina family ate all 4 boxes before I could pack. These are excellent...could serve to anyone Title: where does one  start...and stop... with a treat like this; Content: Wanted to save some to bring to my Chicago family but my North Carolina family ate all 4 boxes before I could pack. These are excellent...could serve to anyone

1 1351123200 B003JK537S A3JBPC3WFUT5ZP 1 Arrived in pieces Not pleased at all. When I opened the box, most of the rings were broken in pieces. A total waste of money. Title: Arrived in pieces; Content: Not pleased at all. When I opened the box, most of the rings were broken in pieces. A total waste of money.



In [12]:
open SharpToken

let embeddingEncoding = "cl100k_base"  // this the encoding for text-embedding-ada-002
let maxTokens = 8000  // the maximum for text-embedding-ada-002 is 8191

// Subsample to 1k most recent reviews and remove samples that are too long
let topN = 1000
let dfSortedByTime =
    df 
    |> Seq.sortByDescending (fun r -> r.Time)
    |> Seq.take topN
    |> Seq.sortBy (fun r -> r.Time)

let encoding = GptEncoding.GetEncoding(embeddingEncoding)
let dfLessThanNTokens = 
    dfSortedByTime
    |> Seq.map (fun r -> 
        r.NTokens <- encoding.Encode(r.Combined).Count
        
        r)
    |> Seq.filter (fun r -> r.NTokens <= maxTokens)
    |> Seq.take topN

dfLessThanNTokens |> Seq.length


# 2. Get embeddings and save them for future use

In [None]:
open OpenAi.EmbeddingsUtils

let embeddingModel = "text-embedding-ada-002"

// Ensure you have the OPENAI_API_KEY environment variable set 
// This may take a few minutes
let reviewsWithEmbeddings = 
    dfLessThanNTokens 
    |> Seq.map (fun x -> 
        let res = getEmbeddings x.Combined embeddingModel
        let embedding = 
            match res with
            | Ok embedding -> embedding[0].Embedding
            | _ -> [||]

        x.Embedding <- embedding
        
        x)

saveCsvWithMap<AugmentedReview, AugmentedReviewMap> "data/fine_food_reviews_with_embeddings_1k.csv" reviewsWithEmbeddings