## 1. Load the dataset

The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).

We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.

In [1]:
import pandas as pd

input_datapath = 'data/fine_food_reviews_1k.csv'  # to save space, we provide a pre-filtered dataset
df = pd.read_csv(input_datapath, index_col=0)
df = df[['Time', 'ProductId', 'UserId', 'Score', 'Summary', 'Text']]
df = df.dropna()
df['combined'] = "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
df.head(2)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...


ss:
[GPT2TokenizerFast.from_pretrained](https://github.com/huggingface/transformers/blob/v4.25.1/src/transformers/tokenization_utils_base.py#L1593)
and https://huggingface.co/docs/transformers/v4.25.1/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase

In [2]:
# subsample to 1k most recent reviews and remove samples that are too long
df = df.sort_values('Time').tail(1_100)
df.drop('Time', axis=1, inplace=True)

from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

#ss we are not saving the encodings, just getting their length. get_embeddings below use the text in combined column
# remove reviews that are too long
df['n_tokens'] = df.combined.apply(lambda x: len(tokenizer.encode(x)))
df = df[df.n_tokens<2000].tail(1_000)
len(df)

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 2.17MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.12MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 2.66MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 108kB/s]


1000

### 2. Get embeddings and save them for future reuse

In [None]:
from openai.embeddings_utils import get_embedding

# This will take just under 10 minutes
df['babbage_similarity'] = df.combined.apply(lambda x: get_embedding(x, engine='text-similarity-babbage-001'))
df['babbage_search'] = df.combined.apply(lambda x: get_embedding(x, engine='text-search-babbage-doc-001'))
df.to_csv('data/fine_food_reviews_with_embeddings_1k.csv')

# ss

In [3]:
df

Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,n_tokens
0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,51
297,B003VXHGPK,A21VWSCGW7UUAR,4,"Good, but not Wolfgang Puck good","Honestly, I have to admit that I expected a li...","Title: Good, but not Wolfgang Puck good; Conte...",178
296,B008JKTTUA,A34XBAIFT02B60,1,Should advertise coconut as an ingredient more...,"First, these should be called Mac - Coconut ba...",Title: Should advertise coconut as an ingredie...,79
295,B000LKTTTW,A14MQ40CCU8B13,5,Best tomato soup,I have a hard time finding packaged food of an...,Title: Best tomato soup; Content: I have a har...,111
294,B001D09KAM,A34XBAIFT02B60,1,Should advertise coconut as an ingredient more...,"First, these should be called Mac - Coconut ba...",Title: Should advertise coconut as an ingredie...,79
...,...,...,...,...,...,...,...
623,B0000CFXYA,A3GS4GWPIBV0NT,1,Strange inflammation response,Truthfully wasn't crazy about the taste of the...,Title: Strange inflammation response; Content:...,110
624,B0001BH5YM,A1BZ3HMAKK0NC,5,My favorite and only MUSTARD,You've just got to experience this mustard... ...,Title: My favorite and only MUSTARD; Content:...,80
625,B0009ET7TC,A2FSDQY5AI6TNX,5,My furbabies LOVE these!,Shake the container and they come running. Eve...,Title: My furbabies LOVE these!; Content: Shak...,46
619,B007PA32L2,A15FF2P7RPKH6G,5,got this for the daughter,all i have heard since she got a kuerig is why...,Title: got this for the daughter; Content: all...,50


In [39]:
import openai as openai
openai.api_key = "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"
from openai.embeddings_utils import get_embedding

import numpy as np

In [36]:
t0=df['combined'][0]
t1=df['combined'][1]


In [37]:
#ss need api key?
e0 = get_embedding(t0, engine='text-similarity-babbage-001')
e1 = get_embedding(t1, engine='text-similarity-babbage-001')


In [40]:
# https://openai.com/blog/introducing-text-and-code-embeddings/
similarity_score = np.dot(e0, e1)
similarity_score

0.7108444448351603

In [24]:
xe = get_embedding('hello', engine='text-similarity-babbage-001')
len(xe)

2048

In [31]:
xe[-10:]

[0.016876760870218277,
 -0.013760174624621868,
 -0.015029593370854855,
 0.02958722412586212,
 0.039286885410547256,
 0.008527890779078007,
 0.005940229166299105,
 -0.011717712506651878,
 -0.008999854326248169,
 -0.015648027881979942]

In [33]:
sorted(xe)[-10:]

[0.05878385528922081,
 0.060053274035453796,
 0.0623968169093132,
 0.06262466311454773,
 0.06304780393838882,
 0.07421217858791351,
 0.07538394629955292,
 0.08234947919845581,
 0.10220448672771454,
 0.14647139608860016]

In [23]:
xe

[-0.2001124918460846,
 -0.12375205755233765,
 -0.07727180421352386,
 -0.07466786354780197,
 -0.06698625534772873,
 -0.06434977054595947,
 -0.05985797941684723,
 -0.057579535990953445,
 -0.057514436542987823,
 -0.05692855268716812,
 -0.05689600110054016,
 -0.05478030443191528,
 -0.05419442057609558,
 -0.05380382761359215,
 -0.0530877448618412,
 -0.05276225507259369,
 -0.05243676155805588,
 -0.05201362445950508,
 -0.05172067880630493,
 -0.05139518901705742,
 -0.0509069487452507,
 -0.05041871219873428,
 -0.04924694076180458,
 -0.0491492934525013,
 -0.049116745591163635,
 -0.04843321070075035,
 -0.04787987470626831,
 -0.047489285469055176,
 -0.04680575057864189,
 -0.046024568378925323,
 -0.04569907858967781,
 -0.04543868452310562,
 -0.04534103721380234,
 -0.045275937765836716,
 -0.04455985501408577,
 -0.04452730715274811,
 -0.04449475556612015,
 -0.04433201253414154,
 -0.04429946094751358,
 -0.04400651901960373,
 -0.043583378195762634,
 -0.0431927889585495,
 -0.0421837642788887,
 -0.042151