## 1. Load the dataset

The dataset used in this example is survey data on food waste during ramadan
(provided by Jordan W-M)


To run this notebook, you will need to install: pandas, openai, transformers,
plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and
scipy.


In [10]:
import pandas as pd
import tiktoken
import openai

from openai.embeddings_utils import get_embedding


In [11]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191


In [12]:
# load & inspect dataset
input_datapath = "data/english_free_text.csv"
df = pd.read_csv(input_datapath, index_col=0)
questions = ["whyRamLess", "whyCompMore", "whyProblemSoc", "encourage"]
df = df[questions]

df = df.dropna()
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

df.head(2)


Unnamed: 0,whyRamLess,whyCompMore,whyProblemSoc,encourage
1,Dgbbb,Fvbbnn,Ffgvb,Cbnn
2,Because i am from Dawoodi bohra community and ...,Because i think i shouĺd improve more...people...,Because people don't care it mych in GCC count...,We should think about those who have no access...


In [13]:
# # subsample to 1k most recent reviews and remove samples that are too long
# top_n = 1000
# # first cut to first 2k entries, assuming less than half will be filtered out
# df = df.tail(top_n * 2)

# encoding = tiktoken.get_encoding(embedding_encoding)

# questionId = 0  # the first question

# # omit reviews that are too long to embed
# df["n_tokens"] = df.iloc[:, questionId].apply(lambda x: len(encoding.encode(x)))

# df = df[df.n_tokens <= max_tokens].tail(top_n)
# len(df)


## 2. Get embeddings and save them for future reuse


In [14]:
# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage

# This may take a few minutes
for col in questions:
    print(f'Getting embeddings for column "{col}"')
    df[col + "-embedding"] = df.loc[:, col].apply(lambda x: get_embedding(x, engine=embedding_model))

df.to_csv("data/english_free_text_with_embeddings_1k.csv")


Getting embeddings for column "whyRamLess"
Getting embeddings for column "whyCompMore"
Getting embeddings for column "whyProblemSoc"
Getting embeddings for column "encourage"
