## Step 2: Create Embeddings 

In this notebook, we are going to create embeddings on our development and evaluation datasets.  
- The development dataset is named *apparel_15to25.tsv.gz* which contains products with 15 to 25 reviews.  
- The evaluation dataset is named *apparel_10to14.tsv.gz* which contains products with 10 to 14 reviews.  

There are 3 outpus for this notebook. 
- *apparel_15to25_embedding.pkl* - contains the embeddings for **product_title + review_body** for the development dataset
- *apparel_10to14_embedding.pkl* - contains the embeddings for **product_title + review_body** for the evaluation dataset
- *apparel_15to25_embedding.pkl* - contains the embeddings for **product_title only** for the development dataset. This will be used for the unsupervised analysis in order to understand what types of products are in the data

In [None]:
import os
import openai
import pandas as pd
import time
import math
import configparser
from tqdm.auto import tqdm
import numpy as np
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff

In [None]:
# Load your API key from an environment variable or secret management service
config = configparser.ConfigParser()
config.read('nes.ini')
openai.api_key = config['OpenAI']['api_key']

## Create the required datasets

In [None]:
# we only need these columns 
cols = ['product_id','product_title', 'product_category', 'star_rating', 'review_id', 'review_headline', 'review_body', 'review_length', 'review_count']

df_development = pd.read_csv('../resources/data/apparel_15to25.tsv.gz', sep='\t', compression='gzip')
df_development = df_development[cols]

df_evaluation = pd.read_csv('../resources/data/apparel_10to14.tsv.gz', sep='\t', compression='gzip')
df_evaluation = df_evaluation[cols]

# create the dataset for unsupervised analysis by copying the development dataset
df_unsupervised_analysis = df_development.copy()

In [None]:
# Use product_title and review_body to create the text for search 
# review_headline tends to be too short or does not provide much context
df_development['text'] = df_development['product_title'] + '. ' + df_development['review_body']
df_evaluation['text'] = df_evaluation['product_title'] + '. ' + df_evaluation['review_body']

## Create embedding with OpenAI service

### Create helper functions

In [None]:
# helper function 
def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

In [None]:
# Use tenacity retry to tackle the OpenAI "rate limits" problem
# reference: https://platform.openai.com/docs/guides/rate-limits/request-increase
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def get_embedding_with_backoff(**kwargs):
    return get_embedding(**kwargs)

In [None]:
def transform_column_to_embedding(df, source_column, target_column_name, rate_limit_per_minute=3000, delay=60.0):
    
    num_of_batch = math.ceil(len(df) / rate_limit_per_minute)
    # also use batch strategy to handle the OpenAI "rate limits" problem apart from the retry mechanism above 
    chunks = []
    tqdm.pandas(desc='Processing rows')
    for chunk in np.array_split(df, num_of_batch):
        chunk[target_column_name] = chunk[source_column].progress_apply(lambda x:get_embedding_with_backoff(text=x))
        chunks.append(chunk)
        time.sleep(delay)

    return pd.concat(chunks, ignore_index=True)

### Create embeddings

In [None]:
# create embedding for development dataset
df_development = transform_column_to_embedding(df=df_development, source_column='text', target_column_name='embedding')

In [None]:
# create embedding for evaluation dataset
df_evaluation = transform_column_to_embedding(df=df_evaluation, source_column='text', target_column_name='embedding')

In [None]:
# create embedding for product_title column of the development dataset 
df_unsupervised_analysis = transform_column_to_embedding(df=df_unsupervised_analysis, source_column='product_title', target_column_name='product_title_embedding')

## Save result in pickle format 

In [None]:
# serialize to pickle file. Pandas version is 1.3.5 

# uncomment below lines to save the datasets
#df_development.to_pickle('../resources/data/apparel_15to25_embedding.pkl')
#df_evaluation.to_pickle('../resources/data/apparel_10to14_embedding.pkl')
#df_unsupervised_analysis.to_pickle('../resources/data/apparel_15to25_product_title_only.pkl')