### Ecommerce dataset preprocessing

Notebook aimed at preprocessing an ecommerce dataset. 

A new subset will be created containing a 100 rows and 4 columns, ['product_name', 'review_title', 'review_text', 'review_rating']. 

An API call is made to translate all reviews into English. Two new columns will be added for the translation of the review text and the translation of the review title, making the shape of the final df (100,6). 

This preprocessed dataframe will be stored a csv locally and will be used from now on in the next working notebooks.

###### (Finished on 11.03.2024)

In [1]:
# imports required
import pandas as pd
import os
import numpy as np
import openai
from dotenv import load_dotenv
import time # Used to pause the API call function to avoid exceeding rate limit
from langdetect import detect, detect_langs

In [2]:
# Load dataset 
filename = './data/raw/amazon_uk_dataset.csv'
df = pd.read_csv(filename, delimiter=',', index_col=None, header=0)

In [3]:
# OpenAI API SDK
load_dotenv('APIopenAI.env')
api_key = os.getenv('API_KEY')

#### Dataframe's size reduction

In [5]:
# Reduce number of rows and columns
col_to_keep = ['product_name','review_title','review_text','review_rating']
df = df[:99, col_to_keep]
df.shape

(100, 4)

In [6]:
# Check dataframe's columns
print(df[:0])

Empty DataFrame
Columns: [product_name, review_title, review_text, review_rating]
Index: []


#### Language review classification

In [None]:
# Create df column using a langdetect library to detect language of input text
df['language'] = df['review_text'].apply(lambda text : detect(text))

In [21]:
# Add language detection and translated review text columns
df['translated_text'] = df['review_text']
df['translated_title'] = df['review_title']
print(df.shape)
# Check dataframe's columns
print(df[:0])

(100, 7)
Empty DataFrame
Columns: [product_name, review_title, review_text, review_rating, language, translated_text, translated_title]
Index: []


In [None]:
def batch_translation(batch_texts, batch_size, batch_pos):
    '''
    Params: 
        batch_texts is an array of texts (str) that need to be translated. Works as prompt.
        batch_size the number of texts contained in the array.
        batch_pos the location of the batch within the dataframe.
    Function:
        Make an openai API call with instructions to translate to English all text within the array.
        
    '''
    # General system instructions
    system_instructions = f"You will be provided with an array of texts. You have to translate to English.\
        Reply with an array of all completions."
    
    for row in range(batch_pos, batch_pos+batch_size):  
        # Restart in every new batch  
        # Index list for modifying correct rows of df
        ids = []
        # Text batch that actually need translation
        to_translate = []
        # Filter language's list row index for non-English reviews  
        if df['language'][row] != 'en':
            ids.append(row)
            if batch_pos > 0:
                to_translate.append(batch_texts[batch_pos])
            else:
                to_translate.append(batch_texts[row])
            
            # Call API
            response = openai.OpenAI(api_key=api_key).chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": system_instructions},
                    {"role": "user", "content": to_translate}
                ],
                max_tokens=128,  # Increase max_tokens to retrieve more than one token
                n=1,
                stop=None
            )
            # translation should be an array of translated texts
            translated_texts = response.choices[0].message.content
    return translated_texts, ids

In [37]:
def batch_translation(title_batch,batch_size,df_pos):
    # Loop over each row and perform review translation
    # Replace review column with column name that requires translation
    #reviews = df['review_text']
    #reviews = df['review_title']

    # General system instructions
    system_instructions = f"You will be provided with an array of texts. You have to translate to English.\
        Reply with an array of all completions.\
            "

    # Default values in response that should not be in df
    default = ["<", "Translated text: ", ">"]

    # Iterate through batches of 20 chunks of reviews to control API call request limit
# Loop through only the titles that are not written in English
#for row in range(i,i+batch_size):    
        #  Filter language's list row index for non-English reviews  
    if df['language'][row] != 'en':

        # Prompt with one review row every iteration
        #prompt = f"Translate to English. Text:<<<{reviews_batch[row]}>>>"
        # Call API
        response = openai.OpenAI(api_key=api_key).chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": system_instructions},
                {"role": "user", "content": title_batch}
            ],
            #temperature=0,
            max_tokens=128,  # Increase max_tokens to retrieve more than one token
            n=1,
            stop=None
        )
        translation = response.choices[0].message.content
        #print(f"review response is", translation,"\n")
        '''
        for i in default:
            translation = translation.replace(i,"")

        # Modify the translated_title list by replacing the titles in English
        df.loc[row,'translated_text'] = translation
        print(f"review translation: ",translation,"\n")
'''
    # Pause for 0.5 seconds to avoid hitting API rate limits
    time.sleep(0.5)


review translation:  I ran about 18 kilometers three times, but the cushioning was good and my feet didn't feel too fatigued. Compared to other lightweight shoes, I feel that there is less damage to my feet. It has a subtle weight (around 220 grams for size 26cm), so if you want a lighter shoe, the Hoka One One RINCON 3 might be better. The area near the heel of the sole had some roughness after the third run. It might be better not to expect too much durability. The shoes make it easy to run as they gently instruct your big toe to 

review translation:  Since the price became in the 5000 yen range, I purchased it. It feels much lighter than Glide Ride, but I feel like there is slightly less sense of moving forward. It seems like it can be used for speed training, so I have started using it for training for a sub-4 time. 

review translation:  Very comfortable for jogging or walking, I recommend them 



In [62]:
batch_size = 20
for batch_pos in range(0, df.shape[0], batch_size):
    # Create batch list containing text chunks
    batch_texts = list(df['review_title'][batch_pos:batch_pos+batch_size])

    # Call function with API call, returns an array of translated text
    trans_text, pos = batch_translation(batch_texts,batch_size, batch_pos)
    
    # Modify the translated_title list by replacing the titles in English
    for i in range(len(pos)):
        df.loc[pos[i],'translated_text'] = trans_text[i]
        print(f"review translation: ",trans_text[i],"\n")

at i 0 len is 20
at i 20 len is 20
at i 40 len is 20
at i 60 len is 20
at i 80 len is 20


In [41]:
# Command to check that the batch's reviews are indeed translated to English
for i in range(len(df['translated_text'][90:])):
    print(df['translated_text'][i])

Love these. Was looking for converses and these were half the price and so unique— I’ve never seen clear shoes like these; they fit great. The plastic takes a little getting used to but the style is so worth it.
The shoes are very cute, but after the 2nd day of wearing them the tongue started ripping. After the 3rd day of wearing them the plastic on the side ripped. They could have ripped bc I was wearing them to work and I do a lot of walking at work. If you’re going to buy these I don’t recommend wearing them on days where you will do a lot of walking or they might rip
Good quality
Great
I chose the white model with black trim at the back and I can say that up close the shoes are even more beautiful, my size is 38, 38.5 and I ordered size 38 and it fits me well. Fast shipping, the package arrived even earlier than expected, excellent price considering that elsewhere they cost at least 10, 15 euros more.
I usually buy Guess shoes and I have never had any sizing issues, but in this cas

In [67]:
# Save dataset on current path
filename = './data/preprocessed/dataset_pp.csv'
df.to_csv(filename, index=False, header=0)