### Ecommerce dataset preprocessing

Notebook aimed at preprocessing an ecommerce dataset. 

A new subset will be created containing a 100 rows and 4 columns, ['product_name', 'review_title', 'review_text', 'review_rating']. 

An API call is made to translate all reviews into English. Two new columns will be added for the translation of the review text and the translation of the review title, making the shape of the final df (100,6). 

This preprocessed dataframe will be stored a csv locally and will be used from now on in the next working notebooks.

###### (Finished on 11.03.2024)

In [1]:
# imports required
import pandas as pd
import os
import numpy as np
import openai
from dotenv import load_dotenv
import time # Used to pause the API call function to avoid exceeding rate limit
from langdetect import detect, detect_langs

In [3]:
# Load dataset 
filename = './data/raw/amazon_uk_dataset.csv'
df = pd.read_csv(filename, delimiter=',', index_col=None, header=0)

In [4]:
# OpenAI API SDK
load_dotenv('APIopenAI.env')
api_key = os.getenv('API_KEY')

#### Dataframe's size reduction

In [5]:
# Reduce number of rows and columns
df = df[:100]
col_to_keep = ['product_name','review_title','review_text','review_rating']
df = df[col_to_keep]
df.shape

(100, 4)

In [6]:
# Check dataframe's columns
print(df[:0])

Empty DataFrame
Columns: [product_name, review_title, review_text, review_rating]
Index: []


#### Language review classification

In [7]:
# Function using a langdetect library to detect language of input text

def language_detection(text, method = "single"):

  """
  @desc: 
    - detects the language of a text
  @params:
    - text: the text which language needs to be detected
    - method: detection method: 
      single: if the detection is based on the first option (detect)
  @return:
    - the langue/list of languages
  """

  if(method.lower() != "single"):
    result = detect_langs(text)

  else:
    result = detect(text)

  return result

In [17]:
# Creation of list to keep track of each row's language
language = []
reviews = df['review_text']
for i in range(len(reviews)):
    language.append(language_detection(reviews[i]))

In [15]:
# Add language detection and translated review text columns
df['language'] = language
df['translated_text'] = df['review_text']
print(df.shape)
# Check dataframe's columns
print(df[:0])

(100, 6)
Empty DataFrame
Columns: [product_name, review_title, review_text, review_rating, language, translated_text]
Index: []


In [33]:
# Create batch of 20 chunks to control API call request limit
reviews_batch=reviews[:10]

In [35]:
# Loop over each row and perform review TITLE translation
# Replace review column with column name that requires translation

# General system instructions
system_instructions = f"You will be provided with a text. You have to translate to English.\
    The output format of your answers is ONLY:<Translated text: output translated text>\
        "

# Default values in response that should not be in df
default = ["<", "Translated text: ", ">"]

# Loop through only the titles that are not written in English
for row in range(len(reviews_batch)):
    # Filter language's list row index for non-English reviews  
    if df['language'][row] != 'en':
        # Prompt with one review row every iteration
        prompt = f"Translate to English. \
            Text:<<<{reviews_batch[row]}>>>"
        # Call API
        response = openai.OpenAI(api_key=api_key).chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": system_instructions},
                {"role": "user", "content": prompt}
            ],
            #temperature=0,
            max_tokens=128,  # Increase max_tokens to retrieve more than one token
            n=1,
            stop=None
        )
        translation = response.choices[0].message.content
        #print(f"review response is", translation,"\n")
        for i in default:
            translation = translation.replace(i,"")

        # Modify the translated_title list by replacing the titles in English
        df['translated_text'][row] = translation
        print(f"review translation: ",translation,"\n")

# Pause for 0.5 seconds to avoid hitting API rate limits
time.sleep(0.5)


row is 0
row is 1
row is 2
row is 3
row is 4


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['translated_text'][row] = translation


review translation:  I chose the white model with black back finish and I can say that up close the shoes are even more beautiful, my size is 38, 38.5 and I ordered size 38 and it fits me well. Fast shipping, the package arrived even earlier than expected, great price considering that they cost at least 10, 15 euros more around. 

row is 5


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['translated_text'][row] = translation


review translation:  I usually buy Guess shoes and I have never had any problems with the size, in this case, however, I received shoes that, despite being the usual size I take, are very small, it almost seems two sizes smaller, so unwearable 

row is 6


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['translated_text'][row] = translation


review translation:  The shoes are very beautiful, they fit perfectly 

row is 7


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['translated_text'][row] = translation


review translation:  Simply perfect. I use custom insoles for flat feet, and they fit perfectly in these shoes. The shoe is comfortable and provides good stability to the foot. I recommend them. 

row is 8


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['translated_text'][row] = translation


review translation:  The shoes are beautiful, arrived in perfect condition with impeccable shipping, unfortunately the size 41 was tight for me and since they don't make size 42, I had to return them, the seller was available and kind 

row is 9


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['translated_text'][row] = translation


review translation:  The shoes are well made and the seller is very precise because I had to make a return and he refunded me perfectly. Well done. It stays in favorites. 



In [36]:
# Command to check that the batch's reviews are indeed translated to English
for i in range(len(df['translated_text'][:10])):
    print(df['translated_text'][i])

Love these. Was looking for converses and these were half the price and so unique— I’ve never seen clear shoes like these; they fit great. The plastic takes a little getting used to but the style is so worth it.
The shoes are very cute, but after the 2nd day of wearing them the tongue started ripping. After the 3rd day of wearing them the plastic on the side ripped. They could have ripped bc I was wearing them to work and I do a lot of walking at work. If you’re going to buy these I don’t recommend wearing them on days where you will do a lot of walking or they might rip
Good quality
Great
I chose the white model with black back finish and I can say that up close the shoes are even more beautiful, my size is 38, 38.5 and I ordered size 38 and it fits me well. Fast shipping, the package arrived even earlier than expected, great price considering that they cost at least 10, 15 euros more around.
I usually buy Guess shoes and I have never had any problems with the size, in this case, howe

In [67]:
# Save dataset on current path
filename = './data/preprocessed/amazon_uk_subset_2.csv'
df.to_csv(filename, index=False, header=0)