### Ecommerce dataset preprocessing

Notebook aimed at preprocessing an ecommerce dataset. 

A new subset will be created containing a 100 rows and 4 columns, ['product_name', 'review_title', 'review_text', 'review_rating']. 

An API call is made to translate all reviews into English. Two new columns will be added for the translation of the review text and the translation of the review title, making the shape of the final df (100,6). 

This preprocessed dataframe will be stored a csv locally and will be used from now on in the next working notebooks.

###### (Finished on 11.03.2024)

In [29]:
# imports required
import pandas as pd
import os
import numpy as np
import openai
from dotenv import load_dotenv
import time # Used to pause the API call function to avoid exceeding rate limit


In [2]:
# Load dataset 
filename = './amazon_uk_dataset.csv'
df = pd.read_csv(filename, delimiter=',', index_col=None, header=0)

In [3]:
# Reduce number of rows and columns
df = df[:100]
col_to_keep = ['product_name','review_title','review_text','review_rating']
df = df[col_to_keep]
df.shape

(100, 4)

In [4]:
# Check dataframe's columns
print(df[:0])

Empty DataFrame
Columns: [product_name, review_title, review_text, review_rating]
Index: []


In [5]:
# OpenAI API SDK
load_dotenv('APIopenAI.env')
api_key = os.getenv('API_KEY')

In [79]:
# test
review = reviews[]
system_instructions = f"You will be provided with a text. You have to do two tasks:\
        Task 1: Classify in which language is the text written.\
        Task 2: If the text is not already written in English, translate the text into English.\
        If the text is already in English, DO NOT modify the text at all.\
        The output format of your answers is <Original language: output text language # Translated text: output translated text>\
        "
prompt = f" Classify the language of the text and, if not written in English, \
        translate to English. \
        Text:<<<{review}>>>"
    # Call the sentiment analysis API with the prompt
response = openai.OpenAI(api_key=api_key).chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
                {"role": "system", "content": system_instructions},
                {"role": "user", "content": prompt}
        ],
        #temperature=0,
        max_tokens=128,  # Increase max_tokens to retrieve more than one token
        n=1,
        stop=None
        )

In [80]:
# test
output = response.choices[0].message.content
print(output)
default = ["<Original language: ", "Translated text: ", ">"]
for i in default:
    output = output.replace(i,"")
    print(output)


<Original language: Italian # Translated text: The shoes are well made and the seller very precise because I had to make a return and he refunded me perfectly. Well done. Stays in favorites.>
Italian # Translated text: The shoes are well made and the seller very precise because I had to make a return and he refunded me perfectly. Well done. Stays in favorites.>
Italian # The shoes are well made and the seller very precise because I had to make a return and he refunded me perfectly. Well done. Stays in favorites.>
Italian # The shoes are well made and the seller very precise because I had to make a return and he refunded me perfectly. Well done. Stays in favorites.


In [7]:
# Loop over each row and perform review translation
# Replace review column with column name that requires translation

# General system instructions
system_instructions = f"You will be provided with a text. You have to do two tasks:\
        Task 1: Identify in which language is the text written.\
        Task 2: If the text is not already written in English, translate the text into English.\
        If the text is already in English, DO NOT modify the text at all.\
        Do NOT output the language of the text, task 1 it is just a previous step to task 2.\
        The output format of your answers is ONLY:<Translated text: output translated text>\
        "
# Create empty list before appending final column to df
translated = []
# Subset of reviews translate and append to the df
reviews = df['review_text']
# Default values in response that should not be in df
default = ["<", "Translated text: ", ">"]
for review in reviews:
    # Prompt with one review row every iteration
    prompt = f" Classify the language of the text and, if not written in English, \
        translate to English. \
        Text:<<<{review}>>>"
    # Call API
    response = openai.OpenAI(api_key=api_key).chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_instructions},
            {"role": "user", "content": prompt}
        ],
        #temperature=0,
        max_tokens=128,  # Increase max_tokens to retrieve more than one token
        n=1,
        stop=None
    )
    translation = response.choices[0].message.content
    #print(f"review response is", translation,"\n")
    for i in default:
        translation = translation.replace(i,"")
    
    # Split response to store translated review into df
    #translation = translation.split(sep='#')
    print(f"review translation: ",translation,"\n")
    translated.append(translation)
    # Pause for 0.5 seconds to avoid hitting API rate limits
    #time.sleep(0.5)

review translation:  Love these. Was looking for converse shoes and these were half the price and so unique— I’ve never seen clear shoes like these; they fit great. The plastic takes a little getting used to but the style is so worth it. 

review translation:  The shoes are very cute, but after the 2nd day of wearing them the tongue started ripping. After the 3rd day of wearing them the plastic on the side ripped. They could have ripped because I was wearing them to work and I do a lot of walking at work. If you’re going to buy these I don’t recommend wearing them on days where you will do a lot of walking or they might rip 

review translation:  Good quality 

review translation:  Great 

review translation:  I chose the white model with black detailing at the back and I can say that the shoes look even more beautiful up close, my size is 38, 38.5 and I ordered size 38 and it fits me well.. Fast shipping, the package arrived even earlier than expected, excellent price considering that

In [28]:
# Add translated review text column
# Note: first created empty list named translated, append translations in the loop, add full list to the df as column
df['translated_text'] = translated
print(df.shape)
# Check dataframe's columns
print(df[:0])

(100, 5)
Empty DataFrame
Columns: [product_name, review_title, review_text, review_rating, translated_text]
Index: []


In [31]:
# Preparation for translating titles
# Select the indices where the review_titles must be translated
to_translate = [6,7,8,9,10,11,12,13,14,15,20,22,36,37, 50,51, 52, 53, 54, 55, 56, 57, 58, 59,\
                66,68,69,70,71,72,73,74,75,76,77,79,80,81,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97]
# Copy original review_title to the translated review_title
translated_title = list(df['review_title'])

In [60]:
# Loop over each row and perform review TITLE translation
# Replace review column with column name that requires translation

# General system instructions
system_instructions = f"You will be provided with a text. You have to do two tasks:\
        Task 1: Identify in which language is the text written.\
        Task 2: If the text is not already written in English, translate the text into English.\
        If the text is already in English, DO NOT modify the text at all.\
        Do NOT output the language of the text, task 1 it is just a previous step to task 2.\
        The output format of your answers is ONLY:<Translated text: output translated text>\
        "

# Default values in response that should not be in df
default = ["<", "Translated text: ", ">"]

# Loop through only the titles that are not written in English
for title in to_translate_revised2:
    # text is the indexed review_title from the given list to_translate
    text = df['review_title'][title]
    # Prompt with one review row every iteration
    prompt = f" Classify the language of the text and, if not written in English, \
        translate to English. \
        Text:<<<{text}>>>"
    # Call API
    response = openai.OpenAI(api_key=api_key).chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=[
            {"role": "system", "content": system_instructions},
            {"role": "user", "content": prompt}
        ],
        #temperature=0,
        max_tokens=128,  # Increase max_tokens to retrieve more than one token
        n=1,
        stop=None
    )
    translation = response.choices[0].message.content
    #print(f"review response is", translation,"\n")
    for i in default:
        translation = translation.replace(i,"")

    # Modify the translated_title list by replacing the titles in English
    translated_title[title] = translation
    print(f"review translation: ",translation,"\n")

# Pause for 0.5 seconds to avoid hitting API rate limits
time.sleep(0.5)


review translation:  Very sensitive 

review translation:  Tight on the top 

review translation:  Very good winter boots! 



In [68]:
# Loop to observe which are the outputs
for i in range(0,len(translated_title)):
    print(translated_title[i])

Love em
The plastic ripped
Good quality
Good
PERFETTE!!
delusione
Very beautiful
Very cute and comfortable
Beautiful....too bad about the size
Well-made shoes, excellent seller
Very comfortable shoe
Perfect
Big
Wonderful
Perfect right out of the box
Disappointed
Wow!
Comfortable and attractive
Cute, but various disappointments
My right foot fits size 11 fine but not the same for the left foot
Lightweight and very comfortable to wear!
Great quality and comfort shoes
NO SUPPORT! NOT FOR RUNNING!
Not as supportive as I had hoped for.
Great pair of shoes!
Can’t wear due to Poorly designed insoles
Tongue rubs ankles raw
What I was looking for
Not for slippery surfaces/no tread
but it's frequent enough with certain socks that it becomes annoying to have to stop and pull my socks up ...
Super cuter sneakers.
Cute, but small
Nice lightweight comfort shoe
Total Comfort
Pain in the child...
From the advertisement, it is not clear what size it is.
May work for a "B" width foot
Fell apart after on

In [63]:
# After observing the outputs some were detected as wrongly translated
# openai gpt-3.5-turbo did not detect certain italian words nor japanese or german correctly
# A new list of review title to translate is created and passed again to the gpt

to_translate_revised = [12, 13, 18, 34, 36, 48, 49, 64, 67, 78,82]

# After a second revision some of them were not well translated
translated_title[12] = 'Big'
translated_title[13] = 'Wonderful'
translated_title[34] = 'Pain in the child...'

In [71]:
df['translated_title'][6]

'Very beautiful'

In [64]:
# Now that all review titles are correctly translated, add column to df
df['translated_title'] = translated_title
print(df.shape)
# Check dataframe's columns
print(df[:0])

(100, 6)
Empty DataFrame
Columns: [product_name, review_title, review_text, review_rating, translated_text, translated_title]
Index: []


In [72]:
# Reorder df columns for easier reading
df.iloc[:,[0,1,2,5,4,3]]
# Check dataframe's columns
print(df[:0])

Empty DataFrame
Columns: [product_name, review_title, review_text, review_rating, translated_text, translated_title]
Index: []


In [67]:
# Save dataset on current path
filename = './amazon_uk_subset.csv'
df.to_csv(filename, index=False, header=0)