### Ecommerce dataset preprocessing

Notebook aimed at preprocessing an ecommerce dataset. 

A new subset is created containing a 100 rows and 4 columns, ['product_name', 'review_title', 'review_text', 'review_rating']. 

The language of the text is detected using a language detection library. A language detection column is added to the df.

An API call is made to translate all reviews into English. Two new columns are added for the translation of the review text and the translation of the review title.

The final dataframe has the columns: ['product_name', 'review_title', 'review_text', 'review_rating', 'language', 'translated_text', 'translated_title']. 

This preprocessed dataframe will be stored a csv locally and will be used from now on in the next working notebooks.

###### (Finished on 14.03.2024)

In [1]:
# imports required
import pandas as pd
import os
import numpy as np
import openai
from dotenv import load_dotenv
import time # Used to pause the API call function to avoid exceeding rate limit
from langdetect import detect
import json
import math

In [2]:
# Load dataset 
filename = './data/raw/amazon_uk_dataset.csv'
df = pd.read_csv(filename, delimiter=',', index_col=None, header=0)

In [3]:
# OpenAI API SDK
load_dotenv('../APIopenAI.env')
api_key = os.getenv('API_KEY')

#### Dataframe's size reduction

In [4]:
# Reduce number of rows and columns
col_to_keep = ['product_name','review_title','review_text','review_rating']
df = df.loc[:99, col_to_keep]
df.shape

(100, 4)

In [5]:
# Check dataframe's columns
print(df[:0])

Empty DataFrame
Columns: [product_name, review_title, review_text, review_rating]
Index: []


#### Language review classification

In [6]:
# Create df column using a langdetect library to detect language of input text
df['language'] = df['review_text'].apply(lambda text : detect(text))

In [27]:
def batch_translation(batch_texts, batch_size):
    '''
    Params: 
        batch_texts is an array of texts (str) that need to be translated. Works as prompt.
        batch_size the number of texts contained in the array.
    Function:
        Make an openai API call with instructions to translate to English all text within the array.
    Returns: response array in JSON format.
    '''

    # General system instructions
    system_instructions = f"You will be provided with an array of texts. You have to translate to English the full text.\
            Reply with all full completions in JSON format. The output format should follow the next conditions:  \
            JSON dictionary have as key translations and have as value another dictionary, this second dictionary \
            will have as key the <original text given by user> and as values the <translated text you generated>\
            Output format example: <\'translations\': <original text 1: translated text1, original text 2: translated text 2, ...>>"
        
    # Call API only for selected texts
    response = openai.OpenAI(api_key=api_key).chat.completions.create(
        model="gpt-3.5-turbo",
        response_format={ "type": "json_object" },
        messages=[
            {"role": "system", "content": system_instructions},
            {"role": "user", "content": batch_texts}
        ],
        #max_tokens=128,  # Increase max_tokens to retrieve more than one token
        n=1,
        stop=None
    )
    # Response is in JSON format
    return response.choices[0].message.content

In [28]:
def merge_translated(new_col: str, batch_set: list, df_non_en):
    # Merge the generated translations to a new column on the dataframe 
    # Create a DataFrame from the set of batches
    translated_df = pd.DataFrame({new_col: batch_set}, index=df_non_en.index)
    merged_df = pd.merge(df, translated_df, how='left', left_index=True, right_index=True)
    # Fill NaN values in the new_col with corresponding values from 'review_title'
    merged_df[new_col] = merged_df[new_col].fillna(merged_df['review_title'])
    # Add column to original df
    df[new_col] = merged_df[new_col]

In [29]:
def review_translation(input_col: str):
    '''
    Main function. 
    Returns translated texts' list and df with non english rows.
    '''
    # Separe english and non-english texts
    df_non_en = df[df['language'] != 'en']
    # For every batch make an API call
    batch_size = math.ceil(len(df_non_en)/3) # as 3 is RPM
    batch_set = []
    for i in range(0, len(df_non_en), batch_size):
        # Create batch list containing text chunks
        batch_texts = str(list(df_non_en[input_col])[i:i+batch_size])

        # Call function with API call, returns an array of translated text
        trans_json = batch_translation(batch_texts,batch_size)
        trans_json = json.loads(trans_json)

        # Transform JSON dict to list of texts
        trans_text = list(trans_json['translations'].values())

        # Append translated batch to set of translated batches
        batch_set.extend(trans_text)
    
    return batch_set, df_non_en


In [None]:
# Main function call for columns to translate
batch_set, df_non_en = review_translation('review_text', 'translated_text')
# Add translated texts and titles to df columns
merge_translated('translated_text', batch_set,df_non_en)

In [None]:
# Call for title column
batch_set, df_non_en = review_translation('review_title', 'translated_title')
merge_translated('translated_title', batch_set,df_non_en)

In [39]:
df.shape

(100, 7)

In [50]:
# Save dataset on current path
filename = './data/preprocessed/dataset_pp.csv'
df.to_csv(filename, index=False, header=0)