## **The International Society of Data Scientists - NLP Track**
#### **Fashion and Beauty Review Rating - The 3rd Annual International Data Science & AI Competition 2022**


Predict how satisfied customers are based on their product reviews, to help with insight in fashion and beauty products.

#### Downloading Datasets

In [None]:
%pip install opendatasets

In [1]:
import opendatasets as od
od.download("https://www.kaggle.com/competitions/fashion-and-beauty-reviews")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:Your Kaggle Key:



Downloading fashion-and-beauty-reviews.zip to .\fashion-and-beauty-reviews


100%|██████████| 112M/112M [00:28<00:00, 4.15MB/s] 


Extracting archive .\fashion-and-beauty-reviews/fashion-and-beauty-reviews.zip to .\fashion-and-beauty-reviews





#### Working with the Dataset

In [None]:
%pip install pandas
%pip install numpy
%pip install torch

In [1]:
import pandas as pd
import numpy as np

In [2]:
dataframe = pd.read_csv(r"fashion-and-beauty-reviews\review_train.tsv", sep='\t', header=0)

In [3]:
dataframe.head()

Unnamed: 0,rating,review,summary,product,reviewer
0,4.0,I received this cream about 8 days ago and hav...,Thought this cream was making me sick...,B00DKEQYJY,A3IQA3VVDHGAK1
1,5.0,Beautiful pieces,Five Stars,B00L4JJKH0,A1QWCVZMSYG1N2
2,5.0,Really impressive tree. It goes together quick...,Highly recommended,1620213982,A2QRLRHMFDJ25E
3,4.0,good,Four Stars,B01D2J35BG,A2MMO1P2ZNEUNF
4,5.0,Gift for granddaughter. She loved it.,Cool looking umbrella.,B01422IQD4,A1WXKDA47VPOFT


##### Simple Analysis

In [4]:
dataframe.shape

(1003984, 5)

In [5]:
print(f"Possible ratings : {np.sort(dataframe['rating'].unique())}")

Possible ratings : [1. 2. 3. 4. 5.]


In [6]:
for col in ['review', 'summary']:
      print(f"Longest string in '{col}' is record# {dataframe[col].str.len().argmax()} "
            f"of length {len(dataframe.iloc[dataframe[col].str.len().argmax()][col])}")

      print(f"Shortest string in '{col}' is record# {dataframe[col].str.len().argmin()} "
            f"of length {len(dataframe.iloc[dataframe[col].str.len().argmin()][col])}")

Longest string in 'review' is record# 788873 of length 13741
Shortest string in 'review' is record# 9398 of length 1
Longest string in 'summary' is record# 800077 of length 267
Shortest string in 'summary' is record# 2321 of length 1


In [7]:
dataframe.isna().sum()

rating         0
review      1304
summary      597
product        0
reviewer       0
dtype: int64

#### Set up

In [8]:
import numpy as np
import pandas as pd
import random
import torch
import torch.nn as nn

In [9]:
def set_seeds(seed=1234):
    """Set seeds for reproducibility."""
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed) # multi-GPU

In [10]:
import time

set_seeds(seed=int(time.time()))

In [11]:
# Set device
cuda = True
device = torch.device("cuda" if (
    torch.cuda.is_available() and cuda) else "cpu")
torch.set_default_tensor_type("torch.FloatTensor")
if device.type == "cuda":
    torch.set_default_tensor_type("torch.cuda.FloatTensor")
print (device)

cpu


#### Data Pre-processing

In [28]:
dataframe = pd.read_csv(r"fashion-and-beauty-reviews\review_train.tsv", sep='\t', header=0)

In [12]:
# Randomize/Shuffle order of all records in the dataset
dataframe = dataframe.sample(frac=1).reset_index(drop=True)

In [13]:
dataframe.head()

Unnamed: 0,rating,review,summary,product,reviewer
0,5.0,fabric is easy care. am 5'3. hits me about 2...,great flip skirt for ballroom dancing,B00GN72LYC,A26ZA7I89DN029
1,5.0,,Five Stars,B00UAWE978,A3P3UITN9EQIW4
2,5.0,Got this for my mom. She loved it. It is reall...,She loved it. It is really cute,B00OB8WRBW,A226AVPBKYZUU0
3,5.0,Great product. Great seller,Five Stars,B00FGCU5M0,A2JL10RY5CI9AM
4,5.0,I love this necklace. It's so pretty and looks...,I love this necklace,B01ADTLECK,A2TPTTT3JVFJL6


In [14]:
# Remove product and reviewer information and combine review and summary into one column
dataframe.drop(['product', 'reviewer'], axis=1, inplace=True)
dataframe['text'] = dataframe['summary'] + ' ' +dataframe['review']
dataframe.drop(['review', 'summary'], axis=1, inplace=True)

dataframe.dropna(inplace=True)
dataframe.isna().sum()

rating    0
text      0
dtype: int64

In [15]:
dataframe.head()

Unnamed: 0,rating,text
0,5.0,great flip skirt for ballroom dancing fabric i...
2,5.0,She loved it. It is really cute Got this for m...
3,5.0,Five Stars Great product. Great seller
4,5.0,I love this necklace I love this necklace. It'...
5,5.0,"Worthy the money Nice jeans, true to size. I w..."


In [None]:
%pip install contractions
%pip install nltk
%pip install inflect

In [16]:
import re
from string import punctuation
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import contractions
import inflect

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
stop_words = set(stopwords.words('english')) - {'against', 'few', 'more', 'no', 'nor', 'not', 'very'}
lemma = WordNetLemmatizer()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\cpani\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cpani\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\cpani\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\cpani\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [19]:
def process(text):

   text = re.sub(r'http\S+', '', text)             # Remove links
   text = re.sub(r"\([^)]*\)", "", text)           # Remove word within brackets
   text = re.sub(f'[{punctuation}]', '', text)     # Remove punctuation
      
   # If number and word are combined seperate with space
   text = re.sub(r'(\d*)(\D+)(\d*)', r'\1 \2 \3', text)
   
   text = re.sub(" +", " ", text)                  # Remove multiple spacing
   text = text.strip()
   
   text = text.lower()                             # Convert to lowercase
   text = contractions.fix(text)                   # Contraction replacement
   
   tokens = nltk.word_tokenize(text)               # Tokenize
   
   # Convert all numbeers to words
   for index, token in enumerate(tokens):
      if token.isdigit():
         try:
            tokens[index] = inflect.engine().number_to_words(token)
         except Exception:
            print(token)
            
   # Remove stopwords
   tokens = [token for token in tokens if token not in stop_words]
   
   # Lemmatization
   tokens = [lemma.lemmatize(token, pos='v') for token in tokens]
   
   text = ' '.join(tokens)
   
   return text

In [20]:
dataframe['text'] = dataframe['text'].apply(process)

7791793621055041073741832752334614789979
4666286801020021073741825353887201376151820146361416897


In [21]:
dataframe.to_csv(r"fashion-and-beauty-reviews\review_train_processed.tsv", sep='\t')

#### Data Split

In [None]:
%pip install sklearn

In [23]:
import collections
from sklearn.model_selection import train_test_split

In [24]:
# Configure for 80% data to be for training and the rest for validation here.

TRAIN_SIZE = 0.8
VAL_SIZE = 0.2

In [25]:
# Data
X = dataframe['text'].values
y = dataframe['rating'].values

In [26]:
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=TRAIN_SIZE, stratify=y)

In [28]:
print (f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print (f"X_val: {X_val.shape}, y_val: {y_val.shape}")

X_train: (801696,), y_train: (801696,)
X_val: (200425,), y_val: (200425,)


#### Label Encoding

In [31]:
dataframe.dtypes

rating    float64
text       object
dtype: object

The point of label encoding is to convert any labels assigned to the output classes to an integer value. However, since this dataset has the output values as numbers already we can convert them to int from float.

In [32]:
dataframe['rating'] = dataframe['rating'].astype(np.int64)

In [33]:
dataframe.dtypes

rating     int64
text      object
dtype: object