## **The International Society of Data Scientists - NLP Track**
#### **Fashion and Beauty Review Rating - The 3rd Annual International Data Science & AI Competition 2022**


Predict how satisfied customers are based on their product reviews, to help with insight in fashion and beauty products.

#### Downloading Datasets

In [None]:
%pip install opendatasets

In [1]:
import opendatasets as od
od.download("https://www.kaggle.com/competitions/fashion-and-beauty-reviews")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:Your Kaggle Key:



Downloading fashion-and-beauty-reviews.zip to .\fashion-and-beauty-reviews


100%|██████████| 112M/112M [00:28<00:00, 4.15MB/s] 


Extracting archive .\fashion-and-beauty-reviews/fashion-and-beauty-reviews.zip to .\fashion-and-beauty-reviews





#### Working with the Dataset

In [None]:
%pip install pandas
%pip install numpy
%pip install torch

In [24]:
import pandas as pd
import numpy as np

In [2]:
dataframe = pd.read_csv(r"fashion-and-beauty-reviews\review_train.tsv", sep='\t', header=0)

In [4]:
dataframe.head()

Unnamed: 0,rating,review,summary,product,reviewer
0,4.0,I received this cream about 8 days ago and hav...,Thought this cream was making me sick...,B00DKEQYJY,A3IQA3VVDHGAK1
1,5.0,Beautiful pieces,Five Stars,B00L4JJKH0,A1QWCVZMSYG1N2
2,5.0,Really impressive tree. It goes together quick...,Highly recommended,1620213982,A2QRLRHMFDJ25E
3,4.0,good,Four Stars,B01D2J35BG,A2MMO1P2ZNEUNF
4,5.0,Gift for granddaughter. She loved it.,Cool looking umbrella.,B01422IQD4,A1WXKDA47VPOFT


##### Simple Analysis

In [3]:
dataframe.shape

(1003984, 5)

In [28]:
print(f"Possible ratings : {np.sort(dataframe['rating'].unique())}")

Possible ratings : [1. 2. 3. 4. 5.]


In [30]:
for col in ['review', 'summary']:
      print(f"Longest string in '{col}' is record# {dataframe[col].str.len().argmax()} "
            f"of length {len(dataframe.iloc[dataframe[col].str.len().argmax()][col])}")

      print(f"Shortest string in '{col}' is record# {dataframe[col].str.len().argmin()} "
            f"of length {len(dataframe.iloc[dataframe[col].str.len().argmin()][col])}")

Longest string in 'review' is record# 788873 of length 13741
Shortest string in 'review' is record# 9398 of length 1
Longest string in 'summary' is record# 800077 of length 267
Shortest string in 'summary' is record# 2321 of length 1


In [6]:
dataframe.isna().sum()

rating      0
review      0
summary     0
product     0
reviewer    0
dtype: int64

#### Set up

In [1]:
import numpy as np
import pandas as pd
import random
import torch
import torch.nn as nn

In [3]:
def set_seeds(seed=1234):
    """Set seeds for reproducibility."""
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed) # multi-GPU

In [4]:
import time

set_seeds(seed=int(time.time()))

In [5]:
# Set device
cuda = True
device = torch.device("cuda" if (
    torch.cuda.is_available() and cuda) else "cpu")
torch.set_default_tensor_type("torch.FloatTensor")
if device.type == "cuda":
    torch.set_default_tensor_type("torch.cuda.FloatTensor")
print (device)

cpu


#### Data Pre-processing

In [21]:
dataframe = pd.read_csv(r"fashion-and-beauty-reviews\review_train.tsv", sep='\t', header=0)

In [22]:
# Randomize/Shuffle order of all records in the dataset
dataframe = dataframe.sample(n=5).reset_index(drop=True)


In [23]:
dataframe.head()

Unnamed: 0,rating,review,summary,product,reviewer
0,5.0,These were the perfect size for me.,Five Stars,B01B8QQ0CQ,A3ALU04HA80MW9
1,4.0,It's one of those items that I never knew I ne...,"Well made, sturdy.",B002GP80EU,A2N75ADJSRW0AH
2,5.0,I love them so much... they look too cute on m...,Five Stars,B00LKWYX2I,A2N0D6FR7I3T8V
3,5.0,I liked these so much I bought more! The littl...,I liked these so much I bought more,B01C79TFDY,A2UTB1578ZSX6J
4,4.0,ITEM IS AS DESCRIBED.,Four Stars,B00PIPNPKY,A3NQ03DMPFQ88B


In [24]:
# Remove product and reviewer information and combine review and summary into one column
dataframe.drop(['product', 'reviewer'], axis=1, inplace=True)
dataframe['text'] = dataframe['summary'] + ' ' +dataframe['review']
dataframe.drop(['review', 'summary'], axis=1, inplace=True)

In [25]:
dataframe.head()

Unnamed: 0,rating,text
0,5.0,Five Stars These were the perfect size for me.
1,4.0,"Well made, sturdy. It's one of those items tha..."
2,5.0,Five Stars I love them so much... they look to...
3,5.0,I liked these so much I bought more I liked th...
4,4.0,Four Stars ITEM IS AS DESCRIBED.


In [None]:
%pip install contractions
%pip install nltk
%pip install inflect

In [62]:
import re
from string import punctuation
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import contractions
import inflect

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
stop_words = set(stopwords.words('english')) - {'against', 'few', 'more', 'no', 'nor', 'not', 'very'}
lemma = WordNetLemmatizer()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\cpani\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cpani\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\cpani\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\cpani\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [63]:
def process(text):
   text = re.sub(r'http\S+', '', text)             # Remove links
   text = re.sub(r"\([^)]*\)", "", text)           # Remove word within brackets
   text = re.sub(f'[{punctuation}]', '', text)     # Remove punctuation
      
   # If number and word are combined seperate with space
   text = re.sub(r'(\d*)(\D+)(\d*)', r'\1 \2 \3', text)
   
   text = re.sub(" +", " ", text)                  # Remove multiple spacing
   text = text.strip()
   
   text = text.lower()                             # Convert to lowercase
   text = contractions.fix(text)                   # Contraction replacement
   
   tokens = nltk.word_tokenize(text)               # Tokenize
   
   # Convert all numbeers to words
   for index, token in enumerate(tokens):
      if token.isdigit():
         tokens[index] = inflect.engine().number_to_words(token)
   
   # Remove stopwords
   tokens = [token for token in tokens if token not in stop_words]
   
   # Lemmatization
   tokens = [lemma.lemmatize(token, pos='v') for token in tokens]
   
   text = ' '.join(tokens)
   
   return text

In [64]:
text = "https://analyticsindiamag.com/complete-tutorial-on-text-preprocessing-in-nlp/ (the website)    FNEIS wasn't a bad site to learn2 things. I have learnt a lot"
process(text)

'fneis not bad site learn two things learn lot'