## Clothing Review Rating Prediction

Given *reviews of women's clothing*, let's try to predict whether the rating associated with the review will be **5-star** or not.

We will use a TensorFlow recurrent neural network to make our predictions.

Data source: https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import tensorflow as tf

In [2]:
data = pd.read_csv('archive/Womens Clothing E-Commerce Reviews.csv')
data

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


### Preprocessing

In [4]:
df = data.copy()

In [8]:
# Drop rows with missing reviews
missing_review_rows = df[df['Review Text'].isna()].index
df = df.drop(missing_review_rows, axis=0).reset_index(drop=True)

In [9]:
# Use only the reviews and rating columns
y = df['Rating'].copy()
X = df['Review Text'].copy()

In [11]:
y.isna().sum()

np.int64(0)

In [12]:
X.isna().sum()

np.int64(0)

In [13]:
y

0        4
1        5
2        3
3        5
4        5
        ..
22636    5
22637    3
22638    3
22639    3
22640    5
Name: Rating, Length: 22641, dtype: int64

In [14]:
# Make y a binary target
y = y.apply(lambda x: 1 if x == 5 else 0)

In [15]:
y

0        0
1        1
2        0
3        1
4        1
        ..
22636    1
22637    0
22638    0
22639    0
22640    1
Name: Rating, Length: 22641, dtype: int64

In [16]:
y.unique()

array([0, 1])

In [17]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)

In [18]:
X_train

9232     This dress is adorable. it simply did not fi o...
13974    The fit was not flattering at all, but the wor...
17979    I bought this top in both colors. the fit is p...
9586     This is even cuter in person than it is in the...
9358     I couldn't believe there weren't more reviews ...
                               ...                        
10955    Pretty color, but i was hoping for a thicker, ...
17289    This sweater coat is a keeper. yes it runs lar...
5192     Should have ordered the xl instead of the larg...
12172    I really like this denim jacket with the plaid...
235      Love pilcro, love the stripes and the length -...
Name: Review Text, Length: 15848, dtype: object

In [19]:
X_train.shape, X_test.shape

((15848,), (6793,))

In [24]:
# Learn the vocabulary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

# Find the size of the vocabulary
tokenizer.word_index

{'the': 1,
 'i': 2,
 'and': 3,
 'a': 4,
 'it': 5,
 'is': 6,
 'this': 7,
 'to': 8,
 'in': 9,
 'but': 10,
 'on': 11,
 'for': 12,
 'of': 13,
 'with': 14,
 'was': 15,
 'so': 16,
 'my': 17,
 'dress': 18,
 'not': 19,
 'that': 20,
 'size': 21,
 'love': 22,
 'have': 23,
 'very': 24,
 'are': 25,
 'fit': 26,
 'top': 27,
 'like': 28,
 'be': 29,
 'me': 30,
 'as': 31,
 'wear': 32,
 "it's": 33,
 'too': 34,
 'great': 35,
 "i'm": 36,
 'am': 37,
 'or': 38,
 'just': 39,
 'you': 40,
 'they': 41,
 'would': 42,
 'at': 43,
 'up': 44,
 'fabric': 45,
 'small': 46,
 'color': 47,
 'look': 48,
 'if': 49,
 'really': 50,
 'more': 51,
 'ordered': 52,
 'perfect': 53,
 'little': 54,
 'these': 55,
 'will': 56,
 'one': 57,
 'flattering': 58,
 'well': 59,
 'an': 60,
 'soft': 61,
 'out': 62,
 'back': 63,
 'because': 64,
 'can': 65,
 'had': 66,
 '\r': 67,
 'cute': 68,
 'comfortable': 69,
 'nice': 70,
 'than': 71,
 'beautiful': 72,
 'when': 73,
 'bought': 74,
 'all': 75,
 'bit': 76,
 'which': 77,
 'looks': 78,
 'shirt': 79

In [26]:
# Find the size of the vocabulary
vocab_length = len(tokenizer.word_index) + 1
print("Vocab Length:", vocab_length)

Vocab Length: 12751


In [27]:
# Convert review texts into sequences of integers
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

In [32]:
# Find the maximum sequence length
max_seq_length = max(list(map(lambda x: len(x), X_train)))
print("Maximum Sequence Length: ", max_seq_length)

Maximum Sequence Length:  116


In [34]:
# Pad the sequences to be of uniform length
X_train = pad_sequences(X_train, maxlen=max_seq_length, padding='post')
X_test = pad_sequences(X_test, maxlen=max_seq_length, padding='post')

In [35]:
X_train

array([[  7,  18,   6, ...,   0,   0,   0],
       [  1,  26,  15, ...,   0,   0,   0],
       [  2,  74,   7, ...,   0,   0,   0],
       ...,
       [349,  23,  52, ...,   0,   0,   0],
       [  2,  50,  28, ...,   0,   0,   0],
       [ 22, 515,  22, ...,   0,   0,   0]],
      shape=(15848, 116), dtype=int32)

In [36]:
X_test

array([[  36, 1080,    7, ...,    0,    0,    0],
       [   2,  141,    7, ...,    0,    0,    0],
       [ 261,    9,   22, ...,    0,    0,    0],
       ...,
       [   2,   92,    1, ...,    0,    0,    0],
       [  52,  185,    1, ...,    0,    0,    0],
       [  39,  141,    7, ...,    0,    0,    0]],
      shape=(6793, 116), dtype=int32)

### Training

In [68]:
inputs = tf.keras.Input(shape=(X_train.shape[1],))

x = tf.keras.layers.Embedding(
    input_dim = vocab_length,
    output_dim = 128,
    input_length = max_seq_length
)(inputs)

In [69]:
inputs

<KerasTensor shape=(None, 116), dtype=float32, sparse=False, ragged=False, name=keras_tensor_17>

In [70]:
x

<KerasTensor shape=(None, 116, 128), dtype=float32, sparse=False, ragged=False, name=keras_tensor_18>

In [71]:
x = tf.keras.layers.Flatten()(x)

In [72]:
x

<KerasTensor shape=(None, 14848), dtype=float32, sparse=False, ragged=False, name=keras_tensor_19>

In [73]:
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

In [74]:
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

In [75]:
model.summary()

In [76]:
history = model.fit(
    X_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=100,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=3,
            restore_best_weights=True
        )
    ]
)

Epoch 1/100
[1m397/397[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 26ms/step - accuracy: 0.7133 - loss: 0.5601 - val_accuracy: 0.7808 - val_loss: 0.4696
Epoch 2/100
[1m397/397[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 25ms/step - accuracy: 0.8561 - loss: 0.3508 - val_accuracy: 0.7804 - val_loss: 0.4737
Epoch 3/100
[1m397/397[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 24ms/step - accuracy: 0.9372 - loss: 0.1979 - val_accuracy: 0.7842 - val_loss: 0.4932
Epoch 4/100
[1m397/397[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 26ms/step - accuracy: 0.9769 - loss: 0.1003 - val_accuracy: 0.7713 - val_loss: 0.5450


### Results

In [77]:
model.evaluate(X_test, y_test)

[1m213/213[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7842 - loss: 0.4709


[0.47091054916381836, 0.7841895818710327]