## Disneyland Review Rating Prediction

Given *reviews of Disneyland*, let's try to predict the **rating** associated with a given review.

We will use a TensorFlow/Keras model with word embeddings to make our predictions.

Data source: https://www.kaggle.com/datasets/arushchillar/disneyland-reviews

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import tensorflow as tf

In [4]:
data = pd.read_csv('archive/DisneylandReviews.csv', encoding='latin-1')
data

Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch
0,670772142,4,2019-4,Australia,If you've ever been to Disneyland anywhere you...,Disneyland_HongKong
1,670682799,4,2019-5,Philippines,Its been a while since d last time we visit HK...,Disneyland_HongKong
2,670623270,4,2019-4,United Arab Emirates,Thanks God it wasn t too hot or too humid wh...,Disneyland_HongKong
3,670607911,4,2019-4,Australia,HK Disneyland is a great compact park. Unfortu...,Disneyland_HongKong
4,670607296,4,2019-4,United Kingdom,"the location is not in the city, took around 1...",Disneyland_HongKong
...,...,...,...,...,...,...
42651,1765031,5,missing,United Kingdom,i went to disneyland paris in july 03 and thou...,Disneyland_Paris
42652,1659553,5,missing,Canada,2 adults and 1 child of 11 visited Disneyland ...,Disneyland_Paris
42653,1645894,5,missing,South Africa,My eleven year old daughter and myself went to...,Disneyland_Paris
42654,1618637,4,missing,United States,"This hotel, part of the Disneyland Paris compl...",Disneyland_Paris


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42656 entries, 0 to 42655
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Review_ID          42656 non-null  int64 
 1   Rating             42656 non-null  int64 
 2   Year_Month         42656 non-null  object
 3   Reviewer_Location  42656 non-null  object
 4   Review_Text        42656 non-null  object
 5   Branch             42656 non-null  object
dtypes: int64(2), object(4)
memory usage: 2.0+ MB


### Preprocessing

In [7]:
df = data.copy()

In [8]:
df

Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch
0,670772142,4,2019-4,Australia,If you've ever been to Disneyland anywhere you...,Disneyland_HongKong
1,670682799,4,2019-5,Philippines,Its been a while since d last time we visit HK...,Disneyland_HongKong
2,670623270,4,2019-4,United Arab Emirates,Thanks God it wasn t too hot or too humid wh...,Disneyland_HongKong
3,670607911,4,2019-4,Australia,HK Disneyland is a great compact park. Unfortu...,Disneyland_HongKong
4,670607296,4,2019-4,United Kingdom,"the location is not in the city, took around 1...",Disneyland_HongKong
...,...,...,...,...,...,...
42651,1765031,5,missing,United Kingdom,i went to disneyland paris in july 03 and thou...,Disneyland_Paris
42652,1659553,5,missing,Canada,2 adults and 1 child of 11 visited Disneyland ...,Disneyland_Paris
42653,1645894,5,missing,South Africa,My eleven year old daughter and myself went to...,Disneyland_Paris
42654,1618637,4,missing,United States,"This hotel, part of the Disneyland Paris compl...",Disneyland_Paris


In [9]:
# Limit data to only the review and rating columns
df = df.loc[:, ['Review_Text', 'Rating']]

In [10]:
df

Unnamed: 0,Review_Text,Rating
0,If you've ever been to Disneyland anywhere you...,4
1,Its been a while since d last time we visit HK...,4
2,Thanks God it wasn t too hot or too humid wh...,4
3,HK Disneyland is a great compact park. Unfortu...,4
4,"the location is not in the city, took around 1...",4
...,...,...
42651,i went to disneyland paris in july 03 and thou...,5
42652,2 adults and 1 child of 11 visited Disneyland ...,5
42653,My eleven year old daughter and myself went to...,5
42654,"This hotel, part of the Disneyland Paris compl...",4


In [11]:
y = df['Rating']
X = df['Review_Text']

In [12]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)

In [13]:
X_train

20780    I love any and all things Disney! I went this ...
791      Easy to get to using the MTR, great for little...
19394    We visited both the California Adventure park ...
32755    This awesome place is not only fun filled for ...
38577    Just got back from disneyland paris and wasnt ...
                               ...                        
7813     Ocean Park far more value for money.  Disneyla...
32511    I went with a friend to stay for 4 nights so w...
5192     Disneyland in Hong Kong is a beautiful park wi...
12172    I have various season passes for California th...
33003    It's Disney, it's magical. we spent 3 nights s...
Name: Review_Text, Length: 29859, dtype: object

In [14]:
y_train

20780    5
791      3
19394    5
32755    5
38577    3
        ..
7813     3
32511    5
5192     5
12172    5
33003    5
Name: Rating, Length: 29859, dtype: int64

In [15]:
# Fit tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

In [19]:
print("Vocab Length:", len(tokenizer.word_index) + 1)

Vocab Length: 37846


In [20]:
def get_sequences(texts, tokenizer, train=True, max_seq_length=None):
    sequences = tokenizer.texts_to_sequences(texts)

    if train == True:
        max_seq_length = np.max(list(map(len, sequences)))

    sequences = pad_sequences(sequences, maxlen=max_seq_length, padding='post')    

    return sequences

In [21]:
# Convert texts to sequences
X_train = get_sequences(X_train, tokenizer, train=True)
X_test = get_sequences(X_test, tokenizer, train=False, max_seq_length=X_train.shape[1])

In [22]:
X_train

array([[ 12, 154, 159, ...,   0,   0,   0],
       [330,   3,  38, ...,   0,   0,   0],
       [  6, 168, 193, ...,   0,   0,   0],
       ...,
       [ 26,   7, 251, ...,   0,   0,   0],
       [ 12,  28, 989, ...,   0,   0,   0],
       [ 68,  23,  68, ...,   0,   0,   0]],
      shape=(29859, 3958), dtype=int32)

### Training

In [23]:
X_train.shape

(29859, 3958)

In [29]:
inputs = tf.keras.Input(shape=(3958,))
x = tf.keras.layers.Embedding(
    input_dim = 37846,
    output_dim = 64
)(inputs)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(64, activation='relu')(x)
x = tf.keras.layers.Dense(64, activation='relu')(x)
outputs = tf.keras.layers.Dense(1, activation='linear')(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

model.compile(
    optimizer='adam',
    loss='mse'
)

history = model.fit(
    X_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=100,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=3,
            restore_best_weights= True
        )
    ]
)

Epoch 1/100
[1m747/747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m257s[0m 342ms/step - loss: 162.7188 - val_loss: 2.7561
Epoch 2/100
[1m747/747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m266s[0m 357ms/step - loss: 1.3508 - val_loss: 0.9735
Epoch 3/100
[1m747/747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m311s[0m 343ms/step - loss: 0.9392 - val_loss: 1.6926
Epoch 4/100
[1m747/747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m266s[0m 349ms/step - loss: 0.6316 - val_loss: 0.7272
Epoch 5/100
[1m747/747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m263s[0m 350ms/step - loss: 0.3584 - val_loss: 0.7070
Epoch 6/100
[1m747/747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m271s[0m 362ms/step - loss: 0.2222 - val_loss: 0.7168
Epoch 7/100
[1m747/747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m272s[0m 364ms/step - loss: 0.1320 - val_loss: 0.7245
Epoch 8/100
[1m747/747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m271s[0m 363ms/step - loss: 0.0866 - val_loss: 0.763

### Results

In [30]:
y_pred = np.squeeze(model.predict(X_test))
y_pred

[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 31ms/step


array([4.5985928, 3.5433617, 3.920749 , ..., 4.284516 , 3.6621928,
       4.5397525], shape=(12797,), dtype=float32)

In [32]:
rmse = np.sqrt(np.mean((y_test - y_pred)**2))
rmse

np.float64(0.822147955930913)

In [33]:
print("RMSE: {:.2f}".format(rmse))

RMSE: 0.82


In [34]:
y_test

12008    5
42394    1
24748    5
42609    1
10719    5
        ..
37120    4
33226    4
3764     5
22423    5
293      4
Name: Rating, Length: 12797, dtype: int64

In [39]:
r2 = 1 - (np.sum((y_test - y_pred)**2) / np.sum((y_test - y_test.mean())**2))
r2

np.float64(0.3957243822748425)

In [40]:
print("R2 Score: {:.4f}".format(r2))

R2 Score: 0.3957
