# 1- INTRODUCTION

This project is baesd on a binary classification of Hotel reviews into Good or Bad. For my project i decided to implement a RNN-based model capable of extracting relevant features from text in order to execute the required task.

**DISCLAIMER**:

Unfortunately i have noticed that choosing in the model selection a Bidirectional LSTM instead of a simple LSTM (or even GRU) have increased a lot the model evaluation, still getting good performances. I have runned on my personal laptop and it converged in 2 min, however, in collab it took me 30 min to fully run the nootebook and obtain the same results.

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import keras
import matplotlib.pyplot as plt
import seaborn as sns
import random
import string
import itertools

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder
from keras.api.preprocessing.sequence import pad_sequences

from keras.api.models import Model
from keras.api.layers import (Input, Dense, LSTM, Embedding, Bidirectional,
                              Dropout, BatchNormalization, Concatenate,)

from keras.src.optimizers import Adam
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

tf.compat.v1.enable_eager_execution()

print( keras.__version__ )
print( tf.__version__)

# Set seeds for reproducibility
def set_seeds(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)

set_seeds(42)

3.8.0
2.18.0


In [2]:
df = pd.read_csv('input_data.csv')
# df = df.drop(df.index[5000:])

In [3]:
df.head()

Unnamed: 0,Hotel_Address,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Hotel_number_reviews,Reviewer_number_reviews,Review_Score,Review,Review_Type
0,Scarsdale Scarsdale Place Kensington Kensingto...,5/2/2017,8.1,Copthorne Tara Hotel London Kensington,United Kingdom,7105,2,6.7,Expensive room rate that didn t include parki...,Bad_review
1,53 53 59 Kilburn High Road Maida Vale London C...,8/4/2016,7.1,BEST WESTERN Maitrise Hotel Maida Vale,United Kingdom,1877,8,5.8,Bedroom in the basement No windows Very small...,Bad_review
2,Pelai Pelai 28 Ciutat Vella 08002 Barcelona Spain,11/17/2016,8.6,Catalonia Ramblas 4 Sup,United Kingdom,4276,2,6.3,Room ready for a makeover Location,Bad_review
3,3 3 Place du G n ral Koenig 17th arr 75017 Par...,2/4/2016,7.1,Hyatt Regency Paris Etoile,United Kingdom,3973,3,5.8,Firstly the lady at the check in desk was qui...,Bad_review
4,Epping Epping Forest 30 Oak Hill London IG8 9N...,7/27/2016,7.5,Best Western PLUS Epping Forest,United Kingdom,587,7,3.3,Not being able to park my vehicle due to the ...,Bad_review


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13772 entries, 0 to 13771
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Hotel_Address            13772 non-null  object 
 1   Review_Date              13772 non-null  object 
 2   Average_Score            13772 non-null  float64
 3   Hotel_Name               13772 non-null  object 
 4   Reviewer_Nationality     13772 non-null  object 
 5   Hotel_number_reviews     13772 non-null  int64  
 6   Reviewer_number_reviews  13772 non-null  int64  
 7   Review_Score             13772 non-null  float64
 8   Review                   13772 non-null  object 
 9   Review_Type              13772 non-null  object 
dtypes: float64(2), int64(2), object(6)
memory usage: 1.1+ MB


# 2- PREPROCESSING

This section in the exam was divided in two subparts:
* How to (if) preprocess input data and which data would you retain/use;
* Which is the input of the model, and how is it represented;

As i said in my exam, i am going to preprocess informations in the following way: drop all the column except for 'Review', 'Review_Score', 'Review_Type'

In [5]:
df = df[['Review', 'Review_Score', 'Review_Type']]

'Review_Type': i encode it into binary integer values, predicting either 0 or 1 for Bad_Reviews and Good_Reviews

In [6]:
label_enc = LabelEncoder()
df['Review_Type'] = label_enc.fit_transform(df['Review_Type'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Review_Type'] = label_enc.fit_transform(df['Review_Type'])


'Review_Score': i said in the exam that i'm going to keep it like that.

<font color="blue">**CHANGE**</font>

'Review': i'm going to clean the whole text, removing punctation and lowercasing words for correctly preprocess input datas, removing non relevant informations. also helped me reduce size of each phrase, making the models computations faster

In [7]:
def clean_text(text):
  text = text.replace('--', ' ')
  # split into tokens by white space
  words = text.split()
  # remove punctuation from each token
  table = str.maketrans('', '', string.punctuation)
  words = [w.translate(table) for w in words]
  # remove remaining tokens that are not alphabetic
  words = [word for word in words if word.isalpha()]
  # make lower case
  words = [word.lower() for word in words]
  # build up tokens again
  return " ".join(words)

df['Clean_Review'] = df['Review'].apply(clean_text)

df.head()

Unnamed: 0,Review,Review_Score,Review_Type,Clean_Review
0,Expensive room rate that didn t include parki...,6.7,0,expensive room rate that didn t include parkin...
1,Bedroom in the basement No windows Very small...,5.8,0,bedroom in the basement no windows very small ...
2,Room ready for a makeover Location,6.3,0,room ready for a makeover location
3,Firstly the lady at the check in desk was qui...,5.8,0,firstly the lady at the check in desk was quit...
4,Not being able to park my vehicle due to the ...,3.3,0,not being able to park my vehicle due to the s...


<font color="blue">**CHANGE**</font>

I did not mention in my exam (and I think it's important) how to correctly **preprocess text**. First, apply **tokenization**: this consists in separating each review into its fundamental words, in order to create a dictionary of them. In this way, we can assign a unique integer (an index) to each word in the vocabulary.

Once the text is tokenized and encoded into sequences of integers, the next step is to **pad** these sequences so that they all have the same length — this is necessary because RNN neural networks expect fixed-size inputs as i mentioned in the exam.

After that, we define the **embedding layer**: it learns a dense vector representation (i.e., low-dimensional continuous representation) for each word during training, based on the context in which the word appears. The embedding layer transforms each word index into a trainable dense vector, which captures semantic meaning over time. I'm going to train the embedding directly inside my function create_model, since we are not allowed to use pretrained Embedding models.

Here as follows i provide a personal implementation of a tokenization, creating a class Token(), then i apply padding before feeed my data into the model.

In [8]:
class Token():
    def __init__(self, statistic=True):
        self.unique_words = None
        self.word_index = None
        self.statisic = statistic

    def fit_on_texts(self, texts):
        flat_words = [word for sublist in [x.split() for x in texts] for word in sublist]

        self.unique_words = np.unique(flat_words)
        self.word_index = dict((word, idx) for idx, word in enumerate(self.unique_words))
        if self.statisic:
            print(f"Total words: {len(flat_words)}")
            print(f"Unique words: {len(self.unique_words)}")

    def encode_words(self, word_list):
        return [self.word_index[word] for word in word_list if word in self.word_index]

    def texts_to_sequences(self, texts):
        return [self.encode_words(text.split()) for text in texts]

In [9]:
# ---------- Tokenization & Padding ----------
tokenizer = Token()
tokenizer.fit_on_texts(df['Clean_Review'])
sequences = tokenizer.texts_to_sequences(df['Clean_Review'])

print(sequences[:10])

Total words: 350687
Unique words: 11458
[[3688, 8519, 7937, 10141, 2849, 9962, 5101, 7144, 6946, 1279, 3742, 1705, 4096, 4554, 7144, 536, 3822, 10341, 9624, 643, 0, 6650], [936, 5080, 10144, 848, 6669, 11282, 10969, 9248, 8519, 6634, 10289, 6689, 5726, 10555, 2847, 6720, 11343, 1484, 10289, 3983, 5364, 2632, 11141, 1730, 6989, 936, 2900, 1004, 2882, 8519, 8762, 1668, 6896, 11332, 4029, 6689, 3766, 5843, 590, 8544, 10144, 2353, 4221, 11225, 6518, 2628, 5807], [8519, 7976, 4096, 0, 6017, 5843], [3967, 10144, 5568, 643, 10144, 1729, 5080, 2803, 11104, 7879, 10710, 8982, 2847, 6720, 4505, 10865, 8982, 2849, 9962, 7732, 10865, 10289, 7763, 435, 1729, 5080, 5182, 6897, 4964, 8627, 11367, 11426, 5742, 10144, 1192, 8086, 8982, 10161, 9091, 391, 8627, 6876, 8982, 600, 4096, 6518, 1533, 391, 7188, 391, 10161, 4314, 6163, 10144, 8519, 5479, 2237, 4964, 4671, 6625, 945, 10289, 10197, 4887, 954, 8982, 2847, 6720, 182, 11213, 4029, 4964, 11104, 6896, 6946, 11225, 10144, 5729, 11190, 6625, 6284, 1130

I am going to fix a predetermined maximum length (max_len) for each review based on statistical analysis of the dataset. Computing percentiles of the review lengths (measured in number of tokens/words) gave me a good max_len tradeoff that covers 95% of all reviews, ensuring that most reviews are fully preserved while avoiding excessively long inputs.

In [10]:
# Compute lengths
lengths = [len(seq) for seq in sequences]

# Inspect key statistics
print(f"Max length:       {np.max(lengths)}")
print(f"Mean length:      {np.mean(lengths):.1f}")
print(f"Std. dev:         {np.std(lengths):.1f}")
print(f"95th percentile:  {np.percentile(lengths, 95)}")
print(f"99th percentile:  {np.percentile(lengths, 99)}")

# Set maxlen based on the 95th percentile
max_len = int(np.percentile(lengths, 95))
vocab_size = len(tokenizer.unique_words) + 1  # +1 for padding index

print(f"Using as max_len: {max_len}, and as vocab_size: {vocab_size}")

Max length:       606
Mean length:      25.5
Std. dev:         41.9
95th percentile:  99.0
99th percentile:  206.28999999999905
Using as max_len: 99, and as vocab_size: 11459


Now it's time to split my dataset into train, test and validation sets before applying tokenization and padding.

In [11]:
train_df, temp_df = train_test_split(
    df, test_size=0.20, random_state=42, shuffle=True
)

val_df, test_df = train_test_split(
    temp_df, test_size=0.50, random_state=42, shuffle=True
)

print(len(train_df), len(val_df), len(test_df))

11017 1377 1378


In [12]:
# --- used in training my model --- #
train_texts = train_df['Clean_Review'].values
train_scores = train_df['Review_Score'].values
train_targets = train_df['Review_Type'].values

# --- used in hyperparameter tuning --- #
val_texts = val_df['Clean_Review'].values
val_scores = val_df['Review_Score'].values
val_targets = val_df['Review_Type'].values

# --- used in final evaluation --- #
test_texts = test_df['Clean_Review'].values
test_scores = test_df['Review_Score'].values
test_targets = test_df['Review_Type'].values

<font color="blue">**CHANGE**</font>

In the exam, I mentioned as input the preprocessed embedding dimensions, since I intended input as "what I actually feed into my selected RNN (specifically, the Bidirectional LSTM)." However, it would probably have been more correct to indicate as input the dimensions of the tokenized reviews, since in Keras it's preferable to include the embedding layer inside the model so it can be trained directly:

**(batch_size, tokenized_reviews) --- Embedding ---> (batch_size, timestep, input_dim) --- RNN MODEL ---> (batch_size, output_BiLSTM)**

In [13]:
# --- training --- #
tokenized_train = tokenizer.texts_to_sequences(train_texts)
train_texts = pad_sequences(
    tokenized_train,
    maxlen=max_len,
    padding='post',
    truncating='post'
)

# --- validation --- #
tokenized_val = tokenizer.texts_to_sequences(val_texts)
val_texts = pad_sequences(
    tokenized_val,
    maxlen=max_len,
    padding='post',
    truncating='post'
)

# --- test --- #
tokenized_test = tokenizer.texts_to_sequences(test_texts)
test_texts = pad_sequences(
    tokenized_test,
    maxlen=max_len,
    padding='post',
    truncating='post'
)


print("First apply tokenization:")
print(tokenized_train[:1])
print("")
print("Then apply paddding:")
print(train_texts[:1])

First apply tokenization:
[[4493, 1279, 10969, 8909, 9578]]

Then apply paddding:
[[ 4493  1279 10969  8909  9578     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0]]


# 3-4 INPUT - OUTPUT

In [14]:
def input_reviews():
    return Input(shape=(max_len,), name='text_input')

def input_scores():
    return Input(shape=(1,), name='score_input')


print(f"input model shape: {train_texts.shape}, {train_scores.shape}")

input model shape: (11017, 99), (11017,)


In [15]:
def output_classification():
    return Dense(1, activation='sigmoid', name='output')

print(f"output model shape: {train_targets.shape}")

output model shape: (11017,)


# 5- MODEL CONFIGURATION

This section represents the following parts:

6a- Model composition (composition of layers, regardless their number,
or their dimension)

In [16]:
def create_model(LOSS = 'binary_crossentropy',
                 embedding_dim = 50,
                 learning_rate=0.01,
                 hidden_act='relu',
                 dropout_rate=0.2):

    text_input = input_reviews()

    score_input = input_scores()

    embedded = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_dim,
        mask_zero=True,
        name='embedding'
    )(text_input)

    bilstm_out = Bidirectional(
        LSTM(units=64,
             activation='tanh',
             recurrent_activation='sigmoid',
             kernel_initializer='glorot_uniform',
             return_sequences=False,
             name='bilstm')
        )(embedded)

    # Concatenate branches
    x = Concatenate(axis=-1)([bilstm_out, score_input])

    # Dense layers with dropout and batch normalization
    x = Dense(64, activation=hidden_act, kernel_initializer='he_uniform', kernel_regularizer='l2')(x)
    x = BatchNormalization()(x)
    x = Dropout(dropout_rate)(x)

    x = Dense(32, activation=hidden_act, kernel_initializer='he_uniform', kernel_regularizer='l2')(x)
    x = BatchNormalization()(x)
    x = Dropout(dropout_rate)(x)

    # Output layer for binary classification
    output = output_classification()(x)

    model = Model(inputs=[text_input, score_input], outputs=output)
    model.compile(
        optimizer=Adam(learning_rate=learning_rate),
        loss=LOSS,
        metrics=['accuracy']
    )

    return model

create_model().summary()

6b- hyperparameter tuning: Here i provide the most important parameters for my model to be tuned. selecting the best combination of those parameters will improve my model capabilities

In [17]:
param_grid = {
    'learning_rate': [1e-5, 1e-4, 1e-3, 1e-2],
    'epochs':       [5, 10, 15],
    'batch_size':   [64, 128, 256],
    'dropout_rate': [0.1, 0.2, 0.3],
    'hidden_act':   ['relu', 'leaky_relu']
}

# 6- MODEL EVALUATION

This section focus on how to assess (in which setting) the generalization capabilties of the model on unseen data.

<font color="blue">**CHANGE**</font>

To evaluate the model, I implemented a custom random search procedure, selecting the best configuration based solely on the F1-score, while also recording accuracy for reference. The evaluation was conducted using a single validation set without applying any stratified sampling and only on 5 samples, everithing in order to reduce complexity.

In [18]:
def random_search(grid, samples= 5):
    # Generate all parameters grid space and randomly sample
    parameters_space = list(itertools.product(*grid.values()))
    print(f"Total configurations: {len(parameters_space)}, sampling {samples} random sets \n")
    sampled = random.sample(parameters_space, samples)
    param_dicts = [dict(zip(grid.keys(), combo)) for combo in sampled]

    # best model variables
    best_f1 = 0
    best_acc = 0
    best_params = None

    counter = 0
    for params in param_dicts:
        counter += 1

        lr = params['learning_rate']
        dp = params['dropout_rate']
        act = params['hidden_act']
        epochs = params['epochs']
        bs = params['batch_size']

        print(f"========== TRAINING {counter}\{samples} ==========")
        print(f"PARAMETERS: lr={lr}, dropout={dp}, hidden_act={act}, epochs={epochs}, batch_size={bs}")
        model = create_model(
            learning_rate=lr,
            dropout_rate=dp,
            hidden_act=act)

        # Fit on train, validate on val
        model.fit(
            x=[train_texts, train_scores],
            y=train_targets,
            validation_data=([val_texts, val_scores], val_targets),
            epochs=epochs,
            batch_size=bs,
            verbose=1
        )

        # Evaluate
        preds = (model.predict([val_texts, val_scores]) > 0.5).astype(int)
        acc = accuracy_score(val_targets, preds)
        f1 = classification_report(val_targets, preds, output_dict=True)['weighted avg']['f1-score']
        print(f" -> Validation F1-score: {f1:.4f}")
        print(f" -> Validation Accuracy: {acc:.4f}\n")

        if f1 > best_f1:
            best_f1 = f1
            best_acc = acc
            best_params = params

    print(f"Best Validation F1-score: {best_f1:.4f}")
    print(f"Best Set Hyperparameters: {best_params}")
    return best_params, best_f1, best_acc


best_params, best_f1, best_acc = random_search(param_grid)

Total configurations: 216, sampling 5 random sets 

PARAMETERS: lr=0.001, dropout=0.1, hidden_act=leaky_relu, epochs=15, batch_size=128
Epoch 1/15
[1m87/87[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 344ms/step - accuracy: 0.9806 - loss: 1.6151 - val_accuracy: 0.6485 - val_loss: 1.1350
Epoch 2/15
[1m87/87[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 344ms/step - accuracy: 0.9899 - loss: 0.5354 - val_accuracy: 0.9659 - val_loss: 0.5572
Epoch 3/15
[1m87/87[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 367ms/step - accuracy: 0.9930 - loss: 0.1856 - val_accuracy: 0.9317 - val_loss: 0.3444
Epoch 4/15
[1m87/87[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 355ms/step - accuracy: 0.9953 - loss: 0.0714 - val_accuracy: 0.8874 - val_loss: 0.2631
Epoch 5/15
[1m87/87[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 339ms/step - accuracy: 0.9986 - loss: 0.0319 - val_accuracy: 0.8998 - val_loss: 0.2023
Epoch 6/15
[1m87/87[0m [32m━━━━━━━━━━━━━━━━━━━

Here is the best model configuration on which i'll evaluate my model performances on test:
- `learning_rate`: 1e-05
- `epochs`: 10
- `batch_size`: 64
- `dropout`: 0.3
- `hidden_act`: `leaky_relu`

In [None]:
X1= np.concatenate([train_texts, val_texts])
X2= np.concatenate([train_scores, val_scores])
Y= np.concatenate([train_targets, val_targets])

final_model = create_model(learning_rate=best_params['learning_rate'],
                           dropout_rate=best_params['dropout_rate'],
                           hidden_act=best_params['hidden_act']
                           )

final_model.fit(x= [X1, X2], y= Y,
                epochs=best_params['epochs'],
                batch_size=best_params['batch_size'],
                verbose=1)

Epoch 1/10
[1m194/194[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 165ms/step - accuracy: 0.4379 - loss: 2.8625
Epoch 2/10
[1m194/194[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 160ms/step - accuracy: 0.5537 - loss: 2.6728
Epoch 3/10
[1m194/194[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 167ms/step - accuracy: 0.6683 - loss: 2.5015
Epoch 4/10
[1m194/194[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 164ms/step - accuracy: 0.7476 - loss: 2.3624
Epoch 5/10
[1m194/194[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 161ms/step - accuracy: 0.8368 - loss: 2.2243
Epoch 6/10
[1m194/194[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 164ms/step - accuracy: 0.8834 - loss: 2.1180
Epoch 7/10
[1m194/194[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 184ms/step - accuracy: 0.9150 - loss: 2.0314
Epoch 8/10
[1m  8/194[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m48s[0m 262ms/step - accuracy: 0.9245 - loss: 1.9895

# PLOTTING RESULTS

- Test F1-score: 0.9906
- Final Test Accuracy: 0.9906

Confusion matrix:
```
[[687 14]
 [05 672]]
```

In [None]:
test_preds = (final_model.predict([test_texts, test_scores]) > 0.5).astype(int)
test_acc = accuracy_score(test_targets, test_preds)
test_f1 = classification_report(test_targets, test_preds, output_dict=True)['weighted avg']['f1-score']

print(f"Final Test F1-score: {test_f1:.4f}")
print(f"Final Test Accuracy: {test_acc:.4f}")

In [None]:
cm = confusion_matrix(test_targets, test_preds)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Bad Review', 'Good Review'],
            yticklabels=['Bad Review', 'Good Review'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix for Review Type Prediction')
plt.show()