<a href="https://colab.research.google.com/drive/1JC3UCnJe6MmMH7Hib4JfCBC6WnFQhMNV?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1- INTRODUCTION

This project is baesd on a binary classification of Hotel reviews into Good or Bad. For my project i decided to implement a RNN-based model capable of extracting relevant features from text in order to execute the required task.

<font color="blue">**DISCLAIMER**</font>

Note to the professor:

- In case you need some help with understanding what I wrote during the exam I can help, email me at *lorenzo.brusati01@universitadipavia.it*.

- I underlined the differences between the exam in the following code like this: <font color="blue">**CHANGE**</font>.

- This notebook takes around 10 minutes to run (If you also run the hyperparameter optimization).

Unfortunately, I observed that selecting a Bidirectional LSTM for model design, as opposed to a standard LSTM or even a GRU, significantly increased the model’s evaluation time, despite maintaining good performance metrics. While the model converged in approximately 2 minutes on my personal laptop, executing the same notebook in Google Colab required around 30 minutes to complete. So i decided to drop half the dataset to speed up all the computations.

I apologize for any inconvenience this may have caused.

In [23]:
import numpy as np
import pandas as pd
import tensorflow as tf
import keras
import matplotlib.pyplot as plt
import seaborn as sns
import random
import string
import itertools

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder
from keras.api.preprocessing.sequence import pad_sequences

from keras.api.models import Model
from keras.api.layers import (Input, Dense, LSTM, Embedding, Bidirectional,
                              Dropout, BatchNormalization, Concatenate,)

from keras.src.optimizers import Adam
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

tf.compat.v1.enable_eager_execution()

print( keras.__version__ )
print( tf.__version__)

# Set seeds for reproducibility
def set_seeds(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)

set_seeds(42)

3.8.0
2.18.0


In [24]:
url = "https://raw.githubusercontent.com/brusati04/Hotel_Reviews/refs/heads/main/input_data.csv"
df = pd.read_csv(url)

In [25]:
total_samples = len(df)
half_size = total_samples // 2
samples_per_class = half_size // 2

df = df.groupby('Review_Type', group_keys=False).apply(
    lambda x: x.sample(n=samples_per_class, random_state=42)
).reset_index(drop=True)


print(df['Review_Type'].value_counts())

Review_Type
Bad_review     3443
Good_review    3443
Name: count, dtype: int64


  df = df.groupby('Review_Type', group_keys=False).apply(


In [26]:
df.head()

Unnamed: 0,Hotel_Address,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Hotel_number_reviews,Reviewer_number_reviews,Review_Score,Review,Review_Type
0,Aletta Aletta Jacobslaan 7 Slotervaart 1066 BP...,8/28/2016,8.4,Corendon Vitality Hotel Amsterdam,Iran,4410,5,5.8,Breakfast was same all days House keeping did...,Bad_review
1,53 53 Upper Street Islington London N1 0UY UK,12/8/2015,8.6,Hilton London Angel Islington,United Kingdom,1462,1,6.5,Our room was very close to the lifts which we...,Bad_review
2,199 199 206 High Holborn Camden London WC1V 7B...,7/3/2016,9.2,The Hoxton Holborn,Netherlands,1740,1,5.8,Room was a shoebox Had to crawl over bed to m...,Bad_review
3,Via Via Senato 5 Milan City Center 20121 Milan...,1/16/2017,8.8,Baglioni Hotel Carlton The Leading Hotels of t...,Monaco,775,2,5.4,restaurant in hotel very expensive Staff in r...,Bad_review
4,7 7 Western Gateway Royal Victoria Dock Newham...,4/21/2016,8.5,Novotel London Excel,United Kingdom,1158,1,6.3,in room 501 next to a service door which I no...,Bad_review


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6886 entries, 0 to 6885
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Hotel_Address            6886 non-null   object 
 1   Review_Date              6886 non-null   object 
 2   Average_Score            6886 non-null   float64
 3   Hotel_Name               6886 non-null   object 
 4   Reviewer_Nationality     6886 non-null   object 
 5   Hotel_number_reviews     6886 non-null   int64  
 6   Reviewer_number_reviews  6886 non-null   int64  
 7   Review_Score             6886 non-null   float64
 8   Review                   6886 non-null   object 
 9   Review_Type              6886 non-null   object 
dtypes: float64(2), int64(2), object(6)
memory usage: 538.1+ KB


# 2- PREPROCESSING

This section in the exam was divided in two subparts:
* How to (if) preprocess input data and which data would you retain/use;
* Which is the input of the model, and how is it represented;

As i said in my exam, i am going to preprocess informations in the following way: drop all the column except for 'Review', 'Review_Score', 'Review_Type'

In [28]:
df = df[['Review', 'Review_Score', 'Review_Type']]

'Review_Type': i encode it into binary integer values, predicting either 0 or 1 for Bad_Reviews and Good_Reviews

In [29]:
label_enc = LabelEncoder()
df['Review_Type'] = label_enc.fit_transform(df['Review_Type'])

'Review_Score': i said in the exam that i'm going to keep it like that.

<font color="blue">**CHANGE**</font>

'Review': didn't mention in my exam an common preprocessing procedure for textual datas: i'm going to clean the whole text, removing punctation and lowercasing words for correctly preprocess input datas, removing non relevant informations. also helped me reduce size of each phrase, making the models computations faster

In [30]:
def clean_text(text):
  text = text.replace('--', ' ')
  # split into tokens by white space
  words = text.split()
  # remove punctuation from each token
  table = str.maketrans('', '', string.punctuation)
  words = [w.translate(table) for w in words]
  # remove remaining tokens that are not alphabetic
  words = [word for word in words if word.isalpha()]
  # make lower case
  words = [word.lower() for word in words]
  # build up tokens again
  return " ".join(words)

df['Clean_Review'] = df['Review'].apply(clean_text)

df.head()

Unnamed: 0,Review,Review_Score,Review_Type,Clean_Review
0,Breakfast was same all days House keeping did...,5.8,0,breakfast was same all days house keeping did ...
1,Our room was very close to the lifts which we...,6.5,0,our room was very close to the lifts which wer...
2,Room was a shoebox Had to crawl over bed to m...,5.8,0,room was a shoebox had to crawl over bed to mo...
3,restaurant in hotel very expensive Staff in r...,5.4,0,restaurant in hotel very expensive staff in re...
4,in room 501 next to a service door which I no...,6.3,0,in room next to a service door which i noticed...


<font color="blue">**CHANGE**</font>

I did not mention in my exam (and I think it's important) how to correctly **preprocess text**. First, apply **tokenization**: this consists in separating each review into its fundamental words, in order to create a dictionary of them. In this way, we can assign a unique integer (an index) to each word in the vocabulary.

Once the text is tokenized and encoded into sequences of integers, the next step is to **pad** these sequences so that they all have the same length — this is necessary because RNN neural networks expect fixed-size inputs as i mentioned in the exam.

After that, we define the **embedding layer**: it learns a dense vector representation (i.e., low-dimensional continuous representation) for each word during training, based on the context in which the word appears. The embedding layer transforms each word index into a trainable dense vector, which captures semantic meaning over time. I'm going to train the embedding directly inside my function create_model, since we are not allowed to use pretrained Embedding models (i.e. Glove or Bert).

Here as follows i provide a personal implementation of a tokenization, creating a class Token(), then i apply padding before feeed my data into the model.

In [31]:
class Token():
    def __init__(self, statistic=True):
        self.unique_words = None
        self.word_index = None
        self.statisic = statistic

    def fit_on_texts(self, texts):
        flat_words = [word for sublist in [x.split() for x in texts] for word in sublist]

        self.unique_words = np.unique(flat_words)
        self.word_index = dict((word, idx) for idx, word in enumerate(self.unique_words))
        if self.statisic:
            print(f"Total words: {len(flat_words)}")
            print(f"Unique words: {len(self.unique_words)}")

    def encode_words(self, word_list):
        return [self.word_index[word] for word in word_list if word in self.word_index]

    def texts_to_sequences(self, texts):
        return [self.encode_words(text.split()) for text in texts]

In [32]:
tokenizer = Token()
tokenizer.fit_on_texts(df['Clean_Review'])
sequences = tokenizer.texts_to_sequences(df['Clean_Review'])

print(sequences[:10])

Total words: 179124
Unique words: 8163
[[920, 7902, 6164, 211, 1878, 3517, 3907, 2018, 4807, 1337, 7227, 6084, 281, 7340, 7808, 7959, 281, 2528, 2018, 4807, 4054, 881, 4890, 7917, 3635, 6084, 4171, 281, 6778], [4986, 6084, 7902, 7808, 1375, 7329, 7227, 4099, 7992, 7964, 7808, 4785, 2528, 8060, 7728, 4990, 4890, 4986, 6544, 7917, 3635, 6084, 5309, 3162, 77, 7227, 6084, 7329, 7227, 1681, 6434, 3172, 4171, 7227, 6084, 7902, 1337, 281, 4890, 0, 2711, 6523, 6778, 3390, 281, 3010], [6084, 7902, 0, 6430, 3260, 7329, 1755, 5007, 682, 7329, 4614, 3021, 4923, 6478, 7329, 7227, 4981, 4807, 3172, 3343, 6835, 7243, 5007, 7313, 281, 3343, 958, 3235, 7329, 7227, 3504, 3172, 485], [5966, 3635, 3504, 7808, 2621, 6778, 3635, 5966, 7808, 7531, 6351, 4285, 5966, 4171, 3829, 5190, 2937, 6435], [3635, 6084, 4745, 7329, 0, 6345, 2179, 7992, 3559, 4821, 4431, 4890, 6778, 3168, 7287, 7986, 3559, 427, 3559, 1710, 4807, 3118, 7329, 6544, 2276, 7329, 7227, 4780, 1455, 3021, 7227, 6345, 2179, 397, 3559, 243, 4807,

I am going to fix a predetermined maximum length (max_len) for each review based on statistical analysis of the dataset. Computing percentiles of the review lengths (measured in number of tokens/words) gave me a good max_len tradeoff that covers 95% of all reviews, ensuring that most reviews are fully preserved while avoiding excessively long inputs.

In [33]:
# Compute lengths
lengths = [len(seq) for seq in sequences]

# Inspect key statistics
print(f"Max length:       {np.max(lengths)}")
print(f"Mean length:      {np.mean(lengths):.1f}")
print(f"Std. dev:         {np.std(lengths):.1f}")
print(f"95th percentile:  {np.percentile(lengths, 95)}")
print(f"99th percentile:  {np.percentile(lengths, 99)}")

# Set maxlen based on the 95th percentile
max_len = int(np.percentile(lengths, 95))
vocab_size = len(tokenizer.unique_words) + 1  # +1 for padding index

print(f"Using as max_len: {max_len}, and as vocab_size: {vocab_size}")

Max length:       606
Mean length:      26.0
Std. dev:         43.1
95th percentile:  101.0
99th percentile:  213.14999999999964
Using as max_len: 101, and as vocab_size: 8164


Now it's time to split my dataset into train, test and validation sets before applying tokenization and padding.

In [34]:
train_df, temp_df = train_test_split(
    df, test_size=0.20, random_state=42, shuffle=True
)

val_df, test_df = train_test_split(
    temp_df, test_size=0.50, random_state=42, shuffle=True
)

print(len(train_df), len(val_df), len(test_df))

5508 689 689


In [35]:
# --- used in training my model --- #
train_texts = train_df['Clean_Review'].values
train_scores = train_df['Review_Score'].values
train_targets = train_df['Review_Type'].values

# --- used in hyperparameter tuning --- #
val_texts = val_df['Clean_Review'].values
val_scores = val_df['Review_Score'].values
val_targets = val_df['Review_Type'].values

# --- used in final evaluation --- #
test_texts = test_df['Clean_Review'].values
test_scores = test_df['Review_Score'].values
test_targets = test_df['Review_Type'].values

<font color="blue">**CHANGE**</font>

In the exam, I mentioned as input the preprocessed embedding dimensions, since I intended input as "what I actually feed into my selected RNN (specifically, the Bidirectional LSTM)."

However, it would probably have been more correct to indicate as input the *dimensions of the tokenized reviews*, that are going to be processed directly inside the model embedding, also it's preferable to include the embedding layer inside the model so thath the weights associated can be easily trained directly:

**(batch_size, tokenized_reviews) --- Embedding ---> (batch_size, timestep, input_dim) --- RNN MODEL ---> (batch_size, output_BiLSTM)**

In [36]:
# --- training --- #
tokenized_train = tokenizer.texts_to_sequences(train_texts)
train_texts = pad_sequences(
    tokenized_train,
    maxlen=max_len,
    padding='post',
    truncating='post'
)

# --- validation --- #
tokenized_val = tokenizer.texts_to_sequences(val_texts)
val_texts = pad_sequences(
    tokenized_val,
    maxlen=max_len,
    padding='post',
    truncating='post'
)

# --- test --- #
tokenized_test = tokenizer.texts_to_sequences(test_texts)
test_texts = pad_sequences(
    tokenized_test,
    maxlen=max_len,
    padding='post',
    truncating='post'
)


print("First apply tokenization:")
print(tokenized_train[:1])
print("")
print("Then apply paddding:")
print(train_texts[:1])

First apply tokenization:
[[7513, 4807, 8085, 177, 1520, 4807, 8085, 4771, 2665, 3635, 642, 4171, 920]]

Then apply paddding:
[[7513 4807 8085  177 1520 4807 8085 4771 2665 3635  642 4171  920    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0]]


# 3-4 INPUT - OUTPUT

I decided to highlight the input and output of the model to clearly illustrate the features and their respective dimensions.

In [37]:
def input_reviews():
    return Input(shape=(max_len,), name='text_input')

def input_scores():
    return Input(shape=(1,), name='score_input')


print(f"input model shape: {train_texts.shape}, {train_scores.shape}")

input model shape: (5508, 101), (5508,)


As already said in the exam, choosing sigmoid is a common setup when predicting probabilities for two-class problems.



In [38]:
def output_classification():
    return Dense(1, activation='sigmoid', name='output')

print(f"output model shape: {train_targets.shape}")

output model shape: (5508,)


# 5- MODEL CONFIGURATION

This section represents the following parts:

6a- Model composition (composition of layers, regardless their number,
or their dimension). This model is designed for binary classification, using both textual reviews and an associated numerical score as input features. The architecture is modular and parameterized to allow flexibility in tuning key hyperparameters. As mentioned in my exam, i decided to add Dropout and BatchNormalization layers.

In [39]:
def create_model(LOSS = 'binary_crossentropy',
                 embedding_dim = 32,
                 learning_rate=0.01,
                 hidden_act='relu',
                 dropout_rate=0.2):
    # 2 inputs: text and score
    text_input = input_reviews()

    score_input = input_scores()

    embedded = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_dim,
        mask_zero=True,
        name='embedding'
    )(text_input)

    bilstm_out = Bidirectional(
        LSTM(units=64,
             activation='tanh',
             recurrent_activation='sigmoid',
             kernel_initializer='glorot_uniform',
             return_sequences=False,
             name='bilstm')
        )(embedded)

    # Concatenate branches
    x = Concatenate(axis=-1)([bilstm_out, score_input])

    # Dense layers with dropout and batch normalization
    x = Dense(64, activation=hidden_act, kernel_initializer='he_uniform', kernel_regularizer='l2')(x)
    x = BatchNormalization()(x)
    x = Dropout(dropout_rate)(x)

    x = Dense(32, activation=hidden_act, kernel_initializer='he_uniform', kernel_regularizer='l2')(x)
    x = BatchNormalization()(x)
    x = Dropout(dropout_rate)(x)

    # Output layer for binary classification
    output = output_classification()(x)

    model = Model(inputs=[text_input, score_input], outputs=output)
    model.compile(
        optimizer=Adam(learning_rate=learning_rate),
        loss=LOSS,
        metrics=['accuracy']
    )

    return model

create_model().summary()

6b- hyperparameter tuning: Here i provide a vocabulary containing the most important parameters for my model to be tuned. Selecting the best combination of those parameters will improve my model capabilities.

In [40]:
param_grid = {
    'learning_rate': [1e-5, 1e-4, 1e-3],
    'epochs':       [5, 10],
    'batch_size':   [64, 128, 256],
    'dropout_rate': [0.1, 0.2, 0.3],
    'hidden_act':   ['relu', 'leaky_relu']
}

# 6- MODEL EVALUATION

This section focus on how to assess (in which setting) the generalization capabilties of the model on unseen data.

<font color="blue">**CHANGE**</font>

To evaluate the model, I implemented a custom random search procedure, selecting the best configuration based solely on the F1-score, while also recording accuracy for reference. The evaluation was conducted using a single validation set without applying any stratified sampling and only on just 2 samples, everithing in order to reduce complexity.

In [None]:
def random_search(grid, samples= 2):
    # Generate all parameters grid space and randomly sample
    parameters_space = list(itertools.product(*grid.values()))
    print(f"Total configurations: {len(parameters_space)}, sampling {samples} random sets \n")
    sampled = random.sample(parameters_space, samples)
    param_dicts = [dict(zip(grid.keys(), combo)) for combo in sampled]

    # best model variables
    best_f1 = 0
    best_acc = 0
    best_params = None

    counter = 0
    for params in param_dicts:
        counter += 1

        lr = params['learning_rate']
        dp = params['dropout_rate']
        act = params['hidden_act']
        epochs = params['epochs']
        bs = params['batch_size']

        print(f"========== TRAINING {counter}\{samples} ==========")
        print(f"PARAMETERS: lr={lr}, dropout={dp}, hidden_act={act}, epochs={epochs}, batch_size={bs}")
        model = create_model(
            learning_rate=lr,
            dropout_rate=dp,
            hidden_act=act)

        # Fit on train, validate on val
        model.fit(
            x=[train_texts, train_scores],
            y=train_targets,
            validation_data=([val_texts, val_scores], val_targets),
            epochs=epochs,
            batch_size=bs,
            verbose=1
        )

        # Evaluate
        preds = (model.predict([val_texts, val_scores]) > 0.5).astype(int)
        acc = accuracy_score(val_targets, preds)
        f1 = classification_report(val_targets, preds, output_dict=True)['weighted avg']['f1-score']
        print(f" -> Validation F1-score: {f1:.4f}")
        print(f" -> Validation Accuracy: {acc:.4f}\n")

        if f1 > best_f1:
            best_f1 = f1
            best_acc = acc
            best_params = params

    print(f"Best Validation F1-score: {best_f1:.4f}")
    print(f"Best Set Hyperparameters: {best_params}")
    return best_params, best_f1, best_acc


best_params, best_f1, best_acc = random_search(param_grid)

Total configurations: 108, sampling 2 random sets 

PARAMETERS: lr=0.001, dropout=0.2, hidden_act=leaky_relu, epochs=5, batch_size=64
Epoch 1/5
[1m67/87[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m3s[0m 177ms/step - accuracy: 0.9045 - loss: 1.8906

Here is the best model configuration on which i'll evaluate my model performances on test:
- `learning_rate`: 0.0001
- `epochs`: 10
- `batch_size`: 64
- `dropout`: 0.1
- `hidden_act`: `relu`

In [None]:
X1= np.concatenate([train_texts, val_texts])
X2= np.concatenate([train_scores, val_scores])
Y= np.concatenate([train_targets, val_targets])

final_model = create_model(learning_rate=best_params['learning_rate'],
                           dropout_rate=best_params['dropout_rate'],
                           hidden_act=best_params['hidden_act']
                           )

final_model.fit(x= [X1, X2], y= Y,
                epochs=best_params['epochs'],
                batch_size=best_params['batch_size'],
                verbose=1)

# PLOTTING RESULTS

- Test F1-score: 0.9898
- Final Test Accuracy: 0.9898

Confusion matrix:
```
[[355 0]
 [7 327]]
```

In [None]:
test_preds = (final_model.predict([test_texts, test_scores]) > 0.5).astype(int)
test_acc = accuracy_score(test_targets, test_preds)
test_f1 = classification_report(test_targets, test_preds, output_dict=True)['weighted avg']['f1-score']

print(f"Final Test F1-score: {test_f1:.4f}")
print(f"Final Test Accuracy: {test_acc:.4f}")

In [None]:
cm = confusion_matrix(test_targets, test_preds)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Bad Review', 'Good Review'],
            yticklabels=['Bad Review', 'Good Review'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix for Review Type Prediction')
plt.show()