<a href="https://colab.research.google.com/github/dyarparvar/NLP/blob/main/Sentiment_Analysis_and_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment analysis and text classification

In this activity, you will build a sentiment analysis model using Python and a data set of customer reviews. You will preprocess the data and fine-tune, evaluate, and test the model.


## Objective
Your objective is to analyse how different parameter choices affect the performance of a sentiment classifier.



## Activity guidance
1. Install the necessary packages that will be useful in this activity
2. Load the dataset sst5 from hugging face (https://huggingface.co/datasets/SetFit/sst5)

3. Create dataframes of the train and train split
4. Split the train dataframe into train and validation in the ratio of 8:2
5. Preprocess the dataset, set the maximum size to 200, vocabulary size to 30000
6. During tokenisation, mark out of vocabulary words as "[OOV]"
7. Pad your sequences with special tokens
8. Train a sentiment classifier on the dataset and compare different models for text classification (7 epochs)
10. Comment on the performance of all the models


## ✅ 0-2. Setup & Data

In [None]:
!pip install datasets



In [None]:
import pandas as pd
import numpy as np

import random
import os

from collections import Counter


import string
import re

from datasets import load_dataset

from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow import keras
from keras.utils import to_categorical

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, SimpleRNN, LSTM, GRU, Bidirectional, SpatialDropout1D, Dropout
# from tensorflow.keras.metrics import BinaryAccuracy, Precision, Recall, F1Score, AUC

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report

### Reproducibility

In [None]:
seed = 42

In [None]:
def set_seed(seed=42):
    # Python randomness
    random.seed(seed)
    # Python hash randomness
    os.environ["PYTHONHASHSEED"] = str(seed)

    # NumPy randomness
    np.random.seed(seed)

    # TensorFlow randomness
    tf.random.set_seed(seed)
    tf.config.experimental.enable_op_determinism()

In [None]:
set_seed(seed)

### Data

In [None]:
dataset = load_dataset("SetFit/sst5")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/421 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


train.jsonl: 0.00B [00:00, ?B/s]

dev.jsonl: 0.00B [00:00, ?B/s]

test.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/8544 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1101 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2210 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 8544
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 1101
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2210
    })
})

## ✅ 3-4. Train, Validation, Test

In [None]:
data = dataset["train"]
data.to_pandas()

Unnamed: 0,text,label,label_text
0,"a stirring , funny and finally transporting re...",4,very positive
1,apparently reassembled from the cutting-room f...,1,negative
2,they presume their audience wo n't sit still f...,1,negative
3,the entire movie is filled with deja vu moments .,2,neutral
4,this is a visually stunning rumination on love...,3,positive
...,...,...,...
8539,take care is nicely performed by a quintet of ...,1,negative
8540,"the script covers huge , heavy topics in a bla...",1,negative
8541,a seriously bad film with seriously warped log...,1,negative
8542,it 's not too racy and it 's not too offensive .,2,neutral


In [None]:
test_data = dataset["test"]
test_data.to_pandas()

Unnamed: 0,text,label,label_text
0,"no movement , no yuks , not much of anything .",1,negative
1,"a gob of drivel so sickly sweet , even the eag...",0,very negative
2,` how many more voyages can this limping but d...,2,neutral
3,so relentlessly wholesome it made me want to s...,2,neutral
4,"gangs of new york is an unapologetic mess , wh...",0,very negative
...,...,...,...
2205,the problem with concept films is that if the ...,1,negative
2206,"safe conduct , however ambitious and well-inte...",1,negative
2207,"a film made with as little wit , interest , an...",1,negative
2208,to enjoy this movie 's sharp dialogue and deli...,2,neutral


In [None]:
test_txt = test_data["text"]
test_label = test_data["label"]

Split the train dataframe into train and validation in the ratio of 8:2



In [None]:
# Convert to dataframe and split
data = data.to_pandas()
train_data, val_data = train_test_split(data, test_size=0.2, stratify=data["label"], random_state=seed)

In [None]:
train_txt = train_data["text"]
train_label = train_data["label"]

In [None]:
train_txt.shape

(6835,)

In [None]:
train_label.shape

(6835,)

In [None]:
val_txt = val_data["text"]
val_label = val_data["label"]

In [None]:
val_txt.shape

(1709,)

In [None]:
val_label.shape

(1709,)

In [None]:
# Check labels
print(f"Unique train labels: {np.unique(train_label)}")
print(f"Unique validation labels: {np.unique(val_label)}")
print(f"Unique test labels: {np.unique(test_label)}")
print(f"Unique label texts: {np.unique(data["label_text"])}")

Unique train labels: [0 1 2 3 4]
Unique validation labels: [0 1 2 3 4]
Unique test labels: [0 1 2 3 4]
Unique label texts: ['negative' 'neutral' 'positive' 'very negative' 'very positive']


In [None]:
n_classes = data["label_text"].nunique()
n_classes

5

## ✅ 5-6. Preprocessing

5. Preprocess the dataset,  set the maximum size to 200, vocabulary size to 30000
6. During tokenization, mark out of vocabulary words as "<OOV>"
7. Pad your sequences with special tokens


In [None]:
# Parameters
vocab_size = 30000 # How many unique words in vocabulary
max_length = 200 # How many words per sentence
padding_type = "post"
trunc_type = "post"

In [None]:
# Initialise the tokeniser
tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(train_txt)

In [None]:
# Tokenise the sentences and pad the sequences in the training set
train_sequences = tokenizer.texts_to_sequences(train_txt)
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [None]:
# Tokenise the sentences and pad the sequences in the validation set
val_sequences = tokenizer.texts_to_sequences(val_txt)
val_padded = pad_sequences(val_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [None]:
# Tokenise the sentences and pad the sequences in the test set
test_sequences = tokenizer.texts_to_sequences(test_txt)
test_padded = pad_sequences(test_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [None]:
# Convert to numpy array
test_padded = np.array(test_padded, dtype="int")
test_label = np.array(test_label, dtype="int")

## ✅ 8. Modelling


8. Train a sentiment classifier on the dataset and compare different models for text classification.
- vanilla RNN
- LSTM
- GRU
- Stacked LSTM
- Bidirectional LSTM
- Stacked Bidirectional LSTM
- Stacked Bidirectional LSTM + dropout

In [None]:
embedding_vecor_length = 100  # How many dimensions per embedding
epochs = 7
batch_size = 32
# class_labels = ['negative', 'neutral', 'positive', 'very negative', 'very positive']

### SimpleRNN

In [None]:
tf.random.set_seed(seed)

model = Sequential()

# Word embedding
model.add(Embedding(
    input_dim=vocab_size,
    output_dim=embedding_vecor_length,
    input_length=max_length
))

# Vanilla RNN
model.add(SimpleRNN(100))

# Output layer
model.add(Dense(n_classes, activation="softmax"))

# Compile the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["sparse_categorical_accuracy"])

In [None]:
# Train the model
model.fit(train_padded,
          train_label,
          validation_data=(val_padded, val_label),
          epochs=epochs,
          batch_size=batch_size)

Epoch 1/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 107ms/step - loss: 1.5776 - sparse_categorical_accuracy: 0.2686 - val_loss: 1.5752 - val_sparse_categorical_accuracy: 0.2715
Epoch 2/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 103ms/step - loss: 1.5836 - sparse_categorical_accuracy: 0.2595 - val_loss: 1.5719 - val_sparse_categorical_accuracy: 0.2744
Epoch 3/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 123ms/step - loss: 1.5878 - sparse_categorical_accuracy: 0.2602 - val_loss: 1.5708 - val_sparse_categorical_accuracy: 0.2580
Epoch 4/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 103ms/step - loss: 1.5803 - sparse_categorical_accuracy: 0.2690 - val_loss: 1.5719 - val_sparse_categorical_accuracy: 0.2592
Epoch 5/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 112ms/step - loss: 1.5784 - sparse_categorical_accuracy: 0.2680 - val_loss: 1.5704 - val_sparse_categorical_accurac

<keras.src.callbacks.history.History at 0x7c59b6401f10>

In [None]:
# Model architecture
model.summary()

In [None]:
def evaluate_nn(X_test, y_test, batch_size, model):

    # Calculate the model's metrics (keras by default uses 0.5 threshold to convert probability to class and calculate the threshold-dependent metrics)
    loss, accuracy = model.evaluate(X_test, y_test, batch_size=batch_size, verbose=0)

    print(
        f"Keras metrics  \n"
        f"Loss : {loss}.2f%% \n"
        f"Categorical Accuracy: {accuracy}.2f%% \n"
    )

    y_pred = model.predict(X_test)
    # Get the index of the highest probability (the predicted class)
    y_pred_classes = np.argmax(y_pred, axis=1)

    print(classification_report(y_test, y_pred_classes))


In [None]:
print("SimpleRNN")
evaluate_nn(test_padded, test_label, batch_size, model)

SimpleRNN
Keras metrics  
Loss : 1.583113431930542.2f%% 
Categorical Accuracy: 0.23076923191547394.2f%% 

[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 30ms/step
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       279
           1       0.00      0.00      0.00       633
           2       0.00      0.00      0.00       389
           3       0.23      1.00      0.38       510
           4       0.00      0.00      0.00       399

    accuracy                           0.23      2210
   macro avg       0.05      0.20      0.07      2210
weighted avg       0.05      0.23      0.09      2210



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### LSTM

In [None]:
tf.random.set_seed(seed)

model = Sequential()

# Word embedding
model.add(Embedding(
    input_dim=vocab_size,  # How many unique words in vocabulary
    output_dim=embedding_vecor_length, # How many dimensions per embedding
    input_length=max_length # How many words per sentence
))

# LSTM
model.add(LSTM(100))

# Output layer
model.add(Dense(n_classes, activation="softmax"))

# Compile the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["sparse_categorical_accuracy"])



In [None]:
# Train the model
model.fit(train_padded,
          train_label,
          validation_data=(val_padded, val_label),
          epochs=epochs,
          batch_size=batch_size)

Epoch 1/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 198ms/step - loss: 1.5746 - sparse_categorical_accuracy: 0.2691 - val_loss: 1.5727 - val_sparse_categorical_accuracy: 0.2715
Epoch 2/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 196ms/step - loss: 1.5705 - sparse_categorical_accuracy: 0.2595 - val_loss: 1.5702 - val_sparse_categorical_accuracy: 0.2715
Epoch 3/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 202ms/step - loss: 1.5701 - sparse_categorical_accuracy: 0.2605 - val_loss: 1.5693 - val_sparse_categorical_accuracy: 0.2715
Epoch 4/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 198ms/step - loss: 1.5699 - sparse_categorical_accuracy: 0.2622 - val_loss: 1.5690 - val_sparse_categorical_accuracy: 0.2715
Epoch 5/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 198ms/step - loss: 1.5698 - sparse_categorical_accuracy: 0.2611 - val_loss: 1.5689 - val_sparse_categorical_accurac

<keras.src.callbacks.history.History at 0x7c59b5f26ea0>

In [None]:
# Model architecture
model.summary()

In [None]:
print("LSTM")
evaluate_nn(test_padded, test_label, batch_size, model)

LSTM
Keras metrics  
Loss : 1.5825059413909912.2f%% 
Categorical Accuracy: 0.23076923191547394.2f%% 

[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 47ms/step
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       279
           1       0.00      0.00      0.00       633
           2       0.00      0.00      0.00       389
           3       0.23      1.00      0.38       510
           4       0.00      0.00      0.00       399

    accuracy                           0.23      2210
   macro avg       0.05      0.20      0.07      2210
weighted avg       0.05      0.23      0.09      2210



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### GRU

In [None]:
tf.random.set_seed(seed)

model = Sequential()

# Word embedding
model.add(Embedding(
    input_dim=vocab_size,  # How many unique words in vocabulary
    output_dim=embedding_vecor_length, # How many dimensions per embedding
    input_length=max_length # How many words per sentence
))

# GRU
model.add(GRU(100))

# Output layer
model.add(Dense(n_classes, activation="softmax"))

# Compile the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["sparse_categorical_accuracy"])



In [None]:
# Train the model
model.fit(train_padded,
          train_label,
          validation_data=(val_padded, val_label),
          epochs=epochs,
          batch_size=batch_size)

Epoch 1/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 225ms/step - loss: 1.5753 - sparse_categorical_accuracy: 0.2726 - val_loss: 1.5715 - val_sparse_categorical_accuracy: 0.2715
Epoch 2/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 227ms/step - loss: 1.5707 - sparse_categorical_accuracy: 0.2598 - val_loss: 1.5702 - val_sparse_categorical_accuracy: 0.2715
Epoch 3/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 230ms/step - loss: 1.5703 - sparse_categorical_accuracy: 0.2606 - val_loss: 1.5696 - val_sparse_categorical_accuracy: 0.2715
Epoch 4/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m51s[0m 240ms/step - loss: 1.5701 - sparse_categorical_accuracy: 0.2602 - val_loss: 1.5693 - val_sparse_categorical_accuracy: 0.2715
Epoch 5/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m79s[0m 223ms/step - loss: 1.5699 - sparse_categorical_accuracy: 0.2599 - val_loss: 1.5691 - val_sparse_categorical_accurac

<keras.src.callbacks.history.History at 0x7c59b0625c70>

In [None]:
# Model architecture
model.summary()

In [None]:
print("GRU")
evaluate_nn(test_padded, test_label, batch_size, model)

GRU
Keras metrics  
Loss : 1.5831310749053955.2f%% 
Categorical Accuracy: 0.23076923191547394.2f%% 

[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 43ms/step
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       279
           1       0.00      0.00      0.00       633
           2       0.00      0.00      0.00       389
           3       0.23      1.00      0.38       510
           4       0.00      0.00      0.00       399

    accuracy                           0.23      2210
   macro avg       0.05      0.20      0.07      2210
weighted avg       0.05      0.23      0.09      2210



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Stacked LSTM

In [None]:
tf.random.set_seed(seed)

model = Sequential()

# Word embedding
model.add(Embedding(
    input_dim=vocab_size,  # How many unique words in vocabulary
    output_dim=embedding_vecor_length, # How many dimensions per embedding
    input_length=max_length # How many words per sentence
))

# Stacked LSTM
model.add(LSTM(100, return_sequences = True))
model.add(LSTM(64))

# Output layer
model.add(Dense(n_classes, activation="softmax"))

# Compile the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["sparse_categorical_accuracy"])



In [None]:
# Train the model
model.fit(train_padded,
          train_label,
          validation_data=(val_padded, val_label),
          epochs=epochs,
          batch_size=batch_size)

Epoch 1/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m73s[0m 318ms/step - loss: 1.5745 - sparse_categorical_accuracy: 0.2698 - val_loss: 1.5742 - val_sparse_categorical_accuracy: 0.2715
Epoch 2/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 318ms/step - loss: 1.5707 - sparse_categorical_accuracy: 0.2607 - val_loss: 1.5702 - val_sparse_categorical_accuracy: 0.2715
Epoch 3/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 324ms/step - loss: 1.5701 - sparse_categorical_accuracy: 0.2606 - val_loss: 1.5691 - val_sparse_categorical_accuracy: 0.2715
Epoch 4/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 317ms/step - loss: 1.5698 - sparse_categorical_accuracy: 0.2603 - val_loss: 1.5688 - val_sparse_categorical_accuracy: 0.2715
Epoch 5/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m70s[0m 325ms/step - loss: 1.5697 - sparse_categorical_accuracy: 0.2621 - val_loss: 1.5686 - val_sparse_categorical_accurac

<keras.src.callbacks.history.History at 0x7c59b261e2a0>

In [None]:
# Model architecture
model.summary()

In [None]:
print("Stacked LSTM")
evaluate_nn(test_padded, test_label, batch_size, model)

Stacked LSTM
Keras metrics  
Loss : 1.5821160078048706.2f%% 
Categorical Accuracy: 0.23076923191547394.2f%% 

[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 92ms/step
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       279
           1       0.00      0.00      0.00       633
           2       0.00      0.00      0.00       389
           3       0.23      1.00      0.38       510
           4       0.00      0.00      0.00       399

    accuracy                           0.23      2210
   macro avg       0.05      0.20      0.07      2210
weighted avg       0.05      0.23      0.09      2210



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The simple RNN, LSTM, GRU, and even the stacked LSTM models all resulted in low accuracy scores for training, validation and test sets. Bidirectional models may provide higher accuracy scores.

### Bidirectional LSTM

In [None]:
tf.random.set_seed(seed)

model = Sequential()

# Word embedding
model.add(Embedding(
    input_dim=vocab_size,  # How many unique words in vocabulary
    output_dim=embedding_vecor_length, # How many dimensions per embedding
    input_length=max_length # How many words per sentence
))

# Bidirectional LSTM
model.add(Bidirectional(LSTM(100)))

# Output layer
model.add(Dense(n_classes, activation="softmax"))

# Compile the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["sparse_categorical_accuracy"])



In [None]:
# Train the model
model.fit(train_padded,
          train_label,
          validation_data=(val_padded, val_label),
          epochs=epochs,
          batch_size=batch_size)

Epoch 1/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m79s[0m 353ms/step - loss: 1.5662 - sparse_categorical_accuracy: 0.2859 - val_loss: 1.4192 - val_sparse_categorical_accuracy: 0.3897
Epoch 2/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 346ms/step - loss: 1.2701 - sparse_categorical_accuracy: 0.4489 - val_loss: 1.4271 - val_sparse_categorical_accuracy: 0.4160
Epoch 3/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m76s[0m 352ms/step - loss: 0.8832 - sparse_categorical_accuracy: 0.6344 - val_loss: 1.7219 - val_sparse_categorical_accuracy: 0.3809
Epoch 4/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 349ms/step - loss: 0.5614 - sparse_categorical_accuracy: 0.7942 - val_loss: 1.9470 - val_sparse_categorical_accuracy: 0.3844
Epoch 5/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 344ms/step - loss: 0.3853 - sparse_categorical_accuracy: 0.8650 - val_loss: 2.6872 - val_sparse_categorical_accurac

<keras.src.callbacks.history.History at 0x7c59b078b650>

In [None]:
# Model architecture
model.summary()

In [None]:
print("BiLSTM")
evaluate_nn(test_padded, test_label, batch_size, model)

BiLSTM
Keras metrics  
Loss : 2.6484832763671875.2f%% 
Categorical Accuracy: 0.3855203688144684.2f%% 

[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 78ms/step
              precision    recall  f1-score   support

           0       0.43      0.20      0.28       279
           1       0.46      0.45      0.45       633
           2       0.24      0.29      0.26       389
           3       0.35      0.46      0.40       510
           4       0.49      0.41      0.45       399

    accuracy                           0.39      2210
   macro avg       0.40      0.36      0.37      2210
weighted avg       0.40      0.39      0.38      2210



With bidirectional LSTM the training accuracy is increased to 98% while the validation accuracy only slightly improves to 35%. This large gap shows the model is overfitting.

### Stacked Bidirectional LSTM

In [None]:
tf.random.set_seed(seed)

model = Sequential()

# Word embedding
model.add(Embedding(
    input_dim=vocab_size,  # How many unique words in vocabulary
    output_dim=embedding_vecor_length, # How many dimensions per embedding
    input_length=max_length # How many words per sentence
))

# Stacked Bidirectional LSTM
model.add(Bidirectional(LSTM(100, return_sequences=True)))
model.add(Bidirectional(LSTM(64)))

# Output layer
model.add(Dense(n_classes, activation="softmax"))

# Compile the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["sparse_categorical_accuracy"])



In [None]:
# Train the model
model.fit(train_padded,
          train_label,
          validation_data=(val_padded, val_label),
          epochs=epochs,
          batch_size=batch_size)

Epoch 1/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m173s[0m 771ms/step - loss: 1.5624 - sparse_categorical_accuracy: 0.2877 - val_loss: 1.5658 - val_sparse_categorical_accuracy: 0.2715
Epoch 2/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m199s[0m 756ms/step - loss: 1.4169 - sparse_categorical_accuracy: 0.3642 - val_loss: 1.3847 - val_sparse_categorical_accuracy: 0.3961
Epoch 3/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m163s[0m 763ms/step - loss: 1.0789 - sparse_categorical_accuracy: 0.5397 - val_loss: 1.5094 - val_sparse_categorical_accuracy: 0.3991
Epoch 4/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m162s[0m 756ms/step - loss: 0.7212 - sparse_categorical_accuracy: 0.7231 - val_loss: 1.8938 - val_sparse_categorical_accuracy: 0.3610
Epoch 5/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m167s[0m 781ms/step - loss: 0.4534 - sparse_categorical_accuracy: 0.8395 - val_loss: 2.0795 - val_sparse_categorical_ac

<keras.src.callbacks.history.History at 0x7c59b07acbf0>

In [None]:
# Model architecture
model.summary()

In [None]:
print("Stacked BiLSTM")
evaluate_nn(test_padded, test_label, batch_size, model)

Stacked BiLSTM
Keras metrics  
Loss : 2.6451094150543213.2f%% 
Categorical Accuracy: 0.3877828121185303.2f%% 

[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 190ms/step
              precision    recall  f1-score   support

           0       0.37      0.27      0.31       279
           1       0.48      0.37      0.42       633
           2       0.26      0.29      0.28       389
           3       0.35      0.50      0.41       510
           4       0.51      0.45      0.48       399

    accuracy                           0.39      2210
   macro avg       0.39      0.38      0.38      2210
weighted avg       0.40      0.39      0.39      2210



With stacked bidirectional LSTM the overfitting problem still remains.

### Stacked Bidirectional LSTM + dropout

In [None]:
tf.random.set_seed(seed)

model = Sequential()

# Word embedding
model.add(Embedding(
    input_dim=vocab_size,  # How many unique words in vocabulary
    output_dim=embedding_vecor_length, # How many dimensions per embedding
    input_length=max_length # How many words per sentence
))

# Stacked Bidirectional LSTM + dropout
model.add(Bidirectional(LSTM(100, return_sequences=True, dropout=0.3, recurrent_dropout=0.3)))
model.add(Bidirectional(LSTM(64, dropout=0.3, recurrent_dropout=0.3)))

# Dropout before final output layer.
model.add(Dropout(0.3))

# Output layer
model.add(Dense(n_classes, activation="softmax"))

# Compile the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["sparse_categorical_accuracy"])



In [None]:
# Train the model
model.fit(train_padded,
          train_label,
          validation_data=(val_padded, val_label),
          epochs=epochs,
          batch_size=batch_size)

Epoch 1/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m312s[0m 1s/step - loss: 1.5716 - sparse_categorical_accuracy: 0.2793 - val_loss: 1.4522 - val_sparse_categorical_accuracy: 0.3657
Epoch 2/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m301s[0m 1s/step - loss: 1.3798 - sparse_categorical_accuracy: 0.3960 - val_loss: 1.4024 - val_sparse_categorical_accuracy: 0.4084
Epoch 3/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m292s[0m 1s/step - loss: 1.0591 - sparse_categorical_accuracy: 0.5445 - val_loss: 1.5717 - val_sparse_categorical_accuracy: 0.3967
Epoch 4/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m292s[0m 1s/step - loss: 0.7604 - sparse_categorical_accuracy: 0.6926 - val_loss: 1.8889 - val_sparse_categorical_accuracy: 0.3915
Epoch 5/7
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m300s[0m 1s/step - loss: 0.4890 - sparse_categorical_accuracy: 0.8309 - val_loss: 2.2571 - val_sparse_categorical_accuracy: 0.3862


<keras.src.callbacks.history.History at 0x7c59ad44eff0>

In [None]:
# Model architecture
model.summary()

In [None]:
print("Stacked BiLSTM + dropout")
evaluate_nn(test_padded, test_label, batch_size, model)

Stacked BiLSTM + dropout
Keras metrics  
Loss : 2.7073843479156494.2f%% 
Categorical Accuracy: 0.3628959357738495.2f%% 

[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 264ms/step
              precision    recall  f1-score   support

           0       0.41      0.13      0.20       279
           1       0.46      0.34      0.39       633
           2       0.24      0.32      0.28       389
           3       0.34      0.59      0.43       510
           4       0.50      0.30      0.38       399

    accuracy                           0.36      2210
   macro avg       0.39      0.34      0.34      2210
weighted avg       0.39      0.36      0.35      2210



Additiona of dropout as a regularisation method to prevent overfitting is not much effective and the gap between the training and validation accuracy values is still large.

## ❓ Data Distribution

To investigate the reason behind the overfitting problem we first check. the distribution of data and classes.

In [None]:
train_label.value_counts(normalize=True)

Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
3,0.271836
1,0.259546
2,0.190051
4,0.150695
0,0.127871


In [None]:
val_label.value_counts(normalize=True)

Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
3,0.271504
1,0.259801
2,0.19017
4,0.150965
0,0.12756


In [None]:
# Confirm stratified class balance.
print(
  f"Class distribution:"
  f"\nTrain: {Counter(train_label)}"
  f"\nValid: {Counter(val_label)}"
)

Class distribution:
Train: Counter({3: 1858, 1: 1774, 2: 1299, 4: 1030, 0: 874})
Valid: Counter({3: 464, 1: 444, 2: 325, 4: 258, 0: 218})


The class proportions in the training and validation sets are similar enough, so the large gap between training and validation accuracy cannot be explained by any imbalance between them.
No clear class imbalance is evident.

In [None]:
train_oov_count = sum([1 for seq in train_padded for tok in seq if tok == tokenizer.word_index["<OOV>"]])
valid_oov_count = sum([1 for seq in val_padded for tok in seq if tok == tokenizer.word_index["<OOV>"]])

print(f"Train OOV rate: {train_oov_count / train_padded.size}")
print(f"Valid OOV rate: {valid_oov_count / val_padded.size}")

Train OOV rate: 0.0
Valid OOV rate: 0.0048478642480983035


The out of vocabulary (OOV) rate is 0% in the training set and only 0.48% in the validation set. This indicates that all the words appearing in the validation set were already encountered during training. Therefore, we can exclude the possibility that the model's poor generalisation is caused by unfamiliar words in the validation data.

## ❓Data Sturcture/Style

The class distribution is very similar for the training and validation sets, with no noticable class imbalance. This means class distribution is not the reason for overfitting.

The OOV rate is also extremely low and therefore almost every word in the validation set was already encountered during training. Therefore, overfitting is not due to this factor either.


We next check whether differences in sentence structure between the training, validation and test sets may be the reason. We will combine all three data sets and shuffle them. Then, re-split it into new train, validation, and test sets. This way, any stylistic and structural patterns that may have caused overfitting will be eliminated.

In [None]:
# Concatenate train & test dara
data_all = pd.concat([
    pd.DataFrame(data),
    pd.DataFrame(test_data)
], ignore_index=True)

# Shuffle combined data set
data_all = data_all.sample(frac=1, random_state=seed).reset_index(drop=True)

# Separate the test set (10%)
train_val_data, test_data = train_test_split(data_all, test_size=0.1, stratify=data_all["label"], random_state=seed)

# Split into train and validation (20%).
train_data, val_data = train_test_split(train_val_data, test_size=0.2, stratify=train_val_data["label"], random_state=seed)


In [None]:
train_txt = train_data["text"]
train_label = train_data["label"]

In [None]:
val_txt = val_data["text"]
val_label = val_data["label"]

In [None]:
test_txt = test_data["text"]
test_label = test_data["label"]

In [None]:
# Confirm stratified class balance.
print(
  f"Class distribution:"
  f"\nTrain: {Counter(train_label)}"
  f"\nValid: {Counter(val_label)}"
  f"\nTest : {Counter(test_label)}"
)

Class distribution:
Train: Counter({1: 2053, 3: 2039, 2: 1449, 4: 1214, 0: 987})
Valid: Counter({1: 513, 3: 510, 2: 362, 4: 304, 0: 247})
Test : Counter({1: 285, 3: 283, 2: 202, 4: 169, 0: 137})


### Preprocessing

In [None]:
# Initialise the tokeniser
tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(train_txt)

In [None]:
# Tokenise the sentences and pad the sequences in the training set
train_sequences = tokenizer.texts_to_sequences(train_txt)
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [None]:
# Tokenise the sentences and pad the sequences in the validation set
val_sequences = tokenizer.texts_to_sequences(val_txt)
val_padded = pad_sequences(val_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [None]:
# Tokenise the sentences and pad the sequences in the test set
test_sequences = tokenizer.texts_to_sequences(test_txt)
test_padded = pad_sequences(test_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [None]:
# Convert to numpy array
test_padded = np.array(test_padded, dtype="int")
test_label = np.array(test_label, dtype="int")

### Stacked Bidirectional LSTM + dropout

In [None]:
tf.random.set_seed(seed)

model = Sequential()

# Word embedding
model.add(Embedding(
    input_dim=vocab_size,  # How many unique words in vocabulary
    output_dim=embedding_vecor_length, # How many dimensions per embedding
    input_length=max_length # How many words per sentence
))

# Stacked Bidirectional LSTM + dropout
model.add(Bidirectional(LSTM(100, return_sequences=True, dropout=0.3, recurrent_dropout=0.3)))
model.add(Bidirectional(LSTM(64, dropout=0.3, recurrent_dropout=0.3)))

# Dropout before final output layer.
model.add(Dropout(0.3))

# Output layer
model.add(Dense(n_classes, activation="softmax"))

# Compile the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["sparse_categorical_accuracy"])



In [None]:
# Train the model
model.fit(train_padded,
          train_label,
          validation_data=(val_padded, val_label),
          epochs=epochs,
          batch_size=batch_size)

Epoch 1/7
[1m242/242[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m432s[0m 2s/step - loss: 1.5741 - sparse_categorical_accuracy: 0.2702 - val_loss: 1.4586 - val_sparse_categorical_accuracy: 0.3502
Epoch 2/7
[1m242/242[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m357s[0m 1s/step - loss: 1.3461 - sparse_categorical_accuracy: 0.4073 - val_loss: 1.4174 - val_sparse_categorical_accuracy: 0.4189
Epoch 3/7
[1m242/242[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m408s[0m 2s/step - loss: 1.0176 - sparse_categorical_accuracy: 0.5572 - val_loss: 1.6027 - val_sparse_categorical_accuracy: 0.3827
Epoch 4/7
[1m242/242[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m378s[0m 1s/step - loss: 0.7579 - sparse_categorical_accuracy: 0.7036 - val_loss: 1.8282 - val_sparse_categorical_accuracy: 0.3802
Epoch 5/7
[1m242/242[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m360s[0m 1s/step - loss: 0.5424 - sparse_categorical_accuracy: 0.7996 - val_loss: 2.1037 - val_sparse_categorical_accuracy: 0.3822


<keras.src.callbacks.history.History at 0x7c59cf69d4f0>

In [None]:
# Model architecture
model.summary()

In [None]:
print("Stacked BiLSTM + dropout (shuffled data)")
evaluate_nn(test_padded, test_label, batch_size, model)

Stacked BiLSTM + dropout
Keras metrics  
Loss : 2.644376039505005.2f%% 
Categorical Accuracy: 0.36245352029800415.2f%% 

[1m34/34[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 303ms/step
              precision    recall  f1-score   support

           0       0.35      0.31      0.33       137
           1       0.39      0.33      0.36       285
           2       0.27      0.21      0.24       202
           3       0.39      0.43      0.41       283
           4       0.37      0.52      0.43       169

    accuracy                           0.36      1076
   macro avg       0.35      0.36      0.35      1076
weighted avg       0.36      0.36      0.36      1076



Even after mixing and shuffling all the data before creating new training, validation, and test splits, the same overfitting pattern persists. This shows that the issue is not caused by the original data structure or style. The problem is model-related rather than data-related and the model is memorising the training data and struggling to generalise.

## 🟣 Recommendations
The persistent overfitting cannot be explained by data issues such as class imbalance, OOV handling, or differences in sentence structure. The model continues to memorise the training data even after reshuffling and resplitting.

These models train embeddings from scratch on a relatively small presented dataset. This limits their ability to build meaningful semantic understanding.

It is recommended to replace the current embedding layer with a pre-trained embedding model such as Word2Vec. These embeddings are trained on large corpora and provide richer linguistic structure, which may help the model generalise better.