# Task 2: Text classification


Due to the limitation of computational power, hyperparameter fine-tuning was very limited, and I relied on academic papers to get good values for the hyperparameter. Please note that I experimented with a number of values but only reported the best ones in this experiment due to, again, computational limitations.

Papers:

1- How to Fine-Tune BERT for Text Classification?
Chi Sun, Xipeng Qiu, Yige Xu, Xuanjing Huang
https://arxiv.org/abs/1905.05583

2- BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi
https://arxiv.org/abs/1904.09675

3- Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks
Nils Reimers, Iryna Gurevych
https://arxiv.org/abs/1707.06799

Please note that as part of my final year project, I had access to colab pro, therefore the time reported here might differ greatly.

## Bi-directional LSTM



### Setup and Imports
*   Import necessary libraries and packages.
*   Download NLTK resources required for text processing.

In [1]:
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout

from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import f1_score

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

###  Data Loading
* Load the training and validation datasets.
* Define the base path for data files.

In [4]:
# change to ./data/
base = './data/'
result_base = './data/'

In [5]:
df = pd.read_csv(base+'Training-dataset.csv')

validation_df = pd.read_csv(base+'Task-2-validation-dataset.csv')

### Data Preprocessing
* Initialize stop words and lemmatizer.
* Clean data.

In [6]:
# Initialize stop words and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [7]:
def clean_text(text):
    """
    Function to clean and preprocess the text.

    Steps:
    1. Remove non-alphabetic characters.
    2. Tokenize the text (split into words).
    3. Remove stop words and apply lemmatization.
    4. Reassemble the text from processed tokens.

    Args:
    text (str): The input text string to be cleaned.

    Returns:
    str: The cleaned and processed text.
    """

    # Remove non-alphabetic characters (keep alphabets and spaces)
    text = ''.join([c for c in text if c.isalpha() or c.isspace()])

    # Tokenize the text by splitting it into words
    tokens = text.split()

    # Remove stop words (common words that are usually filtered out) and lemmatize each word
    # Lemmatization is the process of converting a word to its base form
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    # Reassemble the cleaned tokens back into a single string
    text = ' '.join(tokens)

    # Convert the text to lowercase to maintain uniformity
    return text.lower()


In [8]:
# Apply the cleaning function to plot_synopsis in both training and vaildation data
df['plot_synopsis'] = df['plot_synopsis'].apply(clean_text)
validation_df['plot_synopsis'] = validation_df['plot_synopsis'].apply(clean_text)

### Data Tokenization
* Tokenize text data and convert them into sequences.
* Calculate the optimal sequence length (95th percentile) and perform sequence padding.

In [9]:
# Initialize the tokenizer with a specified maximum number of words (vocabulary size) and an OOV (Out Of Vocabulary) token.
# The OOV token is used for words that are not in the top 5000 most frequent words (based on num_words=5000).
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')

# Fit the tokenizer on the text data. This step involves analyzing the text, creating a word index,
# and effectively learning the vocabulary of the dataset.
tokenizer.fit_on_texts(df['plot_synopsis'])

# Transform the text in 'plot_synopsis' into sequences of integers, where each integer represents a specific word.
# Words not in the top 5000 are represented by the OOV token index.
sequences = tokenizer.texts_to_sequences(df['plot_synopsis'])

# Calculate the length of each sequence (number of words per plot synopsis).
sequence_lengths = [len(seq) for seq in sequences]

# Determine the sequence length that covers 95% of the data.
# This means 95% of the plot synopses will have their length equal to or less than this value.
sen_length = np.percentile(sequence_lengths, 95)
sen_length = int(np.ceil(sen_length))  # Round up to the nearest whole number to avoid fractions.

# Pad (or truncate) the sequences to have a uniform length.
# This ensures that all input sequences fed into a neural network have the same length.
# 'padding=post' adds padding at the end (if needed), and 'truncating=post' truncates sequences at the end (if they are longer than sen_length).
padded_sequences = pad_sequences(sequences, maxlen=sen_length, padding='post', truncating='post')


### Model Building

In [10]:
# Extract labels
labels = df.iloc[:, 3:].to_numpy()

In [11]:
# Calculate class weights for handling imbalanced datasets.
weights = labels.sum(axis=0)
class_weights = {i: (1 / weight) * (len(labels) / 2.0) for i, weight in enumerate(weights)}
class_weights

{0: 3.271394611727417,
 1: 2.2923375902276515,
 2: 2.0704613841524573,
 3: 22.196236559139788,
 4: 1.0272455834784773,
 5: 2.4915509957754978,
 6: 2.058075772681954,
 7: 20.237745098039216,
 8: 1.3474216710182767}

In [12]:
# Define the vocabulary size. It's the total number of unique words in your dataset plus one for the padding token.
VOCAB_SIZE = len(tokenizer.word_index) + 1

# Set the dimension for the word embeddings.
# This is the size of the vector space in which words will be embedded.
EMBEDDING_DIM = 150

# Initialize the Sequential model.
model = Sequential()

# Add an Embedding layer.
# This layer turns positive integers (indexes) into dense vectors of fixed size (EMBEDDING_DIM).
# It's the first layer of the model and takes the vocabulary size and the embedding dimension as inputs.
model.add(Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=sen_length))

# Add a Bidirectional LSTM layer with 64 units.
# 'return_sequences=True' is necessary here as we will add another LSTM layer after this one.
model.add(Bidirectional(LSTM(64, return_sequences=True)))

# Add another Bidirectional LSTM layer, this time with 32 units.
# It will process the sequences returned by the previous LSTM layer.
model.add(Bidirectional(LSTM(32)))

# Add a Dense (fully connected) layer with 64 neurons and ReLU (Rectified Linear Unit) activation function.
# This layer will help to learn non-linear relationships in the data.
model.add(Dense(64, activation='relu'))

# Add a Dropout layer with a dropout rate of 0.5 to prevent overfitting.
# This layer randomly sets input units to 0 with a frequency of 0.5 at each step during training.
model.add(Dropout(0.5))

# Add the output Dense layer with a number of neurons equal to the number of labels (classes) in your dataset.
# Use the sigmoid activation function for binary classification.
model.add(Dense(labels.shape[1], activation='sigmoid'))

# Compile the model specifying the loss function, optimizer, and metrics to track during training.
# 'binary_crossentropy' is suitable for binary classification tasks.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Generate a summary of the model's architecture.
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1246, 150)         19495200  
                                                                 
 bidirectional (Bidirection  (None, 1246, 128)         110080    
 al)                                                             
                                                                 
 bidirectional_1 (Bidirecti  (None, 64)                41216     
 onal)                                                           
                                                                 
 dense (Dense)               (None, 64)                4160      
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 9)                 5

In [13]:
# Transform the 'plot_synopsis' column of the validation data into sequences of integers.
# The tokenizer converts each word in the text to its corresponding integer index based on the word_index learned during fitting.
validation_sequences = tokenizer.texts_to_sequences(validation_df['plot_synopsis'])

# Pad or truncate the sequences in the validation set to ensure they have a uniform length.
# This matches the length used for training data (sen_length).
# Padding is done at the end ('post') and sequences longer than 'sen_length' are truncated from the end ('post').
validation_padded = pad_sequences(validation_sequences, maxlen=sen_length, padding='post', truncating='post')

# Extract the labels for the validation data.
validation_labels = validation_df.iloc[:, 3:].to_numpy()

### Model Training

In [14]:
%%time

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# Set up EarlyStopping to halt training early if validation loss doesn't improve for 3 consecutive epochs.
early_stopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1, restore_best_weights=True)

# ModelCheckpoint to save the model with the lowest validation loss.
model_checkpoint = ModelCheckpoint('best_model.h5', monitor='val_loss', save_best_only=True, verbose=1)

# Training the model. Includes EarlyStopping and ModelCheckpoint in callbacks for improved training control.
history = model.fit(
    padded_sequences, labels,
    epochs=15,
    validation_data=(validation_padded, validation_labels),
    class_weight=class_weights,  # Address class imbalance
    callbacks=[early_stopping, model_checkpoint],
    verbose=2  # Moderate level of verbosity during training
)


Epoch 1/15

Epoch 1: val_loss improved from inf to 0.47574, saving model to best_model.h5


  saving_api.save_model(


259/259 - 77s - loss: 1.2923 - accuracy: 0.1865 - val_loss: 0.4757 - val_accuracy: 0.2643 - 77s/epoch - 296ms/step
Epoch 2/15

Epoch 2: val_loss improved from 0.47574 to 0.46657, saving model to best_model.h5
259/259 - 51s - loss: 1.2043 - accuracy: 0.2311 - val_loss: 0.4666 - val_accuracy: 0.2323 - 51s/epoch - 197ms/step
Epoch 3/15

Epoch 3: val_loss improved from 0.46657 to 0.45710, saving model to best_model.h5
259/259 - 46s - loss: 1.1063 - accuracy: 0.2610 - val_loss: 0.4571 - val_accuracy: 0.2567 - 46s/epoch - 179ms/step
Epoch 4/15

Epoch 4: val_loss did not improve from 0.45710
259/259 - 43s - loss: 0.9851 - accuracy: 0.2947 - val_loss: 0.4652 - val_accuracy: 0.2424 - 43s/epoch - 164ms/step
Epoch 5/15

Epoch 5: val_loss did not improve from 0.45710
259/259 - 40s - loss: 0.8997 - accuracy: 0.3270 - val_loss: 0.4820 - val_accuracy: 0.2155 - 40s/epoch - 153ms/step
Epoch 6/15
Restoring model weights from the end of the best epoch: 3.

Epoch 6: val_loss did not improve from 0.45710
2

### Threshold Selection and Model Evaluation

In [15]:
# Use the trained model to make predictions on the validation data.
predictions = model.predict(validation_padded)




This code optimizes class-specific thresholds for a multi-label classification model using ROC analysis. It calculates optimal thresholds for each class to balance true and false positives, applies these thresholds to generate binary predictions, and then organizes these predictions into a DataFrame for evaluation or submission as a CSV file.

In [16]:
from sklearn.metrics import roc_curve

# Class-specific threshold tuning based on ROC analysis
class_thresholds = np.zeros(predictions.shape[1])

for class_idx in range(predictions.shape[1]):
    fpr, tpr, thresholds = roc_curve(validation_labels[:, class_idx], predictions[:, class_idx])

    # Youden's J statistic to identify the best threshold
    j_scores = tpr - fpr
    j_ordered = sorted(zip(j_scores, thresholds))
    best_threshold = j_ordered[-1][1]

    class_thresholds[class_idx] = best_threshold
    print(f"Class {class_idx + 1}: Best Threshold = {best_threshold:.2f}")

# Apply class-specific thresholds to the predictions
binary_predictions = np.array([np.where(predictions[:, i] > class_thresholds[i], 1, 0) for i in range(predictions.shape[1])]).T

# Create a new DataFrame for submission
submission_df = pd.DataFrame(binary_predictions, columns=df.columns[3:])
submission_df.insert(0, 'doc_id', validation_df['ID'])

# Save the DataFrame to a CSV file
submission_df.to_csv(result_base+'10931277-Task2-method-b-validation.csv', index=False, header=False)


Class 1: Best Threshold = 0.20
Class 2: Best Threshold = 0.26
Class 3: Best Threshold = 0.22
Class 4: Best Threshold = 0.13
Class 5: Best Threshold = 0.54
Class 6: Best Threshold = 0.15
Class 7: Best Threshold = 0.25
Class 8: Best Threshold = 0.07
Class 9: Best Threshold = 0.28


### Test dataset Prediction

In [17]:
%%time
# Load the test dataset
test_df = pd.read_csv(base + 'Task-2-test-dataset1.csv')

# Clean the text data in the test dataset
test_df['plot_synopsis'] = test_df['plot_synopsis'].apply(clean_text)

# Convert the text in 'plot_synopsis' of the test dataset into sequences of integers
test_sequences = tokenizer.texts_to_sequences(test_df['plot_synopsis'])

# Pad the sequences in the test set to have a uniform length
test_padded = pad_sequences(test_sequences, maxlen=sen_length, padding='post', truncating='post')

# Use the trained model to make predictions on the test data
test_predictions = model.predict(test_padded)

# Apply the previously determined class-specific thresholds to these predictions
test_binary_predictions = np.array([np.where(test_predictions[:, i] > class_thresholds[i], 1, 0) for i in range(test_predictions.shape[1])]).T

# Optionally, create a DataFrame for the binary predictions, similar to the submission DataFrame
test_submission_df = pd.DataFrame(test_binary_predictions, columns=df.columns[3:])
test_submission_df.insert(0, 'doc_id', test_df['ID'])

# Save the test predictions to a CSV file if needed
test_submission_df.to_csv(result_base+'10931277-Task2-method-b.csv', index=False, header=False)

CPU times: user 4.85 s, sys: 216 ms, total: 5.07 s
Wall time: 5.42 s


## RoBERTa

### Setup and Imports

In [18]:
!pip install transformers



In [19]:
import pandas as pd
import numpy as np
from transformers import RobertaTokenizer, TFRobertaForSequenceClassification
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score

### Data Loading

In [21]:
base = './data/'
result_base = './data/'

In [22]:
df = pd.read_csv(base+'Training-dataset.csv')

validation_df = pd.read_csv(base+'Task-2-validation-dataset.csv')

### Data Tokenization

In [23]:
# Initialize RoBERTa tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Roberta language model is limited to only 512 tokens, therefore, I experimented with a number of methods including taking the first 510 tokens, taking the first 170 tokens and the last 340 but the former reported a better result and hence used. Furthermore, I did not experiment with aggregation i.e. using all of the text even above 512 tokens due to computational limitation.

In [24]:
# Tokenization and encoding of the text
def encode_text(df, tokenizer, max_len=512):
    return tokenizer(df['plot_synopsis'].tolist(), max_length=max_len, truncation=True, padding='max_length', return_tensors='tf')

In [25]:
# Encode the training and validation data
train_encodings = encode_text(df, tokenizer)
validation_encodings = encode_text(validation_df, tokenizer)

# Extract labels
train_labels = df.iloc[:, 3:].to_numpy()
validation_labels = validation_df.iloc[:, 3:].to_numpy()

### Model Building


In [26]:
# Define the RoBERTa model
model = TFRobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=9)

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForSequenceClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predicti

I experimented with a number of learning rate, this one gave the best result

In [27]:
# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.summary()

Model: "tf_roberta_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 roberta (TFRobertaMainLaye  multiple                  124055040 
 r)                                                              
                                                                 
 classifier (TFRobertaClass  multiple                  597513    
 ificationHead)                                                  
                                                                 
Total params: 124652553 (475.51 MB)
Trainable params: 124652553 (475.51 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [28]:
# Prepare data as a tf.data.Dataset
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), train_labels))
validation_dataset = tf.data.Dataset.from_tensor_slices((dict(validation_encodings), validation_labels))

# Define batch size and shuffle buffer size
batch_size = 8
shuffle_buffer_size = 1000

# Shuffle and batch the datasets
train_dataset = train_dataset.shuffle(shuffle_buffer_size).batch(batch_size).prefetch(tf.data.AUTOTUNE)
validation_dataset = validation_dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)

### Model Training

In [29]:
# Train the model
history = model.fit(
    train_dataset,
    epochs=5,
    validation_data=validation_dataset,
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [30]:
model.save_pretrained(result_base+'roberta_model/')

### Threshold Selection and Model Evaluation

In [31]:
# Generate predictions for the validation set
predictions = model.predict(validation_dataset)[0]

# Apply a sigmoid function to get probabilities from logits
predictions = tf.sigmoid(predictions).numpy()



In [32]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Class-specific threshold tuning with F1 score constraint
class_thresholds = np.zeros(predictions.shape[1])
class_f1_scores = np.zeros(predictions.shape[1])

for class_idx in range(predictions.shape[1]):
    best_threshold = 0.5
    smallest_diff = float('inf')
    best_f1_score_above_threshold = 0

    for threshold in np.arange(0.1, 1, 0.1):
        thresholded_preds = np.where(predictions[:, class_idx] > threshold, 1, 0)
        precision = precision_score(validation_labels[:, class_idx], thresholded_preds, average='binary', zero_division=0)
        recall = recall_score(validation_labels[:, class_idx], thresholded_preds, average='binary', zero_division=0)
        f1 = f1_score(validation_labels[:, class_idx], thresholded_preds, average='binary', zero_division=0)
        precision_recall_diff = abs(precision - recall)

        # Check if F1 score is above 0.2 and precision-recall difference is minimal
        if f1 > 0 and precision_recall_diff < smallest_diff:
            smallest_diff = precision_recall_diff
            best_threshold = threshold
            best_f1_score_above_threshold = f1

    class_thresholds[class_idx] = best_threshold
    class_f1_scores[class_idx] = best_f1_score_above_threshold

    # Print the best threshold, the smallest precision-recall difference, and the corresponding F1 score for each class
    print(f"Class {class_idx + 1}: Best Threshold = {best_threshold:.2f}, F1 Score = {best_f1_score_above_threshold:.4f}")

# Apply class-specific thresholds to the predictions
binary_predictions = np.array([np.where(predictions[:, i] > class_thresholds[i], 1, 0) for i in range(predictions.shape[1])]).T

# Create a new DataFrame for submission
submission_df = pd.DataFrame(binary_predictions, columns=df.columns[3:])
submission_df.insert(0, 'doc_id', validation_df['ID'])

# Save the DataFrame to a CSV file without a header
submission_df.to_csv(result_base+'10931277-Task2-method-c-validation.csv', index=False, header=False)

Class 1: Best Threshold = 0.50, F1 Score = 0.3951
Class 2: Best Threshold = 0.50, F1 Score = 0.4561
Class 3: Best Threshold = 0.40, F1 Score = 0.5097
Class 4: Best Threshold = 0.30, F1 Score = 0.3333
Class 5: Best Threshold = 0.70, F1 Score = 0.7370
Class 6: Best Threshold = 0.30, F1 Score = 0.4555
Class 7: Best Threshold = 0.40, F1 Score = 0.6073
Class 8: Best Threshold = 0.50, F1 Score = 0.4194
Class 9: Best Threshold = 0.50, F1 Score = 0.6424


### Test Dataset Prediction

In [33]:
test_df = pd.read_csv(base + 'Task-2-test-dataset1.csv')  # Load the test data

# Encode the test data
test_encodings = encode_text(test_df, tokenizer)

# Prepare the test data as a tf.data.Dataset
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings)))
test_dataset = test_dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)

# Generate predictions for the test set
test_predictions = model.predict(test_dataset)[0]

# Apply a sigmoid function to get probabilities from logits
test_predictions = tf.sigmoid(test_predictions).numpy()

# Apply class-specific thresholds to the test predictions
test_binary_predictions = np.array([np.where(test_predictions[:, i] > class_thresholds[i], 1, 0) for i in range(test_predictions.shape[1])]).T

# Create a DataFrame for the test predictions
test_submission_df = pd.DataFrame(test_binary_predictions, columns=df.columns[3:])
test_submission_df.insert(0, 'doc_id', test_df['ID'])

# Save the test predictions DataFrame to a CSV file
test_submission_df.to_csv(result_base+'10931277-Task2-method-c.csv', index=False, header=False)


