# Summary

In this notebook, we employ Bayesian optimization to determine the optimal hyperparameters for our BERT model—specifically the learning rate exponent and dropout factor. Afterward, we explored various batch sizes, and finally, we tuned the number of epochs to achieve the highest accuracy.

Due to limited resources and high computational costs, we performed Bayesian optimization using 10 initial random points followed by 5 iterations guided by the surrogate function. We recorded the best hyperparameters and their corresponding accuracy in a log.

Results: https://drive.google.com/file/d/1aSD__PhD6ToBa4oJQ5JHMEqaK3aeOyRc/view?usp=sharing


In [3]:
!pip install bayesian-optimization

Collecting bayesian-optimization
  Downloading bayesian_optimization-2.0.3-py3-none-any.whl.metadata (9.0 kB)
Collecting colorama<0.5.0,>=0.4.6 (from bayesian-optimization)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading bayesian_optimization-2.0.3-py3-none-any.whl (31 kB)
Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: colorama, bayesian-optimization
Successfully installed bayesian-optimization-2.0.3 colorama-0.4.6


In [44]:
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments
from bayes_opt import BayesianOptimization
from google.colab import drive
from bayes_opt.logger import JSONLogger
from bayes_opt.event import Events
from bayes_opt.util import load_logs

In [11]:
from torch import cuda

device = 'cuda' if cuda.is_available() else 'cpu'
print(f"Using device: {device}")

Using device: cuda


# Data preperation

In [45]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
pos_df = pd.read_csv('/content/drive/MyDrive/it1244/cleanedposfull.csv')
neg_df = pd.read_csv('/content/drive/MyDrive/it1244/cleanednegfull.csv')

pos_df['label'] = 1
neg_df['label'] = 0

pos_df = pos_df[['FileName', 'Cleaned_Content', 'label']]
neg_df = neg_df[['FileName', 'Cleaned_Content', 'label']]

df = pd.concat([pos_df, neg_df], axis=0, ignore_index=True)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print("Combined dataset shape:", df.shape)
print(df.head())

Combined dataset shape: (50000, 3)
    FileName                                    Cleaned_Content  label
0  17568.txt  and how they bore you right out of your mind t...      0
1  14894.txt  its not citizen kane but it does deliver cleav...      1
2  23805.txt  if you like othello youll love this flick sinc...      1
3  13159.txt  i watched the this the other night on a local ...      1
4  10128.txt  well i am so glad i watched this on hbo instea...      0


In [10]:
# Split the dataset into training and validation sets.
# We use an 80-20 split, where 20% of the data is reserved for validation.
# Stratification is applied on the 'label' column to ensure that both sets have a similar class distribution.
# The random_state parameter is set to 42 to ensure reproducibility of the split.

train_df, val_df = train_test_split(
    df,
    test_size=0.2,
    stratify=df['label'],
    random_state=42
)

print("Training set size:", train_df.shape)
print("Validation set size:", val_df.shape)

print("\nClass distribution in training set:")
print(train_df['label'].value_counts(normalize=True))
print("\nClass distribution in validation set:")
print(val_df['label'].value_counts(normalize=True))

Training set size: (40000, 3)
Validation set size: (10000, 3)

Class distribution in training set:
label
1    0.5
0    0.5
Name: proportion, dtype: float64

Class distribution in validation set:
label
0    0.5
1    0.5
Name: proportion, dtype: float64


In [None]:
# Load the pre-trained DistilBERT tokenizer (uncased version) for tokenizing input text.
# This tokenizer converts raw text into tokens and corresponding IDs, which are used as inputs to the model.
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Tokenize the training and validation data
train_encodings = tokenizer(
    train_df['Cleaned_Content'].tolist(),
    truncation=True,
    padding=True,
    max_length=512
)
val_encodings = tokenizer(
    val_df['Cleaned_Content'].tolist(),
    truncation=True,
    padding=True,
    max_length=512
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
# Define a custom dataset class for movie reviews that inherits from PyTorch's Dataset
class MovieReviewDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

# Create training and validation datasets using the custom MovieReviewDataset class
train_dataset = MovieReviewDataset(train_encodings, train_df['label'].tolist())
val_dataset   = MovieReviewDataset(val_encodings, val_df['label'].tolist())

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    acc = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

In [14]:
# This function trains a DistilBERT-based sequence classification model using specified hyperparameters.
# It takes two hyperparameters as input: lr_exponent (which determines the learning rate) and dropout_factor (which sets the dropout rate).
# The function performs the following steps:
# 1. Adjusts the learning rate based on the provided exponent.
# 2. Loads the pre-trained DistilBERT model for sequence classification and configures its dropout settings.
# 3. Sets up training arguments including output directories, batch sizes, evaluation and saving strategies, and logging.
# 4. Initializes the Trainer with the model, training arguments, and datasets.
# 5. Trains the model and evaluates its performance on the validation dataset.
# 6. Prints the evaluation results and returns the evaluation accuracy.

def train_model(lr_exponent, dropout_factor):
    lr_exponent = int(round(lr_exponent))
    learning_rate = 10 ** (-lr_exponent)
    dropout = dropout_factor

    print(f"Training with learning_rate={learning_rate}, dropout={dropout}")

    model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

    model.config.hidden_dropout_prob = dropout
    model.config.attention_dropout = dropout

    training_args = TrainingArguments(
        output_dir=f"/content/drive/MyDrive/it1244/results_lr{lr_exponent}_drop{dropout:.3f}",
        num_train_epochs=2,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        evaluation_strategy="steps",
        save_strategy="steps",
        save_steps=2500,
        eval_steps=2500,
        learning_rate=learning_rate,
        logging_dir=f"/content/drive/MyDrive/it1244/results_lr{lr_exponent}_drop{dropout:.3f}/logs",
        logging_steps=100,
        disable_tqdm=False,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics
    )

    trainer.train()

    eval_result = trainer.evaluate()
    print("Evaluation result:", eval_result)
    return eval_result["eval_accuracy"]

In [12]:
# Define the hyperparameter search space (bounds) for tuning:
pbounds = {
    "lr_exponent": (3, 7),
    "dropout_factor": (0.1, 0.6)
}

In [15]:
# Initialize the Bayesian optimizer to tune hyperparameters for the train_model function.
optimizer = BayesianOptimization(
    f=train_model,
    pbounds=pbounds,
    verbose=2,
    random_state=42
)

In [16]:
# JSON logger to log the optimization steps
logger = JSONLogger(path="/content/drive/MyDrive/it1244/bayes_opt_logs.json")
optimizer.subscribe(Events.OPTIMIZATION_STEP, logger)

# Bayesian optimization - using 10 initial random points

In [None]:
# Explore 10 random points first and saved the results in bayes_opt_logs
# Ran seperately in with different google accounts due to time constraits and the logs were combined.
optimizer.maximize(init_points=10, n_iter=0)

# Bayesian optimization - 5 iterations guided by the surrogate function.

In [None]:
# Using the surrogate funciton, we further explore 5 points.
optimizer.maximize(init_points=0, n_iter=5)

# Results

In [31]:
# Load the previous logs to continue from where we left off
log_path = "/content/drive/MyDrive/it1244/bayes_opt_logs.json"
load_logs(optimizer, logs=[log_path])

print("Number of points loaded:", len(optimizer.space))
print("Loaded parameters:", optimizer.space.params)
print("Loaded targets:", optimizer.space.target)

Number of points loaded: 15
Loaded parameters: [[0.287 3.   ]
 [0.178 4.   ]
 [0.287 4.   ]
 [0.287 5.   ]
 [0.1   5.   ]
 [0.569 5.   ]
 [0.129 6.   ]
 [0.401 6.   ]
 [0.11  6.   ]
 [0.287 7.   ]
 [0.294 5.   ]
 [0.593 5.   ]
 [0.356 5.   ]
 [0.466 5.   ]
 [0.128 6.   ]]
Loaded targets: [0.5    0.9008 0.9093 0.9303 0.9326 0.9303 0.9044 0.9055 0.8405 0.8587
 0.8997 0.9272 0.9303 0.9319 0.9044]


# Fine tuning the best batch size

Best lr_exponent: 0.00001 and best dropout_factor: 0.1

We will now continue with fine tuning the hyper parameters based on the best hyperparameters that we now have.

We will be testing with different batch sizes of 4, 16, 32.

In [None]:
# Best hyperparameters from Bayesian optimization
from transformers import DistilBertConfig, DistilBertForSequenceClassification

best_dropout = 0.1
best_lr_exponent = 5.0
best_learning_rate = 10 ** (-best_lr_exponent)

config = DistilBertConfig.from_pretrained("distilbert-base-uncased")
config.dropout = best_dropout
config.num_labels = 2

model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", config=config
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
def train_and_evaluate(batch_size):
    # Set training arguments with the current batch size and best hyperparameters
    training_args = TrainingArguments(
        output_dir=f"/content/drive/MyDrive/it1244/results_lr{best_lr_exponent}_drop{best_dropout:.3f}_bs{batch_size}",
        num_train_epochs=2,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        evaluation_strategy="steps",
        save_strategy="steps",
        save_steps=2500,
        eval_steps=2500,
        learning_rate=best_learning_rate,
        logging_dir=f"/content/drive/MyDrive/it1244/results_lr{best_lr_exponent}_drop{best_dropout:.3f}_bs{batch_size}/logs",
        logging_steps=100,
        disable_tqdm=False,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics
    )

    trainer.train()
    eval_result = trainer.evaluate()
    print(f"Evaluation result for batch size {batch_size}: {eval_result['eval_accuracy']}")
    return eval_result["eval_accuracy"]


Run the experiments with the different batch sizes

In [None]:
# Batch size 4
acc_4 = train_and_evaluate(4)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mqingzhe2[0m ([33mqingzhe2-national-university-of-singapore-students-union[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
2500,0.3242,0.375506,0.9058,0.935221,0.872,0.902505
5000,0.3996,0.313761,0.9202,0.918026,0.9228,0.920407
7500,0.3598,0.263918,0.9258,0.940787,0.9088,0.924517
10000,0.2603,0.281242,0.9283,0.92583,0.9312,0.928507
12500,0.1874,0.325242,0.9303,0.917524,0.9456,0.93135
15000,0.1825,0.328688,0.9337,0.936933,0.93,0.933454
17500,0.1474,0.319157,0.9322,0.925897,0.9396,0.932698
20000,0.1665,0.315209,0.9337,0.930856,0.937,0.933918


Evaluation result for batch size 4: 0.9337


In [None]:
# Batch size 16
acc_16 = train_and_evaluate(16)



Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
2500,0.2402,0.204653,0.9229,0.948653,0.8942,0.920622
5000,0.1844,0.211543,0.9318,0.932319,0.9312,0.931759


Evaluation result for batch size 16: 0.9318


In [None]:
# Batch size 32
acc_32 = train_and_evaluate(32)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mqingzhe2[0m ([33mqingzhe2-national-university-of-singapore-students-union[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
2500,0.1806,0.191554,0.9272,0.925159,0.9296,0.927374


Evaluation result for batch size 32: 0.9272


Conclusion, batch size of 4 was the most optimal (in terms of accuracy).

# Fine tuning the best number of epochs.

Lastly, we will fine tune the number of Epochs from 2 to 4 from the best hyper parameters we have.

In [None]:
# load best model and continue testing to 4 epochs
model_path = "/content/drive/MyDrive/it1244/results_lr5.0_drop0.100_bs4"

training_args = TrainingArguments(
    output_dir=model_path,
    num_train_epochs=4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

trainer.train(resume_from_checkpoint=True)

	logging_steps: 10 (from args) != 100 (from trainer_state.json)
	save_steps: 500 (from args) != 2500 (from trainer_state.json)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mqingzhe2[0m ([33mqingzhe2-national-university-of-singapore-students-union[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
3,0.2644,0.309588,0.9318,0.92668,0.9378,0.932207
4,0.2046,0.327986,0.9334,0.932708,0.9342,0.933453


TrainOutput(global_step=40000, training_loss=0.12296082060337067, metrics={'train_runtime': 4439.5214, 'train_samples_per_second': 36.04, 'train_steps_per_second': 9.01, 'total_flos': 2.119478378496e+16, 'train_loss': 0.12296082060337067, 'epoch': 4.0})

# Conclusion
The model achieved optimal performance using 2 epochs, a dropout rate of 0.1, a learning rate of 0.00005, and a batch size of 4.

The saved model can be found here:
https://drive.google.com/drive/folders/11vozMOVfGNOY5Fxg1Jj-uOp7zobHzLq9?usp=sharing