
<center><br><font size=6>Final Project</font><br>
<font size=5>Advanced Topics in Deep Learning</font><br>
<b><font size=4>Part B</font></b>
<br><font size=4>Training Models like Excercise 5</font><br><br>
Authors: Ido Rappaport & Eran Tascesme
</font></center>

**Submission Details:**
<font size=2>
<br>Ido Rappaport, ID: 322891623
<br>Eran Tascesme , ID: 205708720 </font>


**Import libraries**

❗Note the versions of the packages, we have included information in requirements.txt❗

In [33]:
# Standard libraries
import os
import re
import string
import random
import warnings
from collections import Counter
import gc

# Data handling and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# NLP libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
from gensim import corpora, models
from urllib.parse import urlparse

# Machine learning and deep learning
import torch
from torch.utils.data import DataLoader, Dataset
from torch import nn, optim
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    accuracy_score,
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay
)

# Hugging Face Transformers
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback,
    set_seed,
    TrainerCallback,
    TrainerState,
    TrainerControl,
    AutoConfig,
    DataCollatorWithPadding,
    RobertaForSequenceClassification,
    MarianMTModel,
    MarianTokenizer
)
from datasets import Dataset, DatasetDict, load_dataset
from transformers.modeling_outputs import SequenceClassifierOutput
import evaluate
from dataclasses import dataclass
from transformers.trainer_callback import TrainerCallback
from transformers.data.data_collator import DataCollatorWithPadding

# Other libraries
import optuna
import wandb
from tqdm import tqdm

# Filter warnings
warnings.filterwarnings('ignore')

# Download NLTK resources
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

In [34]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [36]:
from huggingface_hub import login
login()

**Load CSV Files**

Following the results from training based on excercise 4, we concluded that we can train solely on the clean, truncated dataset after augmentation. This approach also helps save time and resources.

In [37]:
# Load CSV files

drive_path = "data/"

train_dataset = pd.read_csv(drive_path + "train_balanced.csv", encoding="ISO-8859-1")
eval_dataset = pd.read_csv(drive_path + "val_clean.csv", encoding="ISO-8859-1")

**Training Classes and Methods**

The function `train_with_optuna_wandb` is designed for training a Hugging Face `transformers` model using hyperparameter optimization with Optuna and experiment tracking with Weights & Biases (W&B). It performs the following steps:

*   Sets up W&B for tracking the Optuna trials and the final best model run.
*   Initializes the tokenizer and prepares the datasets.
*   Defines the model initialization, metric computation, and objective function for Optuna.
*   Configures base training arguments for the hyperparameter search.
*   Implements a custom callback to log metrics per epoch during Optuna trials to W&B.
*   Defines the hyperparameter search space for Optuna.
*   Runs the Optuna hyperparameter search to find the best combination of hyperparameters.
*   Prints the details of the best trial found by Optuna.
*   Logs a summary table of all Optuna trials to W&B.
*   Performs a final training run with the best hyperparameters found by Optuna, with W&B logging enabled.
*   Saves the trained model with the best hyperparameters.

This function provides a **general framework** for hyperparameter tuning and experiment tracking for sequence classification tasks using Hugging Face models, Optuna, and W&B.

In [None]:
def train_with_optuna_wandb(
    project_name, model_name, train_dataset, eval_dataset,
    num_labels=5, n_trials=5, num_train_epochs=5
):
    # Set seed for reproducibility
    set_seed(42)

    # Set W&B environment
    os.environ["WANDB_PROJECT"] = project_name
    os.environ["WANDB_MODE"] = "disabled"  # Disable W&B auto-logging for trials

    # Start single W&B run to track all trials
    wandb_run = wandb.init(project=project_name, name="optuna_search_all_trials", reinit=True)

    # Define custom metrics for step tracking
    wandb.define_metric("epoch")
    wandb.define_metric("eval_accuracy", step_metric="epoch")
    wandb.define_metric("train_accuracy", step_metric="epoch")

    # W&B table for final summary
    trials_table = wandb.Table(columns=[
        "trial", "learning_rate", "batch_size", "weight_decay", "eval_accuracy", "train_accuracy"
    ])

    # Tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize_function(example):
        return tokenizer(example["text"], padding="max_length", truncation=True)

    tokenized_train = train_dataset.map(tokenize_function, batched=True, batch_size=64)
    tokenized_eval = eval_dataset.map(tokenize_function, batched=True, batch_size=64)

    # Accuracy metric
    metric = evaluate.load("accuracy")

    def model_init():
        return AutoModelForSequenceClassification.from_pretrained(
            model_name, num_labels=num_labels, ignore_mismatched_sizes=True
        )

    def compute_metrics(eval_pred):
        predictions = eval_pred.predictions.argmax(axis=-1)
        labels = eval_pred.label_ids
        return metric.compute(predictions=predictions, references=labels)

    def compute_objective(metrics):
        return metrics["eval_accuracy"]

    # Base training args (for Optuna search)
    base_training_args = TrainingArguments(
        output_dir=f"{project_name}/temp_run",
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        logging_strategy="epoch",
        num_train_epochs=num_train_epochs,
        report_to=[],  # Disable W&B logging during search
        logging_dir=f"{project_name}/logs",
    )

    # Callback for logging per epoch
    class WandbOptunaCallback(TrainerCallback):
        def on_epoch_end(self, args, state, control, **kwargs):
            train_metrics = trainer.evaluate(eval_dataset=tokenized_train, metric_key_prefix="train")
            eval_metrics = trainer.evaluate(eval_dataset=tokenized_eval, metric_key_prefix="eval")

            train_acc = train_metrics.get("train_accuracy", None)
            eval_acc = eval_metrics.get("eval_accuracy", None)

            # Log per epoch with trial info
            wandb.log({
                "eval_accuracy": eval_acc,
                "train_accuracy": train_acc,
                "epoch": state.epoch,
                "trial": state.trial_name,
            })

            # Add final metrics to summary table
            if state.epoch + 1 == num_train_epochs:
                trials_table.add_data(
                    state.trial_name,
                    state.trial_params.get("learning_rate"),
                    state.trial_params.get("per_device_train_batch_size"),
                    state.trial_params.get("weight_decay"),
                    eval_acc,
                    train_acc
                )

    # Trainer for Optuna trials
    trainer = Trainer(
        model_init=model_init,
        args=base_training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        callbacks=[WandbOptunaCallback()]
    )

    def optuna_hp_space(trial):
        return {
            "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
            "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [64, 128]),
            "weight_decay": trial.suggest_float("weight_decay", 1e-4, 0.3),
        }

    # Run hyperparameter search
    best_run = trainer.hyperparameter_search(
        direction="maximize",
        backend="optuna",
        hp_space=optuna_hp_space,
        n_trials=n_trials,
        compute_objective=compute_objective,
        study_name="transformers_optuna_study",
        storage=f"sqlite:///{project_name}/optuna_trials.db",
        load_if_exists=True
    )

    print("Best trial:", best_run)

    # Log summary table
    wandb.log({"optuna_trials": trials_table})

    # Finish main W&B run
    wandb.finish()

    # Re-enable W&B for final training run
    os.environ["WANDB_MODE"] = "online"

    # Final training args (W&B enabled)
    final_training_args = TrainingArguments(
        output_dir=f"{project_name}/best_model_run",
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        logging_strategy="epoch",
        num_train_epochs=num_train_epochs,
        learning_rate=best_run.hyperparameters["learning_rate"],
        per_device_train_batch_size=best_run.hyperparameters["per_device_train_batch_size"],
        weight_decay=best_run.hyperparameters["weight_decay"],
        report_to=["wandb"],
        logging_dir=f"{project_name}/logs",
        run_name="final_best_model"
    )

    # Final model trainer
    trainer = Trainer(
        model_init=model_init,
        args=final_training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    trainer.train()

    # Save best model
    best_model_path = f"{project_name}/best_model"
    trainer.save_model(best_model_path)
    print(f"Best model saved to {best_model_path}")

    wandb.finish()
    return best_model_path, best_run


**First Model**

twitter-roberta-base-sentiment

the function above save the best model automatically

In [None]:
best_model_path, best_roberta_run = train_with_optuna_wandb(
    project_name="roberta_sentiment_cutted_data_exc5",
    model_name="cardiffnlp/twitter-roberta-base-sentiment-latest",
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    num_labels=5,
    n_trials=5,
    num_train_epochs=6
)

**Second Model**

distilbert-base-uncased-finetuned-sst-2-english

the function above save the best model automatically

In [None]:
best_model_distil_path, best_distil_run = train_with_optuna_wandb(
    project_name="distilbert_sentiment_5_cutted_data_exc5",
    model_name="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    num_labels=5,
    n_trials=5,
    num_train_epochs=6
)

**Improving the selected models**

To improve model training, we are trying to increase the hyperparameter space and the number of studies.

In [38]:
# --- Load CSV files from your Drive ---
drive_path = "data/"

train_df = pd.read_csv(drive_path + "train_balanced.csv", encoding="ISO-8859-1")
eval_df = pd.read_csv(drive_path + "val_clean.csv", encoding="ISO-8859-1")
test_df = pd.read_csv(drive_path + "test_clean.csv", encoding="ISO-8859-1")

for df in [train_df, eval_df, test_df]:
    df['text'] = df['text'].fillna('').astype(str)

# For consistency, rename the label column to 'labels'
train_df = train_df.rename(columns={'label': 'labels'})
eval_df = eval_df.rename(columns={'label': 'labels'})
test_df = test_df.rename(columns={'label': 'labels'})


# Convert pandas DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)
test_dataset = Dataset.from_pandas(test_df)


In [41]:
def run_hyperparameter_search_and_train(
    project_name, model_name, train_dataset, eval_dataset, test_dataset,
    num_labels=5, n_trials=12, num_train_epochs=5
):
    # 1. Set W&B Project Environment Variable
    os.environ["WANDB_PROJECT"] = project_name

    # 2. Tokenizer and Data Preparation
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)

    tokenized_train = train_dataset.map(tokenize_function, batched=True)
    tokenized_eval = eval_dataset.map(tokenize_function, batched=True)
    tokenized_test = test_dataset.map(tokenize_function, batched=True)

    # 3. Model Initializer (for fresh model in each trial)
    def model_init():
        return AutoModelForSequenceClassification.from_pretrained(
            model_name,
            num_labels=num_labels,
            ignore_mismatched_sizes=True   # Useful for re-initializing head
        )

    # 4. Metrics Computation
    accuracy_metric = evaluate.load("accuracy")
    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        return accuracy_metric.compute(predictions=predictions, references=labels)

    # 5. Define the Optuna Objective Function
    def objective(trial):
        # A. Suggest hyperparameters
        hp = {
            "learning_rate": trial.suggest_float("learning_rate", 1e-6, 5e-5, log=True),
            "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [64, 128]),
            "gradient_accumulation_steps": trial.suggest_categorical("gradient_accumulation_steps", [1, 2]),
            "weight_decay": trial.suggest_float("weight_decay", 0.01, 0.3),
            "optim": trial.suggest_categorical("optim", ["adamw_torch", "adafactor"]),
            "lr_scheduler_type": trial.suggest_categorical("lr_scheduler_type", ["linear", "cosine"]),
        }

        # B. Define Training Arguments for this specific trial
        # Each trial will be a new run in W&B
        trial_run_name = f"trial-{trial.number}"
        output_dir = f"./results/{trial_run_name}"

        training_args = TrainingArguments(
            output_dir=output_dir,
            run_name=trial_run_name,
            # Core training parameters
            num_train_epochs=num_train_epochs,
            per_device_train_batch_size=hp["per_device_train_batch_size"],
            per_device_eval_batch_size=64,
            gradient_accumulation_steps=hp["gradient_accumulation_steps"],
            learning_rate=hp["learning_rate"],
            weight_decay=hp["weight_decay"],
            optim=hp["optim"],
            lr_scheduler_type=hp["lr_scheduler_type"],
            fp16=True if device == "cuda" else False, # Enable mixed precision
            # Evaluation and logging
            eval_strategy="epoch",
            save_strategy="epoch",
            logging_strategy="epoch",
            load_best_model_at_end=True,
            metric_for_best_model="accuracy",
            report_to="wandb",
            # Efficiency
            save_total_limit=1, # Only keep the best checkpoint
            push_to_hub=False,
        )

        # C. Initialize Trainer
        trainer = Trainer(
            model_init=model_init,
            args=training_args,
            train_dataset=tokenized_train,
            eval_dataset=tokenized_eval,
            tokenizer=tokenizer,
            compute_metrics=compute_metrics,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
        )

        # D. Train and return metric for Optuna
        trainer.train()
        eval_metrics = trainer.evaluate()

        # E. Clean up to free memory
        del trainer
        gc.collect()
        torch.cuda.empty_cache()

        return eval_metrics["eval_accuracy"]

    # 6. Run Hyperparameter Search
    study = optuna.create_study(direction="maximize", study_name="sentiment-analysis-optimization")
    study.optimize(objective, n_trials=n_trials)

    best_hyperparameters = study.best_trial.params
    print("🏆 Best Hyperparameters Found 🏆")
    print(best_hyperparameters)

    # 7. Train the Final Model with Best Hyperparameters
    print("🚀 Training final model with best hyperparameters...")
    final_training_args = TrainingArguments(
        output_dir="./results/best-model",
        run_name="final-best-model-run",
        # Use best hyperparameters
        **best_hyperparameters,
        # Other fixed settings
        num_train_epochs=num_train_epochs,
        per_device_eval_batch_size=64,
        fp16=True if device == "cuda" else False,
        # Evaluation and logging
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        report_to="wandb",
        save_total_limit=1,
    )

    final_trainer = Trainer(
        model=model_init(), # Re-initialize the model
        args=final_training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
    )

    final_trainer.train()

    # 8. Evaluate the Best Model on the Test Set
    print("\n🧪 Evaluating the final best model on the test dataset...")
    test_results = final_trainer.evaluate(eval_dataset=tokenized_test)

    print("✅ Final Test Results ✅")
    print(f"Test Accuracy: {test_results['eval_accuracy']:.4f}")
    print(f"Test Loss: {test_results['eval_loss']:.4f}")

    # Log test results to the final W&B run
    wandb.log({"test_accuracy": test_results["eval_accuracy"], "test_loss": test_results["eval_loss"]})

    # End the final W&B run
    wandb.finish()


    # 9. Save the Final Model
    best_model_path = f"{project_name}/best_model"
    final_trainer.save_model(best_model_path)

    # Define the path and filename for the weights
    weights_path = f"final_models/{project_name}.pt"
    weights_dir = os.path.dirname(weights_path)

    # Create the directory if it doesn't exist
    os.makedirs(weights_dir, exist_ok=True)

    # Get the state dictionary from the trained model
    model_weights = final_trainer.model.state_dict()

    # Save the state dictionary to the specified .pt file
    torch.save(model_weights, weights_path)

    print(f"Best model saved to {best_model_path}")

    return best_model_path, best_hyperparameters

**First Model**

twitter-roberta-base-sentiment

the function above save the best model automatically

In [42]:
PROJECT_NAME = "roberta_sentiment_exc5_improved"
MODEL_NAME = "cardiffnlp/twitter-roberta-base-sentiment-latest"
N_TRIALS = 12  # Number of Optuna trials to run
N_EPOCHS = 5  # Number of epochs for each training run

# --- Run the experiment ---
best_model_path, best_params = run_hyperparameter_search_and_train(
    project_name=PROJECT_NAME,
    model_name=MODEL_NAME,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    test_dataset=test_dataset,
    num_labels=5,
    n_trials=N_TRIALS,
    num_train_epochs=N_EPOCHS
)


print(f"Best hyperparameters: {best_params}")

Map:   0%|          | 0/48910 [00:00<?, ? examples/s]

Map:   0%|          | 0/4116 [00:00<?, ? examples/s]

Map:   0%|          | 0/3798 [00:00<?, ? examples/s]

[I 2025-08-20 13:38:13,863] A new study created in memory with name: sentiment-analysis-optimization
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized because th

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0052,0.888696,0.64723
2,0.7563,0.761547,0.715743
3,0.6683,0.732588,0.726676
4,0.6228,0.663192,0.761905
5,0.6046,0.668998,0.76069


[I 2025-08-20 13:45:58,446] Trial 0 finished with value: 0.7619047619047619 and parameters: {'learning_rate': 7.895915816006548e-06, 'per_device_train_batch_size': 128, 'gradient_accumulation_steps': 1, 'weight_decay': 0.07428592513676417, 'optim': 'adamw_torch', 'lr_scheduler_type': 'cosine'}. Best is trial 0 with value: 0.7619047619047619.
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassif

Epoch,Training Loss,Validation Loss,Accuracy
1,1.3717,1.245302,0.449708
2,1.1342,1.136316,0.504373
3,1.054,1.093568,0.529397
4,1.0244,1.082297,0.54033
5,1.0133,1.07801,0.541545


[I 2025-08-20 13:54:12,409] Trial 1 finished with value: 0.5415451895043731 and parameters: {'learning_rate': 1.2413671879093193e-06, 'per_device_train_batch_size': 128, 'gradient_accumulation_steps': 2, 'weight_decay': 0.25146663479302245, 'optim': 'adafactor', 'lr_scheduler_type': 'cosine'}. Best is trial 0 with value: 0.7619047619047619.
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassifi

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0172,0.912533,0.637026
2,0.7771,0.734702,0.726919
3,0.6896,0.741383,0.723761
4,0.6464,0.704189,0.744412
5,0.629,0.699306,0.74757


[I 2025-08-20 14:05:33,960] Trial 2 finished with value: 0.7475704567541303 and parameters: {'learning_rate': 4.94005092027266e-06, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 1, 'weight_decay': 0.24040868508150848, 'optim': 'adafactor', 'lr_scheduler_type': 'cosine'}. Best is trial 0 with value: 0.7619047619047619.
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassificat

Epoch,Training Loss,Validation Loss,Accuracy
1,1.111,1.005032,0.577745
2,0.8896,0.897978,0.64966
3,0.8098,0.86693,0.663508
4,0.7699,0.819968,0.688533
5,0.7485,0.80351,0.694121


[I 2025-08-20 14:14:03,594] Trial 3 finished with value: 0.6941205053449951 and parameters: {'learning_rate': 2.5761455419722516e-06, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 1, 'weight_decay': 0.20450553298785865, 'optim': 'adamw_torch', 'lr_scheduler_type': 'linear'}. Best is trial 0 with value: 0.7619047619047619.
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassif

Epoch,Training Loss,Validation Loss,Accuracy
1,0.9394,0.908441,0.652575
2,0.6571,0.688859,0.746842
3,0.5413,0.533827,0.8207
4,0.4724,0.50436,0.83309
5,0.4438,0.509243,0.828231


[I 2025-08-20 14:23:39,441] Trial 4 finished with value: 0.8330903790087464 and parameters: {'learning_rate': 1.7377150919570914e-05, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 2, 'weight_decay': 0.06979420782642135, 'optim': 'adafactor', 'lr_scheduler_type': 'cosine'}. Best is trial 4 with value: 0.8330903790087464.
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassific

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8369,0.735517,0.721817
2,0.5387,0.467734,0.848639
3,0.3967,0.375496,0.88241
4,0.3091,0.340776,0.899903
5,0.2668,0.344051,0.896501


[I 2025-08-20 14:35:01,611] Trial 5 finished with value: 0.8999028182701652 and parameters: {'learning_rate': 2.622034808533247e-05, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 1, 'weight_decay': 0.15381645222302778, 'optim': 'adafactor', 'lr_scheduler_type': 'cosine'}. Best is trial 5 with value: 0.8999028182701652.
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassifica

Epoch,Training Loss,Validation Loss,Accuracy
1,0.9195,0.79538,0.696793
2,0.6708,0.648938,0.769193
3,0.5667,0.611955,0.785714
4,0.5061,0.524849,0.819971
5,0.4681,0.51465,0.826531


[I 2025-08-20 14:43:14,470] Trial 6 finished with value: 0.826530612244898 and parameters: {'learning_rate': 1.5173186120280756e-05, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 2, 'weight_decay': 0.019923542565660583, 'optim': 'adamw_torch', 'lr_scheduler_type': 'linear'}. Best is trial 5 with value: 0.8999028182701652.
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassif

Epoch,Training Loss,Validation Loss,Accuracy
1,0.9687,0.880853,0.659135
2,0.7279,0.68687,0.750729
3,0.6376,0.668041,0.76069
4,0.583,0.639026,0.771866
5,0.5516,0.603026,0.793732


[I 2025-08-20 14:54:38,292] Trial 7 finished with value: 0.793731778425656 and parameters: {'learning_rate': 7.362337718290002e-06, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 1, 'weight_decay': 0.1555516739493777, 'optim': 'adafactor', 'lr_scheduler_type': 'linear'}. Best is trial 5 with value: 0.8999028182701652.
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassificati

Epoch,Training Loss,Validation Loss,Accuracy
1,0.7996,0.777285,0.718416
2,0.5322,0.503125,0.828474
3,0.4026,0.453306,0.848397
4,0.3176,0.334331,0.902575
5,0.2606,0.321127,0.907677


[I 2025-08-20 15:03:08,331] Trial 8 finished with value: 0.9076773566569485 and parameters: {'learning_rate': 2.6747827764969765e-05, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 1, 'weight_decay': 0.10207203856410653, 'optim': 'adamw_torch', 'lr_scheduler_type': 'linear'}. Best is trial 8 with value: 0.9076773566569485.
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassif

Epoch,Training Loss,Validation Loss,Accuracy
1,0.9494,0.926632,0.646259
2,0.6779,0.706452,0.737366
3,0.572,0.5591,0.808066
4,0.5045,0.542367,0.813411
5,0.466,0.517483,0.822886


[I 2025-08-20 15:12:44,235] Trial 9 finished with value: 0.8228862973760933 and parameters: {'learning_rate': 1.604027763146551e-05, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 2, 'weight_decay': 0.1047229026530841, 'optim': 'adafactor', 'lr_scheduler_type': 'linear'}. Best is trial 8 with value: 0.9076773566569485.
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassificat

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8125,0.67523,0.756074
2,0.5348,0.475522,0.840622
3,0.4054,0.409546,0.871477
4,0.3161,0.326987,0.903547
5,0.2587,0.308292,0.911808


[I 2025-08-20 15:20:28,145] Trial 10 finished with value: 0.9118075801749271 and parameters: {'learning_rate': 3.992042446217103e-05, 'per_device_train_batch_size': 128, 'gradient_accumulation_steps': 1, 'weight_decay': 0.14288238299561276, 'optim': 'adamw_torch', 'lr_scheduler_type': 'linear'}. Best is trial 10 with value: 0.9118075801749271.
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClass

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8106,0.68308,0.750972
2,0.5325,0.468756,0.846939
3,0.4035,0.40641,0.87172
4,0.3135,0.324513,0.902575
5,0.2555,0.304292,0.915209


[I 2025-08-20 15:28:12,632] Trial 11 finished with value: 0.9152089407191448 and parameters: {'learning_rate': 4.06184262483033e-05, 'per_device_train_batch_size': 128, 'gradient_accumulation_steps': 1, 'weight_decay': 0.1434575452929841, 'optim': 'adamw_torch', 'lr_scheduler_type': 'linear'}. Best is trial 11 with value: 0.9152089407191448.



🏆 Best Hyperparameters Found 🏆
{'learning_rate': 4.06184262483033e-05, 'per_device_train_batch_size': 128, 'gradient_accumulation_steps': 1, 'weight_decay': 0.1434575452929841, 'optim': 'adamw_torch', 'lr_scheduler_type': 'linear'}

🚀 Training final model with best hyperparameters...


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpo

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8115,0.695145,0.748542
2,0.5308,0.470752,0.846453
3,0.4015,0.418822,0.861759
4,0.3112,0.309841,0.909621
5,0.2523,0.299909,0.915452



🧪 Evaluating the final best model on the test dataset...



✅ Final Test Results ✅
Test Accuracy: 0.7451
Test Loss: 0.8611



0,1
eval/accuracy,▅▅▆▆▁▂▆▆▃▄▅▄▆▇▇▇▇██▆▇▇▄▆▆▅▇█▄▆▇▇▇▇█▆▇▇█▅
eval/loss,▆▅▅▄▄██▆▅▅▄▇▆▅▄▃▃▅▂▂▁▁▅▃▃▄▃▂▁▁▃▃▄▂▁▂▁▂▂▆
eval/runtime,▅▅▅▆▆▅▆▆▆▆▅▅▇▅▆▆▆▆▆▆▆▆▆▇█▆▆▆▇▆▇▆▆▆▅▆▆▆▆▁
eval/samples_per_second,█▆▆▅▂▆▅▅▅▄▄▅▇▇▆▆▅▅▅▅▅▄▃▄▃▅▄▄▅▅▄▆▄▅▃▁▄█▆█
eval/steps_per_second,▆▇▇▆▇▆▆▅▆▅▇▇▇▇▆▆▆▆▅▆▄▅▇▄▆▅▆▃▆▆▅▄▅▆▁▅▅▁█▇
test_accuracy,▁
test_loss,▁
train/epoch,▅█▁▁▃▅█▁▅▁▃▅▆█▅▁▅▆███▅▆█▁█▁▁▃▅▆██▁█▃▃▅▅█
train/global_step,▁▃▃▄▄▂▂▃▆█▂▂▃▅█▅▆█▃▃▅▆▆█▂█▁▂▃▃▄▃▃▄▄▁▂▃▃▄
train/grad_norm,▅▂▄▃▄▃▄▂▄▇▄▄█▂▂▅▃▂▃▂▂▄▆▃▃▂▄▄▃▂▃▂▃▂▁▃▂▁▃▂

0,1
eval/accuracy,0.74513
eval/loss,0.86107
eval/runtime,2.9563
eval/samples_per_second,1284.726
eval/steps_per_second,20.296
test_accuracy,0.74513
test_loss,0.86107
total_flos,3.21727708697856e+16
train/epoch,5.0
train/global_step,1915.0


Best model saved to /content/drive/My Drive/Colab Notebooks//roberta_sentiment_exc5_improved/best_model
Best hyperparameters: {'learning_rate': 4.06184262483033e-05, 'per_device_train_batch_size': 128, 'gradient_accumulation_steps': 1, 'weight_decay': 0.1434575452929841, 'optim': 'adamw_torch', 'lr_scheduler_type': 'linear'}


**Second Model**

distilbert-base-uncased-finetuned-sst-2-english

the function above save the best model automatically

In [43]:
# --- Configuration ---
PROJECT_NAME = "distilbert_exc5_improved"
MODEL_NAME = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
N_TRIALS = 12 # Number of Optuna trials to run
N_EPOCHS = 5 # Number of epochs for each training run

# --- Run the experiment ---
best_model_path, best_params = run_hyperparameter_search_and_train(
    project_name=PROJECT_NAME,
    model_name=MODEL_NAME,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    test_dataset=test_dataset,
    num_labels=5,
    n_trials=N_TRIALS,
    num_train_epochs=N_EPOCHS
)

print(f"Best hyperparameters: {best_params}")

Map:   0%|          | 0/48910 [00:00<?, ? examples/s]

Map:   0%|          | 0/4116 [00:00<?, ? examples/s]

Map:   0%|          | 0/3798 [00:00<?, ? examples/s]

[I 2025-08-20 15:36:04,722] A new study created in memory with name: sentiment-analysis-optimization
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
-

Epoch,Training Loss,Validation Loss,Accuracy
1,1.4455,1.35251,0.372449
2,1.2068,1.191708,0.480321
3,1.0876,1.117975,0.51725
4,1.0283,1.081685,0.533285
5,1.0029,1.069338,0.539602


[I 2025-08-20 15:42:37,862] Trial 0 finished with value: 0.5396015549076774 and parameters: {'learning_rate': 1.2261744386829436e-06, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 1, 'weight_decay': 0.19966675129519196, 'optim': 'adafactor', 'lr_scheduler_type': 'linear'}. Best is trial 0 with value: 0.5396015549076774.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model chec

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8094,0.679216,0.74757
2,0.4842,0.407293,0.87172
3,0.3406,0.343553,0.894801
4,0.2478,0.269268,0.926628
5,0.1923,0.2581,0.930029


[I 2025-08-20 15:47:12,858] Trial 1 finished with value: 0.9300291545189504 and parameters: {'learning_rate': 4.9972436158457924e-05, 'per_device_train_batch_size': 128, 'gradient_accumulation_steps': 1, 'weight_decay': 0.27020007255509254, 'optim': 'adamw_torch', 'lr_scheduler_type': 'linear'}. Best is trial 1 with value: 0.9300291545189504.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model c

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8551,0.762193,0.706268
2,0.5438,0.521815,0.820214
3,0.4138,0.422078,0.864189
4,0.3343,0.353348,0.896259
5,0.287,0.340412,0.901603


[I 2025-08-20 15:52:13,509] Trial 2 finished with value: 0.9016034985422741 and parameters: {'learning_rate': 2.2278244886317156e-05, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 1, 'weight_decay': 0.2898119442945294, 'optim': 'adamw_torch', 'lr_scheduler_type': 'linear'}. Best is trial 1 with value: 0.9300291545189504.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model che

Epoch,Training Loss,Validation Loss,Accuracy
1,1.1129,0.936486,0.610544
2,0.7878,0.772377,0.710398
3,0.6848,0.716738,0.733965
4,0.637,0.685316,0.751215
5,0.6189,0.689434,0.749514


[I 2025-08-20 15:58:44,040] Trial 3 finished with value: 0.7512147716229349 and parameters: {'learning_rate': 5.674251680300498e-06, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 1, 'weight_decay': 0.23371520706968807, 'optim': 'adafactor', 'lr_scheduler_type': 'cosine'}. Best is trial 1 with value: 0.9300291545189504.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model check

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0125,0.842534,0.666667
2,0.7046,0.691873,0.746599
3,0.5957,0.605231,0.787901
4,0.5329,0.550571,0.81001
5,0.4976,0.551666,0.809767


[I 2025-08-20 16:05:20,319] Trial 4 finished with value: 0.8100097181729835 and parameters: {'learning_rate': 9.407362800799648e-06, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 1, 'weight_decay': 0.2901940366618316, 'optim': 'adafactor', 'lr_scheduler_type': 'linear'}. Best is trial 1 with value: 0.9300291545189504.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkp

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8986,0.721363,0.736152
2,0.5765,0.581078,0.789359
3,0.4309,0.397409,0.876336
4,0.3473,0.365377,0.890185
5,0.3091,0.370397,0.885569


[I 2025-08-20 16:10:38,654] Trial 5 finished with value: 0.8901846452866861 and parameters: {'learning_rate': 2.8959158057796525e-05, 'per_device_train_batch_size': 128, 'gradient_accumulation_steps': 1, 'weight_decay': 0.2535854586045517, 'optim': 'adafactor', 'lr_scheduler_type': 'cosine'}. Best is trial 1 with value: 0.9300291545189504.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model chec

Epoch,Training Loss,Validation Loss,Accuracy
1,1.2777,1.107935,0.514334
2,0.9432,0.937133,0.614189
3,0.8414,0.886985,0.637269
4,0.8016,0.864582,0.655734
5,0.7882,0.862616,0.657434


[I 2025-08-20 16:15:44,306] Trial 6 finished with value: 0.6574344023323615 and parameters: {'learning_rate': 2.6751054095430185e-06, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 1, 'weight_decay': 0.12386876647797875, 'optim': 'adamw_torch', 'lr_scheduler_type': 'cosine'}. Best is trial 1 with value: 0.9300291545189504.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model ch

Epoch,Training Loss,Validation Loss,Accuracy
1,1.4536,1.347579,0.37828
2,1.1952,1.169762,0.494655
3,1.0679,1.101823,0.525024
4,1.0189,1.078348,0.536443
5,1.0024,1.073107,0.54033


[I 2025-08-20 16:20:23,331] Trial 7 finished with value: 0.5403304178814383 and parameters: {'learning_rate': 1.6432129038775826e-06, 'per_device_train_batch_size': 128, 'gradient_accumulation_steps': 1, 'weight_decay': 0.2127389298048047, 'optim': 'adamw_torch', 'lr_scheduler_type': 'cosine'}. Best is trial 1 with value: 0.9300291545189504.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model ch

Epoch,Training Loss,Validation Loss,Accuracy
1,1.4998,1.423102,0.315841
2,1.2966,1.272807,0.428571
3,1.1767,1.202662,0.475462
4,1.1252,1.178939,0.485666
5,1.1106,1.17474,0.486395


[I 2025-08-20 16:25:14,038] Trial 8 finished with value: 0.48639455782312924 and parameters: {'learning_rate': 1.1581872764302e-06, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 2, 'weight_decay': 0.18869869280580823, 'optim': 'adamw_torch', 'lr_scheduler_type': 'cosine'}. Best is trial 1 with value: 0.9300291545189504.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model chec

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8358,0.64442,0.763362
2,0.5124,0.466787,0.843052
3,0.3661,0.333274,0.901118
4,0.2791,0.303811,0.911079
5,0.2408,0.304305,0.912779


[I 2025-08-20 16:31:46,728] Trial 9 finished with value: 0.9127793974732751 and parameters: {'learning_rate': 2.778132261251681e-05, 'per_device_train_batch_size': 64, 'gradient_accumulation_steps': 1, 'weight_decay': 0.2570040468096332, 'optim': 'adafactor', 'lr_scheduler_type': 'cosine'}. Best is trial 1 with value: 0.9300291545189504.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkp

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8734,0.643207,0.777211
2,0.5385,0.476406,0.840865
3,0.4047,0.381441,0.880952
4,0.3179,0.336411,0.902332
5,0.2662,0.323556,0.905248


[I 2025-08-20 16:36:16,830] Trial 10 finished with value: 0.9052478134110787 and parameters: {'learning_rate': 4.983894627979562e-05, 'per_device_train_batch_size': 128, 'gradient_accumulation_steps': 2, 'weight_decay': 0.025444795906912232, 'optim': 'adamw_torch', 'lr_scheduler_type': 'linear'}. Best is trial 1 with value: 0.9300291545189504.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model 

Epoch,Training Loss,Validation Loss,Accuracy
1,0.9323,0.874471,0.650389
2,0.5948,0.521488,0.825316
3,0.4432,0.396649,0.876822
4,0.3429,0.341278,0.89966
5,0.2843,0.336664,0.898445


[I 2025-08-20 16:41:07,924] Trial 11 finished with value: 0.8996598639455783 and parameters: {'learning_rate': 4.910662214106868e-05, 'per_device_train_batch_size': 128, 'gradient_accumulation_steps': 2, 'weight_decay': 0.1283730019374268, 'optim': 'adafactor', 'lr_scheduler_type': 'linear'}. Best is trial 1 with value: 0.9300291545189504.



🏆 Best Hyperparameters Found 🏆
{'learning_rate': 4.9972436158457924e-05, 'per_device_train_batch_size': 128, 'gradient_accumulation_steps': 1, 'weight_decay': 0.27020007255509254, 'optim': 'adamw_torch', 'lr_scheduler_type': 'linear'}

🚀 Training final model with best hyperparameters...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.807,0.680587,0.744898
2,0.4869,0.416892,0.864674
3,0.3435,0.332655,0.899417
4,0.2505,0.274667,0.919096
5,0.192,0.257114,0.929786



🧪 Evaluating the final best model on the test dataset...



✅ Final Test Results ✅
Test Accuracy: 0.7470
Test Loss: 0.8909



0,1
eval/accuracy,▃▃▄▆▇██▅▇▇█▄▅▆▆▇▇▆▇██▅▅▅▃▄▄▁▂▃▇█▇▇█▅▇██▆
eval/loss,█▇▆▆▆▂▂▁▄▂▂▄▄▅▄▄▃▂▂▂▅▅█▆▆▆▇▃▁▁▁▃▂▂▂▂▂▄▁▅
eval/runtime,▄▄▄▅▅▄▄▄▅▆▄▄▅▄▇▄▄▄▄█▇█▄▄▅▅▄▄▄▄▅▄▄▅▄▄▄▄▄▁
eval/samples_per_second,██▇▆████▇▆████▄▅▄▁█▇█▆▇█▇▇▅█▇▇▇█▇▇▇██▆█▆
eval/steps_per_second,██▆█▇█▇▅▇▇▆▅▄▆▄███▇▄▅▄▁██▆█▇█▇▇▇▇▇▇▇█▆█▆
test_accuracy,▁
test_loss,▁
train/epoch,▁▃▅██▃▅▅██▁▆▁▅▆█▃▅██▅▅▆▅▅█▁▆████▁▃▃▆██▁▅
train/global_step,▂▄███▃▄▅▇█▂▄▇▇█▅██▂▂█▁▃▄▄▁▃▄▄▄▇▁▂▂▁▂▂▃▄▄
train/grad_norm,▂▃▃▅▃▅█▃▁▄▅▆▃▄▄▃▇▅▃▅▇▃▃▄▃▄▅▃▅█▇▃▇▄▆▇▄▄▅▁

0,1
eval/accuracy,0.74697
eval/loss,0.89089
eval/runtime,2.1244
eval/samples_per_second,1787.824
eval/steps_per_second,28.244
test_accuracy,0.74697
test_loss,0.89089
total_flos,1.6198317746304e+16
train/epoch,5.0
train/global_step,1915.0


Best model saved to /content/drive/My Drive/Colab Notebooks//distilbert_exc5_improved/best_model
Best hyperparameters: {'learning_rate': 4.9972436158457924e-05, 'per_device_train_batch_size': 128, 'gradient_accumulation_steps': 1, 'weight_decay': 0.27020007255509254, 'optim': 'adamw_torch', 'lr_scheduler_type': 'linear'}


<center><h1>END</h1></center>
