
<center><br><font size=6>Final Project</font><br>
<font size=5>Advanced Topics in Deep Learning</font><br>
<b><font size=4>Part B</font></b>
<br><font size=4>Training Models like Excercise 5</font><br><br>
Authors: Ido Rappaport & Eran Tascesme
</font></center>

**Submission Details:**
<font size=2>
<br>Ido Rappaport, ID: 322891623
<br>Eran Tascesme , ID: 205708720 </font>


In [None]:
''''
!pip install optuna
!pip install wandb
!pip install nlpaug
!pip install gensim
!pip install evaluate
'''

**Import libraries**

❗Note the versions of the packages, we have included information in requirements.txt❗

In [None]:
# Standard libraries
import os
import re
import string
import random
import warnings
from collections import Counter

# Data handling and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# NLP libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
from gensim import corpora, models
from urllib.parse import urlparse

# Machine learning and deep learning
import torch
from torch.utils.data import DataLoader, Dataset
from torch import nn, optim
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    accuracy_score,
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay
)

# Hugging Face Transformers
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback,
    set_seed,
    TrainerCallback,
    TrainerState,
    TrainerControl,
    DataCollatorWithPadding,
    RobertaForSequenceClassification,
    MarianMTModel,
    MarianTokenizer
)
from datasets import Dataset, DatasetDict, load_dataset
from transformers.modeling_outputs import SequenceClassifierOutput
import evaluate

# Other libraries
import optuna
import wandb
from tqdm import tqdm

# Filter warnings
warnings.filterwarnings('ignore')

# Download NLTK resources
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from huggingface_hub import login
login('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')

**Load CSV Files**

Following the results from training based on excercise 4, we concluded that we can train solely on the clean, truncated dataset after augmentation. This approach also helps save time and resources.

In [None]:
# Load CSV files

drive_path = "/content/drive/My Drive/Colab Notebooks/"

train_dataset = pd.read_csv(drive_path + "train_balanced.csv", encoding="ISO-8859-1")
eval_dataset = pd.read_csv(drive_path + "val_balanced.csv", encoding="ISO-8859-1")

**Training Classes and Methods**

The function `train_with_optuna_wandb` is designed for training a Hugging Face `transformers` model using hyperparameter optimization with Optuna and experiment tracking with Weights & Biases (W&B). It performs the following steps:

*   Sets up W&B for tracking the Optuna trials and the final best model run.
*   Initializes the tokenizer and prepares the datasets.
*   Defines the model initialization, metric computation, and objective function for Optuna.
*   Configures base training arguments for the hyperparameter search.
*   Implements a custom callback to log metrics per epoch during Optuna trials to W&B.
*   Defines the hyperparameter search space for Optuna.
*   Runs the Optuna hyperparameter search to find the best combination of hyperparameters.
*   Prints the details of the best trial found by Optuna.
*   Logs a summary table of all Optuna trials to W&B.
*   Performs a final training run with the best hyperparameters found by Optuna, with W&B logging enabled.
*   Saves the trained model with the best hyperparameters.

This function provides a **general framework** for hyperparameter tuning and experiment tracking for sequence classification tasks using Hugging Face models, Optuna, and W&B.

In [None]:
def train_with_optuna_wandb(
    project_name, model_name, train_dataset, eval_dataset,
    num_labels=5, n_trials=5, num_train_epochs=5
):
    # Set seed for reproducibility
    set_seed(42)

    # Set W&B environment
    os.environ["WANDB_PROJECT"] = project_name
    os.environ["WANDB_MODE"] = "disabled"  # Disable W&B auto-logging for trials

    # Start single W&B run to track all trials
    wandb_run = wandb.init(project=project_name, name="optuna_search_all_trials", reinit=True)

    # Define custom metrics for step tracking
    wandb.define_metric("epoch")
    wandb.define_metric("eval_accuracy", step_metric="epoch")
    wandb.define_metric("train_accuracy", step_metric="epoch")

    # W&B table for final summary
    trials_table = wandb.Table(columns=[
        "trial", "learning_rate", "batch_size", "weight_decay", "eval_accuracy", "train_accuracy"
    ])

    # Tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize_function(example):
        return tokenizer(example["text"], padding="max_length", truncation=True)

    tokenized_train = train_dataset.map(tokenize_function, batched=True, batch_size=64)
    tokenized_eval = eval_dataset.map(tokenize_function, batched=True, batch_size=64)

    # Accuracy metric
    metric = evaluate.load("accuracy")

    def model_init():
        return AutoModelForSequenceClassification.from_pretrained(
            model_name, num_labels=num_labels, ignore_mismatched_sizes=True
        )

    def compute_metrics(eval_pred):
        predictions = eval_pred.predictions.argmax(axis=-1)
        labels = eval_pred.label_ids
        return metric.compute(predictions=predictions, references=labels)

    def compute_objective(metrics):
        return metrics["eval_accuracy"]

    # Base training args (for Optuna search)
    base_training_args = TrainingArguments(
        output_dir=f"/content/drive/MyDrive/Colab Notebooks/{project_name}/temp_run",
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        logging_strategy="epoch",
        num_train_epochs=num_train_epochs,
        report_to=[],  # Disable W&B logging during search
        logging_dir=f"/content/drive/MyDrive/Colab Notebooks/{project_name}/logs",
    )

    # Callback for logging per epoch
    class WandbOptunaCallback(TrainerCallback):
        def on_epoch_end(self, args, state, control, **kwargs):
            train_metrics = trainer.evaluate(eval_dataset=tokenized_train, metric_key_prefix="train")
            eval_metrics = trainer.evaluate(eval_dataset=tokenized_eval, metric_key_prefix="eval")

            train_acc = train_metrics.get("train_accuracy", None)
            eval_acc = eval_metrics.get("eval_accuracy", None)

            # Log per epoch with trial info
            wandb.log({
                "eval_accuracy": eval_acc,
                "train_accuracy": train_acc,
                "epoch": state.epoch,
                "trial": state.trial_name,
            })

            # Add final metrics to summary table
            if state.epoch + 1 == num_train_epochs:
                trials_table.add_data(
                    state.trial_name,
                    state.trial_params.get("learning_rate"),
                    state.trial_params.get("per_device_train_batch_size"),
                    state.trial_params.get("weight_decay"),
                    eval_acc,
                    train_acc
                )

    # Trainer for Optuna trials
    trainer = Trainer(
        model_init=model_init,
        args=base_training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        callbacks=[WandbOptunaCallback()]
    )

    def optuna_hp_space(trial):
        return {
            "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
            "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [64, 128]),
            "weight_decay": trial.suggest_float("weight_decay", 1e-4, 0.3),
        }

    # Run hyperparameter search
    best_run = trainer.hyperparameter_search(
        direction="maximize",
        backend="optuna",
        hp_space=optuna_hp_space,
        n_trials=n_trials,
        compute_objective=compute_objective,
        study_name="transformers_optuna_study",
        storage=f"sqlite:////content/drive/MyDrive/Colab Notebooks/{project_name}/optuna_trials.db",
        load_if_exists=True
    )

    print("Best trial:", best_run)

    # Log summary table
    wandb.log({"optuna_trials": trials_table})

    # Finish main W&B run
    wandb.finish()

    # Re-enable W&B for final training run
    os.environ["WANDB_MODE"] = "online"

    # Final training args (W&B enabled)
    final_training_args = TrainingArguments(
        output_dir=f"/content/drive/MyDrive/Colab Notebooks/{project_name}/best_model_run",
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        logging_strategy="epoch",
        num_train_epochs=num_train_epochs,
        learning_rate=best_run.hyperparameters["learning_rate"],
        per_device_train_batch_size=best_run.hyperparameters["per_device_train_batch_size"],
        weight_decay=best_run.hyperparameters["weight_decay"],
        report_to=["wandb"],
        logging_dir=f"/content/drive/MyDrive/Colab Notebooks/{project_name}/logs",
        run_name="final_best_model"
    )

    # Final model trainer
    trainer = Trainer(
        model_init=model_init,
        args=final_training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    trainer.train()

    # Save best model
    best_model_path = f"/content/drive/MyDrive/Colab Notebooks/{project_name}/best_model"
    trainer.save_model(best_model_path)
    print(f"Best model saved to {best_model_path}")

    wandb.finish()
    return best_model_path, best_run


**First Model**

twitter-roberta-base-sentiment

the function above save the best model automatically

In [None]:
# Attempt to remove the Optuna database file before starting the study
!rm -f "/content/drive/MyDrive/Colab Notebooks/roberta_sentiment_5_cutted_data/optuna_trials.db"

best_model_path, best_roberta_run = train_with_optuna_wandb(
    project_name="roberta_sentiment_cutted_data_exc5",
    model_name="cardiffnlp/twitter-roberta-base-sentiment-latest",
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    num_labels=5,
    n_trials=5,
    num_train_epochs=6
)

**Second Model**

distilbert-base-uncased-finetuned-sst-2-english

the function above save the best model automatically

In [None]:
!rm -f "/content/drive/MyDrive/Colab Notebooks/distilbert_sentiment_5_cutted_data/optuna_trials.db"

best_model_distil_path, best_distil_run = train_with_optuna_wandb(
    project_name="distilbert_sentiment_5_cutted_data_exc5",
    model_name="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    num_labels=5,
    n_trials=5,
    num_train_epochs=6
)

<center><h1>END</h1></center>
