
<center><br><font size=6>Final Project</font><br>
<font size=5>Advanced Topics in Deep Learning</font><br>
<b><font size=4>Part B</font></b>
<br><font size=4>Training Models like Excercise 4</font><br><br>
Authors: Ido Rappaport & Eran Tascesme
</font></center>

**Submission Details:**
<font size=2>
<br>Ido Rappaport, ID: 322891623
<br>Eran Tascesme , ID: 205708720 </font>


**Import libraries**

In [2]:
# Standard libraries
import os
import re
import string
import random
import warnings
from collections import Counter

# Data handling and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# NLP libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
from gensim import corpora, models
from urllib.parse import urlparse

# Machine learning and deep learning
import torch
from torch.utils.data import DataLoader, Dataset
from torch import nn, optim
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    accuracy_score,
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay
)

# Hugging Face Transformers
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback,
    set_seed,
    TrainerCallback,
    TrainerState,
    TrainerControl,
    DataCollatorWithPadding,
    RobertaForSequenceClassification,
    MarianMTModel,
    MarianTokenizer
)
from datasets import Dataset, DatasetDict, load_dataset
from transformers.modeling_outputs import SequenceClassifierOutput
import evaluate

# Other libraries
import optuna
import wandb
from tqdm import tqdm

# Filter warnings
warnings.filterwarnings('ignore')

# Download NLTK resources
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

In [4]:
from huggingface_hub import login
login()

In [5]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

**Load 4 CSV Files**
1. dirty - the original set from kaggle
2. clean - after removing signs and stopwords
3. cutted - after removing short/long tweets
4. balanced - after augmentation

In [7]:
# Load CSV files

drive_path = "data/"

train_dirty = pd.read_csv(drive_path + "train_dirty.csv", encoding="ISO-8859-1")
train_clean = pd.read_csv(drive_path + "train_clean.csv", encoding="ISO-8859-1")
train_cutted = pd.read_csv(drive_path + "train_cutted.csv", encoding="ISO-8859-1")
train_balanced = pd.read_csv(drive_path + "train_balanced.csv", encoding="ISO-8859-1")

val_dirty = pd.read_csv(drive_path + "val_dirty.csv", encoding="ISO-8859-1")
val_clean = pd.read_csv(drive_path + "val_clean.csv", encoding="ISO-8859-1")
val_cutted = pd.read_csv(drive_path + "val_cutted.csv", encoding="ISO-8859-1")
val_balanced = val_cutted.copy()  # augmentation is just on the train set.

**Tweet Dataset Class**

This class is a custom PyTorch Dataset to handle the tweet data. It tokenizes the text and prepares it for use with a transformer model. It includes filtering for empty texts.

In [8]:
class TweetDataset(Dataset):
    def __init__(self, dataframe, tokenizer):
        # Ensure 'text' column is string type and filter out any empty strings
        self.dataframe = dataframe.copy()
        self.dataframe['text'] = self.dataframe['text'].astype(str)
        self.dataframe = self.dataframe[self.dataframe['text'].str.strip().astype(bool)].reset_index(drop=True)

        self.texts = self.dataframe['text'].tolist()
        self.labels = self.dataframe['label'].tolist()
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        # Add a check for empty text after retrieval (should be rare after filtering in __init__)
        if not text or not text.strip():
             print(f"Warning: Empty text found at index {idx}. This should ideally be filtered in __init__.")

             # For robustness, ensure text is string here too, though __init__ should handle it.
             if not isinstance(text, str):
                 print(f"Warning: Non-string text found at index {idx}: {text} (type: {type(text)}).")
                 # Option: Convert to string or handle as error
                 text = str(text) if text is not None else "" # Attempt conversion

             if not text.strip():
                 return None # Return None if text is still empty after potential conversion

        encoding = self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=512,
            return_tensors='pt'
        )

        # Ensure the tensors are not empty after encoding
        if encoding['input_ids'].nelement() == 0:
             print(f"Warning: Empty encoding for text at index {idx}: '{text}'.")
             return None # Return None if encoding is empty

        # Squeeze to remove batch dim (1) so shape is [seq_len]
        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'labels': torch.tensor(label, dtype=torch.long)
        }

**Training Classes and Methods**

We chose to train based on accuracy, but we also monitored other metrics such as loss, precision, recall, and F1-score.

We used Adam as our optimizer and CrossEntropyLoss as our criterion.

Additionally, we observed that freezing layers is done differently for various model types. To maintain generality, we added a model_type variable that allows us to freeze the layers accordingly.

We opted for a small number of trials and epochs to save time and resources.

In [9]:
def early_stop_check(patience, best_val_accuracy, best_val_accuracy_epoch, current_val_accuracy, current_val_accuracy_epoch):
    early_stop_flag = False
    if current_val_accuracy > best_val_accuracy:
        best_val_accuracy = current_val_accuracy
        best_val_accuracy_epoch = current_val_accuracy_epoch
    else:
        if current_val_accuracy_epoch - best_val_accuracy_epoch > patience:
            early_stop_flag = True
    return best_val_accuracy, best_val_accuracy_epoch, early_stop_flag

In [10]:
def train_model_with_hyperparams(model, project_name, train_loader, val_loader, optimizer, criterion, epochs, patience, trial):
    best_val_accuracy = 0.0
    best_val_accuracy_epoch = 0
    early_stop_flag = False
    best_model_state = None
    print(f"trail= {trial.number}")

    for epoch in range(1, epochs + 1):
        print(f"epoch= {epoch}")
        model.train()
        train_loss = 0.0
        total_train_samples = 0
        correct_train_predictions = 0

        for batch in train_loader:  #Iterates over the train_loader, which is a DataLoader object containing batches of training data.
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            optimizer.zero_grad()   # Reset gradients
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels) # Forward pass
            logits = outputs.logits   # save the logits (the raw output of the model)
            loss = criterion(logits, labels)   # Calculate loss

            loss.backward() # Backward pass
            optimizer.step() # Update weights using the optimizer

            # Accumulate training loss and predictions
            train_loss += loss.item() * input_ids.size(0)
            total_train_samples += input_ids.size(0)
            correct_train_predictions += (logits.argmax(dim=1) == labels).sum().item()

        train_loss /= total_train_samples
        train_accuracy = correct_train_predictions / total_train_samples

        ###  Validation loop  ###
        model.eval() # Enable evaluation mode
        val_loss = 0.0
        total_val_samples = 0
        correct_val_predictions = 0

        all_val_labels = []
        all_val_preds = []

        with torch.no_grad(): # Disable gradient computation
            for batch in val_loader: # iterate on the val_loader's batches
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                outputs = model(input_ids, attention_mask=attention_mask)
                logits = outputs.logits
                loss = criterion(logits, labels)

                val_loss += loss.item() * input_ids.size(0)
                total_val_samples += input_ids.size(0)
                correct_val_predictions += (logits.argmax(dim=1) == labels).sum().item()

                all_val_labels.extend(labels.cpu().numpy())
                all_val_preds.extend(logits.argmax(dim=1).cpu().numpy())

        # calculate metrics
        val_loss /= total_val_samples
        val_accuracy = correct_val_predictions / total_val_samples
        val_precision = precision_score(all_val_labels, all_val_preds, average='micro')
        val_recall = recall_score(all_val_labels, all_val_preds, average='micro')
        val_f1 = f1_score(all_val_labels, all_val_preds, average='micro')

        # Check for early stopping
        best_val_accuracy, best_val_accuracy_epoch, early_stop_flag = early_stop_check(patience, best_val_accuracy, best_val_accuracy_epoch, val_accuracy, epoch)

        # Save the best model under the best_model_state parameter
        if val_accuracy == best_val_accuracy:
            best_model_state = model.state_dict()

        # Log metrics to Weights & Biases 
        wandb.log({
            "Epoch": epoch,
            "Train Loss": train_loss,
            "Train Accuracy": train_accuracy,
            "Validation Loss": val_loss,
            "Validation Accuracy": val_accuracy,
            "Validation Precision": val_precision,
            "Validation Recall": val_recall,
            "Validation F1": val_f1})

        if early_stop_flag:  # Checks whether the early stopping condition has been met, as indicated by the early_stop_flag
            break   

    if best_model_state is not None: # Save the best model as a .pt file
        base_output_dir = "models/"
        output_dir = os.path.join(base_output_dir, project_name)
        os.makedirs(output_dir, exist_ok=True)
        sanitized_model_name = model_name.replace("/", "-")
        file_path = os.path.join(output_dir, f"best_{sanitized_model_name}_model_trial_{trial.number}.pt")

        torch.save(best_model_state, file_path)

    return best_val_accuracy

In [11]:
# Objective Function for Optuna
def objective(trial, train_df, eval_df, project_name, model_name, autotokenizer, automodelclassification, model_type):
    # Hyperparameter suggestions
    learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 1e-3)
    weight_decay = trial.suggest_loguniform("weight_decay", 1e-6, 1e-4)
    patience = trial.suggest_int("patience", 3, 4)
    batch_size = trial.suggest_categorical("batch_size", [64, 128])
    num_layers = trial.suggest_int("num_layers", 1, 2)

    train_dataset = TweetDataset(train_df, autotokenizer)
    val_dataset = TweetDataset(eval_df, autotokenizer)

    def collate_fn(batch):
        # Filter out None values from the batch
        batch = [item for item in batch if item is not None]
        if not batch: # Return None if the batch is empty after filtering
            return None
        input_ids = torch.stack([item['input_ids'] for item in batch])
        attention_mask = torch.stack([item['attention_mask'] for item in batch])
        labels = torch.stack([item['labels'] for item in batch])
        return {'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': labels}

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn) # insert into a DataLoader
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn) # insert into a DataLoader

    model = automodelclassification.to(device)

    if model_type == "roberta":
        # Freeze all RoBERTa layers
        for param in model.roberta.parameters():
            param.requires_grad = False

        # Unfreeze last `num_layers` encoder layers
        for param in model.roberta.encoder.layer[-num_layers:].parameters():
            param.requires_grad = True

        # Unfreeze the classifier head
        for param in model.classifier.parameters():
            param.requires_grad = True

    elif model_type == "dilbert":
        # Freeze all distilber layers
        backbone = model.distilbert
        for p in backbone.parameters():
            p.requires_grad = False
        # # Unfreeze last `num_layers` encoder layers
        for p in backbone.transformer.layer[-num_layers:].parameters():
            p.requires_grad = True

        # # Unfreeze the classifier head
        if hasattr(model, "pre_classifier"):
            for p in model.pre_classifier.parameters():
                p.requires_grad = True
        for p in model.classifier.parameters():
            p.requires_grad = True

    # Define optimizer and loss function
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)

    # Initialize Weights & Biases - the values in the config are the properties of each trial.
    wandb.init(project=project_name,
               config={
        "learning_rate": learning_rate,
        "weight_decay": weight_decay,
        "patience": patience,
        "batch_size": batch_size,
        "num_layers": num_layers,
        "architecture": model_name,
        "dataset": "Corona_NLP"},
        name=f"trial_{trial.number}") # The name that will be saved in the W&B platform

    # Train the model and get the best validation accuracy
    best_val_accuracy = train_model_with_hyperparams(model, project_name, train_loader, val_loader, optimizer, criterion, epochs=6, patience=patience, trial=trial)

    wandb.finish() # Finish the Weights & Biases run

    return best_val_accuracy # Return best validation acc as the objective to maximize

**First Model**

twitter-roberta-base-sentiment

We trained the model on 4 Datasets: dirty, clean, cutted and balanced


In [None]:
model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
model_type = "roberta"
autotokenizer = AutoTokenizer.from_pretrained(model_name)
automodelclassification = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5, ignore_mismatched_sizes=True)

In [None]:
project_name = "roberta_sentiment_dirty_data_exc4"

roberta_sentiment_study = optuna.create_study(direction="maximize")
roberta_sentiment_study.optimize(partial(objective, train_df=train_dirty, eval_df=val_dirty, project_name=project_name, model_name=model_name, autotokenizer=autotokenizer, automodelclassification=automodelclassification, model_type=model_type),
               n_trials=5)

In [None]:
project_name = "roberta_sentiment_clean_data_exc4"

roberta_sentiment_study = optuna.create_study(direction="maximize")
roberta_sentiment_study.optimize(partial(objective, train_df=train_clean, eval_df=val_clean, project_name=project_name, model_name=model_name, autotokenizer=autotokenizer, automodelclassification=automodelclassification, model_type=model_type),
               n_trials=5)

In [None]:
project_name = "roberta_sentiment_cutted_data_exc4"

roberta_sentiment_study = optuna.create_study(direction="maximize")
roberta_sentiment_study.optimize(partial(objective, train_df=train_cutted, eval_df=val_cutted, project_name=project_name, model_name=model_name, autotokenizer=autotokenizer, automodelclassification=automodelclassification, model_type=model_type),
               n_trials=5)

In [None]:
project_name = "roberta_sentiment_balanced_data_exc4"

roberta_sentiment_study = optuna.create_study(direction="maximize")
roberta_sentiment_study.optimize(partial(objective, train_df=train_balanced, eval_df=val_clean, project_name=project_name, model_name=model_name, autotokenizer=autotokenizer, automodelclassification=automodelclassification, model_type=model_type),
               n_trials=5)

**Second Model**

distilbert-base-uncased-finetuned-sst-2-english

We trained the model on 4 Datasets: dirty, clean, cutted and balanced


In [None]:
model_name = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
model_type = "dilbert"
autotokenizer = AutoTokenizer.from_pretrained(model_name)
automodelclassification = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=5,
    ignore_mismatched_sizes=True
)

In [None]:
project_name = "distilbert_sentiment_dirty_data_exc4"

distilber_sentiment_study = optuna.create_study(direction="maximize")
distilber_sentiment_study.optimize(partial(objective, train_df=train_dirty, eval_df=val_dirty, project_name=project_name, model_name=model_name, autotokenizer=autotokenizer, automodelclassification=automodelclassification, model_type=model_type),
               n_trials=5)

In [None]:
project_name = "distilbert_sentiment_clean_data_exc4"

distilber_sentiment_study = optuna.create_study(direction="maximize")
distilber_sentiment_study.optimize(partial(objective, train_df=train_clean, eval_df=val_clean, project_name=project_name, model_name=model_name, autotokenizer=autotokenizer, automodelclassification=automodelclassification, model_type=model_type),
               n_trials=5)

In [None]:
project_name = "distilbert_sentiment_cutted_data_exc4"

distilber_sentiment_study = optuna.create_study(direction="maximize")
distilber_sentiment_study.optimize(partial(objective, train_df=train_cutted, eval_df=val_cutted, project_name=project_name, model_name=model_name, autotokenizer=autotokenizer, automodelclassification=automodelclassification, model_type=model_type),
               n_trials=5)

In [None]:
project_name = "distilbert_sentiment_balanced_data_exc4"

distilber_sentiment_study = optuna.create_study(direction="maximize")
distilber_sentiment_study.optimize(partial(objective, train_df=train_balanced, eval_df=val_balanced, project_name=project_name, model_name=model_name, autotokenizer=autotokenizer, automodelclassification=automodelclassification, model_type=model_type),
               n_trials=5)

**Choosing the best model for each model type**

By looking on W&B we found that the most stable training was on the augmented dataset. Therefore, we will continue with the model trained on this specific set, as it yielded the highest results on the test set.

In [26]:
test_data = pd.read_csv("data/test_clean.csv", encoding="ISO-8859-1")
# Ensure the 'text' column is treated as string type
test_data['text'] = test_data['text'].astype(str)

This `find_and_save_best_model` function iterates through all saved model files (`.pt`) in a specified folder, evaluates each model's accuracy on the test dataset, and saves the model with the highest accuracy to a designated location with a given filename.

In [27]:
def evaluate_model(model, dataloader, device):
    model.eval()
    preds, true_labels = [], []
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1)

            preds.extend(predictions.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())

    return accuracy_score(true_labels, preds)

def find_and_save_best_model(model_folder_path, model_name, model_save_path, model_save_name, test_data, batch_size=16):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    test_dataset = TweetDataset(test_data, tokenizer)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)

    best_acc = -1.0
    best_state_dict = None

    # Iterate over all .pt files
    for file_name in os.listdir(model_folder_path):
        if file_name.endswith(".pt"):
            print(f"Evaluating {file_name}")
            file_path = os.path.join(model_folder_path, file_name)

            # Load model
            model = AutoModelForSequenceClassification.from_pretrained(
                model_name,
                num_labels=5, ignore_mismatched_sizes=True)

            state_dict = torch.load(file_path, map_location=device)

            # Handle both plain state_dict and dict with 'model_state_dict'
            if "model_state_dict" in state_dict:
                state_dict = state_dict["model_state_dict"]

            model.load_state_dict(state_dict)
            model.to(device)

            acc = evaluate_model(model, test_loader, device)
            print(f"Model {file_name} Accuracy: {acc:.4f}")

            if acc > best_acc:
                best_acc = acc
                best_state_dict = state_dict

    if best_state_dict is not None:
        os.makedirs(model_save_path, exist_ok=True)
        save_path = os.path.join(model_save_path, model_save_name)
        torch.save(best_state_dict, save_path)
        print(f"Best model saved at {save_path} with accuracy {best_acc:.4f}")
    else:
        print(" No .pt files found.")


**First Model**

find the best model of twitter-roberta-base-sentiment-latest

In [28]:
model_folder_path = "models/roberta_sentiment_balanced_data"
model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
model_save_path = "final_models/"
model_save_name = "roberta_sentiment_exc4_weights.pt"

find_and_save_best_model(model_folder_path, model_name, model_save_path, model_save_name, test_data)

Evaluating best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_0.pt


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpo

Model best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_0.pt Accuracy: 0.5469
Evaluating best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_1.pt


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpo

Model best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_1.pt Accuracy: 0.5924
Evaluating best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_2.pt


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpo

Model best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_2.pt Accuracy: 0.5843
Evaluating best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_3.pt


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpo

Model best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_3.pt Accuracy: 0.5787
Evaluating best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_4.pt


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpo

Model best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_4.pt Accuracy: 0.5700
Evaluating best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_5.pt


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpo

Model best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_5.pt Accuracy: 0.5727
Evaluating best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_6.pt


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpo

Model best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_6.pt Accuracy: 0.5756
Evaluating best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_7.pt


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpo

Model best_cardiffnlp-twitter-roberta-base-sentiment-latest_model_trial_7.pt Accuracy: 0.5753
Best model saved at /content/drive/My Drive/Colab Notebooks/final_models/roberta_sentiment_exc4_weights.pt with accuracy 0.5924


**Second Model**

find the best model of distilbert-base-uncased-finetuned-sst-2-english

In [29]:
model_folder_path = "models/distilbert_sentiment_balanced_data"
model_name = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
model_save_path = "final_models/"
model_save_name = "distilbert_exc4_weights.pt"

find_and_save_best_model(model_folder_path, model_name, model_save_path, model_save_name, test_data)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Evaluating best_distilbert-distilbert-base-uncased-finetuned-sst-2-english_model_trial_0.pt


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model best_distilbert-distilbert-base-uncased-finetuned-sst-2-english_model_trial_0.pt Accuracy: 0.6303
Evaluating best_distilbert-distilbert-base-uncased-finetuned-sst-2-english_model_trial_1.pt


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model best_distilbert-distilbert-base-uncased-finetuned-sst-2-english_model_trial_1.pt Accuracy: 0.6274
Best model saved at /content/drive/My Drive/Colab Notebooks/final_models/distilbert_exc4_weights.pt with accuracy 0.6303


<center><h1>END</h1></center>
