# Airlines Review


---------------------

## Phase 3 - LLM DistilRoBerta


### Environment setup

Setting Up Hugging Face to Use the E: Drive instead of the default C: drive.
This saves local disk space and helps manage large files better.

In [1]:
# !pip install transformers datasets peft accelerate bitsandbytes 
import os

# Store all Hugging Face files on the E: drive
os.environ["HF_HOME"] = "E:/huggingface"
os.environ["TRANSFORMERS_CACHE"] = "E:/huggingface/transformers"
os.environ["HF_DATASETS_CACHE"] = "E:/huggingface/datasets"


In [2]:
import torch
print(torch.cuda.get_device_name(0))
print(f"Total VRAM: {round(torch.cuda.get_device_properties(0).total_memory / 1e9, 2)} GB")

NVIDIA GeForce GTX 1660 Ti
Total VRAM: 6.44 GB


### Load the model

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Model ID
model_id = "distilroberta-base"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token if tokenizer.eos_token else tokenizer.pad_token  # Ensure padding token exists

# Load pre-trained DistilRoBERTa model for 3-class classification
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    num_labels=3
).to("cuda")

# Set pad token ID in model config
model.config.pad_token_id = tokenizer.pad_token_id

print("Pad token:", tokenizer.pad_token)
print("Pad token ID:", tokenizer.pad_token_id)



Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Pad token: </s>
Pad token ID: 2


### Load and Preprocess the Dataset

#### Loading the dataset:

In [4]:
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

# Load the dataset
df = pd.read_csv("./AirlinesReviews/data_phase1.csv", encoding='utf-8', on_bad_lines='skip')
df.head()

Unnamed: 0,Name,Review Date,Airline,Verified,Type of Traveller,Month Flown,Route,Class,Seat Comfort,Staff Service,Food & Beverages,Inflight Entertainment,Value For Money,Overall Rating,Recommended,Review,Sentiment,Review_Length
0,Alison Soetantyo,2024-03-01,Singapore Airlines,True,Solo Leisure,December 2023,Jakarta to Singapore,Business Class,4,4,4,4,4,9,yes,Flight was amazing. Flight was amazing. The ...,Positive,89
1,Robert Watson,2024-02-21,Singapore Airlines,True,Solo Leisure,February 2024,Phuket to Singapore,Economy Class,5,3,4,4,1,3,no,seats on this aircraft are dreadful . Bookin...,Negative,49
2,S Han,2024-02-20,Singapore Airlines,True,Family Leisure,February 2024,Siem Reap to Singapore,Economy Class,1,5,2,1,5,10,yes,Food was plentiful and tasty. Excellent perf...,Positive,34
3,D Laynes,2024-02-19,Singapore Airlines,True,Solo Leisure,February 2024,Singapore to London Heathrow,Economy Class,5,5,5,5,5,10,yes,“how much food was available. Pretty comforta...,Positive,171
4,A Othman,2024-02-19,Singapore Airlines,True,Family Leisure,February 2024,Singapore to Phnom Penh,Economy Class,5,5,5,5,5,10,yes,“service was consistently good”. The service ...,Positive,57


#### Preprocessing

The dataset is cleaned and prepared by mapping sentiment labels (Positive, Negative, Neutral) to numerical values required for model training. 

Only the review text and corresponding label were retained, with the text column renamed to text for compatibility with the tokenizer. 

The "Review" column is renamed to "text" to match HuggingFace standards.

The dataset is then split into training, validation, and test sets following an 80%-10%-10% split, ensuring random and reproducible partitions for fine-tuning the Mistral model.

This simplified structure ensures the data is ready for tokenization and fine-tuning with the Mistral model.

In [5]:
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
import pandas as pd

# 💬 Map sentiment labels to integers
label_map = {'Negative': 0, 'Positive': 1, 'Neutral': 2}
df = df[df['Sentiment'].isin(label_map.keys())]
df['label'] = df['Sentiment'].map(label_map)
df = df[['Review', 'label']].rename(columns={'Review': 'text'})

# === STEP 1: Split original imbalanced data FIRST ===
train_df, test_df = train_test_split(
    df,
    test_size=0.1,
    stratify=df['label'],  # maintain class ratio in test
    random_state=42
)

# === STEP 2: Upsample Neutral class ONLY in training set ===
df_neg = train_df[train_df['label'] == 0]
df_pos = train_df[train_df['label'] == 1]
df_neu = train_df[train_df['label'] == 2]

# Match to largest class
target_size = max(len(df_neg), len(df_pos))
df_neu_upsampled = resample(
    df_neu,
    replace=True,
    n_samples=target_size,
    random_state=42
)

# Combine balanced training set
balanced_train_df = pd.concat([df_neg, df_pos, df_neu_upsampled]).sample(frac=1, random_state=42).reset_index(drop=True)

# === STEP 3: Convert to Hugging Face datasets ===
train_dataset = Dataset.from_pandas(balanced_train_df)
test_dataset = Dataset.from_pandas(test_df)

# === STEP 4: Split train into train/validation (10% val) ===
train_val_split = train_dataset.train_test_split(test_size=0.1111, seed=42)

# === STEP 5: Wrap in DatasetDict ===
dataset = DatasetDict({
    "train": train_val_split["train"],
    "validation": train_val_split["test"],
    "test": test_dataset  # ✅ original imbalanced test
})


### Tokenize the Dataset

The dataset is tokenized using the same tokenizer as the Mistral model:

A custom tokenize function is applied to the "text" column, ensuring all sequences are padded or truncated to a maximum length of 512 tokens.
The dataset is processed in batches for faster tokenization, preparing the text inputs for model training.

In [6]:
from transformers import AutoTokenizer
def tokenize(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=512)

tokenized_dataset = dataset.map(tokenize, batched=True)



Map:   0%|          | 0/7866 [00:00<?, ? examples/s]

Map:   0%|          | 0/984 [00:00<?, ? examples/s]

Map:   0%|          | 0/810 [00:00<?, ? examples/s]

### Hyperparameter Search with Optuna

Hyperparameter optimization is performed using Optuna:

- The model is trained multiple times on the training set, and evaluated on the validation set, with the goal of maximizing the weighted F1-score.

- Optuna explores different learning rates, batch sizes, and numbers of epochs to automatically find the best combination.

- Training is monitored with a custom callback to track progress, and early stopping is used to avoid overfitting.

- To address class imbalance in the dataset, class weights are computed automatically based on the frequency of each emotion label (Positive, Negative, Neutral).
These weights are applied during training using a custom WeightedTrainer that modifies the loss function.
This ensures that the model gives more importance to underrepresented classes and does not bias predictions toward majority classes.

In [8]:
import time
import optuna
import numpy as np
import torch
from sklearn.metrics import f1_score
from sklearn.utils.class_weight import compute_class_weight

from transformers import (
    TrainingArguments,
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainerCallback,
    EarlyStoppingCallback,
)
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import BitsAndBytesConfig
from torch import nn
from transformers import Trainer

#  Model Init function
def model_init():
    model = AutoModelForSequenceClassification.from_pretrained(
        model_id,
        num_labels=3
    )
    model.config.pad_token_id = tokenizer.pad_token_id
    return model 

#  Class weights
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(df["label"]),
    y=df["label"]
)
class_weights = torch.tensor(class_weights, dtype=torch.float).to("cuda")

#  TRACK TIME PER TRIAL
class TrialProgressCallback(TrainerCallback):
    def __init__(self):
        self.start_time = None

    def on_train_begin(self, args, state, control, **kwargs):
        self.start_time = time.time()
        print(f"\n🚀 Starting trial at {time.strftime('%H:%M:%S')}")

    def on_log(self, args, state, control, logs=None, **kwargs):
        elapsed = time.time() - self.start_time
        print(f"⏱️ Step {state.global_step} - Elapsed: {round(elapsed/60, 2)} min - Logs: {logs}")

#  METRICS
def compute_metrics(pred):
    preds = pred.predictions.argmax(-1)
    return {"f1": f1_score(pred.label_ids, preds, average="weighted")}

#  MODEL INIT 


#  OPTUNA SEARCH SPACE
def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 2e-5, 5e-4, log=True),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [2, 4]),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 3)
    }

#  Custom Trainer with class weights
class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        loss_fct = nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits, labels)
        return (loss, outputs) if return_outputs else loss



#  TRAINING ARGS
training_args = TrainingArguments(
    output_dir="./optuna_output",
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    metric_for_best_model="f1",
    load_best_model_at_end=True,
    report_to="none",
    logging_dir="./logs",
    logging_strategy="steps",
    logging_steps=100,
    fp16=False
)

#  TRAINER with everything
trainer = WeightedTrainer(
    model_init=model_init,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[
        TrialProgressCallback(),
        EarlyStoppingCallback(early_stopping_patience=2)
    ]
)

#  RUN OPTUNA
start = time.time()

best_trial = trainer.hyperparameter_search(
    direction="maximize",
    hp_space=optuna_hp_space,
    n_trials=3
)

end = time.time()

print("\n✅ Done!")
print(f"⏱️ Optuna tuning done in {round((end - start)/60, 2)} minutes")
print("🏆 Best trial params:", best_trial.hyperparameters)


  trainer = WeightedTrainer(
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2025-05-29 15:46:02,018] A new study created in memory with name: no-name-940362dd-c425-4518-9dd7-e41d503a6e5b
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



🚀 Starting trial at 15:46:02


Epoch,Training Loss,Validation Loss,F1
1,1.0777,0.970283,0.464428
2,1.1018,0.974688,0.65845
3,1.0034,1.206006,0.69998


⏱️ Step 100 - Elapsed: 0.32 min - Logs: {'loss': 1.217, 'grad_norm': 2.8954014778137207, 'learning_rate': 8.528210475650763e-05, 'epoch': 0.025425883549453344}
⏱️ Step 200 - Elapsed: 0.65 min - Logs: {'loss': 1.109, 'grad_norm': 1.9583098888397217, 'learning_rate': 8.455319787824688e-05, 'epoch': 0.05085176709890669}
⏱️ Step 300 - Elapsed: 1.01 min - Logs: {'loss': 1.0758, 'grad_norm': 4.057555675506592, 'learning_rate': 8.382429099998613e-05, 'epoch': 0.07627765064836003}
⏱️ Step 400 - Elapsed: 1.29 min - Logs: {'loss': 1.1337, 'grad_norm': 12.821508407592773, 'learning_rate': 8.309538412172538e-05, 'epoch': 0.10170353419781338}
⏱️ Step 500 - Elapsed: 1.59 min - Logs: {'loss': 1.0573, 'grad_norm': 20.15970230102539, 'learning_rate': 8.236647724346464e-05, 'epoch': 0.12712941774726672}
⏱️ Step 600 - Elapsed: 1.9 min - Logs: {'loss': 1.0254, 'grad_norm': 2.6757819652557373, 'learning_rate': 8.163757036520388e-05, 'epoch': 0.15255530129672007}
⏱️ Step 700 - Elapsed: 2.22 min - Logs: {'lo

[I 2025-05-29 16:20:33,317] Trial 0 finished with value: 0.6999803181241854 and parameters: {'learning_rate': 8.600372256598577e-05, 'per_device_train_batch_size': 2, 'num_train_epochs': 3}. Best is trial 0 with value: 0.6999803181241854.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



🚀 Starting trial at 16:20:34


Epoch,Training Loss,Validation Loss,F1
1,1.0217,0.883136,0.80245
2,0.7894,0.743955,0.85666


⏱️ Step 100 - Elapsed: 0.28 min - Logs: {'loss': 1.1177, 'grad_norm': 11.837272644042969, 'learning_rate': 3.758016435850544e-05, 'epoch': 0.025425883549453344}
⏱️ Step 200 - Elapsed: 0.57 min - Logs: {'loss': 1.2333, 'grad_norm': 2.238924503326416, 'learning_rate': 3.709632034719469e-05, 'epoch': 0.05085176709890669}
⏱️ Step 300 - Elapsed: 0.85 min - Logs: {'loss': 1.0103, 'grad_norm': 4.458409309387207, 'learning_rate': 3.661247633588395e-05, 'epoch': 0.07627765064836003}
⏱️ Step 400 - Elapsed: 1.12 min - Logs: {'loss': 0.8278, 'grad_norm': 0.9361220598220825, 'learning_rate': 3.612863232457321e-05, 'epoch': 0.10170353419781338}
⏱️ Step 500 - Elapsed: 1.4 min - Logs: {'loss': 1.0554, 'grad_norm': 8.248473167419434, 'learning_rate': 3.5644788313262466e-05, 'epoch': 0.12712941774726672}
⏱️ Step 600 - Elapsed: 1.66 min - Logs: {'loss': 1.0175, 'grad_norm': 18.788837432861328, 'learning_rate': 3.5160944301951726e-05, 'epoch': 0.15255530129672007}
⏱️ Step 700 - Elapsed: 1.92 min - Logs: {

[I 2025-05-29 16:42:04,485] Trial 1 finished with value: 0.8566600732687818 and parameters: {'learning_rate': 3.8059169929703077e-05, 'per_device_train_batch_size': 2, 'num_train_epochs': 2}. Best is trial 1 with value: 0.8566600732687818.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



🚀 Starting trial at 16:42:05


Epoch,Training Loss,Validation Loss,F1
1,0.6892,0.597391,0.830417
2,0.4792,0.59161,0.876295


⏱️ Step 100 - Elapsed: 0.42 min - Logs: {'loss': 1.0015, 'grad_norm': 3.291795253753662, 'learning_rate': 5.689421646063128e-05, 'epoch': 0.05083884087442806}
⏱️ Step 200 - Elapsed: 0.84 min - Logs: {'loss': 0.6669, 'grad_norm': 16.933544158935547, 'learning_rate': 5.541066453206202e-05, 'epoch': 0.10167768174885612}
⏱️ Step 300 - Elapsed: 1.26 min - Logs: {'loss': 0.7886, 'grad_norm': 8.993491172790527, 'learning_rate': 5.3927112603492754e-05, 'epoch': 0.1525165226232842}
⏱️ Step 400 - Elapsed: 1.67 min - Logs: {'loss': 0.8002, 'grad_norm': 46.60826873779297, 'learning_rate': 5.2443560674923494e-05, 'epoch': 0.20335536349771224}
⏱️ Step 500 - Elapsed: 2.09 min - Logs: {'loss': 0.7106, 'grad_norm': 14.087248802185059, 'learning_rate': 5.096000874635423e-05, 'epoch': 0.2541942043721403}
⏱️ Step 600 - Elapsed: 2.5 min - Logs: {'loss': 0.713, 'grad_norm': 8.977178573608398, 'learning_rate': 4.947645681778496e-05, 'epoch': 0.3050330452465684}
⏱️ Step 700 - Elapsed: 2.92 min - Logs: {'loss'

[I 2025-05-29 16:58:59,031] Trial 2 finished with value: 0.8762950743602529 and parameters: {'learning_rate': 5.8362932869914855e-05, 'per_device_train_batch_size': 4, 'num_train_epochs': 2}. Best is trial 2 with value: 0.8762950743602529.



✅ Done!
⏱️ Optuna tuning done in 72.95 minutes
🏆 Best trial params: {'learning_rate': 5.8362932869914855e-05, 'per_device_train_batch_size': 4, 'num_train_epochs': 2}


### Saving the best parameters

In [9]:
print("🏆 Best trial F1 score:", best_trial.objective)
print("📋 Best trial hyperparameters:", best_trial.hyperparameters)

🏆 Best trial F1 score: 0.8762950743602529
📋 Best trial hyperparameters: {'learning_rate': 5.8362932869914855e-05, 'per_device_train_batch_size': 4, 'num_train_epochs': 2}


In [10]:
import json
# Save best params
with open("best_params_DistilRB.json", "w") as f:
    json.dump(best_trial.hyperparameters, f)

print("✅ Best hyperparameters saved.")


✅ Best hyperparameters saved.


### Final model

#### Merging Train and Validation Sets

The training and validation datasets are merged into a single full training set.
This allows the final model to be trained using all available labeled data for better performance, instead of wasting examples on separate validation.

In [11]:
# Merge train and validation splits
full_train_dataset = Dataset.from_dict({
    key: tokenized_dataset["train"][key] + tokenized_dataset["validation"][key]
    for key in tokenized_dataset["train"].features
})


### Final version

In this final version, the model is fine-tuned using the best hyperparameters previously found with Optuna.

Class imbalance is addressed by assigning a higher weight to the Neutral class during training.

A custom WeightedTrainer is used to apply class weights correctly through a modified loss function (CrossEntropyLoss).

Training is done on the full training dataset, with no evaluation during training (evaluation_strategy="no"), to prevent data leakage.

After training, predictions are made on the separate test set, and a threshold optimization is applied to better distinguish Neutral class predictions.

Finally, classification metrics, a confusion matrix, and the weighted F1-score are printed to summarize the model's performance.

In [12]:
#  Imports
import torch
import json
import numpy as np
import time
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from torch import nn

#  Load Best Parameters
with open("best_params_DistilRB.json", "r") as f:
    best_params = json.load(f)



#  Class weights
class_weights = torch.tensor([1.0, 1.0, 2.0], dtype=torch.float).to("cuda")  # Neutral weighted higher


#  Custom Trainer to inject class weights
class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        loss_fct = nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits, labels)
        return (loss, outputs) if return_outputs else loss

#  Model Init function
def model_init():
    model = AutoModelForSequenceClassification.from_pretrained(
        model_id,
        num_labels=3
    )
    model.config.pad_token_id = tokenizer.pad_token_id
    return model


#  Training Args
final_training_args = TrainingArguments(
    output_dir="./final",
    eval_strategy="no",
    save_strategy="no",  
    learning_rate=best_params["learning_rate"],
    per_device_train_batch_size=best_params["per_device_train_batch_size"],
    num_train_epochs=3,
    report_to="none",
    fp16=False
)

#  Final Trainer
final_trainer = WeightedTrainer(
    model_init=model_init,
    args=final_training_args,
    train_dataset=full_train_dataset,  # <<< Full train
    tokenizer=tokenizer
)

#  Train
start = time.time()
final_trainer.train()
end = time.time()
print(f"\n✅ Final training completed in {round((end-start)/60, 2)} minutes.")

#  Save Model
final_trainer.save_model("finalBALANCE")
print(" Final model saved to 'finalBALANCE' folder.")

#  Predictions
predictions = final_trainer.predict(tokenized_dataset["test"])
probs = torch.softmax(torch.tensor(predictions.predictions), dim=1).numpy()
labels = predictions.label_ids

#  Auto Threshold Optimization
from scipy.optimize import minimize_scalar

def threshold_objective(thresh):
    preds = np.argmax(probs, axis=1)
    max_probs = np.max(probs, axis=1)
    preds[max_probs < thresh] = 2  # Force to Neutral if uncertain
    return -f1_score(labels, preds, average="weighted")

opt_result = minimize_scalar(threshold_objective, bounds=(0.3, 0.7), method="bounded")
best_thresh = opt_result.x
print(f"\n🔍 Best Neutral Threshold found: {round(best_thresh, 3)}")

#  Apply Best Threshold
preds = np.argmax(probs, axis=1)
max_probs = np.max(probs, axis=1)
preds[max_probs < best_thresh] = 2  # Again force Neutral

#  Print Metrics
print("\n📘 Classification Report (Final):")
print(classification_report(labels, preds, target_names=['Negative', 'Positive', 'Neutral']))

print("\n✅ Confusion Matrix:")
print(confusion_matrix(labels, preds))

final_f1 = f1_score(labels, preds, average="weighted")
print(f"\n🔴 Final Weighted F1 (with threshold): {round(final_f1, 4)}")


  final_trainer = WeightedTrainer(
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.8004
1000,0.7562
1500,0.7033
2000,0.7166
2500,0.639
3000,0.6119
3500,0.627
4000,0.5706
4500,0.5162
5000,0.4178



✅ Final training completed in 28.06 minutes.
 Final model saved to 'finalBALANCE' folder.



🔍 Best Neutral Threshold found: 0.394

📘 Classification Report (Final):
              precision    recall  f1-score   support

    Negative       0.80      0.89      0.85       302
    Positive       0.87      0.90      0.88       341
     Neutral       0.53      0.38      0.44       167

    accuracy                           0.79       810
   macro avg       0.73      0.72      0.72       810
weighted avg       0.77      0.79      0.78       810


✅ Confusion Matrix:
[[270   2  30]
 [  8 306  27]
 [ 59  44  64]]

🔴 Final Weighted F1 (with threshold): 0.7785


In [None]:
import torch
import torch.nn.functional as F
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from scipy.optimize import minimize_scalar

# === Predict on test set ===
predictions = final_trainer.predict(tokenized_dataset["test"])
logits = predictions.predictions
labels = predictions.label_ids

# === Apply softmax with temperature scaling ===
temperature = 2 
probs = F.softmax(torch.tensor(logits / temperature), dim=-1).numpy()

# === Define threshold optimization using confidence margin ===
def threshold_objective(margin):
    final_preds = []
    for p in probs:
        top2 = np.sort(p)[-2:]        # take two highest probs
        gap = top2[1] - top2[0]       # confidence gap
        if gap < margin:
            final_preds.append(2)     # force Neutral
        else:
            final_preds.append(np.argmax(p))
    return -f1_score(labels, final_preds, average="weighted")

# === Find best neutral margin ===
opt_result = minimize_scalar(threshold_objective, bounds=(0.05, 0.4), method="bounded")
neutral_margin = opt_result.x
print(f"\n🔍 Best Neutral Confidence Margin found: {round(neutral_margin, 3)}")

# === Apply margin to get final predictions ===
final_preds = []
for p in probs:
    top2 = np.sort(p)[-2:]
    gap = top2[1] - top2[0]
    if gap < neutral_margin:
        final_preds.append(2)
    else:
        final_preds.append(np.argmax(p))

# === Evaluation ===
print("\n📘 Classification Report (Confidence + Temperature):")
print(classification_report(labels, final_preds, target_names=["Negative", "Positive", "Neutral"]))

print("\n✅ Confusion Matrix:")
print(confusion_matrix(labels, final_preds))

final_f1 = f1_score(labels, final_preds, average="weighted")
print(f"\n🔴 Final Weighted F1 (with threshold): {round(final_f1, 4)}")



🔍 Best Neutral Confidence Margin found: 0.121

📘 Classification Report (Confidence + Temperature):
              precision    recall  f1-score   support

    Negative       0.80      0.89      0.84       302
    Positive       0.87      0.89      0.88       341
     Neutral       0.52      0.40      0.45       167

    accuracy                           0.79       810
   macro avg       0.73      0.73      0.73       810
weighted avg       0.77      0.79      0.78       810


✅ Confusion Matrix:
[[268   2  32]
 [  8 305  28]
 [ 58  43  66]]

🔴 Final Weighted F1 (with threshold): 0.7787


In [14]:
from sklearn.metrics import accuracy_score

final_accuracy = accuracy_score(labels, preds)

print(f"\n✅ Final Accuracy: {final_accuracy:.4f}")
print(f"✅ Final Weighted F1-score: {final_f1:.4f}")



✅ Final Accuracy: 0.7901
✅ Final Weighted F1-score: 0.7787
