# Airlines Review


---------------------

## Phase 3 - LLM Mistral


### Environment setup

Setting Up Hugging Face to Use the E: Drive instead of the default C: drive.
This saves local disk space and helps manage large files better.

In [1]:
# !pip install transformers datasets peft accelerate bitsandbytes 
import os

# Store all Hugging Face files on the E: drive
os.environ["HF_HOME"] = "E:/huggingface"
os.environ["TRANSFORMERS_CACHE"] = "E:/huggingface/transformers"
os.environ["HF_DATASETS_CACHE"] = "E:/huggingface/datasets"


In [2]:
import torch
print(torch.cuda.get_device_name(0))
print(f"Total VRAM: {round(torch.cuda.get_device_properties(0).total_memory / 1e9, 2)} GB")

NVIDIA GeForce GTX 1660 Ti
Total VRAM: 6.44 GB


### Load the model

The TinyMistral-248M model is loaded and prepared for efficient fine-tuning:

- First, 4-bit quantization is set up using BitsAndBytesConfig to save GPU memory.

- The tokenizer and base model are loaded from HuggingFace, configured for a 3-class classification task.

- The model is then adapted for 4-bit training with prepare_model_for_kbit_training.

- Finally, LoRA (Low-Rank Adaptation) is applied by injecting lightweight trainable layers into the attention mechanisms (q_proj and v_proj), making the fine-tuning process much faster and lighter.

- The padding token is also corrected after these adjustments to ensure input sequences are properly handled.

In [3]:
from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForSequenceClassification
from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig, TaskType

# Setup 4-bit Quantization : configure the model to load in 4-bit precision to save memory (important with small VRAM GPUs like 6 GB).
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

#  Load Tokenizer
model_id = "Locutusque/TinyMistral-248M"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token  #  Add padding

# Base model: Load Pre-trained Mistral Model TinyMistral
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    num_labels=3, # ready for classification tasks with 3 output classes
    trust_remote_code=True
).to("cuda")

# ✅ Prepare for LoRA after quant: adapt the model for training in 4-bit precision, making it faster and lighter to fine-tune.
model = prepare_model_for_kbit_training(model)

# Apply LoRA Fine-Tuning : configure LoRA (Low-Rank Adaptation) to inject small, efficient trainable adapters into the model’s attention layers.
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_CLS
)
model = get_peft_model(model, lora_config)

# Now apply pad_token_id after PEFT
model.config.pad_token_id = tokenizer.pad_token_id

print("Pad token:", tokenizer.pad_token)
print("Pad token ID:", tokenizer.pad_token_id)


Some weights of MistralForSequenceClassification were not initialized from the model checkpoint at Locutusque/TinyMistral-248M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Pad token: <|endoftext|>
Pad token ID: 32001


### Load and Preprocess the Dataset

#### Loading the dataset:

In [4]:
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

# Load the dataset
df = pd.read_csv("./AirlinesReviews/data_phase1.csv", encoding='utf-8', on_bad_lines='skip')
df.head()

Unnamed: 0,Name,Review Date,Airline,Verified,Type of Traveller,Month Flown,Route,Class,Seat Comfort,Staff Service,Food & Beverages,Inflight Entertainment,Value For Money,Overall Rating,Recommended,Review,Sentiment,Review_Length
0,Alison Soetantyo,2024-03-01,Singapore Airlines,True,Solo Leisure,December 2023,Jakarta to Singapore,Business Class,4,4,4,4,4,9,yes,Flight was amazing. Flight was amazing. The ...,Positive,89
1,Robert Watson,2024-02-21,Singapore Airlines,True,Solo Leisure,February 2024,Phuket to Singapore,Economy Class,5,3,4,4,1,3,no,seats on this aircraft are dreadful . Bookin...,Negative,49
2,S Han,2024-02-20,Singapore Airlines,True,Family Leisure,February 2024,Siem Reap to Singapore,Economy Class,1,5,2,1,5,10,yes,Food was plentiful and tasty. Excellent perf...,Positive,34
3,D Laynes,2024-02-19,Singapore Airlines,True,Solo Leisure,February 2024,Singapore to London Heathrow,Economy Class,5,5,5,5,5,10,yes,“how much food was available. Pretty comforta...,Positive,171
4,A Othman,2024-02-19,Singapore Airlines,True,Family Leisure,February 2024,Singapore to Phnom Penh,Economy Class,5,5,5,5,5,10,yes,“service was consistently good”. The service ...,Positive,57


#### Preprocessing

The dataset is cleaned and prepared by mapping sentiment labels (Positive, Negative, Neutral) to numerical values required for model training. 

Only the review text and corresponding label were retained, with the text column renamed to text for compatibility with the tokenizer. 

The "Review" column is renamed to "text" to match HuggingFace standards.

The dataset is then split into training, validation, and test sets following an 80%-10%-10% split, ensuring random and reproducible partitions for fine-tuning the Mistral model.

This simplified structure ensures the data is ready for tokenization and fine-tuning with the Mistral model.

In [None]:
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
import pandas as pd

#  Map sentiment labels to integers
label_map = {'Negative': 0, 'Positive': 1, 'Neutral': 2}
df = df[df['Sentiment'].isin(label_map.keys())]
df['label'] = df['Sentiment'].map(label_map)
df = df[['Review', 'label']].rename(columns={'Review': 'text'})

# === STEP 1: Split original imbalanced data FIRST ===
train_df, test_df = train_test_split(
    df,
    test_size=0.1,
    stratify=df['label'],  # maintain class ratio in test
    random_state=42
)

# === STEP 2: Upsample Neutral class ONLY in training set ===
df_neg = train_df[train_df['label'] == 0]
df_pos = train_df[train_df['label'] == 1]
df_neu = train_df[train_df['label'] == 2]

# Match to largest class
target_size = max(len(df_neg), len(df_pos))
df_neu_upsampled = resample(
    df_neu,
    replace=True,
    n_samples=target_size,
    random_state=42
)

# Combine balanced training set
balanced_train_df = pd.concat([df_neg, df_pos, df_neu_upsampled]).sample(frac=1, random_state=42).reset_index(drop=True)

# === STEP 3: Convert to Hugging Face datasets ===
train_dataset = Dataset.from_pandas(balanced_train_df)
test_dataset = Dataset.from_pandas(test_df)

# === STEP 4: Split train into train/validation (10% val) ===
train_val_split = train_dataset.train_test_split(test_size=0.1111, seed=42)

# === STEP 5: Wrap in DatasetDict ===
dataset = DatasetDict({
    "train": train_val_split["train"],
    "validation": train_val_split["test"],
    "test": test_dataset  # ✅ original imbalanced test
})


### Tokenize the Dataset

The dataset is tokenized using the same tokenizer as the Mistral model:

A custom tokenize function is applied to the "text" column, ensuring all sequences are padded or truncated to a maximum length of 512 tokens.
The dataset is processed in batches for faster tokenization, preparing the text inputs for model training.

In [6]:
from transformers import AutoTokenizer
def tokenize(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=512)

tokenized_dataset = dataset.map(tokenize, batched=True)



Map:   0%|          | 0/7866 [00:00<?, ? examples/s]

Map:   0%|          | 0/984 [00:00<?, ? examples/s]

Map:   0%|          | 0/810 [00:00<?, ? examples/s]

### Hyperparameter Search with Optuna

Hyperparameter optimization is performed using Optuna:

- The model is trained multiple times on the training set, and evaluated on the validation set, with the goal of maximizing the weighted F1-score.

- Optuna explores different learning rates, batch sizes, and numbers of epochs to automatically find the best combination.

- Training is monitored with a custom callback to track progress, and early stopping is used to avoid overfitting.

- To address class imbalance in the dataset, class weights are computed automatically based on the frequency of each emotion label (Positive, Negative, Neutral).
These weights are applied during training using a custom WeightedTrainer that modifies the loss function.
This ensures that the model gives more importance to underrepresented classes and does not bias predictions toward majority classes.

In [8]:
import time
import optuna
import numpy as np
import torch
from sklearn.metrics import f1_score
from sklearn.utils.class_weight import compute_class_weight

from transformers import (
    TrainingArguments,
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainerCallback,
    EarlyStoppingCallback,
)
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import BitsAndBytesConfig
from torch import nn
from transformers import Trainer


#  Class weights
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(df["label"]),
    y=df["label"]
)
class_weights = torch.tensor(class_weights, dtype=torch.float).to("cuda")

#  TRACK TIME PER TRIAL
class TrialProgressCallback(TrainerCallback):
    def __init__(self):
        self.start_time = None

    def on_train_begin(self, args, state, control, **kwargs):
        self.start_time = time.time()
        print(f"\n🚀 Starting trial at {time.strftime('%H:%M:%S')}")

    def on_log(self, args, state, control, logs=None, **kwargs):
        elapsed = time.time() - self.start_time
        print(f"⏱️ Step {state.global_step} - Elapsed: {round(elapsed/60, 2)} min - Logs: {logs}")

#  METRICS
def compute_metrics(pred):
    preds = pred.predictions.argmax(-1)
    return {"f1": f1_score(pred.label_ids, preds, average="weighted")}

#  MODEL INIT (with LoRA)
def model_init():
    base_model = AutoModelForSequenceClassification.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        num_labels=3,
        trust_remote_code=True
    )
    base_model.config.pad_token_id = tokenizer.pad_token_id
    base_model = prepare_model_for_kbit_training(base_model)

    lora_config = LoraConfig(
        r=8,
        lora_alpha=16,
        target_modules=["q_proj", "v_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type=TaskType.SEQ_CLS
    )
    lora_model = get_peft_model(base_model, lora_config)
    return lora_model

#  OPTUNA SEARCH SPACE
def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 2e-5, 5e-4, log=True),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [2, 4]),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 3)
    }

#  Custom Trainer with class weights
class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        loss_fct = nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits, labels)
        return (loss, outputs) if return_outputs else loss



#  TRAINING ARGS
training_args = TrainingArguments(
    output_dir="./optuna_output",
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    metric_for_best_model="f1",
    load_best_model_at_end=True,
    report_to="none",
    logging_dir="./logs",
    logging_strategy="steps",
    logging_steps=100,
    fp16=False
)

#  TRAINER with everything
trainer = WeightedTrainer(
    model_init=model_init,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[
        TrialProgressCallback(),
        EarlyStoppingCallback(early_stopping_patience=2)
    ]
)

#  RUN OPTUNA
start = time.time()

best_trial = trainer.hyperparameter_search(
    direction="maximize",
    hp_space=optuna_hp_space,
    n_trials=3
)

end = time.time()

print("\n✅ Done!")
print(f"⏱️ Optuna tuning done in {round((end - start)/60, 2)} minutes")
print("🏆 Best trial params:", best_trial.hyperparameters)


  trainer = WeightedTrainer(
Some weights of MistralForSequenceClassification were not initialized from the model checkpoint at Locutusque/TinyMistral-248M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
[I 2025-05-27 17:37:55,856] A new study created in memory with name: no-name-0a1b9c76-415a-4624-be96-8b3d56d59d15
Some weights of MistralForSequenceClassification were not initialized from the model checkpoint at Locutusque/TinyMistral-248M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
`use_cache=True` is i


🚀 Starting trial at 17:37:56


  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss,F1
1,1.2016,0.919895,0.760713
2,0.8495,0.905518,0.788788
3,0.8703,0.886684,0.794574


⏱️ Step 100 - Elapsed: 0.9 min - Logs: {'loss': 1.1162, 'grad_norm': 17.354602813720703, 'learning_rate': 7.317371887170678e-05, 'epoch': 0.025425883549453344}
⏱️ Step 200 - Elapsed: 1.76 min - Logs: {'loss': 1.086, 'grad_norm': 8.703713417053223, 'learning_rate': 7.254830247109389e-05, 'epoch': 0.05085176709890669}
⏱️ Step 300 - Elapsed: 2.61 min - Logs: {'loss': 0.9988, 'grad_norm': 20.841867446899414, 'learning_rate': 7.1922886070481e-05, 'epoch': 0.07627765064836003}
⏱️ Step 400 - Elapsed: 3.47 min - Logs: {'loss': 0.9255, 'grad_norm': 24.065855026245117, 'learning_rate': 7.129746966986813e-05, 'epoch': 0.10170353419781338}
⏱️ Step 500 - Elapsed: 4.33 min - Logs: {'loss': 0.8775, 'grad_norm': 22.398548126220703, 'learning_rate': 7.067205326925526e-05, 'epoch': 0.12712941774726672}
⏱️ Step 600 - Elapsed: 5.19 min - Logs: {'loss': 0.8477, 'grad_norm': 5.368893623352051, 'learning_rate': 7.004663686864238e-05, 'epoch': 0.15255530129672007}
⏱️ Step 700 - Elapsed: 6.05 min - Logs: {'los

  return fn(*args, **kwargs)


⏱️ Step 4000 - Elapsed: 34.79 min - Logs: {'loss': 0.7723, 'grad_norm': 0.18539218604564667, 'learning_rate': 4.878247924780452e-05, 'epoch': 1.0170353419781337}
⏱️ Step 4100 - Elapsed: 35.6 min - Logs: {'loss': 0.7933, 'grad_norm': 3.7911462783813477, 'learning_rate': 4.815706284719164e-05, 'epoch': 1.0424612255275871}
⏱️ Step 4200 - Elapsed: 36.45 min - Logs: {'loss': 0.8687, 'grad_norm': 5.5596160888671875, 'learning_rate': 4.753164644657876e-05, 'epoch': 1.0678871090770405}
⏱️ Step 4300 - Elapsed: 37.27 min - Logs: {'loss': 0.8094, 'grad_norm': 0.5400141477584839, 'learning_rate': 4.690623004596588e-05, 'epoch': 1.0933129926264937}
⏱️ Step 4400 - Elapsed: 38.1 min - Logs: {'loss': 0.663, 'grad_norm': 2.3294715881347656, 'learning_rate': 4.6280813645353e-05, 'epoch': 1.1187388761759471}
⏱️ Step 4500 - Elapsed: 38.93 min - Logs: {'loss': 0.8715, 'grad_norm': 45.824825286865234, 'learning_rate': 4.565539724474012e-05, 'epoch': 1.1441647597254005}
⏱️ Step 4600 - Elapsed: 39.74 min - Lo

  return fn(*args, **kwargs)


⏱️ Step 7900 - Elapsed: 68.78 min - Logs: {'loss': 0.6472, 'grad_norm': 0.6215434670448303, 'learning_rate': 2.439123962390226e-05, 'epoch': 2.008644800406814}
⏱️ Step 8000 - Elapsed: 69.69 min - Logs: {'loss': 0.7789, 'grad_norm': 8.242461204528809, 'learning_rate': 2.376582322328938e-05, 'epoch': 2.0340706839562674}
⏱️ Step 8100 - Elapsed: 70.61 min - Logs: {'loss': 0.7151, 'grad_norm': 0.0590372197329998, 'learning_rate': 2.31404068226765e-05, 'epoch': 2.059496567505721}
⏱️ Step 8200 - Elapsed: 71.52 min - Logs: {'loss': 0.8961, 'grad_norm': 0.09146124869585037, 'learning_rate': 2.2514990422063622e-05, 'epoch': 2.0849224510551743}
⏱️ Step 8300 - Elapsed: 72.4 min - Logs: {'loss': 0.8963, 'grad_norm': 0.017997564747929573, 'learning_rate': 2.1889574021450744e-05, 'epoch': 2.1103483346046277}
⏱️ Step 8400 - Elapsed: 73.29 min - Logs: {'loss': 0.6633, 'grad_norm': 20.651395797729492, 'learning_rate': 2.1264157620837865e-05, 'epoch': 2.135774218154081}
⏱️ Step 8500 - Elapsed: 74.19 min 

[I 2025-05-27 19:22:27,431] Trial 0 finished with value: 0.7945739907621111 and parameters: {'learning_rate': 7.379288110831352e-05, 'per_device_train_batch_size': 2, 'num_train_epochs': 3}. Best is trial 0 with value: 0.7945739907621111.
Some weights of MistralForSequenceClassification were not initialized from the model checkpoint at Locutusque/TinyMistral-248M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



🚀 Starting trial at 19:22:28


  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss,F1
1,0.7113,0.642841,0.776022
2,0.549,0.689544,0.793612
3,0.5776,0.655661,0.810064


⏱️ Step 100 - Elapsed: 2.12 min - Logs: {'loss': 1.0923, 'grad_norm': 9.305536270141602, 'learning_rate': 0.0001193612523482202, 'epoch': 0.05083884087442806}
⏱️ Step 200 - Elapsed: 4.23 min - Logs: {'loss': 0.9474, 'grad_norm': 15.959235191345215, 'learning_rate': 0.00011730400911574485, 'epoch': 0.10167768174885612}
⏱️ Step 300 - Elapsed: 6.33 min - Logs: {'loss': 0.827, 'grad_norm': 7.3491740226745605, 'learning_rate': 0.0001152467658832695, 'epoch': 0.1525165226232842}
⏱️ Step 400 - Elapsed: 8.43 min - Logs: {'loss': 0.7962, 'grad_norm': 17.798126220703125, 'learning_rate': 0.00011318952265079413, 'epoch': 0.20335536349771224}
⏱️ Step 500 - Elapsed: 10.52 min - Logs: {'loss': 0.647, 'grad_norm': 8.831413269042969, 'learning_rate': 0.00011113227941831877, 'epoch': 0.2541942043721403}
⏱️ Step 600 - Elapsed: 12.62 min - Logs: {'loss': 0.655, 'grad_norm': 8.403214454650879, 'learning_rate': 0.0001090750361858434, 'epoch': 0.3050330452465684}
⏱️ Step 700 - Elapsed: 14.55 min - Logs: {'l

  return fn(*args, **kwargs)


⏱️ Step 2000 - Elapsed: 40.63 min - Logs: {'loss': 0.7023, 'grad_norm': 12.974275588989258, 'learning_rate': 8.027363093118843e-05, 'epoch': 1.0167768174885612}
⏱️ Step 2100 - Elapsed: 42.61 min - Logs: {'loss': 0.5801, 'grad_norm': 7.5302815437316895, 'learning_rate': 7.821638769871307e-05, 'epoch': 1.0676156583629894}
⏱️ Step 2200 - Elapsed: 44.55 min - Logs: {'loss': 0.5468, 'grad_norm': 12.412310600280762, 'learning_rate': 7.61591444662377e-05, 'epoch': 1.1184544992374175}
⏱️ Step 2300 - Elapsed: 46.48 min - Logs: {'loss': 0.5928, 'grad_norm': 1.8654537200927734, 'learning_rate': 7.410190123376234e-05, 'epoch': 1.1692933401118455}
⏱️ Step 2400 - Elapsed: 48.42 min - Logs: {'loss': 0.7188, 'grad_norm': 25.297657012939453, 'learning_rate': 7.204465800128699e-05, 'epoch': 1.2201321809862735}
⏱️ Step 2500 - Elapsed: 50.38 min - Logs: {'loss': 0.5163, 'grad_norm': 7.9633564949035645, 'learning_rate': 6.998741476881164e-05, 'epoch': 1.2709710218607015}
⏱️ Step 2600 - Elapsed: 52.35 min -

  return fn(*args, **kwargs)


⏱️ Step 4000 - Elapsed: 82.11 min - Logs: {'loss': 0.5176, 'grad_norm': 12.795262336730957, 'learning_rate': 3.9128766281681285e-05, 'epoch': 2.0335536349771224}
⏱️ Step 4100 - Elapsed: 84.65 min - Logs: {'loss': 0.5786, 'grad_norm': 12.042179107666016, 'learning_rate': 3.7071523049205935e-05, 'epoch': 2.0843924758515504}
⏱️ Step 4200 - Elapsed: 87.04 min - Logs: {'loss': 0.5963, 'grad_norm': 0.4469873905181885, 'learning_rate': 3.501427981673057e-05, 'epoch': 2.135231316725979}
⏱️ Step 4300 - Elapsed: 89.44 min - Logs: {'loss': 0.5854, 'grad_norm': 1.744767427444458, 'learning_rate': 3.2957036584255214e-05, 'epoch': 2.186070157600407}
⏱️ Step 4400 - Elapsed: 91.86 min - Logs: {'loss': 0.4504, 'grad_norm': 4.811198711395264, 'learning_rate': 3.089979335177986e-05, 'epoch': 2.236908998474835}
⏱️ Step 4500 - Elapsed: 94.29 min - Logs: {'loss': 0.4352, 'grad_norm': 2.6062073707580566, 'learning_rate': 2.8842550119304503e-05, 'epoch': 2.287747839349263}
⏱️ Step 4600 - Elapsed: 96.71 min - 

[I 2025-05-27 21:29:07,828] Trial 1 finished with value: 0.8100640712161892 and parameters: {'learning_rate': 0.0001213979231483708, 'per_device_train_batch_size': 4, 'num_train_epochs': 3}. Best is trial 1 with value: 0.8100640712161892.
Some weights of MistralForSequenceClassification were not initialized from the model checkpoint at Locutusque/TinyMistral-248M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



🚀 Starting trial at 21:29:08


  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss,F1
1,1.0716,0.851844,0.780047
2,0.6819,0.884037,0.814688
3,0.6495,0.830791,0.828066


⏱️ Step 100 - Elapsed: 0.87 min - Logs: {'loss': 1.1012, 'grad_norm': 17.019407272338867, 'learning_rate': 0.0001750014406743405, 'epoch': 0.025425883549453344}
⏱️ Step 200 - Elapsed: 1.73 min - Logs: {'loss': 1.008, 'grad_norm': 5.946736812591553, 'learning_rate': 0.0001735057018651581, 'epoch': 0.05085176709890669}
⏱️ Step 300 - Elapsed: 2.6 min - Logs: {'loss': 0.7708, 'grad_norm': 7.7479119300842285, 'learning_rate': 0.0001720099630559757, 'epoch': 0.07627765064836003}
⏱️ Step 400 - Elapsed: 3.47 min - Logs: {'loss': 0.7755, 'grad_norm': 7.070446968078613, 'learning_rate': 0.0001705142242467933, 'epoch': 0.10170353419781338}
⏱️ Step 500 - Elapsed: 4.34 min - Logs: {'loss': 0.9009, 'grad_norm': 19.29374122619629, 'learning_rate': 0.0001690184854376109, 'epoch': 0.12712941774726672}
⏱️ Step 600 - Elapsed: 5.2 min - Logs: {'loss': 0.9166, 'grad_norm': 2.347876787185669, 'learning_rate': 0.00016752274662842852, 'epoch': 0.15255530129672007}
⏱️ Step 700 - Elapsed: 6.07 min - Logs: {'los

  return fn(*args, **kwargs)


⏱️ Step 4000 - Elapsed: 35.7 min - Logs: {'loss': 0.6478, 'grad_norm': 0.14166556298732758, 'learning_rate': 0.000116667627116227, 'epoch': 1.0170353419781337}
⏱️ Step 4100 - Elapsed: 36.56 min - Logs: {'loss': 0.6961, 'grad_norm': 1.086085557937622, 'learning_rate': 0.00011517188830704459, 'epoch': 1.0424612255275871}
⏱️ Step 4200 - Elapsed: 37.43 min - Logs: {'loss': 0.8022, 'grad_norm': 9.959230422973633, 'learning_rate': 0.0001136761494978622, 'epoch': 1.0678871090770405}
⏱️ Step 4300 - Elapsed: 38.3 min - Logs: {'loss': 0.749, 'grad_norm': 0.3586465120315552, 'learning_rate': 0.0001121804106886798, 'epoch': 1.0933129926264937}
⏱️ Step 4400 - Elapsed: 39.16 min - Logs: {'loss': 0.5676, 'grad_norm': 0.9739956855773926, 'learning_rate': 0.0001106846718794974, 'epoch': 1.1187388761759471}
⏱️ Step 4500 - Elapsed: 40.03 min - Logs: {'loss': 0.7901, 'grad_norm': 40.108070373535156, 'learning_rate': 0.00010918893307031501, 'epoch': 1.1441647597254005}
⏱️ Step 4600 - Elapsed: 40.9 min - Lo

  return fn(*args, **kwargs)


⏱️ Step 7900 - Elapsed: 70.53 min - Logs: {'loss': 0.602, 'grad_norm': 0.09784363210201263, 'learning_rate': 5.83338135581135e-05, 'epoch': 2.008644800406814}
⏱️ Step 8000 - Elapsed: 71.4 min - Logs: {'loss': 0.641, 'grad_norm': 0.08299805968999863, 'learning_rate': 5.68380747489311e-05, 'epoch': 2.0340706839562674}
⏱️ Step 8100 - Elapsed: 72.26 min - Logs: {'loss': 0.5462, 'grad_norm': 0.05765227600932121, 'learning_rate': 5.53423359397487e-05, 'epoch': 2.059496567505721}
⏱️ Step 8200 - Elapsed: 73.12 min - Logs: {'loss': 0.6828, 'grad_norm': 0.02796316333115101, 'learning_rate': 5.384659713056631e-05, 'epoch': 2.0849224510551743}
⏱️ Step 8300 - Elapsed: 73.98 min - Logs: {'loss': 0.7305, 'grad_norm': 0.033547017723321915, 'learning_rate': 5.23508583213839e-05, 'epoch': 2.1103483346046277}
⏱️ Step 8400 - Elapsed: 74.84 min - Logs: {'loss': 0.5378, 'grad_norm': 11.05135726928711, 'learning_rate': 5.085511951220151e-05, 'epoch': 2.135774218154081}
⏱️ Step 8500 - Elapsed: 75.7 min - Logs

[I 2025-05-27 23:14:13,496] Trial 2 finished with value: 0.828066301754496 and parameters: {'learning_rate': 0.00017648222209543107, 'per_device_train_batch_size': 2, 'num_train_epochs': 3}. Best is trial 2 with value: 0.828066301754496.



✅ Done!
⏱️ Optuna tuning done in 336.29 minutes
🏆 Best trial params: {'learning_rate': 0.00017648222209543107, 'per_device_train_batch_size': 2, 'num_train_epochs': 3}


### Saving the best parameters

In [9]:
print("🏆 Best trial F1 score:", best_trial.objective)
print("📋 Best trial hyperparameters:", best_trial.hyperparameters)

🏆 Best trial F1 score: 0.828066301754496
📋 Best trial hyperparameters: {'learning_rate': 0.00017648222209543107, 'per_device_train_batch_size': 2, 'num_train_epochs': 3}


In [10]:
import json
# Save best params
with open("best_params_MISTRAL-v4.json", "w") as f:
    json.dump(best_trial.hyperparameters, f)

print("✅ Best hyperparameters saved.")


✅ Best hyperparameters saved.


### Final model

#### Merging Train and Validation Sets

The training and validation datasets are merged into a single full training set.
This allows the final model to be trained using all available labeled data for better performance, instead of wasting examples on separate validation.

In [7]:
# Merge train and validation splits
full_train_dataset = Dataset.from_dict({
    key: tokenized_dataset["train"][key] + tokenized_dataset["validation"][key]
    for key in tokenized_dataset["train"].features
})


### Final version

In this final version, the model is fine-tuned using the best hyperparameters previously found with Optuna.

Class imbalance is addressed by assigning a higher weight to the Neutral class during training.

A custom WeightedTrainer is used to apply class weights correctly through a modified loss function (CrossEntropyLoss).

The model is initialized with LoRA (Low-Rank Adaptation) on top of TinyMistral-248M, using 4-bit quantization to optimize memory usage.

Training is done on the full training dataset, with no evaluation during training (evaluation_strategy="no"), to prevent data leakage.

After training, predictions are made on the separate test set, and a threshold optimization is applied to better distinguish Neutral class predictions.

Finally, classification metrics, a confusion matrix, and the weighted F1-score are printed to summarize the model's performance.

In [8]:
#  Imports
import torch
import json
import numpy as np
import time
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from torch import nn

#  Load Best Parameters
with open("best_params_MISTRAL.json", "r") as f:
    best_params = json.load(f)



#  Class weights
class_weights = torch.tensor([1.0, 1.0, 2.0], dtype=torch.float).to("cuda")  # Neutral weighted higher


#  Custom Trainer to inject class weights
class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        loss_fct = nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits, labels)
        return (loss, outputs) if return_outputs else loss

#  Model Init function
def model_init():
    base_model = AutoModelForSequenceClassification.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        num_labels=3,
        trust_remote_code=True
    )
    base_model.config.pad_token_id = tokenizer.pad_token_id
    base_model = prepare_model_for_kbit_training(base_model)
    
    lora_config = LoraConfig(
        r=8,
        lora_alpha=16,
        target_modules=["q_proj", "v_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type=TaskType.SEQ_CLS
    )
    model = get_peft_model(base_model, lora_config)
    return model

#  Training Args
final_training_args = TrainingArguments(
    output_dir="./final",
    eval_strategy="no",
    save_strategy="no",  
    learning_rate=best_params["learning_rate"],
    per_device_train_batch_size=best_params["per_device_train_batch_size"],
    num_train_epochs=3,
    report_to="none",
    fp16=False
)

#  Final Trainer
final_trainer = WeightedTrainer(
    model_init=model_init,
    args=final_training_args,
    train_dataset=full_train_dataset,  # <<< Full train
    tokenizer=tokenizer
)

#  Train
start = time.time()
final_trainer.train()
end = time.time()
print(f"\n✅ Final training completed in {round((end-start)/60, 2)} minutes.")

#  Save Model
final_trainer.save_model("finalBALANCE")
print(" Final model saved to 'finalBALANCE' folder.")

#  Predictions
predictions = final_trainer.predict(tokenized_dataset["test"])
probs = torch.softmax(torch.tensor(predictions.predictions), dim=1).numpy()
labels = predictions.label_ids

#  Auto Threshold Optimization
from scipy.optimize import minimize_scalar

def threshold_objective(thresh):
    preds = np.argmax(probs, axis=1)
    max_probs = np.max(probs, axis=1)
    preds[max_probs < thresh] = 2  # Force to Neutral if uncertain
    return -f1_score(labels, preds, average="weighted")

opt_result = minimize_scalar(threshold_objective, bounds=(0.3, 0.7), method="bounded")
best_thresh = opt_result.x
print(f"\n🔍 Best Neutral Threshold found: {round(best_thresh, 3)}")

#  Apply Best Threshold
preds = np.argmax(probs, axis=1)
max_probs = np.max(probs, axis=1)
preds[max_probs < best_thresh] = 2  # Again force Neutral

#  Print Metrics
print("\n📘 Classification Report (Final):")
print(classification_report(labels, preds, target_names=['Negative', 'Positive', 'Neutral']))

print("\n✅ Confusion Matrix:")
print(confusion_matrix(labels, preds))

final_f1 = f1_score(labels, preds, average="weighted")
print(f"\n🔴 Final Weighted F1 (with threshold): {round(final_f1, 4)}")





  final_trainer = WeightedTrainer(
Some weights of MistralForSequenceClassification were not initialized from the model checkpoint at Locutusque/TinyMistral-248M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
Some weights of MistralForSequenceClassification were not initialized from the model checkpoint at Locutusque/TinyMistral-248M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Step,Training Loss
500,0.8991
1000,0.863
1500,0.9062
2000,0.8553
2500,0.8361
3000,0.9218
3500,0.9005
4000,0.9236
4500,0.7822
5000,0.7271



✅ Final training completed in 116.38 minutes.
 Final model saved to 'finalBALANCE' folder.



🔍 Best Neutral Threshold found: 0.341

📘 Classification Report (Final):
              precision    recall  f1-score   support

    Negative       0.80      0.81      0.80       302
    Positive       0.85      0.88      0.87       341
     Neutral       0.44      0.40      0.42       167

    accuracy                           0.75       810
   macro avg       0.70      0.70      0.70       810
weighted avg       0.75      0.75      0.75       810


✅ Confusion Matrix:
[[245   3  54]
 [ 11 299  31]
 [ 52  48  67]]

🔴 Final Weighted F1 (with threshold): 0.7504


In [None]:
import torch
import torch.nn.functional as F
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from scipy.optimize import minimize_scalar

# === Predict on test set ===
predictions = final_trainer.predict(tokenized_dataset["test"])
logits = predictions.predictions
labels = predictions.label_ids

# === Apply softmax with temperature scaling ===
temperature = 2 
probs = F.softmax(torch.tensor(logits / temperature), dim=-1).numpy()

# === Define threshold optimization using confidence margin ===
def threshold_objective(margin):
    final_preds = []
    for p in probs:
        top2 = np.sort(p)[-2:]        # take two highest probs
        gap = top2[1] - top2[0]       # confidence gap
        if gap < margin:
            final_preds.append(2)     # force Neutral
        else:
            final_preds.append(np.argmax(p))
    return -f1_score(labels, final_preds, average="weighted")

# === Find best neutral margin ===
opt_result = minimize_scalar(threshold_objective, bounds=(0.05, 0.4), method="bounded")
neutral_margin = opt_result.x
print(f"\n🔍 Best Neutral Confidence Margin found: {round(neutral_margin, 3)}")

# === Apply margin to get final predictions ===
final_preds = []
for p in probs:
    top2 = np.sort(p)[-2:]
    gap = top2[1] - top2[0]
    if gap < neutral_margin:
        final_preds.append(2)
    else:
        final_preds.append(np.argmax(p))

# === Evaluation ===
print("\n📘 Classification Report (Confidence + Temperature):")
print(classification_report(labels, final_preds, target_names=["Negative", "Positive", "Neutral"]))

print("\n✅ Confusion Matrix:")
print(confusion_matrix(labels, final_preds))

final_f1 = f1_score(labels, final_preds, average="weighted")
print(f"\n🔴 Final Weighted F1 (with threshold): {round(final_f1, 4)}")



🔍 Best Neutral Confidence Margin found: 0.157

📘 Classification Report (Confidence + Temperature):
              precision    recall  f1-score   support

    Negative       0.81      0.79      0.80       302
    Positive       0.86      0.87      0.86       341
     Neutral       0.44      0.43      0.43       167

    accuracy                           0.75       810
   macro avg       0.70      0.70      0.70       810
weighted avg       0.75      0.75      0.75       810


✅ Confusion Matrix:
[[240   3  59]
 [ 10 297  34]
 [ 48  47  72]]

🔴 Final Weighted F1 (with threshold): 0.7512


In [15]:
from sklearn.metrics import accuracy_score

final_accuracy = accuracy_score(labels, preds)

print(f"\n✅ Final Accuracy: {final_accuracy:.4f}")
print(f"✅ Final Weighted F1-score: {final_f1:.4f}")



✅ Final Accuracy: 0.7543
✅ Final Weighted F1-score: 0.7512
