# 03 ‚Äî Train Sentiment Model (distilbert-multilingual)

**M·ª•c ti√™u:**
- Fine-tune `distilbert-base-multilingual-cased` tr√™n EN train set
- Optimize cho GTX 1650: batch nh·ªè (4-8), fp16, max_length=128, early stopping
- L∆∞u model ƒë·ªÉ d√πng cho evaluation + XAI

**L√Ω do ch·ªçn distilbert-multilingual:**
- Nh·∫π h∆°n XLM-R, BERT-multilingual
- V·∫´n support multilingual (d√π train EN, test ES/FR v·∫´n work)

In [1]:
# Imports
from pathlib import Path
import pandas as pd
import numpy as np
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)
from sklearn.metrics import accuracy_score, f1_score, classification_report

print('torch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))

  from .autonotebook import tqdm as notebook_tqdm


torch: 2.7.1+cu118
CUDA available: True
GPU: NVIDIA GeForce GTX 1650


## 1) Config

In [2]:
# Paths
DATA_DIR = Path('data_splits')
OUTPUT_DIR = Path('model_output')
OUTPUT_DIR.mkdir(exist_ok=True)

# Columns (MUST match 02_sample_split.ipynb)
TEXT_COL = 'cleaned_text'  # actual column name from CSV
LABEL_COL = 'sentiment'

# Model - BERT base multilingual (178M params, larger than distilbert 135M)
MODEL_NAME = 'bert-base-multilingual-cased'  # or 'xlm-roberta-base' (270M) if you have more VRAM

# Training hyperparameters (optimized for GTX 1650)
MAX_LENGTH = 128
BATCH_SIZE = 8  # reduced from 16 due to larger model (increase gradient_accumulation if needed)
LEARNING_RATE = 2e-5
EPOCHS = 5  # 5 epochs for better convergence, early stopping will handle convergence
WARMUP_RATIO = 0.1
WEIGHT_DECAY = 0.01
FP16 = torch.cuda.is_available()  # use mixed precision if GPU

# Early stopping
EARLY_STOP_PATIENCE = 2

RANDOM_STATE = 42

## 2) Load data

In [3]:
df_train = pd.read_csv(DATA_DIR / 'train.csv')
df_val = pd.read_csv(DATA_DIR / 'val.csv')

print('Train shape:', df_train.shape)
print('Val shape:', df_val.shape)

# Check actual columns in the data
print('\nüìã Columns in train.csv:')
print(df_train.columns.tolist())

print('\nFirst few rows:')
display(df_train.head(2))

print('\nLabel distribution (train):')
display(df_train[LABEL_COL].value_counts())

Train shape: (31499, 2)
Val shape: (6750, 2)

üìã Columns in train.csv:
['cleaned_text', 'sentiment']

First few rows:


Unnamed: 0,cleaned_text,sentiment
0,Brute Force took down our server.,negative
1,So into pressure enjoy single box check knowle...,positive



Label distribution (train):


sentiment
negative    10500
neutral     10500
positive    10499
Name: count, dtype: int64

In [4]:
# For faster training, use smaller subset (stratified sampling)
TRAIN_SUBSET_SIZE = 10000  # reduce from 31k to 10k for speed
VAL_SUBSET_SIZE = 2000     # reduce from 6.7k to 2k

print(f'Using subset: {TRAIN_SUBSET_SIZE} train, {VAL_SUBSET_SIZE} val')

# Stratified sample
if len(df_train) > TRAIN_SUBSET_SIZE:
    df_train = df_train.groupby(LABEL_COL, group_keys=False).apply(
        lambda x: x.sample(min(TRAIN_SUBSET_SIZE // df_train[LABEL_COL].nunique(), len(x)), random_state=42)
    ).reset_index(drop=True)
    print(f'Sampled train: {len(df_train)}')

if len(df_val) > VAL_SUBSET_SIZE:
    df_val = df_val.groupby(LABEL_COL, group_keys=False).apply(
        lambda x: x.sample(min(VAL_SUBSET_SIZE // df_val[LABEL_COL].nunique(), len(x)), random_state=42)
    ).reset_index(drop=True)
    print(f'Sampled val: {len(df_val)}')

print('\nFinal shapes:')
print('Train:', df_train.shape)
print('Val:', df_val.shape)
print('\nLabel distribution (train):')
display(df_train[LABEL_COL].value_counts())

Using subset: 10000 train, 2000 val
Sampled train: 9999
Sampled val: 1998

Final shapes:
Train: (9999, 2)
Val: (1998, 2)

Label distribution (train):


  df_train = df_train.groupby(LABEL_COL, group_keys=False).apply(
  df_val = df_val.groupby(LABEL_COL, group_keys=False).apply(


sentiment
negative    3333
neutral     3333
positive    3333
Name: count, dtype: int64

## 3) Encode labels to integers (if needed)

In [5]:
# Check if labels are already numeric
if df_train[LABEL_COL].dtype in ['object', 'string']:
    # Encode to int
    label_map = {lbl: i for i, lbl in enumerate(sorted(df_train[LABEL_COL].unique()))}
    print('Label mapping:', label_map)
    
    df_train['label_id'] = df_train[LABEL_COL].map(label_map)
    df_val['label_id'] = df_val[LABEL_COL].map(label_map)
    
    LABEL_ID_COL = 'label_id'
    NUM_LABELS = len(label_map)
    
    # Save label map for later
    import json
    with open(OUTPUT_DIR / 'label_map.json', 'w') as f:
        json.dump(label_map, f, indent=2)
else:
    # Already numeric
    LABEL_ID_COL = LABEL_COL
    NUM_LABELS = df_train[LABEL_COL].nunique()
    print('Labels already numeric, num_labels:', NUM_LABELS)

print(f'NUM_LABELS: {NUM_LABELS}')

Label mapping: {'negative': 0, 'neutral': 1, 'positive': 2}
NUM_LABELS: 3


## 4) Tokenization

In [6]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_batch(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=MAX_LENGTH
    )

# Convert to HuggingFace Dataset
train_ds = Dataset.from_pandas(df_train[[TEXT_COL, LABEL_ID_COL]].rename(columns={TEXT_COL: 'text', LABEL_ID_COL: 'label'}))
val_ds = Dataset.from_pandas(df_val[[TEXT_COL, LABEL_ID_COL]].rename(columns={TEXT_COL: 'text', LABEL_ID_COL: 'label'}))

train_ds = train_ds.map(tokenize_batch, batched=True)
val_ds = val_ds.map(tokenize_batch, batched=True)

# Set format
train_ds.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
val_ds.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

print('Train dataset:', train_ds)
print('Val dataset:', val_ds)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9999/9999 [00:00<00:00, 11473.20 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1998/1998 [00:00<00:00, 14529.34 examples/s]

Train dataset: Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 9999
})
Val dataset: Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1998
})





## 5) Model initialization

In [7]:
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_LABELS
)

print(f'Model params: {model.num_parameters() / 1e6:.1f}M')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model params: 177.9M


## 6) Define metrics

In [8]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    
    acc = accuracy_score(labels, preds)
    f1_macro = f1_score(labels, preds, average='macro')
    
    return {
        'accuracy': acc,
        'f1_macro': f1_macro,
    }

## 7) Training arguments (optimized for GTX 1650)

In [9]:
training_args = TrainingArguments(
    output_dir=str(OUTPUT_DIR),
    
    # Training params
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE * 2,  # eval can use larger batch
    num_train_epochs=EPOCHS,
    learning_rate=LEARNING_RATE,
    warmup_ratio=WARMUP_RATIO,
    weight_decay=WEIGHT_DECAY,
    
    # Optimization
    fp16=FP16,
    gradient_accumulation_steps=2,  # increased from 1 to simulate batch 16 with less VRAM usage
    
    # Evaluation & saving (less frequent to speed up)
    eval_strategy='steps',
    eval_steps=300,  # evaluate every 300 steps
    save_strategy='steps',
    save_steps=300,
    save_total_limit=2,  # keep only 2 best checkpoints
    load_best_model_at_end=True,
    metric_for_best_model='f1_macro',
    greater_is_better=True,
    
    # Logging
    logging_steps=50,
    logging_dir=str(OUTPUT_DIR / 'logs'),
    
    # Misc
    seed=RANDOM_STATE,
    disable_tqdm=False,
)

print('Training args:', training_args)

Training args: TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=300,
eval_strategy=steps,
eval_use_gather_object=False,
fp

## 8) Trainer + early stopping

In [10]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=EARLY_STOP_PATIENCE)]
)

print('Trainer ready. Starting training...')

  trainer = Trainer(


Trainer ready. Starting training...


## 9) Train!

In [11]:
# Train
train_result = trainer.train()

print('\n‚úÖ Training complete!')
print('Best metric (f1_macro):', train_result.metrics.get('eval_f1_macro', 'N/A'))

Step,Training Loss,Validation Loss,Accuracy,F1 Macro
300,0.6885,0.669696,0.721722,0.708132
600,0.374,0.382255,0.86987,0.868247
900,0.2279,0.269228,0.913413,0.912519
1200,0.1903,0.222109,0.921421,0.920468
1500,0.1677,0.290651,0.929429,0.928304
1800,0.1193,0.220644,0.938438,0.937974
2100,0.1278,0.244401,0.93994,0.939708
2400,0.0859,0.265875,0.946446,0.946186
2700,0.0437,0.274664,0.946446,0.946157
3000,0.0559,0.264754,0.950951,0.950585



‚úÖ Training complete!
Best metric (f1_macro): N/A


## 10) Evaluate on val set

In [12]:
val_results = trainer.evaluate()

print('\nValidation results:')
for k, v in val_results.items():
    print(f'  {k}: {v:.4f}')


Validation results:
  eval_loss: 0.2648
  eval_accuracy: 0.9510
  eval_f1_macro: 0.9506
  eval_runtime: 133.2799
  eval_samples_per_second: 14.9910
  eval_steps_per_second: 0.9380
  epoch: 5.0000


## 11) Save final model

In [13]:
# Save model + tokenizer
FINAL_MODEL_DIR = OUTPUT_DIR / 'final_model'
trainer.save_model(str(FINAL_MODEL_DIR))
tokenizer.save_pretrained(str(FINAL_MODEL_DIR))

print(f'\n‚úÖ Model saved to {FINAL_MODEL_DIR.resolve()}')


‚úÖ Model saved to C:\Paper\Twitter proejct\Twitter Sentiment Analysis Dataset\Twitter Sentiment Analysis Dataset\model_output\final_model


## 12) Test on val set with classification report

In [1]:
# Get predictions
val_preds = trainer.predict(val_ds)
val_pred_labels = np.argmax(val_preds.predictions, axis=-1)
val_true_labels = val_preds.label_ids

# Classification report
print('\nClassification report (validation set):')
print(classification_report(val_true_labels, val_pred_labels, digits=4))

NameError: name 'trainer' is not defined

---
## Next steps
- Notebook 04: Translate val/test EN‚ÜíES/FR + evaluate robustness
- Notebook 05: XAI + explanation consistency metrics