# Bengali NID Intent Classification with SmolLM2-135M

Fine-tune SmolLM2-135M on Bengali NID customer service dataset for intent classification.

**Environment:** Google Colab Pro with L4 GPU (24GB)

| Component | Value |
|-----------|-------|
| Model | SmolLM2-135M (135M params) |
| Dataset | Bengali NID (407 intents, ~78k train, ~11k eval) |
| Method | Generative SFT + LoRA |
| Language | Bengali questions → English intent tags |

## 1. Install Dependencies

In [1]:
!pip install -q transformers>=4.46.0 "datasets>=3.0.0,<4.0.0" trl>=0.12.0 peft>=0.13.0 accelerate>=1.0.0 bitsandbytes>=0.44.0 scikit-learn pandas

## 2. Check GPU & Mount Google Drive

In [2]:
import torch

# Check GPU
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name}")
    print(f"VRAM: {gpu_memory:.1f} GB")
else:
    print("No GPU available! Go to Runtime > Change runtime type > GPU")
    raise RuntimeError("GPU required")

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Create output directory for model
import os
OUTPUT_DIR = "/content/drive/MyDrive/models/smollm2-bengali-nid-intent"
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"Model output directory: {OUTPUT_DIR}")

# ============================================================
# DATASET PATH - Upload your CSVs to this folder in Google Drive
# ============================================================
DATASET_DIR = "/content/drive/MyDrive"
os.makedirs(DATASET_DIR, exist_ok=True)

print(f"\n>>> Upload your CSV files to: {DATASET_DIR}")
print("    - sts_train.csv")
print("    - sts_eval.csv")
print("    - tag_answer.csv")
print("\nOr change DATASET_DIR to where your files are located.")

GPU: NVIDIA L4
VRAM: 23.8 GB
Mounted at /content/drive
Model output directory: /content/drive/MyDrive/models/smollm2-bengali-nid-intent

>>> Upload your CSV files to: /content/drive/MyDrive
    - sts_train.csv
    - sts_eval.csv
    - tag_answer.csv

Or change DATASET_DIR to where your files are located.


## 3. Verify Dataset Files

Make sure your CSV files are in Google Drive at the path shown above.

In [3]:
import os

# Check if files exist in Google Drive
train_path = f"{DATASET_DIR}/sts_train.csv"
eval_path = f"{DATASET_DIR}/sts_eval.csv"
tag_path = f"{DATASET_DIR}/tag_answer.csv"

files_status = {
    "sts_train.csv": os.path.exists(train_path),
    "sts_eval.csv": os.path.exists(eval_path),
    "tag_answer.csv": os.path.exists(tag_path),
}

print("Dataset files status:")
for fname, exists in files_status.items():
    status = "✓ Found" if exists else "✗ Missing"
    print(f"  {status}: {fname}")

if not all(files_status.values()):
    missing = [f for f, exists in files_status.items() if not exists]
    print(f"\n❌ Missing files: {missing}")
    print(f"Please upload them to: {DATASET_DIR}")
    raise FileNotFoundError(f"Missing dataset files in {DATASET_DIR}")
else:
    print(f"\n✓ All files found in {DATASET_DIR}")

Dataset files status:
  ✓ Found: sts_train.csv
  ✓ Found: sts_eval.csv
  ✓ Found: tag_answer.csv

✓ All files found in /content/drive/MyDrive


## 4. Load and Analyze Dataset

In [4]:
import pandas as pd
from collections import Counter

# Load CSV files from Google Drive
print("Loading dataset files from Google Drive...")
train_df = pd.read_csv(train_path)
eval_df = pd.read_csv(eval_path)
tag_answer_df = pd.read_csv(tag_path)

print(f"Train samples: {len(train_df)}")
print(f"Eval samples: {len(eval_df)}")
print(f"Unique tags in train: {train_df['tag'].nunique()}")
print(f"Unique tags in eval: {eval_df['tag'].nunique()}")
print(f"Tags with answers: {len(tag_answer_df)}")

# Show sample
print(f"\nSample from training data:")
print(f"  Question: {train_df.iloc[0]['question']}")
print(f"  Tag: {train_df.iloc[0]['tag']}")

Loading dataset files from Google Drive...
Train samples: 78616
Eval samples: 11457
Unique tags in train: 407
Unique tags in eval: 403
Tags with answers: 407

Sample from training data:
  Question: "একাউন্ট লক করা হয়েছে" দেখাচ্ছে, সমাধান কী?
  Tag: account_locked


In [5]:
# Build intent labels from training data
INTENT_TAGS = sorted(train_df['tag'].unique().tolist())
print(f"Total unique intents: {len(INTENT_TAGS)}")

# Create mappings
ID2INTENT = {i: intent for i, intent in enumerate(INTENT_TAGS)}
INTENT2ID = {intent: i for i, intent in enumerate(INTENT_TAGS)}

# Show top 15 tags by frequency
print(f"\nTop 15 tags by frequency:")
tag_counts = train_df['tag'].value_counts()
for tag, count in tag_counts.head(15).items():
    print(f"  {tag}: {count}")

Total unique intents: 407

Top 15 tags by frequency:
  fraction: 494
  permanent_address_change_fees: 381
  spouse_name_correction_new: 231
  parent_spouse_name_correct_or_add_document_new: 229
  parents_name_correction_new: 226
  goodbye: 218
  picture_done_but_lost_or_no_sms_slip: 215
  service_provided: 213
  disability_no_hands_registration_procedure: 206
  abroad_smart_card_collection_return: 206
  reissue_urgent_card_delivery_time: 206
  signature_to_fingerprint_reversal_not_allowed: 206
  reissue_smart_card_download_not_available: 206
  abroad_illegal_resident_nid: 206
  abroad_embassy_walk_in_registration: 206


## 5. Configuration

In [6]:
# ============================================================
# CONFIGURATION
# ============================================================

# Model
MODEL_NAME = "HuggingFaceTB/SmolLM2-135M"

# Data
MAX_SEQ_LENGTH = 512  # Bengali text can be longer

# Training (optimized for L4 24GB with larger dataset)
NUM_EPOCHS = 2        # Reduced for faster iteration
BATCH_SIZE = 16       # L4 can handle this
EVAL_BATCH_SIZE = 32  # Larger batch for faster evaluation
GRAD_ACCUM_STEPS = 4  # Effective batch = 64
LEARNING_RATE = 1e-4  # Slightly lower for stability
WARMUP_RATIO = 0.05   # Less warmup with more data
EARLY_STOPPING_PATIENCE = 3  # Stop if no improvement for 3 evals

# LoRA (higher rank for 407 classes)
LORA_R = 64           # Higher rank for more classes
LORA_ALPHA = 128      # 2x rank
LORA_DROPOUT = 0.05
LORA_TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# Seed
SEED = 42

# Instruction template (Bengali-aware)
INSTRUCTION_TEMPLATE = "Classify the intent of this Bengali customer query: {text}"

print(f"Model: {MODEL_NAME}")
print(f"Number of intents: {len(INTENT_TAGS)}")
print(f"Epochs: {NUM_EPOCHS} (with early stopping patience={EARLY_STOPPING_PATIENCE})")
print(f"Batch size: {BATCH_SIZE} x {GRAD_ACCUM_STEPS} = {BATCH_SIZE * GRAD_ACCUM_STEPS} effective")
print(f"Eval batch size: {EVAL_BATCH_SIZE}")
print(f"LoRA rank: {LORA_R}")

Model: HuggingFaceTB/SmolLM2-135M
Number of intents: 407
Epochs: 2 (with early stopping patience=3)
Batch size: 16 x 4 = 64 effective
Eval batch size: 32
LoRA rank: 64


## 6. Prepare Dataset for SFT

In [7]:
from datasets import Dataset

def format_for_sft(row):
    """
    Format example for generative SFT.

    Input: Bengali question + tag
    Output: Formatted instruction-response pair
    """
    instruction = INSTRUCTION_TEMPLATE.format(text=row['question'])
    formatted = f"User: {instruction}\nAssistant: {row['tag']}"
    return {'text': formatted, 'intent': row['tag']}

# Convert DataFrames to formatted datasets
print("Formatting training data...")
train_formatted = [format_for_sft(row) for _, row in train_df.iterrows()]
train_dataset = Dataset.from_list(train_formatted)

print("Formatting evaluation data...")
eval_formatted = [format_for_sft(row) for _, row in eval_df.iterrows()]
eval_dataset = Dataset.from_list(eval_formatted)

print(f"\nTrain dataset: {len(train_dataset)} samples")
print(f"Eval dataset: {len(eval_dataset)} samples")

# Show formatted sample
print(f"\nFormatted sample:")
print(train_dataset[0]['text'])

Formatting training data...
Formatting evaluation data...

Train dataset: 78616 samples
Eval dataset: 11457 samples

Formatted sample:
User: Classify the intent of this Bengali customer query: "একাউন্ট লক করা হয়েছে" দেখাচ্ছে, সমাধান কী?
Assistant: account_locked


## 7. Load Model and Apply LoRA

In [8]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, TaskType, get_peft_model

# Load tokenizer
print(f"Loading tokenizer: {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

# IMPORTANT: Use left-padding for decoder-only models during batched generation
tokenizer.padding_side = 'left'

# Load model
print(f"Loading model: {MODEL_NAME}")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

print(f"Model loaded. Parameters: {model.num_parameters():,}")

Loading tokenizer: HuggingFaceTB/SmolLM2-135M


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

Loading model: HuggingFaceTB/SmolLM2-135M


config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Model loaded. Parameters: 134,515,008


In [9]:
# Configure LoRA with higher rank for 407 classes
print("Configuring LoRA...")
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=LORA_TARGET_MODULES,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Configuring LoRA...
trainable params: 19,537,920 || all params: 154,052,928 || trainable%: 12.6826


## 8. Train

In [10]:
from trl import SFTTrainer, SFTConfig
from transformers import set_seed, TrainerCallback, EarlyStoppingCallback
from sklearn.metrics import accuracy_score

# Set seed
set_seed(SEED)

# ============================================================
# CUSTOM CALLBACK: Intent Accuracy + Early Stopping
# ============================================================

def extract_intent_fast(response, intent_tags):
    """Extract intent from model response (fast version)."""
    response = response.strip().lower()
    for intent in intent_tags:
        if intent.lower() in response:
            return intent
    return None

class IntentAccuracyCallback(TrainerCallback):
    """Callback to compute intent accuracy during training and enable dual-criteria early stopping."""

    def __init__(self, eval_df, tokenizer, intent_tags, instruction_template,
                 max_seq_length, sample_size=500, batch_size=32, patience=3):
        self.eval_sample = eval_df.sample(n=min(sample_size, len(eval_df)), random_state=42).reset_index(drop=True)
        self.tokenizer = tokenizer
        self.intent_tags = intent_tags
        self.instruction_template = instruction_template
        self.max_seq_length = max_seq_length
        self.batch_size = batch_size
        self.patience = patience
        self.best_accuracy = 0.0
        self.no_improve_count = 0
        self.history = []

    def _compute_accuracy_batched(self, model):
        """Compute intent accuracy on sample using batched inference."""
        model.eval()
        predictions = []
        true_labels = self.eval_sample['tag'].tolist()

        for i in range(0, len(self.eval_sample), self.batch_size):
            batch_df = self.eval_sample.iloc[i:i+self.batch_size]
            prompts = [f"User: {self.instruction_template.format(text=q)}\nAssistant:"
                       for q in batch_df['question']]

            inputs = self.tokenizer(prompts, return_tensors="pt", padding=True,
                                   truncation=True, max_length=self.max_seq_length)
            inputs = {k: v.to(model.device) for k, v in inputs.items()}

            with torch.no_grad():
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=32,
                    do_sample=False,
                    pad_token_id=self.tokenizer.pad_token_id,
                )

            input_len = inputs['input_ids'].shape[1]
            for idx, output in enumerate(outputs):
                response = self.tokenizer.decode(output[input_len:], skip_special_tokens=True)
                # Store raw response for first batch (debugging)
                if i == 0 and idx < 3:
                    if not hasattr(self, '_debug_responses'):
                        self._debug_responses = []
                    self._debug_responses.append(response[:100])  # First 100 chars
                predictions.append(extract_intent_fast(response, self.intent_tags))

        # Debug: Print sample of what model is generating
        print(f"\n    [DEBUG] Sample raw outputs (first 3):")
        if hasattr(self, '_debug_responses'):
            for resp in self._debug_responses[:3]:
                print(f"      Raw: '{resp}'")
            self._debug_responses = []  # Clear for next eval
        print(f"    [DEBUG] Extracted intents vs true (first 3):")
        for i, (pred, true) in enumerate(zip(predictions[:3], true_labels[:3])):
            print(f"      True: {true} | Predicted: {pred}")

        # Compute accuracy (only for valid predictions)
        valid_pairs = [(p, t) for p, t in zip(predictions, true_labels) if p is not None]
        num_none = sum(1 for p in predictions if p is None)
        print(f"    [DEBUG] Valid predictions: {len(valid_pairs)}/{len(predictions)} ({num_none} returned None)")

        if len(valid_pairs) == 0:
            return 0.0
        valid_preds, valid_true = zip(*valid_pairs)
        return accuracy_score(valid_true, valid_preds)

    def on_evaluate(self, args, state, control, model, **kwargs):
        # Compute intent accuracy
        accuracy = self._compute_accuracy_batched(model)
        self.history.append({'step': state.global_step, 'intent_accuracy': accuracy})

        # Print results
        print(f"\n>>> Intent Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")

        # Check for improvement
        if accuracy > self.best_accuracy + 0.001:  # Small threshold to avoid noise
            self.best_accuracy = accuracy
            self.no_improve_count = 0
            print(f"    [NEW BEST] Previous best: {self.best_accuracy:.4f}")
        else:
            self.no_improve_count += 1
            print(f"    No improvement for {self.no_improve_count}/{self.patience} evals")

        # Early stopping on intent accuracy
        if self.no_improve_count >= self.patience:
            print(f"\n*** EARLY STOPPING: Intent accuracy hasn't improved for {self.patience} evals ***")
            control.should_training_stop = True

        return control

# ============================================================
# TRAINING CONFIGURATION
# ============================================================

training_args = SFTConfig(
    output_dir=OUTPUT_DIR,

    # Training schedule
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,  # Use larger eval batch
    gradient_accumulation_steps=GRAD_ACCUM_STEPS,

    # Optimizer
    learning_rate=LEARNING_RATE,
    weight_decay=0.01,
    warmup_ratio=WARMUP_RATIO,
    lr_scheduler_type="cosine",
    optim="adamw_8bit",

    # Mixed precision
    bf16=True,

    # Logging & saving (more frequent evals)
    logging_steps=50,
    eval_strategy="steps",
    eval_steps=250,  # More frequent evaluation
    save_strategy="steps",
    save_steps=500,  # Save more frequently
    save_total_limit=3,

    # Data
    max_length=MAX_SEQ_LENGTH,
    packing=False,

    # Other
    seed=SEED,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)

# Create callbacks
intent_callback = IntentAccuracyCallback(
    eval_df=eval_df,
    tokenizer=tokenizer,
    intent_tags=INTENT_TAGS,
    instruction_template=INSTRUCTION_TEMPLATE,
    max_seq_length=MAX_SEQ_LENGTH,
    sample_size=500,  # Evaluate on 500 samples for speed
    batch_size=EVAL_BATCH_SIZE,
    patience=EARLY_STOPPING_PATIENCE,
)

# Early stopping on validation loss
early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience=EARLY_STOPPING_PATIENCE,
    early_stopping_threshold=0.001,
)

# Initialize trainer with callbacks
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=training_args,
    callbacks=[early_stopping_callback, intent_callback],
)

print("Trainer initialized with early stopping!")
print(f"Total training steps: {len(train_dataset) // (BATCH_SIZE * GRAD_ACCUM_STEPS) * NUM_EPOCHS}")
print(f"Eval every {training_args.eval_steps} steps")
print(f"Early stopping patience: {EARLY_STOPPING_PATIENCE} (on both val_loss and intent_accuracy)")

Adding EOS to train dataset:   0%|          | 0/78616 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/78616 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/78616 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/11457 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/11457 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/11457 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.


Trainer initialized with early stopping!
Total training steps: 2456
Eval every 250 steps
Early stopping patience: 3 (on both val_loss and intent_accuracy)


In [11]:
# Train!
print("Starting training...")
print(f"Dataset: ~78k Bengali NID queries, 407 intents")
print(f"Epochs: {NUM_EPOCHS} (with early stopping)")
print(f"Early stopping patience: {EARLY_STOPPING_PATIENCE} evals")
print("")
print("During training you will see:")
print("  - Training loss (token generation)")
print("  - Validation loss (token generation)")
print("  - Intent Accuracy (tag detection on 500 samples)")
print("-" * 50)

trainer.train()

print("-" * 50)
print("Training complete!")
print(f"Best intent accuracy during training: {intent_callback.best_accuracy:.4f}")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 0}.


Starting training...
Dataset: ~78k Bengali NID queries, 407 intents
Epochs: 2 (with early stopping)
Early stopping patience: 3 evals

During training you will see:
  - Training loss (token generation)
  - Validation loss (token generation)
  - Intent Accuracy (tag detection on 500 samples)
--------------------------------------------------


Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
250,0.7425,0.715608,0.752282,2068389.0,0.792176
500,0.552,0.544215,0.607909,4134111.0,0.832579
750,0.5001,0.504426,0.567595,6200752.0,0.843166
1000,0.4754,0.48245,0.531595,8264813.0,0.848744
1250,0.4559,0.464831,0.512461,10325135.0,0.853709
1500,0.4406,0.452314,0.493152,12388406.0,0.857255
1750,0.434,0.443092,0.494231,14454658.0,0.859974
2000,0.4245,0.43706,0.482034,16523306.0,0.861849
2250,0.422,0.434151,0.481922,18590232.0,0.862965



    [DEBUG] Sample raw outputs (first 3):
      Raw: ' smart_card_chip_chip_chip_type'
      Raw: ' address_change_nid_change_time_fees'
      Raw: ' address_change_smart_card_availability'
    [DEBUG] Extracted intents vs true (first 3):
      True: smart_card_operating_system_functionality | Predicted: None
      True: voter_area_change_ongoing | Predicted: None
      True: nid_adjudication_pending | Predicted: address_change_smart_card_availability
    [DEBUG] Valid predictions: 83/500 (417 returned None)

>>> Intent Accuracy: 0.0000 (0.00%)
    No improvement for 1/3 evals

    [DEBUG] Sample raw outputs (first 3):
      Raw: ' smart_card_chip_damage_replacement'
      Raw: ' abroad_remittance_warrior_label_request'
      Raw: ' nid_information_misentry_correction_procedure'
    [DEBUG] Extracted intents vs true (first 3):
      True: smart_card_operating_system_functionality | Predicted: smart_card_chip_damage_replacement
      True: voter_area_change_ongoing | Predicted: abroad_

## 9. Save Model

In [12]:
# Save model to Google Drive
print(f"Saving model to {OUTPUT_DIR}...")
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

# Also save intent mappings
import json
with open(f"{OUTPUT_DIR}/intent_mappings.json", "w", encoding="utf-8") as f:
    json.dump({"id2intent": ID2INTENT, "intent2id": INTENT2ID}, f, ensure_ascii=False, indent=2)

print(f"Model saved successfully!")
print(f"Files in {OUTPUT_DIR}:")
!ls -la {OUTPUT_DIR}

Saving model to /content/drive/MyDrive/models/smollm2-bengali-nid-intent...
Model saved successfully!
Files in /content/drive/MyDrive/models/smollm2-bengali-nid-intent:
total 81124
-rw------- 1 root root     1058 Dec 22 11:05 adapter_config.json
-rw------- 1 root root 78207176 Dec 22 11:05 adapter_model.safetensors
-rw------- 1 root root      707 Dec 20 21:20 added_tokens.json
-rw------- 1 root root     4168 Dec 20 21:20 chat_template.jinja
drwx------ 2 root root     4096 Dec 22 10:33 checkpoint-1500
drwx------ 2 root root     4096 Dec 22 10:51 checkpoint-2000
drwx------ 2 root root     4096 Dec 22 11:05 checkpoint-2458
-rw------- 1 root root    38936 Dec 22 11:05 intent_mappings.json
-rw------- 1 root root   466391 Dec 22 11:05 merges.txt
-rw------- 1 root root     1559 Dec 22 11:05 README.md
-rw------- 1 root root      863 Dec 22 11:05 special_tokens_map.json
-rw------- 1 root root     3720 Dec 22 11:05 tokenizer_config.json
-rw------- 1 root root  3522916 Dec 22 11:05 tokenizer.json

## 10. Evaluate

In [13]:
from tqdm.notebook import tqdm
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from collections import Counter

def extract_intent(response, intent_tags):
    """Extract intent from model response."""
    response = response.strip().lower()

    # Try exact match first
    for intent in intent_tags:
        if intent.lower() in response:
            return intent

    return None

def evaluate_model_batched(model, tokenizer, eval_df, batch_size=EVAL_BATCH_SIZE, num_samples=None):
    """Fast batched evaluation with intent accuracy."""
    model.eval()

    if num_samples:
        eval_df = eval_df.sample(n=min(num_samples, len(eval_df)), random_state=42).reset_index(drop=True)

    predictions = []
    true_labels = []

    # Process in batches for speed
    num_batches = (len(eval_df) + batch_size - 1) // batch_size
    for i in tqdm(range(0, len(eval_df), batch_size), total=num_batches, desc="Evaluating (batched)"):
        batch_df = eval_df.iloc[i:i+batch_size]
        prompts = [f"User: {INSTRUCTION_TEMPLATE.format(text=q)}\nAssistant:"
                   for q in batch_df['question']]

        # Tokenize batch
        inputs = tokenizer(prompts, return_tensors="pt", padding=True,
                          truncation=True, max_length=MAX_SEQ_LENGTH)
        inputs = {k: v.to(model.device) for k, v in inputs.items()}

        # Generate for entire batch
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=32,  # Reduced - intent tags are short
                do_sample=False,
                pad_token_id=tokenizer.pad_token_id,
            )

        # Decode and extract intents for each sample in batch
        input_len = inputs['input_ids'].shape[1]
        for j, output in enumerate(outputs):
            response = tokenizer.decode(output[input_len:], skip_special_tokens=True)
            predictions.append(extract_intent(response, INTENT_TAGS))

        true_labels.extend(batch_df['tag'].tolist())

    return predictions, true_labels

# Evaluate on subset first (batched is much faster!)
print("Evaluating on 2000 samples (batched for speed)...")
print(f"Using batch size: {EVAL_BATCH_SIZE}")
predictions, true_labels = evaluate_model_batched(model, tokenizer, eval_df, num_samples=2000)

Evaluating on 2000 samples (batched for speed)...
Using batch size: 32


Evaluating (batched):   0%|          | 0/63 [00:00<?, ?it/s]

In [14]:
# Compute metrics
valid_mask = [p is not None for p in predictions]
valid_preds = [INTENT2ID.get(p, -1) for p in predictions]
valid_true = [INTENT2ID.get(t, -1) for t in true_labels]

# Filter valid
filtered_preds = [p for p, m in zip(valid_preds, valid_mask) if m and p != -1]
filtered_true = [t for t, m, p in zip(valid_true, valid_mask, valid_preds) if m and p != -1]

# Calculate metrics
if len(filtered_preds) > 0:
    accuracy = accuracy_score(filtered_true, filtered_preds)
    precision, recall, f1, _ = precision_recall_fscore_support(
        filtered_true, filtered_preds, average="weighted", zero_division=0
    )
else:
    accuracy = precision = recall = f1 = 0.0

print("=" * 50)
print("EVALUATION RESULTS")
print("=" * 50)
print(f"Total samples: {len(predictions)}")
print(f"Valid predictions: {sum(valid_mask)} ({100*sum(valid_mask)/len(predictions):.1f}%)")
print(f"")
print(f"Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")
print("=" * 50)

# Show top confusions
print("\nTop 10 Confusions:")
confusions = [(t, p) for p, t in zip(predictions, true_labels) if p != t and p is not None]
for (true, pred), count in Counter(confusions).most_common(10):
    print(f"  {true} -> {pred}: {count}")

EVALUATION RESULTS
Total samples: 2000
Valid predictions: 1992 (99.6%)

Accuracy:  0.3384 (33.84%)
Precision: 0.3917
Recall:    0.3384
F1 Score:  0.3327

Top 10 Confusions:
  account_locked_unlock_request -> account_locked: 9
  online_apply -> online_portal_registration: 7
  account_locked_retrials -> account_locked: 7
  nid_wallet_download_iphone -> nid_wallet_download: 6
  card_correction_fees_payment_reference_nid_number_query -> card_correction_fees: 5
  smart_card_collect_by_other -> smart_card_chip_damage_replacement: 5
  voter_area_change_online -> voter_area_change: 5
  eight_hours_time_up -> account_locked: 5
  nid_wallet_download_alternative -> nid_wallet_download: 5
  nid_adjudication_pending -> new_voter_application_not_submitted_to_upazila_or_thana_office: 4


## 11. Interactive Inference

In [15]:
def classify_intent(query, model, tokenizer):
    """Classify intent for a single Bengali query."""
    prompt = f"User: {INSTRUCTION_TEMPLATE.format(text=query)}\nAssistant:"

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=MAX_SEQ_LENGTH)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=32,  # Reduced - intent tags are short
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
        )

    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    intent = extract_intent(response, INTENT_TAGS)

    return intent, response

def classify_intent_batch(queries, model, tokenizer, batch_size=16):
    """Classify intents for multiple Bengali queries at once (faster)."""
    model.eval()
    results = []

    for i in range(0, len(queries), batch_size):
        batch = queries[i:i+batch_size]
        prompts = [f"User: {INSTRUCTION_TEMPLATE.format(text=q)}\nAssistant:" for q in batch]

        inputs = tokenizer(prompts, return_tensors="pt", padding=True,
                          truncation=True, max_length=MAX_SEQ_LENGTH)
        inputs = {k: v.to(model.device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=32,
                do_sample=False,
                pad_token_id=tokenizer.pad_token_id,
            )

        input_len = inputs['input_ids'].shape[1]
        for output in outputs:
            response = tokenizer.decode(output[input_len:], skip_special_tokens=True)
            intent = extract_intent(response, INTENT_TAGS)
            results.append((intent, response))

    return results

# Get answer for intent from tag_answer_df
def get_answer(intent):
    """Get Bengali answer for an intent."""
    row = tag_answer_df[tag_answer_df['tag'] == intent]
    if len(row) > 0:
        return row.iloc[0]['answer']
    return "উত্তর পাওয়া যায়নি।"

# Test with sample Bengali queries
test_queries = [
    "আমার এনআইডি একাউন্ট লক হয়ে গেছে, কিভাবে আনলক করবো?",
    "কার্ড হারিয়ে গেলে কি করতে হবে?",
    "জাতীয় পরিচয়পত্রে নাম সংশোধন করতে চাই",
    "ভোটার আইডি কার্ডের ঠিকানা পরিবর্তন করতে কি কি লাগবে?",
    "স্মার্ট কার্ড কবে পাবো?",
]

# Use batched inference for test queries
print("Testing with Bengali queries (batched inference):")
print("=" * 70)
batch_results = classify_intent_batch(test_queries, model, tokenizer, batch_size=len(test_queries))
for query, (intent, raw) in zip(test_queries, batch_results):
    answer = get_answer(intent) if intent else "Intent not recognized"
    print(f"Query: {query}")
    print(f"Intent: {intent}")
    print(f"Answer: {answer[:100]}..." if len(answer) > 100 else f"Answer: {answer}")
    print("-" * 70)

Testing with Bengali queries (batched inference):
Query: আমার এনআইডি একাউন্ট লক হয়ে গেছে, কিভাবে আনলক করবো?
Intent: account_locked
Answer: ভুল তথ্য দিয়ে রেজিস্ট্রেশন বা লগইনের চেষ্টা করা হলে স্বয়ংক্রিয়ভাবে একাউন্টটি লক হয়ে যায়। অনুগ্রহপূর্...
----------------------------------------------------------------------
Query: কার্ড হারিয়ে গেলে কি করতে হবে?
Intent: card_lost_and_damaged_cost
Answer: হ্যাঁ, নতুন এনআইডি কার্ড পেতে নির্ধারিত ফি জমা দিয়ে অনলাইনে আবেদন করুন।
----------------------------------------------------------------------
Query: জাতীয় পরিচয়পত্রে নাম সংশোধন করতে চাই
Intent: correction_application_current_status
Answer: আপনার এই প্রশ্নের উত্তরটি জানতে ৯ বাটন চেপে কল সেন্টার প্রতিনিধির সঙ্গে সরাসরি কথা বলুন।
----------------------------------------------------------------------
Query: ভোটার আইডি কার্ডের ঠিকানা পরিবর্তন করতে কি কি লাগবে?
Intent: address_change_online_new
Answer: হ্যাঁ, করা যাবে। তবে ঠিকানা পরিবর্তনের কারণে যদি ভোটার এলাকা পরিবর্তন হয় সেক্ষেত্রে সংশ্লিষ্ট উপজে

## 12. Push to HuggingFace Hub

In [16]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [17]:
# Push model to Hub
HF_REPO_NAME = "ehzawad/smollm2-bengali-nid-intent"

print(f"Pushing model to HuggingFace Hub: {HF_REPO_NAME}")

model.push_to_hub(HF_REPO_NAME)
tokenizer.push_to_hub(HF_REPO_NAME)

print(f"\nModel uploaded successfully!")
print(f"View at: https://huggingface.co/{HF_REPO_NAME}")

Pushing model to HuggingFace Hub: ehzawad/smollm2-bengali-nid-intent


HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-6949268e-5278cad0499557f47c9cbcc4;b1bd68ef-985e-4c79-8a02-d72d19566399)

Invalid username or password.

## Done!

Your Bengali NID intent classification model is now:
- Saved to Google Drive at: `/content/drive/MyDrive/models/smollm2-bengali-nid-intent`
- Pushed to HuggingFace Hub

To load the model later:
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M")
model = PeftModel.from_pretrained(base_model, "ehzawad/smollm2-bengali-nid-intent")
tokenizer = AutoTokenizer.from_pretrained("ehzawad/smollm2-bengali-nid-intent")
```