# BESSTIE Benchmark Fine-Tuning with Mistral-2B-Instruct

This notebook replicates the BESSTIE benchmark fine-tuning using Mistral-2B-Instruct with QLoRA quantization.

## Task Overview:
- Fine-tune Mistral-2B-Instruct for classification across three English varieties: en-AU, en-IN, en-UK
- Handle three data sections: google-sentiment, reddit-sentiment, reddit-sarcasm
- Cross-variety experimental loop with evaluation
- Generate 3x3 heatmaps showing F1-scores

## Section 1: Setup and Imports

In [None]:
# Install required packages (run once)
# !pip install transformers peft bitsandbytes datasets scikit-learn accelerate torch

In [None]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
import os
from tqdm.auto import tqdm

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset, Dataset
from sklearn.metrics import precision_recall_fscore_support, f1_score
from sklearn.model_selection import train_test_split

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set random seeds
SEED = 50
torch.manual_seed(SEED)
np.random.seed(SEED)

## Section 2: Configuration and Constants

In [None]:
# Model configuration
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"  # Or use appropriate Mistral-2B model
OUTPUT_DIR = "./mistral_besstie_outputs"
RESULTS_DIR = "./mistral_besstie_results"

# Create directories
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(RESULTS_DIR, exist_ok=True)

# Varieties and tasks
VARIETIES = ["en-AU", "en-IN", "en-UK"]
TASKS = ["google-sentiment", "reddit-sentiment", "reddit-sarcasm"]

# Training hyperparameters
MAX_EPOCHS = 30
LEARNING_RATE = 2e-4
BATCH_SIZE = 16
EARLY_STOPPING_PATIENCE = 3
EARLY_STOPPING_THRESHOLD = 0.1  # 10% improvement threshold

# LoRA hyperparameters
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.1

# Prompts
SENTIMENT_PROMPT = "Generate the sentiment of the given text. 1 for positive sentiment, and 0 for negative sentiment. Do not give an explanation."
SARCASM_PROMPT = "Predict if the given text is sarcastic. 1 if the text is sarcastic, and 0 if the text is not sarcastic. Do not give an explanation."

print("Configuration loaded successfully!")

## Section 3: Data Loading and Preparation

In [None]:
# Load BESSTIE dataset
dataset = load_dataset("unswnlporg/BESSTIE")

print("Dataset loaded successfully!")
print(f"Train samples: {len(dataset['train'])}")
print(f"Test samples: {len(dataset['test'])}")
print(f"\nDataset features: {dataset['train'].features}")
print(f"\nSample entry: {dataset['train'][0]}")

In [None]:
def create_stratified_splits(dataset, variety, task, val_size=0.1):
    """
    Create stratified train/validation splits for a specific variety and task.
    
    Args:
        dataset: HuggingFace dataset
        variety: One of ["en-AU", "en-IN", "en-UK"]
        task: One of ["google-sentiment", "reddit-sentiment", "reddit-sarcasm"]
        val_size: Validation split size (default: 0.1 for 10%)
    
    Returns:
        train_data, val_data, test_data
    """
    # Filter by variety and task
    train_df = dataset['train'].to_pandas()
    test_df = dataset['test'].to_pandas()
    
    # Map task names
    task_mapping = {
        "google-sentiment": "Sentiment",
        "reddit-sentiment": "Sentiment",
        "reddit-sarcasm": "Sarcasm"
    }
    
    # Determine source filter
    if "google" in task:
        source_filter = "Google"
    else:
        source_filter = "Reddit"
    
    task_type = task_mapping[task]
    
    # Filter training data
    train_filtered = train_df[
        (train_df['variety'] == variety) & 
        (train_df['task'] == task_type) &
        (train_df['source'] == source_filter)
    ].copy()
    
    # Filter test data
    test_filtered = test_df[
        (test_df['variety'] == variety) & 
        (test_df['task'] == task_type) &
        (test_df['source'] == source_filter)
    ].copy()
    
    # Create stratified train/val split
    if len(train_filtered) > 0:
        train_data, val_data = train_test_split(
            train_filtered,
            test_size=val_size,
            stratif=train_filtered['label'],
            random_state=SEED
        )
    else:
        train_data = train_filtered
        val_data = pd.DataFrame()
    
    return train_data, val_data, test_filtered

print("Data splitting function defined!")

## Section 4: Tokenizer and Model Setup with QLoRA

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print("Tokenizer loaded successfully!")

In [None]:
def get_quantized_model():
    """
    Load model with 4-bit quantization (QLoRA configuration).
    """
    # BitsAndBytes configuration for 4-bit quantization
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,  # Double quantization
        bnb_4bit_quant_type="nf4",  # NF4 quantization
        bnb_4bit_compute_dtype=torch.bfloat16  # bfloat16 compute
    )
    
    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True
    )
    
    # Prepare model for k-bit training
    model = prepare_model_for_kbit_training(model)
    
    return model

print("Model loading function defined!")

In [None]:
def add_lora_adapters(model):
    """
    Add LoRA adapters to all linear layers.
    Target layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
    """
    lora_config = LoraConfig(
        r=LORA_R,
        lora_alpha=LORA_ALPHA,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj"
        ],
        lora_dropout=LORA_DROPOUT,
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    
    return model

print("LoRA adapter function defined!")

## Section 5: Instruction Formatting and Dataset Preparation

In [None]:
def format_instruction(text, label, task_type):
    """
    Format text into instruction-following format for causal LM.
    
    Args:
        text: Input text
        label: Ground truth label (0 or 1)
        task_type: "sentiment" or "sarcasm"
    
    Returns:
        Formatted instruction string
    """
    prompt = SENTIMENT_PROMPT if "sentiment" in task_type else SARCASM_PROMPT
    
    instruction = f"""<s>[INST] {prompt}

Text: {text}
[/INST] {label}</s>"""
    
    return instruction

def tokenize_function(examples, task_type):
    """
    Tokenize examples for training.
    """
    instructions = [
        format_instruction(text, label, task_type)
        for text, label in zip(examples['text'], examples['label'])
    ]
    
    tokenized = tokenizer(
        instructions,
        truncation=True,
        max_length=512,
        padding="max_length",
        return_tensors="pt"
    )
    
    # For causal LM, labels are the same as input_ids
    tokenized["labels"] = tokenized["input_ids"].clone()
    
    return tokenized

print("Instruction formatting functions defined!")

## Section 6: Training Loop with Early Stopping

In [None]:
class EarlyStoppingCallback:
    """
    Custom early stopping callback.
    Stops training if validation loss doesn't improve by threshold for patience epochs.
    """
    def __init__(self, patience=3, threshold=0.1):
        self.patience = patience
        self.threshold = threshold
        self.best_loss = float('inf')
        self.counter = 0
        self.should_stop = False
    
    def __call__(self, val_loss):
        improvement = (self.best_loss - val_loss) / self.best_loss if self.best_loss != float('inf') else 0
        
        if improvement >= self.threshold:
            self.best_loss = val_loss
            self.counter = 0
        else:
            self.counter += 1
        
        if self.counter >= self.patience:
            self.should_stop = True
        
        return self.should_stop

print("Early stopping callback defined!")

## Section 7: Evaluation Functions

In [None]:
def extract_prediction(output_text):
    """
    Extract prediction (0 or 1) from model output.
    """
    # Look for the last occurrence of 0 or 1
    text = output_text.strip()
    if '1' in text:
        return 1
    elif '0' in text:
        return 0
    else:
        # Default to 0 if unclear
        return 0

def evaluate_model(model, test_data, task_type):
    """
    Evaluate model on test data.
    
    Returns:
        precision, recall, f1_score (macro-averaged)
    """
    model.eval()
    predictions = []
    true_labels = []
    
    prompt = SENTIMENT_PROMPT if "sentiment" in task_type else SARCASM_PROMPT
    
    with torch.no_grad():
        for _, row in tqdm(test_data.iterrows(), total=len(test_data), desc="Evaluating"):
            instruction = f"""<s>[INST] {prompt}

Text: {row['text']}
[/INST]"""
            
            inputs = tokenizer(instruction, return_tensors="pt").to(device)
            outputs = model.generate(
                **inputs,
                max_new_tokens=5,
                temperature=0.1,
                do_sample=False
            )
            
            output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
            pred = extract_prediction(output_text)
            
            predictions.append(pred)
            true_labels.append(row['label'])
    
    # Calculate metrics
    precision, recall, f1, _ = precision_recall_fscore_support(
        true_labels, predictions, average='macro', zero_division=0
    )
    
    return precision, recall, f1

print("Evaluation functions defined!")

## Section 8: Main Training Loop (Cross-Variety Experiment)

In [None]:
# Storage for all results
all_results = []

# Iterate through all tasks
for task in TASKS:
    print(f"\n{'='*80}")
    print(f"Processing Task: {task}")
    print(f"{'='*80}\n")
    
    task_results = {}
    
    # Train on each variety
    for train_variety in VARIETIES:
        print(f"\n--- Training on {train_variety} ---")
        
        # Prepare data
        train_data, val_data, _ = create_stratified_splits(dataset, train_variety, task)
        
        if len(train_data) == 0:
            print(f"No training data for {train_variety} - {task}. Skipping...")
            continue
        
        # Convert to HF Dataset
        train_dataset = Dataset.from_pandas(train_data)
        val_dataset = Dataset.from_pandas(val_data) if len(val_data) > 0 else None
        
        # Tokenize
        train_dataset = train_dataset.map(
            lambda x: tokenize_function(x, task),
            batched=True,
            remove_columns=train_dataset.column_names
        )
        
        if val_dataset:
            val_dataset = val_dataset.map(
                lambda x: tokenize_function(x, task),
                batched=True,
                remove_columns=val_dataset.column_names
            )
        
        # Load model
        model = get_quantized_model()
        model = add_lora_adapters(model)
        
        # Training arguments
        training_args = TrainingArguments(
            output_dir=f"{OUTPUT_DIR}/{task}_{train_variety}",
            num_train_epochs=MAX_EPOCHS,
            per_device_train_batch_size=BATCH_SIZE,
            per_device_eval_batch_size=BATCH_SIZE,
            learning_rate=LEARNING_RATE,
            warmup_steps=100,
            logging_steps=10,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
            optim="paged_adamw_8bit",
            fp16=True,
            gradient_accumulation_steps=4,
            report_to="none"
        )
        
        # Trainer
        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset
        )
        
        # Train
        print("Starting training...")
        trainer.train()
        
        # Save best model adapters
        model.save_pretrained(f"{OUTPUT_DIR}/{task}_{train_variety}_best")
        
        # Evaluate on all varieties
        print(f"\nEvaluating {train_variety} model on all varieties...")
        
        for test_variety in VARIETIES:
            _, _, test_data = create_stratified_splits(dataset, test_variety, task)
            
            if len(test_data) == 0:
                continue
            
            precision, recall, f1 = evaluate_model(model, test_data, task)
            
            result = {
                "task": task,
                "trained_on": train_variety,
                "tested_on": test_variety,
                "precision": precision,
                "recall": recall,
                "f1_score": f1
            }
            
            all_results.append(result)
            
            print(f"{train_variety} → {test_variety}: Precision={precision:.4f}, Recall={recall:.4f}, F1={f1:.4f}")
        
        # Clean up
        del model
        del trainer
        torch.cuda.empty_cache()

print("\n" + "="*80)
print("All training and evaluation completed!")
print("="*80)

## Section 9: Save Results

In [None]:
# Convert results to DataFrame
results_df = pd.DataFrame(all_results)

# Save to CSV
results_df.to_csv(f"{RESULTS_DIR}/mistral_besstie_results.csv", index=False)

# Save to JSON
with open(f"{RESULTS_DIR}/mistral_besstie_results.json", 'w') as f:
    json.dump(all_results, f, indent=2)

print("Results saved successfully!")
print(f"\nResults DataFrame:")
print(results_df)

## Section 10: Generate Heatmaps

In [None]:
# Generate heatmaps for each task
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, task in enumerate(TASKS):
    # Filter results for this task
    task_df = results_df[results_df['task'] == task]
    
    # Create pivot table for heatmap
    heatmap_data = task_df.pivot(
        index='trained_on',
        columns='tested_on',
        values='f1_score'
    )
    
    # Ensure correct order
    heatmap_data = heatmap_data.reindex(index=VARIETIES, columns=VARIETIES)
    
    # Plot heatmap
    sns.heatmap(
        heatmap_data,
        annot=True,
        fmt='.3f',
        cmap='RdYlGn',
        vmin=0,
        vmax=1,
        ax=axes[idx],
        cbar_kws={'label': 'F1 Score'},
        square=True
    )
    
    axes[idx].set_title(f"{task.replace('-', ' ').title()}", fontsize=14, fontweight='bold')
    axes[idx].set_xlabel('Tested On', fontsize=12)
    axes[idx].set_ylabel('Trained On', fontsize=12)

plt.tight_layout()
plt.savefig(f"{RESULTS_DIR}/mistral_besstie_heatmaps.png", dpi=300, bbox_inches='tight')
plt.show()

print("Heatmaps generated and saved!")

## Section 11: Summary Statistics

In [None]:
# Calculate summary statistics
print("\nSummary Statistics:\n")
print("="*80)

for task in TASKS:
    task_df = results_df[results_df['task'] == task]
    
    print(f"\n{task.upper()}:")
    print("-" * 40)
    
    # Same-variety performance
    same_variety = task_df[task_df['trained_on'] == task_df['tested_on']]
    print(f"Same-variety F1 (mean): {same_variety['f1_score'].mean():.4f}")
    
    # Cross-variety performance
    cross_variety = task_df[task_df['trained_on'] != task_df['tested_on']]
    print(f"Cross-variety F1 (mean): {cross_variety['f1_score'].mean():.4f}")
    
    # Best and worst
    best_idx = task_df['f1_score'].idxmax()
    worst_idx = task_df['f1_score'].idxmin()
    
    best = task_df.loc[best_idx]
    worst = task_df.loc[worst_idx]
    
    print(f"Best: {best['trained_on']} → {best['tested_on']}: {best['f1_score']:.4f}")
    print(f"Worst: {worst['trained_on']} → {worst['tested_on']}: {worst['f1_score']:.4f}")

print("\n" + "="*80)
print("Analysis Complete!")

## Conclusion

This notebook has successfully replicated the BESSTIE benchmark fine-tuning process using Mistral-2B-Instruct with QLoRA quantization. The results show cross-variety performance across three English varieties and three different tasks.

### Key Outputs:
1. Trained LoRA adapters saved in `./mistral_besstie_outputs/`
2. Results CSV and JSON saved in `./mistral_besstie_results/`
3. Heatmap visualizations showing F1-scores for all variety combinations
4. Summary statistics comparing same-variety vs cross-variety performance