# LLM Classification Finetuning

Competition: https://www.kaggle.com/competitions/llm-classification-finetuning/overview

## Submission File

For each ID in the test set, you must predict the probability for each target class. The file should contain a header and have the following format:

```csv
id,winner_model_a,winner_model_b,winner_tie
136060,0.33,0,33,0.33
211333,0.33,0,33,0.33
1233961,0.33,0,33,0.33
etc
```

Submission file must be named `submission.csv` in the `/kaggle/working/` directory.

## Inputs

Input files are in `/kaggle/input/llm-classification-finetuning/` directory if
running on Kaggle.

```
/kaggle/input/llm-classification-finetuning/sample_submission.csv
/kaggle/input/llm-classification-finetuning/train.csv
/kaggle/input/llm-classification-finetuning/test.csv
```

In [13]:
import os
kaggle_run_type = os.environ.get('KAGGLE_KERNEL_RUN_TYPE')
print(f"KAGGLE_KERNEL_RUN_TYPE: {kaggle_run_type}")

ON_KAGGLE = kaggle_run_type is not None

KAGGLE_KERNEL_RUN_TYPE: Interactive


In [None]:
# Download required packages - for re-use in inference notebook
if ON_KAGGLE:
    %pip download torch torchvision pandas tabulate transformers evaluate peft wandb \
        --dest /kaggle/working/frozen_packages \
        --prefer-binary \
    
    # Install required packages
    %pip install torch torchvision pandas tabulate transformers evaluate peft wandb \
        --find-links /kaggle/working/frozen_packages \
        --no-index

else:
    %pip install torch torchvision pandas tabulate transformers evaluate peft wandb

Collecting torch
  File was already downloaded /kaggle/working/frozen_packages/torch-2.7.1-cp311-cp311-manylinux_2_28_x86_64.whl
Collecting torchvision
  File was already downloaded /kaggle/working/frozen_packages/torchvision-0.22.1-cp311-cp311-manylinux_2_28_x86_64.whl
Collecting pandas
  File was already downloaded /kaggle/working/frozen_packages/pandas-2.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Collecting tabulate
  File was already downloaded /kaggle/working/frozen_packages/tabulate-0.9.0-py3-none-any.whl
Collecting transformers
  File was already downloaded /kaggle/working/frozen_packages/transformers-4.52.4-py3-none-any.whl
Collecting evaluate
  File was already downloaded /kaggle/working/frozen_packages/evaluate-0.4.4-py3-none-any.whl
Collecting peft
  File was already downloaded /kaggle/working/frozen_packages/peft-0.15.2-py3-none-any.whl
Collecting wandb
  File was already downloaded /kaggle/working/frozen_packages/wandb-0.20.1-py3-none-manylinux_2_17_x86

In [None]:
# Setup wandb
import wandb
from kaggle_secrets import UserSecretsClient

os.environ["WANDB_PROJECT"] = "llm-classification-ft-peft"
os.environ["WANDB_LOG_MODEL"] = "checkpoint"

if ON_KAGGLE:
    os.environ["WANDB_HOST"] = "kaggle"
    
    user_secrets = UserSecretsClient()
    wandb_key = user_secrets.get_secret("WANDB_API_KEY")

    wandb.login(key=wandb_key)
else:
    wandb.login()

In [None]:
BASE_PATH = '/kaggle/input/llm-classification-finetuning' if ON_KAGGLE else './data/'

print(f"Using base path: {BASE_PATH}")

print("Available files in base path:")
for root, dirs, files in os.walk(BASE_PATH):
    for file in files:
        print(f" - {os.path.join(root, file)}")


# Data Inputs

Let's load and look at what we got first for inputs.

In [None]:
import pandas as pd

train_df = pd.read_csv(os.path.join(BASE_PATH, 'train.csv'))
test_df = pd.read_csv(os.path.join(BASE_PATH, 'test.csv'))

sample_submission_df = pd.read_csv(os.path.join(BASE_PATH, 'sample_submission.csv'))

In [None]:
print(f"Train DataFrame shape: {train_df.shape}")
print(f"Test DataFrame shape: {test_df.shape}")
print(f"Sample Submission DataFrame shape: {sample_submission_df.shape}")

print ("-------------------------")
# Print types of each column
print("\nColumn types in Train DataFrame:")
print(train_df.dtypes)

print("\nColumn types in Test DataFrame:")
print(test_df.dtypes)

In [None]:
print("First rows of each DataFrame:")

print("\nTrain DataFrame:")
print(train_df.head(1).to_markdown())

print("\nTest DataFrame:")
print(test_df.head(1).to_markdown())

print("\nSample Submission DataFrame:")
print(sample_submission_df.head(1).to_markdown())

## Create Dataset

Ok I think we now can load the huggingface stuff to create the datasets from the
pandas dataframes?

In [None]:
from datasets import Dataset 

# Convert pandas DataFrame to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_df)

# Split the train dataset into train and validation sets, since the test.csv data only has 3 rows.
train_dataset = train_dataset.train_test_split(test_size=0.1, shuffle=True)

# Can see it's now a DatasetDict with 'train' and 'test' splits
train_dataset

## The model stuff now?

We need to pick:
- Model
- Fine tuning method

Let's start small:
- smol-lm
- prompt tuning with `peft`

In [None]:
from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer

checkpoint = "answerdotai/ModernBERT-large"
device = "cuda"

checkpoint = "answerdotai/ModernBERT-large"
config     = AutoConfig.from_pretrained(checkpoint)
config._attn_implementation = "torch"    # ← no FlashAttention

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, config=config).to(device)

# why is it none tho
assert model.config.pad_token_id is None
assert tokenizer.eos_token is not None, "Tokenizer must have an eos_token set."

text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)

print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")

## Data format

Note that columns `prompt`, `response_a`, and `response_b` are strings
containing JSON arrays that could have more than 1 element.

In [None]:
import json

# grab the column as a plain Python list of strings
col = train_dataset["train"]["response_a"]

# find the first row with multiple items
first_multi = next(
    (
        (i, arr)
        for i, raw in enumerate(col)
        for arr in [json.loads(raw)]
        if isinstance(arr, list) and len(arr) > 1
    ),
    None
)

if first_multi:
    i, arr = first_multi
    print(f"First row with >1 element: row {i}: {arr}")

    # now pretty-print the full row at index i
    row = train_dataset["train"][i].copy()

    # parse the JSON-encoded fields
    row["prompt"]     = json.loads(row["prompt"])
    row["response_a"] = arr
    row["response_b"] = json.loads(row["response_b"])

    print("\nRow detail:")
    print(json.dumps(row, indent=2))
else:
    print("No rows with >1 element found.")


## Preprocessing the Data

Now i want to format the input training data to be an input to the model.

Note that there can be multi-turn conversations.
This will be a text input with the following format:

```text
[CLS]
User: <prompt turn 1>
Assistant: <response A turn 1>
User: <prompt turn 2>
Assistant: <response A turn 2>
…
User: <prompt turn N>
Assistant: <response A turn N>
[SEP]
User: <prompt turn 1>
Assistant: <response B turn 1>
User: <prompt turn 2>
Assistant: <response B turn 2>
…
User: <prompt turn N>
Assistant: <response B turn N>
[SEP]
```


In [None]:
def preprocess_function(
    examples: Dict[str, List[str]],
    tokenizer: PreTrainedTokenizer,
    max_length: int = 1024,
) -> Dict[str, Any]:
    """
    Preprocess examples for pairwise BERT classification: builds two sequences (history+resp_a, history+resp_b),
    tokenizes them with special tokens ([CLS], [SEP]), and returns input IDs, attention masks, token type IDs, and labels.
    """
    seq_as, seq_bs, labels = [], [], []
    for prompt_json, resp_a_json, resp_b_json, wa, wb, wt in zip(
        examples["prompt"],
        examples["response_a"],
        examples["response_b"],
        examples["winner_model_a"],
        examples["winner_model_b"],
        examples["winner_tie"],
    ):
        prompts = json.loads(prompt_json)
        responses_a = json.loads(resp_a_json)
        responses_b = json.loads(resp_b_json)

        # Build full multi-turn conversation for each variant
        conv_a = ""
        conv_b = ""
        for prompt_turn, ra, rb in zip(prompts, responses_a, responses_b):
            conv_a += f"User: {prompt_turn}\nAssistant: {ra}\n"
            conv_b += f"User: {prompt_turn}\nAssistant: {rb}\n"

        seq_as.append(conv_a.strip())
        seq_bs.append(conv_b.strip())

        # Map winners to labels: 0=A, 1=B, 2=tie
        if wa == 1:
            labels.append(0)
        elif wb == 1:
            labels.append(1)
        elif wt == 1:
            labels.append(2)
        else:
            raise ValueError(f"Invalid winner flags: {wa}, {wb}, {wt}")

    # Tokenize pairs with special tokens
    tokenized = tokenizer(
        seq_as,
        seq_bs,
        add_special_tokens=True,
        padding="max_length",
        truncation=True,
        max_length=max_length,
        return_tensors="pt",
    )

    tokenized["labels"] = labels
    return tokenized


# Test preprocessing on the first row as an example
example = train_dataset["train"].select(range(1))
example_preprocessed = preprocess_function(
    example,
    tokenizer=tokenizer,
)

print("Improved Preprocessed example:")
print("Input length:", len(example_preprocessed["input_ids"][0]))
print("Labels length:", len(example_preprocessed["labels"][0]))
print("Lengths match:", len(example_preprocessed["input_ids"][0]) == len(example_preprocessed["labels"][0]))

print("\nInput IDs:")
print("---------------------")
print(tokenizer.decode(example_preprocessed["input_ids"][0]))

# Remove -100 from labels for display
labels = [token_id for token_id in example_preprocessed["labels"][0] if token_id != -100]
print("---------------------")
print("Decoded Labels:")
print(tokenizer.decode(labels))

# Show where labels start
labels_full = example_preprocessed["labels"][0]
first_non_ignore = next((i for i, x in enumerate(labels_full) if x != -100), None)
print(f"\nFirst non-ignore label at position: {first_non_ignore}")
if first_non_ignore and first_non_ignore > 5:
    context_start = max(0, first_non_ignore - 5)
    context_end = min(len(example_preprocessed["input_ids"][0]), first_non_ignore + 5)
    print(f"Context around label start: {tokenizer.decode(example_preprocessed['input_ids'][0][context_start:context_end])}")

In [None]:
tokenized = train_dataset.map(
    lambda ex: preprocess_function(ex, tokenizer),
    batched=True,
    # optionally drop old columns
    remove_columns=train_dataset["train"].column_names,
)

In [None]:
# Let's check the first 2 rows of the tokenized dataset

print("Example rows from the tokenized dataset:")
print("--- input_ids ---")
print(tokenizer.decode(tokenized["train"][0]["input_ids"], skip_special_tokens=True))
print("---- labels -----")

# Clear the -100 padding from labels for display
labels = [token_id for token_id in tokenized["train"][0]["labels"] if token_id != -100]
print(tokenizer.decode(labels))

## Now what

Now we have a dataset in the right format with both the inputs and the labels,
we can now train wowowow

Let's use the `peft` library to try prompt tuning

In [None]:
from transformers import default_data_collator

# Use the tokenized datasets directly with the Trainer
train_ds = tokenized["train"]
eval_ds = tokenized["test"]

print(f"Training dataset size: {len(train_ds)}")
print(f"Evaluation dataset size: {len(eval_ds)}")

# Check that all sequences are now the same length
print(f"\nChecking sequence lengths consistency:")
first_sample_length = len(train_ds[0]["input_ids"])
print(f"First sample length: {first_sample_length}")

# Check a few more samples to ensure consistency
for i in range(min(5, len(train_ds))):
    length = len(train_ds[i]["input_ids"])
    labels_length = len(train_ds[i]["labels"])
    attention_length = len(train_ds[i]["attention_mask"])
    print(f"Sample {i}: input_ids={length}, labels={labels_length}, attention_mask={attention_length}")
    
    if length != labels_length or length != attention_length:
        print(f"WARNING: Length mismatch in sample {i}")
        break
else:
    print("All samples have consistent lengths!")

## PEFT Config

We use p-tuning instead of prefix tuning or prompt tuning.

### Why?
Prefix Tuning is more suitable for generation, while we're doing classification

Prompt tuning is very parameter efficient (e.g. could only need to train 8-16 embeddings) but can underperform.

In [None]:
from peft import PromptEncoderConfig, get_peft_model

# Improved PEFT configuration with more capacity and regularization
peft_config = PromptEncoderConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=50,
    encoder_hidden_size=256,
    encoder_dropout=0.1,
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

## Training Setup

Setup optimizer and learning rate scheduler.

In [None]:
from transformers import DataCollatorForLanguageModeling, TrainingArguments, Trainer
import torch
from typing import Dict, List, Any

# Since we're already padding in preprocessing, use a simpler data collator
# that doesn't try to pad again
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # We're doing causal LM, not masked LM
    pad_to_multiple_of=None,  # Don't pad again since we already padded in preprocessing
    return_tensors="pt"
)

# Improved training arguments with better optimization and scheduling
training_args = TrainingArguments(
    output_dir="./llm-classification-ft-peft-p-tuning/output",
    learning_rate=3e-4,              # More conservative learning rate for fine-tuning

    # Smaller batch sizes for memory efficiency
    per_device_train_batch_size=2,
    per_device_eval_batch_size=4,    # Keep eval batch size higher

    # Increase gradient accumulation to maintain effective batch size
    gradient_accumulation_steps=16,  # Effective batch size = 2 * 16 = 32
    # num_train_epochs=0.01,              # More epochs for better convergence
    max_steps=10, # TEMP: For testing without training too long
    weight_decay=0.01,
    
    # Better optimization settings
    warmup_ratio=0.1,                # Warmup for training stability
    lr_scheduler_type="cosine",      # Cosine annealing instead of linear
    
    # Better evaluation and saving strategy
    eval_strategy="steps",           # Evaluate more frequently
    eval_steps=100,                  # Evaluate every 100 steps
    save_strategy="steps", 
    save_steps=100,                  # Save every 100 steps
    save_total_limit=2,              # Limit checkpoints to save disk space

    # Early stopping and best model selection
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,

    # Logging and reporting
    report_to="wandb" if not ON_KAGGLE else None,
    run_name="llm-classification-ft-peft-p-tuning-improved",
    logging_steps=10,                # More frequent logging

    # Memory optimization settings
    dataloader_pin_memory=False,     # Disable pin memory to save GPU memory
    dataloader_num_workers=0,        # Avoid multiprocessing overhead
    gradient_checkpointing=True,     # Trade compute for memory
    fp16=True,                       # Use half precision to reduce memory usage

    # Additional optimizations
    remove_unused_columns=False,     # Important for custom preprocessing
    label_names=["labels"],          # Fix for PEFT model warning
    torch_empty_cache_steps=50,      # Clear cache periodically
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

## Now actually train!

In [None]:
# Print current GPU memory usage
import torch

def print_gpu_memory_usage():
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**3  # Convert to GB
        reserved = torch.cuda.memory_reserved() / 1024**3
        total = torch.cuda.get_device_properties(0).total_memory / 1024**3
        
        print("GPU Memory Usage:")
        print(f"  Allocated: {allocated:.2f} GB")
        print(f"  Reserved:  {reserved:.2f} GB") 
        print(f"  Total:     {total:.2f} GB")
        print(f"  Free:      {total - reserved:.2f} GB")
        print(f"  Usage:     {allocated/total*100:.1f}%")
    else:
        print("CUDA is not available.")

print("BEFORE training:")
print_gpu_memory_usage()

# Clear any cached memory
torch.cuda.empty_cache()
print("\nAfter clearing cache:")
print_gpu_memory_usage()

In [None]:
# Memory optimization before training
import gc
import os

# Clear Python garbage collector
gc.collect()

# Clear CUDA cache
torch.cuda.empty_cache()

# Set PYTORCH environment variable for memory management
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

print("Pre-training memory optimization complete")
print_gpu_memory_usage()

# Check model's memory footprint
print(f"\nModel memory footprint: {model.get_memory_footprint() / 1024**3:.2f} GB")

# Print trainable parameters info
model.print_trainable_parameters()

In [None]:
trainer.train()

In [None]:
wandb.finish()

In [None]:
# Save the model and tokenizer
output_path = "/kaggle/working/model-output" if ON_KAGGLE else "./model-ouput"
trainer.save_model(output_path)

In [None]:
# Also save the full base model & tokenizer for offline use in the
# next inference notebook
base_output_path = "/kaggle/working/base-model" if ON_KAGGLE else "./base-model"

# get the underlying HF model out of your PEFT wrapper
base_model = model.base_model

# save both model weights/config and tokenizer
base_model.save_pretrained(base_output_path)
tokenizer.save_pretrained(base_output_path)

print(f"Base model and tokenizer saved to {base_output_path}")

## Inference Time

Now we have a fine tuned model, we can use it to make predictions on the test
set to see how well (or more likely how poorly) it does.

This is done in the next notebook that uses this one as an input!