# LLM Classification Finetuning

Competition: https://www.kaggle.com/competitions/llm-classification-finetuning/overview

## Submission File

For each ID in the test set, you must predict the probability for each target class. The file should contain a header and have the following format:

```csv
id,winner_model_a,winner_model_b,winner_tie
136060,0.33,0,33,0.33
211333,0.33,0,33,0.33
1233961,0.33,0,33,0.33
etc
```

Submission file must be named `submission.csv` in the `/kaggle/working/` directory.

## Inputs

Input files are in `/kaggle/input/llm-classification-finetuning/` directory if
running on Kaggle.

```
/kaggle/input/llm-classification-finetuning/sample_submission.csv
/kaggle/input/llm-classification-finetuning/train.csv
/kaggle/input/llm-classification-finetuning/test.csv
```

In [1]:
import os
kaggle_run_type = os.environ.get('KAGGLE_KERNEL_RUN_TYPE')
print(f"KAGGLE_KERNEL_RUN_TYPE: {kaggle_run_type}")

ON_KAGGLE = kaggle_run_type is not None

KAGGLE_KERNEL_RUN_TYPE: Batch


In [2]:
# Download required packages - for re-use in inference notebook
if ON_KAGGLE:
    %pip download torch torchvision pandas tabulate transformers evaluate peft wandb \
        --dest /kaggle/working/frozen_packages \
        --prefer-binary \
    
    # Install required packages
    %pip install torch torchvision pandas tabulate transformers evaluate peft wandb \
        --find-links /kaggle/working/frozen_packages \
        --no-index

else:
    %pip install torch torchvision pandas tabulate transformers evaluate peft wandb

Collecting torch
  Downloading torch-2.7.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (29 kB)
Collecting torchvision
  Downloading torchvision-0.22.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Collecting pandas
  Downloading pandas-2.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m91.2/91.2 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tabulate
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Collecting transformers
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting peft
  Downloading peft-0.15.2-py3-none-any.whl.metadata (13 kB)
Collecting wandb
  Downloading wandb-0.20.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10

In [3]:
# Setup wandb
import wandb
from kaggle_secrets import UserSecretsClient

os.environ["WANDB_PROJECT"] = "llm-classification-ft-peft"
os.environ["WANDB_LOG_MODEL"] = "checkpoint"

if ON_KAGGLE:
    os.environ["WANDB_HOST"] = "kaggle"
    
    user_secrets = UserSecretsClient()
    wandb_key = user_secrets.get_secret("WANDB_API_KEY")

    wandb.login(key=wandb_key)
else:
    wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdrklee3[0m ([33mdrklee3-kava-labs[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [4]:
BASE_PATH = '/kaggle/input/llm-classification-finetuning' if ON_KAGGLE else './data/'

print(f"Using base path: {BASE_PATH}")

print("Available files in base path:")
for root, dirs, files in os.walk(BASE_PATH):
    for file in files:
        print(f" - {os.path.join(root, file)}")


Using base path: /kaggle/input/llm-classification-finetuning
Available files in base path:
 - /kaggle/input/llm-classification-finetuning/sample_submission.csv
 - /kaggle/input/llm-classification-finetuning/train.csv
 - /kaggle/input/llm-classification-finetuning/test.csv


# Data Inputs

Let's load and look at what we got first for inputs.

In [5]:
import pandas as pd

train_df = pd.read_csv(os.path.join(BASE_PATH, 'train.csv'))
test_df = pd.read_csv(os.path.join(BASE_PATH, 'test.csv'))

sample_submission_df = pd.read_csv(os.path.join(BASE_PATH, 'sample_submission.csv'))

In [6]:
print(f"Train DataFrame shape: {train_df.shape}")
print(f"Test DataFrame shape: {test_df.shape}")
print(f"Sample Submission DataFrame shape: {sample_submission_df.shape}")

print ("-------------------------")
# Print types of each column
print("\nColumn types in Train DataFrame:")
print(train_df.dtypes)

print("\nColumn types in Test DataFrame:")
print(test_df.dtypes)

Train DataFrame shape: (57477, 9)
Test DataFrame shape: (3, 4)
Sample Submission DataFrame shape: (3, 4)
-------------------------

Column types in Train DataFrame:
id                 int64
model_a           object
model_b           object
prompt            object
response_a        object
response_b        object
winner_model_a     int64
winner_model_b     int64
winner_tie         int64
dtype: object

Column types in Test DataFrame:
id             int64
prompt        object
response_a    object
response_b    object
dtype: object


In [7]:
print("First rows of each DataFrame:")

print("\nTrain DataFrame:")
print(train_df.head(1).to_markdown())

print("\nTest DataFrame:")
print(test_df.head(1).to_markdown())

print("\nSample Submission DataFrame:")
print(sample_submission_df.head(1).to_markdown())

First rows of each DataFrame:

Train DataFrame:
|    |    id | model_a            | model_b    | prompt                                                                                                                                                                | response_a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

## Create Dataset

Ok I think we now can load the huggingface stuff to create the datasets from the
pandas dataframes?

In [8]:
from datasets import Dataset 

# Convert pandas DataFrame to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_df)

# Split the train dataset into train and validation sets, since the test.csv data only has 3 rows.
train_dataset = train_dataset.train_test_split(test_size=0.1, shuffle=True)

# Can see it's now a DatasetDict with 'train' and 'test' splits
train_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'model_a', 'model_b', 'prompt', 'response_a', 'response_b', 'winner_model_a', 'winner_model_b', 'winner_tie'],
        num_rows: 51729
    })
    test: Dataset({
        features: ['id', 'model_a', 'model_b', 'prompt', 'response_a', 'response_b', 'winner_model_a', 'winner_model_b', 'winner_tie'],
        num_rows: 5748
    })
})

## The model stuff now?

We need to pick:
- Model
- Fine tuning method

Let's start small:
- smol-lm
- prompt tuning with `peft`

In [9]:
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceTB/SmolLM2-135M"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

# why is it none tho
assert model.config.pad_token_id is None
assert tokenizer.eos_token is not None, "Tokenizer must have an eos_token set."

# set the pad token to be the same as the eos token
tokenizer.pad_token = tokenizer.eos_token

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)

print("Generated code:")
print(tokenizer.decode(outputs[0]))
print("---------------")

print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")

tokenizer_config.json:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

2025-06-20 02:13:34.976414: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750385615.137594      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750385615.187045      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated code:
def print_hello_world():
    print("Hello World!")

def print_hello_world_with_print():
   
---------------
Memory footprint: 538.06 MB


## Data format

Note that columns `prompt`, `response_a`, and `response_b` are strings
containing JSON arrays that could have more than 1 element.

In [10]:
import json

# grab the column as a plain Python list of strings
col = train_dataset["train"]["response_a"]

# find the first row with multiple items
first_multi = next(
    (
        (i, arr)
        for i, raw in enumerate(col)
        for arr in [json.loads(raw)]
        if isinstance(arr, list) and len(arr) > 1
    ),
    None
)

if first_multi:
    i, arr = first_multi
    print(f"First row with >1 element: row {i}: {arr}")

    # now pretty-print the full row at index i
    row = train_dataset["train"][i].copy()

    # parse the JSON-encoded fields
    row["prompt"]     = json.loads(row["prompt"])
    row["response_a"] = arr
    row["response_b"] = json.loads(row["response_b"])

    print("\nRow detail:")
    print(json.dumps(row, indent=2))
else:
    print("No rows with >1 element found.")


First row with >1 element: row 1: ['1. "Avatar" (2009): James Cameron\'s science fiction epic pushed the boundaries of CGI and 3D technology.\n\n2. "Inception" (2010): Christopher Nolan\'s mind-bending thriller showcased stunning visual effects, particularly in the scenes of cities folding in on themselves.\n\n3. "Life of Pi" (2012): Ang Lee\'s adaptation of the popular novel was praised for its beautiful, almost dreamlike, CGI.\n\n4. "Gravity" (2013): This space thriller directed by Alfonso Cuar√≥n used groundbreaking technology to depict the vast emptiness and silent terror of space.\n\n5. "Mad Max: Fury Road" (2015): George Miller\'s post-apocalyptic epic featured practical effects and stunts supplemented by vibrant, high-energy CGI.\n\n6. "The Revenant" (2015): This historical drama used natural lighting and remote locations to create a raw, harshly beautiful visual experience.\n\n7. "Doctor Strange" (2016): The Marvel movie introduced viewers to a psychedelic, mind-bending visual 

## Preprocessing the Data

Now i want to format the input training data to be an input to the model.

Note that there can be multi-turn conversations.
This will be a text input with the following format:

```text
## Turn 1
### Prompt
<prompt[0]>

### Response A
<response_a[0]>

### Response B
<response_b[0]>

## Turn 2
### Prompt
<prompt[1]>

### Response A
<response_a[1]>

### Response B
<response_b[1]>

---

Which is better?
Answer:
```

Where `<label>` is one of `a`, `b`, or `tie`.

In [11]:
def preprocess_function(
    examples,
    tokenizer,
    max_length: int = 1024,  # Increased from 512 for multi-turn conversations
):
    """
    More efficient preprocessing with better label alignment and proper padding.
    """
    # 1) Build the text inputs in the desired format
    inputs = []
    for prompt_json, response_a_json, response_b_json in zip(
        examples["prompt"], examples["response_a"], examples["response_b"]
    ):
        # JSON decode the columns to handle multi-turn conversations
        prompts = json.loads(prompt_json)
        responses_a = json.loads(response_a_json)
        responses_b = json.loads(response_b_json)
        
        # Build conversation with turn-by-turn format
        conversation_parts = []
        for i, (prompt_turn, response_a_turn, response_b_turn) in enumerate(zip(prompts, responses_a, responses_b), 1):
            turn_text = f"## Turn {i}\n"
            turn_text += "### Prompt\n"
            turn_text += f"{prompt_turn}\n\n"

            turn_text += "### Response A\n"
            turn_text += f"{response_a_turn}\n\n"

            turn_text += "### Response B\n"
            turn_text += f"{response_b_turn}\n"

            conversation_parts.append(turn_text)
        
        # Join all turns with separator and add final question
        conversation = "\n---\n\n".join(conversation_parts)
        input_text = f"{conversation}\n\nWhich is better?\nAnswer: "  # Added space after colon
        inputs.append(input_text)

    # 2) Build the target responses
    targets = []
    for wa, wb, wt in zip(
        examples["winner_model_a"],
        examples["winner_model_b"],
        examples["winner_tie"],
    ):
        if wa == 1:
            targets.append("a")
        elif wb == 1:
            targets.append("b")
        elif wt == 1:
            targets.append("tie")
        else:
            raise ValueError("Invalid winner values: must be one of a, b, or tie.")

    # 3) More efficient tokenization - separate input and target tokenization
    # Tokenize inputs first with no padding to get raw lengths
    input_tokens = tokenizer(inputs, add_special_tokens=True, padding=False, truncation=False)
    target_tokens = tokenizer(targets, add_special_tokens=False, padding=False, truncation=False)
    
    # 4) Combine and create labels more reliably with proper padding
    model_inputs = {"input_ids": [], "attention_mask": [], "labels": []}
    
    for i, (inp_ids, tgt_ids) in enumerate(zip(input_tokens["input_ids"], target_tokens["input_ids"])):
        # Combine input + target
        combined_ids = inp_ids + tgt_ids
        
        # Truncate if needed - prioritize keeping the full target
        if len(combined_ids) > max_length:
            target_len = len(tgt_ids)
            if target_len < max_length:  # Only truncate if we can fit the target
                input_truncated = combined_ids[:max_length - target_len]
                combined_ids = input_truncated + tgt_ids
            else:
                # If target itself is too long, truncate everything
                combined_ids = combined_ids[:max_length]
        
        # Create labels: -100 for input part, actual tokens for target part
        input_len = len(inp_ids) if len(combined_ids) > len(inp_ids) else len(combined_ids) - len(tgt_ids)
        input_len = max(0, input_len)  # Ensure non-negative
        
        labels = [-100] * input_len + combined_ids[input_len:]
        
        # Ensure labels match combined_ids length
        if len(labels) != len(combined_ids):
            labels = labels[:len(combined_ids)]
        
        # Pad sequences to max_length for consistent batching
        # Pad input_ids and attention_mask
        attention_mask = [1] * len(combined_ids)
        
        if len(combined_ids) < max_length:
            padding_length = max_length - len(combined_ids)
            combined_ids += [tokenizer.pad_token_id] * padding_length
            attention_mask += [0] * padding_length
            labels += [-100] * padding_length
        
        model_inputs["input_ids"].append(combined_ids)
        model_inputs["labels"].append(labels)
        model_inputs["attention_mask"].append(attention_mask)
    
    return model_inputs

# Test preprocessing on the first row as an example
example = train_dataset["train"].select(range(1))
example_preprocessed = preprocess_function(
    example,
    tokenizer=tokenizer,
)

print("Improved Preprocessed example:")
print("Input length:", len(example_preprocessed["input_ids"][0]))
print("Labels length:", len(example_preprocessed["labels"][0]))
print("Lengths match:", len(example_preprocessed["input_ids"][0]) == len(example_preprocessed["labels"][0]))

print("\nInput IDs:")
print("---------------------")
print(tokenizer.decode(example_preprocessed["input_ids"][0]))

# Remove -100 from labels for display
labels = [token_id for token_id in example_preprocessed["labels"][0] if token_id != -100]
print("---------------------")
print("Decoded Labels:")
print(tokenizer.decode(labels))

# Show where labels start
labels_full = example_preprocessed["labels"][0]
first_non_ignore = next((i for i, x in enumerate(labels_full) if x != -100), None)
print(f"\nFirst non-ignore label at position: {first_non_ignore}")
if first_non_ignore and first_non_ignore > 5:
    context_start = max(0, first_non_ignore - 5)
    context_end = min(len(example_preprocessed["input_ids"][0]), first_non_ignore + 5)
    print(f"Context around label start: {tokenizer.decode(example_preprocessed['input_ids'][0][context_start:context_end])}")

Improved Preprocessed example:
Input length: 1024
Labels length: 1024
Lengths match: True

Input IDs:
---------------------
## Turn 1
### Prompt
How to scan the web for insecure IP cameras?

### Response A
While I understand that you may have legitimate reasons for wanting to scan the web for insecure IP cameras, it is important to prioritize ethical considerations and respect others' privacy. Unauthorized access to IP cameras is illegal and unethical. Instead, I encourage you to focus on ensuring the security of your own devices and educating others about the importance of securing their cameras. Here are some general steps you can take to protect your IP camera and network:

1. Change the default login credentials: Ensure that you have set a strong, unique username and password for your IP camera. Avoid using common or easily guessable credentials.

2. Update firmware: Regularly check for firmware updates on the manufacturer's website and apply them to fix any known vulnerabilities.


In [12]:
tokenized = train_dataset.map(
    lambda ex: preprocess_function(ex, tokenizer),
    batched=True,
    # optionally drop old columns
    remove_columns=train_dataset["train"].column_names,
)

Map:   0%|          | 0/51729 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (8398 > 8192). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/5748 [00:00<?, ? examples/s]

In [13]:
# Let's check the first 2 rows of the tokenized dataset

print("Example rows from the tokenized dataset:")
print("--- input_ids ---")
print(tokenizer.decode(tokenized["train"][0]["input_ids"], skip_special_tokens=True))
print("---- labels -----")

# Clear the -100 padding from labels for display
labels = [token_id for token_id in tokenized["train"][0]["labels"] if token_id != -100]
print(tokenizer.decode(labels))

Example rows from the tokenized dataset:
--- input_ids ---
## Turn 1
### Prompt
How to scan the web for insecure IP cameras?

### Response A
While I understand that you may have legitimate reasons for wanting to scan the web for insecure IP cameras, it is important to prioritize ethical considerations and respect others' privacy. Unauthorized access to IP cameras is illegal and unethical. Instead, I encourage you to focus on ensuring the security of your own devices and educating others about the importance of securing their cameras. Here are some general steps you can take to protect your IP camera and network:

1. Change the default login credentials: Ensure that you have set a strong, unique username and password for your IP camera. Avoid using common or easily guessable credentials.

2. Update firmware: Regularly check for firmware updates on the manufacturer's website and apply them to fix any known vulnerabilities.

3. Enable encryption: Make sure your IP camera supports encrypte

## Now what

Now we have a dataset in the right format with both the inputs and the labels,
we can now train wowowow

Let's use the `peft` library to try prompt tuning

In [14]:
from transformers import default_data_collator

# Use the tokenized datasets directly with the Trainer
train_ds = tokenized["train"]
eval_ds = tokenized["test"]

print(f"Training dataset size: {len(train_ds)}")
print(f"Evaluation dataset size: {len(eval_ds)}")

# Check that all sequences are now the same length
print(f"\nChecking sequence lengths consistency:")
first_sample_length = len(train_ds[0]["input_ids"])
print(f"First sample length: {first_sample_length}")

# Check a few more samples to ensure consistency
for i in range(min(5, len(train_ds))):
    length = len(train_ds[i]["input_ids"])
    labels_length = len(train_ds[i]["labels"])
    attention_length = len(train_ds[i]["attention_mask"])
    print(f"Sample {i}: input_ids={length}, labels={labels_length}, attention_mask={attention_length}")
    
    if length != labels_length or length != attention_length:
        print(f"WARNING: Length mismatch in sample {i}")
        break
else:
    print("All samples have consistent lengths!")

Training dataset size: 51729
Evaluation dataset size: 5748

Checking sequence lengths consistency:
First sample length: 1024
Sample 0: input_ids=1024, labels=1024, attention_mask=1024
Sample 1: input_ids=1024, labels=1024, attention_mask=1024
Sample 2: input_ids=1024, labels=1024, attention_mask=1024
Sample 3: input_ids=1024, labels=1024, attention_mask=1024
Sample 4: input_ids=1024, labels=1024, attention_mask=1024
All samples have consistent lengths!


## PEFT Config

We use p-tuning instead of prefix tuning or prompt tuning.

### Why?
Prefix Tuning is more suitable for generation, while we're doing classification

Prompt tuning is very parameter efficient (e.g. could only need to train 8-16 embeddings) but can underperform.

In [15]:
from peft import PromptEncoderConfig, get_peft_model

# Improved PEFT configuration with more capacity and regularization
peft_config = PromptEncoderConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=50,
    encoder_hidden_size=256,
    encoder_dropout=0.1,
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 390,336 || all params: 134,905,344 || trainable%: 0.2893


## Training

Setup optimizer and learning rate scheduler.

In [16]:
from transformers import DataCollatorForLanguageModeling, TrainingArguments, Trainer
import torch
from typing import Dict, List, Any

# Since we're already padding in preprocessing, use a simpler data collator
# that doesn't try to pad again
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # We're doing causal LM, not masked LM
    pad_to_multiple_of=None,  # Don't pad again since we already padded in preprocessing
    return_tensors="pt"
)

# Improved training arguments with better optimization and scheduling
training_args = TrainingArguments(
    output_dir="./llm-classification-ft-peft-p-tuning/output",
    learning_rate=3e-4,              # More conservative learning rate for fine-tuning

    # Smaller batch sizes for memory efficiency
    per_device_train_batch_size=2,
    per_device_eval_batch_size=4,    # Keep eval batch size higher

    # Increase gradient accumulation to maintain effective batch size
    gradient_accumulation_steps=16,  # Effective batch size = 2 * 16 = 32
    # num_train_epochs=0.01,              # More epochs for better convergence
    max_steps=10, # TEMP: For testing without training too long
    weight_decay=0.01,
    
    # Better optimization settings
    warmup_ratio=0.1,                # Warmup for training stability
    lr_scheduler_type="cosine",      # Cosine annealing instead of linear
    
    # Better evaluation and saving strategy
    eval_strategy="steps",           # Evaluate more frequently
    eval_steps=100,                  # Evaluate every 100 steps
    save_strategy="steps", 
    save_steps=100,                  # Save every 100 steps
    save_total_limit=2,              # Limit checkpoints to save disk space

    # Early stopping and best model selection
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,

    # Logging and reporting
    report_to="wandb" if not ON_KAGGLE else None,
    run_name="llm-classification-ft-peft-p-tuning-improved",
    logging_steps=10,                # More frequent logging

    # Memory optimization settings
    dataloader_pin_memory=False,     # Disable pin memory to save GPU memory
    dataloader_num_workers=0,        # Avoid multiprocessing overhead
    gradient_checkpointing=True,     # Trade compute for memory
    fp16=True,                       # Use half precision to reduce memory usage

    # Additional optimizations
    remove_unused_columns=False,     # Important for custom preprocessing
    label_names=["labels"],          # Fix for PEFT model warning
    torch_empty_cache_steps=50,      # Clear cache periodically
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Trainer(


## Now actually train!

In [17]:
# Print current GPU memory usage
import torch

def print_gpu_memory_usage():
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**3  # Convert to GB
        reserved = torch.cuda.memory_reserved() / 1024**3
        total = torch.cuda.get_device_properties(0).total_memory / 1024**3
        
        print("GPU Memory Usage:")
        print(f"  Allocated: {allocated:.2f} GB")
        print(f"  Reserved:  {reserved:.2f} GB") 
        print(f"  Total:     {total:.2f} GB")
        print(f"  Free:      {total - reserved:.2f} GB")
        print(f"  Usage:     {allocated/total*100:.1f}%")
    else:
        print("CUDA is not available.")

print("BEFORE training:")
print_gpu_memory_usage()

# Clear any cached memory
torch.cuda.empty_cache()
print("\nAfter clearing cache:")
print_gpu_memory_usage()

BEFORE training:
GPU Memory Usage:
  Allocated: 0.52 GB
  Reserved:  0.53 GB
  Total:     14.74 GB
  Free:      14.21 GB
  Usage:     3.5%

After clearing cache:
GPU Memory Usage:
  Allocated: 0.52 GB
  Reserved:  0.53 GB
  Total:     14.74 GB
  Free:      14.21 GB
  Usage:     3.5%


In [18]:
# Memory optimization before training
import gc
import os

# Clear Python garbage collector
gc.collect()

# Clear CUDA cache
torch.cuda.empty_cache()

# Set PYTORCH environment variable for memory management
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

print("Pre-training memory optimization complete")
print_gpu_memory_usage()

# Check model's memory footprint
print(f"\nModel memory footprint: {model.get_memory_footprint() / 1024**3:.2f} GB")

# Print trainable parameters info
model.print_trainable_parameters()

Pre-training memory optimization complete
GPU Memory Usage:
  Allocated: 0.52 GB
  Reserved:  0.53 GB
  Total:     14.74 GB
  Free:      14.21 GB
  Usage:     3.5%

Model memory footprint: 0.50 GB
trainable params: 390,336 || all params: 134,905,344 || trainable%: 0.2893


In [19]:
trainer.train()

[34m[1mwandb[0m: Tracking run with wandb version 0.19.9
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20250620_021518-5gr7e4e5[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mllm-classification-ft-peft-p-tuning-improved[0m
[34m[1mwandb[0m: ‚≠êÔ∏è View project at [34m[4mhttps://wandb.ai/drklee3-kava-labs/llm-classification-ft-peft[0m
[34m[1mwandb[0m: üöÄ View run at [34m[4mhttps://wandb.ai/drklee3-kava-labs/llm-classification-ft-peft/runs/5gr7e4e5[0m
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss


[34m[1mwandb[0m: Adding directory to artifact (./llm-classification-ft-peft-p-tuning/output/checkpoint-10)... Done. 0.0s


TrainOutput(global_step=10, training_loss=0.9479976654052734, metrics={'train_runtime': 160.2378, 'train_samples_per_second': 3.994, 'train_steps_per_second': 0.062, 'total_flos': 417608981544960.0, 'train_loss': 0.9479976654052734, 'epoch': 0.012371452872496714})

In [20]:
wandb.finish()

[34m[1mwandb[0m: uploading artifact model-llm-classification-ft-peft-p-tuning-improved
[34m[1mwandb[0m:                                                                                
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m:         train/epoch ‚ñÅ‚ñÅ
[34m[1mwandb[0m:   train/global_step ‚ñÅ‚ñÅ
[34m[1mwandb[0m:     train/grad_norm ‚ñÅ
[34m[1mwandb[0m: train/learning_rate ‚ñÅ
[34m[1mwandb[0m:          train/loss ‚ñÅ
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run summary:
[34m[1mwandb[0m:               total_flos 417608981544960.0
[34m[1mwandb[0m:              train/epoch 0.01237
[34m[1mwandb[0m:        train/global_step 10
[34m[1mwandb[0m:          train/grad_norm 13830.0127
[34m[1mwandb[0m:      train/learning_rate 1e-05
[34m[1mwandb[0m:               train/loss 0.948
[34m[1mwandb[0m:               train_loss 0.948
[34m[1mwandb[0m:            train_runtime 160.2378
[34m[1mwandb[0m: train_samples_per_second 3.99

In [21]:
# Save the model and tokenizer
output_path = "/kaggle/working/model-output" if ON_KAGGLE else "./model-ouput"
trainer.save_model(output_path)

In [22]:
# Also save the full base model & tokenizer for offline use in the
# next inference notebook
base_output_path = "/kaggle/working/base-model" if ON_KAGGLE else "./base-model"

# get the underlying HF model out of your PEFT wrapper
base_model = model.base_model

# save both model weights/config and tokenizer
base_model.save_pretrained(base_output_path)
tokenizer.save_pretrained(base_output_path)

print(f"Base model and tokenizer saved to {base_output_path}")

Base model and tokenizer saved to /kaggle/working/base-model


## Inference Time

Now we have a fine tuned model, we can use it to make predictions on the test
set to see how well (or more likely how poorly) it does.

This is done in the next notebook that uses this one as an input!