# LLM Classification Finetuning

Competition: https://www.kaggle.com/competitions/llm-classification-finetuning/overview

## Submission File

For each ID in the test set, you must predict the probability for each target class. The file should contain a header and have the following format:

```csv
id,winner_model_a,winner_model_b,winner_tie
136060,0.33,0,33,0.33
211333,0.33,0,33,0.33
1233961,0.33,0,33,0.33
etc
```

Submission file must be named `submission.csv` in the `/kaggle/working/` directory.

## Inputs

Input files are in `/kaggle/input/llm-classification-finetuning/` directory if
running on Kaggle.

```
/kaggle/input/llm-classification-finetuning/sample_submission.csv
/kaggle/input/llm-classification-finetuning/train.csv
/kaggle/input/llm-classification-finetuning/test.csv
```

In [2]:
import os

kaggle_run_type = os.environ.get('KAGGLE_KERNEL_RUN_TYPE')
print(f"KAGGLE_KERNEL_RUN_TYPE: {kaggle_run_type}")

ON_KAGGLE = kaggle_run_type is not None

KAGGLE_KERNEL_RUN_TYPE: None


In [3]:
BASE_PATH = '/kaggle/input/llm-classification-finetuning' if ON_KAGGLE else './data/'

print(f"Using base path: {BASE_PATH}")

print("Available files in base path:")
for root, dirs, files in os.walk(BASE_PATH):
    for file in files:
        print(f" - {os.path.join(root, file)}")


Using base path: ./data/
Available files in base path:
 - ./data/test.csv
 - ./data/sample_submission.csv
 - ./data/train.csv


# Data Inputs

Let's load and look at what we got first for inputs.

In [4]:
%pip install pandas tabulate

[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [5]:
import pandas as pd

train_df = pd.read_csv(os.path.join(BASE_PATH, 'train.csv'))
test_df = pd.read_csv(os.path.join(BASE_PATH, 'test.csv'))

sample_submission_df = pd.read_csv(os.path.join(BASE_PATH, 'sample_submission.csv'))

In [6]:
print(f"Train DataFrame shape: {train_df.shape}")
print(f"Test DataFrame shape: {test_df.shape}")
print(f"Sample Submission DataFrame shape: {sample_submission_df.shape}")

print ("-------------------------")
# Print types of each column
print("\nColumn types in Train DataFrame:")
print(train_df.dtypes)

print("\nColumn types in Test DataFrame:")
print(test_df.dtypes)

Train DataFrame shape: (57477, 9)
Test DataFrame shape: (3, 4)
Sample Submission DataFrame shape: (3, 4)
-------------------------

Column types in Train DataFrame:
id                 int64
model_a           object
model_b           object
prompt            object
response_a        object
response_b        object
winner_model_a     int64
winner_model_b     int64
winner_tie         int64
dtype: object

Column types in Test DataFrame:
id             int64
prompt        object
response_a    object
response_b    object
dtype: object


In [7]:
print("First rows of each DataFrame:")

print("\nTrain DataFrame:")
print(train_df.head(1).to_markdown())

print("\nTest DataFrame:")
print(test_df.head(1).to_markdown())

print("\nSample Submission DataFrame:")
print(sample_submission_df.head(1).to_markdown())

First rows of each DataFrame:

Train DataFrame:
|    |    id | model_a            | model_b    | prompt                                                                                                                                                                | response_a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

## Create Dataset

Ok I think we now can load the huggingface stuff to create the datasets from the
pandas dataframes?

In [8]:
from datasets import Dataset 

# Convert pandas DataFrame to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_df)

# Split the train dataset into train and validation sets, since the test.csv data only has 3 rows.
train_dataset = train_dataset.train_test_split(test_size=0.1, shuffle=True)

# Can see it's now a DatasetDict with 'train' and 'test' splits
train_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'model_a', 'model_b', 'prompt', 'response_a', 'response_b', 'winner_model_a', 'winner_model_b', 'winner_tie'],
        num_rows: 51729
    })
    test: Dataset({
        features: ['id', 'model_a', 'model_b', 'prompt', 'response_a', 'response_b', 'winner_model_a', 'winner_model_b', 'winner_tie'],
        num_rows: 5748
    })
})

## The model stuff now?

We need to pick:
- Model
- Fine tuning method

Let's start small:
- smol-lm
- prompt tuning with `peft`

In [9]:
%pip install transformers evaluate peft

[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [10]:
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceTB/SmolLM2-135M"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

# why is it none tho
assert model.config.pad_token_id is None
assert tokenizer.eos_token is not None, "Tokenizer must have an eos_token set."

# set the pad token to be the same as the eos token
tokenizer.pad_token = tokenizer.eos_token

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)

print("Generated code:")
print(tokenizer.decode(outputs[0]))
print("---------------")

print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated code:
def print_hello_world():
    print("Hello World!")

def print_hello_world_with_print():
   
---------------
Memory footprint: 538.06 MB


## Data format

Note that columns `prompt`, `response_a`, and `response_b` are strings
containing JSON arrays that could have more than 1 element.

In [11]:
import json

# grab the column as a plain Python list of strings
col = train_dataset["train"]["response_a"]

# find the first row with multiple items
first_multi = next(
    (
        (i, arr)
        for i, raw in enumerate(col)
        for arr in [json.loads(raw)]
        if isinstance(arr, list) and len(arr) > 1
    ),
    None
)

if first_multi:
    i, arr = first_multi
    print(f"First row with >1 element: row {i}: {arr}")

    # now pretty-print the full row at index i
    row = train_dataset["train"][i].copy()

    # parse the JSON-encoded fields
    row["prompt"]     = json.loads(row["prompt"])
    row["response_a"] = arr
    row["response_b"] = json.loads(row["response_b"])

    print("\nRow detail:")
    print(json.dumps(row, indent=2))
else:
    print("No rows with >1 element found.")


First row with >1 element: row 1: ['It looks like there might be a typo in your code. The property for getting the number of elements in an array is `length`, not `lenght`. Also, when you\'re importing a module, you should use the `import` statement correctly with curly braces `{}` around named exports unless you are importing the default export. Here\'s the corrected line of code:\n\n```javascript\nimport { cards } from "./cards.js";\ndocument.querySelector("#result > span").textContent = cards.length - quizTable.length;\n```\n\nMake sure that `quizTable` is also defined and imported if necessary, as it\'s being used but not shown in the snippet you provided. If it\'s a variable that holds an array in the current scope, then it should work fine with the corrected `length` property.', "Pour enlever la dernière valeur d'un tableau en JavaScript, vous pouvez utiliser la méthode `pop()`. Voici un exemple :\n\n```javascript\nlet monTableau = [1, 2, 3, 4, 5];\nmonTableau.pop(); // Cela enlè

## Preprocessing the Data

Now i want to format the input training data to be an input to the model.

Note that there can be multi-turn conversations.
This will be a text input with the following format:

```text
## Turn 1
### Prompt
<prompt[0]>

### Response A
<response_a[0]>

### Response B
<response_b[0]>

## Turn 2
### Prompt
<prompt[1]>

### Response A
<response_a[1]>

### Response B
<response_b[1]>

---

Which is better?
Answer:
```

Where `<label>` is one of `a`, `b`, or `tie`.

In [None]:
def preprocess_function(
    examples,
    tokenizer,
    label_max_length: int = 4,
):
    # 1) Build the text inputs in the desired format
    inputs = []
    for prompt_json, response_a_json, response_b_json in zip(
        examples["prompt"], examples["response_a"], examples["response_b"]
    ):
        # JSON decode the columns to handle multi-turn conversations
        prompts = json.loads(prompt_json)
        responses_a = json.loads(response_a_json)
        responses_b = json.loads(response_b_json)
        
        # Build conversation with turn-by-turn format
        conversation_parts = []
        for i, (prompt_turn, response_a_turn, response_b_turn) in enumerate(zip(prompts, responses_a, responses_b), 1):
            turn_text = f"## Turn {i}\n"
            turn_text += "### Prompt\n"
            turn_text += f"{prompt_turn}\n\n"

            turn_text += "### Response A\n"
            turn_text += f"{response_a_turn}\n\n"

            turn_text += "### Response B\n"
            turn_text += f"{response_b_turn}\n"

            conversation_parts.append(turn_text)
        
        # Join all turns with separator and add final question
        conversation = "\n---\n\n".join(conversation_parts)
        input_text = f"{conversation}\n\nWhich is better?\nAnswer:"
        inputs.append(input_text)

    # 2) Build the single‐char labels ("a", "b" or "tie")
    targets = []
    for wa, wb, wt in zip(
        examples["winner_model_a"],
        examples["winner_model_b"],
        examples["winner_tie"],
    ):
        if wa == 1:
            targets.append("a")
        elif wb == 1:
            targets.append("b")
        elif wt == 1:
            targets.append("tie")
        else:
            raise ValueError("Invalid winner values: must be one of a, b, or tie.")

    # 3) Tokenize the inputs
    model_inputs = tokenizer(
        inputs,
        padding="max_length",
        truncation=True,
    )

    # 4) Tokenize the targets (labels)
    label_tokens = tokenizer(
        targets,
        padding="max_length",
        truncation=True,
        max_length=label_max_length,
    )

    # 5) Replace all pad token ids in the labels with -100
    labels = label_tokens["input_ids"]
    pad_id = tokenizer.pad_token_id
    labels = [
        [(-100 if token_id == pad_id else token_id) for token_id in seq]
        for seq in labels
    ]

    # 6) Attach labels and return
    model_inputs["labels"] = labels
    return model_inputs

# Preprocess the first row as an example

example = train_dataset["train"].select(range(1))
example_preprocessed = preprocess_function(
    example,
    tokenizer=tokenizer,
    label_max_length=4,
)

print("Preprocessed example:")
print("Input IDs:")
print("---------------------")
print(tokenizer.decode(example_preprocessed["input_ids"][0]))

# Remove -100 from labels for display
labels = [token_id for token_id in example_preprocessed["labels"][0] if token_id != -100]
print("---------------------")
print("Decoded Labels:")
print(tokenizer.decode(labels))

TypeError: preprocess_function() got an unexpected keyword argument 'max_length'

In [19]:
tokenized = train_dataset.map(
    lambda ex: preprocess_function(ex, tokenizer),
    batched=True,
    remove_columns=train_dataset["train"].column_names,  # optionally drop old columns
)

Map:   0%|          | 0/51729 [00:00<?, ? examples/s]

TypeError: preprocess_function() missing 1 required positional argument: 'max_length'