# Llama 3 8B Fine-tuning on the ARC Dataset
In this notebook, we will demonstrate how to fine-tune the instruct version of Llama 3 8B using Kaggle hardware. If you aim to apply a Large Language Model (LLM) to the ARC dataset and enhance its performance from its original state without relying on in-context learning, prompt engineering, or other techniques, this straightforward approach is for you.

We will utilize Q-Lora with low-rank adaptation to ensure compatibility with the hardware limitations on Kaggle. It is important to note that this fine-tuning process is not optimized and may not solve any tasks from the hidden test set. This notebook serves as a demonstration of how LLMs can be fine-tuned and the necessary packages required for the process.

We welcome any feedback you may have and appreciate your insights.

If there are any additional details you’d like to include or further adjustments needed, let us know!

## 1. Add datasets and Model
We will be using the following datasets:

1. ‘ARC Prize 2024’: This is the official dataset containing the ARC tasks to be solved.
2. (Optional) ‘Llama-3-ARC-deps’: This dataset contains the wheel files for additional packages not available in the Kaggle Kernel. Note that this dataset is required if you plan to submit this notebook to the competition, as no internet access is allowed during the competition. !missing! (not public)

Additionally, we need to add the original Llama 3 8B model:

3. ‘Llama 3 8B-chat-hf’

Please note that to access the Llama 3 model on Kaggle, you need to obtain access from Meta. Instructions on how to do this can be found [here](https://www.kaggle.com/models/metaresearch/llama-3).

## 2. Install and Import Packages, and Log in to Weights & Biases (wandb)

As mentioned, we will be using Huggingface libraries, and most of the necessary packages are already available in Kaggle kernels. However, there are a few packages that are not included by default. If you are not submitting to the competition, you can download these packages directly.

For competition submissions, where internet access is restricted, we will use a Kaggle dataset containing the required wheel files. This allows us to install the packages without needing internet access during the submission process.

Additionally, we will log in to Weights & Biases (wandb) to track the progress of our fine-tuning process.

### 2.1 With internet access:

If we have internet access we can just directly install the packages:

In [None]:
!pip install -q -U -i https://pypi.org/simple/ bitsandbytes
!pip install -q -U trl
!pip install -q -U peft

### 2.2 Without internet access (use for submission):

If we don't have internet access you can:
1. Add the dataset we prepared [dataset name] !missing!
2. Create your own dataset. You can find the explanation here [link] !missing!

In [None]:
deps_path = '/kaggle/input/llama-3-arc-deps'
! pip install --no-index --find-links {deps_path} --requirement {deps_path}/requirements.txt

### 2.3 Import Packages

Now, let’s import the necessary packages:

In [None]:
# For dataset
import pandas as pd
import json
import os
import ast
import re
import numpy as np
from datasets import Dataset
import matplotlib.pyplot as plt

# For LLM
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    set_seed,
    pipeline
)
from trl import SFTTrainer, setup_chat_format, SFTConfig

import torch
from time import time

# For wandb
from kaggle_secrets import UserSecretsClient
import wandb
# Set seed
set_seed(42)

### 2.4 Log in to Weights & Biases [optional] (wandb)

To log in to Weights & Biases (wandb), follow these steps:

1. Add Your API Key as a Kaggle Secret:
    - Navigate to the “Add-ons” section in the right-hand panel of your Kaggle notebook interface.
    - Select “Secrets”.
    - Click on “Add a new secret”.
    - Enter a name for your secret (e.g., WANDB_API_KEY) and paste your wandb API key in the value field.
    - Save the secret.
2. Log in to wandb in Your Notebook:
    - Use the following code to log in to wandb using the secret you just added:
3. Initialize wandb:
    - Before starting your training or fine-tuning process, initialize wandb to track your experiment. Use the following code snippet to set up your wandb run:

In [None]:
# from kaggle_secrets import UserSecretsClient
# user_secrets = UserSecretsClient()
# secret_value_0 = user_secrets.get_secret("WANDB_API_KEY")

In [None]:
user_secrets = UserSecretsClient()
wandb_key = user_secrets.get_secret("WANDB_API_KEY")
! wandb login $wandb_key

## 3. Load the data

Next, let’s load the ARC tasks:

In [None]:

# Prepare data for DataFrame

# Load JSON data from the files
with open('/kaggle/input/arc-prize-2024/arc-agi_evaluation_challenges.json') as f:
    challenges = json.load(f)

with open('/kaggle/input/arc-prize-2024/arc-agi_evaluation_solutions.json') as f:
    solutions = json.load(f)

data = []
for file_name, grids in challenges.items():
    train_grids = grids.get('train', [])
    test_inputs = grids.get('test', [])
    test_outputs = solutions.get(file_name, [])
    # Transform test grids to lists of dicts with 'output' key
    test_outputs_transformed = [{'output': grid} for grid in test_outputs]
    # Combine test inputs and outputs in alternating manner
    combined_tests = []
    for test_input, test_output in zip(test_inputs, test_outputs_transformed):
        combined_tests.append({'input': test_input['input'], 'output': test_output['output']})
    data.append({
            'file_name': file_name,
            'train': train_grids,
            'test_input': test_inputs,
            'test_output': test_outputs_transformed,
            'test': combined_tests
    })

# Create DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

## 4. Load finetuned Llama-3 Model

Next, we will load our fine-tuned Llama 3 model. We are using a 4-bit quantized version to reduce memory requirements. Ensure that you have selected an appropriate accelerator (T4x2) for the session, as sufficient memory is crucial for the training process to work effectively.

In [None]:
# Define a template for formatting chat messages with the Llama 3 model
# This is model specific. Change it if you e.g. use Google's Gemma instead of Llama
LLAMA_3_CHAT_TEMPLATE = """{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"""

# Set the data type for computations to float16, bfloat16 not supported on T4/P100
compute_dtype = getattr(torch, "float16")

# Configure the BitsAndBytes settings for 4-bit quantization to reduce memory usage
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Enable 4-bit quantization
    bnb_4bit_use_double_quant=True,  # Use double quantization for improved precision
    bnb_4bit_quant_type="nf4",  # Specify the quantization type
    bnb_4bit_compute_dtype=compute_dtype,  # Set the computation data type
)

# Specify the model ID change this if you e.g. want to try with Google's Gemma
model_id = "/kaggle/input/llama-3/transformers/8b-chat-hf/1"

# Record the start time to measure the loading duration
time_start = time()

# Load the pre-trained model with specified configurations
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,  # Apply the 4-bit quantization configuration
    torch_dtype=compute_dtype,  # Set the data type for the model
    use_cache=False,  # Disable caching to save memory
    device_map='auto',  # Automatically map the model to available devices (e.g., GPUs)
)

# Enable gradient checkpointing to reduce memory usage during backpropagation
model.gradient_checkpointing_enable()

# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token  # Set the padding token to the end-of-sequence token you could also introduce a special pad token but this is not needed.
tokenizer.chat_template = LLAMA_3_CHAT_TEMPLATE  # Apply the chat message template

# Record the end time and print the duration for preparing the model and tokenizer
time_end = time()
print(f"Prepare model, tokenizer: {round(time_end-time_start, 3)} sec.")

## 5. Create Prompts and filter the dataset

Next, we will create the prompts that will be used to evaluate the model on the ARC dataset.

### 5.1 Create Prompts

In [None]:
# The system_prompt defines the initial instructions for the model, setting the context for solving ARC tasks.
system_prompt = '''You are a puzzle solving wizard. You are given a puzzle from the abstraction and reasoning corpus developed by Francois Chollet.'''

# User message template is a template for creating user prompts. It includes placeholders for training data and test input data, guiding the model to learn the rule and apply it to solve the given puzzle.
user_message_template = '''Here are the example input and output pairs from which you should learn the underlying rule to later predict the output for the given test input:
----------------------------------------
{training_data}
----------------------------------------
Now, solve the following puzzle based on its input grid by applying the rules you have learned from the training data.:
----------------------------------------
[{{'input': {input_test_data}, 'output': [[]]}}]
----------------------------------------
What is the output grid? Only provide the output grid in the form as in the example input and output pairs. Do not provide any additional information:'''

def preprocess(task, train_mode=True):
    """
    Preprocess a single ARC task to create the prompt and solution for the model.

    This function formats the system and user messages using a predefined template and the task's training and test data.
    If in training mode, it also includes the assistant's message with the expected output.

    Parameters:
    task (dict): The ARC task data containing training and test examples.
    train_mode (bool): If True, includes the assistant's message with the expected output for training purposes.

    Returns:
    dict: A dictionary containing the formatted text prompt, the solution, and the file name.
    """
    # System message
    system_message = {"role": "system", "content": system_prompt}

    # Extract training data and input grid from the task
    training_data = task['train']
    input_test_data = task['test'][0]['input']
    output_test_data = task['test'][0]['output']

    # Format the user message with training data and input test data
    user_message_content = user_message_template.format(training_data=training_data, input_test_data=input_test_data)
    user_message = {
        "role": "user",
        "content": user_message_content
    }

    # Include the assistant message with the expected output if in training mode
    if train_mode:
        assistant_message = {
            "role": "assistant",
            "content": str(output_test_data)
        }

        # Combine system, user, and assistant messages
        messages = [system_message, user_message, assistant_message]
    else:
        messages = [system_message, user_message]
    # Convert messages using the chat template for use with the instruction finetuned version of Llama
    messages = tokenizer.apply_chat_template(messages, tokenize=False)
    return {"text": messages, "solution": output_test_data, "file_name": task['file_name']}

# Convert the loaded data to a Huggingface Dataset object
dataset = Dataset.from_pandas(df)
dataset = dataset.shuffle(seed=42)
# Split dataset into training and testing
dataset = dataset.train_test_split(test_size=0.2)

# Use the map method to apply the preprocess function
dataset = dataset.map(preprocess, batched=False, remove_columns=dataset["train"].column_names)

In [None]:
# Check sample
print(dataset['train'][0]['text'])

### 5.2 Inspect the Prompts

To understand how many tasks we can consider for fine-tuning, we will inspect the number of tokens in the prompts for each task. This will give us an idea of the token length distribution and help us determine the feasibility of including various tasks within the model’s context window.

In [None]:
# Tokenize the dataset and store tokenized samples
data = dataset.map(lambda samples: tokenizer(samples['text']), batched=False)

def plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset):
    """
    Plot the distribution of token lengths in the training and validation datasets.

    This function calculates the length of tokenized input texts in both the training and 
    validation datasets, combines the lengths, and plots a histogram to visualize their distribution.

    Parameters:
    tokenized_train_dataset (Dataset): The tokenized training dataset.
    tokenized_val_dataset (Dataset): The tokenized validation dataset.

    Returns:
    None
    """
    # Calculate the lengths of tokenized texts in the training dataset
    lengths = [len(x['text']) for x in tokenized_train_dataset]
    # Add the lengths of tokenized texts in the validation dataset
    lengths += [len(x['text']) for x in tokenized_val_dataset]
    
    # Print the total number of lengths calculated
    print(len(lengths))

    # Plotting the histogram of token lengths
    plt.figure(figsize=(10, 6))
    plt.hist(lengths, bins=50, alpha=0.7, color='blue')
    plt.xlabel('Length of input_ids')
    plt.ylabel('Frequency')
    plt.title('Distribution of Lengths of input_ids')
    # Uncomment the line below to set x-axis limits, if needed
    # plt.xlim([0, 1500])
    plt.show()

# Plot the distribution of token lengths in the training and validation datasets
plot_data_lengths(data['train'], data['test'])

### 5.2 Filter the Dataset

To address memory limitations, we will restrict the dataset to tasks with prompt lengths of less than 2048 tokens. This ensures that the tasks fit within the model’s context window during fine-tuning.

In [None]:
# Define the maximum number of tokens allowed
max_tokens = 2048

# Function to calculate the number of tokens in a text
def count_tokens(text):
    """
    Calculate the number of tokens in a given text using the tokenizer.

    Parameters:
    text (str): The input text to be tokenized.

    Returns:
    int: The number of tokens in the input text.
    """
    return len(tokenizer.encode(text))

# Filter the dataset to include only tasks with a number of tokens within the allowed limit
filtered_train_dataset = dataset['train'].filter(lambda x: count_tokens(x['text']) <= max_tokens)
filtered_eval_dataset = dataset['test'].filter(lambda x: count_tokens(x['text']) <= max_tokens)

# Calculate the number of tasks filtered out
filtered_out_train_tasks = len(dataset['train']) - len(filtered_train_dataset)
filtered_out_eval_tasks = len(dataset['test']) - len(filtered_eval_dataset)

# Print the number of tasks filtered out and the remaining tasks
print(f'{filtered_out_train_tasks} training tasks were filtered out because they exceed the {max_tokens} token limit.')
print(f'The filtered training dataset contains {len(filtered_train_dataset)} tasks for fine-tuning.')
print(f'{filtered_out_eval_tasks} evaluation tasks were filtered out because they exceed the {max_tokens} token limit.')
print(f'The filtered evaluation dataset contains {len(filtered_eval_dataset)} tasks for evaluation.')

## 5.3 Test original model

In [None]:
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)
pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto"
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

# pipeline(dataset['train'][0]['text'], max_new_tokens=1000, return_full_text=False)

prompt = filtered_eval_dataset[0]['text']

outputs = pipeline(
    prompt,
    max_new_tokens=512,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)
print(outputs[0]["generated_text"][len(prompt):])

## 6. Finetune the model

In this section, we will set up the necessary configurations and initiate the fine-tuning process for our model on the filtered ARC dataset. This involves preparing the training environment, defining the training parameters, and starting the fine-tuning process.

### 6.1 LoRA Configuration

In this subsection, we will configure the Low-Rank Adaptation (LoRA) settings for fine-tuning our model. LoRA is an efficient technique that allows us to adapt pre-trained language models to specific tasks by adding low-rank updates. This approach helps in reducing the number of trainable parameters and computational requirements, making it suitable for our setup.

In [None]:
# Configure LoRA (Low-Rank Adaptation) for fine-tuning the model
peft_config = LoraConfig(
        lora_alpha=64,  # Scaling factor for the low-rank matrices
        lora_dropout=0.05,  # Dropout rate to apply to the low-rank matrices
        r=4,  # Rank of the low-rank matrices
        bias="none",  # Type of bias to use (none, all, or some specific layers)
        task_type="CAUSAL_LM",  # Specify the type of task (e.g., CAUSAL_LM for causal language modeling)
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",  # List of modules to apply LoRA to
                        "gate_proj", "up_proj", "down_proj"],
)

# Explanation of LoRA Configuration Parameters:
# lora_alpha: Controls the scaling of the low-rank adaptation, helping to balance between original weights and the adapted ones.
# lora_dropout: Introduces dropout to the low-rank adaptation matrices, aiding in regularization.
# r: Defines the rank of the low-rank matrices, controlling the number of parameters added.
# bias: Determines whether and where to apply bias in the adapted model.
# task_type: Specifies the type of task for fine-tuning (CAUSAL_LM for causal language modeling in this case).
# target_modules: Lists the specific modules of the model where LoRA will be applied, focusing the adaptation on critical components.

### 6.2 Training Arguments

In this subsection, we will define the training arguments for fine-tuning our model. These settings include various parameters that control the training process, such as batch size, learning rate, evaluation strategy, and more. Properly configuring these arguments is crucial for efficient and effective model training.

In [None]:
# Define the output directory for the fine-tuned model
output_dir="llama3_8b_arc_v01"

sft_config = SFTConfig(
    output_dir=output_dir,  # Directory to save the fine-tuned model and checkpoints
    eval_strategy="steps",  # Evaluate the model at regular steps
    do_eval=True,  # Perform evaluation during training
    optim="paged_adamw_8bit",  # Optimizer to use for training (paged AdamW with 8-bit precision)
    per_device_train_batch_size=1,  # Training batch size per device
    gradient_accumulation_steps=8,  # Accumulate gradients over multiple steps to effectively increase batch size
    per_device_eval_batch_size=1,  # Evaluation batch size per device
    log_level="debug",  # Logging level (debug for detailed logs)
    save_steps=250,  # Save model checkpoint every 100 steps
    logging_steps=10,  # Log training metrics every step
    learning_rate=8e-6,  # Learning rate for the optimizer
    eval_steps=250,  # Evaluate the model every 100 steps
    max_steps=750,  # Maximum number of training steps
    num_train_epochs=3,  # Number of training epochs
    warmup_steps=10,  # Number of warmup steps for learning rate scheduler
    lr_scheduler_type="cosine",  # Type of learning rate scheduler (cosine annealing)
    fp16=True,  # Use 16-bit floating point precision for training
    bf16=False,  # Do not use bfloat16 precision
    max_grad_norm=0.3,  # Maximum gradient norm for gradient clipping
    gradient_checkpointing=True,  # Use gradient checkpointing to save memory
    gradient_checkpointing_kwargs={'use_reentrant':False},  # Arguments for gradient checkpointing
    ######
    dataset_text_field="text", # The field in the dataset containing the text data
    max_seq_length=max_tokens,  # The maximum sequence length for tokenization
    packing=False,
)

### 6.3 Training the Model

In this subsection, we will set up the environment for training and use the SFTTrainer to fine-tune our model on the ARC dataset. This involves configuring the trainer with the model, datasets, PEFT configuration, tokenizer, and training arguments, and then initiating the training process.

In [None]:
# Enable Weights & Biases (wandb) for tracking the training process
os.environ["WANDB_DISABLED"] = "false"
os.environ["WANDB_PROJECT"] = "llama3_8b_ARC"

In [None]:
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)
# Set up the SFTTrainer for fine-tuning the model
# Define the output directory for the fine-tuned model
output_dir="llama3_8b_arc_v01"

trainer = SFTTrainer(
        model=model,  # The pre-trained model to be fine-tuned
        train_dataset=filtered_train_dataset,  # The training dataset
        eval_dataset=filtered_eval_dataset,  # The evaluation dataset
        peft_config=peft_config,  # LoRA configuration for parameter-efficient fine-tuning
        tokenizer=tokenizer,  # The tokenizer for the model
        args=sft_config,  # Training arguments configuration
)

# Print the number of trainable parameters in the model
trainer.model.print_trainable_parameters()

# Start the training process
trainer.train()

# Save the fine-tuned model
trainer.model.save_pretrained(output_dir)

# Save the tokenizer
tokenizer.save_pretrained(output_dir)

print(f"Model and tokenizer saved to {output_dir}")

## 7. Create a Dataset to Be Used for Inference

To use your fine-tuned model for inference or submission, follow the steps outlined below:

Option 1: Use a Publicly Available Dataset/model

We have created a dataset that includes the required packages and fine-tuned model. You can access it directly using this [link].

Option 2: Create Your Own Dataset

If you prefer, you can create a custom dataset with the fine-tuned model and tokenizer files. Follow these steps to create your own dataset:

1. Run Your Notebook:
    - Execute your notebook to save the fine-tuned model and tokenizer into your working directory. Make sure to click on “Save Version” to capture the output.
2. Save the Output:
    - After running the notebook, navigate to the “Dataset” tab in the Kaggle interface.
3. Create a New Dataset:
    - Click on “New Dataset”.
    - Select “Notebook Output Files” as the data source.
    - Choose the notebook you ran earlier. This will include the directory where you saved the fine-tuned model and tokenizer.
    - Provide a name and description for your dataset.
    - Complete the creation process by following the on-screen instructions. You can even keep it automatically in sync with your notebook if you’d like to add further packages later on.

By following these steps, you will have a dataset containing the fine-tuned model and tokenizer, enabling you to use them for inference without requiring internet access during the competition.

# Closing Remarks

This notebook provides a basic demonstration of how to fine-tune the Llama 3 model on the ARC dataset using Kaggle hardware. It’s important to note that this solution has not been optimized or iterated upon and is meant primarily to showcase the steps involved in fine-tuning an LLM and preparing it for inference.

The methods and configurations used here are quite straightforward and do not incorporate advanced techniques that could significantly improve performance. For example, chain-of-thought prompting, more sophisticated data augmentation, and extensive hyperparameter tuning were not employed in this demonstration.

For those interested in state-of-the-art (SOTA) performance on the ARC dataset using a fine-tuned LLM, I highly recommend exploring the work of Jack Cole. He has achieved SOTA results by using a much larger dataset and more advanced techniques, demonstrating the potential of fine-tuned LLMs when more resources and sophisticated methods are applied.

While this notebook provides a starting point, achieving high performance on tasks like those in the ARC dataset typically requires a more thorough and nuanced approach. We encourage you to experiment further, iterate on these methods, and explore more advanced techniques to improve your model’s performance.

By following these steps and considerations, you can better understand the process and potential of fine-tuning large language models for specific tasks. Happy experimenting!