# Finetuning a model

Having [created a dataset](./01-generating-a-dataset.ipynb), the next step is to fine tune a model.

The goal of this notebook is to fine tune the text output of an open source multimodal model locally using the previously generated dataset.

## Prerequisites

Ensure you have a [Huggingface](https://huggingface.co) account.

Create a [token](https://huggingface.co/docs/hub/en/security-tokens) with the minimum settings of:
- Read access to contents of all repos under your personal namespace
- Read access to contents of all public gated repos you can access
- Write access to contents/settings of all repos under your personal namespace

## Load the dataset

If you pushed your dataset to Huggingface and wish to load it from these, follow the [instructions below](#if-load-from-huggingface).

If your `.jsonl` file is local, follow the [other path](#else-load-from-local).

### IF: load from Huggingface

In [1]:
import os

from huggingface_hub import login

login(os.environ["HF_TOKEN"])

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [2]:
from datasets import load_dataset

dataset_id = "charlesLoder/northwestern-metadata"
dataset = load_dataset(dataset_id, split="train")

In [3]:
import json

print(json.dumps(dataset[0], indent=2))

{
  "prompt": "Describe this image.",
  "completion": "Sandy Paton playing guitar at Creed's Books in Berkeley, California"
}


### ELSE: load from local

In [4]:
dataset = load_dataset("json", data_files="./outputs.jsonl", split="train")
print(json.dumps(dataset[0], indent=2))

{
  "prompt": "Describe this image.",
  "completion": "Sandy Paton playing guitar at Creed's Books in Berkeley, California"
}


## Fine tuning

Fine tuning a model can be tricky as there are a number of different factors that can be unique for each task, including:
- the shape of the dataset
- the requirements of the model
- the user's hardware (this notebook was created on a Mac M2)

### Loading the model

We are going to use the [Smol-VLM](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model because it is powerful, small, and open source.

Throughout the course of this process, we will reference the model card linked above.

In [5]:
import torch
from transformers import AutoModelForVision2Seq

model_id = "HuggingFaceTB/SmolVLM-Instruct"

model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    device_map="auto",  # Let the `accelerate` library handle device placement
    torch_dtype=torch.bfloat16, # see `Tensor type` in model card
)

### Loading the processor

The processor will handle the tokenization of our data.

In [6]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(model_id)

### Excersus: data, templates, and tokenization

Before going deeper into the fine tuning process, it is important to understand how our data, model, and processor all work together.

#### Data

Our data is in an "instruction" based format:

In [7]:
print(json.dumps(dataset[0], indent=2))

{
  "prompt": "Describe this image.",
  "completion": "Sandy Paton playing guitar at Creed's Books in Berkeley, California"
}


Though this format is convenient for later use with fine tuning, it is not usable by the model.

#### Templates

Models assume that inputs are passed in a certain way, called a "chat template."

The template can be found on the model card, but we can also access it on the processor.

In [8]:
print(processor.tokenizer.chat_template)

<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}


Structured, it looks like this:

```liquid
<|im_start|>
{% for message in messages %}
    {{message['role'] | capitalize}}
    {% if message['content'][0]['type'] == 'image' %}
        {{':'}}
    {% else %}
        {{': '}}
    {% endif %}
    {% for line in message['content'] %}
        {% if line['type'] == 'text' %}
            {{line['text']}}
        {% elif line['type'] == 'image' %}
            {{ '<image>' }}
        {% endif %}
    {% endfor %}
    <end_of_utterance>
{% endfor %}
{% if add_generation_prompt %}
    {{ 'Assistant:' }}
{% endif %}
```

What this means is that tokenizer expects incoming data to be shaped like this:

```python
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is the capital of France?"
            }
        ]
    },
]
```

And it will output it into this template format (ChatML):

```txt
<|im_start|>User: What is the capital of France?<|end_of_utterance|>
Assistant:
```

Let's look at it in action.

In [9]:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Who was the first president of the United States of America?"
            }
        ]
    },
]

print(processor.tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True, # creates "Assistant" role to continue the conversation
    tokenize=False,

))

<|im_start|>User: Who was the first president of the United States of America?<end_of_utterance>
Assistant:


The shape of our current data will not work.

In [10]:
print(processor.tokenizer.apply_chat_template(
    dataset[0],
    add_generation_prompt=True, # creates "Assistant" role to continue the conversation
    tokenize=False,
))

UndefinedError: 'str object' has no attribute 'content'

So we need to transform it:

In [11]:
example = dataset[0]

new_messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": example["prompt"]
            }
        ]
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": example["completion"]
            }
        ]
    }
]

print(processor.tokenizer.apply_chat_template(
    new_messages,
    tokenize=False,
))

<|im_start|>User: Describe this image.<end_of_utterance>
Assistant: Sandy Paton playing guitar at Creed's Books in Berkeley, California<end_of_utterance>



#### Tokenization

Models, however, don't take text as an input. The text must be tokenized (covnerted to a numerical representation) first for the model to use it.

In [12]:
print(processor.tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True, # creates "Assistant" role to continue the conversation
    tokenize=True,

))

[1, 11126, 42, 6244, 436, 260, 808, 5165, 282, 260, 1797, 1918, 282, 2493, 47, 49154, 198, 9519, 9531, 42]


It is this tokenized text that can be used to chat with a model.

In [13]:
prompt = processor.tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False, # Tokenization is done in the processor
)

inputs = processor(
    text=prompt,
    images=None,
    return_tensors="pt"
).to(model.device)

generated_ids = model.generate(**inputs, max_length=256)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

User: Who was the first president of the United States of America?
Assistant: George Washington


During the fine tuning process, Huggingface's library will handle the data shapte transformation, applying the template, and tokenization.

### PEFT Model

We will use Huggingface's [PEFT library](https://huggingface.co/docs/peft/en/index) for fine tuning a small portion of the models parameters.

This allows fine tuning to be  more efficient, enablying the process to be done on local machines.

In [14]:
from peft import LoraConfig, get_peft_model

# Configure LoRA
peft_config = LoraConfig(
    r=8,
    lora_alpha=8,
    lora_dropout=0.1,
    target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

# Apply PEFT model adaptation
peft_model = get_peft_model(model, peft_config)

# Print trainable parameters
peft_model.print_trainable_parameters()

trainable params: 10,536,960 || all params: 2,256,809,840 || trainable%: 0.4669


Note how only a small portion of the parameters are being trained.

### Training

Using the [Transformer Reinforcement Learning](https://huggingface.co/docs/trl/v0.16.1/en/index) library from Huggingface,
we first create the `SFTConfig` (Supervised Fine-Tuning).

The parameters represent commonly used values. Except for:

- `output_dir`: where it will save the output locally
- `push_to_hub`: whether to push it to the Huggingface Hub (only if you have a Huggingface account)
- `report_to`: where to report outputs (e.g. azure)

In [15]:
from trl import SFTConfig

# Configure training arguments using SFTConfig
training_args = SFTConfig(
    output_dir="SmolVLM-Instruct-library-metadata",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=50,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=25,
    save_strategy="steps",
    save_steps=25,
    save_total_limit=1,
    optim="adamw_torch_fused",
    bf16=True,
    push_to_hub=True,
    report_to="none",
)

We now put it all together!

In [16]:
from trl import SFTTrainer


trainer = SFTTrainer(
    model=peft_model,
    args=training_args,
    # because the dataset is in the instruction based format, we can pass it in directly
    train_dataset=dataset,
    peft_config=None, # not needed because we are passing in a peft_model
    processing_class=processor.tokenizer,
)

Map: 100%|██████████| 4906/4906 [00:00<00:00, 62678.02 examples/s]
Converting train dataset to ChatML: 100%|██████████| 4906/4906 [00:00<00:00, 124199.54 examples/s]
Applying chat template to train dataset: 100%|██████████| 4906/4906 [00:00<00:00, 127384.95 examples/s]
Tokenizing train dataset: 100%|██████████| 4906/4906 [00:00<00:00, 10602.13 examples/s]
Truncating train dataset: 100%|██████████| 4906/4906 [00:00<00:00, 1158172.76 examples/s]
No label_names provided for model class `PeftModel`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [17]:
trainer.train()

Step,Training Loss
25,4.4457
50,3.9415
75,3.0472
100,2.3622
125,2.2192
150,2.0709
175,2.0283
200,1.8294
225,1.8958
250,1.8203


TrainOutput(global_step=306, training_loss=2.4348123307321585, metrics={'train_runtime': 2214.5243, 'train_samples_per_second': 2.215, 'train_steps_per_second': 0.138, 'total_flos': 3472913434710912.0, 'train_loss': 2.4348123307321585})

In [19]:
trainer.save_model(training_args.output_dir)