# NOTE: WORK IN PROGRESS

This is a work in progress and is under active development! The latest version worked quite poorly. I think it's because I split the chat-formatted inputs on the `[INST]` tag *before* tokenizing. The conseqeuence was that the tokenizer added the beginning/end of sequence tokens at the beginnings/endings of both the instruction and the output. I suspect this led to undesirable results.


# Introduction

The [TinyLlama](https://github.com/jzhang38/TinyLlama) project "aims to pretrain a 1.1B Llama model on 3 trillion tokens." 1.1B tokens represents a considerable step up from the small GPT model we [previously fine-tuned](../2_gpt2_single_gpu/2.%20GPT2%20on%20a%20single%20GPU.ipynb). That model had 124M parameters; TinyLlama, while still small by the standards of most widely-used LLMs, is almost ten times the size. We will need around 20GB VRAM at a bare minimum to fine-tune this model.

## Instruction Tuning
We are going to focus on instruction tuning in this example. Instruction Tuning is a supervised learning technique in which we train the model on instruction/output pairs with the goal of training the model to follow human instructions. Before instruction tuning, the base model is trained on next-token completion. We saw this in the GPT2 example: we provided the start of a story and the model completed it. An instruction-tuned model, on the other hand, is trained to answer a question or instruction.

[This repository](https://github.com/xiaoya-li/Instruction-Tuning-Survey) contains a wealth of information on the current state of the field of instruction tuning.

The [TinyLlama repository](https://github.com/jzhang38/TinyLlama/tree/main/sft) includes scripts for fine-tuning. While these will be useful references, we will try to proceed with an approach similar to that used in the gpt2 and t5-small notebooks--purely for the sake of making this notebook a reasonable learning step following those.

# The Data
We will use the [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset. This is a curated subset of the much larger [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) dataset. Why this dataset? It's one of the most popular sources of instruction data on Hugging Face, and its size is more manageable than the full OpenOrca dataset. That's all!

# 1. Load the model and try some examples

We'll begin, as always, by loading the model and trying out some examples.

In [None]:
%pip install --upgrade -r ./tinyllama_requirements.txt

In [None]:
# Some Environment Setup
OUTPUT_DIR = "../results/TinyLlama/" # the path to the output directory; where model checkpoints will be saved
LOG_DIR = "../logs/TinyLlama/" # the path to the log directory; where logs will be saved
CACHE_DIR = "../cache/TinyLlama/" # the path to the cache directory; where cache files will be saved

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

model_ckpt = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

tokenizer = AutoTokenizer.from_pretrained(
    model_ckpt,
)

tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    device_map="auto",
)

# Inference
def generate(prompt, max_new_tokens=100):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
    gen_tokens = model.generate(input_ids, max_new_tokens=max_new_tokens,
                                eos_token_id=tokenizer.eos_token_id,
                                repetition_penalty=1.1)
    return tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)[0]

print(generate("Here are step-by-step instructions to make a great cup of coffee with a Chemex coffee maker:\n1."))

In this example, we structured our prompt with completion in mind: we generated the first part of the full text and asked the model to complete it. What happens if, instead, we ask a question or give an instruction?

In [None]:
# Question
print(generate("How do I make coffee with a Chemex coffee maker?"))

In [None]:
# Instruction
print(generate("Tell me how to make coffee with a Chemex coffee maker."))

These did not work because the model has not been instruction tuned. Our task is to change that!

# 2. Getting and Exploring the Data

In [None]:
from datasets import load_dataset
from pathlib import Path

slimorca = load_dataset('Open-Orca/SlimOrca',
                           cache_dir=str(Path(CACHE_DIR) / "data"))


In [None]:
import json
print(json.dumps(slimorca["train"][0], indent=4))

You'll see that there are three components to the sample entry:
1. A *system message*: this should be familiar if you've used e.g. ChatGPT via the API. This is a general instruction specifying the model's role/identity and general instructions.
2. A *human message*: this is the specific instruction passed to the model by a human.
3. a *gpt*: this is the AI model's response.

So we want to use this dataset to fine-tune the model such that it will respond more like the *gpt* message when given the *system* and *human* messages.

# 3. Formatting the Data

First, we need to get these entries into a format we can actually use for fine-tuning. [Appendix A](#Appendix-A:-Looking-at-the-tinyllama-fine-tuning-code) digs into the fine tuning code from the TinyLlama repo to see how the authors handled data formatting. We're going to take a slightly different approach and use the [chat model templates](https://huggingface.co/docs/transformers/main/en/chat_templating#templates-for-chat-models) from the Transformers library. The Hugging Face docs recommend applying the chat templates as a preprocessing step. Let's take a look at how they work.

## Transformers Chat Templates
Chat Templates are attributes of tokenizers. If a chat template isn't set explicitly, the default template for that model class is used. Let's see if there is a chat template set here.


In [None]:
print(tokenizer.chat_template), print(tokenizer.default_chat_template)

There is no chat template defined for this tokenizer, so we'll use the default LlamaTokenizerFast class default template. To populate this template, we apply the template to this standard chat format.

In [None]:
chat = [
    {"role": "system", "content": "You are a helpful assistant and an expert at making coffee."},
    {"role": "user", "content": "How do I make coffee with a Chemex coffee maker?"},
    {"role": "assistant", "content": "To make coffee with a Chemex:\n1. Boil water to about 200°F (93°C).\n2. Place the Chemex filter in the top and rinse it with hot water to remove paper taste and warm the vessel. Discard the rinse water.\n3. Add coffee grounds to the filter. Use a medium-coarse grind, about 1 gram of coffee per 16 grams of water.\n4. Pour just enough hot water to saturate the grounds. Wait 30 seconds for the coffee to 'bloom'.\n5. Slowly pour the remaining water over the grounds in a circular motion. Aim for a total brew time of 3.5 to 4.5 minutes.\n6. Once brewing is complete, remove the filter and enjoy."}
]

In [None]:
print(tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False))

And because we're trying to train on input/output pairs (following the TinyLlama fine tuning code examples), we'll split this at the `[/INST]` and have the input as everything up to and including the `[/INST]` and the output as everything after (including the space!). Let's write a method to do that.

In [None]:
import torch

# configure the model and tokenizer with chat tokens
# Add the instruction tokens to the tokenizer
special_tokens = ["[INST]", "[/INST]", "<<SYS>>", "<</SYS>>"]
# Adding special tokens to the tokenizer
tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
# Update the model's embeddings accordingly
model.resize_token_embeddings(len(tokenizer))

def format_slimorca(ex, tokenizer, input_max_length=512, output_max_length=512):
    role_mapping = {"gpt": "assistant", "system": "system", "human": "user"}
    chat = [
        {"role": role_mapping[message["from"]], "content": message["value"]}
        for message in ex["conversations"]
    ]
    fmt_chat = tokenizer.apply_chat_template(
        chat, tokenize=True, add_generation_prompt=False,
    )
    inst_token_id = tokenizer.encode("[/INST]")[1]
    split_index = fmt_chat.index(inst_token_id) + 1
    input_ids = fmt_chat[:split_index]
    output_ids = fmt_chat[split_index:]

    # Apply separate padding/truncation for input and output
    input_ids = torch.tensor(input_ids[:input_max_length] + [tokenizer.pad_token_id] * max(0, input_max_length - len(input_ids)))
    output_ids = torch.tensor(output_ids[:output_max_length] + [tokenizer.pad_token_id] * max(0, output_max_length - len(output_ids)))

    return input_ids, output_ids

# Map to the dataset
slimorca_tokenized = slimorca.map(
    lambda ex: {
        "input_ids": format_slimorca(ex, tokenizer)[0],
        "labels": format_slimorca(ex, tokenizer)[1],
    }, num_proc=32
).remove_columns("conversations")


In [None]:
slimorca_tokenized

In [None]:
for i in range(3):
    print("Length of input_ids:", len(slimorca_tokenized["train"][i]['input_ids']))
    print("Length of labels:", len(slimorca_tokenized["train"][i]['labels']))


In [None]:
# Inspect one example
tokenizer.decode(slimorca_tokenized["train"][25]['input_ids'])
tokenizer.decode(slimorca_tokenized["train"][25]['labels'])

In [None]:
from datasets import DatasetDict

# Split the tokenized dataset into training and validation sets
slimorca_tokenized_split = slimorca_tokenized['train'].train_test_split(test_size=0.1)

slimorca_tokenized_split["train"] = slimorca_tokenized_split["train"]
slimorca_tokenized_split["test"] = slimorca_tokenized_split["test"]

# Format the split datasets into a DatasetDict for compatibility with Hugging Face's Trainer
slimorca_tokenized_split = DatasetDict(
    {
        "train": slimorca_tokenized_split["train"],
        "valid": slimorca_tokenized_split["test"],
    }
)

slimorca_tokenized_split

Now, in the [gpt2 example](../2_gpt2_single_gpu/2.%20GPT2%20on%20a%20single%20GPU.ipynb), we will configure a *collator*. We can take some inspiration from the [collator defined in the TinyLlama fine-tuning script](https://github.com/jzhang38/TinyLlama/blob/11a02ce085c1670bd009e6d4385701ff06a7f6cf/sft/finetune.py#L252C19-L252C19).

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

# 4. Fine-tune the model

In [None]:
from transformers import TrainingArguments, Trainer
import mlflow

# Define the training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=1,
    per_device_train_batch_size=32, 
    per_device_eval_batch_size=4,
    auto_find_batch_size=True, 
    warmup_steps=1,
    weight_decay=0.01,
    logging_dir=LOG_DIR,
    logging_steps=25,  # Log every 25 steps
    evaluation_strategy="steps",  # Evaluate every 'eval_steps'
    eval_steps=5000,
    bf16=True,
    #fp16=True,
    gradient_accumulation_steps=4,
    gradient_checkpointing=False,
    #optim="adamw_bnb_8bit",
    save_steps=10000
)

training_args.set_logging(report_to=["mlflow"],
                          steps=50,
                          level="info")


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=slimorca_tokenized_split["train"],
    eval_dataset=slimorca_tokenized_split["valid"],
    data_collator=data_collator,
)

# Start training and track with MLflow
with mlflow.start_run(log_system_metrics=True):
    trainer.train()
    mlflow.log_params(training_args.to_dict())

trainer.save_model()

# 5. Load the Fine-Tuned Model Checkpoint and Run some Examples

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
import os
import re


def load_latest_checkpoint(output_dir = OUTPUT_DIR,
                           default_tokenizer="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"):
    checkpoint_dir = max(
        [d for d in next(os.walk(output_dir))[1] if re.match(r"checkpoint-\d+", d)],
        key=lambda d: int(d.split("-")[-1]),
    )
    path = os.path.join(output_dir, checkpoint_dir)
    model = AutoModelForCausalLM.from_pretrained(path,
                                                  device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(
        path
        if os.path.exists(os.path.join(path, "tokenizer_config.json"))
        else default_tokenizer
    )

    return model, tokenizer

In [None]:
model_ft, tokenizer_ft = load_latest_checkpoint()


In [None]:
prompt = [
    {"role": "system", "content": "You are a helpful assistant and an expert at making coffee."},
    {"role": "user", "content": "How do I make coffee with a Chemex coffee maker?"},
]
prompt = tokenizer.apply_chat_template(prompt,
                                       tokenize=False, add_generation_prompt=False)
prompt

# Testing the fine-tuned model

The first version of this performed very badly. For example:

```
[INST] <<SYS>>
You are a helpful assistant and an expert at making coffee.
<</SYS>>

How do I make coffee with a Chemex coffee maker? [/INST]

How can I make coffee with a Chemex coffee maker? [/INST]

Please tell me if those questions are the same.
Choose from:
(a). no;
(b). yes; [/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/INST]
```

## Retraining Changes
1. add the instruction-related tokens as special tokens to the tokenizer
2. increase the context size to 512 in / 512 out (from 512/256)
3. increate gradient_accumulation_steps to 4
4. add auto_find_batch_size=True
5. change attention to flashattention2
6. use regular adamw.
7. use bf16 instead of fp16

## Questions
1. What enabled me to use regular adamw instead of adamw_bnb_8bit? Was it loading the model in bf16? Did flashattention2 make that big of a difference? Something else?
2. How can I determine the batch size auto_find_batch_size landed on?
3. I actually still have some vram to work with. How can I make the most use of it?

Perhaps most importantly...we're going to try out the checkpoints along the way instead of training for more than a full day before testing the model.

## Testing the model after the second fine-tuning attempt

I tried out one of the model checkpoints during the fine-tuning process and it was performing better than the end result of the first fine-tuning attempt. It still didn't know *how* to make coffee with a Chemex coffee maker, but, well, it's a small model. What's important is that it took a question for a prompt and it responded with a more-or-less coherent answer.

However—and here's another lesson learned about the perils of running everything from notebooks—I initialized the training run by running a whole notebook. A notebook that happened to have a cleanup script intended to delete intermediary checkpoints. So, yeah, I finished a 14.5 hour training run and deleted the results.

On to round 3.

# 7. Next Steps

This fine-tuning process pushed the limits of what we could accomplish on a single GPU. And it makes sense: our back-of-the-envelope calculations said that we would require *at least* 20GB of VRAM, before we even think about storing activations or scaling sequence lengths or batch sizes.

We got around this in part by using a smaller sequence length than that shown in the tinyllama fine-tuning script. The biggest change we made was to use the `adamw_bnb_8bit` optimizer from the bitsandbytes library. The point is that we are running up against the limits of what we can reasonably accomplish with a single GPU, at least without more sophisticated approaches. So what's next? There are several directions we can pursue (and we can and should pursue them all):   
1. Try to further optimize training this model on a single GPU. What can we do to make the training process run faster and more effectively? Can we find an approach that will still let us the normal `adamw` optimizer? Can we benefit from using e.g. [Deepspeed ZeRO](https://huggingface.co/docs/transformers/perf_train_gpu_one#deepspeed-zero)?
2. Try to fine-tune this model on a multi-GPU setup. What are the benefits in terms of speed and ability to train larger batches and larger sequence lengths? And, perhaps more importantly in this setting, how do we make the leap from a single GPU to a multi-GPU setup?
3. Train a bigger model! So far we have fine-tuned t5-small, gpt2, and tinyllama, with each subsequent model larger than the last. We ultimately want to work our way up to even larger models, so after this, it might be time to train a 3B parameter model, and then a 7B parameter model!

# Appendix A: Looking at the tinyllama fine-tuning code

You can fine the tinyllama fine-tuning code [here](https://github.com/jzhang38/TinyLlama/tree/main/sft). It's worth the time, at this phase of learning about fine-tuning, to read through it and learn about some of the approaches they use.

Let's first take a look at the [train()](https://github.com/jzhang38/TinyLlama/blob/11a02ce085c1670bd009e6d4385701ff06a7f6cf/sft/finetune.py#L492) method. It begins by using the [`HFArgumentParser`](https://huggingface.co/docs/transformers/v4.36.1/en/internal/trainer_utils#transformers.HfArgumentParser) to configure the training arguments. Earlier, the training code defined a number of `dataclass`es for e.g. training arguments, data arguments, etc. `HFArgumentParser` provides an approach for parsing command line arguments directly into instances of these dataclass types. So instead of simply defining arguments in notebook cells, as we've been doing, this approach provides a structured way to parse command line arguments. And, indeed, the repo provides a [shell script](https://github.com/jzhang38/TinyLlama/blob/main/sft/script.sh) for running the fine-tuning script with a defined set of arguments.

The [next major section](https://github.com/jzhang38/TinyLlama/blob/11a02ce085c1670bd009e6d4385701ff06a7f6cf/sft/finetune.py#L514C3-L514C3) of the training script is focused on preparing the data, using the [`make_data_module`](https://github.com/jzhang38/TinyLlama/blob/11a02ce085c1670bd009e6d4385701ff06a7f6cf/sft/finetune.py#L354) method defined earlier in the script. That method is set up to handle a few different potential fine-tuning data sources (slimorca is not included among them). It maps each of them to the expected format: an input string and an output string.

I found the handling of alpaca-formatted datasets instructive. The [alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca?row=0) dataset includes instructions and optional inputs that follow a specified format. For examples with inputs, the format is:

```
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
```

The following code snippet in the TinyLlama repo handles this formatting (in the alpaca dataset, the inputs/outputs are not pre-formatted)

```python
ALPACA_PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with an input that provides further context. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response: "
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Response: "
    ),
}

def extract_alpaca_dataset(example):
    if example.get("input", "") != "":
        prompt_format = ALPACA_PROMPT_DICT["prompt_input"]
    else:
        prompt_format = ALPACA_PROMPT_DICT["prompt_no_input"]
    return {'input': prompt_format.format(**example)}
```

We're seeing some repeating patterns across training scripts (the examples so far in this repo and the TinyLlama code). Each fine-tuning run so far requires the following:
- process the data
- set up training arguments
- set up logging

An additional step, as we get to multi-gpu and multi-node setups, will be configuring devices and processses—-see the [script.sh](https://github.com/jzhang38/TinyLlama/blob/main/sft/script.sh) shell script from TinyLlama for an example, which uses [accelerate launch](https://huggingface.co/docs/accelerate/basic_tutorials/launch), a helper command that makes it easier to launch training scripts on different hardware.

# Appendix B: Resuming from a Checkpoint

I made a few mistakes in terms of handling checkpoints. In one of those cases, I saved checkpoints and assumed a final model would also be saved. This was not the case. So I had a checkpoint at step 20,000 out of 29,000. In this case, instead of starting over, it made more sense to load the checkpoint and finish training. To do so with the Hugging Face trainer, we can:

1. Load the desired model checkpoint with e.g. 
```
model = AutoModelForCausalLM.from_pretrained(
    "/path/to/checkpoint-20000",
    torch_dtype=torch.bfloat16, 
    device_map="auto",
    attn_implementation="flash_attention_2"
)
```

Also make sure the tokenizer is loaded.
2. After configuring the trainer/training arguments as before, call `trainer.train` with the `resume_from_checkpoint` argument set to the desired checkpoint.
```
trainer.train(resume_from_checkpoint="/path/to/checkpoint-20000")
```

The training will then pick up at step 20,000. And then you can make sure to save the final model!