# Mistral-Finetune: An Introduction!

In this notebook, we'll be exploring [`mistral-finetune`](https://github.com/mistralai/mistral-finetune) a tool from Mistral AI that, according to their README.md, enables "memory-efficient and performant" fine-tuning of Mistral's models!

It leverages LoRA, an industry staple, in order to achieve this goal.

Let's dive in and see what Mistral's new tool can do for us!

## Gathering Dependencies

First things first, we'll start by gathering the repository, and installing some dependencies!

In [None]:
!git clone https://github.com/mistralai/mistral-finetune.git

In [None]:
%cd mistral-finetune/

In [None]:
!pip install -qUr requirements.txt

> NOTE: You can safely ignore the dependency conflicts above.

## Downloading the Model

Next up, we're going to download Mistral 7B v0.3 from Mistral's CDN.

> NOTE: You may experience difficulty downloading the model in the Colab environment. Please retry the download if you see your download speeds crash, or you experience a disconnect.

In [None]:
!wget https://models.mistralcdn.com/mistral-7b-v0-3/mistral-7B-v0.3.tar

Now we want to save our model in a directory call `mistral_models` - you can use whatever directory name that you desire - but be sure to change references to `mistral_models` as well!

In [None]:
!MODEL=/content/mistral_models && mkdir -p $MODEL && tar -xf mistral-7B-v0.3.tar -C $MODEL

## Data Collection and Verification

Next, we'll want to gather our data and modify it into the appropriate instruct format - as noted in [the repository](https://github.com/mistralai/mistral-finetune?tab=readme-ov-file#instruct).

In essence, `mistral-finetune` expects the instruction fine-tuning data to be in the following format:

```python
{
  "messages" [
    {
      "role" : "system",
      "content" : "SYSTEM_PROMPT_1"
    },
    {
      "role" : "user",
      "content" : "USER_PROMPT_1"
    },
    {
      "role" : "assistant",
      "content" : "RESPONSE_1"
    },
  ]
}
{
  "messages" [
    {
      "role" : "system",
      "content" : "SYSTEM_PROMPT_2"
    },
    {
      "role" : "user",
      "content" : "USER_PROMPT_2"
    },
    {
      "role" : "assistant",
      "content" : "RESPONSE_2"
    },
  ]
}
...
```

Notice that the format is `JSONL`!

We're going to be leveraging a subset of the [LIMIT: Less Is More for Instruction Tuning](https://www.databricks.com/blog/limit-less-more-instruction-tuning), specifically the `Instruct-v1`, aka `dolly_hhrlhf`!

> NOTE: This dataset will require you to accept terms of use - please navigate to [this link](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) if you havbe not already done so.

We'll start with creating a data directory, and popping into it.

In [None]:
!mkdir -p data

In [None]:
%cd data

We're going to grab a few dependencies here for our dataset!

In [None]:
!pip install -qU datasets huggingface-hub

Let's login to Hugging Face with a READ token.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Now we can download our data!

In [None]:
from datasets import load_dataset

dataset = load_dataset("mosaicml/dolly_hhrlhf")

Let's take a peak at our dataset to see what kind of shape it's in!

In [None]:
dataset

In [None]:
dataset["train"][0]

As we can see, this is not the expected format - so we'll need to do some formatting to make sure our data is in the expected format.

We can do this with `dataset.map()`, which simple need to create a formatting function.

In [None]:
def mistral_finetune_format(sample):
  system_prompt = sample["prompt"].split("### Instruction:")[0].strip().lstrip()
  user_prompt = sample["prompt"].split("### Instruction:")[-1].split("### Response:")[0].strip().lstrip()

  return {"data" : {"messages" : [{"role" : "system", "content" : system_prompt}, {"role" : "user", "content" : user_prompt}, {"role" : "assistant", "content" : sample["response"]}]}}

Let's verify our formatting function worked!

In [None]:
mistral_finetune_format(dataset["train"][0])

Now that our data formatter is tested - lets map it across the entire dataset!

In [None]:
formatted_dataset = dataset.map(mistral_finetune_format)

In [None]:
formatted_dataset

Let's save our data as a `JSONL` file for compatibility!

We'll create a training set, and a evaluation set.

In [None]:
import json

file_path = "/content/data/train_instruct.jsonl"

with open(file_path, "w") as file:
  for item in formatted_dataset["train"]["data"]:
    json_str = json.dumps(item)
    file.write(json_str + "\n")

In [None]:
file_path = "/content/data/test_instruct.jsonl"

with open(file_path, "w") as file:
  for item in formatted_dataset["test"]["data"]:
    json_str = json.dumps(item)
    file.write(json_str + "\n")

### Verifying the Dataset

We can use the provided tools to verify that our dataset is in the correct shape - let's first pass our dataset through the reformat to clean up, or skip, any potential issues!

In [None]:
!python -m utils.reformat_data /content/data/train_instruct.jsonl

In [None]:
!python -m utils.reformat_data /content/data/test_instruct.jsonl

Now that our reformat completed with no issues - we can move to validating our data - but before we do, we need to talk about the `.yaml` file that acts as a guide for our training process.

Let's make it together in the following cell - we'll start by adding referene to our data.

Notice that our data is under the `data` header.

In [None]:
training_dataset_path = "/content/data/train_instruct.jsonl"
eval_dataset_path = "/content/data/test_instruct.jsonl"

training_yaml = f"""\
data:
  instruct_data: '{training_dataset_path}'
  eval_instruct_data: '{eval_dataset_path}'
"""

Next, we'll add a reference to our downloaded and extracted model!

In [None]:
model_path = "/content/mistral_models"

training_yaml += f"\nmodel_id_or_path: '{model_path}'"

Now we can add some additional training parameters.

These are typical, and similar to what you'd see in something like `transformers` from Hugging Face!

In [None]:
LORA_RANK = 64
SEQ_LEN = 4092
BATCH_SIZE = 1
NUM_MICROBATCHES = 8
MAX_STEPS = 300

LEARNING_RATE = 1e-4
WEIGHT_DECAY = 0.1

OUTPUT_DIR = "content/limit_test"

In [None]:
training_yaml += f"""
# optim
seq_len: {SEQ_LEN}
batch_size: {BATCH_SIZE}
num_microbatches: {NUM_MICROBATCHES}
max_steps: {MAX_STEPS}

optim:
  lr: {LEARNING_RATE}
  weight_decay: {WEIGHT_DECAY}
  pct_start: 0.05

# other
seed: 0
log_freq: 1
eval_freq: 100
no_eval: False
ckpt_freq: 100

save_adapters: True

run_dir: '{OUTPUT_DIR}'
"""

### Weights and Biases Integration

Now we can add references to our Weights and Biases project, API key, and run name!

This integration is straightforward and lets us monitor our fine-tuning very easily!

In [None]:
!pip install -qU wandb

Now we can add these Weights and Biases configurations to our `.yaml` file!

In [None]:
import getpass

WANDB_PROJECT = "MistralFinetune"
WANBD_RUN_NAME = "DollyInstruct"
API_KEY = getpass.getpass("WandB API Key:")

In [None]:
training_yaml += f"""
wandb:
  project: '{WANDB_PROJECT}'
  run_name: '{WANBD_RUN_NAME}'
  key: '{API_KEY}'
  offline: False
"""

Now let's save our our `.yaml` file and use it to validate our data!

In [None]:
import yaml
with open('/content/instruct_tune_mistral_7B.yaml', 'w') as file:
    yaml.dump(yaml.safe_load(training_yaml), file)

In [None]:
!python -m utils.validate_data --train_yaml /content/instruct_tune_mistral_7B.yaml

## Model Training

Now that we have our `.yaml` file - we can go ahead an train our model!

We need to do a bit of bookkeeping for the Colab environment before moving on.

In [None]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

We'll also make sure that our `OUTPUT_DIR` does not exist to avoid errors.

In [None]:
!rm -r /content/limit_test

Now - we can train!

We'll use `torchrun` to run our `train` script leveraging the created `.yaml` file - and away we go!

In [None]:
!torchrun --nproc-per-node 1 -m train /content/instruct_tune_mistral_7B.yaml

## Inference with Mistral

Now that we have a trained model - let's see how it responds!

First up  - let's install the `mistral_inference` library.

In [None]:
!pip install -qU mistral_inference

Similar to the `transformers` library - we have a set of useful imports that, for the most part, just do what they say!

In [None]:
from mistral_inference.model import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage, SystemMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

Now we can load our downloaded model, our downloaded tokenizer, and our fine-tuned adapter!

In [None]:
tokenizer = MistralTokenizer.from_file("/content/mistral_models/tokenizer.model.v3")
model = Transformer.from_folder("/content/mistral_models")
model.load_lora("/content/limit_test/checkpoints/checkpoint_000100/consolidated/lora.safetensors")

In a very familiar format - we can create a request to our model!

We'll be sure to use the Instruction template we created before, and give a sample request!

In [None]:
completion_request = ChatCompletionRequest(
    messages=
      [
        SystemMessage(content="Below is an instruction that describes a task. Write a response that appropriately completes the request."),
        UserMessage(content="Explain Machine Learning to me in a nutshell.")
      ]
)

We'll go ahead an tokenize our chat completion!

In [None]:
tokens = tokenizer.encode_chat_completion(completion_request).tokens

Now we can generate a response and see how it did!

In [None]:
out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])

In [None]:
print(result)

This is a suitable response! Great job model!