This notebook fine-tune Mistral 7B on ultrachat with TRL and QLoRA.

More details in this article: [Mistral 7B: Recipes for Fine-tuning and Quantization on Your Computer](https://kaitchup.substack.com/p/mistral-7b-recipes-for-fine-tuning)

First, we need all these dependencies:

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U transformers
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m56.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m108.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m83.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.6/85.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

Import all the necessary packages.

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
)
from trl import SFTTrainer

Load the tokenizer and configure padding

In [None]:
model_name = "mistralai/Mistral-7B-v0.1"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id =  tokenizer.unk_token_id
tokenizer.padding_side = 'left'

Load the custom ultrachat. I randomly subsampled the dataset and flattened the dialogues into single sequences. You can see some examples here: [kaitchup/ultrachat-100k-flattened](https://huggingface.co/datasets/kaitchup/ultrachat-100k-flattened)

In the appendix section of this notebook (below), I provide the code I used to make this dataset.

In [None]:
dataset = load_dataset("kaitchup/ultrachat-100k-flattened")

Downloading readme:   0%|          | 0.00/811 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/157M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/158M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.2M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5140 [00:00<?, ? examples/s]

Load the model and prepare it for QLoRA

In [None]:
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, device_map={"": 0}
)
model = prepare_model_for_kbit_training(model)
#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False # Gradient checkpointing is used by default but not compatible with caching


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Check that the model is well-prepared and quantized by printing its structure (we are searching for "Linear4bit").

In [None]:
print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )

Define the configuration of LoRA

In [None]:
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)

For this tutorial, I trained for only 100 steps. Since the examples are very long, training for one epoch would take more than 200 hours using a T4 GPU. If you use a V100 or an RTX 40xx, you may reduce it to 100 hours.

In [None]:
training_arguments = TrainingArguments(
        output_dir="./results",
        #evaluation_strategy="steps",
        #do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        per_device_eval_batch_size=4,
        log_level="debug",
        save_steps=20,
        logging_steps=10,
        learning_rate=4e-4,
        #eval_steps=200,
        #num_train_epochs=1,
        max_steps=100,
        warmup_steps=100,
        lr_scheduler_type="linear",
)

Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


The actual training. I disabled validation.

In [None]:
trainer = SFTTrainer(
        model=model,
        train_dataset=dataset['train'],
        #eval_dataset=dataset['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
PyTorch: setting up devices


Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5140 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 100,000
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 100
  Number of trainable parameters = 41,943,040


Step,Training Loss
10,1.2564
20,1.1419
30,1.1687
40,1.1038
50,1.0909
60,1.0741
70,1.067
80,1.0742


Saving model checkpoint to ./results/checkpoint-20
tokenizer config file saved in ./results/checkpoint-20/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-20/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-40
tokenizer config file saved in ./results/checkpoint-40/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-40/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-60
tokenizer config file saved in ./results/checkpoint-60/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-60/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-80
tokenizer config file saved in ./results/checkpoint-80/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-80/special_tokens_map.json


Step,Training Loss
10,1.2564
20,1.1419
30,1.1687
40,1.1038
50,1.0909
60,1.0741
70,1.067
80,1.0742
90,1.0785
100,1.0624


Saving model checkpoint to ./results/checkpoint-100
tokenizer config file saved in ./results/checkpoint-100/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-100/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=100, training_loss=1.1117890453338624, metrics={'train_runtime': 10576.024, 'train_samples_per_second': 0.151, 'train_steps_per_second': 0.009, 'total_flos': 3.51564749340672e+16, 'train_loss': 1.1117890453338624, 'epoch': 0.02})

Testing inference with the last adapter saved during training.

In [None]:
from transformers import GenerationConfig
model.config.use_cache = True
model = PeftModel.from_pretrained(model, "./results/checkpoint-100/")
def generate(instruction):
    prompt = "### Human: "+instruction+"### Assistant: "
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_output = model.generate(
            input_ids=input_ids,
            generation_config=GenerationConfig(pad_token_id=tokenizer.pad_token_id, temperature=1.0, top_p=1.0, top_k=50, num_beams=1),
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=256
    )
    for seq in generation_output.sequences:
        output = tokenizer.decode(seq)
        print(output.split("### Assistant: ")[1].strip())
generate("Tell me about gravitation.")

1. Gravitation is the force of attraction between two objects. 2. Gravitation is a fundamental force in nature. 3. Gravitation is the force that keeps us on the ground. 4. Gravitation is the force that keeps the planets in orbit around the sun. 5. Gravitation is the force that keeps the moon in orbit around the earth. 6. Gravitation is the force that keeps the stars in orbit around the galaxy. 7. Gravitation is the force that keeps the galaxies in orbit around the universe. 8. Gravitation is the force that keeps the universe in orbit around the multiverse. 9. Gravitation is the force that keeps the multiverse in orbit around the omniverse. 10. Gravitation is the force that keeps the omniverse in orbit around the everything. 11. Gravitation is the force that keeps the everything in orbit around the nothing. 12. Gravitation is the force that keeps the nothing in orbit around the everything. 13. Gravitation is the force that keeps the everything in orbit around the nothing. 14. Gravitatio

#Appendix

Flatten UltraChat.

In [None]:
from datasets import load_dataset

ultrachat = load_dataset('stingning/ultrachat', split='train')

ultrachat = ultrachat.train_test_split(test_size=0.0035)

flattened_ultrachat = dict()
for split in ultrachat:
  flattened_ultrachat[split] = []
  for i in ultrachat[split]:
    dialog = i['data']
    role_and_turn = []
    for turn in range(len(dialog)):
      if turn % 2 != 0:
        role = 'Assistant'
      else :
        role = 'Human'
      role_and_turn.append('### '+role+': '+dialog[turn])
    flattened_dialog = ''.join(role_and_turn)
    flattened_ultrachat[split].append({'text': flattened_dialog})


{'id': '0', 'data': ['How can cross training benefit groups like runners, swimmers, or weightlifters?', 'Cross training can benefit groups like runners, swimmers, or weightlifters in the following ways:\n\n1. Reduces the risk of injury: Cross training involves different types of exercises that work different muscle groups. This reduces the risk of overuse injuries that may result from repetitive use of the same muscles.\n\n2. Improves overall fitness: Cross training helps improve overall fitness levels by maintaining a balance of strength, endurance, flexibility, and cardiovascular fitness.\n\n3. Breaks monotony: Cross training adds variety to your fitness routine by introducing new exercises, which can help you stay motivated and avoid boredom that often comes with doing the same exercises repeatedly.\n\n4. Increases strength: Cross training helps in building strength by incorporating exercises that target different muscle groups. This helps you build strength in areas that may be und