# Fine-tune Mixtral 8x7B (MoE) model with QLoRA

----
This notebook contains an example on how to fine-tune the Mixtral 8x7B model using Hugging Face's PEFT [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora), and bitsandbytes.


----
This notebook was tested with the following configurations:
1. Instance Type: `ml.g5.48xlarge` (192 vCPUs, 768 GB memory, 8 A10Gs, 24GB GPU memory) -- You may choose a smaller GPU instance, but make sure that the model fits (the model itself takes up around 100 GB of GPU memory)
2. SageMaker Notebook with a Python3 kernel

## Step 0: Prerequisites

To get started, install the required libraries -- this includes the relevant Hugging Face libraries. The commands below also check if Python 3 and the NVIDIA CUDA is installed. If you run into errors here, make sure you use a similar configuration as above and come back to this step.

In this step, we will also be loading our dataset. The dataset we use here is the [GEM/Viggo](https://huggingface.co/datasets/GEM/viggo) Dataset available on Hugging Face. Feel free to swap out this dataset with one of your own: You may also have to change the System Prompt, as described in the function `generate_and_tokenize_promp()` below.

To be able to download the dataset and the model from Hugging Face, you would need to follow these steps.
1. On Hugging Face, visit the [Mixtral Model Page](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1). On there, you will see that it is a gated repo. Please click "Request Access". If prompted to log-in or create-account, please do so.
2. Once you have access to the model, you also need a [Hugging Face API Key/Security Token](https://huggingface.co/docs/hub/en/security-tokens). Please follow the steps to create the token. Make sure your token has at least READ permissions.
3. On your SageMaker Notebook, open up a terminal (`Launcher/+ --> Other --> Terminal`)
4. On the terminal, install the Hugging Face CLI: `pip3 install -U huggingface_hub[cli]`
5. Lastly, run `huggingface-cli login` to login to your Hugging Face account and enter your Hugging Face username and password.

In [None]:
!python3 --version

Note: These commands below doesn't need to be run, but is very helpful for:
1. Checking NVIDIA GPU Memory Usage
2. Checking storage on your instance

In [None]:
!nvidia-smi
!df -h

In [None]:
!pip3 install -qU torch bitsandbytes transformers peft datasets
!pip3 install -qU tensorboardX
# Accelerate is required only if you want to use Flash Attention & FSDP. This notebook does not use this package.
!pip3 install -qU accelerate
# Matplotlib is used for plotting input lengths. This is optional, please commend out if not required.
!pip3 install -qU matplotlib

In [None]:
# Add installed cuda runtime to path for bitsandbytes 
import nvidia
import os

cuda_install_dir = '/'.join(nvidia.__file__.split('/')[:-1]) + '/cuda_runtime/lib/'
os.environ['LD_LIBRARY_PATH'] =  cuda_install_dir

If you'd like to use the accelerate package for Flash Attention and/or FSDP, check out [Accelerate](https://huggingface.co/docs/accelerate/en/usage_guides/fsdp)

In [None]:
from datasets import load_dataset
import torch

train_set = load_dataset("gem/viggo", split="train")
validation_set = load_dataset("gem/viggo", split="validation")
test_set = load_dataset("gem/viggo", split="test")

## Step 1: Loading the Quantized Base Model

In this step, we will load the Quantized base model (Mixtral 8x7B) from Hugging Face.

### What is the Mixtral model?
Architectural Details: The Mixtral 8x7B model is a decoder-only transformer model.

Mixtral is a Mixture of Experts (MoE) model with 8 experts per MLP, with a total of 45 billion parameters. Despite the model having 45 billion parameters, the compute required for a single forward pass is the same as that of a 14 billion parameter model. 

To learn more about Mixture-of-Experts, please refer to the blog post. 

In [None]:
# Using 4 bit quantization
# FOR SAGEMAKER NOTEBOOKS: Change cache_dir to /home/ec2-user/SageMaker (since EBS is mounted there)
# FOR STUDIO NOTEBOOKS (DEFAULT): Change cache_dir to /mnt/sagemaker-nvme (since nvme is mounted there)

from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling, BitsAndBytesConfig

model_id = "mistralai/Mixtral-8x7B-v0.1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto", cache_dir="/mnt/sagemaker-nvme")

## Step 2: Tokenization

To train our model, we need to convert our input data to tokens. We do this using the Hugging Face Transformers Tokenizer. To learn more about the Hugging Face "Auto" classes, check out [Auto Classes](https://huggingface.co/docs/transformers/en/model_doc/auto). To learn more about the Tokenizer, along with the EOS, BOS tokens, check out [Tokenizer](https://huggingface.co/docs/transformers/en/main_classes/tokenizer)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    model_id, 
    add_eos_token=True,    # End Of Sentence Token
    add_bos_token=True     # Beginning Of Sentence Token
)

We now need to tokenize our entire training and validation dataset. Before we do that, we also need to format all our data points so that:
1. The labels that we pass in during our fine-tuning job are defined. In this case, we set `labels == input_ids`, which as you will see later will just be the entire un-tokenized prompt itself.
2. A system prompt is passed in to the LLM while fine-tuning (or even for simple inference).

In this example, along with the prompt, we pass in the "Target Sentence" from the GEM/viggo dataset, and expect the LLM to generate the "Meaning Representation".

In [None]:
# Formatting Step 1
# Method to tokenize using the loaded tokenizer from above
def tokenize(prompt):
    result = tokenizer(prompt)
    result["labels"] = result["input_ids"].copy()   # Setting the labels and input_ids to be the same
    return result

In [None]:
# Formatting Step 2
def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
{data_point["target"]}

### Meaning representation:
{data_point["meaning_representation"]}
"""
    return tokenize(full_prompt)

In [None]:
# Tokenize the training and validation dataset
tokenized_train_dataset = train_set.map(generate_and_tokenize_prompt)
tokenized_validation_dataset = validation_set.map(generate_and_tokenize_prompt)

## Step 3: Padding & Re-tokenizing
While passing in a test dataset to the LLM for fine-tuning, it's important to ensure that the inputs are all of a uniform length. To achieve this, we first visualize the distribution of the input token lengths (or alternatively, firectly find the max length). Based on these results, we identify the maximum input token length, and utilize "padding" to ensure all the inputs are of the same length.

Option 1: Using Matplotlib, we visualize the distribution of the input token lengths of the entire dataset.

In [None]:
import matplotlib.pyplot as plt

def plot_data_lengths(tokenized_train_dataset, tokenized_validation_dataset):
    lengths1 = [len(x["input_ids"]) for x in tokenized_train_dataset]
    lengths2 = [len(x["input_ids"]) for x in tokenized_validation_dataset]
    lengths = lengths1 + lengths2
    
    plt.figure(figsize=(10,6))
    plt.hist(lengths, bins=20, alpha=0.7, color="blue")
    plt.xlabel("input_ids lengths")
    plt.ylabel("Frequency")
    plt.title("Distribution of lengths of input_ids")
    plt.show()

Option 2: You may also choose to use the Python `max` function instead of plotting it to directly find the maximum input length like:

In [None]:
print(max([len(x["input_ids"]) for x in tokenized_train_dataset]))

In [None]:
# In our example, max length comes out to 344, so that's what we will be padding all the input data to.
max_length = 344

# Redefine tokenize function to make sure all tensors are same size through padding
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True
)

tokenizer.pad_token = tokenizer.eos_token    # Arbitrarily using the EOS Token as the Pad token

In [None]:
def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=max_length,
        padding="max_length"
    )

    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
tokenized_train_dataset = train_set.map(generate_and_tokenize_prompt)
tokenized_validation_dataset = validation_set.map(generate_and_tokenize_prompt)

In [None]:
# Run the plot function to confirm that your input tokens are all of the same size
plot_data_lengths(tokenized_train_dataset, tokenized_validation_dataset)

## Step 4: Testing the Pre-finetuned Model

To be able to compare your out-of-the-box Mixtral quantized model with the Fine-tuned model, you would need to check how the base Mixtral model performs. We can do this by feeding it some test input (i.e., "Target Sentence") and compare it with the Ground Truth from our GEM/viggo dataset, to test how far the "Meaning Representation" is.

After fine-tuning, we will run the same test to verify if the model performs better.

In [None]:
# Arbitrarily picking from the test dataset
print("TARGET: " + test_set[1]["target"])
print("MEANING REPRESENTATION: " + test_set[35]["meaning_representation"])

In [None]:
# Defining an evaluation tokenizer (similar definition as above) to tokenize the input data
eval_tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    add_bos_token=True
)

In [None]:
eval_prompt = """Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. Is your opinion true for all games which don't have multiplayer?

### Meaning representation:
"""

In [None]:
# Using your GPUs to perform inference on the above prompt
device = "cuda"
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to(device)

model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(model.generate(**model_input, max_new_tokens=128)[0], skip_special_tokens=True))

Here are the results from this run:

- **Actual meaning representation** (Ground Truth): verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation]) <br>
- **Mixtral's (current) output for meaning representation** (Model Inference): inform(name(Little Big Adventure), has_multiplayer(Little Big Adventure))

## Step 5: Setting up QLoRA

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
# Printing the model here to check the layers. This step is important so you can see what layers have been added after the QLoRA step.
print(model)

In the model, we have the linear layers (`q_proj`, `k_proj`, `v_proj`, `o_proj`, `w1`, `w2`, `w3`, `lm_head`). 

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )    

In [None]:
print_trainable_parameters(model)

As expected, the base quantized model shows 0 trainable parameters. With QLoRA, we will add adapters to the existing model.

In [None]:
from peft import LoraConfig, PeftModel, get_peft_model

target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "w1", "w2", "w3", "lm_head"]
config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=target_modules, 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

In [None]:
print_trainable_parameters(model)

The increase in the number of trainable parameters indicates the addition of the QLoRA Adapters! You may also choose to print the model to visualize the additional adapters added.

In [None]:
print(model)

## Step 6: Now we are ready to Fine-Tune!

In [None]:
# If you have more than 1 GPU available to use, you can parallelize the fine-tuning process!
dev_count = torch.cuda.device_count()
if dev_count > 1:
    model.is_parallelizable = True
    model.model_parallel = True

For the training job below, you may choose to use the estimator of your choice! 

In [None]:
# Creating an S3 Bucket to log our training metrics to TensorBoard.
bucket = "mixtral-qlora-finetune-results"
log_bucket = f"s3://{bucket}/qlora-finetuning"

In [None]:
# Run this command and check if a bucket with the name specified above exists in your account
!aws s3 ls | grep $bucket

In [None]:
# pip install tensorboard
# import tensorflowsdf
# To learn more about the training arguments used, check out https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments 

import transformers

tokenizer.pad_token = tokenizer.eos_token

training_args = transformers.TrainingArguments(
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    max_steps=1000,
    output_dir="mixtral-outputs",
    logging_dir=log_bucket,
    logging_steps=2,
    learning_rate=2e-4,
    fp16=True,
    save_strategy="steps",
    save_steps=50,
    eval_strategy="steps",
    eval_steps=50,
    do_eval=True,
    warmup_steps=5,
    gradient_checkpointing=True,
    gradient_accumulation_steps=4,
    optim="paged_adamw_8bit",
    report_to="tensorboard",
)

trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False    

In [None]:
# Start training
trainer.train()

In [None]:
# Returns the metrics from training
trainer.evaluate()

## Step 7: Compare!

Let's compare the fine-tuned model to how the quantized out-of-the-box model performed.

In [None]:
from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "mixtral-8x7b-finetuning-job/checkpoint-500")

In [None]:
eval_prompt = """Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. Is your opinion true for all games which don't have multiplayer?

### Meaning representation:
"""

device = "cuda"
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to(device)

ft_model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

This is what we had before fine-tuning:
- **Actual meaning representation** (Ground Truth): verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation]) <br>
- **Mixtral's (old) output for meaning representation** (Model Inference): inform(name(Little Big Adventure), has_multiplayer(Little Big Adventure))

This is what we have after fine-tuning:

- **Actual meaning representation** (Ground Truth): verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation])
- **Mixtral's (new) output for meaning representation** (Model Inference): verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation])