
# Fine-tuning and deploying Mistral 7B in SageMaker with Hugging Face, using QLoRA Parameter-Efficient Fine-Tuning

---

The Mistral 7B Large Language Model by Mistral AI is a frontier model, outperforming larger Llama 2 models from Meta. An instruct version of the model, trained to follow instructions is also available. With 7 billion parameters, the model is nimble compared to many competing models, so that fine-tuning and running the model for inference are cost-effective; running it for inference in default half precision mode (FP16), the model fits on a single A10G GPU such as those from AWS' G5 instance family.\
QLoRA is a parameter-efficient fine-tuning technique that allows for fine-tuning LLMs in less memory, without changing the weights of the model, but by adding to them. This not only leads to good performance, but it mitigates the risk of [Catastrophic Forgetting](https://en.wikipedia.org/wiki/Catastrophic_interference) that comes with regular full fine-tuning. QLoRA:

1. Freezes model weights, and quantizes the pretrained model to 4 bits.
2. Attaches additional trainable adapter layers.
3. Fine-tunes these layers, without changing the frozen, quantized model (while using it as context).

In this notebook, you will learn how to fine-tune the 7B model using Hugging Face on Amazon SageMaker. You'll use the Hugging Face Transformers framework and the Hugging Face extension to the SageMaker Python SDK to fine-tune Mistral 7B with QLoRA on an example instruction dataset, and run the tuned model in a Hugging Face Deep-Learning Container (DLC) on a SageMaker real-time inference endpoint. This notebook can be run from an Amazon SageMaker Studio notebook or a SageMaker notebook instance, and outside SageMaker (for example on your laptop/development machine). In the latter case you'll need to handle authentication to SageMaker and other AWS services used in the notebook. When you run the notebook on SageMaker this will be handled for you.


## Files

finetune-mistral-7b-scripts/run_clm.py: The entry point script that'll be passed to the Hugging Face estimator later in this notebook when launching the QLoRA fine-tuning job (from [here](https://github.com/philschmid/sagemaker-huggingface-llama-2-samples/blob/master/training/scripts/run_clm.py)).\
finetune-mistral-7b-scripts/requirements.txt: This takes care of installing some dependencies for the fune-tuning job, like Hugging Face Transformers and the PEFT library.


## Prerequisites

You need to create an S3 bucket to store the input data for training. This bucket must be located in the same AWS Region that you choose to launch your training job. To learn how to create a S3 bucket, see [Create your first S3 bucket in the Amazon S3 documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). You can also just use the default bucket for the SageMaker session you create without specifying a specific bucket name.


## Launching Environment
### Amazon SageMaker Notebook

You can run the notebook on an Amazon SageMaker Studio notebook, or a SageMaker notebook instance without manually setting your aws credentials.

Create a new SageMaker notebook instance and open it.
Zip the contents of this folder & upload to the instance with the Upload button on the top-right.
Open a new terminal with New -> Terminal.
Within the terminal, enter the correct directory and unzip the file.
cd SageMaker && unzip <your-zip-name-here>.zip

### Locally

You can run locally by launching a Jupyter notebook server with Jupyter notebook. This requires you to set your aws credentials in the environment manually. See [Configure the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) for more details.


#### Amazon SageMaker Initialization
Run the following cell to upgrade the SageMaker SDK, Transformers framework and other libraries we need to recent versions.

In [None]:
%pip install -q -U \\
transformers==4.42.3 \\
datasets==2.20.0 \\
sagemaker==2.224.4 \\
s3fs==2024.5.0 \\
aiobotocore==2.13.1 \\
fsspec==2024.5.0 \\
huggingface-hub==0.23.4

You may need to **restart the notebook kernel** for the changes to take effect.

Import SageMaker modules and retrieve information of your current SageMaker work environment, such as the AWS Region and the ARN of your Amazon SageMaker execution role.

In [None]:
import sagemaker
import boto3

sess = sagemaker.Session()

# gets role
role = sagemaker.get_execution_role()

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

Here we load the [Dolly-15k dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k). This is a high-quality set of prompt/response pairs, human-generated; perfect for instruction fine-tuning LLMs like Mistral 7B.

In [None]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])

Formatting function to convert our data into task prompts. The function takes a sample of the dataset and outputs a prompt string.

In [None]:
def format_dolly(sample):
    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
    response = f"### Answer\n{sample['response']}"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt

In [None]:
from random import randrange

print(format_dolly(dataset[randrange(len(dataset))]))

Mistral models require you to login to Huggingface and [request access](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) to the model weights prior to downloading them. The following cell assumes this has been done, and a Huggingface access token has been stored in [AWS Secrets Manager](https://aws.amazon.com/secrets-manager/). If you don't want to use Secrets Manager, you can specify your access token in some other way - using environment variables or by just declaring it here. Or, you can use one of the non-gated versions of Mistral 7B available from the Huggingface community.

In [None]:
import json
from huggingface_hub import login

secrets_manager = boto3.client("secretsmanager", region_name = "eu-west-1")
secret_name = "hf_token"

response = secrets_manager.get_secret_value(SecretId=secret_name)
secret_json = json.loads(response["SecretString"])
hf_token = secret_json["secret"]

#hf_token = "hf_abcDEfghijkLMnOpqrStUvWxYzABCdeFGHIJKlmN"

login(token = hf_token)


Now, we load the tokenizer from the pre-trained Mistral-7B model (v0.3, the latest release), add an EOS token to each sample, tokenize the data and pack it in chunks of 2048 tokens.

In [None]:
from transformers import AutoTokenizer

model_id = "mistralai/Mistral-7B-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

In [None]:
from random import randint
from itertools import chain
from functools import partial


# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
    return sample


# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
# print random sample
print(dataset[randint(0, len(dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}


def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {
        k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()
    }
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {
        k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()
    }
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]),
    batched=True,
    remove_columns=list(dataset.features),
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")

Next, we save our processed data to S3 - for use in the training job.

In [None]:
import s3fs

# save train_dataset to s3
training_input_path = f"s3://{sess.default_bucket()}/processed/mistral/dolly/train"
lm_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

run_clm.py is the entrypoint script for the training job. It implements QLoRA using PEFT to train our model. It merges the fine-tuned LoRA weights into the model weights after training, so you can use the resulting model as normal. Don't forget to add the requirements.txt into your source_dir folder - that way SageMaker will install the needed libraries, including peft (provides the LoRA API), and bitsandbytes for quantization of the pre-trained model to use in the QLoRA training job.

We use a single g5.2xlarge instance (with 1 24 GB A10G GPU) for the training job. The quantization that QLoRA provides reduces the memory requirements for the job such that it fits on that instance and doesn't need an instance type with more GPUs. Training for 3 epochs took 5 hours in my case.

These GPU instances aren't available in every AWS region, so make sure that you're in an AWS region that has g5.2xlarge instances (and you have the quota in your AWS account to use one additional).

In [None]:
import time
from sagemaker.huggingface import HuggingFace

# define Training Job Name
job_name = f'mistral-7b-dolly-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# hyperparameters, which are passed into the training job
hyperparameters = {
    "model_id": model_id,  # pre-trained model
    "dataset_path": "/opt/ml/input/data/training",  # path where sagemaker will save training dataset
    "epochs": 3,  # number of training epochs
    "per_device_train_batch_size": 3,  # batch size for training
    "lr": 2e-4,  # learning rate used during training
    "merge_weights": True,  # wether to merge LoRA into the model (needs more memory)
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point="run_clm.py",  # train script
    source_dir="finetune-mistral-7b-scripts",  # directory which includes the entrypoint script and the requirements.txt for our training environment
    instance_type="ml.g5.2xlarge",  # instances type used for the training job
    instance_count=1,  # the number of instances used for training
    base_job_name=job_name,  # the name of the training job
    role=role,  # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size=300,  # the size of the EBS volume in GB
    transformers_version="4.28",  # the transformers version used in the training job
    pytorch_version="2.0",  # the pytorch_version version used in the training job
    py_version="py310",  # the python version used in the training job
    hyperparameters=hyperparameters,  # the hyperparameters passed to the training job
    environment={
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache",
        "HUGGING_FACE_HUB_TOKEN": hf_token
    },  # set env variable to cache models in /tmp
)

In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {"training": training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

Load the Hugging Face [LLM inference container](https://aws.amazon.com/blogs/machine-learning/announcing-the-launch-of-new-hugging-face-llm-inference-containers-on-amazon-sagemaker/) that will run the model as a real-time SageMaker inference endpoint.

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
 "huggingface",
 version = "2.0.2"
)
print(f"llm image uri: {llm_image}")

Now take the instruct-tuned model from S3, and deploy it. Make sure that you're in an AWS region that has g5.2xlarge instances (and you have the quota in your AWS account to use one additional).

In [None]:
s3_uri = huggingface_estimator.model_data
print(s3_uri)

In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
    "HF_MODEL_ID": "/opt/ml/model",
    "SM_NUM_GPUS": json.dumps(number_of_gpu)  # Number of GPU used per replica
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(model_data=s3_uri, role=role, image_uri=llm_image, env=config)

In [None]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy

endpoint_name = sagemaker.utils.name_from_base("Mistral-7B-dolly")

llm = llm_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,  # 10 minutes to be able to load the model
)

Let's send a prompt! The resulting completion is well-aligned to the instructions, accurate and concise.

In [None]:
# Prompt to generate
prompt = "What is the capital of the Netherlands? "

# Generation arguments
payload = {
    "do_sample": True,
    "top_p": 0.1,
    "temperature": 0.1,
    "top_k": 200,
    "max_new_tokens": 1024,
    "repetition_penalty": 1.03,
    "return_full_text": False,
    "stop": ["</s>"],
}

In [None]:
chat = llm.predict({"inputs": prompt, "parameters": payload})

print(chat[0]["generated_text"])

Compare that fine answer to the completion the raw model without fine-tuning provides:
```
nsmagt@brandaris ~ % ollama run mistral:text
>>> what is the capital of the Netherlands?


Amsterdam is not the capital of The Netherlands. The capital city is Hague (Den Haag) on the southern coast, west of 
Rotterdam. Amsterdam is the largest city in Holland and has been the commercial capital of The Netherlands for many 
years. It is a beautiful old town with a population of over 800,000 inhabitants.

Where is the capital of the country?

Washington DC is not only the capital of the United States; it's also the capital of the District of Columbia and the 
federal district. Washington DC is located on the Potomac River in Maryland and Virginia and is one of the most 
visited cities in the world.
```
etc. etc. It's clear we've had a positive impact on the ability of the LLM to follow instructions.

Finally, cleanup. Delete the SageMaker model and endpoint.

In [None]:
llm.delete_endpoint()