
# Fine-tune Mistral-7B with QLoRA and SageMaker remote decorator

## Unsupervised fine-tuning

---

In this demo notebook, we demonstrate how to fine-tune the Mistral-7B model using QLoRA, Hugging Face PEFT, and bitsandbytes.

We are using SageMaker remote decorator for runinng the fine-tuning job on Amazon SageMaker Training job
---
SageMaker Studio Kernel: PyTorch 2.0.0 Python 3.10

JupyterLab Instance Type: ml.t3.medium

Fine-Tuning:
* Instance Type: ml.g5.12xlarge

Install the required libriaries, including the Hugging Face libraries, and restart the kernel.

In [None]:
%pip install -r requirements.txt

In [1]:
%pip install -q -U datasets==2.16.1
%pip install -q -U langchain==0.1.5
%pip install -q -U scikit-learn

Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyter-ai 2.6.0 requires faiss-cpu, which is not installed.
jupyter-ai 2.6.0 requires langchain==0.0.318, but you have langchain 0.1.5 which is incompatible.
jupyter-ai-magics 2.6.0 requires langchain==0.0.318, but you have langchain 0.1.5 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.



## Setup Configuration file path

We are setting the directory in which the config.yaml file resides so that remote decorator can make use of the settings.


In [2]:
import os

# Set path to config file
os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

## Visualize and upload the dataset

Read train dataset in a Pandas dataframe

In [3]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader([
    "https://aws.amazon.com/bedrock/",
    "https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html",
    "https://aws.amazon.com/blogs/aws/preview-enable-foundation-models-to-complete-tasks-with-agents-for-amazon-bedrock/",
    "https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html",
    "https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html",

])

data = loader.load()

In [7]:
from datasets import Dataset

def strip_spaces(doc):
    return {"text": doc.page_content.replace("  ", "")}

stripped_data = list(map(strip_spaces, data))

train_dataset = Dataset.from_list(stripped_data)

train_dataset

Dataset({
    features: ['text'],
    num_rows: 5
})

In [8]:
from transformers import AutoTokenizer

model_id = "mistralai/Mistral-7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Creating chunks and tokenizing the inputs for making it usable by the LLM. For additional details, please refer to the blog [Leveraging qLoRA for Fine-Tuning of Task-Fine-Tuned Models Without Catastrophic Forgetting: A Case Study with LLaMA2(-chat)](https://medium.com/towards-data-science/leveraging-qlora-for-fine-tuning-of-task-fine-tuned-models-without-catastrophic-forgetting-d9bcd594cff4)

In [9]:
from itertools import chain
from functools import partial

remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result



To train our model, we need to convert our inputs (text) to token IDs. This is done by a Hugging Face Transformers Tokenizer. In addition to QLoRA, we will use bitsanbytes 4-bit precision to quantize out frozen LLM to 4-bit and attach LoRA adapters on it.



In [10]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Utility method for finding the target modules and update the necessary matrices. Visit [this](https://github.com/artidoro/qlora/blob/main/qlora.py) link for additional info.

In [11]:
import bitsandbytes as bnb

def find_all_linear_names(hf_model):
    lora_module_names = set()
    for name, module in hf_model.named_modules():
        if isinstance(module, bnb.nn.Linear4bit):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)

/opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32


  warn("The installed version of bitsandbytes was compiled without GPU support. "


Define the train function

In [12]:
from huggingface_hub import login
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
from sagemaker.remote_function import remote
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import transformers

# Start training
@remote(volume_size=100, job_name_prefix=f"train-{model_id.split('/')[-1].replace('.', '-')}-merge")
def train_fn(
        model_name,
        train_ds,
        test_ds=None,
        lora_r=64,
        lora_alpha=16,
        lora_dropout=0.1,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        learning_rate=2e-4,
        num_train_epochs=1,
        chunk_size=2048,
        merge_weights=False,
        token=None
):
    if token is not None:
        login(token=token)

    # tokenize and chunk dataset
    lm_train_dataset = train_ds.map(
        lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(train_ds.features)
    ).map(
        partial(chunk, chunk_length=chunk_size),
        batched=True,
    )

    if test_ds is not None:
        lm_test_dataset = test_ds.map(
            lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(test_ds.features)
        ).map(
            partial(chunk, chunk_length=chunk_size),
            batched=True,
        )

        print(f"Total number of test samples: {len(lm_test_dataset)}")
    else:
        lm_test_dataset = None

    # Print total number of samples
    print(f"Total number of train samples: {len(lm_dataset)}")

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        quantization_config=bnb_config,
        device_map="auto")

    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

    # get lora target modules
    modules = find_all_linear_names(model)
    print(f"Found {len(modules)} modules to quantize: {modules}")

    config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules=modules,
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM"
    )

    model = get_peft_model(model, config)
    print_trainable_parameters(model)

    trainer = transformers.Trainer(
        model=model,
        train_dataset=lm_train_dataset,
        eval_dataset=lm_test_dataset if lm_test_dataset is not None else None,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=per_device_train_batch_size,
            per_device_eval_batch_size=per_device_eval_batch_size,
            logging_steps=2,
            num_train_epochs=num_train_epochs,
            learning_rate=learning_rate,
            bf16=True,
            save_strategy="no",
            output_dir="outputs"
        ),
        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )
    model.config.use_cache = False

    trainer.train()

    if merge_weights:
        output_dir = "/tmp/model"

        # merge adapter weights with base model and save
        # save int 4 model
        trainer.model.save_pretrained(output_dir, safe_serialization=False)
        # clear memory
        del model
        del trainer
        torch.cuda.empty_cache()

        # load PEFT model in fp16
        model = AutoPeftModelForCausalLM.from_pretrained(
            output_dir,
            low_cpu_mem_usage=True,
            torch_dtype=torch.float16,
        )
        # Merge LoRA and base model and save
        model = model.merge_and_unload()
        model.save_pretrained(
            "/opt/ml/model", safe_serialization=True, max_shard_size="2GB"
        )
    else:
        model.save_pretrained("/opt/ml/model", safe_serialization=True)

    tmp_tokenizer = AutoTokenizer.from_pretrained(model_name)
    tmp_tokenizer.save_pretrained("/opt/ml/model")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Fetched defaults config from location: /home/sagemaker-user/sagemaker-new/amazon-sagemaker-remote-decorator-generative-ai
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.ImageUri
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.Dependencies
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.InstanceType


In [13]:
train_fn(
    model_id,
    train_dataset,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=10,
    chunk_size=2048,
    merge_weights=True
)

2024-02-05 00:55:41,107 sagemaker.remote_function INFO     Serializing function code to s3://sagemaker-eu-west-1-691148928602/train-Mistral-7B-Instruct-v0-1-merge-2024-02-05-00-55-41-107/function
2024-02-05 00:55:41,283 sagemaker.remote_function INFO     Serializing function arguments to s3://sagemaker-eu-west-1-691148928602/train-Mistral-7B-Instruct-v0-1-merge-2024-02-05-00-55-41-107/arguments
2024-02-05 00:55:41,523 sagemaker.remote_function INFO     Copied dependencies file at './requirements.txt' to '/tmp/tmpy4qe6wrs/temp_workspace/sagemaker_remote_function_workspace/requirements.txt'
2024-02-05 00:55:41,525 sagemaker.remote_function INFO     Successfully created workdir archive at '/tmp/tmpy4qe6wrs/workspace.zip'
2024-02-05 00:55:41,572 sagemaker.remote_function INFO     Successfully uploaded workdir to 's3://sagemaker-eu-west-1-691148928602/train-Mistral-7B-Instruct-v0-1-merge-2024-02-05-00-55-41-107/sm_rf_user_ws/workspace.zip'
2024-02-05 00:55:41,574 sagemaker.remote_function I

2024-02-05 00:55:41 Starting - Starting the training job...
2024-02-05 00:56:06 Starting - Preparing the instances for training.........
2024-02-05 00:57:21 Downloading - Downloading input data...
2024-02-05 00:57:56 Downloading - Downloading the training image...............
2024-02-05 01:00:37 Training - Training image download completed. Training in progress........[34mINFO: CONDA_PKGS_DIRS is set to '/opt/ml/sagemaker/warmpoolcache/sm_remotefunction_user_dependencies_cache/conda/pkgs'[0m
[34mINFO: PIP_CACHE_DIR is set to '/opt/ml/sagemaker/warmpoolcache/sm_remotefunction_user_dependencies_cache/pip'[0m
[34mINFO: Bootstraping runtime environment.[0m
[34m2024-02-05 01:01:44,313 sagemaker.remote_function INFO     Successfully unpacked workspace archive at '/'.[0m
[34m2024-02-05 01:01:44,313 sagemaker.remote_function INFO     '/sagemaker_remote_function_workspace/pre_exec.sh' does not exist. Assuming no pre-execution commands to run[0m
[34m2024-02-05 01:01:44,313 sagemaker.r

## Deploy Fine-Tuned model

Note: Run `train_fn` with `merge_weights=True`

In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

In [None]:
model_id = "mistralai/Mistral-7B-Instruct-v0.1"

bucket_name = "<S3_BUCKET>"
job_prefix = f"train-{model_id.split('/')[-1].replace('.', '-')}-merge"

In [None]:
def get_last_job_name(job_name_prefix):
    import boto3
    sagemaker_client = boto3.client('sagemaker')
    
    search_response = sagemaker_client.search(
        Resource='TrainingJob',
        SearchExpression={
            'Filters': [
                {
                    'Name': 'TrainingJobName',
                    'Operator': 'Contains',
                    'Value': job_name_prefix
                },
                {
                    'Name': 'TrainingJobStatus',
                    'Operator': 'Equals',
                    'Value': "Completed"
                }
            ]
        },
        SortBy='CreationTime',
        SortOrder='Descending',
        MaxResults=1)
    
    return search_response['Results'][0]['TrainingJob']['TrainingJobName']

In [None]:
job_name = get_last_job_name(job_prefix)

job_name

### Inference configurations

In [None]:
instance_count = 1
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300

In [None]:
image_uri = get_huggingface_llm_image_uri(
    "huggingface",
    version="1.1.0"
)

image_uri

In [None]:
model = HuggingFaceModel(
    image_uri=image_uri,
    model_data=f"s3://{bucket_name}/{job_name}/{job_name}/output/model.tar.gz",
    env={
        'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
        'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
    }
)

In [None]:
predictor = model.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
)

### Predict

In [None]:
base_prompt = f"""
<s>[INST]
{{question}} 
[/INST]
"""

In [None]:
prompt = base_prompt.format(question="Which are the Foundation Models available in Amazon Bedrock?")

predictor.predict({
	"inputs": prompt,
    "parameters": {
        "max_new_tokens": 2048 - len(prompt),
        "temperature": 0.2,
        "top_p": 0.9
    }
})

#### Delete Endpoint

In [None]:
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)