# Fine-tune Llama-13B with QLoRA and SageMaker remote decorator

## Unsupervised fine-tuning

---

In this demo notebook, we demonstrate how to fine-tune the Llama-13B model using QLoRA, Hugging Face PEFT, and bitsandbytes.

We are using SageMaker remote decorator for runinng the fine-tuning job on Amazon SageMaker Training job
---
SageMaker Studio Kernel: PyTorch 2.0.0 Python 3.10

Instance Type: ml.g5.12xlarge

Install the required libriaries, including the Hugging Face libraries, and restart the kernel.

In [None]:
%pip install -r requirements.txt

In [None]:
%pip install -q -U transformers==4.35.1
%pip install -q -U datasets==2.13.1
%pip install -q -U peft==0.6.2
%pip install -q -U accelerate==0.24.1
%pip install -q -U bitsandbytes==0.41.2.post2
%pip install -q -U boto3
%pip install -q -U langchain==0.0.283
%pip install -q -U sagemaker==2.154.0
%pip install -q -U scikit-learn


## Setup Configuration file path

We are setting the directory in which the config.yaml file resides so that remote decorator can make use of the settings.


In [2]:
import os

# Set path to config file
os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

## Visualize and upload the dataset

Read train dataset in a Pandas dataframe

In [3]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader([
    "https://aws.amazon.com/bedrock/",
    "https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html",
    "https://aws.amazon.com/blogs/aws/preview-enable-foundation-models-to-complete-tasks-with-agents-for-amazon-bedrock/",
    "https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html",
    "https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html",

])

data = loader.load()

In [4]:
from datasets import Dataset

def strip_spaces(doc):
    return {"text": doc.page_content.replace("  ", "")}

stripped_data = list(map(strip_spaces, data))

train_dataset = Dataset.from_list(stripped_data)

train_dataset

Dataset({
    features: ['text'],
    num_rows: 5
})



To train our model, we need to convert our inputs (text) to token IDs. This is done by a Hugging Face Transformers Tokenizer. In addition to QLoRA, we will use bitsanbytes 4-bit precision to quantize out frozen LLM to 4-bit and attach LoRA adapters on it.



In [5]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Use the Hugging Face Trainer class to fine-tune the model. Define the hyperparameters we want to use. We also create a DataCollator that will take care of padding our inputs and labels.

In [6]:
! huggingface-cli login --token <HF_TOKEN>

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/bpistone/.cache/huggingface/token
Login successful


In [7]:
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-13b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)

tokenizer.pad_token = tokenizer.eos_token



Downloading (…)okenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Creating chunks and tokenizing the inputs for making it usable by the LLM. For additional details, please refer to the blog [Leveraging qLoRA for Fine-Tuning of Task-Fine-Tuned Models Without Catastrophic Forgetting: A Case Study with LLaMA2(-chat)](https://medium.com/towards-data-science/leveraging-qlora-for-fine-tuning-of-task-fine-tuned-models-without-catastrophic-forgetting-d9bcd594cff4)

In [8]:
from itertools import chain
from functools import partial

remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result

Utility method for finding the target modules and update the necessary matrices. Visit [this](https://github.com/artidoro/qlora/blob/main/qlora.py) link for additional info.

In [9]:
import bitsandbytes as bnb

def find_all_linear_names(hf_model):
    lora_module_names = set()
    for name, module in hf_model.named_modules():
        if isinstance(module, bnb.nn.Linear4bit):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)

  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


Define the train function

In [10]:
from huggingface_hub import login
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
from sagemaker.remote_function import remote
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import transformers

# Start training
@remote(volume_size=100)
def train_fn(
        model_name,
        train_ds,
        lora_r=64,
        lora_alpha=16,
        lora_dropout=0.1,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        learning_rate=2e-4,
        num_train_epochs=1,
        chunk_size=2048,
        merge_weights=False,
        token=None
):
    if token is not None:
        login(token=token)

    # tokenize and chunk dataset
    lm_dataset = train_ds.map(
        lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(train_ds.features)
    ).map(
        partial(chunk, chunk_length=chunk_size),
        batched=True,
    )

    # Print total number of samples
    print(f"Total number of train samples: {len(lm_dataset)}")

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        quantization_config=bnb_config,
        device_map="auto")

    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

    # get lora target modules
    modules = find_all_linear_names(model)
    print(f"Found {len(modules)} modules to quantize: {modules}")

    config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules=modules,
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM"
    )

    model = get_peft_model(model, config)
    print_trainable_parameters(model)

    trainer = transformers.Trainer(
        model=model,
        train_dataset=lm_dataset,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=per_device_train_batch_size,
            per_device_eval_batch_size=per_device_eval_batch_size,
            logging_steps=2,
            num_train_epochs=num_train_epochs,
            learning_rate=learning_rate,
            bf16=True,
            save_strategy="no",
            output_dir="outputs"
        ),
        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )
    model.config.use_cache = False

    trainer.train()

    if merge_weights:
        output_dir = "/tmp/model"

        # merge adapter weights with base model and save
        # save int 4 model
        trainer.model.save_pretrained(output_dir, safe_serialization=False)
        # clear memory
        del model
        del trainer
        torch.cuda.empty_cache()

        # load PEFT model in fp16
        model = AutoPeftModelForCausalLM.from_pretrained(
            output_dir,
            low_cpu_mem_usage=True,
            torch_dtype=torch.float16,
        )
        # Merge LoRA and base model and save
        model = model.merge_and_unload()
        model.save_pretrained(
            "/opt/ml/model", safe_serialization=True, max_shard_size="2GB"
        )
    else:
        model.save_pretrained("/opt/ml/model", safe_serialization=True)

    tmp_tokenizer = AutoTokenizer.from_pretrained(model_name)
    tmp_tokenizer.save_pretrained("/opt/ml/model")

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Fetched defaults config from location: /Users/bpistone/development/amazon/amazon-sagemaker-remote-decorator-generative-ai


INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Fetched defaults config from location: /Users/bpistone/development/amazon/amazon-sagemaker-remote-decorator-generative-ai
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.ImageUri
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.Dependencies
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.InstanceType
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.RoleArn


In [11]:
train_fn(
    model_id,
    train_dataset,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=20,
    chunk_size=2048,
    merge_weights=True,
    token="<HF_TOKEN>"
)

2023-11-15 08:28:23,163 sagemaker.remote_function INFO     Copied dependencies file at './requirements.txt' to '/var/folders/1d/p7dclqcx4934dybvv117p3640000gr/T/tmp_jnrt7xr/temp_workspace/sagemaker_remote_function_workspace/requirements.txt'
2023-11-15 08:28:23,164 sagemaker.remote_function INFO     Successfully created workdir archive at '/var/folders/1d/p7dclqcx4934dybvv117p3640000gr/T/tmp_jnrt7xr/workspace.zip'
2023-11-15 08:28:23,273 sagemaker.remote_function INFO     Successfully uploaded workdir to 's3://sagemaker-eu-west-1-691148928602/train-fn-2023-11-15-08-28-22-524/sm_rf_user_ws/workspace.zip'
2023-11-15 08:28:23,274 sagemaker.remote_function INFO     Serializing function code to s3://sagemaker-eu-west-1-691148928602/train-fn-2023-11-15-08-28-22-524/function
2023-11-15 08:28:25,050 sagemaker.remote_function INFO     Serializing function arguments to s3://sagemaker-eu-west-1-691148928602/train-fn-2023-11-15-08-28-22-524/arguments
2023-11-15 08:28:25,283 sagemaker.remote_functi

2023-11-15 08:28:25 Starting - Starting the training job...
2023-11-15 08:28:50 Starting - Preparing the instances for training......
2023-11-15 08:29:55 Downloading - Downloading input data...
2023-11-15 08:30:20 Training - Downloading the training image...........................
2023-11-15 08:34:52 Training - Training image download completed. Training in progress.......[34mINFO: CONDA_PKGS_DIRS is set to '/opt/ml/sagemaker/warmpoolcache/sm_remotefunction_user_dependencies_cache/conda/pkgs'[0m
[34mINFO: PIP_CACHE_DIR is set to '/opt/ml/sagemaker/warmpoolcache/sm_remotefunction_user_dependencies_cache/pip'[0m
[34mINFO: Bootstraping runtime environment.[0m
[34m2023-11-15 08:35:49,511 sagemaker.remote_function INFO     Successfully unpacked workspace archive at '/'.[0m
[34m2023-11-15 08:35:49,511 sagemaker.remote_function INFO     '/sagemaker_remote_function_workspace/pre_exec.sh' does not exist. Assuming no pre-execution commands to run[0m
[34m2023-11-15 08:35:49,511 sagema

DeserializationError: Corrupt metadata file. SHA256 hash for the serialized data does not exist. Please make sure to install SageMaker SDK version >= 2.156.0 on the client side and try again.

## Load Fine-Tuned model

Note: Run `train_fn` with `merge_weights=False`

### Download model

In [2]:
import boto3

s3_client = boto3.client("s3")

In [4]:
bucket_name = "<S3_BUCKET>"
job_name = "<JOB_NAME>"

In [24]:
s3_client.download_file(bucket_name, f"{job_name}/{job_name}/output/model.tar.gz", "model.tar.gz")

In [25]:
! rm -rf ./model && mkdir -p ./model && tar -xf model.tar.gz -C ./model

Now we are loading the PEFT weights trained

In [5]:
! huggingface-cli login --token <HF_TOKEN>

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [6]:
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-13b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token = tokenizer.eos_token

In [7]:
from peft import PeftModel, PeftConfig
import torch
from transformers import AutoModelForCausalLM

device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'

config = PeftConfig.from_pretrained("./model")
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, trust_remote_code=True)
model = PeftModel.from_pretrained(model, "./model")
model.to(device)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 5120, padding_idx=0)
        (layers): ModuleList(
          (0-39): 40 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear(
                in_features=5120, out_features=5120, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=5120, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=5120, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear(
                in_features=5120, out_features=5120, bias=False
       

Load a test dataset and try a random sample for Q&A.

In [14]:
# format sample
prompt = f"""
<s>[INST] {{question}} [/INST]
"""

test_sample = prompt.format(question="What are Amazon Bedrock Agents?")

In [15]:
input_ids = tokenizer(test_sample, return_tensors="pt").input_ids

In [16]:
#set the tokens for the summary evaluation
tokens_for_answer = 150
output_tokens = input_ids.shape[1] + tokens_for_answer

outputs = model.generate(inputs=input_ids.to(device), do_sample=True, max_length=output_tokens)
gen_text = tokenizer.batch_decode(outputs)[0]

print(gen_text)

<s> ### Instruction
What are Amazon Bedrock Agents?

### Answer
Amazon Bedrock agents are a new class of agents available in Amazon Bedrock that make it easier for developers to build and manage generative AI applications without having to manage the underlying infrastructure. Agents provide a simple API for you to define the actions and intent that your generative AI application can perform in response to a user request. Agents handle the complexity of building and managing the underlying generative AI application, including data access, API calls, and infrastructure management.

Agents are a highly available, fully managed service that can scale to handle the concurrent user traffic and request load. They can also integrate with other Amazon Bedrock features, such as Amazon Bedrock Serverless Compute, to provide a


## Deploy Fine-Tuned model

In [12]:
import json
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

In [13]:
bucket_name = "<S3_BUCKET>"
job_name = "<JOB_NAME>"

### Inference configurations

In [15]:
instance_count = 1
instance_type = "ml.g5.48xlarge"
number_of_gpu = 8
health_check_timeout = 300

In [16]:
image_uri = get_huggingface_llm_image_uri(
    "huggingface",
    version="1.1.0"
)

image_uri

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Fetched defaults config from location: /Users/bpistone/development/amazon/amazon-sagemaker-remote-decorator-generative-ai


INFO:sagemaker.image_uris:Defaulting to only available Python version: py39
INFO:sagemaker.image_uris:Defaulting to only supported image scope: gpu.


'763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04'

In [17]:
model = HuggingFaceModel(
    image_uri=image_uri,
    model_data=f"s3://{bucket_name}/{job_name}/{job_name}/output/model.tar.gz",
    env={
        'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
        'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
    }
)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Fetched defaults config from location: /Users/bpistone/development/amazon/amazon-sagemaker-remote-decorator-generative-ai
sagemaker.config INFO - Applied value from config key = SageMaker.Model.ExecutionRoleArn
sagemaker.config INFO - Applied value from config key = SageMaker.Model.VpcConfig
sagemaker.config INFO - Applied value from config key = SageMaker.Model.EnableNetworkIsolation


INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Fetched defaults config from location: /Users/bpistone/development/amazon/amazon-sagemaker-remote-decorator-generative-ai


In [18]:
predictor = model.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
)

INFO:sagemaker:Creating model with name: huggingface-pytorch-tgi-inference-2023-11-15-08-54-07-671
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-tgi-inference-2023-11-15-08-54-08-877
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-tgi-inference-2023-11-15-08-54-08-877


------------!

### Predict

In [32]:
prompt = f"""
<s>[INST] {{question}} [/INST]
"""

In [34]:
predictor.predict({
	"inputs": prompt.format(question="What are Amazon Bedrock Agents?"),
    "parameters": {
        "max_new_tokens": 2048 - len(prompt),
        "temperature": 0.2,
        "top_p": 0.9,
        "stop": ["."]
    }
})

[{'generated_text': 'Amazon Bedrock is a cloud-based platform that makes it easy to build managed agents for Amazon Bedrock agents are agents that dynamically generate and execute business-specific agents that can perform complex tasks outside of the agent itself—from identifying and engaging with other agents and services, to managing transactions, and filling knowledge gaps with additional research—all while maintaining privacy and security principles.'}]

#### Delete Endpoint

In [35]:
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting model with name: huggingface-pytorch-tgi-inference-2023-11-15-08-54-07-671
INFO:sagemaker:Deleting endpoint configuration with name: huggingface-pytorch-tgi-inference-2023-11-15-08-54-08-877
INFO:sagemaker:Deleting endpoint with name: huggingface-pytorch-tgi-inference-2023-11-15-08-54-08-877
