# Fine-tune LLaMA 2 on Amazon SageMaker

This notebook is an adaptation from Huggingface's notebook https://github.com/philschmid/sagemaker-huggingface-llama-2-samples/blob/master/training/sagemaker-notebook.ipynb

Main differences: 1) this notebook uses a custom dataset, 2) it uses both training and validation dataset, 3) uses tensorboard 4) uses a custom evaluation metric

## 1. Setup Development Environment

In [1]:
!pip install "transformers==4.31.0" datasets sagemaker --upgrade --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.32.27 requires botocore==1.34.27, but you have botocore 1.34.132 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m


To access any LLaMA 2 asset we need to login into our hugging face account. We can do this by running the following command:

In [3]:
# Replace YOUR-HUGGINGFACE-TOKEN with your access token
!huggingface-cli login --token YOUR-HUGGINGFACE-TOKEN

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.



In [4]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker role arn: arn:aws:iam::827930657850:role/service-role/AmazonSageMaker-ExecutionRole-20221027T154083
sagemaker bucket: sagemaker-us-east-1-827930657850
sagemaker session region: us-east-1


## 2. Load and prepare the dataset

We will use a medical QA dataset on huggingface

In [21]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("mamachang/medical", split="train")
print(f"dataset size: {len(dataset)}")

dataset size: 10178


In [22]:
train_and_test_dataset = dataset.train_test_split(test_size=0.1, seed=40)

# Dumping the training/testing data to a local file to be used for training.
train_dataset = train_and_test_dataset["train"]
test_dataset = train_and_test_dataset["test"]

In [24]:
dataset = train_dataset

To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a `formatting_function` that takes a sample and returns a string with our format instruction.

In [28]:
def format_medical(sample):
    instruction = f"### Instruction\nPlease answer with one of the option in the bracket.\n\n"
    context = f"### Context\n{sample['input']}\n\n" if len(sample["input"]) > 0 else None
    response = f"### Answer\n{sample['output']}"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt

lets test our formatting function on a random example.

In addition, to formatting our samples we also want to pack multiple samples to one sequence to have a more efficient training.

Please go to https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and agree to the License Agreement. This would take around 10-20 mins to get the acceptance.

In [26]:
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-7b-hf" # sharded weights
tokenizer = AutoTokenizer.from_pretrained(model_id,use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


We define some helper functions to pack our samples into sequences of a given length and then tokenize them.

In [29]:
from random import randint
from itertools import chain
from functools import partial


# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_medical(sample)}{tokenizer.eos_token}"
    return sample


# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
# print random sample
print(dataset[randint(0, len(dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")

Map:   0%|          | 0/9160 [00:00<?, ? examples/s]

### Instruction
Please answer with one of the option in the bracket.



### Context
Q:A 16-year-old boy is brought to his primary care physician for evaluation of visual loss and is found to have lens subluxation. In addition, he is found to have mild scoliosis that is currently being monitored. Physical exam reveals a tall and thin boy with long extremities. Notably, his fingers and toes are extended and his thumb and little finger can easily encircle his wrist. On this visit, the boy asks his physician about a friend who has a very similar physical appearance because his friend was recently diagnosed with a pheochromocytoma. He is worried that he will also get a tumor but is reassured that he is not at increased risk for any endocrine tumors. Which of the following genetic principles most likely explains why this patient and his friend have a similar physical appearance and yet only one is at increased risk of tumors?? 
{'A': 'Anticipation', 'B': 'Incomplete penetrance', 'C': 'Locus 

Map:   0%|          | 0/9160 [00:00<?, ? examples/s]

Map:   0%|          | 0/9160 [00:00<?, ? examples/s]

Total number of samples: 1415


After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [30]:
# split for training and validation
temp = lm_dataset.train_test_split(test_size=0.2, seed=40)
train_dataset_temp = temp["train"]
eval_dataset_temp = temp["test"]

In [31]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed/llama/medical/train/'
eval_input_path = f's3://{sess.default_bucket()}/processed/llama/medical/eval/'
# lm_dataset.save_to_disk(training_input_path)
train_dataset_temp.save_to_disk(training_input_path)
eval_dataset_temp.save_to_disk(eval_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")
print(f"evaluation dataset to: {eval_input_path}")

severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.



Saving the dataset (0/1 shards):   0%|          | 0/1132 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/283 [00:00<?, ? examples/s]

uploaded data to:
training dataset to: s3://sagemaker-us-east-1-827930657850/processed/llama/medical-dec27/train/
evaluation dataset to: s3://sagemaker-us-east-1-827930657850/processed/llama/medical-dec27/eval/


## 3. Fine-Tune LLaMA 7B with QLoRA on Amazon SageMaker

In [22]:
import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder
from sagemaker.debugger import TensorBoardOutputConfig

# define Training Job Name 
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
model_id = "meta-llama/Llama-2-7b-hf"
# hyperparameters, which are passed into the training job
str_time = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())

LOG_DIR="/opt/ml/output/tensorboard"
tb_output_config = TensorBoardOutputConfig(s3_output_path=f"s3://{sess.default_bucket()}/tensorboard/{str_time}", container_local_output_path=LOG_DIR)

hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'train_dataset_path': '/opt/ml/input/data/training', 
  'eval_dataset_path': '/opt/ml/input/data/testing', 
  'epochs': 10,                                      # number of training epochs
  'per_device_train_batch_size': 2,                 # batch size for training
  'per_device_eval_batch_size': 2,                 # batch size for validation
  'lr': 2e-4,                                       # learning rate used during training
  'hf_token': HfFolder.get_token(),                 # huggingface token to access llama 2
  'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.4xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    tensorboard_output_config=tb_output_config,
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [None]:
data = {'training': training_input_path,
        'testing': eval_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-qlora-2023-12-31-02-40-02-2023-12-31-02-40-06-250


2023-12-31 02:40:06 Starting - Starting the training job...
2023-12-31 02:40:21 Starting - Preparing the instances for training......
2023-12-31 02:41:27 Downloading - Downloading input data...
2023-12-31 02:41:52 Downloading - Downloading the training image...[34m1%|▏         | 73/5660 [09:40<12:20:51,  7.96s/it][0m
[34m1%|▏         | 74/5660 [09:48<12:20:41,  7.96s/it][0m
[34m1%|▏         | 75/5660 [09:56<12:20:33,  7.96s/it][0m
[34m1%|▏         | 76/5660 [10:04<12:20:26,  7.96s/it][0m
[34m1%|▏         | 77/5660 [10:12<12:20:20,  7.96s/it][0m
[34m1%|▏         | 78/5660 [10:20<12:20:12,  7.96s/it][0m
[34m1%|▏         | 79/5660 [10:28<12:20:03,  7.96s/it][0m
[34m1%|▏         | 80/5660 [10:36<12:19:55,  7.96s/it][0m
[34m{'loss': 0.9988, 'learning_rate': 0.0001971731448763251, 'epoch': 0.14}[0m
[34m1%|▏         | 80/5660 [10:36<12:19:55,  7.96s/it][0m
[34m1%|▏         | 81/5660 [10:44<12:19:47,  7.96s/it][0m
[34m1%|▏         | 82/5660 [10:52<12:19:43,  7.96s/it][0

## Deploy Fine-Tuned Model on SageMaker Endpoint

You can deploy your fine-tuned LLaMA model to a SageMaker endpoint and use it for inference. Check out the [Deploy Falcon 7B & 40B on Amazon SageMaker](https://www.philschmid.de/sagemaker-falcon-llm) and [Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker](https://www.philschmid.de/sagemaker-llm-vpc) for more details.

In [78]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.9.3"
)

print(f"llm image uri: {llm_image}")


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


INFO:sagemaker.image_uris:Defaulting to only available Python version: py39
INFO:sagemaker.image_uris:Defaulting to only supported image scope: gpu.


llm image uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04


Add the S3 URI for the finetuned model weights tar.gz file

In [79]:
s3_uri = 'YOUR-MODEL-S3-URI'

In [80]:
import json
from sagemaker.huggingface import HuggingFaceModel

instance_type = "ml.g5.4xlarge"
number_of_gpu = 1
health_check_timeout = 500

In [81]:
# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "/opt/ml/model", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(2048),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(4096),  # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),  # Limits the number of tokens that can be processed in parallel during the generation
  'HUGGING_FACE_HUB_TOKEN': "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",
  'HF_DATASETS_CACHE':'/tmp'
  # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  model_data=s3_uri,
  env=config
)


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [82]:
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
  container_startup_health_check_timeout=3600, #Give more time for model to be downloaded.
  model_data_download_timeout=3600# 1hr minutes to be able to load the model   
)

INFO:sagemaker:Creating model with name: huggingface-pytorch-tgi-inference-2023-12-14-22-34-03-215
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-tgi-inference-2023-12-14-22-34-04-173
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-tgi-inference-2023-12-14-22-34-04-173


----------!

## Inference

Replace "YOUR-MODEL-NAME" with the name of the created endpoint

In [61]:
from sagemaker.huggingface.model import HuggingFacePredictor
llm = HuggingFacePredictor("YOUR-MODEL-NAME")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [62]:
payload_params = {
    # "do_sample": True,
    "top_p": 0.99,
    "temperature": 0.01,
    # "top_k": 250,
    "max_new_tokens": 1024,
    # "repetition_penalty": 1.03,
    "stop": ["</answer>"],  # "#"],
}

In [63]:
def predict_llm(sample, payload_params, llm):
    """Predict on dataset to add prompt to each sample"""

    payload = {
        "inputs": sample,
        "parameters": payload_params,
    }
    payload["inputs"] = sample

    # send request to endpoint
    response = llm.predict(payload)
    result = response[0]["generated_text"]

    return result


In [64]:
input = test_dataset[0].get('input')
prompt ="""### Instruction\nPlease answer with one of the option in the bracket.\n### Context\n"""+f"""{input}"""
result = predict_llm(prompt, payload_params, llm)