## Model Management for LoRA Fine-tuned models using Llama2 & Amazon SageMaker (Separate Adapter and Base Models)

In this example notebook, we will walk through an example using LoRA techniques to fine-tune a LLama2 7B model on Amazon SageMaker, and then add the proper model governance using SageMaker Model Registry. This notebook focus on saving and managing LoRA adapter and base models seperately. 

The example is tested on following kernel and instance types:

<div style="background-color: #FFDDDD; border-left: 5px solid red; padding: 10px; color: black;">
    <strong>Kernel:</strong> PyTorch 2.0.0 Python 3.10 GPU Optimized, <strong>Instance Type:</strong> ml.g4dn.xlarge
</div>

In [None]:
!pip install -Uq pip

In [None]:
!pip install -Uq datasets
!pip install -Uq transformers==4.31.0
!pip install -Uq accelerate==0.21.0
!pip install -Uq safetensors>=0.3.1
!pip install -Uq botocore
!pip install -Uq boto3
!pip install -q sagemaker==2.177.0

In [None]:
!apt-get update && apt-get install -y -qq graphviz

In [None]:
!pip install -q anytree==2.8.0 pydot==1.4.2

## Setup

In [None]:
import os
import glob
import boto3
import pprint
from tqdm import tqdm
import sagemaker
from sagemaker.collection import Collection
from sagemaker.utils import name_from_base

In [None]:
sagemaker_session =  sagemaker.session.Session(boto3.session.Session(region_name="us-east-1")) #sagemaker.session.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
sm_client = boto3.client('sagemaker', region_name=region)
model_collector = Collection(sagemaker_session=sagemaker_session)

## Define Parameters 

In [None]:
# define base model name
model_group_for_base = "llama-2-7b" # we'll group all llama-2 variants under this collection 
model_id = f"Mikael110/{model_group_for_base}-guanaco-fp16" 
# define a base dataset to finetune this base model
dataset_name = "databricks/databricks-dolly-15k"

# s3 prefix
s3_key_prefix = model_id.replace('/', '-')

# base model collection name
model_registry_name_base = f"{s3_key_prefix}-base"
# finetuned model collection name
model_registry_name_finetuned = f"{s3_key_prefix}-finetuned"
model_group_for_finetune = dataset_name.split('/')[-1] # we will group all dataset finetunes to this and attach it back to the parent model

## Prepare Dataset

split the data into training and validation and preview the a sample data

In [None]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
train_dataset = load_dataset(dataset_name, split="train[:05%]")
validation_dataset = load_dataset(dataset_name, split="train[95%:]")

print(f"Training size: {len(train_dataset)} | Validation size: {len(validation_dataset)}")
print("\nTraining sample:\n")
print(train_dataset[randrange(len(train_dataset))])
print("\nValidation sample:\n")
print(validation_dataset[randrange(len(validation_dataset))])

In [None]:
def format_dolly(sample):
    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
    response = f"### Answer\n{sample['response']}"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt

Format the data for instruction fine tuning

In [None]:
from random import randrange

print(format_dolly(train_dataset[randrange(len(train_dataset))]))

Load the tokenizer for Llama2

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

In [None]:
from random import randint
from itertools import chain
from functools import partial


# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
    return sample


# apply prompt template per sample
# train
train_dataset = train_dataset.map(template_dataset, remove_columns=list(train_dataset.features))
# validation
validation_dataset = validation_dataset.map(template_dataset, remove_columns=list(validation_dataset.features))
# print random sample
print(validation_dataset[randint(0, len(validation_dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset

# training
lm_train_dataset = train_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(train_dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# validation
lm_valid_dataset = validation_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(validation_dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(validation_dataset)}")

## Upload dataset to S3

In [None]:
# save train_dataset to s3
training_input_path = f's3://{default_bucket}/largelanguagemodels/{model_id}/dataset/train'
lm_train_dataset.save_to_disk(training_input_path)

print(f"saving training dataset to: {training_input_path}")

# save train_dataset to s3
validation_input_path = f's3://{default_bucket}/largelanguagemodels/{model_id}/dataset/validation'
lm_valid_dataset.save_to_disk(validation_input_path)

print(f"saving validation dataset to: {validation_input_path}")

## Register Base model into Model Registry

We are registering the base model into Model registry. This gives a central repository to manage and version base model, so you don't need to duplicate the download from the hub each time you want to experiment or deploy. 

---
download and save the mdoel

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_save_dir = f"./base_model/{model_id}"
os.makedirs(base_model_save_dir, exist_ok=True)

tokenizer = AutoTokenizer.from_pretrained(model_id).save_pretrained(base_model_save_dir)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    device_map="auto"
).save_pretrained(base_model_save_dir) 

remove model to clear cache memory

In [None]:
del model
import torch; torch.cuda.empty_cache()

Tar and upload the model to S3

In [None]:
model_tar_filename = f"base-model.tar.gz"
print(f"Model tar file name: {model_tar_filename}")

In [None]:
%%time
!cd ./base_model && tar -cvf ./{model_tar_filename} ./{model_id}

In [None]:
%%time
model_data_uri = sagemaker.s3.S3Uploader.upload(
    local_path=f"./base_model/{model_tar_filename}",
    desired_s3_uri=f's3://{default_bucket}/largelanguagemodels/{model_id}/models/base',
)
print(model_data_uri)

### Create a Model Package Group 

In [None]:
# Model Package Group Vars
base_package_group_name = name_from_base(model_id.replace('/', '-'))
base_package_group_desc = f"Source: https://huggingface.co/{model_id}"
base_tags = [
    { 
        "Key": "modelType",
        "Value": "BaseModel"
    },
    { 
        "Key": "fineTuned",
        "Value": "False"
    },
    { 
        "Key": "sourceDataset",
        "Value": "None"
    }
]

model_package_group_input_dict = {
    "ModelPackageGroupName" : base_package_group_name,
    "ModelPackageGroupDescription" : base_package_group_desc,
    "Tags": base_tags
    
}
create_model_pacakge_group_response = sm_client.create_model_package_group(
    **model_package_group_input_dict
)
print(f'Created ModelPackageGroup Arn : {create_model_pacakge_group_response["ModelPackageGroupArn"]}')

base_model_pkg_group_name = create_model_pacakge_group_response["ModelPackageGroupArn"]

### Register the Base Model

In [None]:
from sagemaker.huggingface import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.28',
    pytorch_version='2.0',  
    py_version='py310',
    model_data=model_data_uri,
    role=role,
)

In [None]:
base_model_package = huggingface_model.register(
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=[
        "ml.p2.16xlarge", 
        "ml.p3.16xlarge", 
        "ml.g4dn.4xlarge", 
        "ml.g4dn.8xlarge", 
        "ml.g4dn.12xlarge", 
        "ml.g4dn.16xlarge"
    ],
    transform_instances=[
        "ml.p2.16xlarge", 
        "ml.p3.16xlarge", 
        "ml.g4dn.4xlarge", 
        "ml.g4dn.8xlarge", 
        "ml.g4dn.12xlarge", 
        "ml.g4dn.16xlarge"
    ],
    model_package_group_name=base_model_pkg_group_name,
    approval_status="Approved"
)

### Add Base Model to Model Collection
We can associate the base model and the fine tuned model in a model collection. If you get a permission error during creation of collection, please refer to the pre-req or this [AWS documentation to add the IAM polciy](https://docs.aws.amazon.com/sagemaker/latest/dg/modelcollections-permissions.html)

In [None]:
# create model collection
collection_name = name_from_base(model_group_for_base)
base_collection = model_collector.create(
    collection_name=collection_name
)

In [None]:
_response = model_collector.add_model_groups(
    collection_name=base_collection["Arn"], 
    model_groups=[base_model_pkg_group_name]
)

print(f"Model collection creation status: {_response}")

## Create A Fine Tuning Job

We will use a HuggingFace training estimator to fine tune the llama2 model

In [None]:
rm -rf `find -type d -name .ipynb_checkpoints`

In [None]:
from datetime import datetime
from sagemaker.huggingface import HuggingFace
from sagemaker.experiments.run import Run

# define Training Job Name 
time_suffix = datetime.now().strftime('%y%m%d%H%M')
job_name = f'huggingface-qlora-{time_suffix}'
experiments_name = f"exp-{model_id.replace('/', '-')}"
run_name = f"qlora-finetune-run-{time_suffix}"

with Run(
    experiment_name=experiments_name, 
    run_name=run_name, 
    sagemaker_session=sagemaker.Session()
) as run:
    # create the Estimator
    huggingface_estimator = HuggingFace(
        entry_point='finetune_llm.py',      
        source_dir='code',         
        instance_type='ml.g5.2xlarge',   
        instance_count=1,       
        role=role,              
        volume_size=300,               
        transformers_version='4.28',            
        pytorch_version='2.0',             
        py_version='py310',           
        hyperparameters={
            'base_model_group_name': base_package_group_name,
            'model_id': model_id,                             
            'dataset_path': '/opt/ml/input/data/training',    
            'epochs': 1,                                      
            'per_device_train_batch_size': 2,                 
            'lr': 1e-4,
            'region': region,
        },
        sagemaker_session=sagemaker_session
    )

    # starting the train job with our uploaded datasets as input
    data = {
        'training': training_input_path, 
        'validation': validation_input_path
    }
    huggingface_estimator.fit(
        data, 
        wait=True,
        job_name=job_name
    )
    
    run.log_parameters(data)    

## Register the FineTuned model into Model Registry
Create model package group

In [None]:
# Model Package Group Vars
ft_package_group_name = name_from_base(f"{model_id.replace('/', '-')}-finetuned-sql")
ft_package_group_desc = f"QLoRA for model {model_id}"
ft_tags = [
    { 
        "Key": "modelType",
        "Value": "QLoRAModel"
    },
    { 
        "Key": "fineTuned",
        "Value": "True"
    },
    { 
        "Key": "sourceDataset",
        "Value": f"{dataset_name}"
    }
]

model_package_group_input_dict = {
    "ModelPackageGroupName" : ft_package_group_name,
    "ModelPackageGroupDescription" : ft_package_group_desc,
    "Tags": ft_tags
    
}
create_model_pacakge_group_response = sm_client.create_model_package_group(
    **model_package_group_input_dict
)
print(f'Created ModelPackageGroup Arn : {create_model_pacakge_group_response["ModelPackageGroupArn"]}')

ft_model_pkg_group_name = create_model_pacakge_group_response["ModelPackageGroupArn"]

Register a New Model into Fine-Tuned Model Group

In [None]:
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.28',
    pytorch_version='2.0',  
    py_version='py310',
    model_data=huggingface_estimator.model_data,
    role=role,
)

In [None]:
LoRA_package = huggingface_model.register(
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=[
        "ml.p2.16xlarge", 
        "ml.p3.16xlarge", 
        "ml.g4dn.4xlarge", 
        "ml.g4dn.8xlarge", 
        "ml.g4dn.12xlarge", 
        "ml.g4dn.16xlarge"
    ],
    transform_instances=[
        "ml.p2.16xlarge", 
        "ml.p3.16xlarge", 
        "ml.g4dn.4xlarge", 
        "ml.g4dn.8xlarge", 
        "ml.g4dn.12xlarge", 
        "ml.g4dn.16xlarge"
    ],
    model_package_group_name=ft_model_pkg_group_name,
    approval_status="Approved"
)

Add FineTuned Model to Model Collection with Parent Base

In [None]:
_model_group_for_finetune = name_from_base(model_group_for_finetune)

In [None]:
# create model collection for finetuned and link it back to the base
finetuned_collection = model_collector.create(
    collection_name=_model_group_for_finetune,
    parent_collection_name=collection_name
)

In [None]:
# add finetuned model package group to the new finetuned collection
_response = model_collector.add_model_groups(
    collection_name=_model_group_for_finetune,
    model_groups=[ft_model_pkg_group_name]
)

print(f"Model collection creation status: {_response}")

## Understanding Parent (Base) - Child (QLoRA) Model Registry Relationship

In [None]:
from PIL import Image
from io import BytesIO
from collections import OrderedDict
from anytree import (
    AnyNode as Node, 
    RenderTree, 
    DoubleStyle
)
from anytree.dotexport import RenderTreeGraph


def recursively_build_model_tree(
    root_model_package_group, 
    output_dict, 
    level=0
):
    """ Recursively extracts model collections 
    to build a relationship dictonary """
    output_dict[root_model_package_group] = []
    
    model_packages = model_collector.list_collection(root_model_package_group)
    
    for model_package in model_packages:
        if model_package['Type'] == 'Collection':
            
            output_dict[root_model_package_group].append(
                {
                    "package_name": model_package['Name'],
                    "type": model_package["Type"]
                }
            )
            
            recursively_build_model_tree(
                model_package['Name'], 
                output_dict, 
                level+1
            )
        elif model_package['Type'] == 'AWS::SageMaker::ModelPackageGroup':
            output_dict[root_model_package_group].append(
                {
                    "package_name": model_package['Name'],
                    "type": model_package["Type"]
                }
            )
    
    return output_dict


def build_tree(raw_data):
    """ Builds a tree using dictionary input """
    source_dict = {}
    for k, values in raw_data.items():
        if not any(source_dict):
            source_dict[k] = Node(name=k, type_of="root")
        for v in values:
            source_dict[v['package_name']] = Node(
                name=v['package_name'],
                type_of=v['type'].split(':')[-1],
                parent=source_dict[k]
            )
    return RenderTree(
        source_dict[collection_name], 
        style=DoubleStyle()
    ), source_dict[collection_name]


raw_data = recursively_build_model_tree(
    root_model_package_group=collection_name, 
    output_dict=OrderedDict()
)

_tree, raw_node = build_tree(raw_data=raw_data)

In [None]:
image_path = "test.jpg"
RenderTreeGraph(raw_node).to_picture(image_path)
Image.open(image_path)

## Deploy the model
Step 1: Repack the base model.

In [None]:
!aws s3 cp {base_model_package.model_data} .

In [None]:
!tar -xvf {model_tar_filename} -C ./deepspeed/

!mv ./deepspeed/{model_id} ./deepspeed/base

!rm -rf ./deepspeed/{model_id}

Step 2: we need to download and repackage the LoRA weight

In [None]:
!aws s3 cp {LoRA_package.model_data} .

In [None]:
!mkdir -p ./deepspeed/lora/

!tar -xzf model.tar.gz -C ./deepspeed/lora/

Create a new model package to deploy. This may take up to 10 minutes to package and upload due to the file size.


In [None]:
rm -rf `find -type d -name .ipynb_checkpoints`

In [None]:
!rm -f model.tar.gz
!tar czvf model.tar.gz -C deepspeed .
s3_code_artifact_deepspeed = sagemaker_session.upload_data("model.tar.gz", default_bucket, f"{s3_key_prefix}/inference")
print(f"S3 Code or Model tar for deepspeed uploaded to --- > {s3_code_artifact_deepspeed}")

### Define the serving container
Here we define the container to use for the model for inference. We will be using SageMaker's Large Model Inference(LMI) container using DeepSpeed.

In [None]:
inference_image_uri = sagemaker.image_uris.retrieve(
    "djl-deepspeed", region=region, version="0.23.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

In [None]:
model_name_ds = name_from_base(model_group_for_base)

create_model_response = sm_client.create_model(
    ModelName=model_name_ds,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact_deepspeed},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

In [None]:
endpoint_config_name = f"{model_name_ds}-config"
endpoint_name = f"{model_name_ds}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name_ds,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
        },
    ],
)
endpoint_config_response

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

## Run Inference

Large models such as LLama2 have very high accelerator memory footprint. Thus, a very large input payload or generating a large output can cause out of memory errors. The inference examples below are calibrated such that they will work on the ml.g5.12xlarge instance within the SageMaker response time limit of 60 seconds. If you find that increasing the input length or generation length leads to CUDA Out Of Memory errors, we recommend that you try one of the following solutions:

In [None]:
from random import randint

validation_dataset = load_dataset(dataset_name, split="train[95%:]")

sample = validation_dataset[randint(0,len(validation_dataset))]

instruction = f"### Instruction\n{sample['instruction']}"
context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
response = f"### Answer\n"
# join all the parts together
prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])

    
prompt

In [None]:
import json
smr_client = boto3.client("sagemaker-runtime")

data = {
    "text": prompt,
    "properties": {
        "min_length": 10,
        "max_length": 100,
        "do_sample": True,
    },
}

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(data),
    ContentType="application/json",
)

outputs = json.loads(response_model["Body"].read().decode("utf8"))['outputs']

generated_text = outputs[0]['generated_text']
generated_text

In [None]:
groudtruth = sample['response']
print(f"GroundTruth -> {groudtruth}")

## Clean Up

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)