#  Serve Custom tiiuae/falcon-7b-instruct model with Amazon SageMaker Hosting

👋 Hey there! 🌟 

Let's take a walk-through together on how to deploy and perform inference on a custom **tiiuae/falcon-7b-instruct** model using the **Large Model Inference (LMI)** container provided by AWS with the help of **DJL Serving**. 😄

Since the **tiiuae/falcon-7b-instruct** is a relatively small language model (LLM) that can be easily accommodated on a single GPU, we'll make use of the `ml.g5.2xlarge` instance, which comes with **1** GPU. 🖥️

**Note:** Use an environment that has PyTorch torch==2.0.1 or install by running the following command on a cell -> `pip install torch==2.0.1`

## Setup

To get started, you'll need to install the necessary dependencies for packaging your model and running inferences on Amazon SageMaker. Don't worry, it's a simple process! Just make sure to update SageMaker and boto3 too. 🚀

In [1]:
!pip install sagemaker boto3 --upgrade  --quiet
!pip install transformers einops==0.5.0 tiktoken accelerate --q
# Uncomment if you are building your own env otherwise select one environmanet with torch==2.0.1 pre-installed
#!pip install torch==2.0.1 --q

## Imports and variables

In [2]:
import os
import time
import json
import torch
import boto3
import jinja2
import sagemaker
from pathlib import Path
from sagemaker import image_uris
from sagemaker.utils import name_from_base
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

In [3]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts
hf_model_id = 'tiiuae/falcon-7b-instruct'
model_id = hf_model_id.replace('/','-')
s3_code_prefix_accelerate = f"hf-large-model/{model_id}/accelerate"  # folder within bucket where code artifact will go
s3_model = f"hf-large-model/{model_id}/model"
region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

jinja_env = jinja2.Environment()

First, let's load the tiiuae/falcon-7b-instruct model along with its tokenizer and save it to S3. 🚀 The great thing is that you can also train it on your own custom dataset and store the trained model on S3.

**Note:** While you may not plan to fine-tune it, using the container to directly download the model from the HuggingFace Hub can result in **long download times.** ⏳ It's highly recommended to download the model from the HuggingFace Hub to your local host instead. Once downloaded, simply upload the model to an S3 Bucket. 📥 If you have any questions or need further assistance, feel free to ask! 😊

In [4]:
model_name = 'tiiuae/falcon-7b-instruct'
local_rank = int(os.getenv("LOCAL_RANK", "0"))
tokenizer = AutoTokenizer.from_pretrained(model_name, 
                                      trust_remote_code=True,
                                      cache_dir=os.path.join(os.environ['PWD'],'hf/tokenizer_cache/')
                                     )
model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             torch_dtype=torch.bfloat16,
                                             device_map="auto",
                                             trust_remote_code=True,
                                             cache_dir=os.path.join(os.environ['PWD'],'hf/model_cache/')
                                            )

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Save it locally

In [5]:
# Training code to finetune model on a custom dataset

In [6]:
local_model_path = os.path.join(os.environ['PWD'],'sm_model/')
model.save_pretrained(local_model_path)
tokenizer.save_pretrained(local_model_path)

('/home/ec2-user/SageMaker/sm_model/tokenizer_config.json',
 '/home/ec2-user/SageMaker/sm_model/special_tokens_map.json',
 '/home/ec2-user/SageMaker/sm_model/tokenizer.json')

Upload to S3. It will be used to quickly load the model during deployment and avoiding it downloading from huggingface hub 

In [7]:
def upload_directory_to_s3(local_directory, s3_bucket, s3_prefix):
    s3_client = boto3.client('s3')
    
    for root, dirs, files in os.walk(local_directory):
        for file in files:
            local_path = os.path.join(root, file)
            s3_key = os.path.join(s3_prefix, os.path.relpath(local_path, local_directory))
            s3_client.upload_file(local_path, s3_bucket, s3_key)
            print(f"Uploaded {local_path} to s3://{s3_bucket}/{s3_key}")

# Example usage
upload_directory_to_s3(local_model_path, model_bucket, s3_model)

Uploaded /home/ec2-user/SageMaker/sm_model/generation_config.json to s3://sagemaker-eu-west-1-069230569860/hf-large-model/tiiuae-falcon-7b-instruct/model/generation_config.json
Uploaded /home/ec2-user/SageMaker/sm_model/pytorch_model-00001-of-00002.bin to s3://sagemaker-eu-west-1-069230569860/hf-large-model/tiiuae-falcon-7b-instruct/model/pytorch_model-00001-of-00002.bin
Uploaded /home/ec2-user/SageMaker/sm_model/tokenizer_config.json to s3://sagemaker-eu-west-1-069230569860/hf-large-model/tiiuae-falcon-7b-instruct/model/tokenizer_config.json
Uploaded /home/ec2-user/SageMaker/sm_model/pytorch_model.bin.index.json to s3://sagemaker-eu-west-1-069230569860/hf-large-model/tiiuae-falcon-7b-instruct/model/pytorch_model.bin.index.json
Uploaded /home/ec2-user/SageMaker/sm_model/config.json to s3://sagemaker-eu-west-1-069230569860/hf-large-model/tiiuae-falcon-7b-instruct/model/config.json
Uploaded /home/ec2-user/SageMaker/sm_model/pytorch_model-00002-of-00002.bin to s3://sagemaker-eu-west-1-069

### 1. Create SageMaker compatible model artifacts

To get our model ready for deployment on a SageMaker Endpoint, we need to prepare a few things for both SageMaker and our container. No worries, it's a straightforward process! We'll use a local folder to store these files, including some important ones like serving.properties (which defines parameters for the LMI container) and requirements.txt (to specify the dependencies we need to install). 📂

In [8]:
directory_name = f"code_{model_id.replace('-','_')}_accelerate"
os.makedirs(directory_name, exist_ok=True)

In the serving.properties file, you'll need to define the engine to use and the model you want to host. Pay attention to the tensor_parallel_degree parameter, as it's essential in this scenario. If a single GPU doesn't have enough memory to handle the entire model, you can use tensor parallelism >1 to divide the model into multiple parts.

For your deployment, we'll be using a 'ml.g5.2xlarge' instance, which comes with 1 GPU and is sufficient for loading our model. Just make sure not to specify a value larger than what the instance provides, or your deployment might encounter issues. ❌🙅‍♂️

Finally, we need to specify the S3 link where the model can be found.

In [9]:
%%writefile ./{directory_name}/serving.properties
engine=Python
option.tensor_parallel_degree=1
option.s3url={{s3url}}

Overwriting ./code_tiiuae_falcon_7b_instruct_accelerate/serving.properties


In [10]:
%%writefile ./{directory_name}/requirements.txt
torch==2.0.1
einops==0.5.0
tiktoken
transformers==4.30.2
accelerate

Overwriting ./code_tiiuae_falcon_7b_instruct_accelerate/requirements.txt


In [11]:
# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(Path(f"{directory_name}/serving.properties").open().read())
Path(f"{directory_name}/serving.properties").open("w").write(
    template.render(s3url=f"s3://{model_bucket}/{s3_model}")
)
!pygmentize {directory_name}/serving.properties | cat -n

     1	[36mengine[39;49;00m=[33mPython[39;49;00m[37m[39;49;00m
     2	[36moption.tensor_parallel_degree[39;49;00m=[33m1[39;49;00m[37m[39;49;00m
     3	[36moption.s3url[39;49;00m=[33ms3://sagemaker-eu-west-1-069230569860/hf-large-model/tiiuae-falcon-7b-instruct/model[39;49;00m[37m[39;49;00m


### 2. Create a model.py with custom inference code

With SageMaker, you have the flexibility to bring your own script for inference. In this case, we need to create a model.py file with the necessary code for the Salesforce Xgen-7b-8k-base model.

I've provided two scripts below, and both of them will work. However, I recommend using the second one (uncommented) as it produces slightly faster responses. This is because it utilizes the generate() API instead of the pipeline() API.

If you'd like more information on the difference between the pipeline and generate APIs, you can check out this helpful [Ref](https://discuss.huggingface.co/t/pipeline-vs-model-generate/26203). 📚

In [12]:
%%writefile ./{directory_name}/model.py
from djl_python import Input, Output
import os
import torch
import transformers
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from typing import Any, Dict, Tuple
import warnings

predictor = None
print("transformers version"+ transformers.__version__)


def get_model(properties):
    """
    In case you want to have a look at the env variables set by SageMaker or the paths inside the container
    try:
        print('properties')
        for key, value in properties.items():
            print(key, value)
        print(20*'--')
    except:
        pass
    try:
        print('os.environ')
        for key, value in os.environ.items():
            print(key, value)
        print(20*'--')
    except:
        pass
    try:
        root_dir = '/opt'
        print('root_dir')
        for root, dirs, files in os.walk(root_dir):
            for file in files:
                file_path = os.path.join(root, file)
                print(file_path)
    except:
        pass
    """
    
    model_name = properties["model_id"]
    local_rank = int(os.getenv("LOCAL_RANK", "0"))
    tokenizer = AutoTokenizer.from_pretrained(model_name, 
                                          trust_remote_code=True,
                                         )
    model = AutoModelForCausalLM.from_pretrained(model_name, 
                                                 torch_dtype=torch.bfloat16,
                                                 device_map="auto",
                                                 trust_remote_code=True,
                                                )
    predictor = {"model": model, "tokenizer": tokenizer}
    return predictor 


def handle(inputs: Input) -> None:
    global predictor
    if not predictor:
        predictor = get_model(inputs.get_properties())

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None
    model, tokenizer = predictor["model"], predictor["tokenizer"]
    data = inputs.get_as_json()
    text = data.pop("text", data)
    params = data.pop("parameters", None)
    encoding = tokenizer(text, return_tensors="pt")
    with torch.inference_mode():
        sample = model.generate(input_ids=encoding.input_ids,
                                attention_mask=encoding.attention_mask,
                                **params)
    result = {"generated_text": tokenizer.decode(sample[0])}
    return Output().add_as_json(result)

Overwriting ./code_tiiuae_falcon_7b_instruct_accelerate/model.py


### 3. Create the Tarball and then upload to S3 location
Now, let's package our artifacts as *.tar.gz files, which we'll upload to S3. These files will be used by SageMaker for deployment. 📦💨

In [13]:
!rm -f model.tar.gz
!rm -rf {directory_name}/.ipynb_checkpoints
!tar czvf model.tar.gz -C {directory_name} .
s3_code_artifact_accelerate = sess.upload_data("model.tar.gz", bucket, s3_code_prefix_accelerate)
print(f"S3 Code or Model tar for accelerate uploaded to --- > {s3_code_artifact_accelerate}")

./
./requirements.txt
./serving.properties
./model.py
S3 Code or Model tar for accelerate uploaded to --- > s3://sagemaker-eu-west-1-069230569860/hf-large-model/tiiuae-falcon-7b-instruct/accelerate/model.tar.gz


### 4. Define a serving container, SageMaker Model and SageMaker endpoint

Now, we can move on to creating a SageMaker endpoint to serve our model. 🚀

#### Define the serving container
In this step, we'll specify the container to be used for the model during inference. For optimal performance, we'll be using SageMaker's Large Model Inference (LMI) container with Accelerate. ⚡️🧪 

In [14]:
inference_image_uri = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.22.1-deepspeed0.8.3-cu118"
)

print(f"Image going to be used is ---- > {inference_image_uri}")

Image going to be used is ---- > 763104351884.dkr.ecr.eu-west-1.amazonaws.com/djl-inference:0.22.1-deepspeed0.8.3-cu118


#### Create SageMaker model, endpoint configuration and endpoint.


In [15]:
model_name_acc = name_from_base(model_id)
print(model_name_acc)

tiiuae-falcon-7b-instruct-2023-06-29-17-37-27-513


In [16]:
create_model_response = sm_client.create_model(
    ModelName=model_name_acc,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact_accelerate},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

Created Model: arn:aws:sagemaker:eu-west-1:069230569860:model/tiiuae-falcon-7b-instruct-2023-06-29-17-37-27-513


In [17]:
model_name = model_name_acc
print(f"Building EndpointConfig and Endpoint for: {model_name}")

Building EndpointConfig and Endpoint for: tiiuae-falcon-7b-instruct-2023-06-29-17-37-27-513


In [18]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.2xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
            # "VolumeSizeInGB": 512
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:eu-west-1:069230569860:endpoint-config/tiiuae-falcon-7b-instruct-2023-06-29-17-37-27-513-config',
 'ResponseMetadata': {'RequestId': '20784c10-0f05-4f0c-92ec-a2683c01bf02',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '20784c10-0f05-4f0c-92ec-a2683c01bf02',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '137',
   'date': 'Thu, 29 Jun 2023 17:37:28 GMT'},
  'RetryAttempts': 0}}

In [19]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:eu-west-1:069230569860:endpoint/tiiuae-falcon-7b-instruct-2023-06-29-17-37-27-513-endpoint


In [20]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:eu-west-1:069230569860:endpoint/tiiuae-falcon-7b-instruct-2023-06-29-17-37-27-513-endpoint
Status: InService


### Let's use the endpoint & run Inference

In [21]:
%%timeit -r 5
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"text": "The population of Greece is",
                     "parameters": {
                          "max_new_tokens": 500, #the higher the longer the response time
                          "temperature": 0.1,
                          "top_p": 0.85,
                          "top_k": 40,
                          "repetition_penalty": 1.9,
                          "do_sample": True,
                          "num_return_sequences": 1,
                          #"return_full_text":False, # avoid returning pr
                          "best_of": None, 
                          "truncate": None,
                     }}
                     ),
    ContentType="application/json",
)

The slowest run took 5.89 times longer than the fastest. This could mean that an intermediate result is being cached.
15 s ± 6.74 s per loop (mean ± std. dev. of 5 runs, 1 loop each)


In [24]:
prompt = "The population of Greece is"
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"text": prompt,
                     "parameters": {
                          "max_new_tokens": 500, #the higher the longer the response time
                          "temperature": 0.1,
                          "top_p": 0.85,
                          "top_k": 40,
                          "repetition_penalty": 1.9,
                          "do_sample": True,
                          "num_return_sequences": 1,
                          #"return_full_text":False, # avoid returning pr
                          "best_of": None, 
                          "truncate": None,
                     }}
                     ),
    ContentType="application/json",
)

r = response_model["Body"].read().decode("utf8")

# Load the JSON string as a dictionary
data_dict = json.loads(r)

# Access the dictionary elements
generated_text = data_dict['generated_text']

# Print the generated text
print(generated_text)

The population of Greece is 10.7 million people.
Greece has a population density of 268 people per square kilometer.
Greece's population growth rate is 0.2%.
Greece's population is predominantly urban, with 82.4% of the population living in cities.
Greece's population distribution is relatively evenly spread out across its various regions.<|endoftext|>


In [25]:
def process_generated_text(text, stopwords, prompt=None):
    if prompt:
        text = text[len(prompt):]
    
    for word in stopwords:
        position = text.find(word)
        if position != -1:
            text = text[:position]
    return text

# Print the generated text
print(process_generated_text(generated_text, ['<|endoftext|>'], prompt=prompt))

 10.7 million people.
Greece has a population density of 268 people per square kilometer.
Greece's population growth rate is 0.2%.
Greece's population is predominantly urban, with 82.4% of the population living in cities.
Greece's population distribution is relatively evenly spread out across its various regions.


### Clean Up

In [26]:
# Delete the endpoint
sm_client.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': 'b48dc2f0-d001-4e6f-9828-b819df84607a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b48dc2f0-d001-4e6f-9828-b819df84607a',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Thu, 29 Jun 2023 17:48:17 GMT'},
  'RetryAttempts': 0}}

In [27]:
# Delete the model and the endpoint configuration
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

{'ResponseMetadata': {'RequestId': '34ead684-7907-41ee-abc4-f2143cd12058',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '34ead684-7907-41ee-abc4-f2143cd12058',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Thu, 29 Jun 2023 17:48:18 GMT'},
  'RetryAttempts': 0}}

# Delete all endpoints, endpoint configurations & models

In [28]:
import boto3

def delete_resources(resource_type):
    client = boto3.client('sagemaker')
    list_method = getattr(client, f"list_{resource_type}s")
    delete_method = getattr(client, f"delete_{resource_type}")
    resource_type_name = resource_type.replace('_', ' ').title().replace(' ', '')
    resources = list_method()[f"{resource_type_name}s"]
    for resource in resources:
        resource_name = resource[f"{resource_type_name}Name"]
        print(f"Deleting {resource_type}: {resource_name}")
        try:
            delete_method(**{f"{resource_type_name}Name": resource_name})
            print("Deleted")
        except Exception as e:
            print("An error occurred:", str(e))

def main():
    resource_types = ['model', 'endpoint', 'endpoint_config']  # Add more resource types if needed

    for resource_type in resource_types:
        delete_resources(resource_type)

if __name__ == "__main__":
    main()

Deleting model: tiiuae-falcon-7b-instruct-2023-06-29-17-30-23-944
Deleted
Deleting endpoint_config: tiiuae-falcon-7b-instruct-2023-06-29-17-30-23-944-config
Deleted
