# Introduction to Large Language Model Hosting on SageMaker with DeepSpeed Container

<i>Note: The below code has been adapted from https://github.com/aws/amazon-sagemaker-examples/tree/main/inference/generativeai/llm-workshop/lab1-deploy-llm </i>

In this notebook, we explore how to host a large language model on SageMaker using the [Large Model Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-inference.html) container that is optimized for hosting large models using DJLServing. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent [blog post](https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/).

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

In this notebook, we deploy the open source GPT-J Model which is comprised of 6B parameters on a single GPU. Along the way we will explore approaches that will allow us to scale to larger models with practically no code changes.

This notebook was tested on a `ml.t3.medium` instance using the `Python 3 (Data Science)` kernel on SageMaker Studio.

## Create a SageMaker Model for Deployment
As a first step, we'll import the relevant libraries and configure several global variables such as the hosting image that will be used and the S3 location of our model artifacts

In [5]:
!pip install sagemaker boto3 --upgrade  --quiet

[0m

In [6]:
import sagemaker
from sagemaker.model import Model
from sagemaker import serializers, deserializers
from sagemaker import image_uris
import boto3
import os
import time
import json
import jinja2
from pathlib import Path

In [7]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_prefix = "large-model-djl-gptj6b/code"  # folder within bucket where code artifact will go
s3_model_prefix = "hf-large-model-djl-gptj6b/model"  # folder where model checkpoint will go

region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

s3_client = boto3.client("s3")  # client to intreract with S3 API
sm_client = boto3.client("sagemaker")  # client to intreract with SageMaker
smr_client = boto3.client("sagemaker-runtime")  # client to intreract with SageMaker Endpoints
jinja_env = jinja2.Environment()  # jinja environment to generate model configuration templates

Let's retreive the location of our pre-trained model in Amazon S3

In [8]:
# lookup the S3 model location based on our region
pretrained_model_location = f"s3://sagemaker-examples-files-prod-{region}/models/gpt-j-6b-model/"
print(f"Pretrained model will be downloaded from ---- > {pretrained_model_location}")


Pretrained model will be downloaded from ---- > s3://sagemaker-examples-files-prod-eu-west-1/models/gpt-j-6b-model/


## Deploying a Large Language Model
The DJL Inference Image which we will be utilizing ships with a number of built-in inference handlers for a wide variety of tasks including:
- `text-generation`
- `question-answering`
- `text-classification`
- `token-classification`

You can refer to this [GitRepo](https://github.com/deepjavalibrary/djl-serving/tree/master/engines/python/setup/djl_python) for a list of additional handlers and available NLP Tasks. <br>
These handlers can be utilized as is without having to write any custom inference code. We simply need to create a `serving.properties` text file with our desired hosting options and package it up into a `tar.gz` artifact.


Next we will explore the approach for deploying Large Language Models using [DeepSpeed](https://www.deepspeed.ai/). DeepSpeed provides various [inference optimizations](https://www.deepspeed.ai/tutorials/inference-tutorial/) for compatible transformer based models including model sharding, optimized inference kernels, and quantization. To leverage DeepSpeed, we need to provide a `serving.properties` file

In [9]:
template = jinja_env.from_string(Path("deepspeed_src/serving.template").open().read())
Path("deepspeed_src/serving.properties").open("w").write(
    template.render(s3url=pretrained_model_location)
)
!pygmentize deepspeed_src/serving.properties | cat -n

     1	[36mengine[39;49;00m=[33mDeepSpeed[39;49;00m[37m[39;49;00m
     2	[36moption.entryPoint[39;49;00m=[33mdjl_python.deepspeed[39;49;00m[37m[39;49;00m
     3	[36moption.s3url[39;49;00m=[33ms3://sagemaker-examples-files-prod-eu-west-1/models/gpt-j-6b-model/[39;49;00m[37m[39;49;00m
     4	[36moption.tensor_parallel_degree[39;49;00m=[33m1[39;49;00m[37m[39;49;00m
     5	[36moption.task[39;49;00m=[33mtext-generation[39;49;00m[37m[39;49;00m
     6	[36moption.dtype[39;49;00m=[33mfp16[39;49;00m[37m[39;49;00m


There are a few options specified here. Lets go through them in turn<br>
1. `engine` - specifies the engine that will be used for this workload. In this case we'll be hosting a model using the [DJL Python Engine](https://github.com/deepjavalibrary/djl-serving/tree/master/engines/python)
2. `option.entryPoint` - specifies the entrypoint code that will be used to host the model. Python scripts that use DeepSpeed can not be launched as traditional python scripts (i.e. python `deepspeed.py` would not work.) Setting `engine=DeepSpeed` will automatically configure the environment and launch the inference script appropriatelly.  
3. `option.s3url` - specifies the location of the model files. Alternativelly an `option.model_id` option can be used instead to specifiy a model from Hugging Face Hub (e.g. `EleutherAI/gpt-j-6B`) and the model will be automatically downloaded from the Hub. The s3url approach is recommended as it allows you to host the model artifact within your own environment and enables faster deployments by utilizing optimized approach within the DJL inference container to transfer the model from S3 into the hosting instance 
4. `option.task` - This is specific to the `huggingface.py` inference handler and specifies for which task this model will be used
5. `option.tensor_parallel_degree` where we have to specify the number of GPU devices to which the model will be sharded.

DeepSpeed uses TensorParallelism where individual layers (Tensors) are sharded accross devices. For example each GPU can have a slice of each layer. The diagram below provides a high level illustartion of how this works <br>

<img src="images/TensorShard.png" width="800"/>

Where with the layer-wise approach, the data fllowed through each GPU device sequeantially, here data is sent to all GPU devices where a partial result is compute on each GPU. The partial results are then collected though an All-Gather operation to compute the final result. 
TensorParallelism generally provides higher GPU utilization and better performance.
For more information on the available options, please refer to the [SageMaker Large Model Inference Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html)

In [13]:
# lookup the inference image uri based on our current region
inference_image_uri = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.20.0-deepspeed0.7.5-cu116"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

Image going to be used is ---- > 763104351884.dkr.ecr.eu-west-1.amazonaws.com/djl-inference:0.20.0-deepspeed0.7.5-cu116


We place the `serving.properties` file into a tarball and upload it to S3

In [14]:
!tar czvf ds_model.tar.gz deepspeed_src/

deepspeed_src/
deepspeed_src/serving.template
deepspeed_src/serving.properties


In [15]:
ds_s3_code_artifact = sess.upload_data("ds_model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {ds_s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-eu-west-1-477886989750/large-model-djl-gptj6b/code/ds_model.tar.gz


In [16]:
def deploy_model(image_uri, model_data, role, endpoint_name, instance_type, sagemaker_session):
    """Helper function to create the SageMaker Endpoint resources and return a predictor"""
    model = Model(image_uri=image_uri, model_data=model_data, role=role)

    model.deploy(initial_instance_count=1, instance_type=instance_type, endpoint_name=endpoint_name)

    # our requests and responses will be in json format so we specify the serializer and the deserializer
    predictor = sagemaker.Predictor(
        endpoint_name=endpoint_name,
        sagemaker_session=sagemaker_session,
        serializer=serializers.JSONSerializer(),
        deserializer=deserializers.JSONDeserializer(),
    )

    return predictor

In [17]:
ds_endpoint_name = sagemaker.utils.name_from_base("gptj-ds")
ds_predictor = deploy_model(
    image_uri=inference_image_uri,
    model_data=ds_s3_code_artifact,
    role=role,
    endpoint_name=ds_endpoint_name,
    instance_type="ml.g4dn.4xlarge",
    sagemaker_session=sess,
)

----------------------!

In [14]:
ds_predictor.predict(
    {"inputs": "Large model inference is", "parameters": {"max_length": 50, "temperature": 0.5}}
)

[[{'generated_text': 'Large model inference is an active research area in the machine learning community, and in particular in the field of probabilistic graphical models. We consider the setting where we have a finite number of samples, and where we want to learn a function of the'}]]

In [15]:
print(
    ds_predictor.predict(
        {
            "inputs": """Message: Support has been terrible for 2 weeks...
                                Sentiment: Negative
                                ###
                                Message: I love your API, it is simple and so fast!
                                Sentiment: Positive
                                ###
                                Message: GPT-J has been released 12 months ago.
                                Sentiment: Neutral
                                ###
                                Message: The responsiveness of your team has been amazing, thank you so much!
                                Sentiment:""",
            "parameters": {"max_length": 50, "temperature": 0.5},
        }
    )[0][0]["generated_text"]
)

Message: Support has been terrible for 2 weeks...
                                Sentiment: Negative
                                ###
                                Message: I love your API, it is simple and so fast!
                                Sentiment: Positive
                                ###
                                Message: GPT-J has been released 12 months ago.
                                Sentiment: Neutral
                                ###
                                Message: The responsiveness of your team has been amazing, thank you so much!
                                Sentiment: Positive


In [16]:
%%timeit -n3 -r1
ds_predictor.predict(
    {"inputs": "Large model inference is", "parameters": {"max_length": 50, "temperature": 0.5}}
)

3.01 s ± 0 ns per loop (mean ± std. dev. of 1 run, 3 loops each)


In [None]:
#ds_predictor.delete_endpoint()