# Serve OPT-30b on SageMaker with DeepSpeed Container without custom inference code.


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

---


In this notebook, we explore how to host a large language model on SageMaker using the latest container launched using from DeepSpeed and DJL. DJL provides for the serving framework while DeepSpeed is the key sharding library we leverage to enable hosting of large models.We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/).

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.

SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we deploy the open source OPT 30B model across GPU's on a ml.g5.24xlarge instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. For further reading on DeepSpeed you can refer to https://arxiv.org/pdf/2207.00032.pdf 


## Licence agreement
View license information https://huggingface.co/facebook/opt-30b/blob/main/LICENSE.md for this model including the restrictions in Section 2 before using the model.  
 
OPT-175B is licensed under the OPT-175B license, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
 
 


In [None]:
# Instal boto3 library to create model and run inference workloads
%pip install -Uqq boto3 awscli

In [None]:
import sagemaker

bucket = sagemaker.session.Session().default_bucket()
print(bucket)

## Create SageMaker compatible Model artifact and Upload Model to S3

SageMaker Large Model Inference containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom pre-processing of the input data or postprocessing of the model's predictions.

In this notebook, we demonstrate how to deploy a model without any inference code.

SageMaker needs the model artifacts to be in a Tarball format. In this example, we provide only a single file - `serving.properties`.

The tarball is in the following format

```
code
├──── 
│   └── serving.properties

```

- `serving.properties` is the configuration file that can be used to configure the model server.


In [None]:
!mkdir -p code_opt30

#### Create serving.properties 
Here is a list of settings that we use in this configuration file -
- `engine`: The engine for DJL to use. In this case, it is **DeepSpeed**.
- `option.entryPoint`: The entrypoint python file or module. This should align with the engine that is being used. 
- `option.model_id`: The model id of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. This is an optional setting and is not needed in the scenario where you are brining your own model. If you are getting your own model, you can set `option.s3url` to the URI of the Amazon S3 bucket that contains the model. 
- `option.tensor_parallel_degree`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests. For further reading on DeepSpeedyou can follow the link https://www.deepspeed.ai/tutorials/inference-tutorial/#initializing-for-inference. 


In [None]:
%%writefile ./code_opt30/serving.properties
engine = DeepSpeed
option.entryPoint=djl_python.deepspeed
option.tensor_parallel_degree=4
option.model_id=facebook/opt-30b

In [None]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json

#### Create required variables and initialize them to create the endpoint, we leverage boto3 for this

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_prefix = (
    "hf-large-model-djl-opt30b/code_opt30"  # folder within bucket where code artifact will go
)

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

**Image URI for the DJL container is being used here**

In [None]:
# inference_image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/djl-ds:latest"
inference_image_uri = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.20.0-deepspeed0.7.5-cu116"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

**Create the Tarball and then upload to S3 location**

In [None]:
!rm model.tar.gz
!tar czvf model.tar.gz code_opt30

In [None]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

### This is optional in case you want to use VpcConfig to specify when creating the end points

For more details you can refer to this link https://docs.aws.amazon.com/sagemaker/latest/dg/host-vpc.html

The below is just an example to extract information about Security Groups and Subnets needed to configure

In [None]:
!aws ec2 describe-security-groups --filter Name=vpc-id,Values=<use vpcId> | python3 -c "import sys, json; print(json.load(sys.stdin)['SecurityGroups'])"

In [None]:
# - provide networking configs if needed.
security_group_ids = []  # add the security group id's
subnets = []  # add the subnet id for this vpc
privateVpcConfig = {"SecurityGroupIds": security_group_ids, "Subnets": subnets}
print(privateVpcConfig)

### To create the end point the steps are:

1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

    a) Instance Type is ml.p4d.24xlarge 
    
    b) ContainerStartupHealthCheckTimeoutInSeconds is 2400 to ensure health check starts after the model is ready    
3. Create the end point using the endpoint config created    
    

#### Create the Model
Use the image URI for the DJL container and the s3 location to which the tarball was uploaded.

We will load the model into the `/tmp` space on the container because SageMaker maps the `/tmp` to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB.

For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the `/tmp` on the container. The size of this mount is large enough to hold the model.

The container downloads the model using the huggingface library, which downloads the model and caches it at `~/.cache/huggingface/hub`. But since we want to leverage the space in `/tmp`, we need to change the default location using environment variables `HUGGINGFACE_HUB_CACHE` and `TRANSFORMERS_CACHE`. The huggingface library uses the environment variables to decide the location to which the model needs to be downloaded. For more information on this, please refer https://huggingface.co/docs/transformers/installation?highlight=transformers_cache#caching-models.

In [None]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"opt30b-djl20-ds")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
        "Environment": {"HUGGINGFACE_HUB_CACHE": "/tmp", "TRANSFORMERS_CACHE": "/tmp"},
    },
    # Uncomment if providing networking configs
    # VpcConfig=privateVpcConfig
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

In [None]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.24xlarge",
            "InitialInstanceCount": 1,
            # "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
        },
    ],
)
endpoint_config_response

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

#### Wait for the end point to be created.
However while that happens, let us look at the critical areas of the helper file we are using to load the model
1. Serving.properties to see the environment related properties

In [None]:
# This is the code snippet which shows the environment variables being used to customize runtime
! sed -n '1,4p' code_opt30/serving.properties

### This step can take ~ 15 min or longer so please be patient

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

#### Leverage the Boto3 to invoke the endpoint. 

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results


In [None]:
%%time
smr_client.invoke_endpoint(
    EndpointName=endpoint_name, Body="Amazon.com is the best", ContentType="text/plain"
)["Body"].read().decode("utf8")

You can also pass a batch of prompts as input to the model. This done by setting `inputs` to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters. These `parameters` need to be passed to the endpoint as a dictionary of `kwargs`. Refer this documentation - https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig for more details.

The below code sample illustrates the invocation of the endpoint using a batch of prompts and also sets some parameters.

In [None]:
%%time
prompts = ["Amazon.com is the best", "Amazon.com is the best"]
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": prompts,
            "parameters": {
                "max_length": 50,
                "do_sample": False,
            },
        }
    ),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

## Conclusion
In this post, we demonstrated how to use SageMaker large model inference containers to host OPT-30B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker machine learning instance. For more details about Amazon SageMaker and its large model inference capabilities, refer to the following:

* Amazon SageMaker now supports deploying large models through configurable volume size and timeout quotas (https://aws.amazon.com/about-aws/whats-new/2022/09/amazon-sagemaker-deploying-large-models-volume-size-timeout-quotas/)
* Real-time inference – Amazon SageMake (https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html)



## Clean Up

In [None]:
# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [None]:
# - In case the end point failed we still want to delete the model
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference|nlp|realtime|llm|opt30b|djl_deepspeed_deploy_opt30b_no_custom_inference_code.ipynb)
