# Extended Examples on Deploying LLMs on SageMaker

Examples included in this notebook:
1. Packaging and deploying from S3 compressed artifacts
2. Deploying in network isolation mode
3. Deploying with Deep Java Library and DeepSpeed (requires SageMaker instance types with num of GPUs >= 2)

In [None]:
!pip install "sagemaker>=2.140.0" boto3 "huggingface_hub==0.13.0" "hf-transfer" --upgrade --quiet

In [None]:
!apt-get update && apt-get install -y pigz

## Packaging and deploying from S3 compressed artifacts

You can skip the packaging steps in the interest of time.

In [None]:
from distutils.dir_util import copy_tree
from pathlib import Path
import os

# set HF_HUB_ENABLE_HF_TRANSFER env var to enable hf-transfer for faster downloads
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

HF_MODEL_ID="OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"
# create model dir
model_tar_dir = Path(HF_MODEL_ID.split("/")[-1])
model_tar_dir.mkdir(exist_ok=True)

# Download model from Hugging Face into model_dir
snapshot_download(HF_MODEL_ID, local_dir=str(model_tar_dir), local_dir_use_symlinks=False)

In [None]:
parent_dir=os.getcwd()
# change to model dir
os.chdir(str(model_tar_dir))
# use pigz for faster and parallel compression
!tar -cf oasst-pythia-12b.tar.gz --use-compress-program=pigz *
# change back to parent dir
os.chdir(parent_dir)

In [None]:
!aws s3 cp oasst-sft-4-pythia-12b-epoch-3.5/oasst-pythia-12b.tar.gz s3://sagemaker-us-west-2-224755080010/

### Deployment with Text Generation Inference 

We use a modified version of HuggingFace's [Text Generation Inference](https://github.com/huggingface/text-generation-inference). We modified the `v0.4.0` with the following `Dockerfile` and `sagemaker-entrypoint.sh`.

`Dockerfile`:
```
FROM 224755080010.dkr.ecr.us-west-2.amazonaws.com/sagemaker-text-generation-inference:0.4
COPY sagemaker-entrypoint.sh entrypoint.sh
RUN chmod +x entrypoint.sh

ENTRYPOINT ["./entrypoint.sh"]
```

`sagemaker-entrypoint.sh`:
```
#!/bin/bash

if [[ -z "${HF_MODEL_ID}" ]]; then
  echo "HF_MODEL_ID must be set"
  exit 1
fi

if [[ -n "${HF_MODEL_REVISION}" ]]; then
  export REVISION="${HF_MODEL_REVISION}"
fi

if [[ -n "${SM_NUM_GPUS}" ]]; then
  export NUM_SHARD="${SM_NUM_GPUS}"
fi

if [[ -n "${HF_MODEL_QUANTIZE}" ]]; then
  export QUANTIZE="${HF_MODEL_QUANTIZE}"
fi

text-generation-launcher --port 8080 --model-id $HF_MODEL_ID
```

This Docker image is hosted at `224755080010.dkr.ecr.us-west-2.amazonaws.com/sagemaker-text-generation-inference:0.4-mod`

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.model import Model
import sagemaker

# Define Model and Endpoint configuration parameter
hf_model_id = "/opt/ml/model/" # change to model directory where SageMaker model is loaded
use_quantized_model = True # whether to use quantization or not
instance_type = "ml.g5.xlarge"
number_of_gpu = 1 # number of gpus to use for inference and tensor parallelism

health_check_timeout = 1800 # Increase the timeout for the health check to 15 minutes for downloading model

# account_id = sess.account_id()
account_id= "224755080010"
region =  sagemaker.Session(boto_session=boto3.Session(region_name='us-west-2')).boto_region_name

image_uri = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-text-generation-inference:0.4-mod".format(
    account_id,region)

hf_model = Model(
  role=role,
  model_data="s3://sagemaker-us-west-2-224755080010/oasst-pythia-12b.tar.gz",
 sagemaker_session=sagemaker.Session(boto_session=boto3.Session(region_name='us-west-2')),
  image_uri=image_uri,
  env={
    'HF_MODEL_ID':hf_model_id,
    'HF_MODEL_QUANTIZE': json.dumps(use_quantized_model),
    'SM_NUM_GPUS': json.dumps(number_of_gpu),
  }
)

In [None]:
import uuid
endpoint_name = 'oa-pythia-s3-{}'.format(str(uuid.uuid4()))
hf_model.deploy(
  endpoint_name=endpoint_name,
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout
)

In [None]:
from sagemaker.huggingface.model import HuggingFacePredictor
predictor = HuggingFacePredictor(endpoint_name=endpoint_name,
                                 sagemaker_session=sagemaker.Session(boto_session=boto3.Session(region_name='us-west-2')))
payload = """
<|prompter|>Summarize the following passage.
Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment. 
It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don't have to manage servers. 
It also provides common machine learning algorithms that are optimized to run efficiently against extremely large data in a distributed environment. 
With native support for bring-your-own-algorithms and frameworks, SageMaker offers flexible distributed training options that adjust to your specific workflows. 
Deploy a model into a secure and scalable environment by launching it with a few clicks from SageMaker Studio or the SageMaker console.<|endoftext|><|assistant|>"""

parameters = {
  "max_new_tokens":512,"temperature":0.7,"do_sample":True,"top_k":40,"top_p":0.1
}

# Run prediction
predictor.predict({
	"inputs": payload,
  "parameters" :parameters
})

In [None]:
predictor.delete_endpoint()

## Deploying with network isolation mode

In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.model import Model
import sagemaker, boto3

# Define Model and Endpoint configuration parameter
hf_model_id = "/opt/ml/model/" # change to model directory where SageMaker model is loaded
use_quantized_model = True # whether to use quantization or not
instance_type = "ml.g5.xlarge"
number_of_gpu = 1 # number of gpus to use for inference and tensor parallelism

health_check_timeout = 1800 # Increase the timeout for the health check to 15 minutes for downloading model

# account_id = sess.account_id()
account_id= "224755080010"
region =  sagemaker.Session(boto_session=boto3.Session(region_name='us-west-2')).boto_region_name

image_uri = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-text-generation-inference:0.4-mod-new".format(
    account_id,region)

hf_model = Model(
  role=role,
  model_data="s3://sagemaker-us-west-2-224755080010/oasst-pythia-12b.tar.gz",
 sagemaker_session=sagemaker.Session(boto_session=boto3.Session(region_name='us-west-2')),
  image_uri=image_uri,
  env={
    'HF_MODEL_ID':hf_model_id,
    'HF_MODEL_QUANTIZE': json.dumps(use_quantized_model),
    'SM_NUM_GPUS': json.dumps(number_of_gpu),
  },
  enable_network_isolation=True
)

In [None]:
import uuid
endpoint_name = 'oa-pythia-s3-{}'.format(str(uuid.uuid4()))
hf_model.deploy(
  endpoint_name=endpoint_name,
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout
)

In [None]:
from sagemaker.huggingface.model import HuggingFacePredictor
predictor = HuggingFacePredictor(endpoint_name=endpoint_name,
                                 sagemaker_session=sagemaker.Session(boto_session=boto3.Session(region_name='us-west-2')))
payload = """
<|prompter|>Summarize the following passage.
Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment. 
It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don't have to manage servers. 
It also provides common machine learning algorithms that are optimized to run efficiently against extremely large data in a distributed environment. 
With native support for bring-your-own-algorithms and frameworks, SageMaker offers flexible distributed training options that adjust to your specific workflows. 
Deploy a model into a secure and scalable environment by launching it with a few clicks from SageMaker Studio or the SageMaker console.<|endoftext|><|assistant|>"""

parameters = {
  "max_new_tokens":512,"temperature":0.7,"do_sample":True,"top_k":40,"top_p":0.1
}

# Run prediction
predictor.predict({
	"inputs": payload,
  "parameters" :parameters
})

In [None]:
predictor.delete_endpoint()

## Deploying with Deep Java Library (DJL) and DeepSpeed
[DeepSpeed](https://github.com/microsoft/DeepSpeed) is a library developed to optimize deep learning model training and inference. [DJL](https://github.com/deepjavalibrary/djl-serving) is a model deployment library intended for deep learning models. Based on HuggingFace's own benchmark [https://huggingface.co/blog/bloom-inference-pytorch-scripts], DeepSpeed allows for high throughput inference that can distribute model weights that may not fit within a single GPU across multiple GPUs.

SageMaker's Python SDK is integrated with DJL and DeepSpeed. Check out the full documentation [here](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/using_djl.html#inference-code-and-model-server-properties). 

For this tutorial, you will need access to instances with >= 2 GPUs. We will use `g4dn.12xlarge` to deploy [OpenChatKit](https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B), a new (but relatively older) GPT Neo-X 20B model that is tuned on an instruction dataset.

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

In [None]:
from sagemaker.djl_inference.model import DeepSpeedModel

model = DeepSpeedModel(
    "togethercomputer/GPT-NeoXT-Chat-Base-20B",
    role,
    tensor_parallel_degree=2,
    task="text-generation",
    data_type="int8",
)

predictor = model.deploy(initial_instance_count=1, instance_type="ml.g4dn.12xlarge")