# Serve large models on SageMaker with model parallel inference and DJLServing

In this notebook, we explore how to host a large language model on SageMaker using model parallelism from DeepSpeed and DJLServing.

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.

In this notebook, we deploy a PyTorch GPT-J model from Hugging Face with 6 billion parameters across two GPUs on an Amazon SageMaker ml.g5.48xlarge instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. 

## Step 1: Creating image for SageMaker endpoint
We first pull the docker image djl-serving:0.18.0-deepspeed

In [None]:
%%sh
docker pull deepjavalibrary/djl-serving:0.18.0-deepspeed

In [None]:
!docker images

You should see the image `djl-serving` listed from running the code above. Please note the `IMAGE ID`. We will need it for the next step.

### Push image to ECR
The following code pushes the `djl-serving` image, downloaded from previous step, to ECR. 

In [None]:
%%sh

# The name of our container
img=djl_deepspeed


account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration
region=$(aws configure get region)

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${img}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${img}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${img}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}


# # Build the docker image locally with the image name and then push it to ECR
image_id=$(docker images -q | head -n1)
docker tag $image_id ${fullname}

docker push $fullname

## Step 2: Create a `model.py` and `serving.properties`

In [None]:
%%writefile model.py

from djl_python import Input, Output
import os
import deepspeed
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

predictor = None


def get_model():
    model_name = "EleutherAI/gpt-j-6B"
    tensor_parallel = int(os.getenv("TENSOR_PARALLEL_DEGREE", "2"))
    local_rank = int(os.getenv("LOCAL_RANK", "0"))
    model = AutoModelForCausalLM.from_pretrained(
        model_name, revision="float32", torch_dtype=torch.float32
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    model = deepspeed.init_inference(
        model,
        mp_size=tensor_parallel,
        dtype=model.dtype,
        replace_method="auto",
        replace_with_kernel_inject=True,
    )
    generator = pipeline(
        task="text-generation", model=model, tokenizer=tokenizer, device=local_rank
    )
    return generator


def handle(inputs: Input) -> None:
    global predictor
    if not predictor:
        predictor = get_model()

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_string()
    result = predictor(data, do_sample=True, min_tokens=200, max_new_tokens=256)
    return Output().add(result)

### Setup serving.properties

User needs to add engine Rubikon as shown below. If you would like to control how many worker groups, you can set

```
gpu.minWorkers=1
gpu.maxWorkers=1
```
by adding these lines in the below file. By default, we will create as much worker group as possible based on `gpu_numbers/tensor_parallel_degree`.

In [None]:
%%writefile serving.properties

engine = Rubikon

The code below creates the SageMaker model file (`model.tar.gz`) and upload it to S3. 

In [None]:
import sagemaker, boto3

session = sagemaker.Session()
account = session.account_id()
region = session.boto_region_name
img = "djl_deepspeed"
fullname = account + ".dkr.ecr." + region + "amazonaws.com/" + img + ":latest"

bucket = session.default_bucket()
path = "s3://" + bucket + "/DEMO-djl-big-model/"

In [None]:
%%sh
if [ -d gpt-j ]; then
  rm -d -r gpt-j
fi #always start fresh

mkdir -p gpt-j
mv model.py gpt-j
mv serving.properties gpt-j
tar -czvf gpt-j.tar.gz gpt-j/
#aws s3 cp gpt-j.tar.gz {path}

In [None]:
!aws s3 cp gpt-j.tar.gz {path}

## Step 3: Create SageMaker endpoint

First let us make sure we have the lastest awscli

In [None]:
!pip3 install --upgrade --user awscli

You should see two images from code above. Please note the image name similar to`<AWS_account_ID>.dkr.ecr.us-east-1.amazonaws.com/djl_deepspeed`. This is the ECR image URL that we need for later use. 

Now we create our [SageMaker model](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html). Make sure you provide an IAM role that SageMaker can assume to access model artifacts and docker image for deployment on ML compute hosting instances. In addition, you also use the IAM role to manage permissions the inference code needs. Please check out our SageMaker Roles [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for more details. 

 <span style="color:red"> You must enter ECR image name, S3 path for the model file, and an execution-role-arn</span> in the code below.

In [None]:
!aws sagemaker create-model \
--model-name gpt-j \
--primary-container \
Image=<ECR image>,ModelDataUrl={path},Environment={TENSOR_PARALLEL_DEGREE=2} \
--execution-role-arn <your execution-role-arn>

Note that we configure `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. 

In [None]:
%%sh
aws sagemaker create-endpoint-config \
    --region $(aws configure get region) \
    --endpoint-config-name gpt-j-config \
    --production-variants '[
      {
        "ModelName": "gpt-j",
        "VariantName": "AllTraffic",
        "InstanceType": "ml.g5.48xlarge",
        "InitialInstanceCount": 1,
        "ModelDataDownloadTimeoutInSeconds": 1800,
        "ContainerStartupHealthCheckTimeoutInSeconds": 3600
        }
    ]'

In [None]:
%%sh
aws sagemaker create-endpoint \
--endpoint-name gpt-j \
--endpoint-config-name gpt-j-config

The creation of the SageMaker endpoint might take a while. After the endpoint is created, you can test it out using the following code. 

In [None]:
import boto3, json

client = boto3.client("sagemaker-runtime")

endpoint_name = "gpt-j"  # Your endpoint name.
content_type = "text/plain"  # The MIME type of the input data in the request body.
payload = "Amazon.com is the best"  # Payload for inference.
response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType=content_type, Body=payload
)
print(response["Body"].read())

## Step 4: Clean up

In [None]:
%%sh
aws sagemaker delete-endpoint --endpoint-name gpt-j

## Conclusion

In this notebook, you use tensor parallelism to partition a large language model across multiple GPUs for low latency inference. With tensor parallelism, multiple GPUs work on the same model layer at once allowing for faster inference latency when a low batch size is used. Here, we use open source DeepSpeed as the model parallel library to partition the model and open source Deep Java Library Serving as the model serving solution.

As a next step, you can experiment with larger models from Hugging Face such as GPT-NeoX. You can also adjust the tensor parallel degree to see the impact to latency with models of different sizes.