# Large model inference with Deepspeed


In this notebook, we demonstrate how to run inference for large models with DeepSpeed locally and then deploy it in a SageMaker Inference Endpoint. 


<font color="red"> Note that you need to run the notebook `1_train_gptj_smp_tensor_parallel` first to produce the model artifact that will be used in this notebook. Refer to the variable/cell `model_location`. This needs to be set to the model_s3_uri in the download stage below.</font>

## 1. Download trained model


First, let's clear some space on the notebook instance.

In [None]:
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_mxnet_p36/
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p36/
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_tensorflow2_p27
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_tensorflow2_p36
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_tensorflow_p27/
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_tensorflow_p36/
!rm -rf /home/ec2-user/anaconda3/envs/chainer_p
!rm -rf /home/ec2-user/anaconda3/envs/chainer_p27/
!rm -rf /home/ec2-user/anaconda3/envs/chainer_p36/
!rm -rf /home/ec2-user/anaconda3/envs/mxnet_latest_p37/
!rm -rf /home/ec2-user/anaconda3/envs/mxnet_p27/
!rm -rf /home/ec2-user/anaconda3/envs/mxnet_p36/
!rm -rf /home/ec2-user/anaconda3/envs/python2/
!rm -rf /home/ec2-user/anaconda3/envs/python3/
!rm -rf /home/ec2-user/anaconda3/envs/pytorch_p27/
!rm -rf /home/ec2-user/anaconda3/envs/pytorch_p36/
!rm -rf /home/ec2-user/anaconda3/envs/tensorflow2_p36/
!rm -rf /home/ec2-user/anaconda3/envs/tensorflow_p27/
!rm -rf /home/ec2-user/anaconda3/envs/tensorflow_p36/
!rm -rf /home/ec2-user/anaconda3/envs/R/
!docker system prune -f

Download the trained model for local testing. Set the model_s3_uri for the trained model. It should be of the form
`s3://sagemaker-us-west-2-855988369404/smp-tensorparallel-outputdir/smp-gpt-j-xl-p38xl-tp4-pp1-bs8-2022-06-22-21-03-26-813/output/model.tar.gz`

In [None]:
model_s3_uri = ""

Next cell controls which local path to use for fetching the model

In [None]:
local_model_dir = "./model/"

Next, we download the model.tar.gz file produced by SageMaker training with the previous GPT-J notebook, then we extract it.

In [None]:
! chmod +x ./download.sh
! ./download.sh $local_model_dir $model_s3_uri

# 2. Prepare docker image

We have a `build.sh` bash script which performs the following steps:

* Makes `serve` executable and builds our docker image
* Optionally, runs the container for local testing

Run with local testing using the following command

In [None]:
! ./build.sh gptj-inference-endpoint $local_model_dir test_local

Or, to run without local testing, run:

```sh
./build.sh gptj-inference-endpoint
```

To test the endpoint, you can run the following cells:

In [None]:
import requests
import json
import sys 

URL = 'http://127.0.0.1:8080/invocations'
HEADERS = {'Content-type': 'application/json', 'Accept': '*/*'}

def test_endpoint(text, parameters):
    
    data = {
        "inputs":{
            "text_inputs": text,
            "parameters": parameters
        }
    }
    
    payload = json.dumps(data)
    response = requests.post(URL, json=data, headers=HEADERS)
    
    return(response.text)


In [None]:
text = """This is a creative writing exercise. Below, you'll be given a prompt. Your story should be based on the prompt.

Prompt: A scary story about a haunted mouse
Story: On a dark and stormy night, the mouse crept in the shadows. """

parameters = {
    "do_sample": True,
    "temperature": 0.9,
    "max_new_tokens":200,
    "min_tokens": 100,
    "repetition_penalty": 1.1,
    "top_p": 500,
    }

response = json.loads(test_endpoint(text, parameters))
print(response['response'][0]['generated_text'])

# 3. Deployment

When you're satisfied with your container, you can rebuild and push your container to AWS ECR using the `push_to_ecr.sh` script.

For example, to push the image we built above, named "gptj-inference-endpoint", you can use the `push_to_ecr.sh` script, which requires the name of your docker image.

In [None]:
import os
new_s3_uri = os.path.join(os.path.dirname(model_s3_uri), "infer_model.tar.gz")

In [None]:
! chmod +x push_to_ecr.sh
! ./push_to_ecr.sh gptj-inference-endpoint $local_model_dir $new_s3_uri

First, this script will push your image to ECR. For reference later, note the address of the repository that the container is pushed to. It should appear below the line `Login Succeeded` in the output from the call to `push_to_ecr.sh`.

# 4. Inference

Now, you can deploy your endpoint as follows:

### 4.1 Initialize configuration variables

If you run into the error that endpoint already exists on a rerun, please change the model_name and endpoint_name. 

In [None]:
import sagemaker
from sagemaker.model import Model
from sagemaker.predictor import RealTimePredictor
import time 

role = sagemaker.get_execution_role()

# Specify s3uri for model.tar.gz
model_data = new_s3_uri

# Specify path to gptj-inference-endpoint image in ECR
image = ""

# Specify sagemaker model_name
sm_model_name = "gptj-completion-gpu-test"

# Specify endpoint_name
endpoint_name = "gptj-completion-gpu-test"

# Specify instance_type
instance_type = 'ml.g4dn.2xlarge'

# Specify initial_instance_count
initial_instance_count = 1


### 4.2 Initialize endpoint

In [None]:
sm_model = Model(model_data = model_data, 
                        image_uri = image,
                        role = role,
                        predictor_cls=RealTimePredictor,
                        name = sm_model_name)

predictor = sm_model.deploy(
        instance_type=instance_type,
        initial_instance_count=1,
        endpoint_name = endpoint_name
)

-------------!

The class RealTimePredictor has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


### 4.3 Query model

To query your endpoint, you can use the code below. Also, remember that you can pass any parameters accepted by the HuggingFace `"text-generation"` pipeline.

#### Initialize asynchronous 

In [None]:
import boto3
import json 

# Get the boto3 session and sagemaker client, as well as the current execution role
sess = boto3.Session()

# Specify your AWS Region
aws_region=sess.region_name


# Create a low-level client representing Amazon SageMaker Runtime
sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=aws_region)

In [None]:
%%time

text = """This is a creative writing exercise. Below, you'll be given a prompt. Your story should be based on the prompt.

Prompt: A scary story about a haunted mouse
Story: On a dark and stormy night, the mouse crept in the shadows. """

parameters = {
    "do_sample": True,
    "temperature": 0.7,
    "max_new_tokens":200,
    "min_tokens": 100,
    "repetition_penalty": 1.1,
    "top_p": 500,
    }

data = {
    "inputs": {
        "text_inputs": text,
        "parameters": parameters
    }
}


body = json.dumps(data)


response = sagemaker_runtime.invoke_endpoint( 
        EndpointName=endpoint_name, 
        Body = body, 
        ContentType = 'application/json'
)

In [None]:
%%time

body = json.dumps(data)


response = sagemaker_runtime.invoke_endpoint( 
        EndpointName=endpoint_name, 
        Body = body, 
        ContentType = 'application/json'
)

result = json.loads(response['Body'].read().decode("utf-8"))

In [None]:
result