# Large model inference with Deepspeed


In this notebook, we demonstrate how to run inference for large models with DeepSpeed locally and then deploy it in a SageMaker Inference Endpoint. 


<font color="red"> Note that you need to run the notebook `1_train_gptj_smp_tensor_parallel` first to produce the model artifact that will be used in this notebook. Refer to the variable/cell `model_location`. This needs to be set to the model_s3_uri in the download stage below.</font>

## 1. Download trained model


First, let's clear some space on the notebook instance.

In [10]:
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_mxnet_p36/
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p36/
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_tensorflow2_p27
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_tensorflow2_p36
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_tensorflow_p27/
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_tensorflow_p36/
!rm -rf /home/ec2-user/anaconda3/envs/chainer_p
!rm -rf /home/ec2-user/anaconda3/envs/chainer_p27/
!rm -rf /home/ec2-user/anaconda3/envs/chainer_p36/
!rm -rf /home/ec2-user/anaconda3/envs/mxnet_latest_p37/
!rm -rf /home/ec2-user/anaconda3/envs/mxnet_p27/
!rm -rf /home/ec2-user/anaconda3/envs/mxnet_p36/
!rm -rf /home/ec2-user/anaconda3/envs/python2/
!rm -rf /home/ec2-user/anaconda3/envs/python3/
!rm -rf /home/ec2-user/anaconda3/envs/pytorch_p27/
!rm -rf /home/ec2-user/anaconda3/envs/pytorch_p36/
!rm -rf /home/ec2-user/anaconda3/envs/tensorflow2_p36/
!rm -rf /home/ec2-user/anaconda3/envs/tensorflow_p27/
!rm -rf /home/ec2-user/anaconda3/envs/tensorflow_p36/
!rm -rf /home/ec2-user/anaconda3/envs/R/
!docker system prune -f

Total reclaimed space: 0B


Download the trained model for local testing. Set the model_s3_uri for the trained model. It should be of the form
`s3://sagemaker-us-west-2-855988369404/smp-tensorparallel-outputdir/smp-gpt-j-xl-p38xl-tp4-pp1-bs8-2022-06-22-21-03-26-813/output/model.tar.gz`

In [11]:
model_s3_uri = ""

Next cell controls which local path to use for fetching the model

In [12]:
local_model_dir = "./model/"

Next, we download the model.tar.gz file produced by SageMaker training with the previous GPT-J notebook, then we extract it.

In [13]:
! chmod +x ./download.sh
! ./download.sh $local_model_dir $model_s3_uri

+ MODEL_PATH=s3://sagemaker-us-west-2-855988369404/smp-tensorparallel-outputdir/smp-gpt-j-xl-p38xl-tp4-pp1-bs8-2022-06-22-21-03-26-813/6b_output/model.tar.gz
+ DIR=/dev/shm/model/
+ rm -rf /dev/shm/model/
+ mkdir -p /dev/shm/model/
+ aws s3 cp s3://sagemaker-us-west-2-855988369404/smp-tensorparallel-outputdir/smp-gpt-j-xl-p38xl-tp4-pp1-bs8-2022-06-22-21-03-26-813/6b_output/model.tar.gz /dev/shm/model/
download: s3://sagemaker-us-west-2-855988369404/smp-tensorparallel-outputdir/smp-gpt-j-xl-p38xl-tp4-pp1-bs8-2022-06-22-21-03-26-813/6b_output/model.tar.gz to ../../../../dev/shm/model/model.tar.gz
+ cd /dev/shm/model/
+ tar -xvf model.tar.gz
gptj.pt
special_tokens_map.json
config.json
code/
code/inference.py
added_tokens.json
merges.txt
tokenizer_config.json
tokenizer.json
vocab.json
+ rm model.tar.gz


# 2. Prepare docker image

We have a `build.sh` bash script which performs the following steps:

* Makes `serve` executable and builds our docker image
* Optionally, runs the container for local testing

Run with local testing using the following command

In [14]:
! ./build.sh gptj-inference-endpoint $local_model_dir test_local

Sending build context to Docker daemon  23.04kB
Step 1/13 : FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel
 ---> 7afd9b52a068
Step 2/13 : LABEL com.amazon.image.authors.email="sage-learner@amazon.com"
 ---> Using cache
 ---> ebe03e021f48
Step 3/13 : LABEL com.amazon.image.authors.name="Amazon AI"
 ---> Using cache
 ---> 2dc0d8bb8ea4
Step 4/13 : ENV PYTHONUNBUFFERED=TRUE
 ---> Using cache
 ---> 6e2de100516d
Step 5/13 : ENV PYTHONDONTWRITEBYTECODE=TRUE
 ---> Using cache
 ---> 8c7498c062e1
Step 6/13 : ENV PATH="/opt/program:${PATH}"
 ---> Using cache
 ---> ceb5844ea6d5
Step 7/13 : ARG DEBIAN_FRONTEND=noninteractive
 ---> Using cache
 ---> 396b707be231
Step 8/13 : ENV TZ=Etc/UTC
 ---> Using cache
 ---> f9d427bb8835
Step 9/13 : RUN apt-key del 7fa2af80     && rm /etc/apt/sources.list.d/nvidia-ml.list /etc/apt/sources.list.d/cuda.list     && apt-get -y update && apt-get install -y --no-install-recommends         wget     && wget https://developer.download.nvidia.com/compute/cuda/repos/ubun

Or, to run without local testing, run:

```sh
./build.sh gptj-inference-endpoint
```

To test the endpoint, you can run the following cells:

In [15]:
import requests
import json
import sys 

URL = 'http://127.0.0.1:8080/invocations'
HEADERS = {'Content-type': 'application/json', 'Accept': '*/*'}

def test_endpoint(text, parameters):
    
    data = {
        "inputs":{
            "text_inputs": text,
            "parameters": parameters
        }
    }
    
    payload = json.dumps(data)
    response = requests.post(URL, json=data, headers=HEADERS)
    
    return(response.text)


In [16]:
text = """This is a creative writing exercise. Below, you'll be given a prompt. Your story should be based on the prompt.

Prompt: A scary story about a haunted mouse
Story: On a dark and stormy night, the mouse crept in the shadows. """

parameters = {
    "do_sample": True,
    "temperature": 0.9,
    "max_new_tokens":200,
    "min_tokens": 100,
    "repetition_penalty": 1.1,
    "top_p": 500,
    }

response = json.loads(test_endpoint(text, parameters))
print(response['response'][0]['generated_text'])

This is a creative writing exercise. Below, you'll be given a prompt. Your story should be based on the prompt.

Prompt: A scary story about a haunted mouse
Story: On a dark and stormy night, the mouse crept in the shadows. _____________________ In a room that was so cold, it hurt to breathe. Everything seemed dead and deserted - like an old building abandoned after the owner took off for some faraway land. The little creature shivered like a leaf blown by the wind. __________ But no one had lived there for years - not even the mice.




The squeaking of the floorboards grew louder as he made his way toward the stairs. He was not afraid of walking through the rooms because they had become his new home. His mother never saw him again. _________ Now it was time to find out if something new would live here with him. He put his hand on one of the steps, preparing himself to start climbing when suddenly from above there came a shrill cry. The mouse froze, expecting any moment to hear the cr

# 3. Deployment

When you're satisfied with your container, you can rebuild and push your container to AWS ECR using the `push_to_ecr.sh` script.

For example, to push the image we built above, named "gptj-inference-endpoint", you can use the `push_to_ecr.sh` script, which requires the name of your docker image.

In [17]:
! chmod +x push_to_ecr.sh
! ./push_to_ecr.sh gptj-inference-endpoint

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
The push refers to repository [855988369404.dkr.ecr.us-west-2.amazonaws.com/gptj-inference-endpoint]

[1B17d3cc2a: Preparing 
[1B26a856d9: Preparing 
[1Ba9bb730d: Preparing 
[1B81b1853a: Preparing 
[1Bbb7de61b: Preparing 
[1B263f678e: Preparing 
[1Bd43a62a1: Preparing 
[1Bfc4a44ce: Preparing 
[1Bbc5acecf: Preparing 
[1B2fb01f89: Preparing 
[1B6813b3ac: Preparing 
[1B30aca740: Preparing 
[1Bcd672bd2: Preparing 
[1B8881187d: Preparing 
[1B5df75b44: Preparing 
[16B7d3cc2a: Pushed lready exists 4kB6A[2K[7A[2K[3A[2K[16A[2Klatest: digest: sha256:9900a842095286bbf30a3ce9c6e9fc156ba9578ecf03f923b132757b2e552bf8 size: 3687


First, this script will push your image to ECR. For reference later, note the address of the repository that the container is pushed to. It should appear below the line `Login Succeeded` in the output from the call to `push_to_ecr.sh`.

# 4. Inference

Now, you can deploy your endpoint as follows:

### 4.1 Initialize configuration variables

If you run into the error that endpoint already exists on a rerun, please change the model_name and endpoint_name. 

In [20]:
import sagemaker
from sagemaker.model import Model
from sagemaker.predictor import RealTimePredictor
import time 

role = sagemaker.get_execution_role()

# Specify s3uri for model.tar.gz
model_data = model_s3_uri

# Specify path to gptj-inference-endpoint image in ECR
image = "855988369404.dkr.ecr.us-west-2.amazonaws.com/gptj-inference-endpoint"

# Specify sagemaker model_name
sm_model_name = "gptj-completion-gpu-test7"

# Specify endpoint_name
endpoint_name = "gptj-completion-gpu-test7"

# Specify instance_type
instance_type = 'ml.g4dn.2xlarge'

# Specify initial_instance_count
initial_instance_count = 1


### 4.2 Initialize endpoint

In [None]:
sm_model = Model(model_data = model_data, 
                        image_uri = image,
                        role = role,
                        predictor_cls=RealTimePredictor,
                        name = sm_model_name)

predictor = sm_model.deploy(
        instance_type=instance_type,
        initial_instance_count=1,
        endpoint_name = endpoint_name
)

----

### 4.3 Query model

To query your endpoint, you can use the code below. Also, remember that you can pass any parameters accepted by the HuggingFace `"text-generation"` pipeline.

#### Initialize asynchronous 

In [6]:
import boto3
import json 

# Get the boto3 session and sagemaker client, as well as the current execution role
sess = boto3.Session()

# Specify your AWS Region
aws_region=sess.region_name


# Create a low-level client representing Amazon SageMaker Runtime
sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=aws_region)



In [7]:
%%time

text = """This is a creative writing exercise. Below, you'll be given a prompt. Your story should be based on the prompt.

Prompt: A scary story about a haunted mouse
Story: On a dark and stormy night, the mouse crept in the shadows. """

parameters = {
    "do_sample": True,
    "temperature": 0.7,
    "max_new_tokens":200,
    "min_tokens": 100,
    "repetition_penalty": 1.1,
    "top_p": 500,
    }

data = {
    "inputs": {
        "text_inputs": text,
        "parameters": parameters
    }
}


body = json.dumps(data)


response = sagemaker_runtime.invoke_endpoint( 
        EndpointName=endpoint_name, 
        Body = body, 
        ContentType = 'application/json'
)

CPU times: user 14.6 ms, sys: 542 µs, total: 15.1 ms
Wall time: 10.8 s


In [8]:
%%time

body = json.dumps(data)


response = sagemaker_runtime.invoke_endpoint( 
        EndpointName=endpoint_name, 
        Body = body, 
        ContentType = 'application/json'
)

result = json.loads(response['Body'].read().decode("utf-8"))

CPU times: user 14.9 ms, sys: 35 µs, total: 14.9 ms
Wall time: 10.9 s


In [9]:
result

{'response': [{'generated_text': 'This is a creative writing exercise. Below, you\'ll be given a prompt. Your story should be based on the prompt.\n\nPrompt: A scary story about a haunted mouse\nStory: On a dark and stormy night, the mouse crept in the shadows.  \n---\n\n 1. The mouse began to pace up and down his little cage, but it did not seem as if he was doing anything. After several minutes of pacing, he sat down for a moment.  \n2. He looked around at the darkness with narrowed eyes. "I can\'t take this anymore." The mouse began to sniffle. "You\'re all alone out here," the mouse said, hiccupping. "I feel so scared."  \n3. Suddenly, the mouse heard footsteps coming closer and closer; he then saw a figure walking into the room. His heart raced with fear, and he ran to the corner of the room that contained his food dish.  \n4. The figure was wearing a hooded sweatshirt and black jeans. He looked like a teenager. The mouse could just make out the outline of the person\'s face throu