# Deploy LLama2 70b Model with high performance on SageMaker using Sagemaker LMI container



In this notebook, we explore how to host a LLama2 large language model with FP16 precision on SageMaker using the DeepSpeed. We use DJLServing as the model serving solution in this example that is bundled in the LMI container. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-bloom-176b-and-opt-30b-on-amazon-sagemaker-with-large-model-inference-deep-learning-containers-and-deepspeed/).


Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.

SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we deploy https://huggingface.co/TheBloke/Llama-2-70B-fp16 model across GPUs on a ml.g5.48xlarge instance. 



In [None]:
!pip install sagemaker boto3 huggingface_hub --upgrade #--quiet

In [None]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

print(bucket)

In [None]:
model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_prefix = (
    "hf-large-model-djl/meta-llama/Llama-2-70B-fp16/code"  # folder within bucket where code artifact will go
)

s3_model_prefix = (
    "hf-large-model-djl/meta-llama/Llama-2-70B-fp16/model"  # folder within bucket where model artifact will go
)
region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

jinja_env = jinja2.Environment()

In [None]:
model_bucket

### [OPTIONAL] Download the model from Hugging Face and upload the model artifacts on Amazon S3

If you intend to download your copy of the model and upload it to a s3 location in your AWS account, please follow the below steps, else you can skip to the next step.

In [None]:
api_key = os.environ.get('HUGGING_FACE_HUB_TOKEN')

In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path
import os

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = "TheBloke/Llama-2-70B-fp16"
# Only download pytorch checkpoint files
allow_patterns = ["*.*"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
# model_download_path = snapshot_download(
#     repo_id=model_name,
#     cache_dir=local_model_path,
#     allow_patterns=allow_patterns,
#     use_auth_token=api_key
# )

### Define a variable to contain the s3url of the location that has the model
- Run this cell if you have the model weights in S3

In [None]:
# define a variable to contain the s3url of the location that has the model
pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"
print(f"Pretrained model will be uploaded to ---- > {pretrained_model_location}")

In [None]:
#model_artifact = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
print(f"Model uploaded to --- > {model_artifact}")
print(f"We will set option.s3url={model_artifact}")
print(f"Pre-trained model location={pretrained_model_location}")

## Create SageMaker compatible Model artifact,  upload Model to S3 and bring your own inference script.

SageMaker Large Model Inference containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom pre-processing of the input data or postprocessing of the model's predictions.

SageMaker needs the model artifacts to be in a Tarball format. In this example, we provide the following files - serving.properties.

The tarball is in the following format:

code
│   
└── serving.properties

    serving.properties is the configuration file that can be used to configure the model server.


#### Create serving.properties 
This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.

Here is a list of settings that we use in this configuration file -

- `engine`: The engine for DJL to use. In this case, we have set it to MPI.
- `option.model_id`: The model id of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models) or S3 path to the model artefacts. 
- `option.tensor_parallel_degree`: Set to the number of GPU devices over which the model is partitioned. This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.



In [None]:
!rm -r code_llama2_70b_fp16_fp_16
!mkdir -p code_llama2_70b_fp16

In [None]:
%%writefile code_llama2_70b_fp16/serving.properties
engine=MPI
option.tensor_parallel_degree=8
option.rolling_batch=auto
option.max_rolling_batch_size=16
option.model_loading_timeout=3600
option.model_id = {{model_artifact}}
option.dtype=fp16
option.paged_attention=true
option.trust_remote_code=true
option.max_rolling_batch_prefill_tokens=16080
option.enable_streaming=False

In [None]:
pretrained_model_location="TheBloke/Llama-2-70B-fp16" #the modelid on HF model hub. Can also be set to S3 location

In [None]:
# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(Path("code_llama2_70b_fp16/serving.properties").open().read())
Path("code_llama2_70b_fp16/serving.properties").open("w").write(
    template.render(s3url=pretrained_model_location,model_artifact=pretrained_model_location)
)
!pygmentize code_llama2_70b_fp16/serving.properties | cat -n

**Image URI for the DJL container is being used here**

In [None]:
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.23.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

**Create the Tarball and then upload to S3 location**

In [None]:
!rm model.tar.gz
!tar czvf model.tar.gz code_llama2_70b_fp16

In [None]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

### To create the end point the steps are:

1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

    a) Instance Type is ml.g5.48xlarge 
    
    b) ContainerStartupHealthCheckTimeoutInSeconds is 2400 to ensure health check starts after the model is ready    
3. Create the end point using the endpoint config created    
    

#### Create the Model
Use the image URI for the DJL container and the s3 location to which the tarball was uploaded.

The container downloads the model into the `/tmp` space on the instance because SageMaker maps the `/tmp` to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB. It leverages `s5cmd`(https://github.com/peak/s5cmd) which offers a very fast download speed and hence extremely useful when downloading large models.

For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the `/tmp` on the container. The size of this mount is large enough to hold the model.


In [None]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"Llama-2-70B-fp16-mpi")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

In [None]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.48xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
        },
    ],
)
endpoint_config_response

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

### This step can take ~ 20 min or longer so please be patient

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

#### While you wait for the endpoint to be created, you can read more about:
- [Deep Learning containers for large model inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html)

#### Leverage the Boto3 to invoke the endpoint. 

This is a generative model so we pass in a Text as a prompt and the Model will complete the sentence and return the results.

You can pass a batch of prompts as input to the model. This is done by setting `inputs` to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters. These `parameters` need to be passed to the endpoint as a dictionary of `kwargs`.

The below code sample illustrates the invocation of the endpoint using prompts and also sets some parameters.

Here's a list of default arguments that's used by the model for inference. You can pass specific values based on the use case - 
```
default_args = dict(
            inputs_embeds=None,
            beam_width=1,
            max_seq_len=200,
            top_k=1,
            top_p=0.0,
            beam_search_diversity_rate=0.0,
            temperature=1.0,
            len_penalty=0.0,
            repetition_penalty=1.0,
            presence_penalty=None,
            min_length=0,
            random_seed=0,
            is_return_output_log_probs=False,
            is_return_cum_log_probs=False,
            is_return_cross_attentions=False,
            bad_words_list=None,
            stop_words_list=None
        )
```


In [None]:
%%time
smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
       {
          "inputs": "The diamondback terrapin was the first reptile to",
          "parameters": {
            "do_sample": True,
            "max_new_tokens": 1000,
            "temperature": 0.7,
            "watermark": True
           }
    }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

In [None]:
%%time
smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
       {
          "inputs": "The diamondback terrapin was the first reptile to",
          "parameters": {
            "do_sample": True,
            "max_new_tokens": 300,
            "temperature": 0.7,
            "watermark": True
           }
    }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

In [None]:
"""
Summarize the text. In 1766, Peter Kalm published a description of the turtle in his journal Travels into North America.\\nThe diamondback terrapin was designated the official state reptile of Maryland in 1994.\\nThe diamondback terrapin is the only North American turtle that lives exclusively in brackish water.\\nThe diamondback terrapin is the only turtle in the United States that lives exclusively in brackish water. This means that they are able to live in water that has more salinity than fresh water, but not as much as sea water. Diamondback terrapins live in tidal creeks, salt marshes, and shallow areas of bays.\\nThe diamondback terrapin is the only North American turtle that lives exclusively in brackish water. Brackish water is a mixture of fresh and salt water, which is found in estuaries and other coastal areas. The diamondback terrapin is well-adapted to this environment, with a salt-tolerant kidney that allows it to drink seawater.\\nThe diamondback terrapin is a medium-sized turtle, reaching a maximum length of about 9 inches (23 cm). It has a flattened, oval-shaped shell, which is usually dark brown or black in color. The upper shell (carapace) is covered with a distinctive pattern of diamond-shaped markings, hence the name \\u201cdiamondback.\\u201d The lower shell (plastron) is usually plain white or cream-colored.\\nDiamondback terrapins are found along the Atlantic coast from Massachusetts to Texas. They are also found in the Gulf of Mexico and the Caribbean Sea. In the United States, they are most common in the Chesapeake Bay region.\\nDiet: Diamondback terrapins are omnivorous, feeding on a variety of plants and animals. Their diet includes crabs, snails, clams, worms, and aquatic plants.\\nReproduction: Diamondback terrapins mate in the springtime. Females lay their eggs in nests dug into the sandy banks of tidal creeks. The eggs hatch in the fall, and the hatchlings make their way to the water\\u2019s edge where they live until they are large enough to fend for themselves.\\nThreats: Diamondback terrapins are threatened by a variety of human activities. Habitat destruction and degradation, pollution, and over-collection for food and the pet trade are all major threats to this species.\\nConservation: Diamondback terrapins are protected by a number of state and federal laws. In some states, they are also listed as a threatened or endangered species.\\nQ: What is the diamondback terrapin\\u2019s scientific name?\\nA: The diamondback terrapin\\u2019s scientific name is Malaclemys terrapin.\\nQ: Where does the diamondback terrapin live?\\nA: The diamondback terrapin is found along the Atlantic coast from Massachusetts to Texas. They are also found in the Gulf of Mexico and the Caribbean Sea. In the United States, they are most common in the Chesapeake Bay region.\\nQ: What does the diamondback terrapin eat?\\nA: Diamondback terrapins are omnivorous, feeding on a variety of plants and animals. Their diet includes crabs, snails, clams, worms, and aquatic plants.\\nQ: How does the diamondback terrapin reproduce?\\nA: Diamondback terrapins mate in the springtime. Females lay their eggs in nests dug into the sandy banks of tidal creeks. The eggs hatch in the fall, and the hatchlings make their way to the water\\u2019s edge where they live until they are large enough to fend for themselves.\\nQ: What threats does the diamondback terrapin face?\\nA: Diamondback terrapins are threatened by a variety of human activities. Habitat destruction and degradation, pollution, and over-collection for food and the pet trade are all major threats to this species.\\nQ: What is being done to protect the diamondback terrapin?\\nA: Diamondback terrapins are protected by a number of state and federal laws. In some states, they are also listed as a threatened or endangered species
""".replace("\\n", "").replace("\\", "")

In [None]:
#%%timeit
prompt_data="""
Summarize this. Robin Hood is England's best-loved outlaw. Robin Hood's good traits are easily seen throughout the story. The author did a good job of making his hero come across as a good person, who has often been misinterpreted because of things that he did as a young boy. Showing the change Robin Hood has made since he was a little boy easily allows the reader to better understand how great he really is, and how he is helping not only himself, but all of the poorer community.Robin Hood was faced with issues from very early on in his life. His mothers death was very difficult for him, but living with his fathers love for another women, after his mother had died, was just too much for him and he threatened his father that staying with that women would cost him his only sons love. Robin then left for many years, only to come back and discover that his father had been murdered and that the new leader of Nottingham was the Sheriff. Not only was this a great shock to Robin, but all the people of the land were suffering greatly from the Sheriffs corrupt rule. He was very money hungry and greedy, and the lower class community suffered greatly from his greediness. Robin Hood had many different traits that are quite obvious in the story and the movie. For one he is very set on taking from the wealth of Nottingham and giving back to the poorer community so they can live well. His main idea here is to get as much taken from the Sheriff of Nottingham and his sympathizers so they can easily attack and take the kingdom back. In the end his plan works and Robin kills the Sheriff and the Kingdom is once again his, as well as Maid Marion. His goals are reached because he is persistent in what he wants, and will stop at nothing to get back all the things that the people had lost, and all the things he had lost. Robin Hood seemed almost charismatic in some ways in the story and the movie, however it doesn't seem that he tried to be.
""".replace("'","").replace("\n","")
smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
       {
          "inputs": prompt_data,
          "parameters": {
            "do_sample": True,
            "max_new_tokens": 1000,
            "temperature": 0.7,
            "watermark": True
           }
    }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

## Clean Up

In [None]:
# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [None]:
# - In case the end point failed we still want to delete the model
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)