## Lab 0: Warm Up: Deploy Llama 2 Models on ml.inf2.24xlarge for Inference

In this lab, we'll walk you throught the process of deploying an Open Source Llama2 Model to a SageMaker endpoint for inference. We're going to leverage 1 `ml.inf2.24xlarge` machine for this and subsequent labs. In practice, you can deploy a SageMaker model behind a single load balanced endpoint with auto-scaling policies defined - allowing your LLM SaaS endpoint to scale with input demand.

<div style="background-color: #FFDDDD; border-left: 5px solid red; padding: 10px; color: black;">
    <strong>Kernel:</strong> Data Science 3.0 <strong>Instance Type:</strong> ml.t3.medium
</div>

### Setup Up

Let's install some packages that would be required for this and some sub-sequent labs

In [4]:
!python3 -m pip install sagemaker==2.196.0

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [24]:
import boto3
import sagemaker

In [35]:
REGION = "us-west-2"

sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=REGION))
sm_client = boto3.client("sagemaker", region_name=REGION)
role = sagemaker.get_execution_role()

print(f"SageMaker python SDK version ---> {sagemaker.__version__} | Region ---> {sagemaker_session.boto_session.region_name} | Role ---> {role}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
SageMaker python SDK version ---> 2.196.0 | Region ---> us-west-2 | Role ---> arn:aws:iam::280318901237:role/SageMakerEMRAdmin-EMR-SageMakerExecutionRole


### Define Global Variables

In [47]:
_MODEL_NAME = "llama2"
_MODEL_SIZE = "13b"

MODEL_NAME = f"meta-{_MODEL_NAME}-{_MODEL_SIZE}-chat-tg-model"
ENDPOINT_NAME = f"meta-{_MODEL_NAME}-{_MODEL_SIZE}-chat-tg-ep"

## Let's Deploy!

![Llama 2 Model](https://venturebeat.com/wp-content/uploads/2023/07/cfr0z3n_vector_art_cybernetic_llama_wearing_sunglasses_synthwav_d3f82260-2c47-4abd-9599-b91751711f5b.png?fit=750%2C420&strip=all)

Image Credits: https://venturebeat.com/

### Image to Host

In [38]:
from sagemaker import image_uris

In [39]:
image_uri = image_uris.retrieve(
    framework="djl-neuronx",
    region=sagemaker_session.boto_session.region_name,
    version="0.24.0"
)
print(f"Hosting Image URI ---> {image_uri}")

Hosting Image URI ---> 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.24.0-neuronx-sdk2.14.1


### Create a SageMaker Model

In [40]:
from sagemaker.model import Model

Model weights are currently available in a hosted S3 location for this workshop, alternatively, you can choose to use weights from alternate sources such as HuggingFace Hub, SageMaker JumpStart or Custom Storage Locations.

Please ensure you have opened Llama2 license and have acknowledged the license per use policy [LLAMA2 LICENSE](https://aim401-us-east-1-420917551634.s3.us-west-2.amazonaws.com/aim401-models/llama-2/13B/inf2/LICENSE)

In [41]:
MODEL_S3_URI = "s3://aim401-us-east-1-420917551634/aim401-models/llama-2/13B/inf2/"

In [48]:
llama2_13_model = Model(
    image_uri=image_uri,
    model_data={
        'S3DataSource': {
            'CompressionType': 'None',
            'S3DataType': 'S3Prefix',
            'S3Uri': MODEL_S3_URI
        }
    },
    role=role,
    sagemaker_session=sagemaker_session,
    name=MODEL_NAME,
    env={
        "OPTION_TENSOR_PARALLEL_DEGREE": "12",
        "OPTION_N_POSITIONS": "2048",
        "OPTION_DTYPE": "fp16"
    }    
)

### Deploy!

<img src="https://cdn.jim-nielsen.com/ios/1024/lets-go-rocket-2018-10-15.png" width="512" height="512" />

We're going to deploy our Llama2 model on Amazon Silicon Inferentia `Inf2`. Inferentia instances are purpose built for deep learning (DL) inference. They deliver high performance at the lowest cost in Amazon EC2 for generative artificial intelligence (AI) models, including large language models (LLMs) and vision transformers. 

In [50]:
INSTANCE_TYPE = "ml.inf2.24xlarge"

In [54]:
%%time
print("===== SageMaker Deployment =====")
print("\nPreparing to deploy the model...")
predictor = llama2_13_model.deploy(
    initial_instance_count=1,
    instance_type=INSTANCE_TYPE,
    endpoint_name=ENDPOINT_NAME,
    volume_size=128,
    container_startup_health_check_timeout=1200,
)
print("\n===== Deployment Complete =====")

Your model is not compiled. Please compile your model before using Inferentia.


===== SageMaker Deployment =====

Preparing to deploy the model...
-----------------!
===== Deployment Complete =====
CPU times: user 132 ms, sys: 8.74 ms, total: 141 ms
Wall time: 9min 5s
