#  Llama2-7b
In this notebook we will create and deploy a Llama2-7b using inference components on the endpoint you created in the first notebook. For this model we will be using  the SageMaker Large Model Inference (LMI) container.  We will also be using one GPU for each model copy of the inference component we create. After creating the inference component we also show you how to set auto scaling policies to manage the number of copies of your inference component. We also use managed instance scaling which will scale the number of instances in your endpoint properlly in relation to your inference componenets. This is the 4rth notebook in a series of 5 notebooks used to deploy a model against the endpoint you created in the first notebook. The last notebook will show you other apis available and clean up the artifacts created.

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

---

Tested using the `Python 3 (Data Science)` kernel on SageMaker Studio and `conda_python3` kernel on SageMaker Notebook Instance.

# Licence agreement
 - View license information https://huggingface.co/meta-llama before using the model.
 - This notebook is a sample notebook and not intended for production use. Please refer to the licence at https://github.com/aws/mit-0. 

### Install dependencies

Upgrade the SageMaker Python SDK.

In [None]:
!pip install sagemaker boto3 huggingface_hub --upgrade

### Import libraries

In [None]:
import boto3
import botocore
import sagemaker
import sys
import time
import jinja2
from sagemaker import image_uris
from sagemaker.session import Session
import os
import json
from pathlib import Path
from datetime import datetime

### Set configurations

`REPLACE` the `endpoint_name` value with the created endpoint from the first notebook

In [None]:
%store -r \
endpoint_name

if "endpoint_name" not in locals():
    print("Please specify the endpoint_name before proceed.")

else:
    print(f"Endpoint name: {endpoint_name}")

We first by creating the objects we will need for our notebook. In particular, the boto3 library to create the various clients we will need to interact with SageMaker and other variables that will be referenced later in our notebook. 

In [None]:
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name

In [None]:
role = sagemaker.get_execution_role()
print(f"Role: {role}")

s3_client = boto3.client("s3")

cloudwatch_client = sess.boto_session.client("cloudwatch")
aas_client = sess.boto_session.client("application-autoscaling")

In [None]:
model_bucket = sess.default_bucket()  # bucket to house model artifacts

s3_model_prefix = "hf-large-model-djl/meta-llama/Llama-2-7b-fp16/model"  # folder within bucket where model artifact will go
s3_code_prefix = "hf-large-model-djl/meta-llama/Llama-2-7b-fp16/code"
region = sess._region_name
account_id = sess.account_id()

## Create SageMaker compatible Model artifact,  upload Model to S3 and bring your own inference script.

SageMaker Large Model Inference containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom pre-processing of the input data or postprocessing of the model's predictions.

SageMaker needs the model artifacts to be in a Tarball format. In this example, we provide the following files - serving.properties.

#### Create serving.properties 
This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.

Here is a list of settings that we use in this configuration file -

    engine: The engine for DJL to use. In this case, we have set it to MPI.
    option.model_id: The model id of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models) or S3 path to the model artefacts. 
    option.tensor_parallel_degree: Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 4 GPU machine and we are creating 4 partitions then we will have 1 worker per model to serve the requests.

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.



In [None]:
!rm -rf llama2_7b_fp16
!mkdir -p llama2_7b_fp16

In [None]:
%%writefile llama2_7b_fp16/serving.properties
engine = MPI
option.tensor_parallel_degree = 1
option.rolling_batch = auto
option.max_rolling_batch_size = 8
option.model_loading_timeout = 3600
option.model_id = s3://sagemaker-example-files-prod-us-west-2/models/llama-2/fp16/7B/
option.paged_attention = true
option.trust_remote_code = true
option.dtype = fp16
option.enable_streaming=False

**Image URI for the DJL container is being used here**

In [None]:
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.23.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

**Create the Tarball and then upload to S3 location**

In [None]:
!rm model.tar.gz
!tar czvf model.tar.gz llama2_7b_fp16

In [None]:
s3_code_artifact = sess.upload_data("model.tar.gz", model_bucket, s3_code_prefix)

In [None]:
prefix = sagemaker.utils.unique_name_from_base("DEMO")

model_name = f"{prefix}-model"
print(f"Test model name: {model_name}")

In [None]:
# check to see the status of our SageMaker endpoint
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

## Create Inference Component

In [None]:
inference_component_name = f"{prefix}-inference-component-0"
print(f"Test inference component name: {inference_component_name}")


initial_copy_count = 1
# inference component names if we deploy multiple of them
max_copy_count_per_instance = 4  # up to 4 llama2 7b fp16 model
inference_component_names = [
    f"{prefix}-inference-component-{i}" for i in range(max_copy_count_per_instance)
]

### Create Inference Component

In [None]:
sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
    },
)

In [None]:
variant_name = "AllTraffic"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600
min_memory_required_in_mb = 1024  # max memory util is up to 85%
number_of_accelerator_devices_required = 1

sm_client.create_inference_component(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name,
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
        },
        "ComputeResourceRequirements": {
            # "NumberOfCpuCoresRequired": number_of_cpu_cores_required,
            "MinMemoryRequiredInMb": min_memory_required_in_mb,
            "NumberOfAcceleratorDevicesRequired": number_of_accelerator_devices_required,
        },
    },
    RuntimeConfig={
        "CopyCount": initial_copy_count,
    },
)

In [None]:
while True:
    desc = sm_client.describe_inference_component(InferenceComponentName=inference_component_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

In [None]:
ic3_name = inference_component_name
%store \
ic3_name

#### Leverage the Boto3 to invoke the endpoint. 

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results.

You can pass a prompt as input to the model. This done by setting inputs to a prompt. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters.
These parameters need to be passed to the endpoint as a dictionary of kwargs. Refer this documentation - https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig for more details.

The below code sample illustrates the invocation of the endpoint using a text prompt and also sets some parameters. 

Note that we also apply an InferenceComponentName input to determine whch Inference Component the request should be directed to.

In [None]:
smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name,
    Body=json.dumps(
        {
            "inputs": "The diamondback terrapin was the first reptile to be",
            "parameters": {
                "do_sample": True,
                "max_new_tokens": 256,
                "min_new_tokens": 256,
                "temperature": 0.3,
                "watermark": True,
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

#### Scalable Target
AAS creates two alarms for each autoscaling target
* one to trigger scale-out: 3 minutes (3 one-minute data points)
* another one to trigger scale-in: 15 minutes (15 one-minute data points)

The time to trigger is usually 1 to 2 minutes longer than those because it take time for the endpoint to publish metrics to CloudWatch, and it also takes time for AAS to react.

# Application Auto Scaling
In the following cells we will go through how to use Application Auto Scaling to scale your inference component copies. In addition, please note that in our first notebook we set `ManagedInstanceScaling` to be enabled. By doing this SageMaker will automatically scale your endpoint based on the needs of your inference components.

We can first start by setting the number of desired initial and max copies for an inference component. We will also specify a folder for our test results for our scaling test. 

In [None]:
# set  to a value of '5' to account for the number of accelerators remainin for two instances of ml.g5.12xlarge
max_copy_count = 3 
print(f"Initial copy count: {initial_copy_count}")
print(f"Max copy county {max_copy_count}")

In [None]:
test_results_folder = "test-results"
print(f"Test results will be saved to folder {test_results_folder}")
test_start_time = datetime.now().strftime("%Y%m%d%H%M%S%f")

We can now set the values we will need to register a scalable target (in this case an inference component) with Application Auto Scaling. 

In [None]:
# Autoscaling parameters
resource_id = f"inference-component/{inference_component_name}"
service_namespace = "sagemaker"
scalable_dimension = "sagemaker:inference-component:DesiredCopyCount"

In [None]:
aas_client.register_scalable_target(
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    MinCapacity=initial_copy_count,
    MaxCapacity=max_copy_count,
)

In [None]:
aas_client.describe_scalable_targets(
    ServiceNamespace=service_namespace,
    ResourceIds=[resource_id],
    ScalableDimension=scalable_dimension,
)

#### Scalable Policy
Now that we have registered our scalable targets we can specify a scaling policy for our target. NOTE: If the scale-out cooldown is shorter than that the endpoint update time then it takes no effect, as it is not possible to update a SageMaker endpoint which is already in “Updating” status.

In [None]:
aas_client.put_scaling_policy(
    PolicyName=endpoint_name,
    PolicyType="TargetTrackingScaling",
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    TargetTrackingScalingPolicyConfiguration={
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerInferenceComponentInvocationsPerCopy",
        },
        # Low TPS + load TPS
        "TargetValue": (4.0 / max_copy_count_per_instance)
        + 1,  # you need to adjust this value based on your use case
        "ScaleInCooldown": 300,  # default
        "ScaleOutCooldown": 300,  # default
    },
)

In [None]:
aas_client.describe_scaling_policies(
    PolicyNames=[endpoint_name],
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
)

### Run The Test
We can now run a test to see the behavior of instance and managed auto scaling on SageMaker endpoints. 

In [None]:
# define some helper functions
from dataclasses import dataclass
import threading

initial_instance_count = 1
max_instance_count = 2
print(f"Initial instance count: {initial_instance_count}")
print(f"Max instance count: {max_instance_count}")


@dataclass
class AutoscalingStatus:
    status_name: str  # endpoint status or inference component status
    start_time: datetime  # when was the status changed
    current_instance_count: int
    desired_instance_count: int
    current_copy_count: int
    desired_copy_count: int


class WorkerThread(threading.Thread):
    def __init__(self, do_run, *args, **kwargs):
        super(WorkerThread, self).__init__(*args, **kwargs)
        self.__do_run = do_run
        self.__terminate_event = threading.Event()

    def terminate(self):
        self.__terminate_event.set()

    def is_terminated(self):
        return self.__terminate_event.is_set()

    def run(self):
        while not self.__terminate_event.is_set():
            self.__do_run(self.__terminate_event)


invoke_endpoint_sanity_check_sample = {
    "inputs": "The diamondback terrapin was the first reptile to be",
    "parameters": {
        "do_sample": True,
        "max_new_tokens": 100,
        "min_new_tokens": 100,
        "temperature": 0.3,
        "watermark": True,
    },
}
invoke_endpoint_sanity_check_payload = json.dumps(invoke_endpoint_sanity_check_sample)


def invoke_endpoint_sanity_check(
    sagemaker_runtime_client, endpoint_name, container_names=None, inference_component_name=None
):
    try:
        parameters = {
            "EndpointName": endpoint_name,
            "ContentType": "application/json",
            "Body": invoke_endpoint_sanity_check_payload,
        }
        if container_names is not None:
            for container_name in container_names:
                parameters["TargetContainerHostname"] = container_name
                response = sagemaker_runtime_client.invoke_endpoint(**parameters)
        else:
            if inference_component_name is not None:
                parameters["InferenceComponentName"] = inference_component_name
            response = sagemaker_runtime_client.invoke_endpoint(**parameters)
    except Exception as e:
        print(f"Failed to invoke {endpoint_name}: " + str(e))

In [None]:
def invoke_endpoint(terminate_event):
    start_time = datetime.utcnow()
    for _ in range(max_copy_count_per_instance * max_instance_count * 2):
        invoke_endpoint_sanity_check(
            smr_client, endpoint_name, inference_component_name=inference_component_name
        )
        time.sleep(0.1)
    elapsed_seconds = (datetime.utcnow() - start_time).total_seconds()
    if terminate_event.is_set():
        return
    if elapsed_seconds < 60:
        time.sleep(60 - elapsed_seconds)

In [None]:
# Keep invoking the endpoint with test data
invoke_endpoint_thread = WorkerThread(do_run=invoke_endpoint)
invoke_endpoint_thread.start()

statuses = []
while True:
    endpoint_desc = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = endpoint_desc['EndpointStatus']
    current_instance_count = endpoint_desc['ProductionVariants'][0]['CurrentInstanceCount']
    desired_instance_count = endpoint_desc['ProductionVariants'][0]['DesiredInstanceCount']
    ic_desc = sm_client.describe_inference_component(InferenceComponentName=inference_component_name)
    ic_status = ic_desc['InferenceComponentStatus']
    current_copy_count = ic_desc['RuntimeConfig']['CurrentCopyCount']
    desired_copy_count = ic_desc['RuntimeConfig']['DesiredCopyCount']
    status_name = f"{status}_{ic_status}"
    if not statuses or statuses[-1].status_name != status_name:
        statuses.append(AutoscalingStatus(
            status_name=status_name,
            start_time=datetime.utcnow(),
            current_instance_count=current_instance_count,
            desired_instance_count=desired_instance_count,
            current_copy_count=current_copy_count,
            desired_copy_count=desired_copy_count,
        ))
        print(statuses[-1])
    if status_name == "InService_InService":
        if current_copy_count == 5:
            invoke_endpoint_thread.terminate()
        elif current_copy_count == initial_copy_count:
            if invoke_endpoint_thread.is_terminated():
                break
    time.sleep(1)

In [None]:
invoke_endpoint_thread.terminate()

### Cleanup
We can delete and deregisterer our scaling policy and targets with Application Auto Scaling

In [None]:
aas_client.delete_scaling_policy(
    PolicyName=endpoint_name,
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
)

In [None]:
aas_client.deregister_scalable_target(
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
)

Thats it! You can now proceed to the third notebook where we will show you some miscellaneous functions and clean up our resources.

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2c_meta-llama2-7b-lmi-autoscaling.ipynb)