# Lab 6:  Deploy a token streaming solution on Amazon SageMaker

**Recommended kernel(s):** This notebook can be run with any Amazon SageMaker Studio kernel. We recommend to use the Data Science 3.0 kernel.

In this notebook, you will deploy a small solution using the [AWS Cloud Development Kit (CDK)](https://docs.aws.amazon.com/cdk/v2/guide/home.html) which includes an Amazon SageMaker endpoint that serves the [`cerebras/Cerebras-GPT-2.7B`](https://huggingface.co/cerebras/Cerebras-GPT-2.7B) on a `ml.g5.2xlarge` (single-GPU instance type) using a [Large Model Inference container](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html) container (HuggingFace Accelerate engine). This solution will demonstrate a streaming experience using [AWS Lambda's Response Streaming](https://docs.aws.amazon.com/lambda/latest/dg/configuration-response-streaming.html) with the generated tokens being returned to a HTTP client as they get generated.

**Notices:**
* Make sure that the `ml.g5.2xlarge` instance type is available in your AWS Region. If not, fallback on a similar instance type (e.g. `ml.g4dn.2xlarge`).
* Make sure that the value of your "+instance_type+ for endpoint usage" Amazon SageMaker service quota allows you to deploy one Endpoint using the chosen instance type.

### License information
* The `cerebras/Cerebras-GPT-2.7B` model is under the Apache 2.0 license.
* This notebook is a sample notebook and not intended for production use and is under the [MIT-0 license](https://github.com/aws/mit-0).

### Permissions
This lab involves the deployment of a AWS CloudFormation stack using the AWS CDK. Make sure that you are able to bootstrap a CDK environment in your account and that you are allowed to create and delete all the resources of this lab's stack.

## 1. Environment setup

In [2]:
%pip install sagemaker boto3 huggingface_hub --upgrade  --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.27.111 requires botocore==1.29.111, but you have botocore 1.29.162 which is incompatible.
awscli 1.27.111 requires PyYAML<5.5,>=3.10, but you have pyyaml 6.0 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### 1.1 Imports & global variables assignment

In [3]:
import json
import os
from pathlib import Path
import shutil
from typing import Any, Dict, List

import boto3
import botocore
import huggingface_hub
import sagemaker
from sagemaker.predictor import Predictor

In [4]:
SM_DEFAULT_EXECUTION_ROLE_ARN = sagemaker.get_execution_role()
SM_SESSION = sagemaker.session.Session()
SM_ARTIFACT_BUCKET_NAME = SM_SESSION.default_bucket()

REGION_NAME = SM_SESSION._region_name
S3_CLIENT = boto3.client("s3", region_name=REGION_NAME)

In [5]:
HOME_DIR = os.environ["HOME"]

# HuggingFace local model storage
HF_LOCAL_CACHE_DIR = Path(HOME_DIR) / ".cache" / "huggingface" / "hub"
HF_LOCAL_DOWNLOAD_DIR = Path.cwd() / "model_repo"
HF_LOCAL_DOWNLOAD_DIR.mkdir(exist_ok=True)

# Selected HuggingFace model
HF_HUB_MODEL_NAME = "cerebras/Cerebras-GPT-2.7B"

# HuggingFace remote model storage (Amazon S3)
HF_MODEL_KEY_PREFIX = f"hf-llm-djl-serving/{HF_HUB_MODEL_NAME}"

### 1.2 Storage utility functions

In [6]:
def list_s3_objects(bucket: str, key_prefix: str) -> List[Dict[str, Any]]:
    paginator = S3_CLIENT.get_paginator("list_objects")
    operation_parameters = {"Bucket": bucket, "Prefix": key_prefix}
    page_iterator = paginator.paginate(**operation_parameters)
    return [obj for page in page_iterator for obj in page["Contents"]]


def delete_s3_objects(bucket: str, keys: str) -> None:
    S3_CLIENT.delete_objects(Bucket=bucket, Delete={"Objects": [{"Key": key} for key in keys]})


def get_local_model_cache_dir(hf_model_name: str) -> str:
    for dir_name in os.listdir(HF_LOCAL_CACHE_DIR):
        if dir_name.endswith(hf_model_name.replace("/", "--")):
            break
    else:
        raise ValueError(f"Could not find HF local cache directory for model {hf_model_name}")
    return HF_LOCAL_CACHE_DIR / dir_name

## 2. Model upload to Amazon S3
Models served by a LMI container can be downloaded to the container in different ways:
* Like all the SageMaker Inference containers, having the container to download the model from Amazon S3 as a single `model.tar.gz` file. In the case of LLMs, this approach is discouraged since downloading and decompression times can become unreasonably high.
* Having the container to download the model directly from the HuggingFace Hub for you. This option may involve high download times too and requires access to the public Internet.
* Having the container to download the uncompressed model from Amazon S3 with maximal throughput by using the [`s5cmd`](https://github.com/peak/s5cmd) utility. This option is specific to LMI containers and is the recommended one. It requires however, that the model has been previously uploaded to a S3 Bucket. 

In this section, you will:
1. Download the model from the HuggingFace Hub to your local host,
2. Upload the downloaded model to a S3 Bucket. This notebook uses the SageMaker's default regional Bucket. Feel free to upload the model to the Bucket of your choice by modifying the `SM_ARTIFACT_BUCKET_NAME` global variable accordingly.

Each operation takes a few minutes.

In [None]:
huggingface_hub.snapshot_download(
    repo_id=HF_HUB_MODEL_NAME,
    revision="main",
    local_dir=HF_LOCAL_DOWNLOAD_DIR,
    local_dir_use_symlinks="auto",  # Files larger than 5MB are actually symlinked to the local HF cache
    allow_patterns=[
        "*.json",
        "*.pt",
        "*.bin",
        "*.txt",
        "*.model",
        "*.py",
    ],
);

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading (…)e6cbf7c1/config.json:   0%|          | 0.00/361 [00:00<?, ?B/s]

Downloading (…)1e6cbf7c1/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)1e6cbf7c1/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/10.7G [00:00<?, ?B/s]

In [None]:
MODEL_ID = SM_SESSION.upload_data(
    path=HF_LOCAL_DOWNLOAD_DIR.as_posix(),
    bucket=SM_ARTIFACT_BUCKET_NAME,
    key_prefix=HF_MODEL_KEY_PREFIX,
)
print(f"Model artifacts have been successfully uploaded to: {MODEL_ID}")

**Save the returned location for later.**

The `huggingface_hub.snapshot_download` function downloaded the model repository to a cache located in your home directory. Downloaded files were duplicated in the target local download directory. Large files (larger than 5 MB) were not duplicated however but simply symlinked. Still, uncompressed LLM artifacts consume disk space. The two following cells removes the downloaded files from your local host.

In [None]:
# Clean up - Remove HF model artifacts from the local download directory
shutil.rmtree(HF_LOCAL_DOWNLOAD_DIR)

In [None]:
# Clean up - Remove HF model artifacts from the local HF cache directory
hf_local_cache_dir = get_local_model_cache_dir(hf_model_name=HF_HUB_MODEL_NAME)
shutil.rmtree(hf_local_cache_dir)

## 3. Deploy the streaming solution
### 3.1. Custom inference handler code
The following custom Python code will be deployed to the Endpoint. First let's gather all the custom endpoint Python code in a `code` directory by executing the following cells:

In [None]:
Path("code").mkdir(exist_ok=True)

The content of the `code` directory consist of:
* A `requirements.txt` file that list the Python dependencies required by the custom Python code. It can include Python dependencies specific to the chosen model.
* A DJLServing-specific `serving.properties` which allow to inject configuration values to both the DJL model server and to the custom Python handler.
* Python source files (`cache.py` and `handler.py`), the `handler.py` file being used by the DJL server as entry point.

In [None]:
%%writefile code/requirements.txt
boto3==1.26.161
transformers==4.27.2

For more information about the content of the `serving.properties` file, refer to the [Configuration and settings](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html) page from the SageMaker service documentation.

In [None]:
%%writefile code/serving.properties
engine = Python
option.entryPoint = handler.py
option.task = text - generation
option.low_cpu_mem_usage = true
option.post_every_x_tokens = 1
option.torch_dtype = fp16

The `cache.py` module provides a dedicated object which posts the generated tokens to the Amazon DynamoDB table every time a new token is generated.

In [None]:
%%writefile code/cache.py
from __future__ import annotations

from dataclasses import dataclass, field
from types import TracebackType
from typing import Any, Dict, List, Type

import boto3


@dataclass
class CacheState:
    counter: int = 0
    buffer: List[str] = field(default_factory=list)
    session_id: str = ""
    has_eos_been_generated: bool = False
    is_generation_finished: bool = False


class DynamoDBSequenceCache:
    def __init__(self, config: Dict[str, Any], eos_token: str) -> None:
        self._client = boto3.client("dynamodb", region_name=config["region_name"])
        # Cache config
        self._table_name = config["table_name"]
        self._table_primary_key_name = config["table_primary_key_name"]
        self._post_every_x_tokens = config["post_every_x_tokens"]
        self._eos_token = eos_token
        # Cache state
        self._state = CacheState()

    def __call__(self, session_id: str) -> DynamoDBSequenceCache:
        self._state.session_id = session_id
        return self

    def __enter__(self) -> DynamoDBSequenceCache:
        return self

    def __exit__(self, exc_type: Type, exc_value: BaseException, exc_tb: TracebackType) -> None:
        # Exiting the context manager means that the token generation has stopped:
        # - Either because the EOS token has been generated (state.has_eos_been_generated set to True),
        # - Or because the nb of generated tokens exceeds max_new_tokens (state.has_eos_been_generated remains False)
        self._state.is_generation_finished = True
        self._flush_to_cache()
        self._state = CacheState()

    def put(self, token: str) -> str:
        if token != self._eos_token:
            self._state.buffer.append(token)
        else:
            self._state.has_eos_been_generated = True
        self._state.counter += 1
        if self._do_flush():
            self._flush_to_cache()

    def _do_flush(self) -> bool:
        return (self._state.counter % self._post_every_x_tokens) == 0

    def _flush_to_cache(self) -> None:
        self._client.update_item(
            TableName=self._table_name,
            Key={self._table_primary_key_name: {"S": self._state.session_id}},
            AttributeUpdates={
                "generated_sequence": {
                    "Value": {"S": "".join(self._state.buffer)},
                    "Action": "PUT",
                },
                "is_generation_finished": {
                    "Value": {"BOOL": self._state.is_generation_finished},
                    "Action": "PUT",
                },
                "has_eos_been_generated": {
                    "Value": {"BOOL": self._state.has_eos_been_generated},
                    "Action": "PUT",
                },
            },
        )

The custom entrypoint `handler.py` uses the utilities provided by the DJL-Python toolkit (DJLServing version 0.22.1 and later) available in the LMI container (`djl_python.streaming_utils`) to create an iterator object from the loaded model artifacts and request payload. On each iteration (i.e. every time the iterator is passed to `next`), it performs a forward pass and returns the generated token.

**Notice**: Token iterators from DJL-Python support batching, i.e. an iterator can be created not only from a single prompt but from a list of input sequences. On each iteration, the iterator returns a list of tokens, one per input sequence. In the present case, batch size is 1.

In [None]:
%%writefile code/handler.py
import os
import uuid
from typing import Any, Callable, Dict, List, Optional, Tuple

from djl_python import Input, Output
from djl_python.streaming_utils import StreamingUtils
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.modeling_utils import PreTrainedModel
from transformers.tokenization_utils_base import PreTrainedTokenizerBase

from cache import DynamoDBSequenceCache


def get_torch_dtype_from_str(dtype: str) -> Optional[torch.dtype]:
    if dtype == "fp32":
        return torch.float32
    if dtype == "fp16":
        return torch.float16
    if dtype == "bf16":
        return torch.bfloat16
    if dtype == "int8":
        return torch.int8
    if dtype is None:
        return None
    raise ValueError(f"Data type cannot be parsed as valid Torch data type: {dtype}")


def parse_request(inputs: Input) -> Tuple[List[str], Dict[str, Any]]:
    body = inputs.get_as_json()
    inputs = body["inputs"]
    if isinstance(inputs, list):
        prompt, *_ = inputs
    generation_config = body["parameters"]
    return [prompt], generation_config


class ConfigFactory:
    def __init__(self, properties: Dict[str, str]) -> None:
        self._properties = properties

    def get_tokenizer_config(self) -> Dict[str, Any]:
        return {
            "trust_remote_code": (
                self._properties.get("trust_remote_code", "false").lower() == "true"
            ),
            "revision": self._properties.get("revision", "main"),
            "padding_side": "left",
        }

    def get_model_config(self) -> Dict[str, Any]:
        dtype = self._properties.get("torch_dtype")
        return {
            "low_cpu_mem_usage": (
                self._properties.get("low_cpu_mem_usage", "true").lower() == "true"
            ),
            "torch_dtype": get_torch_dtype_from_str(dtype=dtype),
            "trust_remote_code": (
                self._properties.get("trust_remote_code", "false").lower() == "true"
            ),
            "revision": self._properties.get("revision", "main"),
        }

    def get_accelerate_config(self) -> Dict[str, Any]:
        return {
            "load_in_8bit": (self._properties.get("load_in_8bit", "false").lower() == "true"),
            "device_map": self._properties.get("device_map", "auto"),
        }

    def get_cache_config(self) -> Dict[str, Any]:
        return {
            "table_name": os.environ["CACHE_TABLE_NAME"],
            "table_primary_key_name": os.environ["CACHE_TABLE_PRIMARY_KEY_NAME"],
            "region_name": os.environ["REGION_NAME"],
            "post_every_x_tokens": int(self._properties.get("post_every_x_tokens", 1)),
        }

    def get_default_generation_config(self) -> Dict[str, Any]:
        return {"max_new_tokens": int(self._properties.get("max_new_tokens", 256))}


class AccelerateInferenceService:
    engine = "Accelerate"

    def __init__(self) -> None:
        self.initialized = False
        self._model_location = None
        self._tokenizer = None
        self._model = None
        self._config = None
        self._generation_request_handler = None

    def _load_tokenizer(self) -> None:
        tokenizer_config = self._config.get_tokenizer_config()
        tokenizer = AutoTokenizer.from_pretrained(self._model_location, **tokenizer_config)
        if not tokenizer.pad_token:
            tokenizer.pad_token = tokenizer.eos_token
        # Since the token generator is able to handle multiple input sequences at once, the length of the input
        # sequences must be normalized. We instruct the tokenizer to add padding tokens to the left of input sequences
        # shorter than the longest input sequence. We therefore make sure that a padding token is set for the tokenizer.
        self._tokenizer = tokenizer

    def _load_model(self) -> None:
        model_config = self._config.get_model_config()
        accelerate_config = self._config.get_accelerate_config()
        model = AutoModelForCausalLM.from_pretrained(
            self._model_location, **model_config, **accelerate_config
        )
        self._model = model

    def initialize(self, properties: Dict[str, str]) -> None:
        print(f"properties: {properties}")
        self._config = ConfigFactory(properties=properties)
        # model_id can point to huggingface model_id or local directory.
        # If option.model_id points to a s3 bucket, we download it and set model_id to the download directory.
        # Otherwise we assume model artifacts are in the model_dir (/opt/ml/model, which is also the cwd)
        self._model_location = properties.get("model_id") or properties.get("model_dir")
        self._load_tokenizer()
        self._load_model()
        cache_config = self._config.get_cache_config()
        self._generation_request_handler = create_generation_request_handler(
            tokenizer=self._tokenizer,
            model=self._model,
            cache_config=cache_config,
            engine=self.engine,
        )
        self.initialized = True

    def handle_generation_request(self, inputs: Input) -> Output:
        try:
            input_seqs, request_generation_config = parse_request(inputs=inputs)
            generation_config = self._config.get_default_generation_config()
            generation_config.update(request_generation_config)
            session_id = str(uuid.uuid4())
            print(f"inputs: {input_seqs}")
            print(f"generation_config: {generation_config}")
            output = (
                Output(code=200)
                .add({"session_id": session_id})
                .finalize(
                    self._generation_request_handler, session_id, input_seqs, generation_config
                )
            )
        except Exception as e:
            output = Output(code=500, message=str(e))
        return output


def create_generation_request_handler(
    model: PreTrainedModel,
    tokenizer: PreTrainedTokenizerBase,
    cache_config: Dict[str, Any],
    engine: str,
) -> Callable:
    """Creates a generation handler (closure) function"""
    cache = DynamoDBSequenceCache(config=cache_config, eos_token=tokenizer.eos_token)

    def generation_request_handler(
        session_id: str, input_seqs: List[str], generation_config: Dict[str, Any]
    ) -> None:
        stream_generator = StreamingUtils.get_stream_generator(engine)
        token_iterator = stream_generator(model, tokenizer, input_seqs, **generation_config)
        with cache(session_id=session_id):
            for token_batch in token_iterator:
                # The iterator supports multi-sequence generation (i.e. batch_size>1).
                # Here batch_size=1, token_batch is a one-element list.
                token, *_ = token_batch
                cache.put(token=token)

    return generation_request_handler


_service = AccelerateInferenceService()


def handle(inputs: Input) -> Optional[Output]:
    if not _service.initialized:
        _service.initialize(properties=inputs.get_properties())

    if inputs.is_empty():
        return None

    return _service.handle_generation_request(inputs=inputs)

### 3.2. Stack deployment
In this section, you will use the AWS CDK to deploy the solution described in the following diagram:

![Token-streaming-solution](img/stream-llm-tokens-arch-diagram-light.png)


**Notice:** Before performing any operation, the AWS CDK first retrieves credentials [like the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html#configure-precedence). You can for example create a dedicated named profile for this lab and then add a `--profile <profile-name>` to all `cdk` commands.

Now, let's open a terminal window and:
1. Cd into the lab's directory
2. Extract the archive content: `tar xzvf token-streaming-project.tar.gz`
3. Cd into the newly created `token-streaming-project` directory: `cd token-streaming-project`
4. Create a Python virtualenv: `python3 -m venv .venv`
5. Activate the Python virtualenv: `source .venv/bin/activate`
6. Install the required Python dependencies: `pip install -r requirements.txt`
7. If not already done in the past for your account and Region, bootstrap your AWS CDK environment: `cdk bootstrap`
8. To deploy the stack, substitute the content of the `MODEL_ID` variable into the following command and run it (This operation takes a few minutes):

```bash
cdk deploy --context ModelArtifactsUri=<model-artifacts-s3-uri> \
--context EndpointInstanceType=ml.g5.2xlarge \
--context CodeSourceDirPath=../code \
--context EndpointStartupTimeoutInSec=360
```

9. Write down the stack's output values. Using these values, fill the following variables appropriately:

In [None]:
ACCOUNT = "<account-id>"
IDENTITY_POOL_ID = "<cognito-id-pool-id>"
FUNCTION_NAME = "<function-name>"

## 4. Test the streaming solution
### 4.1 Post your generation request programmatically using `boto3`
In this section you will invoke the streaming function using `boto3`'s `invoke_with_response_stream` function. You will first retrieve temporary credentials from the identity pool, then invoke the function and get a response stream to iterate on in return. 

**Warning:** The very first request sent to the stack may return an empty stream. Don't hesitate to retry.

Notice that the response body has been designed to have a similar structure to the responses resturned by the HuggingFace streaming APIs:

```json
{
   "token":{
      "id":14,
      "text":"\n",
      "logprob":null,
      "special":false
   }
}
```

Only the `id` and `text` fields are actively used in this example. An additional `details` field is added on completion. Example: `"details":{"FinishReason":"length"}`

In [None]:
FINISH_MESSAGES = {
    "eos_token": "Sequence generation completed successfully: EOS token was generated",
    "length": "Sequence generation completed successfully: Maximum sequence length was reached",
}


def running_invocation_handler(chunk: Dict[str, Any]) -> None:
    deserialized_payload = json.loads(chunk["PayloadChunk"]["Payload"])
    print(deserialized_payload["token"]["text"], end="")
    try:
        finish_reason = deserialized_payload["details"]["FinishReason"]
        print("\n" + FINISH_MESSAGES[finish_reason])
    except TypeError:  # "details" value is None
        pass


def completed_invocation_handler(chunk: Dict[str, Any]) -> None:
    try:
        serialized_error_details = chunk["InvokeComplete"]["ErrorDetails"]
        error_details = json.loads(serialized_error_details)
        error_message = error_details["errorMessage"]
        print(f"Resquest raised the following error: {error_message}")
    except KeyError:
        pass

In [None]:
COGNITO_ID_CLIENT = boto3.client("cognito-identity", region_name=REGION_NAME)

get_id_response = COGNITO_ID_CLIENT.get_id(
    AccountId=ACCOUNT,
    IdentityPoolId=IDENTITY_POOL_ID,
)

get_creds_response = COGNITO_ID_CLIENT.get_credentials_for_identity(
    IdentityId=get_id_response["IdentityId"]
)

credentials = {
    "aws_access_key_id": get_creds_response["Credentials"]["AccessKeyId"],
    "aws_secret_access_key": get_creds_response["Credentials"]["SecretKey"],
    "aws_session_token": get_creds_response["Credentials"]["SessionToken"],
}

LAMBDA_CLIENT = boto3.client("lambda", region_name=REGION_NAME, **credentials)

In [None]:
prompt = "What is Amazon? Be concise."

generation_parameters = {
    "max_new_tokens": 128,
    "do_sample": True,
}

request_body = {"inputs": prompt, "parameters": generation_parameters}

response = LAMBDA_CLIENT.invoke_with_response_stream(
    FunctionName=FUNCTION_NAME, InvocationType="RequestResponse", Payload=json.dumps(request_body)
)

for chunk in response["EventStream"]:
    if "InvokeComplete" not in chunk:
        running_invocation_handler(chunk=chunk)
    else:
        completed_invocation_handler(chunk=chunk)
        break

### 4.2 Post your generation request using the command line and `cURL`
If you have `cURL` (or any HTTP client) and the AWS CLI installed on your local host, you can post your generation requests and get streamed tokens using the command line:
1. First retrieve the identity pool ID from the CDK stack's output and substitute it into the following command to get an ID from the Cognito identity pool: `aws cognito-identity get-id --identity-pool-id <cognito-id-pool-id>`
2. Get temporary credentials for your identity: `aws cognito-identity get-credentials-for-identity --identity-id <id>`. The request should return a `Credentials` dictionary with an access key Id, a secret access key and a session token.
3. To post your request, first substitute the credentials and the Lambda function's URL (part of the CDK stack's outputs) into the following cURL command and then run it:

```bash
curl -X POST "<token-streaming-function-url>" \
-d '{"inputs": "What is Amazon? Be concise.", "parameters": {"max_new_tokens": 128, "do_sample": true}}' \
--user "<access-key-id>:<secret-access-key-id>" \
--header "x-amz-security-token: <session-token>" \
--aws-sigv4 "aws:amz:<region-name>:lambda" \
--no-buffer
```

## 5. Clean-up
Once done, execute the following commands to remove the lab's artifacts and resources from both you local host and AWS account:
1. Destroy the lab's CDK stack: `cdk destroy`
2. If you want to destroy the `CDKToolkit` stack created at the bootstrapping step, see [this comment](https://github.com/aws/aws-cdk/issues/986#issuecomment-644602463).
3. Deactivate the virtualenv: `deactivate`
4. Cd back in the lab's root directory: `cd ..`
5. Remove the `token-streaming-stack.tar.gz` file: `rm token-streaming-project.tar.gz`
6. Remove the CDK's application directory `rm -rf token-streaming-project` (includes the removal of the virtualenv).
7. Remove the code directory: `rm -rf code`
8. Remove the model artifacts from Amazon S3 by executing the following cell:

In [None]:
# Remove HF model artifacts from S3
hf_s3_objects = list_s3_objects(bucket=SM_ARTIFACT_BUCKET_NAME, key_prefix=HF_MODEL_KEY_PREFIX)
hf_s3_objects_keys = [obj["Key"] for obj in hf_s3_objects]
delete_s3_objects(bucket=SM_ARTIFACT_BUCKET_NAME, keys=hf_s3_objects_keys)