# Deploy Falcon 40B on Amazon SageMaker

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

---

In this notebook, we use the [Large Model Inference (LMI) container](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-dlc.html) from [SageMaker Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) to host [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b) on Amazon SageMaker.

We'll also see what configuration parameters can be used to optimize the endpoint for throughput and latency. We will deploy using a ml.g5.12xlarge instance for efficiency

### Import the relevant libraries and configure several global variables using boto3

In [None]:
%pip install sagemaker boto3 awscli --upgrade  --quiet

In [None]:
import boto3
import sagemaker
import jinja2
import json
from pathlib import Path
from sagemaker import Model, image_uris, serializers, deserializers

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment
jinja_env = jinja2.Environment()

sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

## Step 1: Prepare the model artifacts
The LMI container expects the following artifacts for hosting the model
- `serving.properties` (required): Defines the model server settings and configurations.
- `model.py` (optional): A python script that defines the inference logic.
- `requirements.txt` (optional): Any additional pip wheels that need to be installed.

SageMaker expects the model artifacts in a tarball with the following structure - 

```
code
├──── 
│   └── serving.properties
│   └── model.py
│   └── requirements.txt

```


In this notebook, we'll only provide a `serving.properties`. By default, the container runs the [huggingface.py module](https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/setup/djl_python/huggingface.py) from the djl python repository as the entry point code. 

In [None]:
!rm -rf falcon_src
!mkdir -p falcon_src

### Create the serving.properties
This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization techniques you would like to use. Depending on your need, you can set the appropriate configuration.

Here is a list of settings that we use in this configuration file -
- `option.model_id`: Used to download model from Hugging Face or S3 bucket.
- `option.tensor_parallel_degree`: Set to the number of GPU devices over which to partition the model.
- `option.max_rolling_batch_size`: Provide a size for maximum batch size for rolling/iteration level batching. Limits the number of concurrent requests.
- `option.rolling_batch`: Select a rolling batch strategy. `auto` will make the handler choose the strategy based on the provided configuration. `scheduler` is a native rolling batch strategy supported for a single GPU. `lmi-dist` and `vllm` support multi-GPU rolling/iteration level batching.
- `option.paged_attention`: Enabling this preallocates more GPU memory for caching. This is only supported when `option.rolling_batch=lmi-dist` or `option.rolling_batch=auto`.
- `option.max_rolling_batch_prefill_tokens`: Only supported for `option.rolling_batch=lmi-dist`. Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU OOM. Use this to tune for your workload
- `engine`: This is set to the runtime engine of the code. `MPI` below refers to the parallel processing framework. It is used by engines like `DeepSpeed` and `FasterTransformer` as well. 


In [None]:
%%writefile falcon_src/serving.properties
engine = MPI
option.model_id = {{s3_url}}
option.trust_remote_code = true
option.tensor_parallel_degree = 4
option.max_rolling_batch_size = 64
option.rolling_batch = auto
option.dtype = fp16
option.max_rolling_batch_prefill_tokens = 1024
option.paged_attention = True

Define a variable to store the s3 location that has the model weights

In [None]:
pretrained_model_location = (
    f"s3://sagemaker-example-files-prod-{region}/models/hf-large-model-djl-ds/tgi_lmi_falcon40b/"
)
print(f"Pretrained model will be downloaded from ---- > {pretrained_model_location}")

Plug in the appropriate model location into the `serving.properties` file. For this publicly hosted model weights, the `s3 URL` depends on the region in which the notebook is executed.

In [None]:
template = jinja_env.from_string(Path("falcon_src/serving.properties").open().read())
Path("falcon_src/serving.properties").open("w").write(
    template.render(s3_url=pretrained_model_location)
)
!pygmentize falcon_src/serving.properties | cat -n

### Create a model.tar.gz with the model artifacts

In [None]:
!tar czvf falcon_code.tar.gz falcon_src/

## Step 2: Create the SageMaker endpoint

Define the sagemaker inference URI to use for model inference.

In [None]:
inference_image_uri = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118"
)

### Upload artifact to S3 and create a SageMaker model

In [None]:
s3_code_prefix = "falcon40b/code"
bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_artifact = sess.upload_data("falcon_code.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

In [None]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"falcon40b-mpi-engine")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

In [None]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"
instance_type = "ml.g5.12xlarge"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
        },
    ],
)
endpoint_config_response

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

### This step can take ~ 10 min or longer so please be patient

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

## Step 3: Invoke the Endpoint

In [None]:
def query_endpoint(payload):
    """Query endpoint and print the response"""

    response_model = smr_client.invoke_endpoint(
        EndpointName=endpoint_name,
        Body=payload,
        ContentType="application/json",
    )

    generated_text = response_model["Body"].read().decode("utf8")
    print(generated_text)

### Generation

In [None]:
payload = json.dumps(
    {
        "inputs": "Building a website can be done in 10 simple steps:",
        "parameters": {"max_new_tokens": 126, "no_repeat_ngram_size": 3},
    }
)

query_endpoint(payload)

### Translation

In [None]:
payload = json.dumps(
    {
        "inputs": """Translate English to French:
                                sea otter => loutre de mer
                                peppermint => menthe poivrée
                                plush girafe => girafe peluche
                                cheese => """,
        "parameters": {"max_new_tokens": 3},
    }
)
query_endpoint(payload)

### Classification

In [None]:
payload = json.dumps(
    {
        "inputs": """"I hate it when my phone battery dies."
                                Sentiment: Negative
                                ###
                                Tweet: "My day has been :+1:"
                                Sentiment: Positive
                                ###
                                Tweet: "This is the link to the article"
                                Sentiment: Neutral
                                ###
                                Tweet: "This new music video was incredibile"
                                Sentiment:""",
        "parameters": {"max_new_tokens": 2},
    }
)
query_endpoint(payload)

### Question answering

In [None]:
payload = json.dumps(
    {
        "inputs": "Could you remind me when was the C programming language invented?",
        "parameters": {"max_new_tokens": 50},
    }
)
query_endpoint(payload)

### Summarization

In [None]:
payload = json.dumps(
    {
        "inputs": """Starting today, the state-of-the-art Falcon 40B foundation model from Technology
                                Innovation Institute (TII) is available on Amazon SageMaker JumpStart, SageMaker's machine learning (ML) hub
                                that offers pre-trained models, built-in algorithms, and pre-built solution templates to help you quickly get
                                started with ML. You can deploy and use this Falcon LLM with a few clicks in SageMaker Studio or
                                programmatically through the SageMaker Python SDK.
                                Falcon 40B is a 40-billion-parameter large language model (LLM) available under the Apache 2.0 license that
                                ranked #1 in Hugging Face Open LLM leaderboard, which tracks, ranks, and evaluates LLMs across multiple
                                benchmarks to identify top performing models. Since its release in May 2023, Falcon 40B has demonstrated
                                exceptional performance without specialized fine-tuning. To make it easier for customers to access this
                                state-of-the-art model, AWS has made Falcon 40B available to customers via Amazon SageMaker JumpStart.
                                Now customers can quickly and easily deploy their own Falcon 40B model and customize it to fit their specific
                                needs for applications such as translation, question answering, and summarizing information.
                                Falcon 40B are generally available today through Amazon SageMaker JumpStart in US East (Ohio),
                                US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Mumbai),
                                Europe (London), Europe (Frankfurt), Europe (Ireland), and Canada (Central),
                                with availability in additional AWS Regions coming soon. To learn how to use this new feature,
                                please see SageMaker JumpStart documentation, the Introduction to SageMaker JumpStart –
                                Text Generation with Falcon LLMs example notebook, and the blog Technology Innovation Institute trainsthe
                                state-of-the-art Falcon LLM 40B foundation model on Amazon SageMaker. Summarize the article above:""",
        "parameters": {"max_new_tokens": 200},
    }
)
query_endpoint(payload)

## Clean up the environment

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_config_name)
sess.delete_model(model_name)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)
