# Accelerating LLM Deployments with SageMaker Fast Model Loader

Amazon SageMaker Fast Model Loader represents a significant advancement in deploying Large Language Models (LLMs) for inference. As LLMs continue to grow in size and complexity, with some models requiring hundreds of gigabytes of memory, the traditional model loading process has become a major bottleneck in deployment and scaling.

This notebook demonstrates how to leverage Fast Model Loader to dramatically improve model loading times. The feature works by streaming model weights directly from Amazon S3 to GPU accelerators, bypassing the typical sequential loading steps that contribute to deployment latency. In internal testing, this approach has shown to load large models up to 15 times faster compared to traditional methods.

We'll walk through deploying the Llama 3.1 70B model using Fast Model Loader, showcasing how to:
- Optimize the model for streaming using ModelBuilder
- Configure tensor parallelism for distributed inference
- Deploy the optimized model to a SageMaker endpoint
- Test the deployment with inference requests

Fast Model Loader introduces two key techniques that work together:
1. Weight Streaming - Directly streams model weights from S3 to GPU memory
2. Model Sharding for Streaming - Pre-shards the model in uniform chunks for optimal loading

## Prerequisites
- SageMaker execution role with appropriate permissions
- Access to a GPU instance (ml.p4d.24xlarge recommended for this example)
- SageMaker Python SDK


## Setup Environment
First, we'll set up our SageMaker session and define basic variables:

In [None]:
!pip install --force-reinstall --no-cache-dir sagemaker==2.235.2

In [None]:
import boto3
from sagemaker import Session
from sagemaker import get_execution_role

# Get the SageMaker execution role
role=get_execution_role()

region = boto3.Session().region_name
sess = Session()
bucket = sess.default_bucket()



## Create Model Builder
We'll use the ModelBuilder class to prepare and package the model inference components. In this example, we're using the Llama 3.1 70B model from SageMaker JumpStart.

Key configurations:
- Model: meta-textgeneration-llama-3-1-70b
- Schema Builder: Defines input/output format

In [None]:
from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
import logging
model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-70b",
    role_arn=role,
    sagemaker_session=sess,
    schema_builder=SchemaBuilder(sample_input="Test", sample_output="Test"),
    #env_vars={
    #   "OPTION_TENSOR_PARALLEL_DEGREE": "8",
    #},
    log_level=logging.WARN
)

output_path = f"s3://{bucket}/sharding"

Note that, if you have already run the model optimization job before and the model shards are available on s3. You can uncomment below code to reuse the existing model shards and skip the section `Optimize Model for Fast Loading`.

In [None]:
# model_builder = ModelBuilder(
#             model="meta-textgeneration-llama-3-1-70b",
#             model_metadata={
#                 "CUSTOM_MODEL_PATH": output_path,
#             },
#             schema_builder=SchemaBuilder(sample_input="Test", sample_output="Test"),
#             role_arn=role,
#             instance_type="ml.g5.48xlarge",
# )

## Optimize Model for Fast Loading
Now we'll optimize the model using Fast Model Loader. This process:
1. Prepares model shards for deployment
2. Enables direct streaming from S3 to GPU
3. Pre-configures tensor parallelism settings

Note: The optimization process may take a while to complete. The optimized model will be stored in the specified S3 output path.

In [None]:
model_builder.optimize(
    instance_type="ml.p4d.24xlarge", 
    accept_eula=True, 
    output_path=output_path, 
    sharding_config={
            "OverrideEnvironment": {
                "OPTION_TENSOR_PARALLEL_DEGREE": "8"
            }
    }
)

## Build and Deploy Model
After optimization, we'll build the final model artifacts and deploy them to a SageMaker endpoint. 

Key configurations:
- Instance Type: ml.p4d.24xlarge
- Memory Request: 204800 MB
- Number of Accelerators: 8 (for tensor parallelism)

In [None]:
final_model = model_builder.build()

In [None]:
# we should use the sharded model
if not final_model._is_sharded_model:
    final_model._is_sharded_model = True
final_model._is_sharded_model

In [None]:
# EnableNetworkIsolation cannot be set to True since SageMaker Fast Model Loading of model requires network access.
if final_model._enable_network_isolation:
    final_model._enable_network_isolation = False
final_model._enable_network_isolation

In [None]:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

resources_required = ResourceRequirements(
    requests={
        "memory" : 204800,
        "num_accelerators": 8
    }
)

In [None]:
final_model.deploy(
    instance_type="ml.p4d.24xlarge", 
    accept_eula=True, 
    # endpoint_logging=False, 
    resources=resources_required,
)

## Test the Endpoint
Finally, we'll test the deployed endpoint with a simple inference request. 

In [None]:
from sagemaker.predictor import retrieve_default 

endpoint_name = final_model.endpoint_name 
predictor = retrieve_default(endpoint_name, sagemaker_session=sess) 

payload = { "inputs": "I believe the meaning of life is", 
            "parameters": { 
                "max_new_tokens": 64, 
                "top_p": 0.9, 
                "temperature": 0.6 
            } 
        }
response = predictor.predict(payload) 
print(response) 

## Clean up

In [None]:
predictor.delete_predictor()
predictor.delete_endpoint()