# Stable Video Diffusion XT 1.1 on Amazon SageMaker

Stability AI's [Stable Video Diffusion XT (SVT-XT) 1.1](https://medium.com/r/?url=https%3A%2F%2Fstability.ai%2Fstable-video) foundation model is a diffusion model that takes in a still image as a conditioning frame and generates a short 4 second video. The notebook walks through configuring, creating, and invoking an [Asynchronous Inference Endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html) backed by the SVT-XT foundation model on Amazon SageMaker.

Version 1 of the Notebook passes a publicly accessible URL of the image in the request payload used to invoke the model. Use the corresponding custom inference script, [inference_v1/inference.py](inference_v1/inference.py) when preparing the model archive.

**Author:** Gary A. Stafford  
**Date:** 2024-04-28

![Architecture V1](architecture/async_inference_v1.png)

## Install Required Packages


In [None]:
%%sh

# optional: update OS packages in Amazon SageMaker Studio Ubuntu environment
sudo apt-get update -qq -y && sudo apt-get upgrade -qq -y

In [None]:
%%sh

sudo apt-get install git libgl1 ffmpeg git-lfs wget -y

In [None]:
%pip install sagemaker boto3 botocore ffmpeg-python ipython diffusers pywget -Uq

In [None]:
# restart kernel 1x when installing new packages

import os

os._exit(00)

## Prepare the SVD-XT Model for Inference

Steps to prepare the model for inference: 1/ Download the model artifacts from Hugging Face, 2/ add the custom inference script, 3/ create an archive file from the model artifacts, and 4/ upload the archive file to Amazon S3 for deployment.

Alternately, for steps 1-3, if the model archive is already available from Amazon S3, see [#Alternate-Method-if-model.tar.gz-Already-Exists-in-S3](#Alternate-Method-if-model.tar.gz-Already-Exists-in-S3), below.


In [None]:
import sagemaker
import boto3
from botocore.exceptions import ClientError
import os
from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig
from sagemaker.s3 import s3_path_join

In [None]:
sm_session_bucket = None

sm_session = sagemaker.Session()

if sm_session_bucket is None and sm_session is not None:
    # set to default bucket if a bucket name is not given
    sm_session_bucket = sm_session.default_bucket()
try:
    sm_role = sagemaker.get_execution_role()
except ValueError:
    iam_client = boto3.client("iam")
    sm_role = iam_client.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

In [None]:
print(f"sagemaker role arn: {sm_role}")
print(f"sagemaker bucket: {sm_session.default_bucket()}")
print(f"sagemaker session region: {sm_session.boto_region_name}")

### 1. Download the Model Artifacts from Hugging Face

It will take 6-7 minutes to download model artifacts from Hugging Face. You will need a Hugging Face account to get your personal access token. Requires approx. 34 GB of space.

Check the `/dev/nvme1n1` volume, mounted to `/home/sagemaker-user` to ensure it has enough space, from your terminal:

```sh
df -h /home/sagemaker-user
```


In [None]:
%%sh

git lfs install

In [None]:
%%time
%%sh

user_name="<YOUR_HUGGINGFACE_USERNAME>"
access_token="<YOUR_HUGGING_FACE_ACCESS_TOKEN>"

git lfs clone "https://${user_name}:${access_token}@huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt-1-1.git"

### 2. Add Custom Inference Script

In [None]:
import shutil

destination = "stable-video-diffusion-img2vid-xt-1-1"

shutil.copy("inference_v1/inference.py", destination)
shutil.copy("inference_v1/requirements.txt", destination)

### 3. TAR GZIP Model Artifacts

Important: Final `model.tar.gz` will be 14-15 GB and could take 35-40 minutes to package and compress.

Continuously poll the size of the `model.tar.gz` file every 15 seconds from your terminal:

```sh
while sleep 15; do ls -la model.tar.gz; done
```


In [None]:
%%time

import os
import tarfile

TAR_MODE = "w:gz"


def create_tar_archive(folder_path, output_tar_file):
    """
    Create a tar archive from a folder, excluding hidden files.

    :param folder_path: The path to the folder to be archived.
    :param output_tar_file: The path to the output tar file.
    """
    with tarfile.open(output_tar_file, TAR_MODE) as tar:
        for root, dirs, files in os.walk(folder_path):
            files = [f for f in files if not f[0] == "."]
            dirs[:] = [d for d in dirs if not d[0] == "."]
            for file in files:
                file_path = os.path.join(root, file)
                tar.add(file_path, arcname=os.path.relpath(file_path, folder_path))
                print(f"Added {file_path} to the archive.")


output_tar_file = "model.tar.gz"

create_tar_archive(destination, output_tar_file)

### Alternate Method if `model.tar.gz` Already Exists in S3

If the `model.tar.gz` file already exists in S3, skip steps 1-3 above. Create an Amazon S3 presigned URL and use the URL to download the model package. This replaces the two steps above: downloading the model artifacts and TAR GZIP. This step takes 4-7 minutes in the same AWS Region.


In [None]:
%%time

import os
from pywget import wget

presigned_s3_url = "<YOUR_PRESIGNED_URL_GOES_HERE>"
save_path = "model.tar.gz"

wget.download(presigned_s3_url, save_path)

### 4. Copy Model Artifacts to S3

This step takes 2-3 minutes in the same AWS Region to copy `model.tar.gz` file to Amazon S3, which is approx 14 GB.


In [None]:
%%time

s3_client = boto3.client("s3")

response = s3_client.upload_file(
    "model.tar.gz",
    sm_session_bucket,
    "async_inference/model/model.tar.gz",
)

## Deploy Model to Amazon SageMaker Endpoint

Deploying the Amazon SageMaker Asynchronous Inference Endpoint takes 5-7 minutes.


In [None]:
env = {
    "SAGEMAKER_MODEL_SERVER_TIMEOUT": "3600",
    "TS_MAX_RESPONSE_SIZE": "1000000000",
    "TS_MAX_REQUEST_SIZE": "1000000000",
    "MMS_MAX_RESPONSE_SIZE": "1000000000",
    "MMS_MAX_REQUEST_SIZE": "1000000000",
}

huggingface_model = HuggingFaceModel(
    model_data=s3_path_join(
        "s3://", sm_session_bucket, "async_inference/model/model.tar.gz"
    ),
    transformers_version="4.37.0",
    pytorch_version="2.1.0",
    py_version="py310",
    env=env,
    role=sm_role,
)

In [None]:
# where the response payload or error will be stored

async_config = AsyncInferenceConfig(
    output_path=s3_path_join("s3://", sm_session_bucket, "async_inference/output"),
    failure_path=s3_path_join(
        "s3://", sm_session_bucket, "async_inference/output_errors"
    ),
)

In [None]:
%%time

# also successfully tested with a ml.g5.2xlarge instance

predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.4xlarge",
    async_inference_config=async_config,
)

In [None]:
endpoint_name = predictor.endpoint_name

In [None]:
print(endpoint_name)

In [None]:
# if model was previously deployed, then set variable manually

# endpoint_name = "<YOUR_MODEL_ENDPOINT_NAME>"

## Examples of Different Images and Inference Parameters

Select one of the sets of inference parameters below and run that cell. Each variation creates a different video.


In [None]:
movie_title = "rocket_1.mp4"

data = {
    "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 42,
}

In [None]:
movie_title = "rocket_2.mp4"

data = {
    "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 180,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 42,
}

In [None]:
movie_title = "smoke.mp4"

data = {
    "image": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/smoke.jpg",
    "width": 576,
    "height": 1024,
    "num_frames": 25,
    "num_inference_steps": 50,
    "min_guidance_scale": 0.5,
    "max_guidance_scale": 1.0,
    "fps": 6,
    "motion_bucket_id": 25,
    "noise_aug_strength": 0.8,
    "decode_chunk_size": 8,
    "seed": 111142,
}

In [None]:
movie_title = "color_smoke.mp4"

data = {
    "image": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/colored_smoke.jpg",
    "width": 576,
    "height": 1024,
    "num_frames": 25,
    "num_inference_steps": 50,
    "min_guidance_scale": 0.5,
    "max_guidance_scale": 1.0,
    "fps": 6,
    "motion_bucket_id": 25,
    "noise_aug_strength": 0.8,
    "decode_chunk_size": 8,
    "seed": 111142,
}

In [None]:
movie_title = "beach_bike_1.mp4"

data = {
    "image": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/beach_bike.jpg",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 1234567890,
}

In [None]:
movie_title = "beach_bike_2.mp4"

data = {
    "image": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/beach_bike.jpg",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 123,
}

In [None]:
movie_title = "waterfall.mp4"

data = {
    "image": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/waterfall.jpg",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 1234567890,
}

In [None]:
movie_title = "boat_ocean.mp4"

data = {
    "image": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/boat_ocean.jpg",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 42,
}

In [None]:
movie_title = "red_car.mp4"

data = {
    "image": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/red_car.jpg",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 42,
}

In [None]:
movie_title = "coffee.mp4"

data = {
    "image": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/coffee_stream.jpg",
    "width": 576,
    "height": 1024,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 42,
}

In [None]:
movie_title = "koi.mp4"

data = {
    "image": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/koi.jpg",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 9288258982,
}

In [None]:
movie_title = "champagne.mp4"

data = {
    "image": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/champagne2.jpg",
    "width": 576,
    "height": 1024,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 42,
}

## Upload Request Payload and Invoke Endpoint

Upload the JSON request payload to Amazon S3 and invoke the endpoint for inference. Invocation time for a video with 25 inference steps is about 2 minutes.


In [None]:
def upload_file(input_location):
    return sm_session.upload_data(
        input_location,
        bucket=sm_session.default_bucket(),
        key_prefix="async_inference/input",
        extra_args={"ContentType": "application/json"},
    )

In [None]:
import json

file_name = "request_payloads/payload.json"

with open(file_name, "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

In [None]:
input_s3_location = upload_file(file_name)

In [None]:
sm_runtime_client = boto3.client("sagemaker-runtime")

response = sm_runtime_client.invoke_endpoint_async(
    EndpointName=endpoint_name,
    InputLocation=input_s3_location,
    InvocationTimeoutSeconds=3600,
)

In [None]:
print(response["OutputLocation"])

### Poll for Model Response

Poll the Amazon S3 bucket for a response from the model invocation.


In [None]:
import urllib
import time


# function reference: https://github.com/aws/amazon-sagemaker-examples/blob/main/async-inference/Async-Inference-Walkthrough-SageMaker-Python-SDK.ipynb
def get_output(output_location):
    output_url = urllib.parse.urlparse(output_location)
    bucket = output_url.netloc
    key = output_url.path[1:]
    while True:
        try:
            return sm_session.read_s3_file(bucket=bucket, key_prefix=key)
        except ClientError as e:
            if e.response["Error"]["Code"] == "NoSuchKey":
                print("Waiting for model output...")
                time.sleep(15)
                continue
            raise

In [None]:
%%time

output = get_output(response["OutputLocation"])
print(f"Sample of output: {output[:500]}")

## JSON to MP4 Video

Convert binary objects in list to JPEGs of each frame, then combined into MP4.


In [None]:
import base64
from PIL import Image
from diffusers.utils import export_to_video, make_image_grid


def load_video_frames(video_frames):
    loaded_video_frames = []

    for idx, video_frame in enumerate(video_frames):
        frame = bytes(video_frame, "raw_unicode_escape")
        frame_name = (
            f"frames_out/frame_0{idx+1}.jpg"
            if idx < 9
            else f"frames_out/frame_{idx+1}.jpg"
        )
        with open(frame_name, "wb") as fh:
            fh.write(base64.decodebytes(frame))

        image = Image.open(frame_name, mode="r")
        loaded_video_frames.append(image)

    return loaded_video_frames

In [None]:
output = get_output(response["OutputLocation"])
data = json.loads(output)
loaded_video_frames = load_video_frames(data["frames"])

export_to_video(loaded_video_frames, f"video_out/{movie_title}", fps=6)
print(f"Video created: {movie_title}")

### Display Frames as Grid

Display the 25 frames as a 5x5 grid.


In [None]:
image = make_image_grid(loaded_video_frames, 5, 5)
(width, height) = (image.width // 2, image.height // 2)
im_resized = image.resize((width, height))
display(im_resized)
im_resized.save("frames.png")

### Display Video

Convert video CODEC to H.264 and display in notebook at 50% of actual size.


In [None]:
import ffmpeg

output_options = {
    "crf": 20,
    "preset": "slower",
    "movflags": "faststart",
    "pix_fmt": "yuv420p",
    "vcodec": "libx264",
}

(
    ffmpeg.input(f"video_out/{movie_title}")
    .output("video_out/tmp.mp4", **output_options)
    .run(overwrite_output=True, quiet=True)
)

In [None]:
from IPython.display import Video

Video(
    url="video_out/tmp.mp4",
    width=(loaded_video_frames[0].width // 2),
    html_attributes="controls muted autoplay loop",
)

## Delete Amazon SageMaker Endpoint


In [None]:
# client_sm = boto3.client("sagemaker")

# client_sm.delete_endpoint(EndpointName=endpoint_name)

## Generating of Multiple Video Variations

Generating multiple videos variations by combining the above code in a loop. In this example we are creating five variations, changing the seed each time.


In [None]:
import random
import json
from diffusers.utils import export_to_video

sm_runtime_client = boto3.client("sagemaker-runtime")

for i in range(3):
    seed = random.randrange(1, 9999999999)
    data = {
        "image": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/beach_bike.jpg",
        "width": 1024,
        "height": 576,
        "num_frames": 25,
        "num_inference_steps": 25,
        "min_guidance_scale": 1.0,
        "max_guidance_scale": 3.0,
        "fps": 6,
        "motion_bucket_id": 127,
        "noise_aug_strength": 0.02,
        "decode_chunk_size": 8,
        "seed": seed,
    }
    movie_title = f"beach_bike_{seed}.mp4"

    file_name = f"request_payloads/payload_{i}.json"
    with open(file_name, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=4)
    input_s3_location = upload_file(file_name)

    response = sm_runtime_client.invoke_endpoint_async(
        EndpointName=endpoint_name,
        InputLocation=input_s3_location,
        InvocationTimeoutSeconds=3600,
    )

    output = get_output(response["OutputLocation"])
    data = json.loads(output)
    loaded_video_frames = load_video_frames(data["frames"])

    export_to_video(loaded_video_frames, f"video_out/{movie_title}", fps=6)
    print(f"Video created: {movie_title}")