# Stable Video Diffusion XT 1.1 on Amazon SageMaker

Stability AI's [Stable Video Diffusion XT (SVT-XT) 1.1](https://medium.com/r/?url=https%3A%2F%2Fstability.ai%2Fstable-video) foundation model is a diffusion model that takes in a still image as a conditioning frame and generates a video from it. The notebook walks through creating and invoking an [asynchronous inference endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html) backed by the SVT-XT foundation model on Amazon SageMaker.

__Author:__ Gary A. Stafford  
__Date:__ 2024-04-20

## Install Required Packages


In [None]:
%%sh

sudo apt-get update -y && sudo apt-get update -y

sudo apt-get install git libgl1 ffmpeg git-lfs -y

In [None]:
%pip install sagemaker boto3 botocore ffmpeg-python opencv-python ipython diffusers -Uq

In [None]:
# restart kernel 1x when installing new packages

import os

os._exit(00)

## Download Model, Add Script, Compress, Deploy to S3


In [None]:
import sagemaker
import boto3
from botocore.exceptions import ClientError
import os
from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig
from sagemaker.s3 import s3_path_join

In [None]:
sm_session_bucket = None

sm_session = sagemaker.Session()

if sm_session_bucket is None and sm_session is not None:
    # set to default bucket if a bucket name is not given
    sm_session_bucket = sm_session.default_bucket()

try:
    sm_role = sagemaker.get_execution_role()
except ValueError:
    client_iam = boto3.client("iam")
    sm_role = client_iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

In [None]:
print(f"sagemaker role arn: {sm_role}")
print(f"sagemaker bucket: {sm_session.default_bucket()}")
print(f"sagemaker session region: {sm_session.boto_region_name}")

### Download the Model and Add Additional Files

Takes approx. 6 minutes to download model artifacts from Hugging Face. Requires approx. 34 GB of space.

```text
sagemaker-user@default:stable-video-diffusion-img2vid-xt-1-1$ tree -a | tail -1
95 directories, 118 files
```


In [None]:
%%sh

# https://huggingface.co/docs/sagemaker/inference#create-a-model-artifact-for-deployment
git lfs install

In [None]:
%%time
%%sh

user_name="<YOUR_HUGGINGFACE_USERNAME>"
access_token="<YOUR_HUGGING_FACE_ACCESS_TOKEN>"

git lfs clone "https://${user_name}:${access_token}@huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt-1-1.git"

In [None]:
%%sh

cp inference.py stable-video-diffusion-img2vid-xt-1-1/
cp requirements.txt stable-video-diffusion-img2vid-xt-1-1/

### TAR GZIP Model Artifacts

Important: Final `model.tar.gz` will be approx. 14 GB and takes approx. 35 minutes to compress.

```text
CPU times: user 156 ms, sys: 29.8 ms, total: 186 ms
Wall time: 36min 18s
```

Watch the `/dev/nvme1n1` volume to ensure it does not get full. From your terminal:

```sh
df -h && ls -alh stable-video-diffusion-img2vid-xt-1-1/model.tar.gz
```

```sh
while sleep 5; do ls -la stable-video-diffusion-img2vid-xt-1-1/model.tar.gz; done
```


In [None]:
%%time
%%sh

cd stable-video-diffusion-img2vid-xt-1-1
tar zcvf model.tar.gz *

### Copy Model Artifacts to S3

Takes approx. 5 minutes to copy `model.tar.gz`, which is approx 28.2 GB.


In [None]:
%%time
%%sh

cd stable-video-diffusion-img2vid-xt-1-1

sm_session_bucket="sagemaker-us-east-1-676164205626"

aws s3 cp model.tar.gz "s3://${sm_session_bucket}/async_inference/model/model.tar.gz"

## Deploy Model to SageMaker Asynchronous Inference Endpoint

Model take approx. 6 minutes to deploy.


In [None]:
env = {
    "SAGEMAKER_MODEL_SERVER_TIMEOUT": "3600",
    "TS_MAX_RESPONSE_SIZE": "1000000000",
    "TS_MAX_REQUEST_SIZE": "1000000000",
    "MMS_MAX_RESPONSE_SIZE": "1000000000",
    "MMS_MAX_REQUEST_SIZE": "1000000000",
}

huggingface_model = HuggingFaceModel(
    model_data=s3_path_join(
        "s3://", sm_session_bucket, "async_inference/model/model.tar.gz"
    ),
    transformers_version="4.37.0",
    pytorch_version="2.1.0",
    py_version="py310",
    env=env,
    role=sm_role,
)

In [None]:
# https://www.philschmid.de/sagemaker-huggingface-async-inference
# https://sagemaker.readthedocs.io/en/stable/api/inference/async_inference.html
# where the response payload will be stored

async_config = AsyncInferenceConfig(
    output_path=s3_path_join("s3://", sm_session_bucket, "async_inference/output"),
    failure_path=s3_path_join(
        "s3://", sm_session_bucket, "async_inference/output_errors"
    ),
)

In [None]:
%%time

predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.16xlarge",
    async_inference_config=async_config,
)

In [None]:
endpoint_name = predictor.endpoint_name

## Examples of Different Images and Inference Parameters


In [None]:
# https://huggingface.co/docs/diffusers/v0.27.2/en/using-diffusers/svd
# https://github.com/Stability-AI/generative-models/blob/main/scripts/sampling/simple_video_sample.py

movie_title = "rocket_1.mp4"

data = {
    "inputs": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 42,
}

In [None]:
movie_title = "rocket_2.mp4"

data = {
    "inputs": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 180,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 42,
}

In [None]:
movie_title = "smoke_tall_1.mp4"

data = {
    "inputs": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/smoke.jpg",
    "width": 576,
    "height": 1024,
    "num_frames": 25,
    "num_inference_steps": 50,
    "min_guidance_scale": 0.5,
    "max_guidance_scale": 1.0,
    "fps": 6,
    "motion_bucket_id": 25,
    "noise_aug_strength": 0.8,
    "decode_chunk_size": 8,
    "seed": 111142,
}

In [None]:
movie_title = "color_smoke_tall_1.mp4"

data = {
    "inputs": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/colored_smoke.jpg",
    "width": 576,
    "height": 1024,
    "num_frames": 25,
    "num_inference_steps": 50,
    "min_guidance_scale": 0.5,
    "max_guidance_scale": 1.0,
    "fps": 6,
    "motion_bucket_id": 25,
    "noise_aug_strength": 0.8,
    "decode_chunk_size": 8,
    "seed": 111142,
}

In [None]:
movie_title = "beach_bike_1.mp4"

data = {
    "inputs": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/beach_bike.jpg",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 1234567890,
}

In [None]:
movie_title = "beach_bike_2.mp4"

data = {
    "inputs": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/beach_bike.jpg",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 123,
}

In [None]:
movie_title = "waterfall_2.mp4"

data = {
    "inputs": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/waterfall.jpg",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 1234567890,
}

In [None]:
movie_title = "boat_ocean_1.mp4"

data = {
    "inputs": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/boat_ocean.jpg",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 42,
}

In [None]:
movie_title = "red_car_1.mp4"

data = {
    "inputs": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/red_car.jpg",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 42,
}

In [None]:
movie_title = "coffee_1.mp4"

data = {
    "inputs": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/coffee_stream.jpg",
    "width": 576,
    "height": 1024,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 42,
}

In [None]:
movie_title = "koi_1.mp4"

data = {
    "inputs": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/koi.jpg",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 42,
}

In [None]:
movie_title = "koi_2.mp4"

data = {
    "inputs": "https://raw.githubusercontent.com/garystafford/svdxt-sagemaker-huggingface/main/images_scaled/koi.jpg",
    "width": 1024,
    "height": 576,
    "num_frames": 25,
    "num_inference_steps": 25,
    "min_guidance_scale": 1.0,
    "max_guidance_scale": 3.0,
    "fps": 6,
    "motion_bucket_id": 127,
    "noise_aug_strength": 0.02,
    "decode_chunk_size": 8,
    "seed": 9288258982,
}

## Upload Request Payload and Invoke Endpoint


In [None]:
def upload_file(input_location):
    return sm_session.upload_data(
        input_location,
        bucket=sm_session.default_bucket(),
        key_prefix="async_inference/input",
        extra_args={"ContentType": "application/json"},
    )

In [None]:
import json

file_name = "payload.json"

with open(file_name, "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

In [None]:
input_s3_location = upload_file(file_name)

In [None]:
client_sm_runtime = boto3.client("sagemaker-runtime", region_name="us-east-1")

input_location = input_s3_location
response = client_sm_runtime.invoke_endpoint_async(
    EndpointName=endpoint_name,
    InputLocation=input_location,
    InvocationTimeoutSeconds=3600,
)

In [None]:
print(response)

In [None]:
print(response["OutputLocation"])

In [None]:
import urllib
import time


# function reference: https://github.com/aws/amazon-sagemaker-examples/blob/main/async-inference/Async-Inference-Walkthrough-SageMaker-Python-SDK.ipynb
def get_output(output_location):
    output_url = urllib.parse.urlparse(output_location)
    bucket = output_url.netloc
    key = output_url.path[1:]
    while True:
        try:
            return sm_session.read_s3_file(
                bucket=output_url.netloc, key_prefix=output_url.path[1:]
            )
        except ClientError as e:
            if e.response["Error"]["Code"] == "NoSuchKey":
                print("waiting for output...")
                time.sleep(5)
                continue
            raise

In [None]:
%%time

output = get_output(response["OutputLocation"])
print(f"Model response output location: {output[0::4000]}")

## JSON to MP4 Video

Convert binary objects in list to JPEGs of each frame, then combined into MP4.


In [None]:
import base64
from PIL import Image
from diffusers.utils import export_to_video, make_image_grid

data = json.loads(output)

video_frames = data["frames"]

loaded_video_frames = []

for idx, video_frame in enumerate(video_frames):
    frame = bytes(video_frame, "raw_unicode_escape")

    frame_name = f"frames_out/imageToSave_{idx+1}.jpg"
    with open(frame_name, "wb") as fh:
        fh.write(base64.decodebytes(frame))

    image = Image.open(frame_name, mode="r")
    loaded_video_frames.append(image)

export_to_video(loaded_video_frames, f"video_out/{movie_title}", fps=6)

#### Display Frames as Grid

Display 25 frames as a 5x5 grid.


In [None]:
image = make_image_grid(loaded_video_frames, 5, 5)
(width, height) = (image.width // 2, image.height // 2)
im_resized = image.resize((width, height))
display(im_resized)
im_resized.save("frames.png")

#### Display Video

Convert video CODEC to H.264 and display at 50% of actual size.


In [None]:
# convert video for display in notebook

import ffmpeg

output_options = {
    "crf": 20,
    "preset": "slower",
    "movflags": "faststart",
    "pix_fmt": "yuv420p",
    "vcodec": "libx264",
}

(
    ffmpeg.input(f"video_out/{movie_title}")
    .output("video_out/tmp.mp4", **output_options)
    .run(overwrite_output=True, quiet=True)
)

In [None]:
from IPython.display import Video

Video(
    url="video_out/tmp.mp4",
    width=(loaded_video_frames[0].width / 2),
    html_attributes="controls muted autoplay loop",
)