This guide provides instructions for creating and managing a SageMaker Hyperpod cluster, as well as implementing the AnimateAnyone algorithm. It is based on the SageMaker Hyperpod workshop studio guidance and the Moore-AnimateAnyone repository.
Lifecycle scripts allow customization of your cluster during creation. They can be used to:
- Install software packages
- Set up configurations
- Configure Slurm
- Create users
- Install Conda or Docker
To set up lifecycle scripts:
- Clone the repository and upload scripts to S3:
git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/ cd awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/ aws s3 cp --recursive base-config/ s3://${BUCKET}/src
- Prepare
cluster-config.json
andprovisioning_parameters.json
files. - Upload the configuration to S3:
aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/
- Create the cluster:
aws sagemaker create-cluster --cli-input-json file://cluster-config.json --region $AWS_REGION
Example of cluster-config.json
and provisioning_parameters.json
can be found at in ClusterConfig
To increase worker instances:
- Update
cluster-config.json
with the new instance count. - Run:
aws sagemaker update-cluster \ --cluster-name ${my-cluster-name} \ --instance-groups file://update-cluster-config.json \ --region $AWS_REGION
Example of update-cluster-config.json
can be found at in ClusterConfig
aws sagemaker delete-cluster --cluster-name ${my-cluster-name}
- SageMaker HyperPod supports Amazon FSx for Lustre integration, enabling full bi-directional synchronization with Amazon S3.
- Ensure proper AWS CLI permissions and configurations.
- Review and test configurations before production deployment.
- Monitor cluster usage for cost and performance optimization.
Follow the guidance on Accessing SageMaker HyperPod cluster nodes.
./easy-ssh.sh -c controller-machine ml-cluster
sudo su - ubuntu
For VS Code connection, follow this guide to set up an SSH Proxy via SSM.
First-time login to the controller node:
cd ~/.ssh
ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
cat id_rsa.pub >> authorized_keys
Allocate and access a worker node:
salloc -N 1
ssh $(srun hostname)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b -f -p ~/miniconda3
source ~/miniconda3/bin/activate
conda create -n videogen python=3.10
conda activate videogen
- List partitions and nodes:
sinfo
- List queued/running jobs:
squeue
Based on the Moore-AnimateAnyone repository.
-
Activate the conda environment:
source ~/miniconda3/bin/activate conda activate videogen
-
Install required packages:
pip install -r requirements.txt
-
Download pre-trained weights:
python tools/download_weights.py
-
Test the training script:
accelerate launch train_stage_1.py --config configs/train/stage1.yaml accelerate launch train_stage_2.py --config configs/train/stage2.yaml
The detailed instructions can be found in here
sbatch submit-animateanyone-algo.sh
Note: For smaller GPU instances (e.g., G5 2xlarge), adjust train_bs: 2
and train_width: 256 train_height: 256
to avoid out-of-memory issues. See one configuration example in AlgoSlurm
sbatch submit-hyperparameter-testing.sh
The detailed instructions can be found in here
The folder contains the single node multi-GPUs setup, as well as the multi-mode multi-GPUs Slurm launch file.
Use MLflow for visualization:
mlflow ui --backend-store-uri ./mlruns/
You can ether try a quick inference on a SageMaker notebook instance of g5.2xlarge
by walking through inference code in inference or deploy an inference endpoint on SageMaker. Please refer to Inference README for more details.
Recent advancements in video generation have rapidly overcome limitations of earlier models like Animate Anyone. Two notable research papers showcase significant progress in this domain:
- Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance enhances shape alignment and motion guidance. It demonstrates superior ability in generating high-quality human animations that accurately capture both pose and shape variations, with improved generalization on in-the-wild datasets.
- UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation enables the generation of longer videos, up to one minute, compared to earlier models' limited frame outputs. It introduces a unified noise input supporting both random noised input and first frame conditioned input, enhancing long-term video generation capabilities.
As research in this field rapidly progresses, SageMaker Hyperpod prove invaluable for AI research and experimentation. It provides the necessary computational resources and flexibility to quickly implement and test innovative ideas, accelerating advancements in video generation and related AI technologies. SageMaker Hyperpod's scalable infrastructure allows researchers to efficiently train and fine-tune large models, reducing the time from concept to implementation. Its integrated development environment streamlines the workflow, enabling faster iterations and more comprehensive experiments. By leveraging such advanced cloud computing solutions, researchers can push the boundaries of what's possible in video generation, potentially leading to breakthroughs in areas like virtual reality, film production, and interactive digital media.
- Original Moore-AnimateAnyone Repository
- SageMaker Hyperpod Workshop Studio
- Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
- Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance
- UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation
- Ensure proper GPU resources and CUDA setup before running experiments.
- Adjust batch files and configurations as needed for your environment.
- Regularly check the original repository for updates or changes.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.