Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated text-to-image solution for training fine-tuned Stable Diffusion XL 1.0 model with Kohya SS #4481

Open
wants to merge 52 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
8876aba
Create README.md
azograby Nov 12, 2023
02a9434
initial commit
azograby Nov 12, 2023
46a429c
Updated s3 bucket name
azograby Nov 13, 2023
2c9d9a2
Updated directions
azograby Nov 13, 2023
2ef05f8
Updated commands and notes
azograby Nov 13, 2023
ee65e05
Added encryption comment
azograby Nov 13, 2023
5518e1f
Reviewed solution, minor changes
azograby Nov 15, 2023
f2ef3bf
Updated instructions
azograby Nov 16, 2023
fd40e2d
Added fix for pytorch/xformers incompatibility
azograby Nov 16, 2023
80c5f4d
No change
azograby Nov 17, 2023
9a302e8
Added notebook test badge
azograby Nov 17, 2023
d19c525
Delete use-cases/text-to-image-fine-tuning/Dockerfile
azograby Nov 17, 2023
39ea383
Delete use-cases/text-to-image-fine-tuning/buildspec.yml
azograby Nov 17, 2023
6392482
Delete use-cases/text-to-image-fine-tuning/train
azograby Nov 17, 2023
dea84c2
Delete use-cases/text-to-image-fine-tuning/kohya-sdxl-config.toml
azograby Nov 17, 2023
7f32a5f
Update README.md with text-to-image solution
azograby Nov 17, 2023
c4a4d3e
Update README.md
azograby Nov 17, 2023
8b1dcad
Update Dockerfile to v22.6.0
azograby Feb 9, 2024
ad5ee8a
Update template.yml to v22.6.0
azograby Feb 9, 2024
ca0f13d
Update Dockerfile to version 22.6.2
azograby Mar 12, 2024
71c9869
Update template.yml to use kohya version 22.6.2
azograby Mar 12, 2024
5b7b2f2
Merge branch 'main' into main
azograby Mar 19, 2024
1047cbe
formatted black-nb, CodeGuru comments updated
Mar 19, 2024
04b9eb0
spelling and grammar updates
Mar 19, 2024
47ad6a7
revise wording for s3 section
azograby Apr 26, 2024
c394900
updated the directory from "root" to "/home/sagemaker-user"
azograby Apr 26, 2024
6e9a167
commented out user in docker
azograby Apr 27, 2024
ac1af31
removed comments in step 2
azograby Apr 28, 2024
671e2a8
updated instructions for inference
azograby Apr 29, 2024
70c79f0
added example output images
azograby Apr 30, 2024
276acdb
updated links
azograby Apr 30, 2024
e3cd6c2
added example high res links
azograby Apr 30, 2024
8529ea6
updated version number
azograby Apr 30, 2024
1664df6
Merge branch 'aws:main' into main
azograby May 1, 2024
213bf41
Merge branch 'main' into main
azograby May 2, 2024
4d7b7f4
updated iam permissions and s3 config
azograby May 2, 2024
2191834
Update notebook metadata
azograby May 2, 2024
2bc9ec6
Update kohya-ss-fine-tuning.ipynb kernelspec
azograby May 2, 2024
24fa714
added prerequisite instructions
azograby May 6, 2024
069fa57
added time to complete certain steps
azograby May 6, 2024
b23b9ea
updated CLI command to work with the sagemaker repo image
azograby May 6, 2024
bf3be9a
updated to work with the sagemaker PR container image
azograby May 6, 2024
74028e2
updated to work with the sagemaker PR container image
azograby May 7, 2024
1aaaaa2
updated to work with the sagemaker PR container image
azograby May 7, 2024
f289e26
updated to work with the sagemaker PR container image
azograby May 7, 2024
1b81339
added clean up instructions, modified studio instructions slightly
azograby May 7, 2024
5b44dcd
updated to use f string formatting
azograby May 7, 2024
6732289
added region variable instead of 'us-west-2'
azograby May 21, 2024
435df13
use ECR image instead of Docker Hub image
azograby May 27, 2024
820045f
updated notebook instructions - changed from 50gb to 100gb and update…
azograby May 28, 2024
182a8a1
updated readme
azograby Jun 5, 2024
ed099f3
updated pipeline to include validation
azograby Jun 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -347,6 +347,13 @@ These examples show you how to use model-packages and algorithms from AWS Market
- [Using Dataset Product from AWS Data Exchange with ML model from AWS Marketplace](aws_marketplace/using_data/using_data_with_ml_model) is a sample notebook which shows how a dataset from AWS Data Exchange can be used with an ML Model Package from AWS Marketplace.
- [Using Shutterstock Image Datasets to train Image Classification Models](aws_marketplace/using_data/image_classification_with_shutterstock_image_datasets) provides a detailed walkthrough on how to use the [Free Sample: Images & Metadata of “Whole Foods” Shoppers](https://aws.amazon.com/marketplace/pp/prodview-y6xuddt42fmbu?qid=1623195111604&sr=0-1&ref_=srh_res_product_title#offers) from Shutterstock's Image Datasets to train a multi-label image classification model using Shutterstock's pre-labeled image assets. You can learn more about this implementation [from this blog post](https://aws.amazon.com/blogs/awsmarketplace/using-shutterstocks-image-datasets-to-train-your-computer-vision-models/).

### Using Amazon SageMaker for Generative AI use cases.

These examples show you how to use AWS services for Generative AI use cases.

- Text-to-image
- [Fine-tune Stable Diffusion XL model with Kohya](use-cases/text-to-image-fine-tuning) Provides an automated solution to create the necessary components to fine-tune a custom Stable Diffusion XL model.

## :balance_scale: License

This library is licensed under the [Apache 2.0 License](http://aws.amazon.com/apache2.0/).
Expand Down
69 changes: 69 additions & 0 deletions use-cases/text-to-image-fine-tuning/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Stable Diffusion XL Fine-Tuning with Kohya SS

This solution creates all the necessary components to get you started quickly with fine-tuning Stable Diffusion XL with a custom dataset, using a custom training container that leverages Kohya SS to do the fine-tuning. Stable Diffusion allows you to generate images from text prompts. The training is coordinated with a SageMaker pipeline and a SageMaker Training job. This solution automates many of the tedious tasks you must do to set up the necessary infrastructure to run your training. You will use the "kohya-ss-fine-tuning" Notebook to set up the solution. But first, get familiar with the solution components described in this README, which are labeled with their default resource names from the Cloudformation template.

![Architecture Diagram](kohya-ss-fine-tuning.jpg)

*Prerequisites:*
1. SageMaker Domain configured (to be used with SageMaker Studio).
2. Add the [required permissions](https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-16550/sagemaker-policy.json) to the SageMaker Execution Role for your domain.
3. SageMaker Domain User Profile configured.
4. If you will run the Cloudformation template via the console, the proper IAM permissions must be assigned to your user.

*Follow these steps to get started:*

1. Navigate to Amazon SageMaker Studio in your AWS account. Run your JupyterLab space.
2. Click on "Terminal".
3. You will check out just the required directories of the SageMaker Examples git repository next (so you don't have to download the entire repo). Run the following commands from the terminal. If successful, you should see the output "Your branch is up to date with 'origin/main'".

git clone --no-checkout https://github.com/azograby/amazon-sagemaker-examples.git
# TODO: change to official aws examples repo once personal repo PR is approved and merged. PR: https://github.com/aws/amazon-sagemaker-examples/pull/4481
# git clone --no-checkout https://github.com/aws/amazon-sagemaker-examples.git
cd amazon-sagemaker-examples/
git sparse-checkout set use-cases/text-to-image-fine-tuning
git checkout

4. In Amazon SageMaker Studio, in the left-hand navigation pane, click the File Browser and navigate to the project directory (amazon-sagemaker-examples/use-cases/text-to-image-fine-tuning). Open the Jupyter Notebook named "kohya-ss-fine-tuning.ipynb".
5. The default runtime kernel is set to use Python 3 automatically. You now have a kernel that is ready to run commands. You may now continue with this Notebook to start setting up your solution.

---

## Solution Components:

* **S3 Bucket: sagemaker-kohya-ss-fine-tuning-\<accountid\>**
* S3 bucket where custom dataset is uploaded (images, captions, kohya ss configuration)
* These files will be uploaded from the Notebook
* The SageMaker pipeline will orchestrate the training and output the model to this same S3 bucket

* **CodeCommit Repository: kohya-ss-fine-tuning-container-image**
* Contains the source code to build the training container image (Dockerfile)
* Contains the training code (train.py)
* Contains the build spec used by CodeBuild to create the docker image (buildspec.yml)
* Changes to these files will trigger a new container image to be built

* **ECR Repository: kohya-ss-fine-tuning**
* ECR repository for hosting the training container image
* Container image will be built and pushed to this repository
* Container image contains the [Kohya SS](https://github.com/bmaltais/kohya_ss.git) program (used to train the custom SDXL model)

* **CodeBuild Project: kohya-ss-fine-tuning-build-container**
* Builds the training container image and pushes it to ECR
* Environment variables in template.yml can be modified to change the Kohya SS version
* The GitHub repository (https://github.com/bmaltais/kohya_ss.git) has been tested as of version v22.1.1
* If you use newer versions, you will want to check the Dockerfile and the docker-compose.yaml file in the repository, and the training entrypoint for SDXL (sdxl_train_network.py) in the custom "train" file located in this repository to see if any modifications need to be made

* **EventBridge Rule: kohya-ss-fine-tuning-trigger-new-image-build-rule**
* Updating the CodeCommit repository code triggers the CodeBuild project that builds the new training container image
* This rule does NOT kick off a training job

* **SageMaker Pipeline: kohya-ss-fine-tuning-pipeline**
* Orchestrates training the custom model
* Takes the custom training image from ECR, and the dataset/config located in S3, and initiates a SageMaker training job
* Once completed, it outputs the model to the same S3 bucket, which you may use to run inference
* This pipeline must be executed manually, given specific input parameters which you may override

* **IAM Roles**
* SageMaker Pipeline execution (custom-sagemaker-pipeline-execution-role)
* SageMaker Service role (custom-sagemaker-service-role)
* CodeBuild Service role (custom-codebuild-service-role)
* EventBridge role (custom-build-new-training-container-image-rule-role)
132 changes: 132 additions & 0 deletions use-cases/text-to-image-fine-tuning/code/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# This Dockerfile has been modified slightly from https://github.com/bmaltais/kohya_ss/blob/v22.6.2/Dockerfile,
# to work with SageMaker training jobs

# syntax=docker/dockerfile:1
ARG UID=1000
ARG VERSION=EDGE
ARG RELEASE=0

# Use the image from ECR to avoid potential Docker Hub rate limits for unauthenticated requests https://www.docker.com/increase-rate-limits/
# FROM python:3.10-slim as build
FROM public.ecr.aws/docker/library/python:3.10-slim as build

# RUN mount cache for multi-arch: https://github.com/docker/buildx/issues/549#issuecomment-1788297892
ARG TARGETARCH
ARG TARGETVARIANT

WORKDIR /app

# Install under /root/.local
ENV PIP_USER="true"
ARG PIP_NO_WARN_SCRIPT_LOCATION=0
ARG PIP_ROOT_USER_ACTION="ignore"

# Install build dependencies
RUN apt-get update && apt-get upgrade -y && \
apt-get install -y --no-install-recommends python3-launchpadlib git curl && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

# Install PyTorch and TensorFlow
# The versions must align and be in sync with the requirements_linux_docker.txt
# hadolint ignore=SC2102
RUN --mount=type=cache,id=pip-$TARGETARCH$TARGETVARIANT,sharing=locked,target=/root/.cache/pip \
pip install -U --extra-index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.nvidia.com \
torch==2.1.2 torchvision==0.16.2 \
xformers==0.0.23.post1 \
# Why [and-cuda]: https://github.com/tensorflow/tensorflow/issues/61468#issuecomment-1759462485
tensorflow[and-cuda]==2.14.0 \
ninja \
pip setuptools wheel

# Install requirements
RUN --mount=type=cache,id=pip-$TARGETARCH$TARGETVARIANT,sharing=locked,target=/root/.cache/pip \
--mount=source=requirements_linux_docker.txt,target=requirements_linux_docker.txt \
--mount=source=requirements.txt,target=requirements.txt \
--mount=source=setup/docker_setup.py,target=setup.py \
pip install -r requirements_linux_docker.txt -r requirements.txt

# Replace pillow with pillow-simd (Only for x86)
ARG TARGETPLATFORM
RUN if [ "$TARGETPLATFORM" = "linux/amd64" ]; then \
apt-get update && apt-get install -y --no-install-recommends zlib1g-dev libjpeg62-turbo-dev build-essential && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
pip uninstall -y pillow && \
CC="cc -mavx2" pip install -U --force-reinstall pillow-simd; \
fi

# Use the image from ECR to avoid potential Docker Hub rate limits for unauthenticated requests https://www.docker.com/increase-rate-limits/
# FROM python:3.10-slim as final
FROM public.ecr.aws/docker/library/python:3.10-slim as final

ARG UID
ARG VERSION
ARG RELEASE

LABEL name="bmaltais/kohya_ss" \
vendor="bmaltais" \
maintainer="bmaltais" \
# Dockerfile source repository
url="https://github.com/bmaltais/kohya_ss" \
version=${VERSION} \
# This should be a number, incremented with each change
release=${RELEASE} \
io.k8s.display-name="kohya_ss" \
summary="Kohya's GUI: This repository provides a Gradio GUI for Kohya's Stable Diffusion trainers(https://github.com/kohya-ss/sd-scripts)." \
description="The GUI allows you to set the training parameters and generate and run the required CLI commands to train the model. This is the docker image for Kohya's GUI. For more information about this tool, please visit the following website: https://github.com/bmaltais/kohya_ss."

# Install runtime dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends libgl1 libglib2.0-0 libjpeg62 libtcl8.6 libtk8.6 libgoogle-perftools-dev dumb-init && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

# Fix missing libnvinfer7
RUN ln -s /usr/lib/x86_64-linux-gnu/libnvinfer.so /usr/lib/x86_64-linux-gnu/libnvinfer.so.7 && \
ln -s /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.7

# Amazon SageMaker: Copy the train file to the specified directory that Amazon SageMaker Training jobs will use
COPY ./train /opt/program/

# Create user
RUN groupadd -g $UID $UID && \
useradd -l -u $UID -g $UID -m -s /bin/sh -N $UID

# Create directories with correct permissions
RUN install -d -m 775 -o $UID -g 0 /dataset && \
install -d -m 775 -o $UID -g 0 /licenses && \
install -d -m 775 -o $UID -g 0 /app

# Copy dist and support arbitrary user ids (OpenShift best practice)
COPY --chown=$UID:0 --chmod=775 \
--from=build /root/.local /home/$UID/.local

WORKDIR /app
COPY --chown=$UID:0 --chmod=775 . .

# Copy licenses (OpenShift Policy)
COPY --chmod=775 LICENSE.md /licenses/LICENSE.md

# Amazon SageMaker: Add /opt/program to the path. The "train" program resides here.
ENV PATH="/home/$UID/.local/bin:$PATH:/opt/program"
ENV PYTHONPATH="${PYTHONPATH}:/home/$UID/.local/lib/python3.10/site-packages"
ENV LD_PRELOAD=libtcmalloc.so
ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

VOLUME [ "/dataset" ]

# 7860: Kohya GUI
# 6006: TensorBoard
EXPOSE 7860 6006

# Amazon SageMaker: Commenting out to avoid permission issues with access to /opt/ml when invoked by an Amazon SageMaker training job
# USER $UID

STOPSIGNAL SIGINT

# Use dumb-init as PID 1 to handle signals properly
ENTRYPOINT ["dumb-init", "--"]

# We will not be using the GUI. Instead, we call this program programatically in the "train" file
CMD ["python3", "kohya_gui.py", "--listen", "0.0.0.0", "--server_port", "7860"]
35 changes: 35 additions & 0 deletions use-cases/text-to-image-fine-tuning/code/buildspec.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
version: 0.2

phases:
pre_build:
commands:
# Log into ECR, and if the ECR repository doesn't exist then create one
- echo Logging in to Amazon ECR...
- aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
- aws ecr describe-repositories --repository-names ${IMAGE_REPO_NAME} || aws ecr create-repository --repository-name ${IMAGE_REPO_NAME}
build:
commands:
- echo Build started on `date`
# Clone the specific version of the GitHub repository for Kohya SS
# Tested as of version v22.6.2. If you use newer versions, you will want to check the Dockerfile and the docker-compose.yaml file in the
# repository, and the training entrypoint for SDXL (sdxl_train_network.py) in the custom "train" file located in this repository
- git clone https://github.com/bmaltais/kohya_ss.git --branch ${KOHYA_SS_VERSION}
# Overwrite the Dockerfile with the custom one we have in the CodeCommit repository
# Our Dockerfile has minor tweaks to enable using SageMaker Training Jobs
- mv -f ./Dockerfile ./kohya_ss/Dockerfile
# Move the custom train file we have in the CodeCommit repository, to the kohya_ss directory since this is the build context for docker-compose
# Right now, the train filename does not conflict with anything in the kohya_ss directory for this version of code,
# but eventually the Docker context may be changed so we don't have to add the train file to this directory
- chmod +x ./train && mv -f ./train ./kohya_ss/train
- cd kohya_ss
- echo Building the Docker image that will be used for training...
# Build and tag the container image
- docker compose build --no-cache=true --progress=plain
# The "kohya-ss-gui:latest" name comes from the Kohya SS Docker config: https://github.com/bmaltais/kohya_ss/blob/v22.6.2/docker-compose.yaml#L5
- docker tag kohya-ss-gui:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
post_build:
commands:
# Push the container image to ECR
- echo Build completed on `date`
- echo Pushing the Docker image...
- docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
65 changes: 65 additions & 0 deletions use-cases/text-to-image-fine-tuning/code/train
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#!/usr/bin/env python

from __future__ import print_function

import os
import sys
import traceback
import subprocess

prefix = '/opt/ml/'

input_path = prefix + 'input/data'
output_path = os.path.join(prefix, 'output')
model_path = os.path.join(prefix, 'model')

# This algorithm has a single channel of input data called 'train'. Since we run in
# File mode, the input files are copied to the directory specified here
channel_name='train'
training_path = os.path.join(input_path, channel_name)

# The function to execute the training, will be called by the SageMaker training job
def train():
print('Starting the training.')
try:
input_files = [ os.path.join(training_path, file) for file in os.listdir(training_path) ]
if len(input_files) == 0:
error_message = (
f"There are no files in {training_path}.\n"
f"This usually indicates that the channel ({channel_name}) was incorrectly specified,\n"
"the data specification in S3 was incorrectly specified or the role specified\n"
"does not have permission to access the data."
)
raise ValueError(error_message)

# Call the program that was installed in the training container, which uses the kohya-ss libraries to train
# Stable Diffusion XL given the kohya-sdxl-config.toml file that is present in the training S3 bucket
# In the future, we can set and retrieve parameters such as num_cpu_threads_per_process via the Hyperparamters
subprocess.run(
["accelerate",
"launch",
"--num_cpu_threads_per_process=2",
"./sdxl_train_network.py",
"--config_file",
os.path.join(training_path, 'kohya-sdxl-config.toml')], check=True
)

print('Training complete.')
except Exception as e:
# Write out an error file. This will be returned as the failureReason in the DescribeTrainingJob result
traceback_str = traceback.format_exc()
failure_log_path = os.path.join(output_path, 'failure')

with open(failure_log_path, 'w') as log_file:
log_file.write(f'Exception during training: {str(e)}\n{traceback_str}')
# Printing this causes the exception to be in the training job logs, as well.
print(f'Exception during training: {str(e)}\n{traceback_str}', file=sys.stderr)

# A non-zero exit code causes the training job to be marked as Failed.
sys.exit(255)

if __name__ == '__main__':
train()

# A zero exit code causes the job to be marked a Succeeded.
sys.exit(0)