aws · azograby · Nov 12, 2023 · Nov 12, 2023 · Nov 13, 2023 · Nov 13, 2023
diff --git a/README.md b/README.md
@@ -347,6 +347,13 @@ These examples show you how to use model-packages and algorithms from AWS Market
   - [Using Dataset Product from AWS Data Exchange with ML model from AWS Marketplace](aws_marketplace/using_data/using_data_with_ml_model) is a sample notebook which shows how a dataset from AWS Data Exchange can be used with an ML Model Package from AWS Marketplace.
   - [Using Shutterstock Image Datasets to train Image Classification Models](aws_marketplace/using_data/image_classification_with_shutterstock_image_datasets) provides a detailed walkthrough on how to use the [Free Sample: Images & Metadata of “Whole Foods” Shoppers](https://aws.amazon.com/marketplace/pp/prodview-y6xuddt42fmbu?qid=1623195111604&sr=0-1&ref_=srh_res_product_title#offers) from Shutterstock's Image Datasets to train a multi-label image classification model using Shutterstock's pre-labeled image assets. You can learn more about this implementation [from this blog post](https://aws.amazon.com/blogs/awsmarketplace/using-shutterstocks-image-datasets-to-train-your-computer-vision-models/).
 
+### Using Amazon SageMaker for Generative AI use cases.
+
+These examples show you how to use AWS services for Generative AI use cases.
+
+- Text-to-image
+  - [Fine-tune Stable Diffusion XL model with Kohya](use-cases/text-to-image-fine-tuning) Provides an automated solution to create the necessary components to fine-tune a custom Stable Diffusion XL model.
+
 ## :balance_scale: License
 
 This library is licensed under the [Apache 2.0 License](http://aws.amazon.com/apache2.0/).

diff --git a/use-cases/text-to-image-fine-tuning/README.md b/use-cases/text-to-image-fine-tuning/README.md
@@ -0,0 +1,69 @@
+# Stable Diffusion XL Fine-Tuning with Kohya SS
+
+This solution creates all the necessary components to get you started quickly with fine-tuning Stable Diffusion XL with a custom dataset, using a custom training container that leverages Kohya SS to do the fine-tuning. Stable Diffusion allows you to generate images from text prompts. The training is coordinated with a SageMaker pipeline and a SageMaker Training job. This solution automates many of the tedious tasks you must do to set up the necessary infrastructure to run your training. You will use the "kohya-ss-fine-tuning" Notebook to set up the solution. But first, get familiar with the solution components described in this README, which are labeled with their default resource names from the Cloudformation template.
+
+![Architecture Diagram](kohya-ss-fine-tuning.jpg)
+
+*Prerequisites:*
+1. SageMaker Domain configured (to be used with SageMaker Studio).
+2. Add the [required permissions](https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-16550/sagemaker-policy.json) to the SageMaker Execution Role for your domain.
+3. SageMaker Domain User Profile configured.
+4. If you will run the Cloudformation template via the console, the proper IAM permissions must be assigned to your user.
+
+*Follow these steps to get started:*
+
+1. Navigate to Amazon SageMaker Studio in your AWS account. Run your JupyterLab space.
+2. Click on "Terminal".
+3. You will check out just the required directories of the SageMaker Examples git repository next (so you don't have to download the entire repo). Run the following commands from the terminal. If successful, you should see the output "Your branch is up to date with 'origin/main'".
+
+        git clone --no-checkout https://github.com/azograby/amazon-sagemaker-examples.git
+        # TODO: change to official aws examples repo once personal repo PR is approved and merged. PR: https://github.com/aws/amazon-sagemaker-examples/pull/4481
+        # git clone --no-checkout https://github.com/aws/amazon-sagemaker-examples.git
+        cd amazon-sagemaker-examples/
+        git sparse-checkout set use-cases/text-to-image-fine-tuning
+        git checkout
+
+4. In Amazon SageMaker Studio, in the left-hand navigation pane, click the File Browser and navigate to the project directory (amazon-sagemaker-examples/use-cases/text-to-image-fine-tuning). Open the Jupyter Notebook named "kohya-ss-fine-tuning.ipynb".
+5. The default runtime kernel is set to use Python 3 automatically. You now have a kernel that is ready to run commands. You may now continue with this Notebook to start setting up your solution.
+
+---
+
+## Solution Components:
+
+* **S3 Bucket: sagemaker-kohya-ss-fine-tuning-\<accountid\>**
+    * S3 bucket where custom dataset is uploaded (images, captions, kohya ss configuration)
+    * These files will be uploaded from the Notebook
+    * The SageMaker pipeline will orchestrate the training and output the model to this same S3 bucket
+
+* **CodeCommit Repository: kohya-ss-fine-tuning-container-image**
+    * Contains the source code to build the training container image (Dockerfile)
+    * Contains the training code (train.py)
+    * Contains the build spec used by CodeBuild to create the docker image (buildspec.yml)
+    * Changes to these files will trigger a new container image to be built
+
+* **ECR Repository: kohya-ss-fine-tuning**
+    * ECR repository for hosting the training container image
+    * Container image will be built and pushed to this repository
+    * Container image contains the [Kohya SS](https://github.com/bmaltais/kohya_ss.git) program (used to train the custom SDXL model)
+
+* **CodeBuild Project: kohya-ss-fine-tuning-build-container**
+    * Builds the training container image and pushes it to ECR
+    * Environment variables in template.yml can be modified to change the Kohya SS version
+    * The GitHub repository (https://github.com/bmaltais/kohya_ss.git) has been tested as of version v22.1.1
+    * If you use newer versions, you will want to check the Dockerfile and the docker-compose.yaml file in the repository, and the training entrypoint for SDXL (sdxl_train_network.py) in the custom "train" file located in this repository to see if any modifications need to be made
+
+* **EventBridge Rule: kohya-ss-fine-tuning-trigger-new-image-build-rule**
+    * Updating the CodeCommit repository code triggers the CodeBuild project that builds the new training container image
+    * This rule does NOT kick off a training job
+
+* **SageMaker Pipeline: kohya-ss-fine-tuning-pipeline**
+    * Orchestrates training the custom model
+    * Takes the custom training image from ECR, and the dataset/config located in S3, and initiates a SageMaker training job
+    * Once completed, it outputs the model to the same S3 bucket, which you may use to run inference
+    * This pipeline must be executed manually, given specific input parameters which you may override
+
+* **IAM Roles**
+    * SageMaker Pipeline execution (custom-sagemaker-pipeline-execution-role)
+    * SageMaker Service role (custom-sagemaker-service-role)
+    * CodeBuild Service role (custom-codebuild-service-role)
+    * EventBridge role (custom-build-new-training-container-image-rule-role)
diff --git a/use-cases/text-to-image-fine-tuning/code/Dockerfile b/use-cases/text-to-image-fine-tuning/code/Dockerfile
@@ -0,0 +1,132 @@
+# This Dockerfile has been modified slightly from https://github.com/bmaltais/kohya_ss/blob/v22.6.2/Dockerfile,
+# to work with SageMaker training jobs
+
+# syntax=docker/dockerfile:1
+ARG UID=1000
+ARG VERSION=EDGE
+ARG RELEASE=0
+
+# Use the image from ECR to avoid potential Docker Hub rate limits for unauthenticated requests https://www.docker.com/increase-rate-limits/
+# FROM python:3.10-slim as build
+FROM public.ecr.aws/docker/library/python:3.10-slim as build
+
+# RUN mount cache for multi-arch: https://github.com/docker/buildx/issues/549#issuecomment-1788297892
+ARG TARGETARCH
+ARG TARGETVARIANT
+
+WORKDIR /app
+
+# Install under /root/.local
+ENV PIP_USER="true"
+ARG PIP_NO_WARN_SCRIPT_LOCATION=0
+ARG PIP_ROOT_USER_ACTION="ignore"
+
+# Install build dependencies
+RUN apt-get update && apt-get upgrade -y && \
+    apt-get install -y --no-install-recommends python3-launchpadlib git curl && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+# Install PyTorch and TensorFlow
+# The versions must align and be in sync with the requirements_linux_docker.txt
+# hadolint ignore=SC2102
+RUN --mount=type=cache,id=pip-$TARGETARCH$TARGETVARIANT,sharing=locked,target=/root/.cache/pip \
+    pip install -U --extra-index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.nvidia.com \
+    torch==2.1.2 torchvision==0.16.2 \
+    xformers==0.0.23.post1 \
+    # Why [and-cuda]: https://github.com/tensorflow/tensorflow/issues/61468#issuecomment-1759462485
+    tensorflow[and-cuda]==2.14.0 \
+    ninja \
+    pip setuptools wheel
+
+# Install requirements
+RUN --mount=type=cache,id=pip-$TARGETARCH$TARGETVARIANT,sharing=locked,target=/root/.cache/pip \
+    --mount=source=requirements_linux_docker.txt,target=requirements_linux_docker.txt \
+    --mount=source=requirements.txt,target=requirements.txt \
+    --mount=source=setup/docker_setup.py,target=setup.py \
+    pip install -r requirements_linux_docker.txt -r requirements.txt
+
+# Replace pillow with pillow-simd (Only for x86)
+ARG TARGETPLATFORM
+RUN if [ "$TARGETPLATFORM" = "linux/amd64" ]; then \
+    apt-get update && apt-get install -y --no-install-recommends zlib1g-dev libjpeg62-turbo-dev build-essential && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/* && \
+    pip uninstall -y pillow && \
+    CC="cc -mavx2" pip install -U --force-reinstall pillow-simd; \
+    fi
+
+# Use the image from ECR to avoid potential Docker Hub rate limits for unauthenticated requests https://www.docker.com/increase-rate-limits/
+# FROM python:3.10-slim as final
+FROM public.ecr.aws/docker/library/python:3.10-slim as final
+
+ARG UID
+ARG VERSION
+ARG RELEASE
+
+LABEL name="bmaltais/kohya_ss" \
+    vendor="bmaltais" \
+    maintainer="bmaltais" \
+    # Dockerfile source repository
+    url="https://github.com/bmaltais/kohya_ss" \
+    version=${VERSION} \
+    # This should be a number, incremented with each change
+    release=${RELEASE} \
+    io.k8s.display-name="kohya_ss" \
+    summary="Kohya's GUI: This repository provides a Gradio GUI for Kohya's Stable Diffusion trainers(https://github.com/kohya-ss/sd-scripts)." \
+    description="The GUI allows you to set the training parameters and generate and run the required CLI commands to train the model. This is the docker image for Kohya's GUI. For more information about this tool, please visit the following website: https://github.com/bmaltais/kohya_ss."
+
+# Install runtime dependencies
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends libgl1 libglib2.0-0 libjpeg62 libtcl8.6 libtk8.6 libgoogle-perftools-dev dumb-init && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+# Fix missing libnvinfer7
+RUN ln -s /usr/lib/x86_64-linux-gnu/libnvinfer.so /usr/lib/x86_64-linux-gnu/libnvinfer.so.7 && \
+    ln -s /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.7
+
+# Amazon SageMaker: Copy the train file to the specified directory that Amazon SageMaker Training jobs will use
+COPY ./train /opt/program/
+
+# Create user
+RUN groupadd -g $UID $UID && \
+    useradd -l -u $UID -g $UID -m -s /bin/sh -N $UID
+
+# Create directories with correct permissions
+RUN install -d -m 775 -o $UID -g 0 /dataset && \
+    install -d -m 775 -o $UID -g 0 /licenses && \
+    install -d -m 775 -o $UID -g 0 /app
+
+# Copy dist and support arbitrary user ids (OpenShift best practice)
+COPY --chown=$UID:0 --chmod=775 \
+    --from=build /root/.local /home/$UID/.local
+
+WORKDIR /app
+COPY --chown=$UID:0 --chmod=775 . .
+
+# Copy licenses (OpenShift Policy)
+COPY --chmod=775 LICENSE.md /licenses/LICENSE.md
+
+# Amazon SageMaker: Add /opt/program to the path. The "train" program resides here.
+ENV PATH="/home/$UID/.local/bin:$PATH:/opt/program"
+ENV PYTHONPATH="${PYTHONPATH}:/home/$UID/.local/lib/python3.10/site-packages" 
+ENV LD_PRELOAD=libtcmalloc.so
+ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
+
+VOLUME [ "/dataset" ]
+
+# 7860: Kohya GUI
+# 6006: TensorBoard
+EXPOSE 7860 6006
+
+# Amazon SageMaker: Commenting out to avoid permission issues with access to /opt/ml when invoked by an Amazon SageMaker training job
+# USER $UID
+
+STOPSIGNAL SIGINT
+
+# Use dumb-init as PID 1 to handle signals properly
+ENTRYPOINT ["dumb-init", "--"]
+
+# We will not be using the GUI. Instead, we call this program programatically in the "train" file
+CMD ["python3", "kohya_gui.py", "--listen", "0.0.0.0", "--server_port", "7860"]
diff --git a/use-cases/text-to-image-fine-tuning/code/buildspec.yml b/use-cases/text-to-image-fine-tuning/code/buildspec.yml
@@ -0,0 +1,35 @@
+version: 0.2
+
+phases:
+  pre_build:
+    commands:
+      # Log into ECR, and if the ECR repository doesn't exist then create one
+      - echo Logging in to Amazon ECR...
+      - aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
+      - aws ecr describe-repositories --repository-names ${IMAGE_REPO_NAME} || aws ecr create-repository --repository-name ${IMAGE_REPO_NAME}
+  build:
+    commands:
+      - echo Build started on `date`
+      # Clone the specific version of the GitHub repository for Kohya SS
+      # Tested as of version v22.6.2. If you use newer versions, you will want to check the Dockerfile and the docker-compose.yaml file in the 
+      # repository, and the training entrypoint for SDXL (sdxl_train_network.py) in the custom "train" file located in this repository
+      - git clone https://github.com/bmaltais/kohya_ss.git --branch ${KOHYA_SS_VERSION}
+      # Overwrite the Dockerfile with the custom one we have in the CodeCommit repository
+      # Our Dockerfile has minor tweaks to enable using SageMaker Training Jobs
+      - mv -f ./Dockerfile ./kohya_ss/Dockerfile
+      # Move the custom train file we have in the CodeCommit repository, to the kohya_ss directory since this is the build context for docker-compose
+      # Right now, the train filename does not conflict with anything in the kohya_ss directory for this version of code,
+      # but eventually the Docker context may be changed so we don't have to add the train file to this directory
+      - chmod +x ./train && mv -f ./train ./kohya_ss/train
+      - cd kohya_ss
+      - echo Building the Docker image that will be used for training...          
+      # Build and tag the container image
+      - docker compose build --no-cache=true --progress=plain
+      # The "kohya-ss-gui:latest" name comes from the Kohya SS Docker config: https://github.com/bmaltais/kohya_ss/blob/v22.6.2/docker-compose.yaml#L5
+      - docker tag kohya-ss-gui:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG   
+  post_build:
+    commands:
+      # Push the container image to ECR
+      - echo Build completed on `date`
+      - echo Pushing the Docker image...
+      - docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
diff --git a/use-cases/text-to-image-fine-tuning/code/train b/use-cases/text-to-image-fine-tuning/code/train
@@ -0,0 +1,65 @@
+#!/usr/bin/env python
+
+from __future__ import print_function
+
+import os
+import sys
+import traceback
+import subprocess
+
+prefix = '/opt/ml/'
+
+input_path = prefix + 'input/data'
+output_path = os.path.join(prefix, 'output')
+model_path = os.path.join(prefix, 'model')
+
+# This algorithm has a single channel of input data called 'train'. Since we run in
+# File mode, the input files are copied to the directory specified here
+channel_name='train'
+training_path = os.path.join(input_path, channel_name)
+
+# The function to execute the training, will be called by the SageMaker training job
+def train():
+    print('Starting the training.')
+    try:
+        input_files = [ os.path.join(training_path, file) for file in os.listdir(training_path) ]
+        if len(input_files) == 0:
+            error_message = (
+                f"There are no files in {training_path}.\n"
+                f"This usually indicates that the channel ({channel_name}) was incorrectly specified,\n"
+                "the data specification in S3 was incorrectly specified or the role specified\n"
+                "does not have permission to access the data."
+            )
+            raise ValueError(error_message)
+
+        # Call the program that was installed in the training container, which uses the kohya-ss libraries to train
+        # Stable Diffusion XL given the kohya-sdxl-config.toml file that is present in the training S3 bucket
+        # In the future, we can set and retrieve parameters such as num_cpu_threads_per_process via the Hyperparamters
+        subprocess.run(
+            ["accelerate", 
+            "launch",
+            "--num_cpu_threads_per_process=2",
+            "./sdxl_train_network.py",
+            "--config_file",
+            os.path.join(training_path, 'kohya-sdxl-config.toml')], check=True
+        )
+
+        print('Training complete.')
+    except Exception as e:
+        # Write out an error file. This will be returned as the failureReason in the DescribeTrainingJob result
+        traceback_str = traceback.format_exc()
+        failure_log_path = os.path.join(output_path, 'failure')
+
+        with open(failure_log_path, 'w') as log_file:
+            log_file.write(f'Exception during training: {str(e)}\n{traceback_str}')
+        # Printing this causes the exception to be in the training job logs, as well.
+        print(f'Exception during training: {str(e)}\n{traceback_str}', file=sys.stderr)
+
+        # A non-zero exit code causes the training job to be marked as Failed.
+        sys.exit(255)
+
+if __name__ == '__main__':
+    train()
+
+    # A zero exit code causes the job to be marked a Succeeded.
+    sys.exit(0)