# Fine-Tuning: Teaching AI About Your Specific Data

In the previous notebook, we discovered that pre-trained models don't know about specific subjects. Now we'll solve this using **Dreambooth**, a technique for teaching Stable Diffusion about new concepts while preserving its general capabilities.

## What You'll Learn
1. **Dreambooth Technique** - How to fine-tune diffusion models effectively
2. **Data Preparation** - Organizing training images for best results
3. **Training Configuration** - Optimizing parameters for GPU resources
4. **Model Persistence** - Saving to S3 for pipeline automation and serving

## Why Fine-Tuning Matters in Production
- **Personalization**: Adapt models to your specific domain
- **Brand Consistency**: Generate content matching your style guides
- **Proprietary Knowledge**: Teach models about your unique products/concepts
- **Quality Control**: Improve accuracy for your use cases

Let's teach our model to recognize and generate images of Teddy!

![redhat dog](https://rhods-public.s3.amazonaws.com/sample-data/images/redhat-dog-small.jpg)

### GPU Memory Requirements

**Important**: Fine-tuning requires significant GPU memory (24GB+). 
- **Recommended**: Shut down other notebook kernels to free memory
- **Check GPU**: We need most of the GPU memory available
- **Alternative**: Use the pipeline version for automated resource management

In [None]:
!nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader

## Install Requirements

**Note**: We're installing diffusers from source to get the latest training script compatibility. After installation, you may need to restart the kernel for the changes to take effect.

In [None]:
!pip install -r requirements-base.txt
!pip install -r requirements-gpu.txt

In [None]:
!pip list | grep -E "boto|grpcio|pandas|torch|torchvision|diffusers|transformers|accelerate|flash-attn|ftfy|xformers|protobuf"

**Important**: If you see an older version of diffusers (like 0.34.0) instead of a development version (e.g., 0.35.0.dev0), you need to restart the kernel for the source installation to take effect:
1. Kernel → Restart Kernel
2. Re-run the cells from the beginning

## Training Configuration

Configure all training parameters using environment variables. This approach enables:
- Parameterized notebooks for different experiments
- Integration with pipelines using different settings
- Keeping sensitive information out of code

In [None]:
import os
from huggingface_hub import login

# Check if HF_TOKEN environment variable is set
if not os.environ.get('HF_TOKEN'):
    print("HF_TOKEN environment variable not found.")
    print("Please log in to Hugging Face to access models.")
    login()
else:
    print("Using HF_TOKEN from environment variable.")

### Hugging Face Integration

We authenticate with Hugging Face to:
- Download the base model for fine-tuning
- Access gated models that require agreement to terms
- (Optional) Upload our fine-tuned model back to Hugging Face

This integration shows how OpenShift AI workbenches can securely connect to external services.

In [None]:
import os
from datetime import datetime

date = datetime.now()
date_string = date.strftime("%Y%m%d%H%M%S")

VERSION = os.environ.get("VERSION", f"notebook-output")
MODEL_NAME = os.environ.get("MODEL_NAME", "stabilityai/stable-diffusion-3.5-medium")
OUTPUT_DIR = os.path.join(os.getcwd(), f"{VERSION}/stable_diffusion_weights/redhat-dog")
DATA_DIR = os.path.join(os.getcwd(), f"{VERSION}/data")
INSTANCE_DATA_URL = os.environ.get("INSTANCE_DATA_URL", "https://rhods-public.s3.amazonaws.com/sample-data/images/redhat-dog.tar.gz")
INSTANCE_DIR = os.path.join(DATA_DIR, "instance_dir")
CLASS_DIR = os.path.join(DATA_DIR, "class_dir")
INSTANCE_PROMPT = os.environ.get("INSTANCE_PROMPT", "photo of a rhteddy dog")
CLASS_PROMPT = os.environ.get("CLASS_PROMPT", "a photo of dog")

NUM_CLASS_IMAGES = int(os.environ.get("NUM_CLASS_IMAGES", "200"))
MAX_TRAIN_STEPS = int(os.environ.get("MAX_TRAIN_STEPS", "800"))

S3_PREFIX = f"models/{VERSION}/redhat-dog"

os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(INSTANCE_DIR, exist_ok=True)

print(f"Weights will be saved at {OUTPUT_DIR}")
print(f"It will be based on the model {MODEL_NAME}")
print(f"Training data located in downloaded from {INSTANCE_DATA_URL}")
print(f"We're going to train the difference between \"{INSTANCE_PROMPT}\" and \"{CLASS_PROMPT}\"")

### Key Training Parameters Explained

- **MODEL_NAME**: Base model to fine-tune (we use SD 3.5 Medium for quality)
- **VERSION**: Helps organize experiments and model versions
- **INSTANCE_PROMPT**: The new concept we're teaching ("photo of a rhteddy dog")
- **CLASS_PROMPT**: The general category to preserve ("a photo of dog")
- **NUM_CLASS_IMAGES**: Regularization images to prevent overfitting
- **MAX_TRAIN_STEPS**: Training duration (800 steps ≈ 15 minutes on A10G)

The Dreambooth technique uses these prompts to:
1. Learn the specific subject (instance)
2. Preserve general knowledge (class)
3. Enable prompts like "rhteddy dog in the snow"

## Training Workflow

The training process involves:
1. **Data Preparation** - Download and organize training images
2. **Model Configuration** - Set up the training environment
3. **Training Execution** - Fine-tune the model
4. **Model Persistence** - Save to S3 for later use

### Step 1: Prepare Training Data

For Dreambooth, we need:
- **5-10 images** of our subject (Teddy) from different angles
- **Consistent quality** - Similar resolution and lighting
- **Clear subject** - Teddy should be prominent in each image

The images are stored in S3 and downloaded on-demand, demonstrating OpenShift AI's integration with object storage.

In [None]:
import sys
import os
import tarfile
import urllib

url = INSTANCE_DATA_URL
output = f"instance-images.tar.gz"
urllib.request.urlretrieve(url, output)

!tar -xzf instance-images.tar.gz -C $INSTANCE_DIR

### Step 2: Configure Accelerate

[Accelerate](https://huggingface.co/docs/accelerate) handles distributed training configuration. For our single-GPU setup, we use the default configuration. In production, you might:
- Use multiple GPUs for faster training
- Configure mixed-precision training for efficiency
- Set up distributed training across nodes

In [None]:
!accelerate config default

In [None]:
!wget -O train_dreambooth_sd3.py https://raw.githubusercontent.com/huggingface/diffusers/main/examples/dreambooth/train_dreambooth_sd3.py

### Step 3: Download Training Script

Download the Dreambooth SD3 training script from Hugging Face:

### Start Training

Here we kick off the training job with our chosen settings.  This will take about 15 minutes depending on settings and hardware.

### Step 4: Start Training

Now we launch the training job. Key optimizations for GPU efficiency:
- **Gradient checkpointing**: Trades computation for memory
- **8-bit Adam**: Reduces optimizer memory usage
- **xFormers**: Efficient attention implementation
- **Prior preservation**: Maintains model's general capabilities

**Expected duration**: ~15 minutes on an A10G GPU

The training script will:
1. Load the base model
2. Generate regularization images
3. Fine-tune on Teddy images
4. Save checkpoints periodically

In [None]:
!accelerate launch train_dreambooth_sd3.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="$INSTANCE_PROMPT" \
  --class_prompt="$CLASS_PROMPT" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_checkpointing \
  --gradient_accumulation_steps=2 \
  --use_8bit_adam \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=$NUM_CLASS_IMAGES \
  --max_train_steps=$MAX_TRAIN_STEPS

### Training Results

After training completes, we have:
- **Model weights** for all components (UNet, VAE, text encoder)
- **Tokenizer** with our new concept
- **Configuration files** for easy loading

These artifacts are ready for:
1. Local testing in notebooks
2. Upload to model registries (Hugging Face, S3)
3. Deployment via KServe
4. Pipeline automation

In [None]:
!ls $OUTPUT_DIR

# Save to S3: Model Persistence

Save the fine-tuned model to S3 storage for:
- **Version Control**: Track model iterations
- **Pipeline Integration**: Access models from any pipeline step
- **Model Serving**: Deploy directly from S3
- **Collaboration**: Share models across teams

# Save to S3

Now we have our model in a portable ONNX format, however, it's not doing much good in this notebook. We need to push the model to our connected storage location. Then we can use it in another notebook or serve the models for use within an application.
Note: This requires a data connection to an S3 compatible bucket. As part of the setup for this project, you added the setup-s3.yaml which created a local s3 bucket and data connections.


### S3 Integration in OpenShift AI

The boto3 code below uses credentials from the Data Connection we attached to our workbench. OpenShift AI automatically:
- Injects S3 credentials as environment variables
- Configures endpoints for your storage provider  
- Handles secure credential management

This same pattern works with any S3-compatible storage:
- Red Hat OpenShift Data Foundation
- AWS S3
- MinIO
- IBM Cloud Object Storage

In [None]:
import os
import boto3
import botocore

aws_access_key_id = os.environ.get('AWS_ACCESS_KEY_ID')
aws_secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY')
endpoint_url = os.environ.get('AWS_S3_ENDPOINT')
region_name = os.environ.get('AWS_DEFAULT_REGION')
bucket_name = os.environ.get('AWS_S3_BUCKET')

session = boto3.session.Session(aws_access_key_id=aws_access_key_id,
                                aws_secret_access_key=aws_secret_access_key)

s3_resource = session.resource(
    's3',
    config=botocore.client.Config(signature_version='s3v4'),
    endpoint_url=endpoint_url,
    region_name=region_name)

bucket = s3_resource.Bucket(bucket_name)


def upload_directory_to_s3(local_directory, s3_prefix):
    for root, dirs, files in os.walk(local_directory):
        for filename in files:
            file_path = os.path.join(root, filename)
            relative_path = os.path.relpath(file_path, local_directory)
            s3_key = os.path.join(s3_prefix, relative_path)
            print(f"{file_path} -> {s3_key}")
            bucket.upload_file(file_path, s3_key)


def list_objects(prefix):
    filter = bucket.objects.filter(Prefix=prefix)
    for obj in filter.all():
        print(obj.key)


In [None]:
print(f"Your S3 path is {S3_PREFIX}")

In [None]:
upload_directory_to_s3(OUTPUT_DIR, S3_PREFIX)

In [None]:
list_objects(S3_PREFIX)

### Model Storage Location

Note the S3 prefix where your model is stored:
- Format: `models/{VERSION}/redhat-dog/`
- Example: `models/notebook-output/redhat-dog/`

You'll need this path for:
- Creating model servers in OpenShift AI
- Referencing in pipelines
- Sharing with team members

## Next Steps

Congratulations! You've successfully:
1. ✅ Fine-tuned a Stable Diffusion model on custom data
2. ✅ Saved the model to S3 for persistence
3. ✅ Prepared for pipeline automation and serving

### What's Next?

1. **Create a Pipeline** (Optional)
   - Convert this notebook to a repeatable pipeline
   - Experiment with different parameters
   - Train on multiple subjects

2. **Deploy for Serving** 
   - Use the custom Diffusers runtime
   - Create a KServe inference service
   - Expose as REST API

3. **Test the Deployment**
   - Continue to [Notebook 3 - Remote Inference](3_remote_inference.ipynb)
   - Test your model via REST API
   - Integrate with applications

### Key Takeaways

- **Dreambooth** enables fine-tuning with minimal data (5-10 images)
- **OpenShift AI** provides GPU resources and S3 integration
- **Environment variables** enable parameterization for pipelines
- **S3 storage** enables model versioning and serving

Before proceeding, ensure you've noted your model's S3 path!

In [None]:
print(f"Your S3 path is {S3_PREFIX}")