# 4. Stable Diffusion 3 Medium — Demo on Apple Silicon
**Running the MMDiT Revolution on 16GB**

SD3 Medium introduced the Multi-Modal Diffusion Transformer (MMDiT) architecture, replacing the traditional UNet
with a transformer that jointly processes text and image tokens through two-way attention. This notebook
demonstrates real inference on a memory-constrained 16GB Apple Silicon Mac using strategic optimizations:
dropping the T5-XXL text encoder, using float16 precision, and enabling CPU offloading.

## Prerequisites

Before running this notebook, ensure the following:

1. **HuggingFace account + token** — Run `huggingface-cli login` in your terminal and paste your access token.
2. **Accept the SD3 Medium license** — Visit [stabilityai/stable-diffusion-3-medium-diffusers](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers) and accept the model license agreement.
3. **16GB memory strategy** — SD3 Medium ships with three text encoders (CLIP-L, CLIP-G, T5-XXL). T5-XXL alone requires ~14GB in float16, making it impossible to fit the full pipeline in 16GB. Our strategy:
   - **Drop T5-XXL** (`drop_t5_encoder=True`): CLIP-L + CLIP-G still provide excellent text understanding.
   - **Use float16**: Halves memory compared to float32 with negligible quality loss.
   - **CPU offloading**: Moves inactive pipeline components to CPU RAM during inference.

In [None]:
import sys
import os
import time

# Add project root to path
sys.path.insert(0, os.path.join(os.path.dirname(os.getcwd()), ""))
# If running from notebooks dir:
if os.path.basename(os.getcwd()) == "notebooks":
    sys.path.insert(0, os.path.dirname(os.getcwd()))

import torch
import matplotlib.pyplot as plt
from PIL import Image

from config.default import DiffusionConfig
from models.memory_utils import setup_mps_environment, clear_memory, log_memory_usage
from models.pipeline_factory import load_pipeline, generate_image
from models.prompt_bank import BENCHMARK_PROMPTS

In [None]:
config = DiffusionConfig(
    model_name="sd3-medium",
    quantization="none",
    num_inference_steps=28,
    guidance_scale=7.0,
    height=512,
    width=512,
    seed=42,
    dtype="float16",
    enable_cpu_offload=True,
    drop_t5_encoder=True,
)

print(f"Model: {config.model_name}")
print(f"Steps: {config.num_inference_steps}")
print(f"Resolution: {config.width}x{config.height}")
print(f"T5-XXL dropped: {config.drop_t5_encoder} (saves ~14GB)")
print(f"CPU offload: {config.enable_cpu_offload}")

## Memory Baseline

We measure process memory (RSS) at three key points to understand the memory profile:
1. **Before loading** — baseline with just Python and imports.
2. **After loading** — the pipeline is in memory (with T5-XXL dropped and CPU offloading active).
3. **After generation** — peak memory during inference when model components move to the MPS device.

In [None]:
log_memory_usage("before loading")
print("Loading SD3 Medium (this may take a few minutes on first run)...")
print("Note: T5-XXL encoder is dropped to fit in 16GB")

pipe = load_pipeline(config)

log_memory_usage("after loading SD3 Medium")

## Single Image Generation

Let's generate our first image with the MMDiT architecture. We use a classic prompt that tests
photorealism, spatial composition, and cinematic lighting.

In [None]:
prompt = "A photorealistic astronaut riding a white horse on Mars, cinematic lighting"
config.prompt = prompt

print(f"Prompt: {prompt}")
print(f"Generating with {config.num_inference_steps} steps...")

start_time = time.perf_counter()
image = generate_image(pipe, config)
elapsed = time.perf_counter() - start_time

print(f"Generation time: {elapsed:.1f}s")
log_memory_usage("after generation")

# Display
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.imshow(image)
ax.set_title(f"SD3 Medium | {config.num_inference_steps} steps | {elapsed:.1f}s", fontsize=13, fontweight='bold')
ax.axis('off')
plt.tight_layout()
plt.show()

# Save
os.makedirs(os.path.join("..", "outputs", "sd3_medium"), exist_ok=True)
image.save(os.path.join("..", "outputs", "sd3_medium", "astronaut.png"))

## Benchmark Prompt Suite

Now we run all 5 benchmark prompts to test different capabilities that improved with the MMDiT
architecture: text rendering, photorealism, spatial reasoning, artistic style, and complex scenes.

In [None]:
images = []
times = []

for i, ps in enumerate(BENCHMARK_PROMPTS):
    print(f"[{i+1}/{len(BENCHMARK_PROMPTS)}] {ps.category}: {ps.prompt[:60]}...")
    config.prompt = ps.prompt

    start = time.perf_counter()
    img = generate_image(pipe, config)
    elapsed = time.perf_counter() - start

    images.append(img)
    times.append(elapsed)
    print(f"  Done in {elapsed:.1f}s")

# Display grid
fig, axes = plt.subplots(1, len(images), figsize=(20, 4))
for ax, img, ps, t in zip(axes, images, BENCHMARK_PROMPTS, times):
    ax.imshow(img)
    ax.set_title(f"{ps.category}\n{t:.1f}s", fontsize=10, fontweight='bold')
    ax.axis('off')

fig.suptitle("SD3 Medium \u2014 Benchmark Prompt Suite (512x512)", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Save all
for img, ps in zip(images, BENCHMARK_PROMPTS):
    img.save(os.path.join("..", "outputs", "sd3_medium", f"{ps.name}.png"))
print(f"Saved {len(images)} images to outputs/sd3_medium/")

## Performance Summary

Summary of inference performance for SD3 Medium on Apple Silicon with the 16GB-optimized configuration.

In [None]:
print("=" * 60)
print("SD3 Medium Performance Summary (16GB Apple Silicon)")
print("=" * 60)
print(f"{'Prompt':<20} {'Category':<18} {'Time (s)':<10}")
print("-" * 60)
for ps, t in zip(BENCHMARK_PROMPTS, times):
    print(f"{ps.name:<20} {ps.category:<18} {t:<10.1f}")
print("-" * 60)
print(f"{'Average':<20} {'':<18} {sum(times)/len(times):<10.1f}")
print(f"\nResolution: {config.width}x{config.height}")
print(f"Steps: {config.num_inference_steps}")
print(f"Guidance Scale: {config.guidance_scale}")

log_memory_usage("final")

In [None]:
del pipe
clear_memory()
log_memory_usage("after cleanup")
print("Pipeline unloaded. Memory cleared for next model.")

## Observations

Key observations from the SD3 Medium demo:

- **Text rendering**: MMDiT's two-way attention significantly improves text in images compared to SDXL. The dual CLIP encoders (even without T5-XXL) give the model much stronger text comprehension.
- **Spatial reasoning**: Better understanding of spatial relationships between objects (e.g., "on top of", "behind") thanks to the joint text-image token processing.
- **Memory**: Dropping T5-XXL is essential for 16GB — quality is still excellent with the dual CLIP encoders (CLIP-L + CLIP-G). The full pipeline with T5-XXL would require ~30GB.
- **Speed**: 28 steps is the sweet spot for quality vs speed on the MMDiT scheduler. Fewer steps degrade quality noticeably; more steps offer diminishing returns.

**Next**: See how FLUX.1-schnell compares in notebook 05 — it uses a distilled flow matching approach that needs only 4 steps.