# 5. FLUX.1-schnell --- 12B Parameters on a MacBook

**Running the Largest Open-Source Diffusion Model via GGUF Quantization**

FLUX.1-schnell (Black Forest Labs, Aug 2024) is a 12-billion parameter rectified flow transformer
that generates high-quality images in just **4 inference steps**. Under normal float16 precision
it requires ~24 GB of VRAM, putting it out of reach for most consumer hardware.

In this notebook we run FLUX.1-schnell on a **16 GB Apple Silicon Mac** by applying GGUF 4-bit
quantization to the transformer backbone and combining it with CPU offloading. This is the
centerpiece demo of the project --- proving that state-of-the-art diffusion models can run
locally on modest hardware with the right optimization strategy.

## Why GGUF Quantization?

| Component | Full Precision | With GGUF Q4_K_S |
|-----------|---------------|-------------------|
| FLUX transformer (12B params) | ~24 GB (float16) | ~6.8 GB (4-bit) |
| Text encoders (CLIP + T5) | ~2 GB | ~2 GB (unchanged) |
| VAE decoder | ~0.3 GB | ~0.3 GB (unchanged) |
| **Total peak** | **~26 GB** | **~10 GB** |

Key ideas:

- The FLUX.1 transformer has 12B parameters = ~24 GB in float16
- Our Mac has 16 GB unified memory --- not enough for full precision
- **GGUF Q4_K_S** quantizes weights to 4 bits with super-block scaling, reducing the
  transformer to ~6.8 GB
- Combined with `enable_model_cpu_offload()` and float16 for the text encoders / VAE,
  the full pipeline fits in ~10 GB peak
- **Trade-off**: Slight quality reduction vs full precision, but still impressive for
  consumer hardware and more than good enough for prototyping and experimentation

In [None]:
import sys
import os
import time

# Ensure project root is on the path so we can import config/ and models/
sys.path.insert(0, os.path.join(os.path.dirname(os.getcwd()), ""))
if os.path.basename(os.getcwd()) == "notebooks":
    sys.path.insert(0, os.path.dirname(os.getcwd()))

import torch
import matplotlib.pyplot as plt
from PIL import Image

from config.default import DiffusionConfig
from models.memory_utils import setup_mps_environment, clear_memory, log_memory_usage
from models.pipeline_factory import load_pipeline, generate_image
from models.prompt_bank import BENCHMARK_PROMPTS

In [None]:
config = DiffusionConfig(
    model_name="flux-schnell",
    quantization="Q4_K_S",
    num_inference_steps=4,
    guidance_scale=0.0,  # FLUX schnell uses guidance distillation, no CFG needed
    height=512,
    width=512,
    seed=42,
    dtype="float16",
    enable_cpu_offload=True,
)

print(f"Model: {config.model_name}")
print(f"Quantization: {config.quantization} (4-bit GGUF)")
print(f"Steps: {config.num_inference_steps} (guidance distilled!)")
print(f"Guidance Scale: {config.guidance_scale} (no CFG needed)")
print(f"Resolution: {config.width}x{config.height}")
print(f"CPU offload: {config.enable_cpu_offload}")

## Why Only 4 Steps?

Traditional diffusion models use **Classifier-Free Guidance (CFG)** at every denoising step.
This means the model runs **twice** per step --- once with the text condition and once without
--- then blends the results. A typical workflow looks like:

```
Traditional: 50 steps x 2 forward passes = 100 forward passes total
```

FLUX.1-schnell changes the game in two ways:

1. **Guidance Distillation**: The model was trained to produce CFG-quality results in a
   **single forward pass** per step. The "knowledge" of classifier-free guidance has been
   distilled directly into the model weights during training. That is why we set
   `guidance_scale=0.0` --- no CFG is needed at inference time.

2. **Flow Matching with Straight Trajectories**: Instead of the curved denoising paths of
   traditional DDPM/DDIM schedulers, flow matching learns nearly straight trajectories
   from noise to data. Straighter paths need fewer discretization steps to follow accurately.

Combined, this gives us:

```
FLUX.1-schnell: 4 steps x 1 forward pass = 4 forward passes total
```

That is a **25x reduction** in neural network evaluations compared to a traditional 50-step
CFG pipeline --- and the image quality remains remarkably competitive.

In [None]:
log_memory_usage("before loading")
print("Loading FLUX.1-schnell with GGUF Q4_K_S quantization...")
print("(First run will download ~7GB of model weights)")

pipe = load_pipeline(config)

log_memory_usage("after loading FLUX.1-schnell")

In [None]:
prompt = "A corgi holding a wooden sign that reads 'FLUX.1'"
config.prompt = prompt

print(f"Prompt: {prompt}")
print(f"Generating with just {config.num_inference_steps} steps...")

start_time = time.perf_counter()
image = generate_image(pipe, config)
elapsed = time.perf_counter() - start_time

print(f"Generation time: {elapsed:.1f}s")
log_memory_usage("after generation")

fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.imshow(image)
ax.set_title(
    f"FLUX.1-schnell | {config.num_inference_steps} steps | {elapsed:.1f}s | GGUF Q4_K_S",
    fontsize=12,
    fontweight="bold",
)
ax.axis("off")
plt.tight_layout()
plt.show()

os.makedirs(os.path.join("..", "outputs", "flux_schnell"), exist_ok=True)
image.save(os.path.join("..", "outputs", "flux_schnell", "corgi_sign.png"))

## Step Count Comparison

How does image quality change with different step counts? Since FLUX.1-schnell is optimized
for 4 steps via guidance distillation, we compare **1, 2, 4, and 8** steps to see the
diminishing-returns curve. At 1 step the model barely resolves structure; by 4 steps quality
is excellent; 8 steps offers marginal improvement at the cost of doubled inference time.

In [None]:
step_counts = [1, 2, 4, 8]
step_images = []
step_times = []

test_prompt = "A photorealistic astronaut riding a white horse on Mars, cinematic lighting"

for steps in step_counts:
    config.prompt = test_prompt
    config.num_inference_steps = steps

    start = time.perf_counter()
    img = generate_image(pipe, config)
    elapsed = time.perf_counter() - start

    step_images.append(img)
    step_times.append(elapsed)
    print(f"  {steps} steps: {elapsed:.1f}s")

# Reset to default
config.num_inference_steps = 4

fig, axes = plt.subplots(1, 4, figsize=(20, 5))
for ax, img, steps, t in zip(axes, step_images, step_counts, step_times):
    ax.imshow(img)
    ax.set_title(
        f"{steps} step{'s' if steps > 1 else ''} | {t:.1f}s",
        fontsize=12,
        fontweight="bold",
    )
    ax.axis("off")

fig.suptitle(
    "FLUX.1-schnell: Quality vs Step Count", fontsize=14, fontweight="bold"
)
plt.tight_layout()
plt.show()

## Benchmark Suite

Now we run all 5 benchmark prompts from `models.prompt_bank`. These are the same prompts
used in the SD3 Medium notebook, enabling a fair head-to-head comparison later in Notebook 06.
Each prompt targets a different capability: text rendering, photorealism, spatial reasoning,
artistic style transfer, and compositional complexity.

In [None]:
config.num_inference_steps = 4
config.guidance_scale = 0.0

images = []
times = []

for i, ps in enumerate(BENCHMARK_PROMPTS):
    print(f"[{i+1}/{len(BENCHMARK_PROMPTS)}] {ps.category}: {ps.prompt[:60]}...")
    config.prompt = ps.prompt

    start = time.perf_counter()
    img = generate_image(pipe, config)
    elapsed = time.perf_counter() - start

    images.append(img)
    times.append(elapsed)
    print(f"  Done in {elapsed:.1f}s")

fig, axes = plt.subplots(1, len(images), figsize=(20, 4))
for ax, img, ps, t in zip(axes, images, BENCHMARK_PROMPTS, times):
    ax.imshow(img)
    ax.set_title(f"{ps.category}\n{t:.1f}s", fontsize=10, fontweight="bold")
    ax.axis("off")

fig.suptitle(
    "FLUX.1-schnell --- Benchmark Suite (512x512, 4 steps, GGUF Q4_K_S)",
    fontsize=14,
    fontweight="bold",
)
plt.tight_layout()
plt.show()

for img, ps in zip(images, BENCHMARK_PROMPTS):
    img.save(os.path.join("..", "outputs", "flux_schnell", f"{ps.name}.png"))
print(f"Saved {len(images)} images to outputs/flux_schnell/")

In [None]:
print("=" * 60)
print("FLUX.1-schnell Performance (16GB Apple Silicon, GGUF Q4_K_S)")
print("=" * 60)
print(f"{'Prompt':<20} {'Category':<18} {'Time (s)':<10}")
print("-" * 60)
for ps, t in zip(BENCHMARK_PROMPTS, times):
    print(f"{ps.name:<20} {ps.category:<18} {t:<10.1f}")
print("-" * 60)
print(f"{'Average':<20} {'':<18} {sum(times)/len(times):<10.1f}")
print(f"\nResolution: {config.width}x{config.height}")
print(f"Steps: {config.num_inference_steps}")
print(f"Quantization: {config.quantization}")

log_memory_usage("final")

In [None]:
del pipe
clear_memory()
log_memory_usage("after cleanup")
print("Pipeline unloaded. Memory cleared.")

## Key Observations

- **Speed**: 4 steps is remarkably fast --- this is what guidance distillation combined with
  flow matching enables. Each step is a single forward pass through the 12B transformer,
  with no duplicate CFG evaluation.

- **Text rendering**: FLUX.1 excels at rendering legible text within images, a known
  strength of the MMDiT (Multimodal Diffusion Transformer) architecture and its deep
  cross-attention between text and image tokens.

- **GGUF trade-off**: Q4_K_S quantization has minimal visible quality impact at 512x512.
  The super-block scaling in Q4_K_S preserves important weight distributions better than
  naive 4-bit quantization.

- **Memory**: Peak usage stays around ~10 GB thanks to quantization + CPU offloading,
  comfortably within the 16 GB unified memory budget of entry-level Apple Silicon Macs.

- **Next**: Notebook 06 puts SD3 Medium and FLUX.1-schnell head-to-head on the same
  benchmark prompts, comparing quality, speed, and memory usage side by side.