You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
End-to-end character consistency pipeline built on FLUX.1-dev, combining a
subject-specific LoRA, PuLID-Flux face identity injection, and Union
ControlNet for pose control. Produced for a Senior ML/CV position focused on
interactive AI avatars.
TL;DR
20 photos → LoRA trained in ~3 h on a single RTX A6000 (48 GB, bf16, ai-toolkit)
4 configurations compared on 9 fixed prompts:
base Flux → +LoRA → +LoRA+PuLID → +LoRA+PuLID+ControlNet
ArcFace identity similarity jumps from random (0.00) to 0.78 after the
full stack
Full pipeline runs end-to-end through a ComfyUI REST API, reproducible via
Python scripts in scripts/
Results
ArcFace similarity across configurations
Configuration
Mean
Median
Std
No-face rate
base Flux
−0.008
−0.017
0.030
11.1 %
+ LoRA (step 1750)
0.606
0.721
0.261
0.0 %
+ LoRA + PuLID v3
0.783
0.838
0.095
11.1 %
+ LoRA + PuLID + CN (strong, arm-up)
0.710
0.718
0.141
11.1 %
Base Flux −0.008 confirms the metric sanity: un-conditioned generations have
zero identity relationship with the reference.
What each step adds
LoRA alone already crosses the "same person" threshold (0.3) on most
prompts. Outliers are profile or distant shots where ArcFace itself breaks.
+ PuLID boosts median to 0.84 while mostly preserving visual quality
(see notes.md for the iteration story — PuLID was re-tuned three times
before it stopped dominating LoRA). Tight weight/scheduling was critical.
+ ControlNet (applied only to full-body prompts) brings pose
determinism at a small identity cost (−0.02), which is the expected
trade-off and matches the senior-engineer pattern of "choose per-frame, not
globally".
See gallery/final_grid.jpg for the full side-by-side, and
gallery/final_grid_small.jpg for a viewable 1100-px version.
Pipeline overview
┌────────────────────────────────────────┐
raw photos │ dataset/raw (26 WhatsApp photos) │
└───────────────┬────────────────────────┘
│ scripts/analyze.py
│ - insightface face detection
│ - identity centroid clustering
│ - reject distant / multi-face / sub-person shots
▼
┌────────────────────────────────────────┐
filtered │ 20 accepted │
└───────────────┬────────────────────────┘
│ scripts/fix_crop.py
│ - face-aware cropping excluding non-subjects
│ - aspect clamp to [0.5, 2.0]
│ - upscale to min 1024 short side
│ manual blur pass for images where other faces
│ geometrically overlap the subject
▼
┌────────────────────────────────────────┐
training │ dataset/processed/*.jpg + *.txt │
set │ - structured captions (close-up / │
│ portrait / full body) + "ohwx man" │
└───────────────┬────────────────────────┘
│ ai-toolkit (Flux LoRA, rank 32, bf16)
│ - 2500 steps, bs=2, lr=1e-4
│ - 10 sample batches, ema 0.99
▼
┌────────────────────────────────────────┐
LoRA │ output/ohwx_flux_lora_v1_000001750 │
│ (best checkpoint selected manually │
│ from 7 saved, see notes.md) │
└───────────────┬────────────────────────┘
│ ComfyUI headless (port 8188)
│ + scripts/comfy_generate.py
│ + PuLID-Flux (patched for ComfyUI 0.19)
│ + Flux.1 ControlNet Union Pro
▼
┌────────────────────────────────────────┐
outputs │ gallery/{01_base,02_lora, │
│ 03_lora_pulid,04_full}/ │
│ 00..08_<tag>.jpg │
└───────────────┬────────────────────────┘
│ scripts/arcface_eval.py
│ scripts/build_grid.py
│ scripts/build_plot.py
▼
┌────────────────────────────────────────┐
reports │ eval/*.csv + gallery/final_grid.jpg │
│ + gallery/summary_plot.png│
└────────────────────────────────────────┘
Final hyperparameters
LoRA training (config/ohwx_flux_lora.yaml)
Param
Value
Rationale
linear / linear_alpha (rank)
32
Good capacity/overfit balance on 20 imgs
optimizer
adamw8bit
Memory-efficient
lr
1e-4
Standard for Flux LoRA
batch_size / grad_accum
2 / 1
Bf16 allowed 2 samples to fit on 48 GB
steps
2500
~125 epochs on 20 imgs, picked 1750 as best
dtype
bf16
No quantization — A6000 has VRAM budget
resolution
[768, 1024] multi-res
ai-toolkit auto-buckets
ema_decay
0.99
Reduces per-step noise
save_every
250
10 checkpoints for ablation
PuLID-Flux (v3, final)
Param
Value
Why
reference image
dataset/processed/15.jpg
Sharpest frontal close-up, low JPEG noise
weight
0.4
High weight drowned LoRA → plastic faces
start_at
0.15
Let LoRA build composition first
end_at
0.9
Release at the end so skin detail finishes naturally
face encoder
AntelopeV2 ONNX
CPU-only, buffalo_l is lower quality
image encoder
EVA02-CLIP-L @ 336 px
Official PuLID-Flux requirement
Union ControlNet Pro (pose)
Param
Value
Why
applied on
only forest_hike, mountain_winter, desert_sunset
Full-body prompts only
strength
0.85
Strong — force the arm-up pose against prompt cues
start_percent
0.0
Active from step 0
end_percent
0.80
Hold through most of sampling; release only at tail
resolution
896 × 1216 (portrait)
Matches pose skeleton aspect (0.74)
pose skeleton
OpenPose, programmatically edited
Full-body standing, left arm raised high; padded 18 % above head
CN type
openpose via SetUnionControlNetType
Union Pro mode selector
Hardware / performance
Stage
Time
VRAM peak
LoRA training, 2500 steps @ 1024
~3 h 10 min (4.8 s/step)
42 GB / 48 GB
Base Flux inference (9 imgs, 25 steps)
~4 min (~24 s/img)
33 GB
+ LoRA inference
~4 min
34 GB
+ LoRA + PuLID (first run: model load)
~5 min first, ~4 min repeats
36 GB
+ LoRA + PuLID + ControlNet
~40 s / 896×1216 img
40 GB
A6000 bf16 without quantization; gradient_checkpointing: true enabled during
training.
Repository layout
avatar/
├── README.md you are here
├── notes.md trade-offs, failed experiments, lessons
├── scripts/
│ ├── analyze.py dataset auto-filter (insightface)
│ ├── fix_crop.py face-aware crop for Flux buckets
│ ├── caption.py structured captioner with trigger word
│ ├── generate.py diffusers-based base + LoRA generation
│ ├── comfy_generate.py ComfyUI API client for PuLID / ControlNet
│ ├── arcface_eval.py ArcFace cosine sim vs reference centroid
│ ├── build_grid.py 4-way comparison grid
│ └── build_plot.py summary plot with ablation
├── dataset/
│ ├── raw/ (gitignored – photos)
│ └── processed/ (20 cropped + captioned images)
├── output/ohwx_flux_lora_v1/
│ ├── ohwx_flux_lora_v1_000001750.safetensors (chosen checkpoint)
│ └── samples/ training checkpoint samples
├── ComfyUI/ (submodule-style – not in repo)
├── ai-toolkit/ (submodule-style – not in repo)
├── gallery/
│ ├── 01_base/*.jpg
│ ├── 02_lora/*.jpg
│ ├── 03_lora_pulid/*.jpg
│ ├── 04_full/*.jpg
│ ├── final_grid.jpg full-res comparison
│ ├── final_grid_small.jpg
│ └── summary_plot.png
└── eval/
├── 01_base.csv
├── 02_lora.csv
├── 03_lora_pulid.csv
└── 04_full.csv
ArcFace metric is not perceptual. See notes.md → Goodhart's law section.
Metric improvement doesn't always mean visually-better faces.
Pose mismatch on close-ups. Only full-body CN was used; a production
pipeline should keep a per-composition pose bank (headshot, medium, full)
rather than a single reference.
Source data is WhatsApp-compressed. Original DSLR / raw photos would
yield a sharper LoRA and a cleaner PuLID reference.
T5 prompt embedding cache is not shared across configurations — re-encoding
the same 9 prompts in each config is ~30 s wasted; could batch.
Multi-LoRA stacking (style LoRA on top of identity LoRA) is not
implemented — next exploration.