Flux Consistent Character Pipeline

End-to-end character consistency pipeline built on FLUX.1-dev, combining a subject-specific LoRA, PuLID-Flux face identity injection, and Union ControlNet for pose control. Produced for a Senior ML/CV position focused on interactive AI avatars.

TL;DR

20 photos → LoRA trained in ~3 h on a single RTX A6000 (48 GB, bf16, ai-toolkit)
4 configurations compared on 9 fixed prompts: base Flux → +LoRA → +LoRA+PuLID → +LoRA+PuLID+ControlNet
ArcFace identity similarity jumps from random (0.00) to 0.78 after the full stack
Full pipeline runs end-to-end through a ComfyUI REST API, reproducible via Python scripts in scripts/

Results

ArcFace similarity across configurations

Configuration	Mean	Median	Std	No-face rate
base Flux	−0.008	−0.017	0.030	11.1 %
+ LoRA (step 1750)	0.606	0.721	0.261	0.0 %
+ LoRA + PuLID v3	0.783	0.838	0.095	11.1 %
+ LoRA + PuLID + CN (strong, arm-up)	0.710	0.718	0.141	11.1 %

Base Flux −0.008 confirms the metric sanity: un-conditioned generations have zero identity relationship with the reference.

What each step adds

LoRA alone already crosses the "same person" threshold (0.3) on most prompts. Outliers are profile or distant shots where ArcFace itself breaks.
+ PuLID boosts median to 0.84 while mostly preserving visual quality (see notes.md for the iteration story — PuLID was re-tuned three times before it stopped dominating LoRA). Tight weight/scheduling was critical.
+ ControlNet (applied only to full-body prompts) brings pose determinism at a small identity cost (−0.02), which is the expected trade-off and matches the senior-engineer pattern of "choose per-frame, not globally".

See gallery/final_grid.jpg for the full side-by-side, and gallery/final_grid_small.jpg for a viewable 1100-px version.

Pipeline overview

           ┌────────────────────────────────────────┐
raw photos │ dataset/raw  (26 WhatsApp photos)      │
           └───────────────┬────────────────────────┘
                           │ scripts/analyze.py
                           │   - insightface face detection
                           │   - identity centroid clustering
                           │   - reject distant / multi-face / sub-person shots
                           ▼
           ┌────────────────────────────────────────┐
 filtered  │ 20 accepted                            │
           └───────────────┬────────────────────────┘
                           │ scripts/fix_crop.py
                           │   - face-aware cropping excluding non-subjects
                           │   - aspect clamp to [0.5, 2.0]
                           │   - upscale to min 1024 short side
                           │ manual blur pass for images where other faces
                           │ geometrically overlap the subject
                           ▼
           ┌────────────────────────────────────────┐
 training  │ dataset/processed/*.jpg + *.txt        │
   set     │   - structured captions (close-up /    │
           │     portrait / full body) + "ohwx man" │
           └───────────────┬────────────────────────┘
                           │ ai-toolkit (Flux LoRA, rank 32, bf16)
                           │   - 2500 steps, bs=2, lr=1e-4
                           │   - 10 sample batches, ema 0.99
                           ▼
           ┌────────────────────────────────────────┐
   LoRA    │ output/ohwx_flux_lora_v1_000001750     │
           │ (best checkpoint selected manually     │
           │  from 7 saved, see notes.md)           │
           └───────────────┬────────────────────────┘
                           │ ComfyUI headless (port 8188)
                           │ + scripts/comfy_generate.py
                           │ + PuLID-Flux (patched for ComfyUI 0.19)
                           │ + Flux.1 ControlNet Union Pro
                           ▼
           ┌────────────────────────────────────────┐
  outputs  │ gallery/{01_base,02_lora,              │
           │          03_lora_pulid,04_full}/       │
           │          00..08_<tag>.jpg              │
           └───────────────┬────────────────────────┘
                           │ scripts/arcface_eval.py
                           │ scripts/build_grid.py
                           │ scripts/build_plot.py
                           ▼
           ┌────────────────────────────────────────┐
  reports  │ eval/*.csv  +  gallery/final_grid.jpg  │
           │             +  gallery/summary_plot.png│
           └────────────────────────────────────────┘

Final hyperparameters

LoRA training (`config/ohwx_flux_lora.yaml`)

Param	Value	Rationale
`linear` / `linear_alpha` (rank)	32	Good capacity/overfit balance on 20 imgs
`optimizer`	`adamw8bit`	Memory-efficient
`lr`	`1e-4`	Standard for Flux LoRA
`batch_size` / `grad_accum`	2 / 1	Bf16 allowed 2 samples to fit on 48 GB
`steps`	2500	~125 epochs on 20 imgs, picked 1750 as best
`dtype`	`bf16`	No quantization — A6000 has VRAM budget
`resolution`	`[768, 1024]` multi-res	ai-toolkit auto-buckets
`ema_decay`	0.99	Reduces per-step noise
`save_every`	250	10 checkpoints for ablation

PuLID-Flux (v3, final)

Param	Value	Why
reference image	`dataset/processed/15.jpg`	Sharpest frontal close-up, low JPEG noise
`weight`	0.4	High weight drowned LoRA → plastic faces
`start_at`	0.15	Let LoRA build composition first
`end_at`	0.9	Release at the end so skin detail finishes naturally
face encoder	AntelopeV2 ONNX	CPU-only, `buffalo_l` is lower quality
image encoder	EVA02-CLIP-L @ 336 px	Official PuLID-Flux requirement

Union ControlNet Pro (pose)

Param	Value	Why
applied on	only `forest_hike`, `mountain_winter`, `desert_sunset`	Full-body prompts only
`strength`	0.85	Strong — force the arm-up pose against prompt cues
`start_percent`	0.0	Active from step 0
`end_percent`	0.80	Hold through most of sampling; release only at tail
resolution	896 × 1216 (portrait)	Matches pose skeleton aspect (0.74)
pose skeleton	OpenPose, programmatically edited	Full-body standing, left arm raised high; padded 18 % above head
CN type	`openpose` via `SetUnionControlNetType`	Union Pro mode selector

Hardware / performance

Stage	Time	VRAM peak
LoRA training, 2500 steps @ 1024	~3 h 10 min (4.8 s/step)	42 GB / 48 GB
Base Flux inference (9 imgs, 25 steps)	~4 min (~24 s/img)	33 GB
+ LoRA inference	~4 min	34 GB
+ LoRA + PuLID (first run: model load)	~5 min first, ~4 min repeats	36 GB
+ LoRA + PuLID + ControlNet	~40 s / 896×1216 img	40 GB

A6000 bf16 without quantization; gradient_checkpointing: true enabled during training.

Repository layout

avatar/
├── README.md                          you are here
├── notes.md                           trade-offs, failed experiments, lessons
├── scripts/
│   ├── analyze.py                     dataset auto-filter (insightface)
│   ├── fix_crop.py                    face-aware crop for Flux buckets
│   ├── caption.py                     structured captioner with trigger word
│   ├── generate.py                    diffusers-based base + LoRA generation
│   ├── comfy_generate.py              ComfyUI API client for PuLID / ControlNet
│   ├── arcface_eval.py                ArcFace cosine sim vs reference centroid
│   ├── build_grid.py                  4-way comparison grid
│   └── build_plot.py                  summary plot with ablation
├── dataset/
│   ├── raw/                           (gitignored – photos)
│   └── processed/                     (20 cropped + captioned images)
├── output/ohwx_flux_lora_v1/
│   ├── ohwx_flux_lora_v1_000001750.safetensors  (chosen checkpoint)
│   └── samples/                       training checkpoint samples
├── ComfyUI/                           (submodule-style – not in repo)
├── ai-toolkit/                        (submodule-style – not in repo)
├── gallery/
│   ├── 01_base/*.jpg
│   ├── 02_lora/*.jpg
│   ├── 03_lora_pulid/*.jpg
│   ├── 04_full/*.jpg
│   ├── final_grid.jpg                 full-res comparison
│   ├── final_grid_small.jpg
│   └── summary_plot.png
└── eval/
    ├── 01_base.csv
    ├── 02_lora.csv
    ├── 03_lora_pulid.csv
    └── 04_full.csv

Reproducing the pipeline

# 1. Environment
python -m venv /venv/flux
source /venv/flux/bin/activate
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
# install ai-toolkit + diffusers stack via uv (pip resolver chokes):
uv pip install -r ai-toolkit/requirements_base.txt insightface onnxruntime-gpu
# patch PuLID-Flux for ComfyUI 0.19 (forward_orig kwargs) – see notes.md

# 2. Download models (35 GB + 23 GB + 6.6 GB + 1.1 GB + 860 MB)
hf download black-forest-labs/FLUX.1-dev --local-dir models/flux1-dev
hf download comfyanonymous/flux_text_encoders clip_l.safetensors t5xxl_fp16.safetensors
hf download Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro diffusion_pytorch_model.safetensors
hf download guozinan/PuLID pulid_flux_v0.9.1.safetensors
hf download QuanSun/EVA-CLIP EVA02_CLIP_L_336_psz14_s6B.pt
# antelopev2 from MonsterMMORPG/tools (see README of PuLID-Flux)

# 3. Dataset preprocessing
python scripts/analyze.py   # filter 26 → 20
python scripts/fix_crop.py  # face-aware crop
python scripts/caption.py   # trigger-word captions

# 4. Train
cd ai-toolkit && python run.py config/ohwx_flux_lora.yaml

# 5. Launch ComfyUI headless
cd ComfyUI && python main.py --listen 127.0.0.1 --port 8188 --disable-auto-launch &

# 6. Generate all 4 galleries
python scripts/generate.py --config base
python scripts/generate.py --config lora
python scripts/comfy_generate.py --config pulid --out gallery/03_lora_pulid
python scripts/comfy_generate.py --config full  --out gallery/04_full

# 7. Eval and build reports
for c in 01_base 02_lora 03_lora_pulid 04_full; do
  python scripts/arcface_eval.py --ref dataset/processed --gen gallery/$c --out eval/$c.csv
done
python scripts/build_grid.py
python scripts/build_plot.py

Known limitations / next steps

ArcFace metric is not perceptual. See notes.md → Goodhart's law section. Metric improvement doesn't always mean visually-better faces.
Pose mismatch on close-ups. Only full-body CN was used; a production pipeline should keep a per-composition pose bank (headshot, medium, full) rather than a single reference.
Source data is WhatsApp-compressed. Original DSLR / raw photos would yield a sharper LoRA and a cleaner PuLID reference.
T5 prompt embedding cache is not shared across configurations — re-encoding the same 9 prompts in each config is ~30 s wasted; could batch.
Multi-LoRA stacking (style LoRA on top of identity LoRA) is not implemented — next exploration.

Acknowledgments

ai-toolkit — Ostris, Flux LoRA trainer
ComfyUI-PuLID-Flux — Balázs, PuLID
cubiq/PuLID_ComfyUI — Matteo, original PuLID node
Shakker-Labs FLUX.1 ControlNet Union Pro
ToTheBeginning/PuLID — PuLID original paper authors
Black Forest Labs — FLUX.1-dev base model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flux Consistent Character Pipeline

TL;DR

Results

ArcFace similarity across configurations

What each step adds

Pipeline overview

Final hyperparameters

LoRA training (`config/ohwx_flux_lora.yaml`)

PuLID-Flux (v3, final)

Union ControlNet Pro (pose)

Hardware / performance

Repository layout

Reproducing the pipeline

Known limitations / next steps

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
config		config
eval		eval
gallery		gallery
scripts		scripts
.gitignore		.gitignore
README.md		README.md
notes.md		notes.md

Folders and files

Latest commit

History

Repository files navigation

Flux Consistent Character Pipeline

TL;DR

Results

ArcFace similarity across configurations

What each step adds

Pipeline overview

Final hyperparameters

LoRA training (config/ohwx_flux_lora.yaml)

PuLID-Flux (v3, final)

Union ControlNet Pro (pose)

Hardware / performance

Repository layout

Reproducing the pipeline

Known limitations / next steps

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

LoRA training (`config/ohwx_flux_lora.yaml`)

Packages