Skip to content

azamatsab/avatar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flux Consistent Character Pipeline

End-to-end character consistency pipeline built on FLUX.1-dev, combining a subject-specific LoRA, PuLID-Flux face identity injection, and Union ControlNet for pose control. Produced for a Senior ML/CV position focused on interactive AI avatars.

Final 4-way comparison grid

TL;DR

  • 20 photos → LoRA trained in ~3 h on a single RTX A6000 (48 GB, bf16, ai-toolkit)
  • 4 configurations compared on 9 fixed prompts: base Flux+LoRA+LoRA+PuLID+LoRA+PuLID+ControlNet
  • ArcFace identity similarity jumps from random (0.00) to 0.78 after the full stack
  • Full pipeline runs end-to-end through a ComfyUI REST API, reproducible via Python scripts in scripts/

Summary plot

Results

ArcFace similarity across configurations

Configuration Mean Median Std No-face rate
base Flux −0.008 −0.017 0.030 11.1 %
+ LoRA (step 1750) 0.606 0.721 0.261 0.0 %
+ LoRA + PuLID v3 0.783 0.838 0.095 11.1 %
+ LoRA + PuLID + CN (strong, arm-up) 0.710 0.718 0.141 11.1 %

Base Flux −0.008 confirms the metric sanity: un-conditioned generations have zero identity relationship with the reference.

What each step adds

  • LoRA alone already crosses the "same person" threshold (0.3) on most prompts. Outliers are profile or distant shots where ArcFace itself breaks.
  • + PuLID boosts median to 0.84 while mostly preserving visual quality (see notes.md for the iteration story — PuLID was re-tuned three times before it stopped dominating LoRA). Tight weight/scheduling was critical.
  • + ControlNet (applied only to full-body prompts) brings pose determinism at a small identity cost (−0.02), which is the expected trade-off and matches the senior-engineer pattern of "choose per-frame, not globally".

See gallery/final_grid.jpg for the full side-by-side, and gallery/final_grid_small.jpg for a viewable 1100-px version.

Pipeline overview

           ┌────────────────────────────────────────┐
raw photos │ dataset/raw  (26 WhatsApp photos)      │
           └───────────────┬────────────────────────┘
                           │ scripts/analyze.py
                           │   - insightface face detection
                           │   - identity centroid clustering
                           │   - reject distant / multi-face / sub-person shots
                           ▼
           ┌────────────────────────────────────────┐
 filtered  │ 20 accepted                            │
           └───────────────┬────────────────────────┘
                           │ scripts/fix_crop.py
                           │   - face-aware cropping excluding non-subjects
                           │   - aspect clamp to [0.5, 2.0]
                           │   - upscale to min 1024 short side
                           │ manual blur pass for images where other faces
                           │ geometrically overlap the subject
                           ▼
           ┌────────────────────────────────────────┐
 training  │ dataset/processed/*.jpg + *.txt        │
   set     │   - structured captions (close-up /    │
           │     portrait / full body) + "ohwx man" │
           └───────────────┬────────────────────────┘
                           │ ai-toolkit (Flux LoRA, rank 32, bf16)
                           │   - 2500 steps, bs=2, lr=1e-4
                           │   - 10 sample batches, ema 0.99
                           ▼
           ┌────────────────────────────────────────┐
   LoRA    │ output/ohwx_flux_lora_v1_000001750     │
           │ (best checkpoint selected manually     │
           │  from 7 saved, see notes.md)           │
           └───────────────┬────────────────────────┘
                           │ ComfyUI headless (port 8188)
                           │ + scripts/comfy_generate.py
                           │ + PuLID-Flux (patched for ComfyUI 0.19)
                           │ + Flux.1 ControlNet Union Pro
                           ▼
           ┌────────────────────────────────────────┐
  outputs  │ gallery/{01_base,02_lora,              │
           │          03_lora_pulid,04_full}/       │
           │          00..08_<tag>.jpg              │
           └───────────────┬────────────────────────┘
                           │ scripts/arcface_eval.py
                           │ scripts/build_grid.py
                           │ scripts/build_plot.py
                           ▼
           ┌────────────────────────────────────────┐
  reports  │ eval/*.csv  +  gallery/final_grid.jpg  │
           │             +  gallery/summary_plot.png│
           └────────────────────────────────────────┘

Final hyperparameters

LoRA training (config/ohwx_flux_lora.yaml)

Param Value Rationale
linear / linear_alpha (rank) 32 Good capacity/overfit balance on 20 imgs
optimizer adamw8bit Memory-efficient
lr 1e-4 Standard for Flux LoRA
batch_size / grad_accum 2 / 1 Bf16 allowed 2 samples to fit on 48 GB
steps 2500 ~125 epochs on 20 imgs, picked 1750 as best
dtype bf16 No quantization — A6000 has VRAM budget
resolution [768, 1024] multi-res ai-toolkit auto-buckets
ema_decay 0.99 Reduces per-step noise
save_every 250 10 checkpoints for ablation

PuLID-Flux (v3, final)

Param Value Why
reference image dataset/processed/15.jpg Sharpest frontal close-up, low JPEG noise
weight 0.4 High weight drowned LoRA → plastic faces
start_at 0.15 Let LoRA build composition first
end_at 0.9 Release at the end so skin detail finishes naturally
face encoder AntelopeV2 ONNX CPU-only, buffalo_l is lower quality
image encoder EVA02-CLIP-L @ 336 px Official PuLID-Flux requirement

Union ControlNet Pro (pose)

Param Value Why
applied on only forest_hike, mountain_winter, desert_sunset Full-body prompts only
strength 0.85 Strong — force the arm-up pose against prompt cues
start_percent 0.0 Active from step 0
end_percent 0.80 Hold through most of sampling; release only at tail
resolution 896 × 1216 (portrait) Matches pose skeleton aspect (0.74)
pose skeleton OpenPose, programmatically edited Full-body standing, left arm raised high; padded 18 % above head
CN type openpose via SetUnionControlNetType Union Pro mode selector

Hardware / performance

Stage Time VRAM peak
LoRA training, 2500 steps @ 1024 ~3 h 10 min (4.8 s/step) 42 GB / 48 GB
Base Flux inference (9 imgs, 25 steps) ~4 min (~24 s/img) 33 GB
+ LoRA inference ~4 min 34 GB
+ LoRA + PuLID (first run: model load) ~5 min first, ~4 min repeats 36 GB
+ LoRA + PuLID + ControlNet ~40 s / 896×1216 img 40 GB

A6000 bf16 without quantization; gradient_checkpointing: true enabled during training.

Repository layout

avatar/
├── README.md                          you are here
├── notes.md                           trade-offs, failed experiments, lessons
├── scripts/
│   ├── analyze.py                     dataset auto-filter (insightface)
│   ├── fix_crop.py                    face-aware crop for Flux buckets
│   ├── caption.py                     structured captioner with trigger word
│   ├── generate.py                    diffusers-based base + LoRA generation
│   ├── comfy_generate.py              ComfyUI API client for PuLID / ControlNet
│   ├── arcface_eval.py                ArcFace cosine sim vs reference centroid
│   ├── build_grid.py                  4-way comparison grid
│   └── build_plot.py                  summary plot with ablation
├── dataset/
│   ├── raw/                           (gitignored – photos)
│   └── processed/                     (20 cropped + captioned images)
├── output/ohwx_flux_lora_v1/
│   ├── ohwx_flux_lora_v1_000001750.safetensors  (chosen checkpoint)
│   └── samples/                       training checkpoint samples
├── ComfyUI/                           (submodule-style – not in repo)
├── ai-toolkit/                        (submodule-style – not in repo)
├── gallery/
│   ├── 01_base/*.jpg
│   ├── 02_lora/*.jpg
│   ├── 03_lora_pulid/*.jpg
│   ├── 04_full/*.jpg
│   ├── final_grid.jpg                 full-res comparison
│   ├── final_grid_small.jpg
│   └── summary_plot.png
└── eval/
    ├── 01_base.csv
    ├── 02_lora.csv
    ├── 03_lora_pulid.csv
    └── 04_full.csv

Reproducing the pipeline

# 1. Environment
python -m venv /venv/flux
source /venv/flux/bin/activate
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
# install ai-toolkit + diffusers stack via uv (pip resolver chokes):
uv pip install -r ai-toolkit/requirements_base.txt insightface onnxruntime-gpu
# patch PuLID-Flux for ComfyUI 0.19 (forward_orig kwargs) – see notes.md

# 2. Download models (35 GB + 23 GB + 6.6 GB + 1.1 GB + 860 MB)
hf download black-forest-labs/FLUX.1-dev --local-dir models/flux1-dev
hf download comfyanonymous/flux_text_encoders clip_l.safetensors t5xxl_fp16.safetensors
hf download Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro diffusion_pytorch_model.safetensors
hf download guozinan/PuLID pulid_flux_v0.9.1.safetensors
hf download QuanSun/EVA-CLIP EVA02_CLIP_L_336_psz14_s6B.pt
# antelopev2 from MonsterMMORPG/tools (see README of PuLID-Flux)

# 3. Dataset preprocessing
python scripts/analyze.py   # filter 26 → 20
python scripts/fix_crop.py  # face-aware crop
python scripts/caption.py   # trigger-word captions

# 4. Train
cd ai-toolkit && python run.py config/ohwx_flux_lora.yaml

# 5. Launch ComfyUI headless
cd ComfyUI && python main.py --listen 127.0.0.1 --port 8188 --disable-auto-launch &

# 6. Generate all 4 galleries
python scripts/generate.py --config base
python scripts/generate.py --config lora
python scripts/comfy_generate.py --config pulid --out gallery/03_lora_pulid
python scripts/comfy_generate.py --config full  --out gallery/04_full

# 7. Eval and build reports
for c in 01_base 02_lora 03_lora_pulid 04_full; do
  python scripts/arcface_eval.py --ref dataset/processed --gen gallery/$c --out eval/$c.csv
done
python scripts/build_grid.py
python scripts/build_plot.py

Known limitations / next steps

  • ArcFace metric is not perceptual. See notes.md → Goodhart's law section. Metric improvement doesn't always mean visually-better faces.
  • Pose mismatch on close-ups. Only full-body CN was used; a production pipeline should keep a per-composition pose bank (headshot, medium, full) rather than a single reference.
  • Source data is WhatsApp-compressed. Original DSLR / raw photos would yield a sharper LoRA and a cleaner PuLID reference.
  • T5 prompt embedding cache is not shared across configurations — re-encoding the same 9 prompts in each config is ~30 s wasted; could batch.
  • Multi-LoRA stacking (style LoRA on top of identity LoRA) is not implemented — next exploration.

Acknowledgments

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages