Skip to content

[adapter,hparams] feat: support resuming from Hugging Face checkpoint paths#160

Merged
Jayce-Ping merged 4 commits into
X-GenGroup:mainfrom
Jayce-Ping:main
May 17, 2026
Merged

[adapter,hparams] feat: support resuming from Hugging Face checkpoint paths#160
Jayce-Ping merged 4 commits into
X-GenGroup:mainfrom
Jayce-Ping:main

Conversation

@Jayce-Ping
Copy link
Copy Markdown
Collaborator

Summary

Extend BaseAdapter.load_checkpoint to transparently resolve a Hugging Face repo spec (either explicit hf://owner/repo[/subdir][@rev] or bare owner/repo[/...]) to a local cache directory via huggingface_hub.snapshot_download. The existing lora / full / state loading branches are reused unchanged.

No new config fields. resume_path: Optional[str] keeps the same shape and gains a second meaning — local directory or HF repo spec — mirroring how model_name_or_path already accepts both.

Logic priority (resolved inside BaseAdapter._resolve_checkpoint_path)

1. `hf://` prefix              -> force HF (overrides any colliding local dir)
2. Local path exists           -> return as-is
3. Otherwise parse as          -> `owner/repo[/subfolder][@revision]` and download

huggingface_hub is already a hard dependency (pyproject.toml:39), so no dependency change.

Multi-node correctness

The download is gated on accelerator.is_local_main_process (one process per node), not is_main_process (one global). Trade-offs across the four FS-vs-gating combinations:

Scenario downloads wall clock asymmetric progress
Shared FS + is_local_main_process (chosen) 1 (HF Hub WeakFileLock dedupes across nodes) T No
Shared FS + is_main_process 1 T No
Non-shared FS + is_local_main_process (chosen) N (one per node, parallel) T No
Non-shared FS + is_main_process N 2T (rank 0 first, then nodes 1..N-1 post-barrier) Yes — rank 0 proceeds while others still downloading

is_local_main_process wins decisively on non-shared filesystems (2x faster wall clock) and avoids a latent asymmetric-progress hazard where rank 0 could enter _load_lora / _load_full_model while other ranks are still inside the resolver — which would deadlock the moment any downstream loader adds a collective op (FSDP state-dict broadcast, DeepSpeed GatheredParameters, etc.).

Fail-fast errors

A single narrow except (RepositoryNotFoundError, HfHubHTTPError) re-raises as FileNotFoundError with full context (path, repo_id, subfolder, revision, HF_TOKEN hint) — complies with no-defensive-except.mdc's "narrow + documented" exception rule.

Operational caveats (documented in the resume_path help string)

  • HF_TOKEN must be set on every node (Slurm --export=NONE and K8s without envFrom can drop env vars on workers).
  • For very large checkpoints, set HF_HUB_ENABLE_HF_TRANSFER=1 (~5x faster) or bump TORCH_NCCL_BLOCKING_WAIT / NCCL_TIMEOUT to avoid the watchdog killing the job while local rank 0 downloads.
  • HPC clusters with firewalled compute nodes may need a manual pre-download from a login node (hf download owner/repo) and a plain local path.

Commits (2 substantive + 1 merge)

  1. 2207c92 [adapter,hparams] feat: support resuming from Hugging Face checkpoint paths — the actual feature.
  2. 540ff6d [adapter,docs] refactor: hoist HF imports to module top; codify rule — small post-merge cleanup: move two huggingface_hub imports from function bodies to the module's import block, and extend constraint [Feat] Support Z-Image-Omni #22 ("Import Style") + the ff-review skill checklist with an explicit "top-level imports only" rule (with the three sanctioned exceptions: optional deps, backend-gated imports, circular-import workarounds). Pre-existing inline FSDP/DeepSpeed imports in models/abc.py are grandfathered.
  3. 5fb0b14 is the fork's merge commit; happy to have you squash if you prefer linear history.

Files changed

  • src/flow_factory/utils/checkpoint.py — added parse_hf_checkpoint_path, download_hf_checkpoint, and HF_PATH_PREFIX.
  • src/flow_factory/models/abc.py — added BaseAdapter._resolve_checkpoint_path; one-line insertion in load_checkpoint to resolve the path before any local-only dispatch.
  • src/flow_factory/hparams/model_args.py — extended resume_path's help with HF + multi-node notes.
  • examples/**/*.yaml (59 files) — normalized the inline comment on resume_path: to document the new HF support; YAML semantics unchanged.
  • .agents/knowledge/constraints.md, .agents/skills/ff-review/SKILL.md — codified the import-style rule (one line each).

Test plan

  • Pure-Python parser unit-tested (8 happy paths + 5 error paths) — all pass.
  • All 59 example YAMLs re-parse via yaml.safe_load after the bulk-comment update.
  • My added lines pass black --check standalone; pre-existing lint debt in the touched files predates this PR.
  • Single-GPU smoke: resume_path: "<small-public-lora>".
  • Multi-node smoke: 2-node x 2-GPU accelerate launch to confirm one download per node and no asymmetric progress.

Out of scope

  • Saving checkpoints to HF Hub (push_to_hub).
  • Reward-model checkpoints (different loading path).
  • A separate resume_revision config field — inline @rev is simpler and can be promoted later if users ask.

Made with Cursor

Jayce-Ping and others added 3 commits May 17, 2026 08:52
… paths

Extend `BaseAdapter.load_checkpoint` to transparently resolve a Hugging
Face repo spec (either `hf://owner/repo[/subdir][@rev]` or bare
`owner/repo[/...]`) to a local cache directory via
`huggingface_hub.snapshot_download`, reusing the existing `lora` /
`full` / `state` loading branches unchanged.

Logic priority at `_resolve_checkpoint_path`:
  1. `hf://` prefix -> force HF (overrides any colliding local dir)
  2. Local path exists -> return as-is
  3. Otherwise parse as `owner/repo[/subfolder][@revision]` and download

Multi-node-safe: download gated on `is_local_main_process` (one per
node), not `is_main_process` (one global). On non-shared filesystems
each node populates its own HF cache once; on shared filesystems
huggingface_hub's per-blob `WeakFileLock` dedupes the concurrent
`snapshot_download` calls so only one node transfers bytes.

Fail-fast: narrow `except (RepositoryNotFoundError, HfHubHTTPError)`
re-raised as `FileNotFoundError` with full context (path, repo_id,
subfolder, revision, HF_TOKEN hint) for actionable error messages.

Also updates the `resume_path` help text and normalizes the inline
comment on all 59 example YAMLs to document the new HF support. No
new config fields; `resume_path: Optional[str]` keeps the same shape.

Note: pre-existing black/isort lint debt in the touched src files is
unrelated to this change.

Co-authored-by: Cursor <cursoragent@cursor.com>
Move the two `huggingface_hub` imports added in the previous commit
out of function bodies and up to the module's import block:

- `from huggingface_hub import snapshot_download` in
  `src/flow_factory/utils/checkpoint.py`
- `from huggingface_hub.errors import RepositoryNotFoundError,
  HfHubHTTPError` in `src/flow_factory/models/abc.py`

`huggingface_hub` is already a hard dependency (`pyproject.toml:39`),
so there is no import-cost concern; lazy-loading only hid the
dependency surface from readers and `isort`.

Codify the rule so this pattern is caught in review going forward:
- Extend constraint X-GenGroup#22 ("Import Style") in
  `.agents/knowledge/constraints.md` with an explicit "Top-level imports
  only" bullet, listing the three sanctioned exceptions (optional deps
  via `try/except ImportError`, backend-gated imports under runtime
  feature checks like DeepSpeed/FSDP, and unresolvable circular imports).
- Add a matching checklist item to the Code Style section of
  `.agents/skills/ff-review/SKILL.md`.

Pre-existing inline FSDP/DeepSpeed imports in `models/abc.py` (lines
~997-1152) are grandfathered under the new rule's exception (b).

Co-authored-by: Cursor <cursoragent@cursor.com>
[adapter,hparams] feat: support resuming from Hugging Face checkpoint paths
Copilot AI review requested due to automatic review settings May 17, 2026 02:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends adapter checkpoint loading so resume_path can point either to a local directory or to a Hugging Face Hub repo spec, which is resolved to a cached snapshot directory before the existing LoRA/full/state loading logic runs.

Changes:

  • Add HF checkpoint spec parsing + snapshot download helpers in utils/checkpoint.py.
  • Add BaseAdapter._resolve_checkpoint_path() and invoke it from BaseAdapter.load_checkpoint() to transparently download/resolve HF resume paths.
  • Update resume_path help text and normalize example YAML comments to document HF support.

Reviewed changes

Copilot reviewed 64 out of 64 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/flow_factory/utils/checkpoint.py Adds HF path prefix constant, parser, and snapshot_download wrapper for checkpoint resolution.
src/flow_factory/models/abc.py Resolves load_checkpoint() paths via HF Hub when needed (new _resolve_checkpoint_path).
src/flow_factory/hparams/model_args.py Expands resume_path help text to document HF repo specs + multi-node caveats.
.agents/skills/ff-review/SKILL.md Codifies “top-level imports only” guidance in the review checklist.
.agents/knowledge/constraints.md Codifies “top-level imports only” as a repository constraint (with exceptions).
examples/template/sd3_5/async_reward.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/lora/z_image/default.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/lora/wan22/t2v.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/lora/wan21/t2v.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/lora/wan21/i2v.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/lora/sd3_5/default.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/lora/qwen_image/rational_rewards_t2i.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/lora/qwen_image_edit_plus/rational_rewards_edit.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/lora/flux2_klein_base/default.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/lora/flux1/rational_rewards_t2i.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/lora/flux1/default.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/lora/flux1_kontext/rational_rewards_edit.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/full/z_image/default.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/full/z_image_turbo/default.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/full/wan22/t2v.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/full/flux2_klein_base/default.yaml Updates resume_path comment to mention HF repo specs.
examples/nft/full/flux1/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/z_image/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/z_image_turbo/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/wan22/t2v.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/wan22/i2v.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/wan21/v2v.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/wan21/t2v.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/wan21/i2v.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/sd3_5/nocfg.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/sd3_5/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/qwen_image/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/qwen_image_edit_plus/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/ltx2/t2av.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/ltx2/t2av_pickscore.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/ltx2/i2av.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/flux2/t2i.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/flux2/i2i.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/flux2_klein/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/flux2_klein_base/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/flux1/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/lora/flux1_kontext/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/z_image/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/z_image_turbo/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/wan22/t2v.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/wan22/i2v.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/wan21/t2v.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/wan21/i2v.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/sd3_5/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/qwen_image/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/qwen_image_edit_plus/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/flux2/t2i.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/flux2/i2i.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/flux2_klein/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/flux2_klein_base/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/flux1/default.yaml Updates resume_path comment to mention HF repo specs.
examples/grpo/full/flux1_kontext/default.yaml Updates resume_path comment to mention HF repo specs.
examples/dpo/lora/sd3_5/default.yaml Updates resume_path comment to mention HF repo specs.
examples/dgpo/lora/sd3_5/nocfg.yaml Updates resume_path comment to mention HF repo specs.
examples/dgpo/lora/sd3_5/default.yaml Updates resume_path comment to mention HF repo specs.
examples/crd/lora/sd3_5/default.yaml Updates resume_path comment to mention HF repo specs.
examples/awm/lora/sd3_5/default.yaml Updates resume_path comment to mention HF repo specs.
examples/awm/lora/flux2_klein_base/default.yaml Updates resume_path comment to mention HF repo specs.
examples/awm/lora/flux1/default.yaml Updates resume_path comment to mention HF repo specs.
Comments suppressed due to low confidence (1)

src/flow_factory/models/abc.py:1509

  • After the barrier, every process calls download_hf_checkpoint(...), which wraps snapshot_download. Even when the cache is already populated this can cause lock contention and repeated hub metadata checks across all ranks. Consider having only is_local_main_process call snapshot_download, then broadcast the resulting local path to other ranks (or have non-local ranks call with local_files_only=True via an option on download_hf_checkpoint) to make the post-barrier step strictly local/cheap.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/flow_factory/utils/checkpoint.py
Comment thread src/flow_factory/models/abc.py Outdated
Comment on lines +1494 to +1503
repo_id, subfolder, revision = parse_hf_checkpoint_path(spec)

if self.accelerator.is_local_main_process:
local_path = download_hf_checkpoint(repo_id, subfolder, revision)
logger.info(
f"[local rank 0 / global rank {self.accelerator.process_index}] "
f"resolved checkpoint '{path}' -> {local_path}"
)
self.accelerator.wait_for_everyone()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit aa883aa by un-gating the download.

All ranks now call snapshot_download directly. huggingface_hub's per-blob WeakFileLock serializes concurrent calls within each filesystem domain (cross-node on POSIX-locking shared FS, per-node on non-shared FS), so we still get exactly one actual download per filesystem domain — same byte-count as the gated version, but without the failure-before-barrier deadlock.

How the four FS x failure quadrants behave now:

  • Shared FS, success: 1 download globally (cross-node lock dedup).
  • Non-shared FS, success: N downloads in parallel (one per node).
  • Symmetric failure (auth/network/repo-not-found): every rank raises uniformly; no one reaches the trailing barrier; no deadlock.
  • Asymmetric single-rank failure (rare): residual hazard called out explicitly in the docstring — surviving ranks would trip the NCCL watchdog on the trailing wait_for_everyone().

Trailing wait_for_everyone() is kept so downstream _load_lora / _load_full_model (which may add collective ops over time) enter in lockstep.

Side effect: also resolves the post-barrier HEAD-request overhead concern from the low-confidence summary comment (no more post-barrier snapshot_download call).

Comment thread .agents/skills/ff-review/SKILL.md Outdated
- [ ] English comments and docstrings
- [ ] Apache 2.0 license header on new files
- [ ] No unnecessary wildcard imports (except `hparams`)
- [ ] **Top-level imports only** — no `import` / `from ... import ...` inside function or method bodies (constraint #22). Exceptions: documented optional deps via `try/except ImportError`, or genuine unresolvable circular imports.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit aa883aa.

Replaced the duplicated exception list with a reference to constraint #22, so the skill checklist no longer needs to be kept in sync with constraints.md. New text:

- [ ] **Top-level imports only** (constraint #22) — see that file for the three sanctioned exceptions (optional deps via `try/except ImportError`, backend-gated runtime feature checks like DeepSpeed/FSDP, unresolvable circular imports).

Follows the agents-docs-maintenance.mdc principle "Reference constraint numbers ... do not re-explain the rule."

Three fixes from the Copilot review on the upstream PR:

1. Path-traversal validation (utils/checkpoint.py):
   parse_hf_checkpoint_path now rejects '.', '..', and backslash
   segments with an informative ValueError that preserves the original
   spec. Validates at the parser front door rather than at the
   download site so error messages keep the user's exact input.
   Without this, a spec like 'owner/repo/..' would escape the snapshot
   directory via os.path.join.

2. Un-gate _resolve_checkpoint_path (models/abc.py):
   Remove the `if is_local_main_process:` gate and the post-barrier
   double-call pattern. All ranks now call snapshot_download directly;
   huggingface_hub's per-blob WeakFileLock serializes concurrent calls
   within each filesystem domain (cross-node on POSIX-locking shared
   FS, per-node on non-shared FS), so we still get exactly one
   download per filesystem domain.

   This eliminates the distributed-deadlock hazard where a download
   failure on the gated rank would raise before reaching
   wait_for_everyone(), leaving siblings blocked at the barrier until
   NCCL watchdog timeout. The trailing wait_for_everyone() is kept to
   maintain lockstep entry into the downstream loaders.

   Residual asymmetric-failure risk (one rank's network blip while
   others succeed) is documented in the docstring.

3. Skill-checklist alignment (.agents/skills/ff-review/SKILL.md):
   Replace the duplicated import-exception list with a reference to
   constraint X-GenGroup#22, where the full set of three sanctioned exceptions
   (optional deps, backend-gated runtime feature checks, circular
   imports) lives. Prevents future drift between the two documents.

Verified with 8 happy-path + 5 original-error + 6 path-traversal
parser test cases (all pass).

Co-authored-by: Cursor <cursoragent@cursor.com>
@Jayce-Ping Jayce-Ping merged commit bfd0f14 into X-GenGroup:main May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants