Skip to content

fix: krea FP8 + torch.compile incompatibility on H100 (#669)#670

Open
livepeer-tessa wants to merge 1 commit intomainfrom
fix/krea-float8-compile-conflict
Open

fix: krea FP8 + torch.compile incompatibility on H100 (#669)#670
livepeer-tessa wants to merge 1 commit intomainfrom
fix/krea-float8-compile-conflict

Conversation

@livepeer-tessa
Copy link
Contributor

@livepeer-tessa livepeer-tessa commented Mar 11, 2026

Problem

Issue #669 — Krea fails to load on remote inference after using longlive.

NotImplementedError: Float8Tensor dispatch: attempting to run unimplemented
operator/function: func=<OpOverload(op='aten.as_strided', overload='default')>

Root cause

On H100/Hopper, Krea loads with compile=True. If FP8 quantization is also active, Float8DynamicActivationFloat8WeightConfig + torch.compile(fullgraph=False) crash during warmup.

Why it happens: With fullgraph=False, graph breaks split the compiled function into sub-graphs. At the boundary between sub-graphs, AOT autograd's gen_alias_from_base calls aten.as_strided on the Float8Tensor output of a compiled linear layer to create an output alias — but as_strided is not implemented for Float8Tensor in torchao.

Why specifically after longlive: torch._dynamo.reset() is never called between pipeline switches. After longlive runs with FP8 (no compile), its Float8Tensor dispatch state persists in the global dynamo compilation cache. When Krea then loads with FP8 + compile, the residual cache state triggers the AOT autograd aliasing code path that would otherwise be avoided on a fresh worker.

Traceback path

wan2_1/blocks/denoise.py → wan2_1/components/generator.py
  → krea_realtime_video/modules/causal_model.py (torch.compiled block)
    → torch/nn/modules/linear.py: F.linear(input, self.weight, self.bias)
      → torchao Float8Tensor __torch_function__ dispatch (aten.linear)
        → aot_autograd runtime_wrappers: gen_alias_from_base
          → aliased_base_tensor.as_strided(...)  ← CRASH

Fix

1. krea_realtime_video/pipeline.py — When FP8 quantization is active, skip block.compile(). The two optimisations are currently mutually exclusive under fullgraph=False. FP8 alone still provides meaningful memory/compute savings on H100.

2. pipeline_manager.py — Call torch._dynamo.reset() on every pipeline unload to flush stale compiled graphs and Float8 dispatch state, preventing cross-pipeline cache pollution.

Testing

Reproduce with:

  1. Load longlive with FP8 quantization on H100
  2. Unload longlive
  3. Load krea-realtime-video — should now succeed instead of crashing during warmup

Closes #669

Summary by CodeRabbit

  • Bug Fixes
    • Improved FP8 quantization handling by skipping block-wide compilation when active to prevent incompatibilities.
    • Fixed pipeline unloading to properly clear compilation cache, preventing stale compiled graphs from affecting subsequently loaded pipelines.

Float8DynamicActivationFloat8WeightConfig is not compatible with
torch.compile(fullgraph=False). During warmup on H100 (where compile=True),
AOT autograd's gen_alias_from_base calls aten.as_strided on Float8Tensor
outputs, which is not implemented in torchao:

  NotImplementedError: Float8Tensor dispatch: attempting to run unimplemented
  operator/function: func=<OpOverload(op='aten.as_strided', overload='default')>

The crash manifests specifically after longlive (also FP8) because
torch._dynamo's compile cache is never reset between pipeline switches,
allowing longlive's Float8 dispatch state to persist and influence Krea's
subsequent compile attempt.

Two fixes:

1. krea_realtime_video/pipeline.py: when FP8 quantization is active, skip
   block.compile() — the two optimizations are currently mutually exclusive
   with fullgraph=False. FP8 alone still provides meaningful memory/compute
   savings on H100 without compile.

2. pipeline_manager.py: call torch._dynamo.reset() on every pipeline unload
   to clear stale compiled graphs and Float8 dispatch state, preventing
   cross-pipeline cache pollution.

Fixes #669

Signed-off-by: livepeer-robot <robot@livepeer.org>
@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

The changes implement safeguards against FP8 quantization incompatibilities by preventing block-wide compilation of attention modules when FP8 is active, and clearing the PyTorch Dynamo compilation cache upon pipeline unload to prevent stale compiled graphs from persisting across loads.

Changes

Cohort / File(s) Summary
FP8 Quantization Guard
src/scope/core/pipelines/krea_realtime_video/pipeline.py
Adds conditional logic to skip block compilation when FP8 quantization is active, logging a warning to prevent incompatibility issues with aten.as_strided on Float8Tensor. Compilation now only proceeds in the non-FP8 code path.
Dynamo Compilation Cache Cleanup
src/scope/server/pipeline_manager.py
Resets the PyTorch Dynamo compilation cache in _unload_pipeline_by_id_unsafe after CUDA cleanup to prevent stale FP8-compiled graphs from leaking into subsequently loaded pipelines.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Float8 tensions and dynamo dreams—
A cleanup ensures smoother streams,
Guard rails in place, caches now clear,
Pipelines load fresh, no ghosts to fear!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR partially addresses issue #669: it implements fixes to prevent FP8 dispatch state from persisting (dynamo cache reset) and disables compile when FP8 is active. However, it does not include regression tests, CI guards, or improved logging/error messages as specified in the issue requirements. Add regression tests for the longlive→krea swap case, implement CI checks to prevent reintroduction, and improve logging with actionable remediation steps as outlined in issue #669.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: fixing an FP8 and torch.compile incompatibility on H100, which is the core issue addressed in the PR.
Out of Scope Changes check ✅ Passed All code changes are directly related to fixing the FP8 and torch.compile incompatibility issue. The modifications to krea_realtime_video/pipeline.py and pipeline_manager.py align precisely with the stated objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/krea-float8-compile-conflict

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-670--preview
WebSocket wss://fal.run/daydream/scope-pr-670--preview/ws
Commit 31ae08d

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-670--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/scope/core/pipelines/krea_realtime_video/pipeline.py (1)

141-160: Add a regression for the LongLive FP8 → Krea reload path.

This guard only works in combination with the unload-side reset in src/scope/server/pipeline_manager.py, and the failure shows up only after the specific unload/reload sequence. Without an automated regression here, either half of the fix can drift and the crash will only reappear during H100 warmup.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/scope/server/pipeline_manager.py`:
- Around line 698-702: The try/except around torch._dynamo.reset() currently
swallows errors and the code still publishes pipeline_unloaded; change the flow
so that if torch._dynamo.reset() raises you do NOT publish pipeline_unloaded and
instead either raise/return early from the unload path or set a persistent
worker-level flag (e.g., worker_force_no_compile = True) that subsequent load
logic will read to force compile=False; update the code paths that publish
pipeline_unloaded and the load logic that checks compile (the logic using
compile=any(... for x in ("h100","hopper"))) to respect this flag so stale
Dynamo/FP8 state cannot cause re-compilation attempts.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7aa4b080-c83b-41e5-9f74-596fcaa22c3b

📥 Commits

Reviewing files that changed from the base of the PR and between 5f6ee61 and 31ae08d.

📒 Files selected for processing (2)
  • src/scope/core/pipelines/krea_realtime_video/pipeline.py
  • src/scope/server/pipeline_manager.py

Comment on lines +698 to +702
try:
torch._dynamo.reset()
logger.info("torch._dynamo cache reset")
except Exception as e:
logger.warning(f"torch._dynamo reset failed: {e}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In PyTorch 2.9.1, if torch._dynamo.reset()is skipped or raises, do previously cachedtorch.compile graphs remain active for later compiles in the same worker process?

💡 Result:

Yes. In PyTorch 2.9.1, TorchDynamo stores torch.compile results in in-memory caches keyed off Python code objects (compiled bytecode + guards). Those cache entries persist for the lifetime of the worker process unless you explicitly clear them. [2], [3]

torch._dynamo.reset() is the API that clears those compile caches and restores Dynamo’s state as if you started a fresh process (it calls reset_code_caches() and clears multiple global Dynamo structures). If you skip it—or it raises before completing—then whatever caches were already created remain and can still be reused by later torch.compile calls in the same process (when the same underlying __code__ objects are encountered and guards match). [1], [2]

Sources:
[1] PyTorch v2.9.1 torch._dynamo.reset() implementation (torch/_dynamo/__init__.py)
[2] PyTorch 2.9 docs: Dynamo cache entries are stored/retrieved from a function’s __code__ object (_debug_get_cache_entry_list)
[3] PyTorch dev-discuss: each traced code object has its own cache entries (compiled bytecode + guards)


🏁 Script executed:

# Check the implementation of _unload_pipeline_by_id_unsafe around lines 698-702
sed -n '680,710p' src/scope/server/pipeline_manager.py

Repository: daydreamlive/scope

Length of output: 1304


🏁 Script executed:

# Check if there's any state tracking for the reset failure or if compile is disabled conditionally
rg "compile\s*=" src/scope/server/pipeline_manager.py | head -20

Repository: daydreamlive/scope

Length of output: 90


🏁 Script executed:

# Also check what happens next when a pipeline is loaded - does it check any worker state?
rg "_load_pipeline" src/scope/server/pipeline_manager.py -A 5 | head -40

Repository: daydreamlive/scope

Length of output: 1241


🏁 Script executed:

# Check if there's a compile flag or state that gets disabled on reset failure
rg "torch\.compile|compile_kwargs|torch_compile" src/scope/server/pipeline_manager.py | head -20

Repository: daydreamlive/scope

Length of output: 142


🏁 Script executed:

# Get more context around the compile decision
sed -n '1,50p' src/scope/server/pipeline_manager.py

Repository: daydreamlive/scope

Length of output: 1280


🏁 Script executed:

# Search for where compile is determined during pipeline load
rg "compile" src/scope/server/pipeline_manager.py -B 3 -A 3 | head -60

Repository: daydreamlive/scope

Length of output: 942


🏁 Script executed:

# Check if there's any state field tracking reset failure or worker health
rg "self\._.*reset|self\._.*failed|self\._.*error" src/scope/server/pipeline_manager.py | head -20

Repository: daydreamlive/scope

Length of output: 170


🏁 Script executed:

# Check if _load_pipeline_implementation has any fallback or state checking
sed -n '300,400p' src/scope/server/pipeline_manager.py | grep -A 20 "_load_pipeline_implementation"

Repository: daydreamlive/scope

Length of output: 44


🏁 Script executed:

# Look for the full _load_pipeline_implementation method
rg "_load_pipeline_implementation" src/scope/server/pipeline_manager.py -A 30 | head -50

Repository: daydreamlive/scope

Length of output: 2042


🏁 Script executed:

# Check if there's any exception handling around the compile call or pipeline init
rg "KreaRealtimeVideoPipeline\|compile=" src/scope/server/pipeline_manager.py -B 5 -A 5

Repository: daydreamlive/scope

Length of output: 44


🏁 Script executed:

# Search for where KreaRealtimeVideoPipeline is instantiated
rg "KreaRealtimeVideoPipeline" src/scope/server/pipeline_manager.py -B 2 -A 10

Repository: daydreamlive/scope

Length of output: 1127


🏁 Script executed:

# Look for the full context around compile decision
sed -n '250,350p' src/scope/server/pipeline_manager.py

Repository: daydreamlive/scope

Length of output: 4177


Don't treat a Dynamo reset failure as a successful unload.

This fix depends on clearing the stale FP8/compile state here. If torch._dynamo.reset() raises, the code still reports pipeline_unloaded unconditionally, and the next krea-realtime-video load will attempt to compile with the same GPU-based decision (compile=any(...for x in ("h100", "hopper"))) while stale compile caches remain active in the worker process. This can re-enter the warmup crash path from the original FP8-to-Krea conflict.

Either fail the transition (prevent pipeline_unloaded publication on reset failure) or latch a worker-level flag to force compile=False on all subsequent loads until the worker restarts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scope/server/pipeline_manager.py` around lines 698 - 702, The try/except
around torch._dynamo.reset() currently swallows errors and the code still
publishes pipeline_unloaded; change the flow so that if torch._dynamo.reset()
raises you do NOT publish pipeline_unloaded and instead either raise/return
early from the unload path or set a persistent worker-level flag (e.g.,
worker_force_no_compile = True) that subsequent load logic will read to force
compile=False; update the code paths that publish pipeline_unloaded and the load
logic that checks compile (the logic using compile=any(... for x in
("h100","hopper"))) to respect this flag so stale Dynamo/FP8 state cannot cause
re-compilation attempts.

@github-actions
Copy link
Contributor

✅ E2E Tests passed

Status passed
fal App daydream/scope-pr-670--preview
Run View logs

Test Artifacts

Check the workflow run for screenshots.

livepeer-tessa pushed a commit that referenced this pull request Mar 11, 2026
…ale-cache recompile

If torch._dynamo.reset() raises during pipeline unload, stale Dynamo/FP8
compile caches remain active in the worker process. Previously the code
swallowed the exception and published pipeline_unloaded unconditionally,
leaving the next krea-realtime-video load free to torch.compile against
those stale caches — re-entering the warmup crash from the FP8→Krea
conflict.

Fix: set self._dynamo_reset_failed = True on reset failure. The Krea load
path now checks this flag and forces compile=False for the lifetime of the
worker, with a clear log warning to restart the process to re-enable
compilation.

Addresses CodeRabbit review comment on PR #670.

Signed-off-by: livepeer-robot <robot@livepeer.org>
Copy link
Contributor Author

@livepeer-tessa livepeer-tessa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @coderabbitai. Filed #671 to address this.

The fix goes with the worker-level flag approach: self._dynamo_reset_failed is set to True if torch._dynamo.reset() raises. Subsequent krea-realtime-video loads read it and force compile=False for the rest of the process lifetime, with a warning to restart. The unload itself still completes and pipeline_unloaded is still published (memory is freed — we just can't guarantee the Dynamo cache is clean).

Failing the unload entirely felt like the worse option since it'd leave the pipeline in a limbo state and block future load attempts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Krea fails to load on remote inference after using longlive

1 participant