Skip to content

flashvsr pipeline fails to load with 'No CUDA GPUs are available' on GPU-H100 fal.ai workers #675

@livepeer-tessa

Description

@livepeer-tessa

Summary

The flashvsr pipeline reports "No CUDA GPUs are available" at load time on fal.ai GPU-H100 workers where CUDA is otherwise functioning normally. The error recurs across multiple load attempts within the same session, preventing the pipeline from becoming usable.

Error Details

Observed in fal.ai prod logs, 2026-03-12 21:16–21:20 UTC:

scope.server.pipeline_manager - ERROR - Failed to load pipeline flashvsr:
  No CUDA GPUs are available. If this error persists, consider removing the
  models directory '/data/models' and re-downloading models.

scope.server.pipeline_manager - ERROR - Failed to load pipeline: flashvsr
scope.server.pipeline_manager - ERROR - Some pipelines failed to load

Multiple attempts (at least 3 within the same session, ~4 minutes apart):

  • 2026-03-12 21:16:17 UTC

  • 2026-03-12 21:20:09 UTC

  • 2026-03-12 21:20:33 UTC

  • fal.ai job: e416f15d-e5f9-4c66-8422-57623b915c94

  • fal.ai node: 1dfe12b6-1fe9-bed3-17dd-e2ae8651c9fc (worker type: fal-jobs/GPU-H100)

Why This Is Suspicious

The same session/job is actively processing frames (NDI output errors, spout errors, other pipelines loading) — so the GPU worker is alive and other operations run fine. Only flashvsr raises "No CUDA GPUs are available", suggesting the exception originates in the flashvsr plugin's own initialization rather than a system-wide CUDA absence.

Possible causes:

  1. The flashvsr plugin calls torch.cuda.is_available() or checks device_count() in a way that fails on this particular driver/environment configuration
  2. The plugin checks for a specific CUDA capability not present on this H100 node variant
  3. A CUDA_VISIBLE_DEVICES environment variable is set to an unexpected value at plugin load time
  4. Race condition: plugin loads before CUDA context is fully initialized

Distinction from #574

Issue #574 reports "Invalid pipeline ID: flashvsr" (pipeline registry/model lookup failure). This issue is a different error message ("No CUDA GPUs are available") on a GPU-equipped worker where the ID itself appears to resolve correctly.

Impact

Users who select the flashvsr pipeline on fal.ai cannot use it; the failure is silent from the UI perspective ("Some pipelines failed to load").

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions