Skip to content

[Bug]: RunPod clusters fail to provision due to minCudaVersion set #3910

@r4victor

Description

@r4victor

Steps to reproduce

  1. Try to provision a RunPod cluster via dstack, e.g. a two-pod H100 cluster is currently available
  2. The provisioning fails as if there is no capacity with an obscure error from RunPod:
WARNING 2026-05-27T10:33:42.249 dstack._internal.server.background.pipeline_tasks.jobs_submitted
  job(50b3ad)tame-snake-1-0-0: NVIDIA H100 80GB HBM3 launch in runpod/EUR-IS-3 failed:
  RunpodApiClientError([{'message': 'Error creating cluster - cluster creation failed', 'path':
  ['createCluster'], 'extensions': {'code': 'RUNPOD'}}])

Actual behaviour

If I drop minCudaVersion: "12.8" from the request, then provisioning succeeds:

input_fields.append(f'minCudaVersion: "{RunpodProvider.MIN_CUDA_VERSION}"')

But the host does have an nvidia driver supporting cuda 12.8 so minCudaVersion seems to work incorrectly. Moreover specifying any lower value, e.g. "11", "11.1", all fail. My guess is that minCudaVersion does not work for CreateCluster even though it's listed in the reference.

Introduced in #3304 so RunPod clusters are not working since then.

As a workaround we can drop minCudaVersion from CreateCluster request until it's clarified/fixed on the RunPod side.

Expected behaviour

No response

dstack version

master

Server logs

Additional information

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions