DGX Spark local config: Qwen3-Coder-Next FP8 + Qwen2.5-Coder-3B autocomplete #11713

ztolley · 2026-03-22T17:19:41Z

ztolley
Mar 22, 2026

I wanted to share a tested local Continue setup for a DGX Spark private coding-assistant workflow:

https://github.com/ztolley/dgx-spark-qwen3-coder-next-compose

This is a community-tested reference setup, not an official vendor certification.

Model split:

Qwen/Qwen3-Coder-Next-FP8 for chat, edit, and apply
Qwen/Qwen2.5-Coder-3B for autocomplete

Why this may be useful:

it is a real tested setup on DGX Spark
both models are exposed as local OpenAI-compatible endpoints
the main default context is 32768, with 40960 also validated as feasible
the repo includes a Continue config example and measured memory/performance notes

Measured notes from the default stack:

main model GPU memory: about 88.8 GiB
autocomplete GPU memory: about 11.0 GiB
repeated prompt with prefix caching: 2.37s
autocomplete short completion: 1.56s

I also tested Qwen/Qwen2.5-Coder-7B for autocomplete under vLLM. It fit, but it was slower and not clearly better enough to replace 3B as the default.

If anyone else is running large local coding models with a separate autocomplete endpoint in Continue, I’d be interested in comparing setups.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DGX Spark local config: Qwen3-Coder-Next FP8 + Qwen2.5-Coder-3B autocomplete #11713

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

DGX Spark local config: Qwen3-Coder-Next FP8 + Qwen2.5-Coder-3B autocomplete #11713

Uh oh!

ztolley Mar 22, 2026

Replies: 0 comments

ztolley
Mar 22, 2026