finally, I ran it on Windows 11 with Intel 12th CPU & RTX 4070 with "--gpu" option.
I just try to help someone who has interesting "dlgo" project on Windows 11 like me.
Loaded Z:\downloads\Qwen3.5-0.8B-Q8_0.gguf (0.3s)
Model: qwen35 Params: 24 layers, 1024 dim, 8 heads, vocab 248320
Context: 8192 tokens
Backend: CPU (20 threads)
Sampling: temp=0.70 top-k=40 top-p=0.90 Type
/help for commands, or start chatting.
>>> hey
Exception 0xc000001d 0x0 0x0 0x7ff62c5b9cca PC=0x7ff62c5b9cca signal arrived during external code execution runtime.cgocall(0x7ff62c5af780, 0x3d4a58ffad08) C:/Program Files/Go/src/runtime/cgocall.go:167 +0x3e fp=0x3d4a58fface0 sp=0x3d4a58ffac78 pc=0x7ff62c30d47e github.com/computerex/dlgo/quant._Cfunc_batch_quantize_for_type(0x3d4a8f7cc000, 0x3d4a59b96000, 0x8, 0x400, 0x440, 0x16) _cgo_gotypes.go:58 +0x48 fp=0x3d4a58ffad08 sp=0x3d4a58fface0 pc=0x7ff62c395408 github.com/computerex/dlgo/quant.BatchQuantizeForType(...) C:/Users/Administrator.BLACKWOLF/go/pkg/mod/github.com/computerex/dlgo@v0.0.0-20260401165446-8383c977f6aa/quant/qq_dot_cgo.go:47 github.com/computerex/dlgo/blas.QBatchGEMMParallel({0x3d4a777cc000, 0x21000, 0x3000000}, 0x3d4a5b1a8140, {0x3d4a8f7cc000, 0x5800, 0x800000}, 0x16, 0x3d4a5906e000) C:/Users/Administrator.BLACKWOLF/go/pkg/mod/github.com/computerex/dlgo@v0.0.0-20260401165446-8383c977f6aa/blas/blas.go:656 +0x269 fp=0x3d4a58ffad88 sp=0x3d4a58ffad08 pc=0x7ff62c3a5389 github.com/computerex/dlgo/models/llm.ForwardBatch(0x3d4a5d160000, {0x3d4a59064700, 0x16, 0x20}, 0x0, 0x3d4a58fda4b0, 0x3d4a59078008, 0x3d4a590982c8) C:/Users/Administrator.BLACKWOLF/go/pkg/mod/github.com/computerex/dlgo@v0.0.0-20260401165446-8383c977f6aa/models/llm/forward_batch.go:255 +0x572 fp=0x3d4a58ffb1a0 sp=0x3d4a58ffad88 pc=0x7ff62c3fead2 github.com/computerex/dlgo/models/llm.(*Pipeline).GenerateDetailed(0x3d4a5919a000, {0x3d4a59064500?, 0x6?}, {0x200, {0x3f333333, 0x28, 0x3f666666, 0x0, 0x3f8ccccd}, 0xffffffffffffffff, ...})
{Blah Blah Blah}
- Install the Valkan SDK on C-Drive and use "1.4.341.1" version (Hard coded in the source code)
- VS2022 C++ fullset
{Source Path}>findstr /s /i "avx512" *.c *.h *.go
quant\cpool.c:#pragma GCC target("avx2,fma,f16c,avx512f,avx512bw,avx512dq,avx512vl,avx512vnni")
quant\simd_dot.c:#pragma GCC target("avx2,fma,f16c,avx512f,avx512bw,avx512dq,avx512vl,avx512vnni")
quant\simd_qq_dot.c:#pragma GCC target("avx2,fma,f16c,avx512f,avx512bw,avx512dq,avx512vl,avx512vnni")
quant\simd_qq_dot.c:#if defined(__AVXVNNI__) || defined(__AVX512VNNI__)
- Open "quant\cpool.c", "quant\simd_dot.c", "quant\simd_qq_dot.c" and make #pragma GCC target("avx2,fma,f16c,avx512f,avx512bw,avx512dq,avx512vl,avx512vnni") line to "comment line"
ex: //#pragma GCC target("avx2,fma,f16c,avx512f,avx512bw,avx512dq,avx512vl,avx512vnni")
- and build your code.
ex: {Source path}>go build -a -tags vulkan -ldflags "-linkmode internal" -o dlgo.exe ./cmd/dlgo/
- and run your binary.
ex: {Source path}>dlgo run --gpu Z:\ResearchSource\dlgo_back\Qwen3.5-0.8B-Q8_0.gguf
My result:
Z:\ResearchSource\dlgo>dlgo run --gpu Z:\ResearchSource\dlgo_back\Qwen3.5-0.8B-Q8_0.gguf
Loaded Z:\ResearchSource\dlgo_back\Qwen3.5-0.8B-Q8_0.gguf (0.2s)
[dlgo/gpu] Memory heaps: 2
[dlgo/gpu] heap 0: 12010 MB, flags=0x1 (device-local)
[dlgo/gpu] heap 1: 16254 MB, flags=0x0
[dlgo/gpu] Memory types: 5
[dlgo/gpu] type 0: heap=1, flags=0x0
[dlgo/gpu] type 1: heap=0, flags=0x1 DEV
[dlgo/gpu] type 2: heap=1, flags=0x6 HOST_VIS HOST_COH
[dlgo/gpu] type 3: heap=1, flags=0xe HOST_VIS HOST_COH HOST_CACH
[dlgo/gpu] type 4: heap=0, flags=0x7 DEV HOST_VIS HOST_COH
[dlgo/gpu] VK_KHR_cooperative_matrix: YES
[dlgo/gpu] coopmat[0]: M=16 N=16 K=16 A=f16 B=f16 C=f16 R=f16 sat=0 scope=3
[dlgo/gpu] coopmat[1]: M=16 N=8 K=16 A=f16 B=f16 C=f16 R=f16 sat=0 scope=3
[dlgo/gpu] coopmat[2]: M=16 N=8 K=8 A=f16 B=f16 C=f16 R=f16 sat=0 scope=3
[dlgo/gpu] coopmat[3]: M=16 N=16 K=16 A=f16 B=f16 C=f32 R=f32 sat=0 scope=3
[dlgo/gpu] coopmat[4]: M=16 N=8 K=16 A=f16 B=f16 C=f32 R=f32 sat=0 scope=3
[dlgo/gpu] coopmat[5]: M=16 N=8 K=8 A=f16 B=f16 C=f32 R=f32 sat=0 scope=3
[dlgo/gpu] coopmat[6]: M=16 N=16 K=32 A=u8 B=u8 C=u32 R=u32 sat=0 scope=3
[dlgo/gpu] coopmat[7]: M=16 N=16 K=32 A=s8 B=s8 C=s32 R=s32 sat=0 scope=3
[dlgo/gpu] coopmat[8]: M=16 N=8 K=32 A=u8 B=u8 C=u32 R=u32 sat=0 scope=3
[dlgo/gpu] coopmat[9]: M=16 N=8 K=32 A=s8 B=s8 C=s32 R=s32 sat=0 scope=3
[dlgo/gpu] coopmat[10]: M=16 N=16 K=16 A=f16 B=f16 C=f32 R=f32 sat=0 scope=3
[dlgo/gpu] coopmat[11]: M=16 N=16 K=32 A=f16 B=f16 C=f16 R=f16 sat=0 scope=3
[dlgo/gpu] coopmat[12]: M=16 N=16 K=32 A=f16 B=f16 C=f32 R=f32 sat=0 scope=3
[dlgo/gpu] coopmat[13]: M=16 N=16 K=32 A=f16 B=f16 C=f16 R=f16 sat=0 scope=3
[dlgo/gpu] coopmat[14]: M=16 N=16 K=32 A=f16 B=f16 C=f32 R=f32 sat=0 scope=3
[dlgo/gpu] about to call vkCreateDevice (coopmat=1, ext_count=4)
[dlgo/gpu] Initialized Vulkan on NVIDIA GeForce RTX 4070 (12010 MB VRAM)
[dlgo/gpu] Uploading model to NVIDIA GeForce RTX 4070 (12010 MB total, 11239 MB usable)...
[dlgo/memsafety] Working set limit: 27.7 GB (total RAM: 31.7 GB)
[dlgo/memsafety] Go memory limit: 8.0 GB (was 8589934592.0 GB), system RAM: 22.4/31.7 GB available
[dlgo/gpu] dp4a enabled for attention + FFN + MoE (per-tensor safe types)
[dlgo/gpu] All 24 layers on GPU
[dlgo/gpu] Memory: 1140 MB VRAM + 0 MB host-visible = 1140 MB total (FP16 KV cache)
[dlgo/gpu] SSM state on GPU (18 SSM layers, 16 heads, 16 KV groups, state=128x128)
Model: qwen35
Params: 24 layers, 1024 dim, 8 heads, vocab 248320
Context: 8192 tokens
Backend: GPU (NVIDIA GeForce RTX 4070)
Sampling: temp=0.70 top-k=40 top-p=0.90
Type /help for commands, or start chatting.
>>> hi
[dlgo/gpu] Loaded 137 compute pipelines
The user is greeting me with "hi". This is a simple, friendly opening that I should respond to warmly and offer assistance. There's no need for complex analysis or explanations since it's just a casual hello. I'll keep the response brief and friendly while inviting them to ask questions about what they'd like help with.
</think>
Hello! 👋 How can I assist you today?
176.4 tok/s | 80 tokens | 0.7s
- When I tested other model like qwen3-7b-instruct-q4_k_m.gguf, Qwen3.5-9B-Q8_0.gguf, Qwen3.5-9B-Q4_K_M.gguf
=> Doesn't work correctly. Perhaps need to add some code. ;)
Thanks
finally, I ran it on Windows 11 with Intel 12th CPU & RTX 4070 with "--gpu" option.
I just try to help someone who has interesting "dlgo" project on Windows 11 like me.
Caused: C code uses "AVX512" (* Intel 12th doesn't support it)
Solution (Dirty & Quick)
you could what files had problem by the command.
{Source Path}>findstr /s /i "avx512" *.c *.h *.go quant\cpool.c:#pragma GCC target("avx2,fma,f16c,avx512f,avx512bw,avx512dq,avx512vl,avx512vnni") quant\simd_dot.c:#pragma GCC target("avx2,fma,f16c,avx512f,avx512bw,avx512dq,avx512vl,avx512vnni") quant\simd_qq_dot.c:#pragma GCC target("avx2,fma,f16c,avx512f,avx512bw,avx512dq,avx512vl,avx512vnni") quant\simd_qq_dot.c:#if defined(__AVXVNNI__) || defined(__AVX512VNNI__)ex: //#pragma GCC target("avx2,fma,f16c,avx512f,avx512bw,avx512dq,avx512vl,avx512vnni")ex: {Source path}>go build -a -tags vulkan -ldflags "-linkmode internal" -o dlgo.exe ./cmd/dlgo/ex: {Source path}>dlgo run --gpu Z:\ResearchSource\dlgo_back\Qwen3.5-0.8B-Q8_0.ggufMy result:
=> Doesn't work correctly. Perhaps need to add some code. ;)
Thanks