Summary
Two related items found while building ds4's ROCm backend natively on Windows (AMD HIP SDK, no WSL) for gfx1151 / Strix Halo:
- A concrete portability bug: the ROCm staged model loader fails at exactly 2.00 GiB on Windows.
- A feature: native Windows (HIP SDK) build support for
ds4-bench.
Bug: >2 GiB staged read truncates through 32-bit off_t (Windows/MSVC)
cuda_pread_full() in rocm/ds4_rocm_runtime.cuh does:
ssize_t n = pread(fd, (char *)buf + done, n_req, (off_t)(offset + done));
Under the MSVC ABI off_t is 32-bit (long), so once offset reaches 2 GiB the (off_t) cast truncates to a negative value and the staged read fails:
ds4: ROCm model staged read failed at 2.00 GiB: Input/output error
i.e. the ~80 GB model never loads on Windows. (On Linux off_t is 64-bit under LFS, so this only manifests on the MSVC build.) Fix: pass the 64-bit offset un-truncated on Windows; POSIX path unchanged.
Feature: native Windows ROCm/HIP build
Building ds4-bench.exe for gfx1151 natively (hipcc/clang from the AMD HIP SDK 7.1, MSVC ABI — no WSL) needs:
- A dependency-free POSIX shim (
ds4_win.h): mmap/anon-mmap/mlock, madvise, sysconf, flock/fcntl/pread, fmemopen, clock_gettime, ftruncate, 64-bit _stat64 remap (so stat'ing a >2 GB model doesn't EOVERFLOW), etc.
- A Win32 pthread shim and a Winsock shim (
ds4_distributed.c is now in CORE_OBJS and pulls in BSD sockets/poll, linked even into the bench).
- Vendoring the full rocWMMA header tree (the Windows HIP SDK ships only
rocwmma-version.hpp, but main's ROCm path now uses rocwmma::fragment directly), plus hipBLASLt and -std=c++17.
- A
windows-rocm build path; all portability is behind #ifdef _WIN32 so Linux/macOS/CUDA/Linux-ROCm builds are byte-for-byte unchanged.
With the >2 GiB fix above, ds4-bench.exe loads the model and runs on gfx1151 (Strix Halo) — measured ~114 tok/s prefill @2k, ~195 @16k (with DS4_CUDA_MANAGED, see the companion OOM issue). CLI (linenoise/termios) and server (full sockets) frontends are deferred; the bench is the perf vehicle.
Happy to send PR(s) for both. (The native CPU MinGW build is a separate, smaller variant.)
Environment
- GPU: AMD Radeon 8060S (gfx1151, Strix Halo), Windows 11. AMD HIP SDK / ROCm 7.1.
Summary
Two related items found while building ds4's ROCm backend natively on Windows (AMD HIP SDK, no WSL) for gfx1151 / Strix Halo:
ds4-bench.Bug: >2 GiB staged read truncates through 32-bit
off_t(Windows/MSVC)cuda_pread_full()inrocm/ds4_rocm_runtime.cuhdoes:Under the MSVC ABI
off_tis 32-bit (long), so onceoffsetreaches 2 GiB the(off_t)cast truncates to a negative value and the staged read fails:i.e. the ~80 GB model never loads on Windows. (On Linux
off_tis 64-bit under LFS, so this only manifests on the MSVC build.) Fix: pass the 64-bit offset un-truncated on Windows; POSIX path unchanged.Feature: native Windows ROCm/HIP build
Building
ds4-bench.exefor gfx1151 natively (hipcc/clang from the AMD HIP SDK 7.1, MSVC ABI — no WSL) needs:ds4_win.h): mmap/anon-mmap/mlock, madvise, sysconf, flock/fcntl/pread, fmemopen, clock_gettime, ftruncate, 64-bit_stat64remap (so stat'ing a >2 GB model doesn't EOVERFLOW), etc.ds4_distributed.cis now in CORE_OBJS and pulls in BSD sockets/poll, linked even into the bench).rocwmma-version.hpp, but main's ROCm path now usesrocwmma::fragmentdirectly), plus hipBLASLt and-std=c++17.windows-rocmbuild path; all portability is behind#ifdef _WIN32so Linux/macOS/CUDA/Linux-ROCm builds are byte-for-byte unchanged.With the >2 GiB fix above,
ds4-bench.exeloads the model and runs on gfx1151 (Strix Halo) — measured ~114 tok/s prefill @2k, ~195 @16k (withDS4_CUDA_MANAGED, see the companion OOM issue). CLI (linenoise/termios) and server (full sockets) frontends are deferred; the bench is the perf vehicle.Happy to send PR(s) for both. (The native CPU MinGW build is a separate, smaller variant.)
Environment