Skip to content

ROCm/Windows: native AMD HIP SDK build for gfx1151 + >2 GiB staged-read off_t truncation #360

@jamesburton

Description

@jamesburton

Summary

Two related items found while building ds4's ROCm backend natively on Windows (AMD HIP SDK, no WSL) for gfx1151 / Strix Halo:

  1. A concrete portability bug: the ROCm staged model loader fails at exactly 2.00 GiB on Windows.
  2. A feature: native Windows (HIP SDK) build support for ds4-bench.

Bug: >2 GiB staged read truncates through 32-bit off_t (Windows/MSVC)

cuda_pread_full() in rocm/ds4_rocm_runtime.cuh does:

ssize_t n = pread(fd, (char *)buf + done, n_req, (off_t)(offset + done));

Under the MSVC ABI off_t is 32-bit (long), so once offset reaches 2 GiB the (off_t) cast truncates to a negative value and the staged read fails:

ds4: ROCm model staged read failed at 2.00 GiB: Input/output error

i.e. the ~80 GB model never loads on Windows. (On Linux off_t is 64-bit under LFS, so this only manifests on the MSVC build.) Fix: pass the 64-bit offset un-truncated on Windows; POSIX path unchanged.

Feature: native Windows ROCm/HIP build

Building ds4-bench.exe for gfx1151 natively (hipcc/clang from the AMD HIP SDK 7.1, MSVC ABI — no WSL) needs:

  • A dependency-free POSIX shim (ds4_win.h): mmap/anon-mmap/mlock, madvise, sysconf, flock/fcntl/pread, fmemopen, clock_gettime, ftruncate, 64-bit _stat64 remap (so stat'ing a >2 GB model doesn't EOVERFLOW), etc.
  • A Win32 pthread shim and a Winsock shim (ds4_distributed.c is now in CORE_OBJS and pulls in BSD sockets/poll, linked even into the bench).
  • Vendoring the full rocWMMA header tree (the Windows HIP SDK ships only rocwmma-version.hpp, but main's ROCm path now uses rocwmma::fragment directly), plus hipBLASLt and -std=c++17.
  • A windows-rocm build path; all portability is behind #ifdef _WIN32 so Linux/macOS/CUDA/Linux-ROCm builds are byte-for-byte unchanged.

With the >2 GiB fix above, ds4-bench.exe loads the model and runs on gfx1151 (Strix Halo) — measured ~114 tok/s prefill @2k, ~195 @16k (with DS4_CUDA_MANAGED, see the companion OOM issue). CLI (linenoise/termios) and server (full sockets) frontends are deferred; the bench is the perf vehicle.

Happy to send PR(s) for both. (The native CPU MinGW build is a separate, smaller variant.)

Environment

  • GPU: AMD Radeon 8060S (gfx1151, Strix Halo), Windows 11. AMD HIP SDK / ROCm 7.1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions