ROCm/Windows: native AMD HIP SDK build for gfx1151 + >2 GiB staged-read off_t truncation

### Summary

Two related items found while building ds4's ROCm backend **natively on Windows** (AMD HIP SDK, no WSL) for gfx1151 / Strix Halo:

1. A concrete portability **bug**: the ROCm staged model loader fails at exactly 2.00 GiB on Windows.
2. A **feature**: native Windows (HIP SDK) build support for `ds4-bench`.

### Bug: >2 GiB staged read truncates through 32-bit `off_t` (Windows/MSVC)

`cuda_pread_full()` in `rocm/ds4_rocm_runtime.cuh` does:

```c
ssize_t n = pread(fd, (char *)buf + done, n_req, (off_t)(offset + done));
```

Under the MSVC ABI `off_t` is 32-bit (`long`), so once `offset` reaches 2 GiB the `(off_t)` cast truncates to a negative value and the staged read fails:

```
ds4: ROCm model staged read failed at 2.00 GiB: Input/output error
```

i.e. the ~80 GB model never loads on Windows. (On Linux `off_t` is 64-bit under LFS, so this only manifests on the MSVC build.) Fix: pass the 64-bit offset un-truncated on Windows; POSIX path unchanged.

### Feature: native Windows ROCm/HIP build

Building `ds4-bench.exe` for gfx1151 natively (hipcc/clang from the AMD HIP SDK 7.1, MSVC ABI — no WSL) needs:
- A dependency-free POSIX shim (`ds4_win.h`): mmap/anon-mmap/mlock, madvise, sysconf, flock/fcntl/pread, fmemopen, clock_gettime, ftruncate, **64-bit `_stat64` remap** (so stat'ing a >2 GB model doesn't EOVERFLOW), etc.
- A Win32 pthread shim and a **Winsock shim** (`ds4_distributed.c` is now in CORE_OBJS and pulls in BSD sockets/poll, linked even into the bench).
- Vendoring the **full rocWMMA header tree** (the Windows HIP SDK ships only `rocwmma-version.hpp`, but main's ROCm path now uses `rocwmma::fragment` directly), plus hipBLASLt and `-std=c++17`.
- A `windows-rocm` build path; all portability is behind `#ifdef _WIN32` so Linux/macOS/CUDA/Linux-ROCm builds are byte-for-byte unchanged.

With the >2 GiB fix above, `ds4-bench.exe` loads the model and runs on gfx1151 (Strix Halo) — measured ~114 tok/s prefill @2k, ~195 @16k (with `DS4_CUDA_MANAGED`, see the companion OOM issue). CLI (linenoise/termios) and server (full sockets) frontends are deferred; the bench is the perf vehicle.

Happy to send PR(s) for both. (The native CPU MinGW build is a separate, smaller variant.)

### Environment
- GPU: AMD Radeon 8060S (gfx1151, Strix Halo), Windows 11. AMD HIP SDK / ROCm 7.1.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCm/Windows: native AMD HIP SDK build for gfx1151 + >2 GiB staged-read off_t truncation #360

Summary

Bug: >2 GiB staged read truncates through 32-bit `off_t` (Windows/MSVC)

Feature: native Windows ROCm/HIP build

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ROCm/Windows: native AMD HIP SDK build for gfx1151 + >2 GiB staged-read off_t truncation #360

Description

Summary

Bug: >2 GiB staged read truncates through 32-bit off_t (Windows/MSVC)

Feature: native Windows ROCm/HIP build

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Bug: >2 GiB staged read truncates through 32-bit `off_t` (Windows/MSVC)