feat(hslm): double-buffered batch prefetch

## Task
Two buffers: while processing batch N, prefetch batch N+1 from disk.
Eliminate I/O stalls entirely. With mmap — nearly free.

## Scientific Background

### Asynchronous I/O for Deep Learning (HiPC 2021, Lee et al.)
- Double-buffered I/O reduces I/O as % of training: **>50% → <15%**
- Lock-free implementation: atomic compare-and-swap for buffer coordination
- Scales to 64 processes without contention
- Buffer sizing: 2B samples allocated (B = batch size)
- Total I/O ops = ceil(N/B), with overlap → I/O cost hidden

### Implementation Patterns (no framework)
- Two fixed memory buffers, each holds one batch
- I/O thread: monitors empty buffer, fills asynchronously
- Compute thread: consumes full buffer, never waits (ideal case)
- Synchronization: atomic status flags (empty/full), no locks
- With mmap: `madvise(MADV_WILLNEED)` for next batch pages

### SIMD in Data Loading (Intel DPC++ guide)
- Vectorize data transformations during load (normalize, convert)
- Stride-1 access patterns for maximum SIMD efficiency
- `NNNNN...CCCC...` layout > `NCNCNC...` for vectorization

### Zig-specific: std.Thread + std.os.mmap
```zig
// Thread-based double buffer
const Buffer = struct {
    data: []f32,
    ready: std.atomic.Value(bool),
};
var buffers: [2]Buffer = ...;
var current: u1 = 0;

// Prefetch thread
fn prefetchLoop(file: std.fs.File, buf: *Buffer) void {
    while (true) {
        if (!buf.ready.load(.acquire)) {
            loadBatch(file, buf.data);
            buf.ready.store(true, .release);
        }
        std.Thread.yield();
    }
}
```

## Changes
- `src/hslm/data.zig`: double buffer struct + prefetch thread
- `src/hslm/data.zig`: madvise(WILLNEED) for next batch pages
- `src/hslm/trainer.zig`: swap buffers each step, zero-wait design
- Atomic synchronization (no mutex overhead)

## Expected
- **15-25% training speedup** by hiding I/O latency
- Especially impactful with batch=256 (more data per step)
- Combined with mmap: near-zero I/O overhead
- Indirect PPL: same wall-clock → more steps → lower PPL

## Priority: MEDIUM — solid speedup, low implementation complexity

## References
- Async I/O for DL: https://sdm.lbl.gov/oapapers/hipc2021-lee.pdf
- Data prefetching patterns: https://www.jpatrickpark.com/post/prefetcher/
- SIMD data loading: Intel DPC++ Compiler Guide (SIMD Vectorization)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(hslm): double-buffered batch prefetch #319

Task

Scientific Background

Asynchronous I/O for Deep Learning (HiPC 2021, Lee et al.)

Implementation Patterns (no framework)

SIMD in Data Loading (Intel DPC++ guide)

Zig-specific: std.Thread + std.os.mmap

Changes

Expected

Priority: MEDIUM — solid speedup, low implementation complexity

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

feat(hslm): double-buffered batch prefetch #319

Description

Task

Scientific Background

Asynchronous I/O for Deep Learning (HiPC 2021, Lee et al.)

Implementation Patterns (no framework)

SIMD in Data Loading (Intel DPC++ guide)

Zig-specific: std.Thread + std.os.mmap

Changes

Expected

Priority: MEDIUM — solid speedup, low implementation complexity

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions