Task
Two buffers: while processing batch N, prefetch batch N+1 from disk.
Eliminate I/O stalls entirely. With mmap — nearly free.
Scientific Background
Asynchronous I/O for Deep Learning (HiPC 2021, Lee et al.)
- Double-buffered I/O reduces I/O as % of training: >50% → <15%
- Lock-free implementation: atomic compare-and-swap for buffer coordination
- Scales to 64 processes without contention
- Buffer sizing: 2B samples allocated (B = batch size)
- Total I/O ops = ceil(N/B), with overlap → I/O cost hidden
Implementation Patterns (no framework)
- Two fixed memory buffers, each holds one batch
- I/O thread: monitors empty buffer, fills asynchronously
- Compute thread: consumes full buffer, never waits (ideal case)
- Synchronization: atomic status flags (empty/full), no locks
- With mmap:
madvise(MADV_WILLNEED) for next batch pages
SIMD in Data Loading (Intel DPC++ guide)
- Vectorize data transformations during load (normalize, convert)
- Stride-1 access patterns for maximum SIMD efficiency
NNNNN...CCCC... layout > NCNCNC... for vectorization
Zig-specific: std.Thread + std.os.mmap
// Thread-based double buffer
const Buffer = struct {
data: []f32,
ready: std.atomic.Value(bool),
};
var buffers: [2]Buffer = ...;
var current: u1 = 0;
// Prefetch thread
fn prefetchLoop(file: std.fs.File, buf: *Buffer) void {
while (true) {
if (!buf.ready.load(.acquire)) {
loadBatch(file, buf.data);
buf.ready.store(true, .release);
}
std.Thread.yield();
}
}
Changes
src/hslm/data.zig: double buffer struct + prefetch thread
src/hslm/data.zig: madvise(WILLNEED) for next batch pages
src/hslm/trainer.zig: swap buffers each step, zero-wait design
- Atomic synchronization (no mutex overhead)
Expected
- 15-25% training speedup by hiding I/O latency
- Especially impactful with batch=256 (more data per step)
- Combined with mmap: near-zero I/O overhead
- Indirect PPL: same wall-clock → more steps → lower PPL
Priority: MEDIUM — solid speedup, low implementation complexity
References
Task
Two buffers: while processing batch N, prefetch batch N+1 from disk.
Eliminate I/O stalls entirely. With mmap — nearly free.
Scientific Background
Asynchronous I/O for Deep Learning (HiPC 2021, Lee et al.)
Implementation Patterns (no framework)
madvise(MADV_WILLNEED)for next batch pagesSIMD in Data Loading (Intel DPC++ guide)
NNNNN...CCCC...layout >NCNCNC...for vectorizationZig-specific: std.Thread + std.os.mmap
Changes
src/hslm/data.zig: double buffer struct + prefetch threadsrc/hslm/data.zig: madvise(WILLNEED) for next batch pagessrc/hslm/trainer.zig: swap buffers each step, zero-wait designExpected
Priority: MEDIUM — solid speedup, low implementation complexity
References