Skip to content

feat(hslm): double-buffered batch prefetch #319

@gHashTag

Description

@gHashTag

Task

Two buffers: while processing batch N, prefetch batch N+1 from disk.
Eliminate I/O stalls entirely. With mmap — nearly free.

Scientific Background

Asynchronous I/O for Deep Learning (HiPC 2021, Lee et al.)

  • Double-buffered I/O reduces I/O as % of training: >50% → <15%
  • Lock-free implementation: atomic compare-and-swap for buffer coordination
  • Scales to 64 processes without contention
  • Buffer sizing: 2B samples allocated (B = batch size)
  • Total I/O ops = ceil(N/B), with overlap → I/O cost hidden

Implementation Patterns (no framework)

  • Two fixed memory buffers, each holds one batch
  • I/O thread: monitors empty buffer, fills asynchronously
  • Compute thread: consumes full buffer, never waits (ideal case)
  • Synchronization: atomic status flags (empty/full), no locks
  • With mmap: madvise(MADV_WILLNEED) for next batch pages

SIMD in Data Loading (Intel DPC++ guide)

  • Vectorize data transformations during load (normalize, convert)
  • Stride-1 access patterns for maximum SIMD efficiency
  • NNNNN...CCCC... layout > NCNCNC... for vectorization

Zig-specific: std.Thread + std.os.mmap

// Thread-based double buffer
const Buffer = struct {
    data: []f32,
    ready: std.atomic.Value(bool),
};
var buffers: [2]Buffer = ...;
var current: u1 = 0;

// Prefetch thread
fn prefetchLoop(file: std.fs.File, buf: *Buffer) void {
    while (true) {
        if (!buf.ready.load(.acquire)) {
            loadBatch(file, buf.data);
            buf.ready.store(true, .release);
        }
        std.Thread.yield();
    }
}

Changes

  • src/hslm/data.zig: double buffer struct + prefetch thread
  • src/hslm/data.zig: madvise(WILLNEED) for next batch pages
  • src/hslm/trainer.zig: swap buffers each step, zero-wait design
  • Atomic synchronization (no mutex overhead)

Expected

  • 15-25% training speedup by hiding I/O latency
  • Especially impactful with batch=256 (more data per step)
  • Combined with mmap: near-zero I/O overhead
  • Indirect PPL: same wall-clock → more steps → lower PPL

Priority: MEDIUM — solid speedup, low implementation complexity

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent:spawnAuto-spawn agent container

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions