- Benchmark performance
- Profile benchmark
- Optimize performance (7.91s → 3.35s, 58% faster)
- Handle bad SRTM data
Profile showed 27.5% of CPU in HashMap hashing (hash_one 20.4% +
Hasher::write 7.1%). TileCoord is 4 bytes (two i16s) but HashMap uses
SipHash. Switched to FxHashMap from rustc-hash which uses a fast
multiply-xor hash.
- Baseline: 7.91s
- After: 5.16s
- Improvement: 35%
The ring search visits many consecutive cells in the same tile, but each
cache.elevation() call did a hash lookup. Added a CachedTile that
stores a raw pointer to the current tile's elevation data and skips the
hash lookup when the tile coordinate hasn't changed. The pointer is
stable because Vec heap data doesn't move when the HashMap resizes.
- Baseline: 5.16s
- After: 4.61s
- Improvement: 11%
The per-element loop doing i16::from_be_bytes was not vectorized by
LLVM. Replaced with copy_nonoverlapping followed by an in-place
i16::from_be loop, which LLVM vectorizes into SIMD byte-swap
instructions.
- Baseline: 4.61s
- After: 3.61s
- Improvement: 22%
The iteration interleaved top and bottom ring edges, alternating between
opposite ends on every cell. This defeated the CachedTile optimization,
forcing a hash lookup on nearly every cell. Reordering to iterate each
edge contiguously keeps consecutive cells in the same tile.
- Baseline: 3.61s
- After: 3.30s
- Improvement: 9%
Added max_elevation field to Tile. When the search enters a new
tile, if max_elevation ≤ min_elevation, all cells in that tile are
skipped without reading elevation data, avoiding cache misses entirely.
- Benchmark (35km searches): no significant change (tiles nearby have qualifying peaks)
- Bag 55 (207km search, min 2550m): ~5s → 2.7s (46% faster) — most tiles in the search area are below 2550m and get skipped
div_euclid / rem_euclid compile to idiv instructions (~30
cycles each). Added a fast path: if global row/col are in [0, stride),
the cell is in the center tile and no division is needed. Uses unsigned
comparison trick to check both bounds in one instruction.
- Baseline: 3.35s
- After: 3.27s
- Improvement: 2%
Attempted to reuse a single Vec across ring iterations instead of
allocating a new one per ring. This caused an 8% regression, likely
because Vec::with_capacity with exact size produces better-fitting
allocations than reusing a growing buffer.