Skip to content

Optimize hex decoding performance with LUT-based remainder processing#7

Merged
cfcosta merged 1 commit intocfcosta:mainfrom
biryukovmaxim:main
Nov 10, 2025
Merged

Optimize hex decoding performance with LUT-based remainder processing#7
cfcosta merged 1 commit intocfcosta:mainfrom
biryukovmaxim:main

Conversation

@biryukovmaxim
Copy link
Contributor

Optimize hex decoding performance with LUT-based remainder processing

Overview

This PR significantly improves hex decoding performance by replacing SIMD-based remainder processing with a lookup table (LUT) approach and optimizing the validation path in the main SIMD loops. The optimizations reduce code complexity and improve instruction cache utilization

Key Changes

1. Extracted decode_into as separate function

  • Before: All decoding logic inlined into wrapper, creating massive 2000+ line function
  • After: Core logic in decode_into, wrapper handles allocation only
  • Results in cleaner assembly with better register allocation
  • Enables compiler to optimize the hot path independently

Assembly comparison:

// Before: Inline everything (~2000 lines)
decode_mu_hex:
    push rbp / r15 / r14 / r13 / r12 / rbx
    and rsp, -32        # 32-byte alignment
    sub rsp, 224        # Huge stack frame
    [... 2000+ lines of inlined SIMD code ...]

// After: Clean separation (~100 lines)  
decode_mu_hex:
    push r15 / r14 / r13 / r12 / rbx
    sub rsp, 16         # Minimal stack (16 bytes vs 224)
    call muhex::decode_into@GOTPCREL  # Separate optimized function
    ret

2. LUT-Based Remainder Processing

  • Problem: SIMD remainder generated 16 unrolled vpextrb instructions with bounds checks (~500+ lines assembly)
  • Solution: Simple lookup table for remainders (< 32 bytes)
  • Benefits:
    • No SIMD register setup overhead
    • No data shuffling or byte extraction
    • Predictable branches (better speculation)
    • Minimal stack usage
    • Smaller code footprint (better i-cache)
// Compile-time lookup table
const HEX_DECODE_LUT: [u8; 256] = { /* precomputed */ };

// Tight loop for remainder
while pos < end {
    let hi = HEX_DECODE_LUT[input[pos] as usize];
    let lo = HEX_DECODE_LUT[input[pos + 1] as usize];
    if (hi | lo) == 255 { return Err(...); }  // Single branch
    output[out_pos].write((hi << 4) | lo);
    pos += 2;
}

3. Optimized Stack Usage

  • Before: 224 bytes stack + 32-byte alignment
  • After: 16 bytes stack + natural alignment
  • Reduced function prologue/epilogue overhead
  • Better register pressure management

Assembly Impact Analysis

Before: ~2000 lines with all SIMD code inlined
After: ~100 lines calling separated function

# Stack allocation
Before: sub rsp, 224
After:  sub rsp, 16         # 14x smaller

# Function size  
Before: 2000+ lines (includes all SIMD processing inline)
After:  ~100 lines (clean wrapper + call)

Core Loop (decode_into)

Now in separate compilation unit with better optimization:

  • Tighter SIMD loops (no validation overhead)
  • Better register allocation
  • Improved instruction scheduling
  • More aggressive inlining of helpers

Remainder Processing

Before (SIMD with vpextrb):

vpextrb byte ptr [r14 + r12], xmm0, 0
cmp rcx, 2
je .skip1
vpextrb byte ptr [r14 + rax], xmm0, 1
cmp rcx, 3
je .skip2
# ... 14 more unrolled iterations
# Total: ~500 lines with bounds checks

After (LUT):

movzx eax, byte ptr [rsi]       # Load high nibble
movzx ecx, byte ptr [rsi + 1]   # Load low nibble  
mov al, byte ptr [rax + LUT]    # Lookup high
or al, byte ptr [rcx + LUT]     # Lookup low & combine
cmp al, 255                      # Check validity (1 branch)
je .error
mov byte ptr [rdi], result      # Store
add rsi, 2
inc rdi
cmp rsi, end
jb .loop
# Total: ~15 lines tight loop

Benchmark Suite Enhancements

Added comprehensive benchmarking infrastructure:

1. Multi-Size Testing

Tests 12 different input sizes from 1B to 1MB:

  • Aligned sizes (64B, 96B, 1KB, 1MB)
  • Unaligned sizes (63B, 33B, 31B, 17B, 15B, 7B, 3B, 1B)
  • Validates performance across all code paths

2. Strategy Comparison Benchmarks

New test-util feature enables direct comparison:

#[cfg(feature = "test-util")]
pub fn decode_remainder_simd_bench(...) { }  // Old SIMD path
pub fn decode_remainder_lut_bench(...) { }   // New LUT path

3. Detailed Profiling

  • bench_remainder_strategies: Tests pure remainder (2B-30B)
  • bench_remainder_detailed: Per-pair granularity (1-15 pairs)

…and support for varied input sizes, add utility functions for benchmarking decode strategies.

remove inline from function with hot loop

introduce lut handling of remaining (more performant according to benchmark)
@biryukovmaxim
Copy link
Contributor Author

@cfcosta

@cfcosta cfcosta merged commit 68dd72d into cfcosta:main Nov 10, 2025
1 of 2 checks passed
@cfcosta
Copy link
Owner

cfcosta commented Nov 10, 2025

image

Dude, that's impressive! I really like the LUT approach, and honestly the code not only looks much better now, it also consistently performs faster than faster-hex.

I'm really impressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants