Optimize hex decoding performance with LUT-based remainder processing by biryukovmaxim · Pull Request #7 · cfcosta/muhex

biryukovmaxim · 2025-11-09T21:49:15Z

Optimize hex decoding performance with LUT-based remainder processing

Overview

This PR significantly improves hex decoding performance by replacing SIMD-based remainder processing with a lookup table (LUT) approach and optimizing the validation path in the main SIMD loops. The optimizations reduce code complexity and improve instruction cache utilization

Key Changes

1. Extracted `decode_into` as separate function

Before: All decoding logic inlined into wrapper, creating massive 2000+ line function
After: Core logic in decode_into, wrapper handles allocation only
Results in cleaner assembly with better register allocation
Enables compiler to optimize the hot path independently

Assembly comparison:

// Before: Inline everything (~2000 lines)
decode_mu_hex:
    push rbp / r15 / r14 / r13 / r12 / rbx
    and rsp, -32        # 32-byte alignment
    sub rsp, 224        # Huge stack frame
    [... 2000+ lines of inlined SIMD code ...]

// After: Clean separation (~100 lines)  
decode_mu_hex:
    push r15 / r14 / r13 / r12 / rbx
    sub rsp, 16         # Minimal stack (16 bytes vs 224)
    call muhex::decode_into@GOTPCREL  # Separate optimized function
    ret

2. LUT-Based Remainder Processing

Problem: SIMD remainder generated 16 unrolled vpextrb instructions with bounds checks (~500+ lines assembly)
Solution: Simple lookup table for remainders (< 32 bytes)
Benefits:
- No SIMD register setup overhead
- No data shuffling or byte extraction
- Predictable branches (better speculation)
- Minimal stack usage
- Smaller code footprint (better i-cache)

// Compile-time lookup table
const HEX_DECODE_LUT: [u8; 256] = { /* precomputed */ };

// Tight loop for remainder
while pos < end {
    let hi = HEX_DECODE_LUT[input[pos] as usize];
    let lo = HEX_DECODE_LUT[input[pos + 1] as usize];
    if (hi | lo) == 255 { return Err(...); }  // Single branch
    output[out_pos].write((hi << 4) | lo);
    pos += 2;
}

3. Optimized Stack Usage

Before: 224 bytes stack + 32-byte alignment
After: 16 bytes stack + natural alignment
Reduced function prologue/epilogue overhead
Better register pressure management

Assembly Impact Analysis

Before: ~2000 lines with all SIMD code inlined
After: ~100 lines calling separated function

# Stack allocation
Before: sub rsp, 224
After:  sub rsp, 16         # 14x smaller

# Function size  
Before: 2000+ lines (includes all SIMD processing inline)
After:  ~100 lines (clean wrapper + call)

Core Loop (decode_into)

Now in separate compilation unit with better optimization:

Tighter SIMD loops (no validation overhead)
Better register allocation
Improved instruction scheduling
More aggressive inlining of helpers

Remainder Processing

Before (SIMD with vpextrb):

vpextrb byte ptr [r14 + r12], xmm0, 0
cmp rcx, 2
je .skip1
vpextrb byte ptr [r14 + rax], xmm0, 1
cmp rcx, 3
je .skip2
# ... 14 more unrolled iterations
# Total: ~500 lines with bounds checks

After (LUT):

movzx eax, byte ptr [rsi]       # Load high nibble
movzx ecx, byte ptr [rsi + 1]   # Load low nibble  
mov al, byte ptr [rax + LUT]    # Lookup high
or al, byte ptr [rcx + LUT]     # Lookup low & combine
cmp al, 255                      # Check validity (1 branch)
je .error
mov byte ptr [rdi], result      # Store
add rsi, 2
inc rdi
cmp rsi, end
jb .loop
# Total: ~15 lines tight loop

Benchmark Suite Enhancements

Added comprehensive benchmarking infrastructure:

1. Multi-Size Testing

Tests 12 different input sizes from 1B to 1MB:

Aligned sizes (64B, 96B, 1KB, 1MB)
Unaligned sizes (63B, 33B, 31B, 17B, 15B, 7B, 3B, 1B)
Validates performance across all code paths

2. Strategy Comparison Benchmarks

New test-util feature enables direct comparison:

#[cfg(feature = "test-util")]
pub fn decode_remainder_simd_bench(...) { }  // Old SIMD path
pub fn decode_remainder_lut_bench(...) { }   // New LUT path

3. Detailed Profiling

bench_remainder_strategies: Tests pure remainder (2B-30B)
bench_remainder_detailed: Per-pair granularity (1-15 pairs)

…and support for varied input sizes, add utility functions for benchmarking decode strategies. remove inline from function with hot loop introduce lut handling of remaining (more performant according to benchmark)

biryukovmaxim · 2025-11-10T21:07:07Z

@cfcosta

cfcosta · 2025-11-10T22:40:04Z

Dude, that's impressive! I really like the LUT approach, and honestly the code not only looks much better now, it also consistently performs faster than faster-hex.

I'm really impressed.

cfcosta merged commit 68dd72d into cfcosta:main Nov 10, 2025
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize hex decoding performance with LUT-based remainder processing#7

Optimize hex decoding performance with LUT-based remainder processing#7
cfcosta merged 1 commit intocfcosta:mainfrom
biryukovmaxim:main

biryukovmaxim commented Nov 9, 2025

Uh oh!

biryukovmaxim commented Nov 10, 2025

Uh oh!

Uh oh!

cfcosta commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

biryukovmaxim commented Nov 9, 2025

Optimize hex decoding performance with LUT-based remainder processing

Overview

Key Changes

1. Extracted decode_into as separate function

2. LUT-Based Remainder Processing

3. Optimized Stack Usage

Assembly Impact Analysis

Core Loop (decode_into)

Remainder Processing

Benchmark Suite Enhancements

1. Multi-Size Testing

2. Strategy Comparison Benchmarks

3. Detailed Profiling

Uh oh!

biryukovmaxim commented Nov 10, 2025

Uh oh!

Uh oh!

cfcosta commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Extracted `decode_into` as separate function