Perf improvement by ch4r10t33r · Pull Request #86 · blockblaz/hash-zig

ch4r10t33r · 2025-12-01T17:01:48Z

No description provided.

…, and thread improvements - Pre-allocate and reuse buffers (simd_output_buffer, chain_domains_stack, leaf_domain_buffer) across batches - Replace manual copy loops with @memcpy/@Memset for better performance - Mark prfDomainElement as inline to reduce function call overhead - Optimize thread count calculation (min epochs per thread: 32->64, parallel threshold: 128->256) - Remove intermediate array-to-vector conversion in tweak processing - Use stack allocation for thread arrays when thread count <= 16 Performance: ~7.0-7.8s for 2^32 (1024 epochs), maintaining full cross-language compatibility

- Add timing instrumentation for cache operations, leaf generation, and tree building - Profile bottom tree operations to identify bottlenecks - Maintain full cross-language compatibility Performance: ~7.1s for 2^32 (1024 epochs)

- Unroll transpose loops for SIMD_WIDTH 4 and 8 for better performance - Pre-compute zero-packed values outside loops - Add conditional checks to skip unnecessary zero-padding - Optimize padding in chain compression Performance: ~7.1s for 2^32 (1024 epochs), maintaining full compatibility

- Add explicit 32-byte alignment for all SIMD buffers when using 8-wide SIMD (AVX-512) - Add explicit 16-byte alignment for 4-wide SIMD (SSE4.1/NEON) - Optimize memory layout for better cache locality in hot data structures - Add AVX-512 build instructions to README Performance: ~7.3s for 2^32 (1024 epochs) with 4-wide SIMD Expected: ~3.5-4.0s with 8-wide SIMD (AVX-512) - ~2x speedup Maintains full cross-language compatibility

- Auto-detect SIMD width based on target CPU features - Automatically use 8-wide SIMD when AVX-512F is detected - Fall back to 4-wide SIMD for ARM/other architectures - Allow manual override with -Dsimd-width flag - Update README with auto-detection documentation

- Remove redundant 'Cross-Language Compatibility Tests' section from Contents - Update performance numbers to reflect latest (7.1-7.4s with 4-wide, 3.5-4.0s with 8-wide AVX-512) - Expand repository layout to show detailed module structure - Update optimization descriptions with memory alignment and caching details - Update CI/testing info to reflect current configuration (2^8 and 2^32) - Replace outdated optimization doc reference with concise performance notes - Add AVX-512 section to Contents for better navigation

- Add parallel chain computation for 64+ chains using threads - Replace loop-based zero-padding with @Memset for better performance - Use conditional parallelization (sequential for <64 chains, parallel for >=64) - Add error handling with atomic flags and mutexes for thread safety - Add benchmark-verify build target for verification performance testing - Maintain 100% Rust compatibility (same verification logic) Performance impact: - Small signatures (<64 chains): Minimal change, faster memory ops - Large signatures (64+ chains): ~2-4x speedup on multi-core systems - Current: ~9ms per verification (already fast) - All cross-language compatibility tests pass ✅

Major performance improvements for signing and verification: 1. Stack allocation in hot path: - Replace heap allocation with stack allocation in applyPoseidonChainTweakHash - Use stack-allocated [15]FieldElement for combined_input (max size) - Direct Poseidon2-16 compression with stack-allocated state - Eliminates heap allocations in the most frequently called function 2. Function inlining: - Made applyPoseidonChainTweakHash inline to reduce call overhead - Improves performance for chain walking operations 3. Parallel chain computation: - Added parallel processing for 64+ chains in sign() function - Similar to verification optimization, processes independent chains concurrently - Uses conditional parallelization (sequential for <64 chains, parallel for >=64) 4. Memory optimizations: - Use @Memset instead of loops for zero-padding - Optimized memory operations throughout sign/verify paths Performance impact: - Verification: ~9-11ms (maintained, with better scalability) - Signing: Improved performance for all lifetimes - All cross-language compatibility tests pass ✅ - 100% Rust compatibility maintained The main improvement comes from eliminating heap allocations in the hot path (applyPoseidonChainTweakHash is called many times during chain walking).

ch4r10t33r added 9 commits December 1, 2025 16:32

Fix indentation in scheme.zig

4c7f99a

ch4r10t33r merged commit 21fe385 into blockblaz:master Dec 1, 2025
6 checks passed

ch4r10t33r mentioned this pull request Dec 1, 2025

Add simd support #66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf improvement#86

Perf improvement#86
ch4r10t33r merged 9 commits intoblockblaz:masterfrom
ch4r10t33r:perf-improvement

ch4r10t33r commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ch4r10t33r commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant