Skip to content

fix: fence the tsc() call and prevent crash from _cycles_per_sec#11

Draft
BewareMyPower wants to merge 3 commits into
fast:mainfrom
BewareMyPower:fix-rdtsc-crash
Draft

fix: fence the tsc() call and prevent crash from _cycles_per_sec#11
BewareMyPower wants to merge 3 commits into
fast:mainfrom
BewareMyPower:fix-rdtsc-crash

Conversation

@BewareMyPower
Copy link
Copy Markdown
Contributor

fixes #5
fixes #7

Just like #7 points out, __rdtsc might cause out or order executions, see https://doc.rust-lang.org/beta/core/arch/x86/fn._rdtsc.html

The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed.

It might leads to incorrect return value of tsc(), especially when there are heavy math operation or a memory load coming up after tsc() returns, because any following instruction could be executed before tsc().

Hence, this PR leverages the rdtscp instruction and the following fence (_mm_lfence when SSE2 is enabled or fall back to compiler_fence) to prevent instruction reordering.

However, there could still be a rare case that #5 happens.

let tsc1 = tsc();
let tsc2 = tsc();

The 2nd tsc() call might be switched to a different core, which might not have the same cycle with the old core. Then tsc2 might be a value slightly smaller than tsc1. Then the overflow could happen and the application would crash like #5

@BewareMyPower
Copy link
Copy Markdown
Contributor Author

After addressing the safety issue, the performance of fastant::Instant would not be better than std::Instant

Implementation Time (ns) vs std
fastant::Instant::now() 28.94 ns ~1.03x (3% slower)
quanta::Instant::now() 11.46 ns ~2.46x faster
std::Instant::now() 28.16 ns baseline

quanta is ~2.5x faster than both fastant and std. fastant and std are essentially tied (within noise/overlap of confidence intervals).

@BewareMyPower
Copy link
Copy Markdown
Contributor Author

After removing the fence, it's still much slower than quanta.

Implementation Time (ns) vs std
fastant 22.46 ns (22.44-22.49) 0.80x faster
quanta 11.48 ns (11.43-11.55) 0.41x faster
std 28.00 ns (28.00-28.01) baseline

The result come from the CI in a different branch: https://github.com/BewareMyPower/fastant/actions/runs/25737350865/job/75577818458

I didn't have time looking into the source of quanta, but from LLM's analysis, it seems that the overhead of serializing of __rdtscp is the major factor.

I will mark this PR as drafted and start a discussion here. @tisonkun @andylokandy


quanta is faster than fastant for one primary reason:

__rdtsc vs __rdtscp

fastant uses __rdtscp (tsc_now.rs:211) — the serializing variant. It waits for all prior instructions to complete before reading the counter, guaranteeing the TSC is read after everything else in program order.

quanta uses __rdtsc (counter.rs:7) — the non-serializing variant. It just reads the counter immediately, no ordering guarantee.

On modern x86, rdtscp costs ~20+ cycles vs rdtsc at ~10 cycles. That alone accounts for most of the 2.5x gap (28.9ns → 11.5ns).

Secondary differences

fastant quanta
TSC instruction __rdtscp (serializing) __rdtsc (non-serializing)
Per-call check is_tsc_available() read + branch every call Amortized via OnceCell init
Hot-path math wrapping_sub (anchor offset) saturating_sub + u128::mul + shift
Calibration f64 pre-computed nanos_per_cycle Power-of-two integer scaling (no float in hot path)

Note that fastant's wrapping_sub is applied to raw TSC values to normalize them, while quanta's saturating_sub + mul + shift converts TSC ticks → nanoseconds. Both do subtraction per call. The extra mul + shift in quanta is still cheaper than the serializing overhead of rdtscp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Is necessary to introduce rdtsc_ordered? Panic due to overflow in _cycles_per_sec()

1 participant