fix: fence the tsc() call and prevent crash from _cycles_per_sec#11
fix: fence the tsc() call and prevent crash from _cycles_per_sec#11BewareMyPower wants to merge 3 commits into
Conversation
|
After addressing the safety issue, the performance of
|
|
After removing the fence, it's still much slower than quanta.
The result come from the CI in a different branch: https://github.com/BewareMyPower/fastant/actions/runs/25737350865/job/75577818458 I didn't have time looking into the source of quanta, but from LLM's analysis, it seems that the overhead of serializing of I will mark this PR as drafted and start a discussion here. @tisonkun @andylokandy quanta is faster than fastant for one primary reason:
|
| fastant | quanta | |
|---|---|---|
| TSC instruction | __rdtscp (serializing) |
__rdtsc (non-serializing) |
| Per-call check | is_tsc_available() read + branch every call |
Amortized via OnceCell init |
| Hot-path math | wrapping_sub (anchor offset) |
saturating_sub + u128::mul + shift |
| Calibration | f64 pre-computed nanos_per_cycle |
Power-of-two integer scaling (no float in hot path) |
Note that fastant's wrapping_sub is applied to raw TSC values to normalize them, while quanta's saturating_sub + mul + shift converts TSC ticks → nanoseconds. Both do subtraction per call. The extra mul + shift in quanta is still cheaper than the serializing overhead of rdtscp.
fixes #5
fixes #7
Just like #7 points out,
__rdtscmight cause out or order executions, see https://doc.rust-lang.org/beta/core/arch/x86/fn._rdtsc.htmlIt might leads to incorrect return value of
tsc(), especially when there are heavy math operation or a memory load coming up aftertsc()returns, because any following instruction could be executed beforetsc().Hence, this PR leverages the
rdtscpinstruction and the following fence (_mm_lfencewhen SSE2 is enabled or fall back tocompiler_fence) to prevent instruction reordering.However, there could still be a rare case that #5 happens.
The 2nd
tsc()call might be switched to a different core, which might not have the same cycle with the old core. Thentsc2might be a value slightly smaller thantsc1. Then the overflow could happen and the application would crash like #5