Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Adjusting `GetCurrentProcessorId` caching to different environments. #467
At sub-context-switch times the result of
The good news is that this API is fast on recent hardware/OS-es and is getting faster. It is not uncommon now to see systems where the call is fast enough that caching is not necessary or, in fact, may make it slower. There are systems where this API is actually faster than a standalone
The goal of this change is to use as little caching as possible while not falling into perf cliffs on systems where API happens to be slow.
== In this change:
There are two on-demand checks that estimate the performance of
If you notice calibration adds 5 msec total to the first 500 accesses (one sample taken per 50 accesses).
Further accesses will not have that cost but the results of calibration will have effect on the API by re-tuning the cache refresh from default to, hopefully, a better value, but there could be observable change and "better" is a bit of a fuzzy metric.
Also, as long as we accept some degree of errors (as I said we can dial that vs. the cost of calibration), there could be some bimodality from run to run. Ex: in one run the cache refresh is set to 20 accesses in another 18.
Any thoughts whether this could be detrimental to benchmarking and whether benchmarking will be capable to catch any misbehavior. (i.e. long term variability turns out way higher than we expected or there are wild outliers)
Agree. I would also add high predictability on any recent hardware / OS combinations that have the fast processor ID. BTW: I expect that we may remove the fast check in future once the fast processor ID becomes ubiquitous (on some platforms at least).
The first check collects multiple samples too. Can we make it nearly as robust by applying more statistical methods on the samples? E.g. look at the mins and medians?
Can we detect these cases from the data we collect in the fast check? These are the legacy hardware or legacy OS cases. I think it is fine to use the max refresh rare for any of the legacy cases. Also, I am not worried about spending some extra time in the static check if the platform has a bad timer.
The first check has loop duration at 1 usec. That is just 10 ticks on my main machine and 3 ticks on machine that was my main just this summer (CoreID is "fast" on that machine, but timer is not great).
It is actually possible to have only a boolean check, but that would create a cliff between fast and other systems. The second check smooths that by producing numerical comparison with acceptable precision (I have seen about ~25% variance in extreme cases, I think that is acceptable).
As I hear, we would want the first check to have fewer false negatives on fast systems. Maybe at a cost of slightly higher chance for false positives or spending a bit more time when failing.
I think we have a whole range of possibilities here. We probably need to discuss into which constraints we want to fit.
Can we top-off the number of iterations for the thread static, similarly to how we top-off the number of iterations for
I think it is fine to have some variance once we start using the thread-static cache. The tuning of the thread-static cache refresh rate is not exact science. It is not clear to me whether the current formula
in the new change:
what was traded for that:
I think tradeoffs may be worth it.
I am sorry, I've misunderstood the question. Thanks for clarification @VSadov !
BenchmarkDotNet has non-trivial warmup. Let's consider following example run on my Ubuntu PC using latest .NET Core 5.0 SDK:
[Benchmark] public int GetCurrentProcessorId() => Thread.GetCurrentProcessorId();
First of all, we run the benchmark once to JIT the code. If this takes longer than
If it takes less, we run the benchmark once with manual loop unrolling enabled. (to again JIT it)
Then we start the Pilot phase which is supposed to find perfect invocation count (how many times to run the benchmark per single iteration):
(The perfect invocation count above is
After this BDN runs warmup phase which has a simple herustic that runs by default at least 6 iterations and stops when the results get stable. In performance repo we run only 1 warmup iteration.
So the answer to your question is that the code is going to be executed so many times before we start the actual benchmarking that the end result won't contain the "warmup" and "calibration" overhead.
However, if the calibration never stops, the benchmark results might be multimodal. BDN is going to print a warning and a nice histogram.