Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjusting `GetCurrentProcessorId` caching to different environments. #467

Open
wants to merge 8 commits into
base: master
from

Conversation

@VSadov
Copy link
Member

VSadov commented Dec 3, 2019

GetProcessorNumber has a big range of possible implementations and performance characteristics. The slowest implementations can be 100x times slower than the fast ones, and we may not have seen all the range. Even on the same hardware performance may vary (i.e. x86 WOW process will have noticeably slower support than x64 on the same machine/OS).

At sub-context-switch times the result of GetProcessorNumber is fairly stable. Most OSs would try to keep the thread from “wandering” to other cores. As such it is possible to amortize the expenses of this API by thread-static caching. However the context switches do happen and the cached result has diminishing utility as the time passes.

The good news is that this API is fast on recent hardware/OS-es and is getting faster. It is not uncommon now to see systems where the call is fast enough that caching is not necessary or, in fact, may make it slower. There are systems where this API is actually faster than a standalone ThreadStatic access. (for example when RDPID instruction is available and OS uses it).

The goal of this change is to use as little caching as possible while not falling into perf cliffs on systems where API happens to be slow.

== In this change:

There are two on-demand checks that estimate the performance of GetProcessorNumber with a standalone ThreadStatic access used as a base reference. There is still a cache, but the caching strategy is adjusted accordingly to those checks.

  1. On the first access to the GetCurrentProcessorId a quick and conservative check is performed to detect “definitely fast” systems.
    On such systems no caching is required and no further checks are necessary. We get out of the way. Cache is bypassed.
  2. Systems that do not pass the first check, will operate with default cache settings (refresh every 50 accesses).
    At the time of cache refresh a calibration sample would be collected.
    When enough samples are collected, a more precise refresh rate is computed. Typically it will be smaller than 50x. Sometimes it may be larger.
    The total cost of calibration has upper bound and is not high. The impact is further mitigated by spreading out the measurements which makes calibration pay-for-play, takes it away from start-up and has dilutionary effect. As a result calibration is generally hardly noticeable in profiler.
@VSadov VSadov requested review from jkotas and stephentoub Dec 3, 2019
@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Dec 3, 2019

Thanks @tannergooding for validating this on Zen2 architecture (known to have RDPID support). #Closed

double tlsMin = double.MaxValue;
for (int i = 0; i < CalibrationSamples; i++)
{
idMin = Math.Min(idMin, calibrationState[i * 2]); //ID

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 3, 2019

Member

Do we really need to keep the whole array around to just compute the minimums at the end?

@jkotas

This comment has been minimized.

Copy link
Member

jkotas commented Dec 3, 2019

cc @adamsitnik Any thoughts about impact of calibration algorithms like this on bench-marking?

@VSadov VSadov force-pushed the VSadov:CoreId branch from 7e39598 to c646a6f Dec 4, 2019
@adamsitnik

This comment has been minimized.

Copy link
Member

adamsitnik commented Dec 4, 2019

Any thoughts about impact of calibration algorithms like this on bench-marking?

BenchmarkDotNet does not use GetCurrentProcessorId at all. All the heuristics are based on execution time only.

/cc @AndreyAkinshin

@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Dec 4, 2019

@adamsitnik - I think @jkotas is asking about impact on benchmarking from a general practice of using calibration like this.

If you notice calibration adds 5 msec total to the first 500 accesses (one sample taken per 50 accesses).
That is assuming we keep the latest calibration durations. We can dial that as a cost vs. variability tradeoff.

Further accesses will not have that cost but the results of calibration will have effect on the API by re-tuning the cache refresh from default to, hopefully, a better value, but there could be observable change and "better" is a bit of a fuzzy metric.
I.E. the average access may become slightly more expensive while precision of the API would be improved and have indirect effect of having fewer cache misses and sharing. Impact of that will depend on sensitivity of the use pattern.
Basically there is a slight performance "step" after 500 accesses when calibration is done.

Also, as long as we accept some degree of errors (as I said we can dial that vs. the cost of calibration), there could be some bimodality from run to run. Ex: in one run the cache refresh is set to 20 accesses in another 18.
This can be observed indirectly as a noise from run to run, even though the effects could be fairly indirect, unless errors are extreme.

Any thoughts whether this could be detrimental to benchmarking and whether benchmarking will be capable to catch any misbehavior. (i.e. long term variability turns out way higher than we expected or there are wild outliers)

@VSadov VSadov force-pushed the VSadov:CoreId branch from 93532b5 to bb4847f Dec 7, 2019
@jkotas

This comment has been minimized.

Copy link
Member

jkotas commented Dec 7, 2019

I wanted the whole deal to be optimized for good hardware

Agree. I would also add high predictability on any recent hardware / OS combinations that have the fast processor ID. BTW: I expect that we may remove the fast check in future once the fast processor ID becomes ubiquitous (on some platforms at least).

As you may see there is a lot of noise even in the second check, but this check is more robust since it has multiple samples needs only one good measurement of each, regardless in which pair.

The first check collects multiple samples too. Can we make it nearly as robust by applying more statistical methods on the samples? E.g. look at the mins and medians?

what if GetProcessorNumber is ridiculously slow
what if timer is really bad

Can we detect these cases from the data we collect in the fast check? These are the legacy hardware or legacy OS cases. I think it is fine to use the max refresh rare for any of the legacy cases. Also, I am not worried about spending some extra time in the static check if the platform has a bad timer.

@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Dec 8, 2019

The first check has loop duration at 1 usec. That is just 10 ticks on my main machine and 3 ticks on machine that was my main just this summer (CoreID is "fast" on that machine, but timer is not great).
Basically the time we spend in the first check seems enough to give a boolean answer with some confidence, but hardly enough to measure the refresh rate.
(if one is 20x slower than another - hard to measure that when spending 10 or 3 ticks on each)

It is actually possible to have only a boolean check, but that would create a cliff between fast and other systems. The second check smooths that by producing numerical comparison with acceptable precision (I have seen about ~25% variance in extreme cases, I think that is acceptable).
The second check also acts as a fallback if something went wrong with the first check.

As I hear, we would want the first check to have fewer false negatives on fast systems. Maybe at a cost of slightly higher chance for false positives or spending a bit more time when failing.
That is definitely possible.

I think we have a whole range of possibilities here. We probably need to discuss into which constraints we want to fit.

@jkotas

This comment has been minimized.

Copy link
Member

jkotas commented Dec 8, 2019

if one is 20x slower than another - hard to measure that when spending 10 or 3 ticks on each

Can we top-off the number of iterations for the thread static, similarly to how we top-off the number of iterations for GetCurrentProcessorNumber? Then you should be able to get a good idea about the relative speed (e.g. that one is about 20x slower than other).

@jkotas

This comment has been minimized.

Copy link
Member

jkotas commented Dec 8, 2019

I think it is fine to have some variance once we start using the thread-static cache. The tuning of the thread-static cache refresh rate is not exact science. It is not clear to me whether the current formula tdMin / tlsMin for the refresh rate is the right trade-off. This formula means that we will spend about 50% of the time on refreshing the cache. I would make it lower, like spend about 10% on refreshing the cache.

@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Dec 12, 2019

in the new change:

  • piece-wise calibration pass is gone. only have the first pass now
  • more robust detection of "fast" cases. (do not give up on first fail, use the best among 10 samples)
  • substantially simpler,IMO. The array of samples allocation is gone too.

what was traded for that:

  • we may spend more time in the checking routine if GetCurrentProcessorNumber API is super slow.
    previously we could call the slow API just 1 time in the worst case, now may call it 16 times.
    more common cases may make fewer calls than before though.

  • refresh rate, when used, is more noisy
    I have seen refresh computed at 15 and 30 on the same machine.
    Generally the range is not that large though - within 10%

I think tradeoffs may be worth it.


// If GetCurrentProcessorNumber takes any nontrivial time (compared to TLS access), return false.
// Check more than once - to make sure it was not because TLS was delayed by GC or a context switch.
internal static unsafe int ProcessorNumberSpeedCheck()

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 12, 2019

Member

I would make this return bool so that all "magic" constants are together. #Resolved

@jkotas

This comment has been minimized.

Copy link
Member

jkotas commented Dec 12, 2019

This looks much better to me, modulo some comments. Thank you!

minTLS = Math.Min(minTLS, (double)t1 / iters);
}

s_processorIdRefreshRate = Math.Min((int)(minID / minTLS), MaxIdRefreshRate);

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 12, 2019

Member

I am not sure whether minID / minTLS is the best tradeoff for the refresh rate. I would make it 5 * minID / minTLS.

This comment has been minimized.

Copy link
@VSadov

VSadov Dec 12, 2019

Author Member

5 * minID / minTLS will move the refresh rate on the typical nonfast machine (like x86) from 20-30 to 200-300.
There must be a point when large refresh rate makes this API less useful than ManagedThreadId. ManagedThreadId is relatively unique, so when threads exchange cores, they would have cache misses, but at least might not share (and ping-pong those cache lines).
With CoreID and large refresh rate they could be actually sharing the data. And it would not be nice if another core swap happens within those 300 accesses.

I think that 300 is a bit too much and we should have as small a refresh rate as we think affordable. - I.E. if we have 30x slower ProcessorNumber, we should just amortize that over 30 accesses.

I did not try yet to validate/reject this theory.
I need to think of an experiment to at least crudely estimate the effects of long refresh rates.


In reply to: 356995018 [](ancestors = 356995018)

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 13, 2019

Member

Yeah, it is hard to get a conclusive answer for this. I can live with the 1:1 amortization ratio if you really believe that it is right. Or maybe take take average between our two opinions and make it 1:2 :-)

This comment has been minimized.

Copy link
@VSadov

VSadov Dec 13, 2019

Author Member

I am trying to use this in a striped counter and so far it points to benefits of longer caching than 1:1. Maybe 5 * minID / minTLS is the right ratio...
That is an extreme case though, very sensitive to the cost of access to a stripe, since uncontended increment is cheap.

Or maybe I have bugs somewhere :-)


In reply to: 357432015 [](ancestors = 357432015)

// doesn't exist on all platforms. On those it doesn't exist on, GetCurrentProcessorNumber()
// returns -1. As a fallback in that case and to spread the threads across the buckets
// by default, we use the current managed thread ID as a proxy.
if (currentProcessorId < 0) currentProcessorId = Environment.CurrentManagedThreadId;

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 12, 2019

Member

Nit: Formatting. I believe this should be on two lines according to the coding conventions.

minTLS = Math.Min(minTLS, (double)t1 / iters);
}

s_processorIdRefreshRate = Math.Min((int)(minID / minTLS), MaxIdRefreshRate);

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 12, 2019

Member

Do we need to crop it on the low-end as well to protect against bad clocks, etc.?

Math.Clamp

This comment has been minimized.

Copy link
@VSadov

VSadov Dec 13, 2019

Author Member

I think both mins will be positive and not large (relative to the range of Double) values. If it goes below "fast" threshold, we will not be using cache, so effective refresh seems to be between "fast" threashold and MaxIdRefresh


In reply to: 357429978 [](ancestors = 357429978)

}

// do a fast check and record in a readonly static so that it could become a JIT constant
private static readonly bool s_isProcessorNumberReallyFast = ProcessorIdCache.ProcessorNumberSpeedCheck();

This comment has been minimized.

Copy link
@adamsitnik

adamsitnik Dec 13, 2019

Member

It would be nice to have this field declaration at the top of the file so this initialization would not be "hidden".

https://github.com/dotnet/runtime/blob/master/docs/coding-guidelines/coding-style.md#c-coding-style

  1. Fields should be specified at the top within type declarations.

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 13, 2019

Member

System.Threading.Thread is a dumping ground for many disjoint pieces of functionality. The existing style used in the file is to have the data defined declared locally. Unclear whether moving all data to the top would make it more readable. If you believe that it would make it more readable, we should do it for all to maintain consistent style.

t1 = Stopwatch.GetTimestamp();
for (int j = 0; j < iters; j++)
{
Thread.GetCurrentProcessorNumber();

This comment has been minimized.

Copy link
@adamsitnik

adamsitnik Dec 13, 2019

Member

should the result of Thread.GetCurrentProcessorNumber be consumed to avoid dead code elimination?

double minTLS = Stopwatch.Frequency; // 1 sec

// warm up the code paths.
UninlinedThreadStatic();

This comment has been minimized.

Copy link
@adamsitnik

adamsitnik Dec 13, 2019

Member

should we also call Thread.GetCurrentProcessorNumber() here to make sure that it's JITted before we call it in the loop?

@adamsitnik

This comment has been minimized.

Copy link
Member

adamsitnik commented Dec 13, 2019

I think @jkotas is asking about impact on benchmarking from a general practice of using calibration like this.

I am sorry, I've misunderstood the question. Thanks for clarification @VSadov !

BenchmarkDotNet has non-trivial warmup. Let's consider following example run on my Ubuntu PC using latest .NET Core 5.0 SDK:

[Benchmark]
public int GetCurrentProcessorId() => Thread.GetCurrentProcessorId();
WorkloadJitting  1: 1 op, 337448.00 ns, 337.4480 us/op

WorkloadJitting  2: 16 op, 447148.00 ns, 27.9468 us/op

WorkloadPilot    1: 16 op, 931.00 ns, 58.1875 ns/op
WorkloadPilot    2: 4296464 op, 37519726.00 ns, 8.7327 ns/op
WorkloadPilot    3: 28628048 op, 216613893.00 ns, 7.5665 ns/op
WorkloadPilot    4: 33040416 op, 211301732.00 ns, 6.3953 ns/op
WorkloadPilot    5: 39091520 op, 250231990.00 ns, 6.4012 ns/op
WorkloadPilot    6: 39055280 op, 250907306.00 ns, 6.4244 ns/op
WorkloadPilot    7: 38914064 op, 248882611.00 ns, 6.3957 ns/op
WorkloadPilot    8: 39088784 op, 249162154.00 ns, 6.3743 ns/op
WorkloadPilot    9: 39220240 op, 251730406.00 ns, 6.4184 ns/op

WorkloadWarmup   1: 39220240 op, 248808770.00 ns, 6.3439 ns/op

First of all, we run the benchmark once to JIT the code. If this takes longer than IterationTime (500ms by default, 250ms for the performance repo) the warmup is over (to save time for long running benchmarks) and we run the benchmark once per iteration.

WorkloadJitting  1: 1 op, 337448.00 ns, 337.4480 us/op

If it takes less, we run the benchmark once with manual loop unrolling enabled. (to again JIT it)

WorkloadJitting  2: 16 op, 447148.00 ns, 27.9468 us/op

Then we start the Pilot phase which is supposed to find perfect invocation count (how many times to run the benchmark per single iteration):

WorkloadPilot    1: 16 op, 931.00 ns, 58.1875 ns/op
WorkloadPilot    2: 4296464 op, 37519726.00 ns, 8.7327 ns/op
WorkloadPilot    3: 28628048 op, 216613893.00 ns, 7.5665 ns/op
WorkloadPilot    4: 33040416 op, 211301732.00 ns, 6.3953 ns/op
WorkloadPilot    5: 39091520 op, 250231990.00 ns, 6.4012 ns/op
WorkloadPilot    6: 39055280 op, 250907306.00 ns, 6.4244 ns/op
WorkloadPilot    7: 38914064 op, 248882611.00 ns, 6.3957 ns/op
WorkloadPilot    8: 39088784 op, 249162154.00 ns, 6.3743 ns/op
WorkloadPilot    9: 39220240 op, 251730406.00 ns, 6.4184 ns/op

(The perfect invocation count above is 39220240)

After this BDN runs warmup phase which has a simple herustic that runs by default at least 6 iterations and stops when the results get stable. In performance repo we run only 1 warmup iteration.

WorkloadWarmup   1: 39220240 op, 248808770.00 ns, 6.3439 ns/op

So the answer to your question is that the code is going to be executed so many times before we start the actual benchmarking that the end result won't contain the "warmup" and "calibration" overhead.

However, if the calibration never stops, the benchmark results might be multimodal. BDN is going to print a warning and a nice histogram.

@adamsitnik

This comment has been minimized.

Copy link
Member

adamsitnik commented Dec 13, 2019

BTW do we have any automated test for this API? Something that starts affinitized process and then validates the value returned by the API? (I am just curious, I know that writing it and keeping it stable for all OSes might be non-trivial)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.