Bump HandleHistogram32::SIZE to 16 #87332

EgorBo · 2023-06-09T18:08:14Z

We think that one of the reasons we see a big noise in microbenchmarks after we enabled Dynamic PGO is mispredicted classes for virtual calls. #87324 (although, could happend previously with the static pgo as well).

The actual list of affect benchmarks is bigger. Especially it's well visualized in these benchmarks:

I investigated this benchmark and confirmed that it's caused by "random" GDV decisions while for P=0.3 we should always perform GDV for the class that has P=0.7.

A simple repro:

using System.Runtime.CompilerServices;

public class B
{
    public virtual int F() => 33;
}

public class D : B
{
    public override int F() => 44;
}

public class Prog
{

    [MethodImpl(MethodImplOptions.NoInlining)]
    static long Test(B b)
    {
        return b.F() + b.F();
    }

    static void Main()
    {
        for (int i = 0; i < 100; i++)
        {
            Thread.Sleep(16);
            Test((i % 10 >= 3) ? new D() : new B());
        }
        Console.ReadKey();
    }
}

(note: run in Release or change DOTNET_TC_* args in Checked - Checked tries to tier up too aggressivly)

In this case we expect Test to devirtualize both calls for D class because it happens more frequently, instead we get this:

  0) Class D - 7
  1) Class B - 1
Picked:   'D'

  0) Class B - 5
  1) Class D - 3
Picked:   'B'

and it varies between runs, so we might end up in a situation where two calls on the same object are devirtualized differently. In my case I could do GDV for both calls for B and B, B and D, etc... while it should always be D and D here.

Once I bumped the size to 16 it is now way more stable:

  0) Class D - 10
  1) Class B - 6
Picked:   'D'

  0) Class D - 10
  1) Class B - 6
Picked:   'D'

We might want to bump it to 32 or higher, but I propose we do it iteratively and watch benchmark's behavior.

PS: Since we now only instrument hot tier0/r2r code we no longer allocate these tables for every method so presumably we can afford extra 8*sizof(void*) bytes per instrumented virtual call-site.

cc @AndyAyersMS @jakobbotsch @davidwrighton

ghost · 2023-06-09T18:08:31Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

We think that one of the reason we see a big variance in microbenchmarks now is mispredicted classes for virtual calls. #87324

The actual list of affect benchmarks is bigger. Especially it's well visualized in these benchmarks:

I investigated this benchmark and confirmed that it's caused by "random" GDV decisions while for P=0.3 we should always perform GDV for the class that has P=0.7.

A simple repro:

using System.Runtime.CompilerServices;

public class B
{
    public virtual int F() => 33;
}

public class D : B
{
    public override int F() => 44;
}

public class Prog
{

    [MethodImpl(MethodImplOptions.NoInlining)]
    static long Test(B b)
    {
        return b.F() + b.F();
    }

    static void Main()
    {
        for (int i = 0; i < 100; i++)
        {
            Thread.Sleep(16);
            Test((i % 10 >= 3) ? new D() : new B());
        }
        Console.ReadKey();
    }
}

(note: run in Release or change DOTNET_TC_* args in Checked - Checked tries to tier up too aggressivly)

In this case we expect Test to devirtualize both calls for D class because it happens more frequently, instead we get this:

  0) Class D - 7
  1) Class B - 1
Picked:   'D'

  0) Class B - 5
  1) Class D - 3
Picked:   'B'

and it varies between runs, so we might end up in a situation where two calls on the same object are devirtualized differently.

Once I bumped the size to 16 it is now way more stable:

  0) 00007FFF6C850D50 - 10
  1) 00007FFF6C850BA8 - 6
Picked:   'D'

  0) 00007FFF6C850D50 - 10
  1) 00007FFF6C850BA8 - 6
Picked:   'D'

We might want to bump it to 32 or higher, but I propose we do it iteratively and watch benchmark's behavior.

PS: Since we now only instrument hot tier0/r2r code we no longer allocate these tables for every method so presumably we can afford this.

cc @AndyAyersMS @jakobbotsch @davidwrighton

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

jakobbotsch · 2023-06-09T18:15:56Z

With 16, what is the probability that we actually see D as the dominating class in this benchmark? Would be interesting to understand what the theoretical probability of us picking the true dominating class is with different table sizes (and with different true probabilities,). It should be computable with some simulation or maybe there's a closed formula.

EgorBo · 2023-06-09T18:22:11Z

With 16, what is the probability that we actually see D as the dominating class in this benchmark? Would be interesting to understand what the theoretical probability of us picking the true dominating class is with different table sizes (and with different true probabilities,). It should be computable with some simulation or maybe there's a closed formula.

I assume it's complicated by 2 things:

We still use random and I see small artifacts because of it even with size=16 but across many runs D still dominated in all of them
CallCount is 30 but it's very dependent on external factors, if the promotion qeueu is empty we can promote a method almost instantly after 30 hits while in other case we can invoke it like 100k times and it will be still pending promotion

EgorBo · 2023-06-09T18:25:48Z

With 16, what is the probability that we actually see D as the dominating class in this benchmark?

9 to 11 slots in the 16-slots table are 'D' handles => 56.25% - 68.75%

Actual (real) probability is 70%

EgorBo · 2023-06-09T18:26:02Z

Oops, accidentally closed

jakobbotsch · 2023-06-09T18:28:08Z

9 to 10 slots in the 16-slots table are 'D' handles => 56.25% - 62.5%

This is not what I mean. I mean that, over many runs, in how many of them do we pick the wrong class to GDV to with N=16? With N=8? This can be simulated and the probability can be computed. It can help us make the decision of what numbers we are comfortable with for various scenarios (where this bimorphic benchmark with P=0.3 is one of them).

jakobbotsch · 2023-06-09T18:37:39Z

E.g. quick and dirty:

Random rand = new Random();
bool[] table = new bool[8];
int correct = 0;
for (int i = 0; i < 10_000_000; i++)
{
    int num = 0;
    for (int j = 0; j < 100; j++)
    {
        bool cls = rand.NextDouble() > 0.3;
        if (num < table.Length)
        {
            table[num++] = cls;
            continue;
        }

        int x = rand.Next();
        if ((x % 32) >= table.Length)
        {
            continue;
        }

        table[x % table.Length] = cls;
    }

    int count = 0;
    foreach (bool b in table)
    {
        if (b)
            count++;
    }

    if (count == table.Length / 2)
        correct += rand.Next(2) == 0 ? 1 : 0;
    else if (count > table.Length / 2)
        correct++;
}

Console.WriteLine("{0} correct", correct);

With N = 8 we get 8740683, so we make a mistake 12.6% of the time.
With N = 16 we get 9754599, so 97.5% of the time we are right.

Edit: So we still expect 1 out of every 40 runs of this benchmark with 2 classes and P = 0.3 to pick the wrong GDV with N=16.
At N=32 the probability grows to 99.6%.

EgorBo · 2023-06-09T18:42:27Z

9 to 10 slots in the 16-slots table are 'D' handles => 56.25% - 62.5%

This is not what I mean. I mean that, over many runs, in how many of them do we pick the wrong class to GDV to with N=16? With N=8? This can be simulated and the probability can be computed. It can help us make the decision of what numbers we are comfortable with for various scenarios (where this bimodal benchmark with P=0.3 is one of them).

I did this in powershell:

for(;;) { artifacts\bin\coreclr\windows.x64.Checked\corerun.exe ConsoleApp.dll }

and

if (ISMETHOD("Test"))
{
    FILE* f = fopen("log.csv", "a");
    fprintf(f, "%s\n", eeGetClassName(likelyClass));
    fclose(f);
}

in the considerGuardedDevirtualization and got 43% bad decisions (I mean class 'B' was picked) accross 15 runs

0 bad decisions with size=16. (doesn't mean we won't make bad decisions in other cases, right)

EgorBo · 2023-06-09T18:45:05Z

So we still expect 1 out of every 40 runs of this benchmark

Ah, didn't run that long enough to see B 🙂

AndyAyersMS · 2023-06-09T18:59:19Z

It would be interesting to plot the two-class error rate vs both table size and true likelihood but given what we see above I would say 32 seems like a pretty good bet. Also, if we ever do want to enable multi-guess we'll want to have more confidence in the second-best probabilities and this entails larger sizes too. If the dominant entry takes up say 70% of the table (say ~22 slots at 32) that leaves an effective table size of 10 slots to get the second highest likelihood.

The lab also measures 3 class cases but unfortunately not for really interesting mixtures like 50, 25, 25 where we would still hope to get it right most of the time.

Give our logic we are also willing to guess for something like 30,10,10,10,10,10,10,10 where the dominant case is not the majority case.

EgorBo · 2023-06-09T19:06:23Z

say 32 seems like a pretty good bet.

Agree, changed

jakobbotsch · 2023-06-09T19:17:25Z

Should we get rid of SAMPLE_INTERVAL and the logic around that in CheckSample? Or should we increase SAMPLE_INTERVAL? The test using is always false now.

AndyAyersMS · 2023-06-09T19:24:20Z

Should we get rid of SAMPLE_INTERVAL and the logic around that in CheckSample? Or should we increase SAMPLE_INTERVAL? The test using is always false now.

I would increase it, either to 64 or possibly 128.

Not the true "reservoir" algorithm would actually use the total count for this (which we no longer have), so the likelihood of updates would steadily decrease over time.

EgorBo · 2023-06-09T21:02:10Z

Should we get rid of SAMPLE_INTERVAL and the logic around that in CheckSample? Or should we increase SAMPLE_INTERVAL? The test using is always false now.

I would increase it, either to 64 or possibly 128.

Not the true "reservoir" algorithm would actually use the total count for this (which we no longer have), so the likelihood of updates would steadily decrease over time.

Changed to 64. Part of the problem that CallCount threshold being 30 might leave us with not enough samples if we increase the sample window from my understanding

AndyAyersMS · 2023-06-09T21:07:54Z

Should we get rid of SAMPLE_INTERVAL and the logic around that in CheckSample? Or should we increase SAMPLE_INTERVAL? The test using is always false now.

I would increase it, either to 64 or possibly 128.
Not the true "reservoir" algorithm would actually use the total count for this (which we no longer have), so the likelihood of updates would steadily decrease over time.

Changed to 64. Part of the problem that CallCount threshold being 30 might leave us with not enough samples if we increase the sample window from my understanding

We should be ok, the sample interval only matters once the table has filled up.

EgorBo · 2023-06-09T22:41:05Z

It would be interesting to plot the two-class error rate vs both table size

I did something like this, here is a more complicated sample:

using System.Runtime.CompilerServices;

public interface IValue
{
    public int GetValue();
}

public class Class1 : IValue
{
    public int GetValue() => 33;
}

public class Class2 : IValue
{
    public int GetValue() => 44;
}

public class Class3 : IValue
{
    public int GetValue() => 55;
}

public class Prog
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static int Test(IValue obj) => 
        obj.GetValue() + 
        obj.GetValue() + 
        obj.GetValue();

    static void Main()
    {
        IValue[] testData = 
            {
                // 50% Class1
                new Class1(),
                new Class1(),
                new Class1(),
                new Class1(),
                new Class1(),

                // 30% Class2
                new Class2(),
                new Class2(),
                new Class2(),

                // 20% Class3
                new Class3(),
                new Class3(),
            };

        for (int i = 0; i < 100; i++)
        {
            // Shuffle in a random order each run
            foreach (IValue obj in testData.OrderBy(_ => Random.Shared.Next()))
            {
                Test(obj);
            }
            Thread.Sleep(10);
        }
    }
}

I had many runs of this app + added some logging to the logic that extracts type handles from the histograms and got this:

So I have 3 tables (for each Class1, Class2 and Class3), columns are table SIZE (8,16,32,64).

Yellow - the currently proposed value for the SIZE.

PS: I made a mistake in my logic and SIZE=64 in the result is slightly misleading (mostly because the tables had too many null values due to its size and CalllCount=30)

EgorBo · 2023-06-09T22:48:28Z

Overall, 32 sounds like a good choice.

AndyAyersMS · 2023-06-09T23:07:26Z

Yeah I was going to say the data for 64 looked off, and the data for 32 might be slightly tainted too. If the table is not full then you need to "resize" it to match the actual count value.

Presumably for the 32 case when Class1 is min then Class2 is likely to be max, and so about 3% of the time we will prefer Class2, which seems acceptable given that Class2 is not a bad choice, just not the best choice.

EgorBo · 2023-06-09T23:18:52Z

data for 32 might be slightly tainted too. If the table is not full then you need to "resize" it to match the actual count value.

I've just checked and 32 looks ok - all tables were completely full while for 64 they had 4-14 empty slots on average.

which seems acceptable given that Class2 is not a bad choice, just not the best choice.

agree

Bump HandleHistogram32::SIZE to 16

e5e8da1

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 9, 2023

ghost assigned EgorBo Jun 9, 2023

EgorBo closed this Jun 9, 2023

EgorBo reopened this Jun 9, 2023

Update corjit.h

f6f8d6e

Update corjit.h

4a7e4a6

build-analysis bot mentioned this pull request Jun 10, 2023

Tracking issue for CI build timeouts #76454

Closed

AndyAyersMS approved these changes Jun 10, 2023

View reviewed changes

EgorBo merged commit b02e844 into dotnet:main Jun 10, 2023
109 checks passed

EgorBo deleted the bump-HandleHistogram32-size branch June 10, 2023 12:05

AndyAyersMS mentioned this pull request Jun 17, 2023

Increased variance in some benchmark tests #87324

Closed

dotnet locked as resolved and limited conversation to collaborators Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump HandleHistogram32::SIZE to 16 #87332

Bump HandleHistogram32::SIZE to 16 #87332

EgorBo commented Jun 9, 2023 •

edited

ghost commented Jun 9, 2023

jakobbotsch commented Jun 9, 2023

EgorBo commented Jun 9, 2023 •

edited

EgorBo commented Jun 9, 2023 •

edited

EgorBo commented Jun 9, 2023 •

edited

jakobbotsch commented Jun 9, 2023 •

edited

jakobbotsch commented Jun 9, 2023 •

edited

EgorBo commented Jun 9, 2023 •

edited

EgorBo commented Jun 9, 2023

AndyAyersMS commented Jun 9, 2023

EgorBo commented Jun 9, 2023

jakobbotsch commented Jun 9, 2023

AndyAyersMS commented Jun 9, 2023

EgorBo commented Jun 9, 2023

AndyAyersMS commented Jun 9, 2023

EgorBo commented Jun 9, 2023 •

edited

EgorBo commented Jun 9, 2023

AndyAyersMS commented Jun 9, 2023

EgorBo commented Jun 9, 2023 •

edited

Bump HandleHistogram32::SIZE to 16 #87332

Bump HandleHistogram32::SIZE to 16 #87332

Conversation

EgorBo commented Jun 9, 2023 • edited

ghost commented Jun 9, 2023

jakobbotsch commented Jun 9, 2023

EgorBo commented Jun 9, 2023 • edited

EgorBo commented Jun 9, 2023 • edited

EgorBo commented Jun 9, 2023 • edited

jakobbotsch commented Jun 9, 2023 • edited

jakobbotsch commented Jun 9, 2023 • edited

EgorBo commented Jun 9, 2023 • edited

EgorBo commented Jun 9, 2023

AndyAyersMS commented Jun 9, 2023

EgorBo commented Jun 9, 2023

jakobbotsch commented Jun 9, 2023

AndyAyersMS commented Jun 9, 2023

EgorBo commented Jun 9, 2023

AndyAyersMS commented Jun 9, 2023

EgorBo commented Jun 9, 2023 • edited

EgorBo commented Jun 9, 2023

AndyAyersMS commented Jun 9, 2023

EgorBo commented Jun 9, 2023 • edited

EgorBo commented Jun 9, 2023 •

edited

EgorBo commented Jun 9, 2023 •

edited

EgorBo commented Jun 9, 2023 •

edited

EgorBo commented Jun 9, 2023 •

edited

jakobbotsch commented Jun 9, 2023 •

edited

jakobbotsch commented Jun 9, 2023 •

edited

EgorBo commented Jun 9, 2023 •

edited

EgorBo commented Jun 9, 2023 •

edited

EgorBo commented Jun 9, 2023 •

edited