Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump HandleHistogram32::SIZE to 16 #87332

Merged
merged 3 commits into from Jun 10, 2023

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jun 9, 2023

We think that one of the reasons we see a big noise in microbenchmarks after we enabled Dynamic PGO is mispredicted classes for virtual calls. #87324 (although, could happend previously with the static pgo as well).

The actual list of affect benchmarks is bigger. Especially it's well visualized in these benchmarks:

image

I investigated this benchmark and confirmed that it's caused by "random" GDV decisions while for P=0.3 we should always perform GDV for the class that has P=0.7.

A simple repro:

using System.Runtime.CompilerServices;

public class B
{
    public virtual int F() => 33;
}

public class D : B
{
    public override int F() => 44;
}

public class Prog
{

    [MethodImpl(MethodImplOptions.NoInlining)]
    static long Test(B b)
    {
        return b.F() + b.F();
    }

    static void Main()
    {
        for (int i = 0; i < 100; i++)
        {
            Thread.Sleep(16);
            Test((i % 10 >= 3) ? new D() : new B());
        }
        Console.ReadKey();
    }
}

(note: run in Release or change DOTNET_TC_* args in Checked - Checked tries to tier up too aggressivly)

In this case we expect Test to devirtualize both calls for D class because it happens more frequently, instead we get this:

  0) Class D - 7
  1) Class B - 1
Picked:   'D'

  0) Class B - 5
  1) Class D - 3
Picked:   'B'

and it varies between runs, so we might end up in a situation where two calls on the same object are devirtualized differently. In my case I could do GDV for both calls for B and B, B and D, etc... while it should always be D and D here.

Once I bumped the size to 16 it is now way more stable:

  0) Class D - 10
  1) Class B - 6
Picked:   'D'

  0) Class D - 10
  1) Class B - 6
Picked:   'D'

We might want to bump it to 32 or higher, but I propose we do it iteratively and watch benchmark's behavior.

PS: Since we now only instrument hot tier0/r2r code we no longer allocate these tables for every method so presumably we can afford extra 8*sizof(void*) bytes per instrumented virtual call-site.

cc @AndyAyersMS @jakobbotsch @davidwrighton

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 9, 2023
@ghost ghost assigned EgorBo Jun 9, 2023
@ghost
Copy link

ghost commented Jun 9, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

We think that one of the reason we see a big variance in microbenchmarks now is mispredicted classes for virtual calls. #87324

The actual list of affect benchmarks is bigger. Especially it's well visualized in these benchmarks:

image

I investigated this benchmark and confirmed that it's caused by "random" GDV decisions while for P=0.3 we should always perform GDV for the class that has P=0.7.

A simple repro:

using System.Runtime.CompilerServices;

public class B
{
    public virtual int F() => 33;
}

public class D : B
{
    public override int F() => 44;
}

public class Prog
{

    [MethodImpl(MethodImplOptions.NoInlining)]
    static long Test(B b)
    {
        return b.F() + b.F();
    }

    static void Main()
    {
        for (int i = 0; i < 100; i++)
        {
            Thread.Sleep(16);
            Test((i % 10 >= 3) ? new D() : new B());
        }
        Console.ReadKey();
    }
}

(note: run in Release or change DOTNET_TC_* args in Checked - Checked tries to tier up too aggressivly)

In this case we expect Test to devirtualize both calls for D class because it happens more frequently, instead we get this:

  0) Class D - 7
  1) Class B - 1
Picked:   'D'

  0) Class B - 5
  1) Class D - 3
Picked:   'B'

and it varies between runs, so we might end up in a situation where two calls on the same object are devirtualized differently.

Once I bumped the size to 16 it is now way more stable:

  0) 00007FFF6C850D50 - 10
  1) 00007FFF6C850BA8 - 6
Picked:   'D'

  0) 00007FFF6C850D50 - 10
  1) 00007FFF6C850BA8 - 6
Picked:   'D'

We might want to bump it to 32 or higher, but I propose we do it iteratively and watch benchmark's behavior.

PS: Since we now only instrument hot tier0/r2r code we no longer allocate these tables for every method so presumably we can afford this.

cc @AndyAyersMS @jakobbotsch @davidwrighton

Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member

With 16, what is the probability that we actually see D as the dominating class in this benchmark? Would be interesting to understand what the theoretical probability of us picking the true dominating class is with different table sizes (and with different true probabilities,). It should be computable with some simulation or maybe there's a closed formula.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 9, 2023

With 16, what is the probability that we actually see D as the dominating class in this benchmark? Would be interesting to understand what the theoretical probability of us picking the true dominating class is with different table sizes (and with different true probabilities,). It should be computable with some simulation or maybe there's a closed formula.

I assume it's complicated by 2 things:

  1. We still use random and I see small artifacts because of it even with size=16 but across many runs D still dominated in all of them
  2. CallCount is 30 but it's very dependent on external factors, if the promotion qeueu is empty we can promote a method almost instantly after 30 hits while in other case we can invoke it like 100k times and it will be still pending promotion

@EgorBo
Copy link
Member Author

EgorBo commented Jun 9, 2023

With 16, what is the probability that we actually see D as the dominating class in this benchmark?

9 to 11 slots in the 16-slots table are 'D' handles => 56.25% - 68.75%

Actual (real) probability is 70%

@EgorBo EgorBo closed this Jun 9, 2023
@EgorBo EgorBo reopened this Jun 9, 2023
@EgorBo
Copy link
Member Author

EgorBo commented Jun 9, 2023

Oops, accidentally closed

@jakobbotsch
Copy link
Member

jakobbotsch commented Jun 9, 2023

9 to 10 slots in the 16-slots table are 'D' handles => 56.25% - 62.5%

This is not what I mean. I mean that, over many runs, in how many of them do we pick the wrong class to GDV to with N=16? With N=8? This can be simulated and the probability can be computed. It can help us make the decision of what numbers we are comfortable with for various scenarios (where this bimorphic benchmark with P=0.3 is one of them).

@jakobbotsch
Copy link
Member

jakobbotsch commented Jun 9, 2023

E.g. quick and dirty:

Random rand = new Random();
bool[] table = new bool[8];
int correct = 0;
for (int i = 0; i < 10_000_000; i++)
{
    int num = 0;
    for (int j = 0; j < 100; j++)
    {
        bool cls = rand.NextDouble() > 0.3;
        if (num < table.Length)
        {
            table[num++] = cls;
            continue;
        }

        int x = rand.Next();
        if ((x % 32) >= table.Length)
        {
            continue;
        }

        table[x % table.Length] = cls;
    }

    int count = 0;
    foreach (bool b in table)
    {
        if (b)
            count++;
    }

    if (count == table.Length / 2)
        correct += rand.Next(2) == 0 ? 1 : 0;
    else if (count > table.Length / 2)
        correct++;
}

Console.WriteLine("{0} correct", correct);

With N = 8 we get 8740683, so we make a mistake 12.6% of the time.
With N = 16 we get 9754599, so 97.5% of the time we are right.

Edit: So we still expect 1 out of every 40 runs of this benchmark with 2 classes and P = 0.3 to pick the wrong GDV with N=16.
At N=32 the probability grows to 99.6%.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 9, 2023

9 to 10 slots in the 16-slots table are 'D' handles => 56.25% - 62.5%

This is not what I mean. I mean that, over many runs, in how many of them do we pick the wrong class to GDV to with N=16? With N=8? This can be simulated and the probability can be computed. It can help us make the decision of what numbers we are comfortable with for various scenarios (where this bimodal benchmark with P=0.3 is one of them).

I did this in powershell:

for(;;) { artifacts\bin\coreclr\windows.x64.Checked\corerun.exe ConsoleApp.dll }

and

if (ISMETHOD("Test"))
{
    FILE* f = fopen("log.csv", "a");
    fprintf(f, "%s\n", eeGetClassName(likelyClass));
    fclose(f);
}

in the considerGuardedDevirtualization and got 43% bad decisions (I mean class 'B' was picked) accross 15 runs

0 bad decisions with size=16. (doesn't mean we won't make bad decisions in other cases, right)

@EgorBo
Copy link
Member Author

EgorBo commented Jun 9, 2023

So we still expect 1 out of every 40 runs of this benchmark

Ah, didn't run that long enough to see B 🙂

@AndyAyersMS
Copy link
Member

It would be interesting to plot the two-class error rate vs both table size and true likelihood but given what we see above I would say 32 seems like a pretty good bet. Also, if we ever do want to enable multi-guess we'll want to have more confidence in the second-best probabilities and this entails larger sizes too. If the dominant entry takes up say 70% of the table (say ~22 slots at 32) that leaves an effective table size of 10 slots to get the second highest likelihood.

The lab also measures 3 class cases but unfortunately not for really interesting mixtures like 50, 25, 25 where we would still hope to get it right most of the time.

Give our logic we are also willing to guess for something like 30,10,10,10,10,10,10,10 where the dominant case is not the majority case.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 9, 2023

say 32 seems like a pretty good bet.

Agree, changed

@jakobbotsch
Copy link
Member

Should we get rid of SAMPLE_INTERVAL and the logic around that in CheckSample? Or should we increase SAMPLE_INTERVAL? The test using is always false now.

@AndyAyersMS
Copy link
Member

Should we get rid of SAMPLE_INTERVAL and the logic around that in CheckSample? Or should we increase SAMPLE_INTERVAL? The test using is always false now.

I would increase it, either to 64 or possibly 128.

Not the true "reservoir" algorithm would actually use the total count for this (which we no longer have), so the likelihood of updates would steadily decrease over time.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 9, 2023

Should we get rid of SAMPLE_INTERVAL and the logic around that in CheckSample? Or should we increase SAMPLE_INTERVAL? The test using is always false now.

I would increase it, either to 64 or possibly 128.

Not the true "reservoir" algorithm would actually use the total count for this (which we no longer have), so the likelihood of updates would steadily decrease over time.

Changed to 64. Part of the problem that CallCount threshold being 30 might leave us with not enough samples if we increase the sample window from my understanding

@AndyAyersMS
Copy link
Member

Should we get rid of SAMPLE_INTERVAL and the logic around that in CheckSample? Or should we increase SAMPLE_INTERVAL? The test using is always false now.

I would increase it, either to 64 or possibly 128.
Not the true "reservoir" algorithm would actually use the total count for this (which we no longer have), so the likelihood of updates would steadily decrease over time.

Changed to 64. Part of the problem that CallCount threshold being 30 might leave us with not enough samples if we increase the sample window from my understanding

We should be ok, the sample interval only matters once the table has filled up.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 9, 2023

It would be interesting to plot the two-class error rate vs both table size

I did something like this, here is a more complicated sample:

using System.Runtime.CompilerServices;

public interface IValue
{
    public int GetValue();
}

public class Class1 : IValue
{
    public int GetValue() => 33;
}

public class Class2 : IValue
{
    public int GetValue() => 44;
}

public class Class3 : IValue
{
    public int GetValue() => 55;
}

public class Prog
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static int Test(IValue obj) => 
        obj.GetValue() + 
        obj.GetValue() + 
        obj.GetValue();

    static void Main()
    {
        IValue[] testData = 
            {
                // 50% Class1
                new Class1(),
                new Class1(),
                new Class1(),
                new Class1(),
                new Class1(),

                // 30% Class2
                new Class2(),
                new Class2(),
                new Class2(),

                // 20% Class3
                new Class3(),
                new Class3(),
            };

        for (int i = 0; i < 100; i++)
        {
            // Shuffle in a random order each run
            foreach (IValue obj in testData.OrderBy(_ => Random.Shared.Next()))
            {
                Test(obj);
            }
            Thread.Sleep(10);
        }
    }
}

I had many runs of this app + added some logging to the logic that extracts type handles from the histograms and got this:

image

So I have 3 tables (for each Class1, Class2 and Class3), columns are table SIZE (8,16,32,64).

Yellow - the currently proposed value for the SIZE.

PS: I made a mistake in my logic and SIZE=64 in the result is slightly misleading (mostly because the tables had too many null values due to its size and CalllCount=30)

@EgorBo
Copy link
Member Author

EgorBo commented Jun 9, 2023

Overall, 32 sounds like a good choice.

@AndyAyersMS
Copy link
Member

Yeah I was going to say the data for 64 looked off, and the data for 32 might be slightly tainted too. If the table is not full then you need to "resize" it to match the actual count value.

Presumably for the 32 case when Class1 is min then Class2 is likely to be max, and so about 3% of the time we will prefer Class2, which seems acceptable given that Class2 is not a bad choice, just not the best choice.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 9, 2023

data for 32 might be slightly tainted too. If the table is not full then you need to "resize" it to match the actual count value.

I've just checked and 32 looks ok - all tables were completely full while for 64 they had 4-14 empty slots on average.

which seems acceptable given that Class2 is not a bad choice, just not the best choice.

agree

@EgorBo EgorBo merged commit b02e844 into dotnet:main Jun 10, 2023
109 checks passed
@EgorBo EgorBo deleted the bump-HandleHistogram32-size branch June 10, 2023 12:05
@dotnet dotnet locked as resolved and limited conversation to collaborators Jul 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants