New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drawback of type derived from generic #55

Closed
alexandrnikitin opened this Issue Feb 4, 2015 · 7 comments

Comments

Projects
None yet
4 participants
@alexandrnikitin
Contributor

alexandrnikitin commented Feb 4, 2015

Following the stackoverflow question.

I've encountered with one performance problem.
Let's talk code. I simplified the code as much as I could to reproduce the issue.
Suppose we have a generic class. It has an empty list inside and does something with T in constructor. It has Run method that calls an IEnumerable<T> method on the list, e.g. Any().

public class BaseClass<T>
{
    private List<T> _list = new List<T>();

    public BaseClass()
    {
        Enumerable.Empty<T>();
        // or Enumerable.Repeat(new T(), 10);
        // or even new T();
        // or foreach (var item in _list) {}
    }

    public void Run()
    {
        for (var i = 0; i < 8000000; i++)
        {
            if (_list.Any())
            // or if (_list.Count() > 0)
            // or if (_list.FirstOrDefault() != null)
            // or if (_list.SingleOrDefault() != null)
            // or other IEnumerable<T> method
            {
                return;
            }
        }
    }
}

Then we have a derived class which is empty:

public class DerivedClass : BaseClass<object>
{
}

Let's measure the performance of running ClassBase<T>.Run method from both classes. Accessing from derived type is 4 times slower that from base class. And I can't understand why that happens. Compiled in Release mode, result is the same with warm up. It happens on .NET 4.5 only.

public class Program
{
    public static void Main()
    {
        Measure(new DerivedClass());
        Measure(new BaseClass<object>());
    }

    private static void Measure(BaseClass<object> baseClass)
    {
        var sw = Stopwatch.StartNew();
        baseClass.Run();
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);
    }
}

Full listing on gist

I've got an answer on Microsoft Connect

It is related to dictionary lookups in shared generics code. The heuristic in runtime and JIT do not work well for this particular test. We will take a look what can be done about it.

In the meantime, you can workaround it by adding two dummy methods to the BaseClass (do not even need to be called). It will cause the heuristic to work as one would expect.

@jkotas jkotas assigned jkotas and cmckinsey and unassigned jkotas Feb 4, 2015

@jkotas

This comment has been minimized.

Member

jkotas commented Feb 4, 2015

@cmckinsey - we also have TFS bug from the original report on Microsoft Connect

@alexandrnikitin

This comment has been minimized.

Contributor

alexandrnikitin commented Feb 4, 2015

@jkotas What's the status of the bug? Could you share some insights?

@jkotas

This comment has been minimized.

Member

jkotas commented Feb 4, 2015

We did not have a chance to look into the bug.

The heuristic mentioned is implemented in MethodTableBuilder::AllocAndInitMethodDescs: https://github.com/dotnet/coreclr/blob/master/src/vm/methodtablebuilder.cpp#L7260

The problem is that the number of generic dictionary slots guessed by MethodTableBuilder::AllocAndInitMethodDescs is less than what the JIT eventually asks for, and forces the JIT to use slower path.

@kangaroo

This comment has been minimized.

Contributor

kangaroo commented Mar 14, 2015

@jkotas Is there an action item here? This issue is aging -- what are the next steps?

@jkotas

This comment has been minimized.

Member

jkotas commented Mar 14, 2015

The issue is assigned to @cmckinsey I expect that @cmckinsey or somebody on his team will look into it once it gets to the top of their priority list, if nobody beats them to it.

@cmckinsey

This comment has been minimized.

Contributor

cmckinsey commented Apr 2, 2015

Alexandr, thanks for the post regarding the performance observation about generics. I think there are two issues here: the first is to explain the current behavior you're observing in this program and the second is whether we can do anything to fix it.

Let me start with trying to explain the behavior. The issue arises because of sharing of the same JITTED code for generic methods on behalf of different types at runtime. When shared methods are executed any applications of generic type variables in their body have to be looked up to get the concrete runtime type. For your example it is List given it's executing the shared method BaseClass.Run. We refer to this process as runtime handle lookup. This process is the most important aspect of making shared generic code as nearly as efficient as fully specialized methods.

Because of the critical performance needs of this feature, both the JIT and runtime cooperate through a series of sophisticated techniques to reduce the overhead. The different timing behaviors you observe in the various versions of your PerformanceProblem.Program example are a reflection of the current implementation artifacts. So these artifacts might change over time as the code base evolves.

So after saying that, lets talk about how the runtime optimizes these lookups. There are essentially a series of caches to avoid the ultimately expensive lookup of types at runtime via the class loader. Without going into too much detail you can abstractly look at the lookup costs like this:

  1. Inline lookup from dictionary slot - An inline sequence of code that can lookup the exact type within a few levels of indirection (think 10 clocks for a hit). These slots are accessible off the methodtable for the object.
  2. Helper global dictionary cache lookup - A call to a helper method that looks up in a global hash table to find handle corresponding to the token. (think about 30 clocks for a hit)
  3. Type hierarchy walk with global dictionary cache lookup - Walk inheritance tree and lookup in the global hash table using the declaring type (think about 50-60 clocks for a hit)
  4. Class-loader - (think 300 clocks)

So ideally all of the handle lookup scenarios get an inline lookup off the methodtable and this costs very little and hence performance near that of specialized generic methods. But like most cache performance it comes down to determining how many slots are necessary and predicting where (in what methods) these lookups will be required. When a class is built at runtime the VM makes a guess about how many slots are needed without having actually compiled all the methods (remember this is lazy execution of code). So it currently makes an estimate of how many slots to reserve based on the number of methods in the class. Then, as the methods are actually JITTED the slots are allocated for specific lookup points in the method. So whether you hit the ideal performance characteristics of #1 above has to do with the number of methods in the class and the order in which the methods are JITTED. So in your example if you add more methods to the base class the class builder in the VM assumes those methods will have points where more runtime handle lookups are required and allocate more slots.

Lets come back to the above in a bit and talk about the JIT. Depending upon which JIT is being used it may be able to optimize away some of these artifacts and speedup the code as well. As mentioned the JIT allocates the slots as it compiles the methods. In some cases, the JIT might be able to determine that a method invocation is dead after inlining it and optimize away it's runtime handle lookup and thereby not need to allocate a dictionary slot for it. This then frees up a slot for use when compiling some later method. So in your example at least one of our JITs will inline Enumerable.Empty() in BaseClass() and remove the runtime handle lookup, thereby leaving the dictionary slot available for use when compiling Run() method with the slot used for the runtime lookup of _list.Any() call. So our 4.5 64-bit JIT will do this whereas the x86 JIT will not. For x86, if you remove the call to Enumerable.Empty from the source code this will allow the slot to be used for the _list.Any() call and should demonstrate a speedup.

There is another aspect of the loop in BaseClass.Run that impacts the performance of this example program. The runtime handle lookup for _list.Any() is invariant and can be hoisted out of the loop. The x86 JIT and the 64-bit JIT in 4.6/CoreCLR (not 64-bit 4.5) will do this. Because this transformation is so effective at mitigating the runtime handle lookup cost you will not measure a significant difference between the cost of the BaseClass and DerivedClass invocations because of this in 4.6. I had to rewrite your test example of put the loop structure up into Measure in order to block the invariant code hoisting in order to reproduce the difference on 4.6.

So the above explains why you see different performance characteristics from various kinds of minor code changes as well as on different versions of the CLR JIT for this example. I don't see any changes in our version control history to suggest that the VM's behavior has changed in the last 5 years. But it doesn't explain why the base and derived classes have different performance characteristics so here's the punch line:

For this example, because the base class has 2 methods it ends up getting 2 dictionary slots allocated, using the 2 reserved slots (preferred #1 above) to cache the lookups for "new List" and "Empty when JITTING the constructor. Obviously the constructor is JITTED before the Run method. So when going to JIT the Run method the dictionary slots have overflowed which means that _list.Any() will fall-back to a runtime helper for the lookup (#2-#4).

It is this slower lookup path that the test case is measuring and shows the difference in lookup cost between base and derived. Basically the base class case lookup is hitting #2 above and the derived class is falling back to #3. I measure a roughly 3x slowdown for the derived class. I believe there is a bug in the global dictionary cache update for the derived class that is responsible for the difference. I have made the change locally and ran some benchmarks. With this change both class lookups give the same performance. So thanks for reporting this I believe this is a real improvement for the runtime that I'll checkin after more desktop testing and code-review.

There is however still the issue that even with the above fix you get a hit in #2 (global dictionary cache) and not #1 (inline dictionary slot) for this case. While the heuristic for pre-allocating the dictionary slots could be bumped up a little bit to allocate extra slots for this case, there is a trade-off between over-allocating unused dictionary slots (bad working set) and under-allocating used dictionary slots (slow lookup). I don't think we have enough justification to change the heuristic given how much performance benchmarking we'd have to do on large server workloads to know whether changing the heuristic would lead to the right trade-offs. I expect the current heuristic works well for a large cross-section of programs using generics. Of course any single program may benefit more or less from this heuristic. We may still want to look into this in the future if it becomes a recurring source of performance problems for customers.

Hope this helps explain things. Thanks again Alexandr.

cmckinsey added a commit to cmckinsey/coreclr that referenced this issue Apr 2, 2015

CQ fix for generic handle lookup in GenericHandleWorker
When looking up the runtime handle in the generic handle cache for a methodtable
we get the declaring methodtable in the hierarchy and then lookup the handle in
the cache from that methodtable. When found we should insert the handle back into
the table using the original methodtable as the key and not the declaring MT
so that later lookups are faster. Inserting the handle back into the table under
the original key was a noop.

Desktop DDR and JITSH testing clean. The test case in Github issue dotnet#55 is now
3x faster. Benchmarked roslyn performance which shows no change in performance.

josteink added a commit to josteink/coreclr that referenced this issue Apr 3, 2015

CQ fix for generic handle lookup in GenericHandleWorker
When looking up the runtime handle in the generic handle cache for a methodtable
we get the declaring methodtable in the hierarchy and then lookup the handle in
the cache from that methodtable. When found we should insert the handle back into
the table using the original methodtable as the key and not the declaring MT
so that later lookups are faster. Inserting the handle back into the table under
the original key was a noop.

Desktop DDR and JITSH testing clean. The test case in Github issue dotnet#55 is now
3x faster. Benchmarked roslyn performance which shows no change in performance.
@alexandrnikitin

This comment has been minimized.

Contributor

alexandrnikitin commented Apr 4, 2015

@cmckinsey Thank you for the great explanation!!

josteink added a commit to josteink/coreclr that referenced this issue Apr 5, 2015

CQ fix for generic handle lookup in GenericHandleWorker
When looking up the runtime handle in the generic handle cache for a methodtable
we get the declaring methodtable in the hierarchy and then lookup the handle in
the cache from that methodtable. When found we should insert the handle back into
the table using the original methodtable as the key and not the declaring MT
so that later lookups are faster. Inserting the handle back into the table under
the original key was a noop.

Desktop DDR and JITSH testing clean. The test case in Github issue dotnet#55 is now
3x faster. Benchmarked roslyn performance which shows no change in performance.

@cmckinsey cmckinsey closed this Apr 6, 2015

ellismg pushed a commit to ellismg/coreclr that referenced this issue Apr 14, 2015

Merge pull request dotnet#55 from aspnet/alphaport
burning samples and readme to alpha build-189
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment