Reduce contention for Datacontract Serialization #73893

Daniel-Svensson · 2022-08-13T11:01:53Z

This is just #70668 which was closed after #71752 was merged giving a merge conflict.
Since the later did not really make any large improvement as explained here I am reopening the PR.

In short it improves RPS by around ~5% without contention and way over 400% (from 137 to near 600 rps) for scenario with 16 threads.

New

Old:

Se PR #70688 for more details.

If it is better to reopen the original PR then feel free to do so (I did not have the option), maybe @StephenMolloy can do it if it is better.

…ing DataContractSerializer

… typeHandle

…tion

Daniel-Svensson · 2022-08-13T11:06:32Z

...es/System.Private.DataContractSerialization/src/System/Runtime/Serialization/DataContract.cs

-                        TryCreateBuiltInDataContract(type, out dataContract);
-                        s_typeToBuiltInContract.Add(type, dataContract);
-                    }
+                    TryCreateBuiltInDataContract(key, out DataContract? dataContract);


TryCreateBuiltInDataContract seems thread safe (it only contains a switch statement and creates new DataContracts) so it should not be a problem to call it concurrently or the slim chance of creating duplicate instances

HongGit · 2023-02-02T22:33:41Z

@mconnew and @StephenMolloy, can you please take a look?

StephenMolloy · 2023-04-01T03:12:36Z

Traditionally ConcurrentDictionary has struggled to keep up with access from multiple threads. Single thread read/write looks good, but actually using the concurrency features slow things down compared to Hashtable. And although ConcurrentDictionary has made strides in recent releases, this is still the "apples to apples" case. Hashtable is just faster for multiple readers and a single writer.

However, this isn't really an "apples to apples" case. In order to use Hashtable, we have to wrap value types and provider our own comparison method. The extra overhead involved does indeed make ConcurrentDictionary faster in this case as far as I can tell.

However however... I suspect these collections are part of the problem causing #77877. So I'm going to need to re-do how this all works anyway. Trying to keep in mind the lessons learned from this performance dive.

stephentoub · 2023-04-01T11:34:16Z

Traditionally ConcurrentDictionary has struggled to keep up with access from multiple threads. Single thread read/write looks good, but actually using the concurrency features slow things down compared to Hashtable. And although ConcurrentDictionary has made strides in recent releases, this is still the "apples to apples" case. Hashtable is just faster for multiple readers and a single writer.

Can you share the benchmarks you're using that lead to these conclusions?

mconnew · 2023-04-03T19:49:26Z

Traditionally ConcurrentDictionary has struggled to keep up with access from multiple threads. Single thread read/write looks good, but actually using the concurrency features slow things down compared to Hashtable. And although ConcurrentDictionary has made strides in recent releases, this is still the "apples to apples" case. Hashtable is just faster for multiple readers and a single writer.

Can you share the benchmarks you're using that lead to these conclusions?

This information was coming from me. Many years ago on .NET Framework we replaced a lock in the DataContractSerializer with a lock only on write semantic. At the time we benchmarked different solutions and ConcurrentDictionary performed worse than the existing always lock code and the final solution of using a HashTable and only locking on write. This was in the 4.6.x time frame if my memory is correct. I don't have any of the benchmarks available now.
The dev who was working on it at the time coded it a couple of different ways and ran the .NET Framework perf benchmarks and the HashTable with lock on write was the fastest.

I did a quick visual comparison of the two implementations (.NET vs NetFx) and I can see multiple changes which would account for the performance difference. On NetFx, every read has a Volative.Read call which on mult-numa node machines can be really expensive as it causes a L1 cache write flush and L1 cache line ownership change which is negotiated across all cores. On multi-numa node machines this can be very expensive if the old owning core and the new owning core are on different numa nodes. This has been eliminated in the .NET implementation with an explanation here. As you're aware, this can cause contention problems between different CPU cores even when accessing different values if they are in the same cache line. Removal of the Volatile.Read call removes the cache flush and reduces memory contention.

A second improvement which will apply in this case is the code here. There is a cheaper code path when getting a value when TKey is a value type and no comparer has been set. The code key.GetHashCode() is going to be cheaper than the code here.

I think it's fair to say that the performance (at least for this scenario) of ConcurrentDictionary has been improved for the scenario with multiple threads reading from the ConcurrentDictionary at the same time, especially for multi-numa node machines (generally Xeon class processors and usually not what we have on our dev machines). With the improvements to ConcurrentDictionary, it looks like it's the better option now as it results in simpler code.

I also think it might be worth making another change in the code. WCF had a break when the Hot Reload feature was introduced. When code is changed, some runtime reflection classes get replaced with new instances. In the WCF case we were using a MethodInfo as the key to a dictionary to look up the SOAP operation. But after a hot reload, the comparison between MethodInfo objects from before and after the reload for the exact same method returned that they are different. Details here. Looking at RuntimeTypeHandle.Equals, it calls ReferenceEquals to compare RuntimeTypeHandle.m_type of the two instances. If the Type instance changes during a hot reload (it's possible it's only MethodInfo instances that get invalidated, I'm presuming here), then this will be fragile and result in creating a new duplicated DataContract effectively identical to the previous one. Would we be better off storing the IntPtr value from RuntimeTypeHandle.Value instead? When calling a method passing a value type, it has to copy the entire struct for each method call. We have quite a bit of indirection where we hop multiple methods before we get to the actual lookup code. If this was changed to the RuntimeTypeHandle.Value it would make the method calls cheaper too as an IntPtr can be passed by register but a RuntimeTypeHandle can't, it must be passed on the stack.

StephenMolloy · 2023-04-03T19:51:51Z

@stephentoub - Nothing fancy. Just simple single-threaded loop doing Hashtable/ConcurrentDictionary gets and adds. This single-access read/write model is the 90% case. I don't think Hashtable being faster than CD in this general scenario is controversial.

Or were you asking about the applied to this situation scenario where ConcurrentDictionary comes out faster? I didn't do anything fancy. I pretty much just took @Daniel-Svensson's simple benchmarks and added some parallelism which is supposed to be where Hashtable shines over traditional synchronized r/w collections. Instead of the original nested-for loops in his benchmark, I did something like this:

            [Params(1, 4, 16, 32)]
            public int concurrency;

            Parallel.For(0, concurrency, c =>
            {
                int l = _typeHandles.Length;
                int s = (l * c) / concurrency;
                for (int i = 0; i < l; ++i)
                    for (int j = 0; j <= i; ++j)
                        sum += GetId_Implementation_Of_Interest(_typeHandles[(j+s)%l]);
            });

Again though, nothing fancy or sophisticated. All just playing around on my virtual 4-core dev box to get an idea of what I was looking at. I ran against on the .Net 6, 7, and 8 runtimes with the various 'GetId()' implementations. The .Net6 numbers were still in favor of CD over HT, but they were closer to a wash. CD performance has gotten better though, so it's a pretty noticeable difference in my basic exploration. The real difference is not having to mess around with the 'TypeHandleRef' class et al.

StephenMolloy · 2023-04-03T19:54:40Z

Or perhaps you were asking for something more along the lines of @mconnew remarks he was apparently typing up at the same time. ;)

Either way, I believe this code is part of the problem with DCS and AssemblyLoadContext. So we'll be doing something different here anyway.

Daniel-Svensson · 2023-05-07T08:53:45Z

I have applied the suggestion from @mconnew and changed the key from RuntimeTypeHandle to RuntimeTypeHandle.Value.
I had looked at the mono source code version and that just implemented Equals and GetHashCode using .Value.

If you prefer the code can be changed to keep RuntimeTypeHandle as key and use a custom comparer instead, the performance difference will be small.

The proposed code should hopefully be both simpler and much faster that the existing code so maybe you can consider merging it even if you might make changes later on? It should hopefully make any future work easier to do rather than harder.

Daniel-Svensson · 2023-05-29T13:22:50Z

Or were you asking about the applied to this situation scenario where ConcurrentDictionary comes out faster? I didn't do anything fancy. I pretty much just took @Daniel-Svensson's simple benchmarks and added some parallelism which is supposed to be where Hashtable shines over traditional synchronized r/w collections. Instead of the original nested-for loops in his benchmark, I did something like this:
            [Params(1, 4, 16, 32)]
            public int concurrency;

            Parallel.For(0, concurrency, c =>
            {
                int l = _typeHandles.Length;
                int s = (l * c) / concurrency;
                for (int i = 0; i < l; ++i)
                    for (int j = 0; j <= i; ++j)
                        sum += GetId_Implementation_Of_Interest(_typeHandles[(j+s)%l]);
            });
Again though, nothing fancy or sophisticated. All just playing around on my virtual 4-core dev box to get an idea of what I was looking at. I ran against on the .Net 6, 7, and 8 runtimes with the various 'GetId()' implementations. The .Net6 numbers were still in favor of CD over HT, but they were closer to a wash. CD performance has gotten better though, so it's a pretty noticeable difference in my basic exploration.

@StephenMolloy you are not benchmarkning what you think you are. Your example will run "concurrency" number of iterations of the loop code. And it will use at most as many threads as your CPU (4).
To actually run on more threads you are better of

creating N threads manually
Wait for all threads to be started
Have all threads wait for a semaphoreslim, barrier or similar and ensure that they are released all at the same time.
- make sure that the inner loop takes enough time (maybe seconds) for any concurrency issues to be more easily detected, it must be much larger than syncronisation overhead of an empty loop.
  (4. Measure and report time for each thread as well as total time)

You can also use the full end to end benchmark originally used, it needs to be started with the "/server" parameter and you also need a load test application such as nettling.

While a micro benchmark might be good to more quickly test and iterate on a solution I think the most important measure are how it might impact real world scenarios and in this case it was more than 400% improvement in RPS. if you manage to make a realistic micro benchmark you should expect the improvement to be larger than that.

The real difference is not having to mess around with the 'TypeHandleRef' class et al.

That not only gave cleaner, easier code but also part of what makes the solution faster.
Hopefully the benchmark is enough to show that the single threaded performance is better, and the load test result to prove that there is a large multithreaded performance win apart from simpler code.

I really hope that is enough to get a review of the code

mconnew · 2023-05-30T19:34:43Z

This can be overcome by passing an instance of ParallelOptions before the delegate setting MaxDegreeOfParallelism to concurrency.
Also I don't know which dev machine @StephenMolloy used, but my current dev machine has 10 physical cores (20 cores with hyperthreading) so all but the 32 concurrency test variant will have had sufficient CPU core count to not be limited by that if testing on my own work dev box.

StephenMolloy · 2023-05-30T19:40:25Z

Just chiming in with a reminder again... this is an interesting and informative discussion, so the PR is still open for the moment. But it's not likely that we merge this because I believe the code involved here is the cause of #77877. We will likely be making a different change here. But I do want that change to be informed by this discussion.

…ss that we no longer need.

StephenMolloy · 2023-08-11T19:06:15Z

It seems like this does not interfere with fixing DCS/ALC after all. Pushing an additional commit to apply the same logic to DCJS and removing some of the supporting code for structs that are no longer used.

Daniel-Svensson · 2023-08-12T09:37:26Z

Thank you for comin back and merging this @StephenMolloy

If you want to benchmark concurrent code in the future you can do something similar to https://github.com/Daniel-Svensson/DataContractBench/blob/main/Program.cs

Example output for this specific case for mutex+dictionary vs Concurrent Dictionary https://github.com/Daniel-Svensson/DataContractBench/blob/main/Results.csv showing the old mutex approach having less than halved throughput with 4 threads while concurrent dictionary instead is 4 times as fast.

Daniel-Svensson · 2023-09-11T20:38:56Z

I reran my benchmarks for 20 000 objects comparing net7 and net8 and the results looks really promising ~460% with 16 threaded load and slightly more with 320 thread loading on my 8 core Ryzen 5800X with much lower latencies.

Net7.0.10 vs 8.0.0-rc.2.23423.11 (daily build)

Thank you @StephenMolloy @stephentoub and @mconnew for your time with this

(16 threads, 1 object request only shows 8% performance increase so the rest should mostly be improvements in serialization)

Daniel-Svensson and others added 9 commits June 13, 2022 15:37

Use ConcurrentDictionary to avoid lock contention when serializing us…

c8f55d8

…ing DataContractSerializer

Improve concurrency for GetId

b2ddfbb

Prevent allocating multiple slots in s_dataContractCache for the same…

aea00eb

… typeHandle

Use RuntimeTypeHandle as key in dictionary

4b91a91

Merge remote-tracking branch 'upstream/main' into datacontract_conten…

03a2886

…tion

add readonly

c16871c

Remove usage of lazy

1a986b8

Merge remote-tracking branch 'upstream/main' into datacontract_conten…

7f866e0

…tion

revert whitespace changes

9785dec

dotnet-issue-labeler bot added the area-Serialization label Aug 13, 2022

ghost added the community-contribution Indicates that the PR has been added by a community member label Aug 13, 2022

ghost added this to Open PRs in WCF Owned Areas Aug 13, 2022

Daniel-Svensson commented Aug 13, 2022

View reviewed changes

Merge branch 'main' into datacontract_contention

a425847

build-analysis bot mentioned this pull request Mar 24, 2023

WasmTestOnBrowser-System.* test failures in CI #83655

Closed

Use RuntimeTypeHandle.Value as key instead of RuntimeTypeHandle

3e105e1

Apply same perf improvement to JDCS; Remove Int/TypeHandleRef silline…

481f0a1

…ss that we no longer need.

StephenMolloy approved these changes Aug 11, 2023

View reviewed changes

StephenMolloy merged commit 21f3bbe into dotnet:main Aug 11, 2023
100 of 103 checks passed

WCF Owned Areas automation moved this from Open PRs to Completed Aug 11, 2023

StephenMolloy mentioned this pull request Aug 11, 2023

DCS work nicely with collectible AssemblyLoadContext #90437

Merged

build-analysis bot mentioned this pull request Aug 11, 2023

Tracking issue for CI build timeouts #76454

Closed

Daniel-Svensson deleted the datacontract_contention branch August 15, 2023 09:20

dotnet locked as resolved and limited conversation to collaborators Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce contention for Datacontract Serialization #73893

Reduce contention for Datacontract Serialization #73893

Daniel-Svensson commented Aug 13, 2022 •

edited

Daniel-Svensson Aug 13, 2022

HongGit commented Feb 2, 2023

StephenMolloy commented Apr 1, 2023

stephentoub commented Apr 1, 2023

mconnew commented Apr 3, 2023

StephenMolloy commented Apr 3, 2023

StephenMolloy commented Apr 3, 2023

Daniel-Svensson commented May 7, 2023 •

edited

Daniel-Svensson commented May 29, 2023 •

edited

mconnew commented May 30, 2023

StephenMolloy commented May 30, 2023

StephenMolloy commented Aug 11, 2023

Daniel-Svensson commented Aug 12, 2023

Daniel-Svensson commented Sep 11, 2023

Reduce contention for Datacontract Serialization #73893

Reduce contention for Datacontract Serialization #73893

Conversation

Daniel-Svensson commented Aug 13, 2022 • edited

Daniel-Svensson Aug 13, 2022

Choose a reason for hiding this comment

HongGit commented Feb 2, 2023

StephenMolloy commented Apr 1, 2023

stephentoub commented Apr 1, 2023

mconnew commented Apr 3, 2023

StephenMolloy commented Apr 3, 2023

StephenMolloy commented Apr 3, 2023

Daniel-Svensson commented May 7, 2023 • edited

Daniel-Svensson commented May 29, 2023 • edited

mconnew commented May 30, 2023

StephenMolloy commented May 30, 2023

StephenMolloy commented Aug 11, 2023

Daniel-Svensson commented Aug 12, 2023

Daniel-Svensson commented Sep 11, 2023

Daniel-Svensson commented Aug 13, 2022 •

edited

Daniel-Svensson commented May 7, 2023 •

edited

Daniel-Svensson commented May 29, 2023 •

edited