Cached type shallow copy availability #1586

dVakulen · 2016-03-19T07:44:14Z

Addresses #1305,

Checking for ability of shallow copying is taking considerable amount of CPU:

Benchmark, 3 000 000 repeats:

BenchmarkDotNet=v0.9.1.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i5-4200M CPU @ 2.50GHz, ProcessorCount=4
Frequency=2435776 ticks, Resolution=410.5468 ns
HostCLR=MS.NET 4.0.30319.34209, Arch=32-bit RELEASE

Type=ShallowCopyableBench  Mode=Throughput  LaunchCount=1  
WarmupCount=1  TargetCount=3

Method	Median	StdDev	ThreadsCount
CachedClass	106.5624 ms	3.1811 ms	1
CachedDecimal	106.4367 ms	1.8021 ms	1
CachedImmutable	104.3393 ms	1.7869 ms	1
CachedShallowCopyableClass	102.3058 ms	37.8404 ms	1
CachedValueType	102.6185 ms	25.0215 ms	1
NonCachedClass	3,847.8685 ms	116.0603 ms	1
NonCachedDecimal	409.7000 ms	7.1004 ms	1
NonCachedImmutable	3,624.7472 ms	130.7426 ms	1
NonCachedShallowCopyableClass	6,778.8531 ms	433.8992 ms	1
NonCachedValueType	3,497.3040 ms	19.6930 ms	1
CachedClass	65.8086 ms	4.0328 ms	3
CachedDecimal	61.1600 ms	2.6875 ms	3
CachedImmutable	62.1187 ms	1.4798 ms	3
CachedShallowCopyableClass	63.8821 ms	1.1557 ms	3
CachedValueType	61.7941 ms	0.1123 ms	3
NonCachedClass	2,007.5797 ms	30.7070 ms	3
NonCachedDecimal	157.9320 ms	1.2766 ms	3
NonCachedImmutable	2,000.8525 ms	173.4787 ms	3
NonCachedShallowCopyableClass	3,852.3215 ms	36.5141 ms	3
NonCachedValueType	1,912.7173 ms	21.3129 ms	3

Benchmark source code:
https://gist.github.com/dVakulen/f34be5ba5024d008d3d2#file-orleansshallowcopyablebenchmark

Expected perfomance improvement: 3x for best case(multithreaded checking of the primitive), 67x for the worst one (checking of the class with ImmutableAttribute).

Note that for lightweight classes making full copy may actually be faster than checking existence of the ImmutableAttribute.

dVakulen · 2016-03-19T08:10:16Z

src/Orleans/Serialization/TypeUtilities.cs

        static readonly Dictionary<RuntimeTypeHandle, string> typeNameCache = new Dictionary<RuntimeTypeHandle, string>();
        static readonly Dictionary<RuntimeTypeHandle, string> typeKeyStringCache = new Dictionary<RuntimeTypeHandle, string>();
        static readonly Dictionary<RuntimeTypeHandle, byte[]> typeKeyCache = new Dictionary<RuntimeTypeHandle, byte[]>();

        static TypeUtilities()
        {
-            shallowCopyableValueTypes[typeof(Decimal).TypeHandle] = true;


TypeHandle uses underlying system type in GetHashCode() method

Sorry, I don't understand this comment. Do you mean that using typeof() instead of typeof().TypeHandle as a key here is more efficient? I don't understand why.

Yes, calling GetHashCode() on RuntimeTypeHandle is ~ 10% slower than on the Type and cost of accesing TypeHandle property adds additional 20%,

gabikliot · 2016-03-20T03:26:45Z

I like it.
I would not expect that the price of GetTypeInfo() and IsPrimitive to be larger than a dictionary lookup, but apparently, based on your bechmark, it is.

The one thing is to try back with the dictionary instead of concurrent dictionary: keep all you did (cache primitives dynamically), but use regular dictionary with lock.
My intuition, which might very well be wrong, is that might be even better, since very soon we stop adding to the dictionary and get from regular dictionary plus fast lock check might be faster then get from a concurrent dictionary. Worth checking.

Alternatively, in addition to what you did with the order, we can have a smarter 2 dictionaries approach: fast without lock and a 2nd one under lock and periodically move all items from the 2nd one to 1st one.

dVakulen · 2016-03-20T05:08:50Z

I've already tried simple dictionary implementations with locks and with ReadWriteLockSlim, both were 2 - 3 times slower that concurrent dictionary for one threaded scenario and up to 7 - for multithreaded one.

gabikliot · 2016-03-20T05:18:11Z

Really? You tried exactly the same code as in this PR, with concurrent dictionary and with regular dictionary with lock and regular dictionary with lock was 2-3 times slower in single threaded execution ?!? Interesting. OK.
Can you please compare regular dictionary without lock vs. concurrent dictionary in single threaded execution. Just to be sure. It must be faster, right?

dVakulen · 2016-03-20T05:18:49Z

Yes,regular dictionary with locks was 2 times slower in single threaded execution than concurrent one. Perhaps because of lock on each read, while concurrent one implements almost no-locking read. As for difference between simple dictionary without locks and ConcurrentDictionary - for single threaded scenario the difference is less than 7%, and for the multithreaded one - 3%.

gabikliot · 2016-03-20T05:20:11Z

multithreaded without locks is incorrect, so makes no sense, right?

dVakulen · 2016-03-20T05:21:59Z

If dictionary content doesn't change - it can me accesed concurrently without locks.

gabikliot · 2016-03-20T05:24:36Z

OK, this change looks legit to me. You would still want to do a perf. check before merging @sergeybykov, to be sure.

dVakulen · 2016-03-20T05:27:57Z

Also, the price of the GetTypeInfo() call is 16 ms on i5-4200M CPU @ 2.50GHz. for 1 million iterations. For CoreClr compatability lots of such calls must be added..

gabikliot · 2016-03-20T05:35:15Z

16 ms for million operations seems like very very low to me. I wonder if optimizing that is even worth our time.

jdom · 2016-03-22T00:17:24Z

I like it. Although any reason to have this new one use ConcurrentDictionary and Type, but keep all the others as Dictionary using RuntimeTypeHandle?

I would expect to have consistency among them, unless there's a specific reason not to.

I do agree that having immutable dictionaries (not really using ImmutableDictionary, but never mutating the dictionary that is assigned in the field, just cloning, adding and then switching it) could be better than even using ConcurrentDictionary, since we expect that after a little warm up, the dictionaries will never update again.
I still think that ConcurrentDictionary is much better than reading using locks.

dVakulen · 2016-03-23T03:06:51Z

Updated to have consistency among cache collections.

sergeybykov · 2016-03-23T17:56:09Z

I'm running perf tests to confirm that there is no perf regression here. Looking at the code I can't imagine there would be.

sergeybykov · 2016-03-23T18:50:39Z

Interestingly enough, I see no perf impact on one of the ping tests, but a significant increase in throughput in the other one.

PingLoadTest_LocalReentrant: 58,984 vs. 58,416 last night
PingLoadTest_RandomReentrant_MultiSilos: 445,613 vs. 352,822 - a 26% jump.

Maybe @gabikliot has an idea how this is possible. I'm dumbfounded so far.

dVakulen · 2016-03-23T19:01:27Z

@sergeybykov I'd suggest to rerun those tests, but if you believe in their reliability then I don't have explanation either.

sergeybykov · 2016-03-23T19:07:50Z

@dVakulen I will rerun them. However, the numbers on these tests have been very stable. I looked several days back. That's why I'm puzzled by the sudden jump. It's a good thing, of course. Just uncomfortable that I don't understand how this is possible.

dVakulen · 2016-03-23T19:08:54Z

@sergeybykov To which builds this two results corresponds?

sergeybykov · 2016-03-23T19:21:13Z

Last night it was c427526. What I'm running now is 45d47c4 merged on top of dfadfd4.

jdom · 2016-03-23T21:18:41Z

@sergeybykov, not sure how big the jump is, but I was expecting an increase in throughput. We were locking: even without contention, locking is very costly. Nevertheless, having many cores, I would expect this to be contested normally for reading depending on the scenario.

sergeybykov · 2016-03-23T22:02:06Z

NightlyLoadTest: 175,165 vs. 166,106 last night - ~5% gain.

@jdom I'm not as much surprised by the gain as by the difference between PingLoadTest_LocalReentrant and PingLoadTest_RandomReentrant_MultiSilos.

sergeybykov · 2016-03-23T22:09:35Z

ActivationCollectorStressTest: 205,393 vs. 189,404 = ~8% gain.

I think I'll merge it now because there is clearly no perf regression, and will rerun the tests against master.

sergeybykov · 2016-03-23T22:11:01Z

@dVakulen Obviously, bug thanks! IIRC this the highest perf gains for a single PR!

dVakulen · 2016-03-23T22:17:26Z

@sergeybykov Glad to hear.

gabikliot · 2016-03-24T04:09:30Z

I think I understand the diff between PingLoadTest_LocalReentrant and
PingLoadTest_RandomReentrant_MultiSilos.
The first one only sends msg from client to silo and silo responds. There is zero serialization and deep copy on the silo, since the replies in this particular case are all Task (no return value). So the code he changed had no impact in the one silo case.
The 2nd one is sending from silo to silo, so silo is doing lots of serialization and deep copy attempts.

The other question is why this improvement was so big. I would speculate that it is not so much due to ConcurrentDic vs Ditionary plus lock, but more due to direct check in the dictionary vs. checking all the if (typeInfo.IsPrimitive || typeInfo.IsEnum || t == typeof (string) || t == typeof (DateTime) || t == typeof (Decimal) || if (typeInfo.IsPrimitive || typeInfo.IsEnum) || t == typeof (Immutable<>))

if I remember correctly, the PingLoadTest send Immutable<byte[]> so in the old code this was a heavy check (who could have predicted ?!?) while in the new code it quick dictionary check.

Great job @dVakulen !

sergeybykov · 2016-03-24T04:13:40Z

@gabikliot Makes sense.

gabikliot · 2016-03-24T04:17:27Z

Basically, the take away I think is that operations on typeInfo and type are not as cheap as we thought, so might make sense to look elsewhere in the stack, mostly in serialization, and try to cache/optimize/reuse.

dnfclas added the cla-already-signed label Mar 19, 2016

dVakulen changed the title ~~Cached type shallow copy ability~~ Cached type shallow copy availability Mar 19, 2016

dVakulen force-pushed the shallow-copyable-perf-improvement branch from f115b09 to 6baf0ce Compare March 19, 2016 07:47

dVakulen reviewed Mar 19, 2016
View reviewed changes

gabikliot added the enhancement label Mar 20, 2016

Cached type shallow copy availability

45d47c4

dVakulen force-pushed the shallow-copyable-perf-improvement branch from 8583f1a to 45d47c4 Compare March 23, 2016 03:05

sergeybykov merged commit 924e6cb into dotnet:master Mar 23, 2016

dVakulen deleted the shallow-copyable-perf-improvement branch March 23, 2016 22:16

gabikliot mentioned this pull request Mar 24, 2016

Replace ImmutableAttribute with IImmutable interface #1305

Closed

dVakulen mentioned this pull request Sep 5, 2016

Threading model rework #2060

Closed

github-actions bot locked and limited conversation to collaborators Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cached type shallow copy availability #1586

Cached type shallow copy availability #1586

dVakulen commented Mar 19, 2016

dVakulen Mar 19, 2016

sergeybykov Mar 19, 2016

dVakulen Mar 19, 2016

gabikliot commented Mar 20, 2016

dVakulen commented Mar 20, 2016

gabikliot commented Mar 20, 2016

dVakulen commented Mar 20, 2016

gabikliot commented Mar 20, 2016

dVakulen commented Mar 20, 2016

gabikliot commented Mar 20, 2016

dVakulen commented Mar 20, 2016

gabikliot commented Mar 20, 2016

jdom commented Mar 22, 2016

dVakulen commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

dVakulen commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

dVakulen commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

jdom commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

dVakulen commented Mar 23, 2016

gabikliot commented Mar 24, 2016

sergeybykov commented Mar 24, 2016

gabikliot commented Mar 24, 2016

Cached type shallow copy availability #1586

Cached type shallow copy availability #1586

Conversation

dVakulen commented Mar 19, 2016

dVakulen Mar 19, 2016

Choose a reason for hiding this comment

sergeybykov Mar 19, 2016

Choose a reason for hiding this comment

dVakulen Mar 19, 2016

Choose a reason for hiding this comment

gabikliot commented Mar 20, 2016

dVakulen commented Mar 20, 2016

gabikliot commented Mar 20, 2016

dVakulen commented Mar 20, 2016

gabikliot commented Mar 20, 2016

dVakulen commented Mar 20, 2016

gabikliot commented Mar 20, 2016

dVakulen commented Mar 20, 2016

gabikliot commented Mar 20, 2016

jdom commented Mar 22, 2016

dVakulen commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

dVakulen commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

dVakulen commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

jdom commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

sergeybykov commented Mar 23, 2016

dVakulen commented Mar 23, 2016

gabikliot commented Mar 24, 2016

sergeybykov commented Mar 24, 2016

gabikliot commented Mar 24, 2016