Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cached type shallow copy availability #1586

Merged

Conversation

dVakulen
Copy link
Contributor

Addresses #1305,

Checking for ability of shallow copying is taking considerable amount of CPU:
bc0e53dc-e0d8-11e5-9dd9-38540a102571

Benchmark, 3 000 000 repeats:

BenchmarkDotNet=v0.9.1.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i5-4200M CPU @ 2.50GHz, ProcessorCount=4
Frequency=2435776 ticks, Resolution=410.5468 ns
HostCLR=MS.NET 4.0.30319.34209, Arch=32-bit RELEASE

Type=ShallowCopyableBench  Mode=Throughput  LaunchCount=1  
WarmupCount=1  TargetCount=3  
Method Median StdDev ThreadsCount
CachedClass 106.5624 ms 3.1811 ms 1
CachedDecimal 106.4367 ms 1.8021 ms 1
CachedImmutable 104.3393 ms 1.7869 ms 1
CachedShallowCopyableClass 102.3058 ms 37.8404 ms 1
CachedValueType 102.6185 ms 25.0215 ms 1
NonCachedClass 3,847.8685 ms 116.0603 ms 1
NonCachedDecimal 409.7000 ms 7.1004 ms 1
NonCachedImmutable 3,624.7472 ms 130.7426 ms 1
NonCachedShallowCopyableClass 6,778.8531 ms 433.8992 ms 1
NonCachedValueType 3,497.3040 ms 19.6930 ms 1
CachedClass 65.8086 ms 4.0328 ms 3
CachedDecimal 61.1600 ms 2.6875 ms 3
CachedImmutable 62.1187 ms 1.4798 ms 3
CachedShallowCopyableClass 63.8821 ms 1.1557 ms 3
CachedValueType 61.7941 ms 0.1123 ms 3
NonCachedClass 2,007.5797 ms 30.7070 ms 3
NonCachedDecimal 157.9320 ms 1.2766 ms 3
NonCachedImmutable 2,000.8525 ms 173.4787 ms 3
NonCachedShallowCopyableClass 3,852.3215 ms 36.5141 ms 3
NonCachedValueType 1,912.7173 ms 21.3129 ms 3

Benchmark source code:
https://gist.github.com/dVakulen/f34be5ba5024d008d3d2#file-orleansshallowcopyablebenchmark

Expected perfomance improvement: 3x for best case(multithreaded checking of the primitive), 67x for the worst one (checking of the class with ImmutableAttribute).

Note that for lightweight classes making full copy may actually be faster than checking existence of the ImmutableAttribute.

@dVakulen dVakulen changed the title Cached type shallow copy ability Cached type shallow copy availability Mar 19, 2016
@dVakulen dVakulen force-pushed the shallow-copyable-perf-improvement branch from f115b09 to 6baf0ce Compare March 19, 2016 07:47
static readonly Dictionary<RuntimeTypeHandle, string> typeNameCache = new Dictionary<RuntimeTypeHandle, string>();
static readonly Dictionary<RuntimeTypeHandle, string> typeKeyStringCache = new Dictionary<RuntimeTypeHandle, string>();
static readonly Dictionary<RuntimeTypeHandle, byte[]> typeKeyCache = new Dictionary<RuntimeTypeHandle, byte[]>();

static TypeUtilities()
{
shallowCopyableValueTypes[typeof(Decimal).TypeHandle] = true;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeHandle uses underlying system type in GetHashCode() method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand this comment. Do you mean that using typeof() instead of typeof().TypeHandle as a key here is more efficient? I don't understand why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, calling GetHashCode() on RuntimeTypeHandle is ~ 10% slower than on the Type and cost of accesing TypeHandle property adds additional 20%,

@gabikliot
Copy link
Contributor

I like it.
I would not expect that the price of GetTypeInfo() and IsPrimitive to be larger than a dictionary lookup, but apparently, based on your bechmark, it is.

The one thing is to try back with the dictionary instead of concurrent dictionary: keep all you did (cache primitives dynamically), but use regular dictionary with lock.
My intuition, which might very well be wrong, is that might be even better, since very soon we stop adding to the dictionary and get from regular dictionary plus fast lock check might be faster then get from a concurrent dictionary. Worth checking.

Alternatively, in addition to what you did with the order, we can have a smarter 2 dictionaries approach: fast without lock and a 2nd one under lock and periodically move all items from the 2nd one to 1st one.

@dVakulen
Copy link
Contributor Author

I've already tried simple dictionary implementations with locks and with ReadWriteLockSlim, both were 2 - 3 times slower that concurrent dictionary for one threaded scenario and up to 7 - for multithreaded one.

@gabikliot
Copy link
Contributor

Really? You tried exactly the same code as in this PR, with concurrent dictionary and with regular dictionary with lock and regular dictionary with lock was 2-3 times slower in single threaded execution ?!? Interesting. OK.
Can you please compare regular dictionary without lock vs. concurrent dictionary in single threaded execution. Just to be sure. It must be faster, right?

@dVakulen
Copy link
Contributor Author

Yes,regular dictionary with locks was 2 times slower in single threaded execution than concurrent one. Perhaps because of lock on each read, while concurrent one implements almost no-locking read. As for difference between simple dictionary without locks and ConcurrentDictionary - for single threaded scenario the difference is less than 7%, and for the multithreaded one - 3%.

@gabikliot
Copy link
Contributor

multithreaded without locks is incorrect, so makes no sense, right?

@dVakulen
Copy link
Contributor Author

If dictionary content doesn't change - it can me accesed concurrently without locks.

@gabikliot
Copy link
Contributor

OK, this change looks legit to me. You would still want to do a perf. check before merging @sergeybykov, to be sure.

@dVakulen
Copy link
Contributor Author

Also, the price of the GetTypeInfo() call is 16 ms on i5-4200M CPU @ 2.50GHz. for 1 million iterations. For CoreClr compatability lots of such calls must be added..

@gabikliot
Copy link
Contributor

16 ms for million operations seems like very very low to me. I wonder if optimizing that is even worth our time.

@jdom
Copy link
Member

jdom commented Mar 22, 2016

I like it. Although any reason to have this new one use ConcurrentDictionary and Type, but keep all the others as Dictionary using RuntimeTypeHandle?

I would expect to have consistency among them, unless there's a specific reason not to.

I do agree that having immutable dictionaries (not really using ImmutableDictionary, but never mutating the dictionary that is assigned in the field, just cloning, adding and then switching it) could be better than even using ConcurrentDictionary, since we expect that after a little warm up, the dictionaries will never update again.
I still think that ConcurrentDictionary is much better than reading using locks.

@dVakulen dVakulen force-pushed the shallow-copyable-perf-improvement branch from 8583f1a to 45d47c4 Compare March 23, 2016 03:05
@dVakulen
Copy link
Contributor Author

Updated to have consistency among cache collections.

@sergeybykov
Copy link
Contributor

I'm running perf tests to confirm that there is no perf regression here. Looking at the code I can't imagine there would be.

@sergeybykov
Copy link
Contributor

Interestingly enough, I see no perf impact on one of the ping tests, but a significant increase in throughput in the other one.

PingLoadTest_LocalReentrant: 58,984 vs. 58,416 last night
PingLoadTest_RandomReentrant_MultiSilos: 445,613 vs. 352,822 - a 26% jump.

Maybe @gabikliot has an idea how this is possible. I'm dumbfounded so far.

@dVakulen
Copy link
Contributor Author

@sergeybykov I'd suggest to rerun those tests, but if you believe in their reliability then I don't have explanation either.

@sergeybykov
Copy link
Contributor

@dVakulen I will rerun them. However, the numbers on these tests have been very stable. I looked several days back. That's why I'm puzzled by the sudden jump. It's a good thing, of course. Just uncomfortable that I don't understand how this is possible.

@dVakulen
Copy link
Contributor Author

@sergeybykov To which builds this two results corresponds?

@sergeybykov
Copy link
Contributor

Last night it was c427526. What I'm running now is 45d47c4 merged on top of dfadfd4.

@jdom
Copy link
Member

jdom commented Mar 23, 2016

@sergeybykov, not sure how big the jump is, but I was expecting an increase in throughput. We were locking: even without contention, locking is very costly. Nevertheless, having many cores, I would expect this to be contested normally for reading depending on the scenario.

@sergeybykov
Copy link
Contributor

NightlyLoadTest: 175,165 vs. 166,106 last night - ~5% gain.

@jdom I'm not as much surprised by the gain as by the difference between PingLoadTest_LocalReentrant and PingLoadTest_RandomReentrant_MultiSilos.

@sergeybykov
Copy link
Contributor

ActivationCollectorStressTest: 205,393 vs. 189,404 = ~8% gain.

I think I'll merge it now because there is clearly no perf regression, and will rerun the tests against master.

@sergeybykov sergeybykov merged commit 924e6cb into dotnet:master Mar 23, 2016
@sergeybykov
Copy link
Contributor

@dVakulen Obviously, bug thanks! IIRC this the highest perf gains for a single PR!

@dVakulen dVakulen deleted the shallow-copyable-perf-improvement branch March 23, 2016 22:16
@dVakulen
Copy link
Contributor Author

@sergeybykov Glad to hear.

@gabikliot
Copy link
Contributor

I think I understand the diff between PingLoadTest_LocalReentrant and
PingLoadTest_RandomReentrant_MultiSilos.
The first one only sends msg from client to silo and silo responds. There is zero serialization and deep copy on the silo, since the replies in this particular case are all Task (no return value). So the code he changed had no impact in the one silo case.
The 2nd one is sending from silo to silo, so silo is doing lots of serialization and deep copy attempts.

The other question is why this improvement was so big. I would speculate that it is not so much due to ConcurrentDic vs Ditionary plus lock, but more due to direct check in the dictionary vs. checking all the if (typeInfo.IsPrimitive || typeInfo.IsEnum || t == typeof (string) || t == typeof (DateTime) || t == typeof (Decimal) || if (typeInfo.IsPrimitive || typeInfo.IsEnum) || t == typeof (Immutable<>))

if I remember correctly, the PingLoadTest send Immutable<byte[]> so in the old code this was a heavy check (who could have predicted ?!?) while in the new code it quick dictionary check.

Great job @dVakulen !

@sergeybykov
Copy link
Contributor

@gabikliot Makes sense.

@gabikliot
Copy link
Contributor

Basically, the take away I think is that operations on typeInfo and type are not as cheap as we thought, so might make sense to look elsewhere in the stack, mostly in serialization, and try to cache/optimize/reuse.

@dVakulen dVakulen mentioned this pull request Sep 5, 2016
@github-actions github-actions bot locked and limited conversation to collaborators Dec 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants