Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Improve performance for GetViewBetween and enumerating TreeSubSet #30921

Merged
merged 6 commits into from Aug 2, 2018

Conversation

acerbusace
Copy link
Contributor

Increases performance for SortedSet - GetViewBetween and enumerating TreeSubSet
This also, fixes #30821

  • VersionCheckImpl is not called within the constructor anymore
  • count does not get updated by VersionCheckImpl anymore
  • count is now updated within VersionCheckCount which is only called when retrieving Count property.
    As a result count has been replaced with Count where necessary.
  • For TreeSubSet the enumerator now uses the Count of the parent set (entire tree) when creating the
    stack. This results in a larger stack size, however it eliminates the need to get the count of TreeSubSet,
    which drastically increases the performance.

The performance impacts can be seen below (benchmarkdotnet was used on release build of corefx).

Size of Set: 5000000
Original: original code base
Optimized: changed code base for performance improvements

SubSet of size 10000

  1. Got a SubSet of size 10000 (2495000 - 2505000) (GetView)
  2. Enumerated through the SubSet (EnumerateView)

image

SubSet of size 500000

  1. Got a SubSet of size 1000000 (2000000 - 3000000) (GetView)
  2. Enumerated through the SubSet (EnumerateView)

image

The reason enumerating through a view has similar runtimes, is because the count property in the
original code is updated when the view is gotten and since no changes occur to the set, count doesn't
need to be updated during the start of enumeration.

* VersionCheckImpl is not called within the constructor

* count does not get updated by VersionCheckImpl

* count is now updated within VersionCheckCount which is
  only called when retrieving Count property. As a result
  count has been replaced with Count inside TreeSubSet/SortedSet
  where nessessary.

* For TreeSubSet the enumerator now uses the Count of the
  parent set (entire tree) when creating the stack.
  This results in a larger stack size, however it eliminates the
  need to get the count of TreeSubSet, which drastically
  increases the performance.
@dnfclas
Copy link

dnfclas commented Jul 9, 2018

CLA assistant check
All CLA requirements met.

@karelz karelz added this to the 3.0 milestone Jul 19, 2018
@danmoseley danmoseley requested a review from safern July 24, 2018 02:58
@@ -311,6 +311,8 @@ object ICollection.SyncRoot

// Virtual function for TreeSubSet, which may need to update its count.
internal virtual void VersionCheck() { }
internal virtual void VersionCheckCount() { }
internal virtual int TotalCount() { return Count; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this TotalCount wrapper around Count needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/acerbusace/corefx/blob/6564eb78f7fbe1408c8dbd82a498bb572e017c0f/src/System.Collections/src/System/Collections/Generic/SortedSet.cs#L1955

TotalCount is being used to get the parent set's Count property (when enumerating with a TreeSubSet). This way, enumeration does not require the Count property of TreeSubSet, hence no extra time wasted to updating the Count property first.

If there is another more efficient way, I can make the necessary changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. I got confused and thought you just introduced the parent's TotalCount, missed the overriden version.

{
VersionCheckImpl();

if (versionCount != _underlying.version)
Copy link
Member

@safern safern Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding this new versionCount variable which makes it confusing, it took me a little while to understand what was versionCount needed for. Instead we could change VersionCheck to have a bool parameter that is false by default and when that is true then update count?

i.e

internal override void VersionCheck(bool updateCount = false) => VersionCheckImpl(updateCount);

private void VersionCheckImpl(bool updateCount)
{
    Debug.Assert(_underlying != null);
    if (version != _underlying.version)
    {
        root = _underlying.FindRange(_min, _max, _lBoundActive, _uBoundActive);
        version = _underlying.version;
        if (updateCount)
        {
             ...
        }
    }
}

and then from Count just call it with that bool to true? I think that way it will be more readable and will understand the intention of why we're doing that. I guess with that we would get the same perf improvements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea. The only issue is that we have no way of knowing if the current Count property of TreeSubSet is up to date. As a result very time the Count property is accessed (either through within the code or by the end user), Count will always be forcefully updated.

How about adding the updateCount bool like you mentioned (still gets rid of the VersionCheckCount method) but also keeping versionCount variable and renaming it to something easier to understand?

Copy link
Member

@safern safern Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding the updateCount bool like you mentioned (still gets rid of the VersionCheckCount method) but also keeping versionCount variable and renaming it to something easier to understand?

That is true, I'm fine with keeping it the way it is, but please rename the variable to something easier to understand and add a comment in the new method just so that people that reads that understand why we need an extra variable and what method is doing.

@acerbusace
Copy link
Contributor Author

https://github.com/acerbusace/corefx/blob/6564eb78f7fbe1408c8dbd82a498bb572e017c0f/src/System.Collections/src/System/Collections/Generic/SortedSet.TreeSubSet.cs#L191

I was also thinking, how about changing InOrderTreeWalk to use parent set's count property instead? Especially in the case when TreeSubSet's count property is initially 0 and we are trying to update it using InOrderTreeWalk. In the above case since the stack is created with size 0, the run time is a lot worse than just O(n).

@safern
Copy link
Member

safern commented Jul 31, 2018

I was also thinking, how about changing InOrderTreeWalk to use parent set's count property instead? Especially in the case when TreeSubSet's count property is initially 0 and we are trying to update it using InOrderTreeWalk. In the above case since the stack is created with size 0, the run time is a lot worse than just O(n).

Wouldn't using the parent set's count property call: https://github.com/acerbusace/corefx/blob/6564eb78f7fbe1408c8dbd82a498bb572e017c0f/src/System.Collections/src/System/Collections/Generic/SortedSet.TreeSubSet.cs#L317 ? And potentially if something changes the _underlying.Version in between we could end up in an infinite recursion calling InOrderTreeWalk again and going back and forth?

if (updateCount) {
count = 0;
InOrderTreeWalk(n => { count++; return true; });
countVersion = _underlying.version;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this variable is used now. Shouldn't this be outside of the version != _underlying.version comparing countVersion to _underlying.version if updateCount is true?

@acerbusace
Copy link
Contributor Author

Wouldn't using the parent set's count property call: https://github.com/acerbusace/corefx/blob/6564eb78f7fbe1408c8dbd82a498bb572e017c0f/src/System.Collections/src/System/Collections/Generic/SortedSet.TreeSubSet.cs#L317 ? And potentially if something changes the _underlying.Version in between we could end up in an infinite recursion calling InOrderTreeWalk again and going back and forth?

That is true. Let's leave that performance improvement for another time, then.

@@ -281,7 +281,7 @@ public int Count
{
get
{
VersionCheck();
VersionCheckCount(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing, sorry for the back and forth iterations, but for bool parameters we like to name them so that when you read this statement is clear and you don't have to go to the method declaration to see what that true stands for, i.e:

VersionCheck(updateCount: true);

and just noticed you're calling VersionCheckCount still, it should now be VersionCheck.

Also, would you mind re running your benchmarks just to make sure we're still getting the perf benefits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed that above issue and sure, I'll rerun the benchmarks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance benefits are still there. Tested for Set of size 5000000 and SubSet (View) of Size 500000.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the update!

// keeps track of whether the count variable is up to date
// up to date -> countVersion = _underlying.version
// not up to date -> countVersion < _underlying.version
private int countVersion;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we name this with _ I think the reason why version doesn't have the prefix is for serialization compatibility, since this is a new field we're adding, we can add the prefix.

{
Debug.Assert(_underlying != null);
if (version != _underlying.version)
{
root = _underlying.FindRange(_min, _max, _lBoundActive, _uBoundActive);
version = _underlying.version;
}

if (updateCount && countVersion != _underlying.version) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last thing, the opening bracket needs to be in a new line.

Copy link
Member

@safern safern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @acerbusace

@safern safern merged commit cc4d621 into dotnet:master Aug 2, 2018
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…tnet/corefx#30921)

* Improve performance for GetViewBetween and enumerating TreeSubSet

* VersionCheckImpl is not called within the constructor

* count does not get updated by VersionCheckImpl

* count is now updated within VersionCheckCount which is
  only called when retrieving Count property. As a result
  count has been replaced with Count inside TreeSubSet/SortedSet
  where nessessary.

* For TreeSubSet the enumerator now uses the Count of the
  parent set (entire tree) when creating the stack.
  This results in a larger stack size, however it eliminates the
  need to get the count of TreeSubSet, which drastically
  increases the performance.

* Cleaned up code and added comments

* Fixed logic error with updateCount

* Removed unnecessary blank line

* Changed VersionCheckCount -> VersionCheck

* Cleaning code


Commit migrated from dotnet/corefx@cc4d621
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
5 participants