Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid allocations when adding existing items to the blob heap. #81059

Merged
merged 8 commits into from Apr 19, 2023

Conversation

teo-tsirpanis
Copy link
Contributor

@teo-tsirpanis teo-tsirpanis commented Jan 23, 2023

To avoid adding them many times, System.Reflaction.Metadata keeps track of the blobs added in the blob heap through a Dictionary<ImmutableArray<byte>, BlobHandle> and a custom equality comparer that compares the blobs by their content. The problem with this approach is that we have to allocate an immutable array for every blob we are going to add, even if it already exists in the heap. The most obvious solution to this problem it be to wait for #2010 or implement a full-featured hash table with span keys for our own internal use.

This PR uses a different approach to solve this problem. Instead of a Dictionary<ImmutableArray<byte>, BlobHandle> we have a Dictionary<int, KeyValuePair<ImmutableArray<byte>, BlobHandle>> where the key is the hash code of our blob, and the value is the blob itself and its handle.

If we want to see if a blob exists:

  1. We set the dictionary's key as the blob's hash code.
  2. We check if the dictionary has an item with this key.
    1. If it doesn't, the blob does not exist. We can add it at the current dictionary key. Only then do we have to allocate the immutable array.
    2. If it has, we compare the value's key with the user-provided key.
      1. If they are equal, the blob exists.
      2. If they are not equal, get the next dictionary key (at the moment by running the current one over a Linear Congruential Generator) and go to step 2.

The simplicity of this solution comes from using a "hash table within a hash table". The Dictionary gets keys in their full 32-bit range and will do its best to allow us efficiently access them, and we don't have to do all the heavy lifting ourselves.

With this data structure in hand, the MetadataReader.GetOrAddBlob*** methods got rewritten. An internal overload that takes a ReadOnlySpan<byte> got introduced, and the other overloads were reimplemented around it:

  • GetOrAddBlobUTF16 will cast the string straight to a ReadOnlySpan<byte> if we are on a little-endian system and avoid any allocations and copies.
  • GetOrAddBlob(BlobBuilder) cannot always avoid the allocation, but it will if the blob builder consists of a single segment.
    • Other overloads are using this one and were indirectly optimized too.
  • GetOrAddBlob(ImmutableArray<byte>) takes care to reuse the provided immutable array, like before.

Let's see what CI thinks. I will open an API proposal to add span APIs that would take advantage of this. I also left the optimization of the string and user string heaps for another time (they don't currently have a public API that would take advantage of this).

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jan 23, 2023
@ghost
Copy link

ghost commented Jan 23, 2023

Tagging subscribers to this area: @dotnet/area-system-reflection-metadata
See info in area-owners.md if you want to be subscribed.

Issue Details

To avoid adding them many times, System.Reflaction.Metadata keeps track of the blobs added in the blob heap through a Dictionary<ImmutableArray<byte>, BlobHandle> and a custom equality comparer that compares the blobs by their content. The problem with this approach is that we have to allocate an immutable array for every blob we are going to add, even if it already exists in the heap. The most obvious solution to this problem it be to wait for #2010 or until implement a full-featured hash table with span keys for our own internal use.

This PR uses a different approach to solve this problem. Instead of a Dictionary<ImmutableArray<byte>, BlobHandle> we have a Dictionary<int, KeyValuePair<ImmutableArray<byte>, BlobHandle>> where the key is the hash code of our blob, and the value is the blob itself and its handle.

If we want to see if a blob exists:

  1. We set the dictionary's key as the blob's hash code.
  2. We check if the dictionary has an item with this key.
    1. If it doesn't, the blob does not exist. We can add it at the current dictionary key. Only then do we have to allocate the immutable array.
    2. If it has, we compare the value's key with the user-provided key.
      1. If they are equal, the blob exists.
      2. If they are not equal, get the next dictionary key (at the moment by running the current one over a Linear Congruential Generator) and go to step 2.

The simplicity of this solution comes from using a "hash table within a hash table". The Dictionary gets keys in their full 32-bit range and will do its best to allow us efficiently access them, and we don't have to do all the heavy lifting ourselves.

With this data structure in hand, the MetadataReader.GetOrAddBlob*** methods got rewritten. An internal overload that takes a ReadOnlySpan<byte> got introduced, and the other overloads were reimplemented around it:

  • GetOrAddBlobUTF16 will cast the string straight to a ReadOnlySpan<byte> if we are on a little-endian system and avoid any allocations and copies.
  • GetOrAddBlob(BlobBuilder) cannot always avoid the allocation, but it will if the blob builder is small or consists of a single segment (in the latter case it will also avoid any copies).
    • Other overloads are using this one and were indirectly optimized too.
  • GetOrAddBlob(ImmutableArray<byte>) takes care to reuse the provided immutable array, like before.

Let's see what CI thinks. I will open an API proposal to add span APIs that would take advantage of this. I also left the optimization of the string and user string heaps for another time (they don't currently have a public API that would take advantage of this).

Author: teo-tsirpanis
Assignees: -
Labels:

area-System.Reflection.Metadata, community-contribution

Milestone: -

@buyaa-n
Copy link
Member

buyaa-n commented Mar 6, 2023

Let's see what CI thinks. I will open an API proposal to add span APIs that would take advantage of this.

@teo-tsirpanis I was confused with this, is this PR ready for review? If its ready, do you have numbers to show/share the improvements? If it's not ready yet let's convert it into draft

@teo-tsirpanis
Copy link
Contributor Author

@buyaa-n yes it's ready for review. I didn't run any benchmarks yet.

@steveharter
Copy link
Member

@teo-tsirpanis thanks for this PR as it looks like a significant perf improvement. However, we need benchmarks to show the removed allocations and the CPU impact. This can either be done through the official benchmarks or by using one-off Benchmark.Net results.

I do not see any existing official benchmarks for MetadataBuilder. I can assist by creating these and verify against this PR if needed. Thanks.

@steveharter steveharter added the needs-author-action An issue or pull request that requires more info or actions from the author. label Apr 3, 2023
@teo-tsirpanis
Copy link
Contributor Author

Thanks @steveharter, I would appreciate some help with the benchmarks.

@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Apr 4, 2023
@steveharter
Copy link
Member

@teo-tsirpanis can you merge to main again please? CI is a bit of a mess here with almost 2000 failures. Thanks

@steveharter
Copy link
Member

I created some benchmarks and they do look good, so thanks! I'll create a PR for these once you verify those.

It seems like the changes to do the comparisons are much faster; the "Write_NoReuse" benchmark below is almost 2x faster (the first byte differs, causing a short compare perhaps).

Allocs in the no-share increased slightly (3%) but in the sharing case did go down at a ratio of .65. (one-third).

results:

|        Method |        Job |               Toolchain |      Mean |    Error |   StdDev |   Median |       Min |      Max | Ratio | RatioSD |    Gen0 |    Gen1 | Allocated | Alloc Ratio |
|-------------- |----------- |------------------------ |----------:|---------:|---------:|---------:|----------:|---------:|------:|--------:|--------:|--------:|----------:|------------:|
| Write_NoReuse | Job-VRUVWC |  \TEO_AFTER\corerun.exe | 105.65 us | 0.966 us | 0.904 us | 106.0 us | 104.29 us | 107.2 us |  0.55 |    0.01 | 26.1954 | 10.8108 | 269.83 KB |        1.03 |
| Write_NoReuse | Job-FAUVYJ | \TEO_BEFORE\corerun.exe | 193.31 us | 2.292 us | 2.144 us | 193.0 us | 190.17 us | 197.1 us |  1.00 |    0.00 | 25.2765 |  9.4787 | 261.21 KB |        1.00 |
|               |            |                         |           |          |          |          |           |          |       |         |         |         |           |             |
|   Write_Reuse | Job-VRUVWC |  \TEO_AFTER\corerun.exe |  99.95 us | 2.096 us | 2.414 us | 100.5 us |  96.51 us | 104.8 us |  0.95 |    0.03 | 16.0156 |  3.1250 | 167.72 KB |        0.65 |
|   Write_Reuse | Job-FAUVYJ | \TEO_BEFORE\corerun.exe | 104.73 us | 1.129 us | 1.001 us | 104.7 us | 103.04 us | 106.2 us |  1.00 |    0.00 | 25.4167 |  5.0000 | 259.65 KB |        1.00 |

To run the benchmarks:

  • Build your main and copy the binaries to a folder. e.g. copy artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\ c:\repos\TEO_BEFORE
  • Build your srm-blob-opt and copy the binaries to a folder. e.g. copy artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\ c:\repos\TEO_AFTER
  • Clone the Performance repo and pull my branch
  • From the performance repo, go to directory to src\benchmarks\micro
  • Run dotnet run -c Release -f net7.0 --filter System.Reflection.Metadata.Blob* --join --corerun c:\repos\TEO_BEFORE\corerun.exe c:\repos\TEO_AFTER\corerun.exe

…ffer.

The blob builder's default starting size is 256 bytes; the same as the stack-alllocated buffer, and making it bigger does not have any benefit since such large blobs are rare.
Copy link
Member

@steveharter steveharter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. @teo-tsirpanis can you merge this or should I? Thanks

@teo-tsirpanis
Copy link
Contributor Author

I can't, I'm not a Microsoft employee. 😅

@buyaa-n
Copy link
Member

buyaa-n commented Apr 19, 2023

I will open an API proposal to add span APIs that would take advantage of this.

Thank you @teo-tsirpanis, I am looking forward for that proposal and the new API that we could use for emitting custom attributes that comes with ReadOnlySpan

Test failure unrelated and reported, merging

@buyaa-n buyaa-n merged commit 6149ca0 into dotnet:main Apr 19, 2023
104 of 107 checks passed
@teo-tsirpanis teo-tsirpanis deleted the srm-blob-opt branch April 20, 2023 09:11
@steveharter steveharter added the tenet-performance Performance related issue label Apr 24, 2023
@dotnet dotnet locked as resolved and limited conversation to collaborators May 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Reflection.Metadata community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants