Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String Deduplication Design Doc #31971

Merged
merged 1 commit into from
Feb 8, 2020
Merged

String Deduplication Design Doc #31971

merged 1 commit into from
Feb 8, 2020

Conversation

Maoni0
Copy link
Member

@Maoni0 Maoni0 commented Feb 8, 2020

@Maoni0 Maoni0 self-assigned this Feb 8, 2020
Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this down!

@benaadams
Copy link
Member

benaadams commented Feb 8, 2020

Related: Kestrel does string dedupe in byte->string conversion across requests on a single connection (checking if the new string for the request is the same as the one from the previous request; before trying to create it) dotnet/aspnetcore#8374.

image

This helps with http, as the values of some of the more egregious sized strings Cookie, User-Agent, Authorization are sent again and again with every request and are generally always identical for every request on a connection.

While runtime string dedupe wouldn't help necessarily directly with these strings (as they may be unique); since it also makes the strings become longer lived it would automagically widen this deduping to be across connection for the values of other common strings Host, accept-encoding and common User-Agents etc so rather than being one per connection they would naturally tend to become one value per process.

This wider string deduplication for longer lived strings would be a beautiful addition; and also an advantage a non-GC'd environment would find very hard to replicate.

@benaadams
Copy link
Member

Would not deduping pointers on stack and providing a attribute for fields (say: [NoDedupe]) allow a release valve for corner cases and help with adoption? (e.g. you could use that in a library if you were in one of the corner cases, to not prevent someone using your library in an app from switching it on)

@Maoni0
Copy link
Member Author

Maoni0 commented Feb 8, 2020

@benaadams as with any perf feature it's a matter of how much adoption you can get vs how much it requires to implement. stack vars tend to live for a short period of time so the benefit of deduping there isn't as obvious as old gen objects. for this release I'm not thinking about deduping refs on the stack. we might look into deduping long lived stack refs in the future. I can add a note of this in the doc.


- We know when new LOH strings are allocated (all LOH allocations are done via C++ helpers in gchelpers.cpp), we can take the accumulative size of new LOH strings till it reaches a certain percentage of LOH before starting the next dedup cycle.

- We know how much is promoted into gen2 and we can also wait till a certain percentage of gen2 consists of newly promoted memory before starting the next dedup cycle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to also trigger if there is (a smaller value, numeric rather than percentage?) promoted memory and enough (idle?) time has elapsed, so it can use more idle periods to also dedpue? Also in that phase maybe check if the string count in that memory goes over a threshold prior to starting the scan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. when we actually start doing work we'll experiment with various tuning options to see what works best. I didn't want to put too much implementation detail in the doc.

@Maoni0
Copy link
Member Author

Maoni0 commented Feb 8, 2020

ok...I believe I've addressed all feedback. thanks all, especially @jkotas for your excellent help!

@Maoni0 Maoni0 merged commit 98d47d5 into dotnet:master Feb 8, 2020
@jkotas
Copy link
Member

jkotas commented Feb 8, 2020

Would not deduping pointers on stack and providing a attribute for fields (say: [NoDedupe]) allow a release valve for corner cases and help with adoption?

@benaadams meant this as a suggestions to deal with the compat problems in existing libraries that depend on string object identity. In his suggestion, not deduping pointers on the stack is a good thing (tm), not something to fix.


- Backup nonconcurrent dedup mode

If you are running into these problematic scenarios and cannot make changes, and you can tolerate longer STW (Stop-The-World) pauses, we will provide a mode where deduping is completely done in either a full blocking GC or during the STW pause of a BGC. Obviously this would make BGC’s STW pauses much, much longer. This would basically just call the function that the dedup thread calls, but in these STW pauses. If you use a CPU profiler you should be able to see preciously how long the dedup part takes during the STW pause.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blocking mode cannot be used to workaround the compatibility problems reliably. It would work only if you can tell whether none of the threads are running the problematic code while the world is stopped which is close to impossible. All of the compat problems listed above will still be a problem even in the full blocking mode.

I doubt we would be able to recommend this as a workaround to anybody important. We cannot recommend to use something that causes crash or data corruption some of the time. Maybe as the one-shot deduplication after startup when you can know that other threads in the process are not doing anything problematic (we would need the API for that), but not as the in-flight dedup while the service is running.


1. Using the fixed keyword to modify string content.

I’m seeing that there are almost 600 places in libraries that do `fixed (char*` but hopefully most of them do not actually modify the string content. We should definitely be encouraging folks to switch to using `string.Create` like what [PR#31700](https://github.com/dotnet/runtime/pull/31700) did.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this pattern internally for implement string constructors. We will need an internal API that will be called at the end of the string constructor to mark the string as "baked" and candidate for deduping.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, I should add this...sorry I forgot about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we won't need this method if we are not deduping pointers on stack as @benaadams pointed out. Now I remember we talked about it already, but I forgot. Either way, it would be useful to mention it in the doc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, because it can not be assigned to any field before it gets full constructed. makes sense.

@GSPP
Copy link

GSPP commented Feb 9, 2020

Above, an opt-out attribute [NoDedupe] was suggested. An alternative design to an attribute would be a holder struct:

struct NoDedupe
{
 public string Value;
}

The runtime would know not to touch the Value field.

That way, opt-out can be granular per variable.


Also, I think it is an important result of the discussion above that ReferenceEquals can not only flip from false to true but also from true to false. I wonder what kinds of (optimized) code this could potentially break.

@lpereira
Copy link
Contributor

On Linux, it should be possible to use a userfaultfd() and store the deduplicated strings on read-only pages. If unsafe code modifies the contents of a string, the runtime can then copy it to a new location, patch things up, and resume operation. Kind of an automatic CoW mechanism.

I don't know how expensive this would be, but my gut instinct says it's going to be fine given that modifying strings shouldn't happen in most cases, and that string deduping is an opt-in feature that's not 100% free -- all the while providing an alternative to make things just work without requiring people to change their code.

I also don't know if other platforms support handling page faults in user code.

@Maoni0
Copy link
Member Author

Maoni0 commented Feb 11, 2020

Would not deduping pointers on stack and providing a attribute for fields (say: [NoDedupe]) allow a release valve for corner cases and help with adoption?

@benaadams meant this as a suggestions to deal with the compat problems in existing libraries that depend on string object identity. In his suggestion, not deduping pointers on the stack is a good thing (tm), not something to fix.

hmm, seems confusing to me 'cause I never mentioned we were deduping strings specifically pointed to by the stack. in the ref equality example, the string field was deduped because it's pointed to by an object (thus is called "_field") and the stack vars just happened to point to it. are you suggesting that while we are deduping we also check if a ref is also pointed by stack vars? that would be very difficult to do - I dunno of any performant way to do it.

@benaadams
Copy link
Member

hmm, seems confusing to me 'cause I never mentioned we were deduping strings specifically pointed to by the stack

Never mentioned it wasn't deduping string pointers on the stack either, only the generation of the string? 😄

I had assumed it also would be deduping stack pointers due to the worry around mutating strings immediately after construction; as presumably they will only be pointed at by the stack (at that time).

What I was saying is: if it didn't dedupe pointers on the stack (which it doesn't) and you could opt out specific fields with it (via [NoDedupe]); then it would be easier for broad adoption because libraries that did problematic things could mark those few heap pointers and still be used by an application that enabled deduping.

Further to that; assuming the analyzers were robust and the deduping showed good results, there is potential for it to move to opt-out (if something like [NoDedupe]) was provided, as assemblies could be marked as "dedupe-aware", and loading one that didn't have the flag set could switch off the feature (if it was not explicitly overriden to opt-in); with the analyzers setting the flag on compilation?

@Maoni0
Copy link
Member Author

Maoni0 commented Feb 11, 2020

the design outline states "As the dedup thread goes through the old generations linearly, it looks for references to a string object (denoted by the method table) ...". so it does not do anything with the stack vars (at least in my mind that's what it meant 😆) but I can see how this detail can be missed - it certainly can be made more obvious by mentioning it specifically in the design goals section.

I get what you mean now. I'll need to see how much work it'd be to implement something to specify [NoDedup] efficiently. thanks for your explanation!

@Maoni0
Copy link
Member Author

Maoni0 commented Feb 11, 2020

@lpereira, yes, Windows has this too. probably wouldn't want to start out on the path of handling AVs but this is something we can definitely keep in mind. I'll add a note to the doc.


#### **Design outline**

This is an opt in feature. When the runtime detects it’s turned on, it creates a dedup thread to do the work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more? Is this a compile flag or .config flag?
How would dll behave? Would it share the same pool? AppDomain?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for chiming in :)

I think that would be turned on through an environment variable. As per our tradition, it would probably be called COMPLUS_DeduplicateString or something similar. The environment variable is read on process startup, and it will impact the current process.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.