Skip to content

Improve Density of GC heap by String Interning (de-duping) on Gen2 GC  #9022

@vancem

Description

@vancem

It has been noted that for some applications (notably applications like MSBuild or Visual Studio) that manipulate many file paths, or deserialize large text-based payloads (e.g. JSON or XML), tend to have many duplicated strings. We could reduce the size of the GC heap by interning them (that is using the same string for all instances of a particular string value). The runtime already does this for string literals, but strings that are constructed at runtime do not benefit.

To implement this you need a table that remembers all the existing strings that is indexed by string value. While you COULD do this table lookup check when strings are first created, this is not likely to be a good approach because MANY strings have very short lifetimes, and it would slow down ALL strings creation.

Instead the idea is to do the interning check when as part of promoting the object from GC generation 1 to GC generation 2. This is a nice place to do it because

  1. Most strings die before reaching Gen 2
  2. However if they do make it, they are 'expensive' strings in that they are likely to live a long time.
  3. Thus it makes sense to de-dup at that point

Another nice aspect of this feature is that it does not need to be perfect (you dont' HAVE to dedup everything). Thus the hash table you keep can be of fixed size with 'replace on collision' semantics, which is simple and bounded, and tends to favor younger strings (all good things).

The expectation is that typical apps have 20% of their GC heap be strings. Some measurements we have seen is that for at least some applications, 10-30% of strings all may be duplicated, so this might save 2-3% of the GC heap. Not huge, but the feature is not that difficult either.

The first step to build enough of a prototype, so that we can run it on a number of interesting apps and get a feel for how much GC space would save us and how much overhead this would add to (gen 1 and gen 2) GCs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions