[ Question ] Reduce memory consumption of CoreCLR #7694

ruben-ayrapetyan · 2017-03-22T09:55:11Z

Hello.

I am wondering about possible ways to reduce memory consumption of CoreCLR.
Do you have any ideas about how it is possible to reduce the working set size?
Please, share any related ideas, and also general opinions about this direction of development.

By the way, is there any defined set of rules for choosing between higher performance and lower memory consumption?
Is it an accepted practice to add compile-time or runtime switches, which allow to switch between the two options?

davidfowl · 2017-03-22T10:00:50Z

Don't allocate 😄

seanshpark · 2017-03-22T10:13:32Z

Hi Ruben, do you have any profiling results?

ruben-ayrapetyan · 2017-03-22T11:02:19Z

Hi SaeHie,

Yes, we have profiled several Xamarin GUI applications on Tizen Mobile.

Typical profile of CoreCLR's memory on the GUI applications is the following:

Mapped assembly images - 4.2 megabytes (50%)
JIT-compiler's memory - 1.7 megabytes (20%)
Execution engine - about 1 megabyte (11%)
Code heap - about 1 megabyte (11%)
Type information - about 0.5 megabyte (6%)
Objects heap - about 0.2 megabyte (2%)

egavrin · 2017-03-22T11:06:04Z

JIT-compiler memory - 1.7 megabytes (20%)

Compiler itself or generated code?

ruben-ayrapetyan · 2017-03-22T11:07:30Z

JIT-compiler memory - 1.7 megabytes (20%)

Compiler itself of generated code?

Yes, the memory for compilation itself, without size of JIT-compiled code (the code's size is accounted in "Code heap").

jkotas · 2017-03-22T18:29:56Z

Yes, the memory for compilation itself

This memory should be transient. It is not needed once the JIT is done JITing. The JIT keeps some of it around to avoid asking OS for it again and again. Is the 1.7MB number the high watermark, or do you see it kept around permanently?

The JIT should need less than 100kB to JIT most methods. You may take a look at which (large?) methods take the large amount of memory to JIT, and do something about them.

jkotas · 2017-03-22T18:44:33Z

Don't allocate :-)

This is not necessarily the right answer to optimize the fixed footprint that this issue is about. The techniques to avoid allocations (generics, etc.) often make the fixed footprint worse than just writing a simple code that allocates a bit of temporary garbage.

Typical profile of CoreCLR's memory on the GUI applications

Excellent! It is always good to start performance investigation with a measurement.

Higher performance and lower memory consumption? Is it an accepted practice to add compile-time or runtime switches, which allow to switch between the two options?

We do have a prior art here: The server GC vs. workstation GC setting is exactly that. The server GC is higher performance, but it has higher memory consumption as well. We can discuss other similar switches like this.

Mapped assembly images - 4.2 megabytes (50%)
JIT-compiler's memory - 1.7 megabytes (20%)

These two are obviously the buckets to focus on. For optimizing the footprint of mapped assembly images, you may take a look at using the https://github.com/mono/linker - @russellhadley and @erozenfeld are looking into using the mono linker for .NET Core.

seanshpark · 2017-03-22T22:28:18Z

Yes, we have profiled several Xamarin GUI applications on Tizen Mobile.

Thanks for sharing the results!

ruben-ayrapetyan · 2017-04-03T13:43:37Z

@jkotas,

Thank you very much for your comments.

We clarified the measurements.

Also, need to add some comments about them:

the measurements were performed with assemblies precompiled in ReadyToRun format, which currently isn't default in Tizen. When Fragile format of assemblies is used, distribution of memory consumptions looks quite differently.
the measurements show "Private" memory usage of process, i.e. only the part, which is not shared with other processes. "Shared" part is not accounted at all in this measurements. Most part of "Mapped assembly images" in measurements above is "Private_Clean" (unmodified) memory, which automatically becomes "shared" just when same assembly is mapped to another processes. So, actual per-application consumption of the "mapped assembly images" is much less in ReadyToRun mode. Please, see the new measurements below.

@seanshpark , @jkotas , please, see the clarified measurements below.
The following measurements are for Puzzle sample application (https://developer.tizen.org/sites/default/files/documentation/puzzle2.zip), which is started along with another .NET application (so, mapped files are mostly shared).

ReadyToRun mode means the Tizen-default set of precompiled assemblies is in ReadyToRun format.
Fragile mode - the Tizen-default set of precompiled assemblies is in Fragile format (currently, the format is used in Tizen).
The values in cells represent "Private" (per-application) memory consumption of CoreCLR.

Component	ReadyToRun mode	Fragile mode
Mapped assembly images	1921 kilobytes (37%)	5130 kilobytes (76%)
Execution engine	1309 kilobytes (25.2%)	795 kilobytes (11.8%)
Objects heap	690 kilobytes (13.3%)	506 kilobytes (7.5%)
Code heap	549 kilobytes (10.5%)	119 kilobytes (1.7%)
Type information	654 kilobytes (12.6%)	106 kilobytes (1.5%)
JIT-compiler's memory	64 kilobytes (1.2%)	64 kilobytes (0.9%)
Total	5187 kilobytes (100%)	6720 kilobytes (100%)

Do we understand correctly that the differences in memory distribution between ReadyToRun and Fragile mode are caused by storing preinitialised data in the Fragile format? Could you, please, point us to some documentation or places in code base that could explain the difference?

jkotas · 2017-04-03T15:10:09Z

differences in memory distribution between ReadyToRun and Fragile mode are caused by storing preinitialised data in the Fragile format?

I think so.

documentation or places in code base that could explain the difference?

The pre-initialized datastructures in the Fragile format have a lot of pointers that need to be updated. It is called "restoring" in the code, e.g. look for MethodTable::Restore. Updating the pointers produces the private memory pages.

Creating the datastructures at runtime on demand gives you a dense packing for free. The private pages contain just the datastructures needed. The preinitialized datastructures in the fragile images do not have this property (e.g. the program may only need 100 byte datastructure from a given page, but the whole 4k page is private memory).

ruben-ayrapetyan · 2017-04-03T15:40:23Z

@jkotas, thank you for the information!

danmoseley · 2017-04-13T04:44:49Z

@ruben-ayrapetyan as i read it this is answered now; please reopen if not.

ruben-ayrapetyan · 2017-07-05T12:55:01Z

@jkotas,

We have performed initial comparison of CoreCLR and CoreRT from viewpoint of memory consumption on benchmarks from http://benchmarksgame.alioth.debian.org.

The initial measurements show that CoreCLR consumes approximately 41% more memory on average than CoreRT and is approximately 4% slower (x64 release build).

Particularly, binary-trees benchmark (http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=csharpcore&id=5) shows the following:
Peak Rss on CoreCLR is about 1.5 gigabytes
Peak Rss on CoreRT is about 1 gigabyte
Running time on CoreCLR is about 46.7 seconds
Running time on CoreRT is about 29.6 seconds

As far as we currently see, the difference in memory consumption is mostly related to differences in GC heuristics.
Particularly, we could reduce memory consumption of CoreCLR on binary-trees by about 2 times through invoking GC more frequently.

Do we see correctly that the main cause of the difference is related to GC?
Could you, please, clarify what are the differences in GC between CoreRT and CoreCLR?

cc @lemmaa @egavrin @Dmitri-Botcharnikov @sergign60 @BredPet @gbalykov @kvochko

egavrin · 2017-07-05T15:45:02Z

As far as we currently see, the difference in memory consumption is mostly related to differences in GC heuristics.

Unfortunately, it does not explain why we see performance improvements on memory intensive benchmarks like binary-trees or spectral-norm.

Launch time is better on CoreRT, obviously. ~45% faster with CoreRT.

jkotas · 2017-07-05T21:35:49Z

GC PAL is incomplete in CoreRT - the performance related parts are missing:

The concurrent/background GC is not enabled in CoreRT yet (it is the default in CoreCLR). You can try rerunning the CoreCLR with concurrent GC disabled to see whether it is causing the difference.
The L1/L2 cache size detection is missing https://github.com/dotnet/corert/blob/master/src/Native/gc/unix/gcenv.unix.cpp#L389. You can try hardcode the number that CoreCLR uses on your machine to see whether it is causing the difference.

ruben-ayrapetyan · 2017-07-07T10:17:20Z

@jkotas, Thank you very much for the advice.

We checked the CoreCLR with concurrent GC turned off.

In this configuration, CoreCLR consumes 2 times less RSS at peak, and is about 30% faster than CoreRT on the binary-trees benchmark.

jkotas · 2017-07-07T10:50:46Z

You may be running into dotnet/corert#3784.

These kind of differences between CoreCLR and CoreRT are point-in-time problem. The GC perf characteristics should be within noise between CoreCLR and CoreRT by the time we are done.

danmoseley closed this as completed Apr 13, 2017

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

ruben-ayrapetyan mentioned this issue Jan 31, 2020

[ Memory consumption ] PC-relative addresses instead of relocations in code section on ARM32 #7989

Closed

This was referenced Jan 31, 2020

[Memory consumption] Relative addresses instead of relocations in MethodDesc section on ARM32 #8001

Closed

[Memory consumption] Remove relocations for Readonly section #8379

Closed

dotnet locked as resolved and limited conversation to collaborators Dec 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ Question ] Reduce memory consumption of CoreCLR #7694

[ Question ] Reduce memory consumption of CoreCLR #7694

ruben-ayrapetyan commented Mar 22, 2017

davidfowl commented Mar 22, 2017

seanshpark commented Mar 22, 2017

ruben-ayrapetyan commented Mar 22, 2017 •

edited

egavrin commented Mar 22, 2017 •

edited

ruben-ayrapetyan commented Mar 22, 2017 •

edited

jkotas commented Mar 22, 2017 •

edited

jkotas commented Mar 22, 2017 •

edited

seanshpark commented Mar 22, 2017

ruben-ayrapetyan commented Apr 3, 2017 •

edited

jkotas commented Apr 3, 2017

ruben-ayrapetyan commented Apr 3, 2017

danmoseley commented Apr 13, 2017

ruben-ayrapetyan commented Jul 5, 2017 •

edited

egavrin commented Jul 5, 2017 •

edited

jkotas commented Jul 5, 2017

ruben-ayrapetyan commented Jul 7, 2017

jkotas commented Jul 7, 2017

[ Question ] Reduce memory consumption of CoreCLR #7694

[ Question ] Reduce memory consumption of CoreCLR #7694

Comments

ruben-ayrapetyan commented Mar 22, 2017

davidfowl commented Mar 22, 2017

seanshpark commented Mar 22, 2017

ruben-ayrapetyan commented Mar 22, 2017 • edited

egavrin commented Mar 22, 2017 • edited

ruben-ayrapetyan commented Mar 22, 2017 • edited

jkotas commented Mar 22, 2017 • edited

jkotas commented Mar 22, 2017 • edited

seanshpark commented Mar 22, 2017

ruben-ayrapetyan commented Apr 3, 2017 • edited

jkotas commented Apr 3, 2017

ruben-ayrapetyan commented Apr 3, 2017

danmoseley commented Apr 13, 2017

ruben-ayrapetyan commented Jul 5, 2017 • edited

egavrin commented Jul 5, 2017 • edited

jkotas commented Jul 5, 2017

ruben-ayrapetyan commented Jul 7, 2017

jkotas commented Jul 7, 2017

ruben-ayrapetyan commented Mar 22, 2017 •

edited

egavrin commented Mar 22, 2017 •

edited

ruben-ayrapetyan commented Mar 22, 2017 •

edited

jkotas commented Mar 22, 2017 •

edited

jkotas commented Mar 22, 2017 •

edited

ruben-ayrapetyan commented Apr 3, 2017 •

edited

ruben-ayrapetyan commented Jul 5, 2017 •

edited

egavrin commented Jul 5, 2017 •

edited