New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Question ] Reduce memory consumption of CoreCLR #10380

Closed
ruben-ayrapetyan opened this Issue Mar 22, 2017 · 17 comments

Comments

Projects
None yet
7 participants
@ruben-ayrapetyan
Contributor

ruben-ayrapetyan commented Mar 22, 2017

Hello.

I am wondering about possible ways to reduce memory consumption of CoreCLR.
Do you have any ideas about how it is possible to reduce the working set size?
Please, share any related ideas, and also general opinions about this direction of development.

By the way, is there any defined set of rules for choosing between higher performance and lower memory consumption?
Is it an accepted practice to add compile-time or runtime switches, which allow to switch between the two options?

@davidfowl

This comment has been minimized.

Show comment
Hide comment
@davidfowl

davidfowl Mar 22, 2017

Contributor

Don't allocate 😄

Contributor

davidfowl commented Mar 22, 2017

Don't allocate 😄

@seanshpark

This comment has been minimized.

Show comment
Hide comment
@seanshpark

seanshpark Mar 22, 2017

Contributor

Hi Ruben, do you have any profiling results?

Contributor

seanshpark commented Mar 22, 2017

Hi Ruben, do you have any profiling results?

@ruben-ayrapetyan

This comment has been minimized.

Show comment
Hide comment
@ruben-ayrapetyan

ruben-ayrapetyan Mar 22, 2017

Contributor

Hi SaeHie,

Yes, we have profiled several Xamarin GUI applications on Tizen Mobile.

Typical profile of CoreCLR's memory on the GUI applications is the following:

  1. Mapped assembly images - 4.2 megabytes (50%)
  2. JIT-compiler's memory - 1.7 megabytes (20%)
  3. Execution engine - about 1 megabyte (11%)
  4. Code heap - about 1 megabyte (11%)
  5. Type information - about 0.5 megabyte (6%)
  6. Objects heap - about 0.2 megabyte (2%)
Contributor

ruben-ayrapetyan commented Mar 22, 2017

Hi SaeHie,

Yes, we have profiled several Xamarin GUI applications on Tizen Mobile.

Typical profile of CoreCLR's memory on the GUI applications is the following:

  1. Mapped assembly images - 4.2 megabytes (50%)
  2. JIT-compiler's memory - 1.7 megabytes (20%)
  3. Execution engine - about 1 megabyte (11%)
  4. Code heap - about 1 megabyte (11%)
  5. Type information - about 0.5 megabyte (6%)
  6. Objects heap - about 0.2 megabyte (2%)
@egavrin

This comment has been minimized.

Show comment
Hide comment
@egavrin

egavrin Mar 22, 2017

JIT-compiler memory - 1.7 megabytes (20%)

Compiler itself or generated code?

egavrin commented Mar 22, 2017

JIT-compiler memory - 1.7 megabytes (20%)

Compiler itself or generated code?

@ruben-ayrapetyan

This comment has been minimized.

Show comment
Hide comment
@ruben-ayrapetyan

ruben-ayrapetyan Mar 22, 2017

Contributor

JIT-compiler memory - 1.7 megabytes (20%)

Compiler itself of generated code?

Yes, the memory for compilation itself, without size of JIT-compiled code (the code's size is accounted in "Code heap").

Contributor

ruben-ayrapetyan commented Mar 22, 2017

JIT-compiler memory - 1.7 megabytes (20%)

Compiler itself of generated code?

Yes, the memory for compilation itself, without size of JIT-compiled code (the code's size is accounted in "Code heap").

@jkotas

This comment has been minimized.

Show comment
Hide comment
@jkotas

jkotas Mar 22, 2017

Member

Yes, the memory for compilation itself

This memory should be transient. It is not needed once the JIT is done JITing. The JIT keeps some of it around to avoid asking OS for it again and again. Is the 1.7MB number the high watermark, or do you see it kept around permanently?

The JIT should need less than 100kB to JIT most methods. You may take a look at which (large?) methods take the large amount of memory to JIT, and do something about them.

Member

jkotas commented Mar 22, 2017

Yes, the memory for compilation itself

This memory should be transient. It is not needed once the JIT is done JITing. The JIT keeps some of it around to avoid asking OS for it again and again. Is the 1.7MB number the high watermark, or do you see it kept around permanently?

The JIT should need less than 100kB to JIT most methods. You may take a look at which (large?) methods take the large amount of memory to JIT, and do something about them.

@jkotas

This comment has been minimized.

Show comment
Hide comment
@jkotas

jkotas Mar 22, 2017

Member

Don't allocate :-)

This is not necessarily the right answer to optimize the fixed footprint that this issue is about. The techniques to avoid allocations (generics, etc.) often make the fixed footprint worse than just writing a simple code that allocates a bit of temporary garbage.

Typical profile of CoreCLR's memory on the GUI applications

Excellent! It is always good to start performance investigation with a measurement.

Higher performance and lower memory consumption? Is it an accepted practice to add compile-time or runtime switches, which allow to switch between the two options?

We do have a prior art here: The server GC vs. workstation GC setting is exactly that. The server GC is higher performance, but it has higher memory consumption as well. We can discuss other similar switches like this.

Mapped assembly images - 4.2 megabytes (50%)
JIT-compiler's memory - 1.7 megabytes (20%)

These two are obviously the buckets to focus on. For optimizing the footprint of mapped assembly images, you may take a look at using the https://github.com/mono/linker - @russellhadley and @erozenfeld are looking into using the mono linker for .NET Core.

Member

jkotas commented Mar 22, 2017

Don't allocate :-)

This is not necessarily the right answer to optimize the fixed footprint that this issue is about. The techniques to avoid allocations (generics, etc.) often make the fixed footprint worse than just writing a simple code that allocates a bit of temporary garbage.

Typical profile of CoreCLR's memory on the GUI applications

Excellent! It is always good to start performance investigation with a measurement.

Higher performance and lower memory consumption? Is it an accepted practice to add compile-time or runtime switches, which allow to switch between the two options?

We do have a prior art here: The server GC vs. workstation GC setting is exactly that. The server GC is higher performance, but it has higher memory consumption as well. We can discuss other similar switches like this.

Mapped assembly images - 4.2 megabytes (50%)
JIT-compiler's memory - 1.7 megabytes (20%)

These two are obviously the buckets to focus on. For optimizing the footprint of mapped assembly images, you may take a look at using the https://github.com/mono/linker - @russellhadley and @erozenfeld are looking into using the mono linker for .NET Core.

@seanshpark

This comment has been minimized.

Show comment
Hide comment
@seanshpark

seanshpark Mar 22, 2017

Contributor

Yes, we have profiled several Xamarin GUI applications on Tizen Mobile.

Thanks for sharing the results!

Contributor

seanshpark commented Mar 22, 2017

Yes, we have profiled several Xamarin GUI applications on Tizen Mobile.

Thanks for sharing the results!

@ruben-ayrapetyan

This comment has been minimized.

Show comment
Hide comment
@ruben-ayrapetyan

ruben-ayrapetyan Apr 3, 2017

Contributor

@jkotas,

Thank you very much for your comments.

We clarified the measurements.

Also, need to add some comments about them:

  • the measurements were performed with assemblies precompiled in ReadyToRun format, which currently isn't default in Tizen. When Fragile format of assemblies is used, distribution of memory consumptions looks quite differently.
  • the measurements show "Private" memory usage of process, i.e. only the part, which is not shared with other processes. "Shared" part is not accounted at all in this measurements. Most part of "Mapped assembly images" in measurements above is "Private_Clean" (unmodified) memory, which automatically becomes "shared" just when same assembly is mapped to another processes. So, actual per-application consumption of the "mapped assembly images" is much less in ReadyToRun mode. Please, see the new measurements below.

@seanshpark , @jkotas , please, see the clarified measurements below.
The following measurements are for Puzzle sample application (https://developer.tizen.org/sites/default/files/documentation/puzzle2.zip), which is started along with another .NET application (so, mapped files are mostly shared).

ReadyToRun mode means the Tizen-default set of precompiled assemblies is in ReadyToRun format.
Fragile mode - the Tizen-default set of precompiled assemblies is in Fragile format (currently, the format is used in Tizen).
The values in cells represent "Private" (per-application) memory consumption of CoreCLR.

Component ReadyToRun mode Fragile mode
Mapped assembly images 1921 kilobytes (37%) 5130 kilobytes (76%)
Execution engine 1309 kilobytes (25.2%) 795 kilobytes (11.8%)
Objects heap 690 kilobytes (13.3%) 506 kilobytes (7.5%)
Code heap 549 kilobytes (10.5%) 119 kilobytes (1.7%)
Type information 654 kilobytes (12.6%) 106 kilobytes (1.5%)
JIT-compiler's memory 64 kilobytes (1.2%) 64 kilobytes (0.9%)
Total 5187 kilobytes (100%) 6720 kilobytes (100%)

Do we understand correctly that the differences in memory distribution between ReadyToRun and Fragile mode are caused by storing preinitialised data in the Fragile format? Could you, please, point us to some documentation or places in code base that could explain the difference?

Contributor

ruben-ayrapetyan commented Apr 3, 2017

@jkotas,

Thank you very much for your comments.

We clarified the measurements.

Also, need to add some comments about them:

  • the measurements were performed with assemblies precompiled in ReadyToRun format, which currently isn't default in Tizen. When Fragile format of assemblies is used, distribution of memory consumptions looks quite differently.
  • the measurements show "Private" memory usage of process, i.e. only the part, which is not shared with other processes. "Shared" part is not accounted at all in this measurements. Most part of "Mapped assembly images" in measurements above is "Private_Clean" (unmodified) memory, which automatically becomes "shared" just when same assembly is mapped to another processes. So, actual per-application consumption of the "mapped assembly images" is much less in ReadyToRun mode. Please, see the new measurements below.

@seanshpark , @jkotas , please, see the clarified measurements below.
The following measurements are for Puzzle sample application (https://developer.tizen.org/sites/default/files/documentation/puzzle2.zip), which is started along with another .NET application (so, mapped files are mostly shared).

ReadyToRun mode means the Tizen-default set of precompiled assemblies is in ReadyToRun format.
Fragile mode - the Tizen-default set of precompiled assemblies is in Fragile format (currently, the format is used in Tizen).
The values in cells represent "Private" (per-application) memory consumption of CoreCLR.

Component ReadyToRun mode Fragile mode
Mapped assembly images 1921 kilobytes (37%) 5130 kilobytes (76%)
Execution engine 1309 kilobytes (25.2%) 795 kilobytes (11.8%)
Objects heap 690 kilobytes (13.3%) 506 kilobytes (7.5%)
Code heap 549 kilobytes (10.5%) 119 kilobytes (1.7%)
Type information 654 kilobytes (12.6%) 106 kilobytes (1.5%)
JIT-compiler's memory 64 kilobytes (1.2%) 64 kilobytes (0.9%)
Total 5187 kilobytes (100%) 6720 kilobytes (100%)

Do we understand correctly that the differences in memory distribution between ReadyToRun and Fragile mode are caused by storing preinitialised data in the Fragile format? Could you, please, point us to some documentation or places in code base that could explain the difference?

@jkotas

This comment has been minimized.

Show comment
Hide comment
@jkotas

jkotas Apr 3, 2017

Member

differences in memory distribution between ReadyToRun and Fragile mode are caused by storing preinitialised data in the Fragile format?

I think so.

documentation or places in code base that could explain the difference?

The pre-initialized datastructures in the Fragile format have a lot of pointers that need to be updated. It is called "restoring" in the code, e.g. look for MethodTable::Restore. Updating the pointers produces the private memory pages.

Creating the datastructures at runtime on demand gives you a dense packing for free. The private pages contain just the datastructures needed. The preinitialized datastructures in the fragile images do not have this property (e.g. the program may only need 100 byte datastructure from a given page, but the whole 4k page is private memory).

Member

jkotas commented Apr 3, 2017

differences in memory distribution between ReadyToRun and Fragile mode are caused by storing preinitialised data in the Fragile format?

I think so.

documentation or places in code base that could explain the difference?

The pre-initialized datastructures in the Fragile format have a lot of pointers that need to be updated. It is called "restoring" in the code, e.g. look for MethodTable::Restore. Updating the pointers produces the private memory pages.

Creating the datastructures at runtime on demand gives you a dense packing for free. The private pages contain just the datastructures needed. The preinitialized datastructures in the fragile images do not have this property (e.g. the program may only need 100 byte datastructure from a given page, but the whole 4k page is private memory).

@ruben-ayrapetyan

This comment has been minimized.

Show comment
Hide comment
@ruben-ayrapetyan

ruben-ayrapetyan Apr 3, 2017

Contributor

@jkotas, thank you for the information!

Contributor

ruben-ayrapetyan commented Apr 3, 2017

@jkotas, thank you for the information!

@danmosemsft

This comment has been minimized.

Show comment
Hide comment
@danmosemsft

danmosemsft Apr 13, 2017

Member

@ruben-ayrapetyan as i read it this is answered now; please reopen if not.

Member

danmosemsft commented Apr 13, 2017

@ruben-ayrapetyan as i read it this is answered now; please reopen if not.

@ruben-ayrapetyan

This comment has been minimized.

Show comment
Hide comment
@ruben-ayrapetyan

ruben-ayrapetyan Jul 5, 2017

Contributor

@jkotas,

We have performed initial comparison of CoreCLR and CoreRT from viewpoint of memory consumption on benchmarks from http://benchmarksgame.alioth.debian.org.

The initial measurements show that CoreCLR consumes approximately 41% more memory on average than CoreRT and is approximately 4% slower (x64 release build).

Particularly, binary-trees benchmark (http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=csharpcore&id=5) shows the following:
Peak Rss on CoreCLR is about 1.5 gigabytes
Peak Rss on CoreRT is about 1 gigabyte
Running time on CoreCLR is about 46.7 seconds
Running time on CoreRT is about 29.6 seconds

As far as we currently see, the difference in memory consumption is mostly related to differences in GC heuristics.
Particularly, we could reduce memory consumption of CoreCLR on binary-trees by about 2 times through invoking GC more frequently.

Do we see correctly that the main cause of the difference is related to GC?
Could you, please, clarify what are the differences in GC between CoreRT and CoreCLR?

cc @lemmaa @egavrin @Dmitri-Botcharnikov @sergign60 @BredPet @gbalykov @kvochko

Contributor

ruben-ayrapetyan commented Jul 5, 2017

@jkotas,

We have performed initial comparison of CoreCLR and CoreRT from viewpoint of memory consumption on benchmarks from http://benchmarksgame.alioth.debian.org.

The initial measurements show that CoreCLR consumes approximately 41% more memory on average than CoreRT and is approximately 4% slower (x64 release build).

Particularly, binary-trees benchmark (http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=csharpcore&id=5) shows the following:
Peak Rss on CoreCLR is about 1.5 gigabytes
Peak Rss on CoreRT is about 1 gigabyte
Running time on CoreCLR is about 46.7 seconds
Running time on CoreRT is about 29.6 seconds

As far as we currently see, the difference in memory consumption is mostly related to differences in GC heuristics.
Particularly, we could reduce memory consumption of CoreCLR on binary-trees by about 2 times through invoking GC more frequently.

Do we see correctly that the main cause of the difference is related to GC?
Could you, please, clarify what are the differences in GC between CoreRT and CoreCLR?

cc @lemmaa @egavrin @Dmitri-Botcharnikov @sergign60 @BredPet @gbalykov @kvochko

@egavrin

This comment has been minimized.

Show comment
Hide comment
@egavrin

egavrin Jul 5, 2017

As far as we currently see, the difference in memory consumption is mostly related to differences in GC heuristics.

Unfortunately, it does not explain why we see performance improvements on memory intensive benchmarks like binary-trees or spectral-norm.

Launch time is better on CoreRT, obviously. ~45% faster with CoreRT.

egavrin commented Jul 5, 2017

As far as we currently see, the difference in memory consumption is mostly related to differences in GC heuristics.

Unfortunately, it does not explain why we see performance improvements on memory intensive benchmarks like binary-trees or spectral-norm.

Launch time is better on CoreRT, obviously. ~45% faster with CoreRT.

@jkotas

This comment has been minimized.

Show comment
Hide comment
@jkotas

jkotas Jul 5, 2017

Member

GC PAL is incomplete in CoreRT - the performance related parts are missing:

  • The concurrent/background GC is not enabled in CoreRT yet (it is the default in CoreCLR). You can try rerunning the CoreCLR with concurrent GC disabled to see whether it is causing the difference.
  • The L1/L2 cache size detection is missing https://github.com/dotnet/corert/blob/master/src/Native/gc/unix/gcenv.unix.cpp#L389. You can try hardcode the number that CoreCLR uses on your machine to see whether it is causing the difference.
Member

jkotas commented Jul 5, 2017

GC PAL is incomplete in CoreRT - the performance related parts are missing:

  • The concurrent/background GC is not enabled in CoreRT yet (it is the default in CoreCLR). You can try rerunning the CoreCLR with concurrent GC disabled to see whether it is causing the difference.
  • The L1/L2 cache size detection is missing https://github.com/dotnet/corert/blob/master/src/Native/gc/unix/gcenv.unix.cpp#L389. You can try hardcode the number that CoreCLR uses on your machine to see whether it is causing the difference.
@ruben-ayrapetyan

This comment has been minimized.

Show comment
Hide comment
@ruben-ayrapetyan

ruben-ayrapetyan Jul 7, 2017

Contributor

@jkotas, Thank you very much for the advice.

We checked the CoreCLR with concurrent GC turned off.

In this configuration, CoreCLR consumes 2 times less RSS at peak, and is about 30% faster than CoreRT on the binary-trees benchmark.

Contributor

ruben-ayrapetyan commented Jul 7, 2017

@jkotas, Thank you very much for the advice.

We checked the CoreCLR with concurrent GC turned off.

In this configuration, CoreCLR consumes 2 times less RSS at peak, and is about 30% faster than CoreRT on the binary-trees benchmark.

@jkotas

This comment has been minimized.

Show comment
Hide comment
@jkotas

jkotas Jul 7, 2017

Member

You may be running into dotnet/corert#3784.

These kind of differences between CoreCLR and CoreRT are point-in-time problem. The GC perf characteristics should be within noise between CoreCLR and CoreRT by the time we are done.

Member

jkotas commented Jul 7, 2017

You may be running into dotnet/corert#3784.

These kind of differences between CoreCLR and CoreRT are point-in-time problem. The GC perf characteristics should be within noise between CoreCLR and CoreRT by the time we are done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment