Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the effects of Garbage Collection #1

Closed
PragTob opened this issue Jun 18, 2016 · 11 comments
Closed

Reduce the effects of Garbage Collection #1

PragTob opened this issue Jun 18, 2016 · 11 comments

Comments

@PragTob
Copy link
Member

PragTob commented Jun 18, 2016

Especially micro benchmarking can be affected by garbage collection as single runs will be much slower than the others leading to a sky rocketing standard deviation and unreliable measures. Sadly, to the best of my knowledge, one can’t turn off GC on the BEAM.

The best breadcrumb to achieve anything like this so far:

You can try to use erlang:spawn_opt http://erlang.org/doc/man/erlang.html#spawn_opt-2 setting fullsweep_after and min_heap_size to high values to reduce chances of garbage collection.

This would then go into a new configuration option like: avoid_gc: true/false.

Would also need testing with existing benchmarks to see effect on standard deviation etc. - likely a large-ish operation :)

@PragTob
Copy link
Member Author

PragTob commented Aug 15, 2017

Looked into this, spawn_opt has warnings about data locality and such.... so if it works/helps or not might be questionable. The elixir spawn also uses spawn_opt. Problem is, there is probably some implementation work coming our way. Aka OTP Supervision tree spawn processes sort of way, cause you can't just pass (to the best of my understanding) spawn_opts somehow to Task.async or setting it afterwards/before.

So for this use case - that might work out or not, we'd either reimplement the relevant parts of Task.async/await or just our own OTP supervision tree.

It shouldn't be that much, just more than I can fathom in an evening after a long long day :)

(wanted to give this a head start before maybe tackling memory measurements as it should improve results but yeah, probably not now :) )

@OvermindDL1
Copy link

So for this use case - that might work out or not, we'd either reimplement the relevant parts of Task.async/await or just our own OTP supervision tree.

Which is pretty trivial. Just spawn processes and link/monitor them and wait for messages or timeouts.

@PragTob
Copy link
Member Author

PragTob commented Aug 16, 2017

So for this use case - that might work out or not, we'd either reimplement the relevant parts of Task.async/await or just our own OTP supervision tree.

Which is pretty trivial. Just spawn processes and link/monitor them and wait for messages or timeouts.

Yup I just didn't do this a whole lot yet 😅 I guess it's a good way to add to my experience 😁 It's at least "slightly" more difficult than pure async/await and I'd hoped that there'd be an interface into those that we could just use

@devonestes
Copy link
Collaborator

In thinking about this before, I wondered if this might be something useful to add to Elixir itself. Maybe we should propose this feature for Task.async on the mailing list? My only worry is that we are technically just exposing something that's already available in Erlang, but since they've already gone down that road with Task, then maybe they won't mind?

@PragTob
Copy link
Member Author

PragTob commented Aug 17, 2017

Yeah I saw a couple of different ways Task spawned processes... it might be worth proposing I'd go and educate myself before that (right now I feel like spawn_opt is just more general) and as always there is no harm in proposing it :) deep_merge didn't make it but the discussion also taught me a lot.

Also I see an argument being made that if you care that much about specific process options maybe you shouldn't be using a high level abstraction anyways 👼

However, pretty sure we'd still need to implement it ourselves here - that'd be elixir 1.6 at the earliest I believe and until we wanna drop support for everything lower than that... :|

@michalmuskala
Copy link
Contributor

Here's an idea for a different approach.

As I understand, the problem with cost of garbage collection is that it is spread unevenly between runs. Disabling GC completely is one way of evening it out - making gc always run is another. This means the execution would be slower than it would normally be, but it would be consistent.

The way this could be achieved: during warmup, don't run GC and allow the heap to grow to the desired size, manually run GC with :erlang.garbage_collect() before running any tests to compact everything, then run :erlang.garbage_collect() at the end of each run and count the the time it takes towards execution - an algorithm that is faster, but creates more garbage can is slower overall.

@PragTob
Copy link
Member Author

PragTob commented Dec 9, 2017

👋

Thanks for the Input Michal!

Iirc we actually run garbage collection before warmup and before real execution time. The idea to run garbage collection after every invocation - without counting it into the total runtime - is there and is already possible through the after_each hooks. For an experiment of different approaches and combinations see my comment 160 :)

Counting garbage collection into the runtime is non desirable imo. When you look at the example above, with an input size of 1k you can see that the sample size is more than 10 times smaller which also means that both together are more than 10 times slower (due to the overhead GC incurs) and so especially for small benchmarks that'd leave you without any distinction as GC makes up the majority of the run time.

At the same time, the behaviour - if wanted - is easy to do on the user side by just adding :erlang.garbage_collect() to the benchmarking function if I understood you correctly :)

@michalmuskala
Copy link
Contributor

Ah, I saw this discussion some time ago (I think you linked it on twitter), but missed it now when I was looking though issues.

Maybe a settling like gc_after: :every | {1, :second} | {10, :runs} could be useful? This could allow tuning it for the particular benchmark until it's useful. I'm not sure what could be a sane default, though - maybe measure during warmup how often the gc runs and the during actual runs, run the gc manually as often (or slightly more often, so it's not triggered accidentally during runs)?

@PragTob
Copy link
Member Author

PragTob commented Dec 9, 2017

Huh, quite the intricate/interesting ideas you got! :) I like the last idea a lot, it sounds nice and self tuning. As long as people don't bechmark random stuff (which I'd like to add btw. - e.g. the data generation from property based testing) the intervals in which GCs occur could/should be quite stable but then again with a growing and shrinking heap it might be tricky.

A first version can very well be "do it yourself in an after_each however you see fit" and then we can still implement builtin configuration later - which is also nice as it can be shown as part of the benchmark configuration then.

@devonestes
Copy link
Collaborator

One of the other issues with the garbage collection as it stands now is it makes accurately measuring memory usage of a particular scenario extremely unreliable. In my earlier spikes on measuring that sometimes the net memory usage for a scenario would even be measured as a negative number since garbage collection was running randomly.

So while this is inconvenient for measuring runtimes, it actually feels like a blocker when it comes to measuring memory usage. I even just learned that the memory measurement feature has been disabled in benchfella since they've run into the same issues we have.

@PragTob
Copy link
Member Author

PragTob commented Mar 10, 2018

We had an experiment in #160 - that didn't seem to help that much - sometimes even made it worse. I checked out that branch again and determined that a simple after_each: with a manual garbage collection call creates good enough results should that be wanted:

tobi@speedy ~/github/benchee $ mix run samples/multiple_inputs.exs # avoid_gc branch
Compiling 10 files (.ex)
Operating System: Linux
CPU Information: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Number of Available Cores: 8
Available memory: 15.61 GB
Elixir 1.5.2
Erlang 20.0
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
parallel: 1
inputs: Bigger, Small
Estimated total run time: 56 s


Benchmarking flat_map with input Bigger...
Benchmarking flat_map with input Small...
Benchmarking map.flatten with input Bigger...
Benchmarking map.flatten with input Small...

##### With input Bigger #####
Name                  ips        average  deviation         median         99th %
flat_map           200.67        4.98 ms     ±2.98%        4.94 ms        5.37 ms
map.flatten        135.24        7.39 ms    ±23.64%        7.26 ms        8.19 ms

Comparison: 
flat_map           200.67
map.flatten        135.24 - 1.48x slower

Extended statistics: 

Name                minimum        maximum    sample size                     mode
flat_map            4.88 ms        8.30 ms         1.00 K                  4.92 ms
map.flatten         7.19 ms       52.29 ms            676                  7.22 ms

##### With input Small #####
Name                  ips        average  deviation         median         99th %
flat_map          18.18 K       55.02 μs   ±894.01%          53 μs          77 μs
map.flatten       13.62 K       73.45 μs   ±633.71%          71 μs          78 μs

Comparison: 
flat_map          18.18 K
map.flatten       13.62 K - 1.33x slower

Extended statistics: 

Name                minimum        maximum    sample size                     mode
flat_map              51 μs      146781 μs        88.98 K                    53 μs
map.flatten           70 μs      120557 μs        67.01 K                    71 μs
tobi@speedy ~/github/benchee $ mix run samples/multiple_inputs.exs # master at time of avoid_gc branch
Compiling 10 files (.ex)
Generated benchee app
Operating System: Linux
CPU Information: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Number of Available Cores: 8
Available memory: 15.61 GB
Elixir 1.5.2
Erlang 20.0
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
parallel: 1
inputs: Bigger, Small
Estimated total run time: 56 s


Benchmarking flat_map with input Bigger...
Benchmarking flat_map with input Small...
Benchmarking map.flatten with input Bigger...
Benchmarking map.flatten with input Small...

##### With input Bigger #####
Name                  ips        average  deviation         median         99th %
flat_map           152.09        6.58 ms    ±11.91%        6.53 ms        8.49 ms
map.flatten        115.65        8.65 ms    ±13.75%        8.38 ms       11.60 ms

Comparison: 
flat_map           152.09
map.flatten        115.65 - 1.32x slower

Extended statistics: 

Name                minimum        maximum    sample size                     mode
flat_map            4.46 ms       10.33 ms            760         6.53 ms, 6.14 ms
map.flatten         6.75 ms       11.80 ms            578                  7.66 ms

##### With input Small #####
Name                  ips        average  deviation         median         99th %
flat_map          27.38 K       36.53 μs    ±10.64%          36 μs          45 μs
map.flatten       18.88 K       52.96 μs    ±13.04%          52 μs          89 μs

Comparison: 
flat_map          27.38 K
map.flatten       18.88 K - 1.45x slower

Extended statistics: 

Name                minimum        maximum    sample size                     mode
flat_map              36 μs        1090 μs       132.71 K                    36 μs
map.flatten           51 μs        1033 μs        92.46 K                    51 μs
tobi@speedy ~/github/benchee $ mix run samples/multiple_inputs.exs # master at time of avoid_gc branch with after_each garbage collect
Operating System: Linux
CPU Information: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Number of Available Cores: 8
Available memory: 15.61 GB
Elixir 1.5.2
Erlang 20.0
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
parallel: 1
inputs: Bigger, Small
Estimated total run time: 56 s


Benchmarking flat_map with input Bigger...
Benchmarking flat_map with input Small...
Benchmarking map.flatten with input Bigger...
Benchmarking map.flatten with input Small...

##### With input Bigger #####
Name                  ips        average  deviation         median         99th %
flat_map           252.47        3.96 ms     ±4.87%        3.96 ms        4.42 ms
map.flatten        181.72        5.50 ms     ±4.40%        5.48 ms        5.83 ms

Comparison: 
flat_map           252.47
map.flatten        181.72 - 1.39x slower

Extended statistics: 

Name                minimum        maximum    sample size                     mode
flat_map            3.68 ms        7.27 ms         1.04 K         4.04 ms, 3.97 ms
map.flatten         5.22 ms        9.45 ms            784                  5.44 ms

##### With input Small #####
Name                  ips        average  deviation         median         99th %
flat_map          25.96 K       38.52 μs    ±16.30%          37 μs          68 μs
map.flatten       18.18 K       55.01 μs    ±16.01%          52 μs          89 μs

Comparison: 
flat_map          25.96 K
map.flatten       18.18 K - 1.43x slower

Extended statistics: 

Name                minimum        maximum    sample size                     mode
flat_map              36 μs         119 μs         9.04 K                    37 μs
map.flatten           51 μs         148 μs         8.80 K                    52 μs
tobi@speedy ~/github/benchee $ 

You take a big hit in sample size (depending on what you do) but that is ok if you really wanna avoid gc. Tuning warmup time up so that the process can grow to a suitable size better might also help.

@PragTob PragTob closed this as completed Mar 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants