-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace folsom histograms #4650
Comments
Going by https://github.com/open-telemetry/oteps/blob/main/text/0149-exponential-histogram.md and https://opentelemetry.io/blog/2022/exponential-histograms/ it seems exponential base 2 histograms are popular. The scheme to replace Folsom sliding window histogram could be based on persistent terms and counters.
bucket index can be computed with a simple hack of starting with a simple float representation:
Exponent then can be scaled by a precision value by shifting left ( The temporal (windowing) aspect can be implementing by having 15 or so counter arrays and writing to the |
@nickva nicely done demonstrating the lock collision overhead for the histograms. That said, it seems like the underlying histogram implementation is orthogonal to the issue of concurrent updates to the same histogram inducing locks. Your example for a histogram creates a single global histogram that would be concurrently updated just like we do with Folsom. So how do the alternatives to Folsom address the issue of concurrent lock collisions? For bucketing based histograms, given the same accuracy and range, they should be additive in theory. Were you hoping to achieve that with the histograms you linked? It's possible we could use a counter based ets backend for creating bucketing histograms, but counters are designed to be incremented and decremented, whereas with the sampling based histogram we use with Folsom will directly and go set the histogram reservoir values based on the current sample. |
It's also worth remembering that the Folsom histograms are front loaded on the number of samples actually collected (for better or worse) which means that the percentage of couch_stats histogram updates that actually triggers an ets update goes down as the number of requests per second goes up. I personally think this is not great as the default bucket size of 1024 entries in a given one second "moment", leaves huge amounts of data on the floor; for example, if 100k histogram updates are called in a given second, the last update will have something like a 1:99k chance of actually impacting the histogram. So one thing to watch out for is that switching to a bucketing based histogram that is explicitly accurate in incrementing a bucket for every histogram update, we will drastically increase the amount of lock contention happening. In our example above, all 100k requests will actually modify the histogram, as opposed to Folsom's sampling based approach which has a chance to update the histogram inversely proportional to the number of requests made in the current moment. So again, I believe the fundamental issue at hand is concurrent updates to histograms, and any implementation we run with needs to address that issue directly. |
I tried to compare the two approaches: -module(upstats).
-export([go/0, go/1, go/2]).
go() ->
go(1000, 10000).
go(N) ->
go(N, 10000).
go(N, X) ->
T0 = erlang:monotonic_time(),
Workers = [spawn_monitor(fun() ->
rand:seed(default, os:timestamp()),
upstat(X)
end) || _ <- lists:seq(1, N)],
{_Pids, Refs} = lists:unzip(Workers),
WorkersSet = sets:from_list(Refs, [{version,2}]),
ok = wait_workers(WorkersSet),
Dt = erlang:monotonic_time() - T0,
erlang:convert_time_unit(Dt, native, millisecond).
upstat(0) ->
ok;
upstat(Times) ->
Hist = case rand:uniform(5) of
1 -> [couchdb, request_time];
2 -> [couchdb, dbinfo];
3 -> [couchdb, httpd, bulk_docs];
4 -> [fsync, time];
5 -> [couchdb, db_open_time]
end,
couch_stats:update_histogram(Hist, 100),
upstat(Times - 1).
wait_workers(Workers) when is_map(Workers) ->
case sets:size(Workers) of
0 ->
ok;
_ ->
receive
{'DOWN', Ref, process, _, normal} ->
Workers1 = sets:del_element(Ref, Workers),
wait_workers(Workers1);
{'DOWN', Ref, process, _, Err} ->
Workers1 = sets:del_element(Ref, Workers),
io:format("~n worker ~p crashed: ~p~n", [Ref, Err]),
wait_workers(Workers1)
end
end. With an example counters update: -module(upstats_counter).
-export([go/0, go/1, go/2]).
go() ->
go(1000, 10000).
go(N) ->
go(N, 10000).
go(N, X) ->
Ref = counters:new(1024, [write_concurrency]),
persistent_term:put(term, Ref),
T0 = erlang:monotonic_time(),
Workers = [spawn_monitor(fun() ->
rand:seed(default, os:timestamp()),
upstat(X)
end) || _ <- lists:seq(1, N)],
{_Pids, Refs} = lists:unzip(Workers),
WorkersSet = sets:from_list(Refs, [{version,2}]),
ok = wait_workers(WorkersSet),
Dt = erlang:monotonic_time() - T0,
erlang:convert_time_unit(Dt, native, millisecond).
upstat(0) ->
ok;
upstat(Times) ->
Ref = persistent_term:get(term),
counters:add(Ref, rand:uniform(1024), 100),
upstat(Times - 1).
wait_workers(Workers) when is_map(Workers) ->
case sets:size(Workers) of
0 ->
ok;
_ ->
receive
{'DOWN', Ref, process, _, normal} ->
Workers1 = sets:del_element(Ref, Workers),
wait_workers(Workers1);
{'DOWN', Ref, process, _, Err} ->
Workers1 = sets:del_element(Ref, Workers),
io:format("~n worker ~p crashed: ~p~n", [Ref, Err]),
wait_workers(Workers1)
end
end. (node1@127.0.0.1)18> upstats_counter:go(50000, 1000).
1385
(node1@127.0.0.1)19> upstats_counter:go(50000, 1000).
1425
(node1@127.0.0.1)20> upstats_counter:go(50000, 1000).
1427
(node1@127.0.0.1)21> upstats_counter:go(50000, 1000).
1438
(node1@127.0.0.1)22> upstats:go(50000, 1000).
15131
(node1@127.0.0.1)23> upstats:go(50000, 1000).
16075
(node1@127.0.0.1)24> upstats:go(50000, 1000).
14869 Even with folsom front loading and updating only 1024 values per second it's till 10x slower! |
The base-2 histograms would be additive and would record every single update. |
I chatted with @nickva directly and he mentioned that |
That's why I thought counters might work here. With
Seems to check out based on OTP source |
The basic idea for preserving the windowed updates with 10 second windows, and 1 second granularity, like we have currently, could be to use at 10 counter arrays (in a tuple for instance, to allow quick iteration), one per second:
So In order to for the trimmer/cleaner process to not step on the toes of current updaters, we'd create 15 or 20 counters, to allow plenty of time in between when the trimmer process resets the old counters and some delayed updaters still updating it. Using monotonic time prevents the updaters from going backwards but there could be some time between them reading the monotonic time and the update itself. In this way we avoid the need to modify (delete old and create new) counter arrays in a persistent term. The persistent histogram term structure remains immutable. The only thing that gets updated are the counter values themselves which should be efficient. |
To go with float base-2 exponential binning (to capture as much range as possible in each counters array) threw this together: -define(EXPONENT_BIAS, 1023).
-define(MANTISSA_BITS, 52).
-define(EXPONENT_BITS, 11).
bin_index(Val, Scale) when is_integer(Val) ->
bin_index(float(Val), Scale);
bin_index(Val, Scale) when is_float(Val), is_integer(Scale), Scale >=0, Scale < ?MANTISSA_BITS ->
{Exponent, Mantissa} = exp(Val),
ExponentBits = Exponent bsl Scale,
MantissaBits = Mantissa bsr (?MANTISSA_BITS - Scale),
ExponentBits bor MantissaBits.
exp(Val) when is_float(Val) ->
<<0:1, Exponent:?EXPONENT_BITS, Mantissa:?MANTISSA_BITS>> = <<Val/float>>,
{Exponent - ?EXPONENT_BIAS, Mantissa}. At first thought of using just the exponent part directly, however the bins are spaced out a bit too much. So then used the idea from https://github.com/newrelic-experimental/newrelic-sketch-java/tree/main/src/main/java/com/newrelic/nrsketch/indexer to add some more precision from the mantissa part into the bin index. First exponent is shifted a bit "to make some room" for the extra bits Then, we shift the mantissa bits to the right and leave only its two top most significant bits. For example, This is a bit of a low-level bit hack, here is a good reminder of the structure of the floating point encoding: https://fabiensanglard.net/floating_point_visually_explained/ |
https://floating-point-gui.de/formats/fp/ also has some links to pages with visualization: |
Thinking more about how we compute the bin_index(Val) ->
<<0:1, BinIndex:?INDEX_BITS, _/bitstring>> = <<Val/float>>,
BinIndex.
This however returns the unbiased exponent, but since already have to handle an offset to transform bin indices [1, ... MaxBins] range, we'll just have a large offset now (8000 or so) |
A quick proof of concept: https://gist.github.com/nickva/d1e5a086107e2f0e824e1c49ff177507 test_speed() ->
H = new(15),
timer:tc(fun() -> lists:foreach(fun(I) -> fhist:update(H, I, 10000.0) end, lists:seq(1, 1000000)) end). > fhist:test_speed().
{243211,ok} We can do a bin selection (normally a |
Since the main goal is improve performance during higher concurrency, and avoid the bottleneck-ing on a single ETS lock, created a benchmark which compares old (Folsom) histogram updates with the new ones under concurrency. https://gist.github.com/nickva/c057ee081e59359aca3c6e78ffb3d263 At 10k concurrency level:
Taking the best times:
The new histogram implementation is 25x faster at 10k concurrency At 1k concurrency level:
Using best times:
The new histogram implementation is 20x faster at 1k concurrency. Which confirms that the new histogram implementation is faster and is less of a bottleneck during concurrent updates. |
Folsom histograms are a major bottleneck under high concurrency, as described in #4650. This was noticed during performance testing, confirmed using Erlang VM lock counting, then verified by creating a test release with histogram update logic commented out [1]. CouchDB doesn't use most of the Folsom statistics and metrics; we only use counters, gauages and one type of sliding window, sampling histogram. Instead of trying to re-design and update Folsom, which is a generic stats and metrics library, take a simpler approach and create just the three metrics we need, and then remove Folsom and Bear dependencies altogether. ALl the metrics types we re-implement are based on two relatively new Erlang/OTP features: counters [2] and persistent terms [3]. Counters are mutable arrays of integers, which allow fast concurrent updates, and persistent terms allow fast, global, constant time access to Erlang terms. Gauges and counters are implemented as counter arrays with one element. Histograms are represented as counter arrays where each array element is a histogram bin. Since we're dealing with sliding time window histograms, we have a tuple of counter arrays, where each time instant (each second) is a counter array. The overall histogram object then looks something like: ``` Histogram = { 1 = [1, 2, ..., ?BIN_COUNT] 2 = [1, 2, ..., ?BIN_COUNT] ... TimeWindow = [1, 2, ..., ?BIN_COUNT] } ``` To keep the structure immutable we need to set a limit on both the number of bins and the time window size. To limit the number of bins we need to set some minimum and maximum value limits. Since almost all our histograms record access times in milliseconds, we pick a range from 10 microseconds up to over an one hour. Histogram bin widths are increasing exponentialy in order to keep a reasonable precision across the whole range of values. This encoding is similar to how floating point numbers work. Additional details on how this works are described in the the couch_stats_histogram.erl module. To keep the histogram object structure immutable, the time window is used in a circular fashion. The time parameter to the histogram update/3 function is the monotonic clock time, and the histogram time index is computed as `Time rem TimeWindow`. So, as the monotonic time is advancing forward, the histogram time index will loop around. This comes with a minor annoynance of having to allocate a larger time window to accomodate some process which cleans stale (expired) histogram entries, possibly with some extra buffers to ensure the currently updated interval and the interval ready to be cleaned would not overlap. This periodic cleanup is performed in the couch_stats_server process. Besides performance, the new histograms have two other improvement over the folsom ones: - They record every single value. Previous histograms did sampling and recorded mostly just the first 1024 values during each time instant (second). - They are mergeable. Multiple histograms can be merged with corresponding bins summed together. This could allow cluster wide histogram summaries or gathering histograms from individual processes, then combining them at the end in a central process. Other performance improvement in this commit is eliminating the need to periodically flush or scrape stats in the background in both couch_stats and prometheus apps. Stats fetching from persistent terms and counters takes less than 5 milliseconds, and sliding time window histogram will always return the last 10 seconds of data no matter when the stats are queried. Now that will be done only when the stats are actually queried. Since the folsom library was abstracted away behind a couch_stats API, the rest of the applications do not need to be updated. They still call couch_stats:update_histogram/2, couch_stats:increment_counter/1, etc. Previously couch_stats did not have any tests at all. Folsom and Bear had some tests, but I don't think we ever ran those test suites. To rectify the situation added tests to cover the functionality. All the newly added or updated modules should be have near or exactly 100% test coverage. [1] #4650 (comment) [2] https://www.erlang.org/doc/man/counters.html [3] https://www.erlang.org/doc/man/persistent_term.html
Folsom histograms are a major bottleneck under high concurrency, as described in #4650. This was noticed during performance testing, confirmed using Erlang VM lock counting, then verified by creating a test release with histogram update logic commented out [1]. CouchDB doesn't use most of the Folsom statistics and metrics; we only use counters, gauges and one type of sliding window, sampling histogram. Instead of trying to re-design and update Folsom, which is a generic stats and metrics library, take a simpler approach and create just the three metrics we need, and then remove Folsom and Bear dependencies altogether. All the metrics types we re-implement are based on two relatively new Erlang/OTP features: counters [2] and persistent terms [3]. Counters are mutable arrays of integers, which allow fast concurrent updates, and persistent terms allow fast, global, constant time access to Erlang terms. Gauges and counters are implemented as counter arrays with one element. Histograms are represented as counter arrays where each array element is a histogram bin. Since we're dealing with sliding time window histograms, we have a tuple of counter arrays, where each time instant (each second) is a counter array. The overall histogram object then looks something like: ``` Histogram = { 1 = [1, 2, ..., ?BIN_COUNT] 2 = [1, 2, ..., ?BIN_COUNT] ... TimeWindow = [1, 2, ..., ?BIN_COUNT] } ``` To keep the structure immutable we need to set a limit on both the number of bins and the time window size. To limit the number of bins we need to set some minimum and maximum value limits. Since almost all our histograms record access times in milliseconds, we pick a range from 10 microseconds up to over one hour. Histogram bin widths are increasing exponentially in order to keep a reasonable precision across the whole range of values. This encoding is similar to how floating point numbers work. Additional details on how this works are described in the the `couch_stats_histogram.erl` module. To keep the histogram object structure immutable, the time window is used in a circular fashion. The time parameter to the histogram update/3 function is the monotonic clock time, and the histogram time index is computed as `Time rem TimeWindow`. So, as the monotonic time is advancing forward, the histogram time index will loop around. This comes with a minor annoyance of having to allocate a larger time window to accommodate some process which cleans stale (expired) histogram entries, possibly with some extra buffers to ensure the currently updated interval and the interval ready to be cleaned would not overlap. This periodic cleanup is performed in the couch_stats_server process. Besides performance, the new histograms have two other improvement over the Folsom ones: - They record every single value. Previous histograms did sampling and recorded mostly just the first 1024 values during each time instant (second). - They are mergeable. Multiple histograms can be merged with corresponding bins summed together. This could allow cluster wide histogram summaries or gathering histograms from individual processes, then combining them at the end in a central process. Other performance improvement in this commit is eliminating the need to periodically flush or scrape stats in the background in both couch_stats and prometheus apps. Stats fetching from persistent terms and counters takes less than 5 milliseconds, and sliding time window histogram will always return the last 10 seconds of data no matter when the stats are queried. Now that will be done only when the stats are actually queried. Since the Folsom library was abstracted away behind a couch_stats API, the rest of the applications do not need to be updated. They still call couch_stats:update_histogram/2, couch_stats:increment_counter/1, etc. Previously couch_stats did not have any tests at all. Folsom and Bear had some tests, but I don't think we ever ran those test suites. To rectify the situation added tests to cover the functionality. All the newly added or updated modules should be have near or exactly 100% test coverage. [1] #4650 (comment) [2] https://www.erlang.org/doc/man/counters.html [3] https://www.erlang.org/doc/man/persistent_term.html
Folsom and bear replacement is implemented in #4672 The full implementation preserves the same 20x speedup improvement as in the initial prototype #4672 (comment) |
Folsom histograms are a major bottleneck under high concurrency, as described in #4650. This was noticed during performance testing, confirmed using Erlang VM lock counting, then verified by creating a test release with histogram update logic commented out [1]. CouchDB doesn't use most of the Folsom statistics and metrics; we only use counters, gauges and one type of sliding window, sampling histogram. Instead of trying to re-design and update Folsom, which is a generic stats and metrics library, take a simpler approach and create just the three metrics we need, and then remove Folsom and Bear dependencies altogether. All the metrics types we re-implement are based on two relatively new Erlang/OTP features: counters [2] and persistent terms [3]. Counters are mutable arrays of integers, which allow fast concurrent updates, and persistent terms allow fast, global, constant time access to Erlang terms. Gauges and counters are implemented as counter arrays with one element. Histograms are represented as counter arrays where each array element is a histogram bin. Since we're dealing with sliding time window histograms, we have a tuple of counter arrays, where each time instant (each second) is a counter array. The overall histogram object then looks something like: ``` Histogram = { 1 = [1, 2, ..., ?BIN_COUNT] 2 = [1, 2, ..., ?BIN_COUNT] ... TimeWindow = [1, 2, ..., ?BIN_COUNT] } ``` To keep the structure immutable we need to set a limit on both the number of bins and the time window size. To limit the number of bins we need to set some minimum and maximum value limits. Since almost all our histograms record access times in milliseconds, we pick a range from 10 microseconds up to over one hour. Histogram bin widths are increasing exponentially in order to keep a reasonable precision across the whole range of values. This encoding is similar to how floating point numbers work. Additional details on how this works are described in the the `couch_stats_histogram.erl` module. To keep the histogram object structure immutable, the time window is used in a circular fashion. The time parameter to the histogram `update/3` function is the monotonic clock time, and the histogram time index is computed as `Time rem TimeWindow`. So, as the monotonic time is advancing forward, the histogram time index will loop around. This comes with a minor annoyance of having to allocate a larger time window to accommodate some process which cleans stale (expired) histogram entries, possibly with some extra buffers to ensure the currently updated interval and the interval ready to be cleaned would not overlap. This periodic cleanup is performed in the couch_stats_server process. Besides performance, the new histograms have two other improvement over the Folsom ones: - They record every single value. Previous histograms did sampling and recorded mostly just the first 1024 values during each time instant (second). - They are mergeable. Multiple histograms can be merged with corresponding bins summed together. This could allow cluster wide histogram summaries or gathering histograms from individual processes, then combining them at the end in a central process. Other performance improvement in this commit is eliminating the need to periodically flush or scrape stats in the background in both couch_stats and prometheus apps. Stats fetching from persistent terms and counters takes less than 5 milliseconds, and sliding time window histogram will always return the last 10 seconds of data no matter when the stats are queried. Now that will be done only when the stats are actually queried. Since the Folsom library was abstracted away behind a couch_stats API, the rest of the applications do not need to be updated. They still call `couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc. Previously couch_stats did not have any tests at all. Folsom and Bear had some tests, but I don't think we ever ran those test suites. To rectify the situation added tests to cover the functionality. All the newly added or updated modules should be have near or exactly 100% test coverage. [1] #4650 (comment) [2] https://www.erlang.org/doc/man/counters.html [3] https://www.erlang.org/doc/man/persistent_term.html
Folsom histograms are a major bottleneck under high concurrency, as described in #4650. This was noticed during performance testing, confirmed using Erlang VM lock counting, then verified by creating a test release with histogram update logic commented out [1]. CouchDB doesn't use most of the Folsom statistics and metrics; we only use counters, gauges and one type of sliding window, sampling histogram. Instead of trying to re-design and update Folsom, which is a generic stats and metrics library, take a simpler approach and create just the three metrics we need, and then remove Folsom and Bear dependencies altogether. All the metrics types we re-implement are based on two relatively new Erlang/OTP features: counters [2] and persistent terms [3]. Counters are mutable arrays of integers, which allow fast concurrent updates, and persistent terms allow fast, global, constant time access to Erlang terms. Gauges and counters are implemented as counter arrays with one element. Histograms are represented as counter arrays where each array element is a histogram bin. Since we're dealing with sliding time window histograms, we have a tuple of counter arrays, where each time instant (each second) is a counter array. The overall histogram object then looks something like: ``` Histogram = { 1 = [1, 2, ..., ?BIN_COUNT] 2 = [1, 2, ..., ?BIN_COUNT] ... TimeWindow = [1, 2, ..., ?BIN_COUNT] } ``` To keep the structure immutable we need to set a limit on both the number of bins and the time window size. To limit the number of bins we need to set some minimum and maximum value limits. Since almost all our histograms record access times in milliseconds, we pick a range from 10 microseconds up to over one hour. Histogram bin widths are increasing exponentially in order to keep a reasonable precision across the whole range of values. This encoding is similar to how floating point numbers work. Additional details on how this works are described in the the `couch_stats_histogram.erl` module. To keep the histogram object structure immutable, the time window is used in a circular fashion. The time parameter to the histogram `update/3` function is the monotonic clock time, and the histogram time index is computed as `Time rem TimeWindow`. So, as the monotonic time is advancing forward, the histogram time index will loop around. This comes with a minor annoyance of having to allocate a larger time window to accommodate some process which cleans stale (expired) histogram entries, possibly with some extra buffers to ensure the currently updated interval and the interval ready to be cleaned would not overlap. This periodic cleanup is performed in the couch_stats_server process. Besides performance, the new histograms have two other improvement over the Folsom ones: - They record every single value. Previous histograms did sampling and recorded mostly just the first 1024 values during each time instant (second). - They are mergeable. Multiple histograms can be merged with corresponding bins summed together. This could allow cluster wide histogram summaries or gathering histograms from individual processes, then combining them at the end in a central process. Other performance improvement in this commit is eliminating the need to periodically flush or scrape stats in the background in both couch_stats and prometheus apps. Stats fetching from persistent terms and counters takes less than 5 milliseconds, and sliding time window histogram will always return the last 10 seconds of data no matter when the stats are queried. Now that will be done only when the stats are actually queried. Since the Folsom library was abstracted away behind a couch_stats API, the rest of the applications do not need to be updated. They still call `couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc. Previously couch_stats did not have any tests at all. Folsom and Bear had some tests, but I don't think we ever ran those test suites. To rectify the situation added tests to cover the functionality. All the newly added or updated modules should be have near or exactly 100% test coverage. [1] #4650 (comment) [2] https://www.erlang.org/doc/man/counters.html [3] https://www.erlang.org/doc/man/persistent_term.html
Folsom histograms are a major bottleneck under high concurrency, as described in #4650. This was noticed during performance testing, confirmed using Erlang VM lock counting, then verified by creating a test release with histogram update logic commented out [1]. CouchDB doesn't use most of the Folsom statistics and metrics; we only use counters, gauges and one type of sliding window, sampling histogram. Instead of trying to re-design and update Folsom, which is a generic stats and metrics library, take a simpler approach and create just the three metrics we need, and then remove Folsom and Bear dependencies altogether. All the metrics types we re-implement are based on two relatively new Erlang/OTP features: counters [2] and persistent terms [3]. Counters are mutable arrays of integers, which allow fast concurrent updates, and persistent terms allow fast, global, constant time access to Erlang terms. Gauges and counters are implemented as counter arrays with one element. Histograms are represented as counter arrays where each array element is a histogram bin. Since we're dealing with sliding time window histograms, we have a tuple of counter arrays, where each time instant (each second) is a counter array. The overall histogram object then looks something like: ``` Histogram = { 1 = [1, 2, ..., ?BIN_COUNT] 2 = [1, 2, ..., ?BIN_COUNT] ... TimeWindow = [1, 2, ..., ?BIN_COUNT] } ``` To keep the structure immutable we need to set a limit on both the number of bins and the time window size. To limit the number of bins we need to set some minimum and maximum value limits. Since almost all our histograms record access times in milliseconds, we pick a range from 10 microseconds up to over one hour. Histogram bin widths are increasing exponentially in order to keep a reasonable precision across the whole range of values. This encoding is similar to how floating point numbers work. Additional details on how this works are described in the the `couch_stats_histogram.erl` module. To keep the histogram object structure immutable, the time window is used in a circular fashion. The time parameter to the histogram `update/3` function is the monotonic clock time, and the histogram time index is computed as `Time rem TimeWindow`. So, as the monotonic time is advancing forward, the histogram time index will loop around. This comes with a minor annoyance of having to allocate a larger time window to accommodate some process which cleans stale (expired) histogram entries, possibly with some extra buffers to ensure the currently updated interval and the interval ready to be cleaned would not overlap. This periodic cleanup is performed in the couch_stats_server process. Besides performance, the new histograms have two other improvement over the Folsom ones: - They record every single value. Previous histograms did sampling and recorded mostly just the first 1024 values during each time instant (second). - They are mergeable. Multiple histograms can be merged with corresponding bins summed together. This could allow cluster wide histogram summaries or gathering histograms from individual processes, then combining them at the end in a central process. Other performance improvement in this commit is eliminating the need to periodically flush or scrape stats in the background in both couch_stats and prometheus apps. Stats fetching from persistent terms and counters takes less than 5 milliseconds, and sliding time window histogram will always return the last 10 seconds of data no matter when the stats are queried. Now that will be done only when the stats are actually queried. Since the Folsom library was abstracted away behind a couch_stats API, the rest of the applications do not need to be updated. They still call `couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc. Previously couch_stats did not have any tests at all. Folsom and Bear had some tests, but I don't think we ever ran those test suites. To rectify the situation added tests to cover the functionality. All the newly added or updated modules should be have near or exactly 100% test coverage. [1] #4650 (comment) [2] https://www.erlang.org/doc/man/counters.html [3] https://www.erlang.org/doc/man/persistent_term.html
Folsom histograms are a major bottleneck under high concurrency, as described in #4650. This was noticed during performance testing, confirmed using Erlang VM lock counting, then verified by creating a test release with histogram update logic commented out [1]. CouchDB doesn't use most of the Folsom statistics and metrics; we only use counters, gauges and one type of sliding window, sampling histogram. Instead of trying to re-design and update Folsom, which is a generic stats and metrics library, take a simpler approach and create just the three metrics we need, and then remove Folsom and Bear dependencies altogether. All the metrics types we re-implement are based on two relatively new Erlang/OTP features: counters [2] and persistent terms [3]. Counters are mutable arrays of integers, which allow fast concurrent updates, and persistent terms allow fast, global, constant time access to Erlang terms. Gauges and counters are implemented as counter arrays with one element. Histograms are represented as counter arrays where each array element is a histogram bin. Since we're dealing with sliding time window histograms, we have a tuple of counter arrays, where each time instant (each second) is a counter array. The overall histogram object then looks something like: ``` Histogram = { 1 = [1, 2, ..., ?BIN_COUNT] 2 = [1, 2, ..., ?BIN_COUNT] ... TimeWindow = [1, 2, ..., ?BIN_COUNT] } ``` To keep the structure immutable we need to set a limit on both the number of bins and the time window size. To limit the number of bins we need to set some minimum and maximum value limits. Since almost all our histograms record access times in milliseconds, we pick a range from 10 microseconds up to over one hour. Histogram bin widths are increasing exponentially in order to keep a reasonable precision across the whole range of values. This encoding is similar to how floating point numbers work. Additional details on how this works are described in the the `couch_stats_histogram.erl` module. To keep the histogram object structure immutable, the time window is used in a circular fashion. The time parameter to the histogram `update/3` function is the monotonic clock time, and the histogram time index is computed as `Time rem TimeWindow`. So, as the monotonic time is advancing forward, the histogram time index will loop around. This comes with a minor annoyance of having to allocate a larger time window to accommodate some process which cleans stale (expired) histogram entries, possibly with some extra buffers to ensure the currently updated interval and the interval ready to be cleaned would not overlap. This periodic cleanup is performed in the couch_stats_server process. Besides performance, the new histograms have two other improvement over the Folsom ones: - They record every single value. Previous histograms did sampling and recorded mostly just the first 1024 values during each time instant (second). - They are mergeable. Multiple histograms can be merged with corresponding bins summed together. This could allow cluster wide histogram summaries or gathering histograms from individual processes, then combining them at the end in a central process. Other performance improvement in this commit is eliminating the need to periodically flush or scrape stats in the background in both couch_stats and prometheus apps. Stats fetching from persistent terms and counters takes less than 5 milliseconds, and sliding time window histogram will always return the last 10 seconds of data no matter when the stats are queried. Now that will be done only when the stats are actually queried. Since the Folsom library was abstracted away behind a couch_stats API, the rest of the applications do not need to be updated. They still call `couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc. Previously couch_stats did not have any tests at all. Folsom and Bear had some tests, but I don't think we ever ran those test suites. To rectify the situation added tests to cover the functionality. All the newly added or updated modules should be have near or exactly 100% test coverage. [1] #4650 (comment) [2] https://www.erlang.org/doc/man/counters.html [3] https://www.erlang.org/doc/man/persistent_term.html
Folsom histograms are a major bottleneck under high concurrency, as described in #4650. This was noticed during performance testing, confirmed using Erlang VM lock counting, then verified by creating a test release with histogram update logic commented out [1]. CouchDB doesn't use most of the Folsom statistics and metrics; we only use counters, gauges and one type of sliding window, sampling histogram. Instead of trying to re-design and update Folsom, which is a generic stats and metrics library, take a simpler approach and create just the three metrics we need, and then remove Folsom and Bear dependencies altogether. All the metrics types we re-implement are based on two relatively new Erlang/OTP features: counters [2] and persistent terms [3]. Counters are mutable arrays of integers, which allow fast concurrent updates, and persistent terms allow fast, global, constant time access to Erlang terms. Gauges and counters are implemented as counter arrays with one element. Histograms are represented as counter arrays where each array element is a histogram bin. Since we're dealing with sliding time window histograms, we have a tuple of counter arrays, where each time instant (each second) is a counter array. The overall histogram object then looks something like: ``` Histogram = { 1 = [1, 2, ..., ?BIN_COUNT] 2 = [1, 2, ..., ?BIN_COUNT] ... TimeWindow = [1, 2, ..., ?BIN_COUNT] } ``` To keep the structure immutable we need to set a limit on both the number of bins and the time window size. To limit the number of bins we need to set some minimum and maximum value limits. Since almost all our histograms record access times in milliseconds, we pick a range from 10 microseconds up to over one hour. Histogram bin widths are increasing exponentially in order to keep a reasonable precision across the whole range of values. This encoding is similar to how floating point numbers work. Additional details on how this works are described in the the `couch_stats_histogram.erl` module. To keep the histogram object structure immutable, the time window is used in a circular fashion. The time parameter to the histogram `update/3` function is the monotonic clock time, and the histogram time index is computed as `Time rem TimeWindow`. So, as the monotonic time is advancing forward, the histogram time index will loop around. This comes with a minor annoyance of having to allocate a larger time window to accommodate some process which cleans stale (expired) histogram entries, possibly with some extra buffers to ensure the currently updated interval and the interval ready to be cleaned would not overlap. This periodic cleanup is performed in the couch_stats_server process. Besides performance, the new histograms have two other improvement over the Folsom ones: - They record every single value. Previous histograms did sampling and recorded mostly just the first 1024 values during each time instant (second). - They are mergeable. Multiple histograms can be merged with corresponding bins summed together. This could allow cluster wide histogram summaries or gathering histograms from individual processes, then combining them at the end in a central process. Other performance improvement in this commit is eliminating the need to periodically flush or scrape stats in the background in both couch_stats and prometheus apps. Stats fetching from persistent terms and counters takes less than 5 milliseconds, and sliding time window histogram will always return the last 10 seconds of data no matter when the stats are queried. Now that will be done only when the stats are actually queried. Since the Folsom library was abstracted away behind a couch_stats API, the rest of the applications do not need to be updated. They still call `couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc. Previously couch_stats did not have any tests at all. Folsom and Bear had some tests, but I don't think we ever ran those test suites. To rectify the situation added tests to cover the functionality. All the newly added or updated modules should be have near or exactly 100% test coverage. [1] #4650 (comment) [2] https://www.erlang.org/doc/man/counters.html [3] https://www.erlang.org/doc/man/persistent_term.html
We removed it in the main repo already. Issue: apache/couchdb#4650
We removed it in the main repo already. Issue: apache/couchdb#4650
Main PR was merged. Close issue as "fixed". |
Folsom histograms are a major bottleneck under high concurrency, as described in #4650. This was noticed during performance testing, confirmed using Erlang VM lock counting, then verified by creating a test release with histogram update logic commented out [1]. CouchDB doesn't use most of the Folsom statistics and metrics; we only use counters, gauges and one type of sliding window, sampling histogram. Instead of trying to re-design and update Folsom, which is a generic stats and metrics library, take a simpler approach and create just the three metrics we need, and then remove Folsom and Bear dependencies altogether. All the metrics types we re-implement are based on two relatively new Erlang/OTP features: counters [2] and persistent terms [3]. Counters are mutable arrays of integers, which allow fast concurrent updates, and persistent terms allow fast, global, constant time access to Erlang terms. Gauges and counters are implemented as counter arrays with one element. Histograms are represented as counter arrays where each array element is a histogram bin. Since we're dealing with sliding time window histograms, we have a tuple of counter arrays, where each time instant (each second) is a counter array. The overall histogram object then looks something like: ``` Histogram = { 1 = [1, 2, ..., ?BIN_COUNT] 2 = [1, 2, ..., ?BIN_COUNT] ... TimeWindow = [1, 2, ..., ?BIN_COUNT] } ``` To keep the structure immutable we need to set a limit on both the number of bins and the time window size. To limit the number of bins we need to set some minimum and maximum value limits. Since almost all our histograms record access times in milliseconds, we pick a range from 10 microseconds up to over one hour. Histogram bin widths are increasing exponentially in order to keep a reasonable precision across the whole range of values. This encoding is similar to how floating point numbers work. Additional details on how this works are described in the the `couch_stats_histogram.erl` module. To keep the histogram object structure immutable, the time window is used in a circular fashion. The time parameter to the histogram `update/3` function is the monotonic clock time, and the histogram time index is computed as `Time rem TimeWindow`. So, as the monotonic time is advancing forward, the histogram time index will loop around. This comes with a minor annoyance of having to allocate a larger time window to accommodate some process which cleans stale (expired) histogram entries, possibly with some extra buffers to ensure the currently updated interval and the interval ready to be cleaned would not overlap. This periodic cleanup is performed in the couch_stats_server process. Besides performance, the new histograms have two other improvement over the Folsom ones: - They record every single value. Previous histograms did sampling and recorded mostly just the first 1024 values during each time instant (second). - They are mergeable. Multiple histograms can be merged with corresponding bins summed together. This could allow cluster wide histogram summaries or gathering histograms from individual processes, then combining them at the end in a central process. Other performance improvement in this commit is eliminating the need to periodically flush or scrape stats in the background in both couch_stats and prometheus apps. Stats fetching from persistent terms and counters takes less than 5 milliseconds, and sliding time window histogram will always return the last 10 seconds of data no matter when the stats are queried. Now that will be done only when the stats are actually queried. Since the Folsom library was abstracted away behind a couch_stats API, the rest of the applications do not need to be updated. They still call `couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc. Previously couch_stats did not have any tests at all. Folsom and Bear had some tests, but I don't think we ever ran those test suites. To rectify the situation added tests to cover the functionality. All the newly added or updated modules should be have near or exactly 100% test coverage. [1] #4650 (comment) [2] https://www.erlang.org/doc/man/counters.html [3] https://www.erlang.org/doc/man/persistent_term.html
Folsom histograms can become a bottleneck when writting docs with a high concurrency.
Initially this was observed in a benchmark run with lock counting enabled:
To confirm it further, created a build with histogram updates disabled based on #4647
Then ran a document insertion ramp-up benchmark and measured number of ops (inserts) in 10 second windows (throughput) and latency.
Started at 1000 clients concurrency our throughput improves 100% without histogram updates.
Latency increases at a faster rate with histogram updates enabled. P95 with 3k concurrent clients is 3x as bad with histogram updates as without.
The text was updated successfully, but these errors were encountered: