Replace folsom histograms #4650

nickva · 2023-06-20T05:45:27Z

Folsom histograms can become a bottleneck when writting docs with a high concurrency.

Initially this was observed in a benchmark run with lock counting enabled:

> lcnt:rt_opt({copy_save, true}), lcnt:clear(), timer:sleep(10000), lcnt:collect(), lcnt:conflicts().
                       lock    id   #tries  #collisions  collisions [%]  time [us]  duration [%]
                      -----   ---  ------- ------------ --------------- ---------- -------------
               db_hash_slot  2432  3073630       155606          5.0626   80812927      807.3885
                  run_queue    22 12050789      1823812         15.1344   74547468      744.7913
                  proc_main 64973  6287153      2020053         32.1299   19640746      196.2274
                   pix_lock  1024   320419          272          0.0849    3071989       30.6917
                proc_status 64973  5666139        92365          1.6301    1792422       17.9078
             alcu_allocator    10  1146957       108469          9.4571    1674152       16.7262
 dirty_run_queue_sleep_list     2  1920837       276037         14.3707    1355878       13.5464
                  proc_msgq 64973  7011355        24101          0.3437      78924        0.7885
       dist_entry_out_queue     7   367096         1676          0.4566       6023        0.0602
                   atom_tab     1  4760261           46          0.0010       4167        0.0416
            port_sched_lock   101   390590          353          0.0904       1705        0.0170
           dist_entry_links     6    23389           14          0.0599        323        0.0032
                     db_tab   399  4968232           13          0.0003        146        0.0015
               drv_ev_state   128    72977            6          0.0082         40        0.0004
ok

> lcnt:inspect(db_hash_slot).
         lock                   id  #tries  #collisions  collisions [%]  time [us]  duration [%] histogram [log2(us)]
        -----                  --- ------- ------------ --------------- ---------- ------------- ---------------------
 db_hash_slot folsom_slide_uniform   21808        13020         59.7029    7075598       70.6911 |      ............XX.  .      |
 db_hash_slot folsom_slide_uniform   24165        12706         52.5802    6640890       66.3480 |     .............XX..        |
 db_hash_slot folsom_slide_uniform   25239        10766         42.6562    5779694       57.7440 |      ............XX..        |
 db_hash_slot folsom_slide_uniform   23089         9987         43.2544    5082403       50.7774 |      ............XX...       |

To confirm it further, created a build with histogram updates disabled based on #4647

Then ran a document insertion ramp-up benchmark and measured number of ops (inserts) in 10 second windows (throughput) and latency.

Started at 1000 clients concurrency our throughput improves 100% without histogram updates.

Latency increases at a faster rate with histogram updates enabled. P95 with 3k concurrent clients is 3x as bad with histogram updates as without.

The text was updated successfully, but these errors were encountered:

nickva · 2023-06-20T05:47:13Z

Going by https://github.com/open-telemetry/oteps/blob/main/text/0149-exponential-histogram.md and https://opentelemetry.io/blog/2022/exponential-histograms/ it seems exponential base 2 histograms are popular.

The scheme to replace Folsom sliding window histogram could be based on persistent terms and counters.

persistent_term:put(Histogram, counters:new(1024, [write_concurrency])).

Ref = persistent_term:get(histogram_now(Histogram, NowSec)),
counters:add(Ref, bucket(Value), Value),

bucket index can be computed with a simple hack of starting with a simple float representation:

exp(N) when is_integer(N) ->
    exp(float(N));
exp(F) when is_float(F), F < 0.0 ->
    0;
exp(F) when is_float(F) ->
    <<0:1,Exp:11,Mantissa:52>> = <<F:64/float>>,
    Exp.

Exponent then can be scaled by a precision value by shifting left (bsl 4, for instance) and combining with the most significant bits of the mantissa (aka the significand). Index = (Exp bsl 4) bor (Mantissa bsr (52-4)).

The temporal (windowing) aspect can be implementing by having 15 or so counter arrays and writing to the NowSec % 15th one. We'll need a similar auto-cleanup / trim process to recycle and clean old entries.

chewbranca · 2023-06-20T17:13:42Z

@nickva nicely done demonstrating the lock collision overhead for the histograms. That said, it seems like the underlying histogram implementation is orthogonal to the issue of concurrent updates to the same histogram inducing locks. Your example for a histogram creates a single global histogram that would be concurrently updated just like we do with Folsom. So how do the alternatives to Folsom address the issue of concurrent lock collisions?

For bucketing based histograms, given the same accuracy and range, they should be additive in theory. Were you hoping to achieve that with the histograms you linked?

It's possible we could use a counter based ets backend for creating bucketing histograms, but counters are designed to be incremented and decremented, whereas with the sampling based histogram we use with Folsom will directly and go set the histogram reservoir values based on the current sample.

chewbranca · 2023-06-20T17:24:23Z

It's also worth remembering that the Folsom histograms are front loaded on the number of samples actually collected (for better or worse) which means that the percentage of couch_stats histogram updates that actually triggers an ets update goes down as the number of requests per second goes up. I personally think this is not great as the default bucket size of 1024 entries in a given one second "moment", leaves huge amounts of data on the floor; for example, if 100k histogram updates are called in a given second, the last update will have something like a 1:99k chance of actually impacting the histogram.

So one thing to watch out for is that switching to a bucketing based histogram that is explicitly accurate in incrementing a bucket for every histogram update, we will drastically increase the amount of lock contention happening. In our example above, all 100k requests will actually modify the histogram, as opposed to Folsom's sampling based approach which has a chance to update the histogram inversely proportional to the number of requests made in the current moment.

So again, I believe the fundamental issue at hand is concurrent updates to histograms, and any implementation we run with needs to address that issue directly.

nickva · 2023-06-20T17:37:27Z

I tried to compare the two approaches:

-module(upstats).

-export([go/0, go/1, go/2]).

go() ->
    go(1000, 10000).

go(N) ->
    go(N, 10000).

go(N, X) ->
    T0 = erlang:monotonic_time(),
    Workers = [spawn_monitor(fun() ->
        rand:seed(default, os:timestamp()),
        upstat(X)
    end) || _ <- lists:seq(1, N)],
    {_Pids, Refs} = lists:unzip(Workers),
    WorkersSet = sets:from_list(Refs, [{version,2}]),
    ok = wait_workers(WorkersSet),
    Dt = erlang:monotonic_time() - T0,
    erlang:convert_time_unit(Dt, native, millisecond).

upstat(0) ->
    ok;
upstat(Times) ->
    Hist = case rand:uniform(5) of
        1 -> [couchdb, request_time];
        2 -> [couchdb, dbinfo];
        3 -> [couchdb, httpd, bulk_docs];
        4 -> [fsync, time];
        5 -> [couchdb, db_open_time]
    end,
    couch_stats:update_histogram(Hist, 100),
    upstat(Times - 1).

wait_workers(Workers) when is_map(Workers) ->
    case sets:size(Workers) of
        0 ->
            ok;
        _ ->
            receive
                {'DOWN', Ref, process, _, normal} ->
                    Workers1 = sets:del_element(Ref, Workers),
                    wait_workers(Workers1);
                {'DOWN', Ref, process, _, Err} ->
                    Workers1 = sets:del_element(Ref, Workers),
                    io:format("~n worker ~p crashed: ~p~n", [Ref, Err]),
                    wait_workers(Workers1)
            end
    end.

With an example counters update:

-module(upstats_counter).

-export([go/0, go/1, go/2]).

go() ->
    go(1000, 10000).

go(N) ->
    go(N, 10000).

go(N, X) ->
    Ref = counters:new(1024, [write_concurrency]),
    persistent_term:put(term, Ref),
    T0 = erlang:monotonic_time(),
    Workers = [spawn_monitor(fun() ->
        rand:seed(default, os:timestamp()),
        upstat(X)
    end) || _ <- lists:seq(1, N)],
    {_Pids, Refs} = lists:unzip(Workers),
    WorkersSet = sets:from_list(Refs, [{version,2}]),
    ok = wait_workers(WorkersSet),
    Dt = erlang:monotonic_time() - T0,
    erlang:convert_time_unit(Dt, native, millisecond).

upstat(0) ->
    ok;
upstat(Times) ->
    Ref = persistent_term:get(term),
    counters:add(Ref, rand:uniform(1024), 100),
    upstat(Times - 1).

wait_workers(Workers) when is_map(Workers) ->
    case sets:size(Workers) of
        0 ->
            ok;
        _ ->
            receive
                {'DOWN', Ref, process, _, normal} ->
                    Workers1 = sets:del_element(Ref, Workers),
                    wait_workers(Workers1);
                {'DOWN', Ref, process, _, Err} ->
                    Workers1 = sets:del_element(Ref, Workers),
                    io:format("~n worker ~p crashed: ~p~n", [Ref, Err]),
                    wait_workers(Workers1)
            end
    end.

(node1@127.0.0.1)18> upstats_counter:go(50000, 1000).
1385
(node1@127.0.0.1)19> upstats_counter:go(50000, 1000).
1425
(node1@127.0.0.1)20> upstats_counter:go(50000, 1000).
1427
(node1@127.0.0.1)21> upstats_counter:go(50000, 1000).
1438
(node1@127.0.0.1)22> upstats:go(50000, 1000).
15131
(node1@127.0.0.1)23> upstats:go(50000, 1000).
16075
(node1@127.0.0.1)24> upstats:go(50000, 1000).
14869

Even with folsom front loading and updating only 1024 values per second it's till 10x slower!

nickva · 2023-06-20T17:38:33Z

For bucketing based histograms, given the same accuracy and range, they should be additive in theory. Were you hoping to achieve that with the histograms you linked?

The base-2 histograms would be additive and would record every single update.

chewbranca · 2023-06-20T17:41:37Z

I chatted with @nickva directly and he mentioned that write_concurrency option to Erlang counters implementation results in per scheduler counter tracking, avoiding the concurrent updates I was concerned with! So while a bucket approach that records every histogram entry would result in more updates than a sampling based approach, it sounds like we can completely avoid the locking issue with Erlang counters.

nickva · 2023-06-20T17:41:42Z

So one thing to watch out for is that switching to a bucketing based histogram that is explicitly accurate in incrementing a bucket for every histogram update, we will drastically increase the amount of lock contention happening.

That's why I thought counters might work here. With [write_concurrency] it does auto-stripping across schedulers
https://www.erlang.org/doc/man/counters.html#new-2

This is an optimization to achieve very efficient concurrent add and sub operations at the expense of potential read inconsistency and memory consumption per counters. Read operations may see sequentially inconsistent results with regard to concurrent write operations. Even if write operation A is done sequentially before write operation B, a concurrent reader may see any combination of A and B, including only B. A read operation is only guaranteed to see all writes done sequentially before the read. No writes are ever lost, but will eventually all be seen.

The typical use case for write_concurrency is when concurrent calls to add and sub toward the same counters are very frequent, while calls to get and put are much less frequent. The lack of absolute read consistency must also be acceptable

Seems to check out based on OTP source

nickva · 2023-06-22T16:52:10Z

The basic idea for preserving the windowed updates with 10 second windows, and 1 second granularity, like we have currently, could be to use at 10 counter arrays (in a tuple for instance, to allow quick iteration), one per second:

Counters = {counters:new(NumberOfBins), counters:new(NumberOfBins), ....}

So update(#hist{}, Val) would do something like TimeIndex = erlang:monotonic_time(second) rem tuple_size(Counters) + 1, CounterRef = element(TimeIndex, Counters).

In order to for the trimmer/cleaner process to not step on the toes of current updaters, we'd create 15 or 20 counters, to allow plenty of time in between when the trimmer process resets the old counters and some delayed updaters still updating it. Using monotonic time prevents the updaters from going backwards but there could be some time between them reading the monotonic time and the update itself.

In this way we avoid the need to modify (delete old and create new) counter arrays in a persistent term. The persistent histogram term structure remains immutable. The only thing that gets updated are the counter values themselves which should be efficient.

nickva · 2023-06-22T19:30:38Z

To go with float base-2 exponential binning (to capture as much range as possible in each counters array) threw this together:

-define(EXPONENT_BIAS, 1023).
-define(MANTISSA_BITS, 52).
-define(EXPONENT_BITS, 11).

bin_index(Val, Scale) when is_integer(Val) ->
    bin_index(float(Val), Scale);
bin_index(Val, Scale) when is_float(Val), is_integer(Scale), Scale >=0, Scale < ?MANTISSA_BITS ->
    {Exponent, Mantissa} = exp(Val),
    ExponentBits = Exponent bsl Scale,
    MantissaBits = Mantissa bsr (?MANTISSA_BITS - Scale),
    ExponentBits bor MantissaBits.

exp(Val) when is_float(Val) ->
    <<0:1, Exponent:?EXPONENT_BITS, Mantissa:?MANTISSA_BITS>> = <<Val/float>>,
    {Exponent - ?EXPONENT_BIAS, Mantissa}.

At first thought of using just the exponent part directly, however the bins are spaced out a bit too much. So then used the idea from https://github.com/newrelic-experimental/newrelic-sketch-java/tree/main/src/main/java/com/newrelic/nrsketch/indexer to add some more precision from the mantissa part into the bin index.

First exponent is shifted a bit "to make some room" for the extra bits Exponent bsl Scale. eeee bsl 2 => eeee00. Now we have a place for the 2 extra precision bits.

Then, we shift the mantissa bits to the right and leave only its two top most significant bits. For example, mmmmmm... bsr (52 -2) => 0000mm. And finally, combine it by OR-ing into the shifted exponent part to get the index eeee00 bor 0000mm => eeeemm.

This is a bit of a low-level bit hack, here is a good reminder of the structure of the floating point encoding: https://fabiensanglard.net/floating_point_visually_explained/

pgj · 2023-06-23T09:13:36Z

This is a bit of a low-level bit hack, here is a good reminder of the structure of the floating point encoding: https://fabiensanglard.net/floating_point_visually_explained/

https://floating-point-gui.de/formats/fp/ also has some links to pages with visualization:

nickva · 2023-06-23T20:06:40Z

Thinking more about how we compute the bin_index: Since Erlang support bitstrings, we can avoid doing all the shifting and and OR-ing gymnastics and just pick exponent + top msb bits from the mantissa directly.

bin_index(Val) ->
    <<0:1, BinIndex:?INDEX_BITS, _/bitstring>> = <<Val/float>>,
    BinIndex.

INDEX_BITS = EXPONENT_BITS + SCALE = 11 + 3 = 14

This however returns the unbiased exponent, but since already have to handle an offset to transform bin indices [1, ... MaxBins] range, we'll just have a large offset now (8000 or so)

nickva · 2023-06-23T20:09:22Z

A quick proof of concept: https://gist.github.com/nickva/d1e5a086107e2f0e824e1c49ff177507

test_speed() ->
    H = new(15),
    timer:tc(fun() -> lists:foreach(fun(I) -> fhist:update(H, I, 10000.0) end, lists:seq(1, 1000000)) end).

> fhist:test_speed().
{243211,ok}

We can do a bin selection (normally a math:log/1 operation), a counter time index selection, and a counter update in about 200 nanoseconds.

nickva · 2023-06-28T23:36:19Z

Since the main goal is improve performance during higher concurrency, and avoid the bottleneck-ing on a single ETS lock, created a benchmark which compares old (Folsom) histogram updates with the new ones under concurrency.

https://gist.github.com/nickva/c057ee081e59359aca3c6e78ffb3d263

At 10k concurrency level:

> upstats_new_hist:go_new_hist(10000, 10000).
6083
> upstats_new_hist:go_new_hist(10000, 10000).
6541
> upstats_new_hist:go_new_hist(10000, 10000).
6509

> upstats_new_hist:go_folsom_hist(10000, 10000).
156323
> upstats_new_hist:go_folsom_hist(10000, 10000).
154118
> upstats_new_hist:go_folsom_hist(10000, 10000).
166647

Taking the best times:

1> 154118 / 6083.
25.335854019398322

The new histogram implementation is 25x faster at 10k concurrency

At 1k concurrency level:

> upstats_new_hist:go_new_hist(1000, 10000).
567
> upstats_new_hist:go_new_hist(1000, 10000).
563
> upstats_new_hist:go_new_hist(1000, 10000).
573

> upstats_new_hist:go_folsom_hist(1000, 10000).
12807
> upstats_new_hist:go_folsom_hist(1000, 10000).
12221
> upstats_new_hist:go_folsom_hist(1000, 10000).
12459

Using best times:

> 12221 / 563.
21.706927175843695

The new histogram implementation is 20x faster at 1k concurrency. Which confirms that the new histogram implementation is faster and is less of a bottleneck during concurrent updates.

Folsom histograms are a major bottleneck under high concurrency, as described in #4650. This was noticed during performance testing, confirmed using Erlang VM lock counting, then verified by creating a test release with histogram update logic commented out [1]. CouchDB doesn't use most of the Folsom statistics and metrics; we only use counters, gauages and one type of sliding window, sampling histogram. Instead of trying to re-design and update Folsom, which is a generic stats and metrics library, take a simpler approach and create just the three metrics we need, and then remove Folsom and Bear dependencies altogether. ALl the metrics types we re-implement are based on two relatively new Erlang/OTP features: counters [2] and persistent terms [3]. Counters are mutable arrays of integers, which allow fast concurrent updates, and persistent terms allow fast, global, constant time access to Erlang terms. Gauges and counters are implemented as counter arrays with one element. Histograms are represented as counter arrays where each array element is a histogram bin. Since we're dealing with sliding time window histograms, we have a tuple of counter arrays, where each time instant (each second) is a counter array. The overall histogram object then looks something like: ``` Histogram = { 1 = [1, 2, ..., ?BIN_COUNT] 2 = [1, 2, ..., ?BIN_COUNT] ... TimeWindow = [1, 2, ..., ?BIN_COUNT] } ``` To keep the structure immutable we need to set a limit on both the number of bins and the time window size. To limit the number of bins we need to set some minimum and maximum value limits. Since almost all our histograms record access times in milliseconds, we pick a range from 10 microseconds up to over an one hour. Histogram bin widths are increasing exponentialy in order to keep a reasonable precision across the whole range of values. This encoding is similar to how floating point numbers work. Additional details on how this works are described in the the couch_stats_histogram.erl module. To keep the histogram object structure immutable, the time window is used in a circular fashion. The time parameter to the histogram update/3 function is the monotonic clock time, and the histogram time index is computed as `Time rem TimeWindow`. So, as the monotonic time is advancing forward, the histogram time index will loop around. This comes with a minor annoynance of having to allocate a larger time window to accomodate some process which cleans stale (expired) histogram entries, possibly with some extra buffers to ensure the currently updated interval and the interval ready to be cleaned would not overlap. This periodic cleanup is performed in the couch_stats_server process. Besides performance, the new histograms have two other improvement over the folsom ones: - They record every single value. Previous histograms did sampling and recorded mostly just the first 1024 values during each time instant (second). - They are mergeable. Multiple histograms can be merged with corresponding bins summed together. This could allow cluster wide histogram summaries or gathering histograms from individual processes, then combining them at the end in a central process. Other performance improvement in this commit is eliminating the need to periodically flush or scrape stats in the background in both couch_stats and prometheus apps. Stats fetching from persistent terms and counters takes less than 5 milliseconds, and sliding time window histogram will always return the last 10 seconds of data no matter when the stats are queried. Now that will be done only when the stats are actually queried. Since the folsom library was abstracted away behind a couch_stats API, the rest of the applications do not need to be updated. They still call couch_stats:update_histogram/2, couch_stats:increment_counter/1, etc. Previously couch_stats did not have any tests at all. Folsom and Bear had some tests, but I don't think we ever ran those test suites. To rectify the situation added tests to cover the functionality. All the newly added or updated modules should be have near or exactly 100% test coverage. [1] #4650 (comment) [2] https://www.erlang.org/doc/man/counters.html [3] https://www.erlang.org/doc/man/persistent_term.html

Folsom histograms are a major bottleneck under high concurrency, as described in #4650. This was noticed during performance testing, confirmed using Erlang VM lock counting, then verified by creating a test release with histogram update logic commented out [1]. CouchDB doesn't use most of the Folsom statistics and metrics; we only use counters, gauges and one type of sliding window, sampling histogram. Instead of trying to re-design and update Folsom, which is a generic stats and metrics library, take a simpler approach and create just the three metrics we need, and then remove Folsom and Bear dependencies altogether. All the metrics types we re-implement are based on two relatively new Erlang/OTP features: counters [2] and persistent terms [3]. Counters are mutable arrays of integers, which allow fast concurrent updates, and persistent terms allow fast, global, constant time access to Erlang terms. Gauges and counters are implemented as counter arrays with one element. Histograms are represented as counter arrays where each array element is a histogram bin. Since we're dealing with sliding time window histograms, we have a tuple of counter arrays, where each time instant (each second) is a counter array. The overall histogram object then looks something like: ``` Histogram = { 1 = [1, 2, ..., ?BIN_COUNT] 2 = [1, 2, ..., ?BIN_COUNT] ... TimeWindow = [1, 2, ..., ?BIN_COUNT] } ``` To keep the structure immutable we need to set a limit on both the number of bins and the time window size. To limit the number of bins we need to set some minimum and maximum value limits. Since almost all our histograms record access times in milliseconds, we pick a range from 10 microseconds up to over one hour. Histogram bin widths are increasing exponentially in order to keep a reasonable precision across the whole range of values. This encoding is similar to how floating point numbers work. Additional details on how this works are described in the the `couch_stats_histogram.erl` module. To keep the histogram object structure immutable, the time window is used in a circular fashion. The time parameter to the histogram update/3 function is the monotonic clock time, and the histogram time index is computed as `Time rem TimeWindow`. So, as the monotonic time is advancing forward, the histogram time index will loop around. This comes with a minor annoyance of having to allocate a larger time window to accommodate some process which cleans stale (expired) histogram entries, possibly with some extra buffers to ensure the currently updated interval and the interval ready to be cleaned would not overlap. This periodic cleanup is performed in the couch_stats_server process. Besides performance, the new histograms have two other improvement over the Folsom ones: - They record every single value. Previous histograms did sampling and recorded mostly just the first 1024 values during each time instant (second). - They are mergeable. Multiple histograms can be merged with corresponding bins summed together. This could allow cluster wide histogram summaries or gathering histograms from individual processes, then combining them at the end in a central process. Other performance improvement in this commit is eliminating the need to periodically flush or scrape stats in the background in both couch_stats and prometheus apps. Stats fetching from persistent terms and counters takes less than 5 milliseconds, and sliding time window histogram will always return the last 10 seconds of data no matter when the stats are queried. Now that will be done only when the stats are actually queried. Since the Folsom library was abstracted away behind a couch_stats API, the rest of the applications do not need to be updated. They still call couch_stats:update_histogram/2, couch_stats:increment_counter/1, etc. Previously couch_stats did not have any tests at all. Folsom and Bear had some tests, but I don't think we ever ran those test suites. To rectify the situation added tests to cover the functionality. All the newly added or updated modules should be have near or exactly 100% test coverage. [1] #4650 (comment) [2] https://www.erlang.org/doc/man/counters.html [3] https://www.erlang.org/doc/man/persistent_term.html

nickva · 2023-07-11T07:09:03Z

Folsom and bear replacement is implemented in #4672

The full implementation preserves the same 20x speedup improvement as in the initial prototype #4672 (comment)

Folsom histograms are a major bottleneck under high concurrency, as described in #4650. This was noticed during performance testing, confirmed using Erlang VM lock counting, then verified by creating a test release with histogram update logic commented out [1]. CouchDB doesn't use most of the Folsom statistics and metrics; we only use counters, gauges and one type of sliding window, sampling histogram. Instead of trying to re-design and update Folsom, which is a generic stats and metrics library, take a simpler approach and create just the three metrics we need, and then remove Folsom and Bear dependencies altogether. All the metrics types we re-implement are based on two relatively new Erlang/OTP features: counters [2] and persistent terms [3]. Counters are mutable arrays of integers, which allow fast concurrent updates, and persistent terms allow fast, global, constant time access to Erlang terms. Gauges and counters are implemented as counter arrays with one element. Histograms are represented as counter arrays where each array element is a histogram bin. Since we're dealing with sliding time window histograms, we have a tuple of counter arrays, where each time instant (each second) is a counter array. The overall histogram object then looks something like: ``` Histogram = { 1 = [1, 2, ..., ?BIN_COUNT] 2 = [1, 2, ..., ?BIN_COUNT] ... TimeWindow = [1, 2, ..., ?BIN_COUNT] } ``` To keep the structure immutable we need to set a limit on both the number of bins and the time window size. To limit the number of bins we need to set some minimum and maximum value limits. Since almost all our histograms record access times in milliseconds, we pick a range from 10 microseconds up to over one hour. Histogram bin widths are increasing exponentially in order to keep a reasonable precision across the whole range of values. This encoding is similar to how floating point numbers work. Additional details on how this works are described in the the `couch_stats_histogram.erl` module. To keep the histogram object structure immutable, the time window is used in a circular fashion. The time parameter to the histogram `update/3` function is the monotonic clock time, and the histogram time index is computed as `Time rem TimeWindow`. So, as the monotonic time is advancing forward, the histogram time index will loop around. This comes with a minor annoyance of having to allocate a larger time window to accommodate some process which cleans stale (expired) histogram entries, possibly with some extra buffers to ensure the currently updated interval and the interval ready to be cleaned would not overlap. This periodic cleanup is performed in the couch_stats_server process. Besides performance, the new histograms have two other improvement over the Folsom ones: - They record every single value. Previous histograms did sampling and recorded mostly just the first 1024 values during each time instant (second). - They are mergeable. Multiple histograms can be merged with corresponding bins summed together. This could allow cluster wide histogram summaries or gathering histograms from individual processes, then combining them at the end in a central process. Other performance improvement in this commit is eliminating the need to periodically flush or scrape stats in the background in both couch_stats and prometheus apps. Stats fetching from persistent terms and counters takes less than 5 milliseconds, and sliding time window histogram will always return the last 10 seconds of data no matter when the stats are queried. Now that will be done only when the stats are actually queried. Since the Folsom library was abstracted away behind a couch_stats API, the rest of the applications do not need to be updated. They still call `couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc. Previously couch_stats did not have any tests at all. Folsom and Bear had some tests, but I don't think we ever ran those test suites. To rectify the situation added tests to cover the functionality. All the newly added or updated modules should be have near or exactly 100% test coverage. [1] #4650 (comment) [2] https://www.erlang.org/doc/man/counters.html [3] https://www.erlang.org/doc/man/persistent_term.html

We removed it in the main repo already. Issue: apache/couchdb#4650

nickva · 2023-07-19T23:47:44Z

Main PR was merged. Close issue as "fixed".

Folsom histograms are a major bottleneck under high concurrency, as described in #4650. This was noticed during performance testing, confirmed using Erlang VM lock counting, then verified by creating a test release with histogram update logic commented out [1]. CouchDB doesn't use most of the Folsom statistics and metrics; we only use counters, gauges and one type of sliding window, sampling histogram. Instead of trying to re-design and update Folsom, which is a generic stats and metrics library, take a simpler approach and create just the three metrics we need, and then remove Folsom and Bear dependencies altogether. All the metrics types we re-implement are based on two relatively new Erlang/OTP features: counters [2] and persistent terms [3]. Counters are mutable arrays of integers, which allow fast concurrent updates, and persistent terms allow fast, global, constant time access to Erlang terms. Gauges and counters are implemented as counter arrays with one element. Histograms are represented as counter arrays where each array element is a histogram bin. Since we're dealing with sliding time window histograms, we have a tuple of counter arrays, where each time instant (each second) is a counter array. The overall histogram object then looks something like: ``` Histogram = { 1 = [1, 2, ..., ?BIN_COUNT] 2 = [1, 2, ..., ?BIN_COUNT] ... TimeWindow = [1, 2, ..., ?BIN_COUNT] } ``` To keep the structure immutable we need to set a limit on both the number of bins and the time window size. To limit the number of bins we need to set some minimum and maximum value limits. Since almost all our histograms record access times in milliseconds, we pick a range from 10 microseconds up to over one hour. Histogram bin widths are increasing exponentially in order to keep a reasonable precision across the whole range of values. This encoding is similar to how floating point numbers work. Additional details on how this works are described in the the `couch_stats_histogram.erl` module. To keep the histogram object structure immutable, the time window is used in a circular fashion. The time parameter to the histogram `update/3` function is the monotonic clock time, and the histogram time index is computed as `Time rem TimeWindow`. So, as the monotonic time is advancing forward, the histogram time index will loop around. This comes with a minor annoyance of having to allocate a larger time window to accommodate some process which cleans stale (expired) histogram entries, possibly with some extra buffers to ensure the currently updated interval and the interval ready to be cleaned would not overlap. This periodic cleanup is performed in the couch_stats_server process. Besides performance, the new histograms have two other improvement over the Folsom ones: - They record every single value. Previous histograms did sampling and recorded mostly just the first 1024 values during each time instant (second). - They are mergeable. Multiple histograms can be merged with corresponding bins summed together. This could allow cluster wide histogram summaries or gathering histograms from individual processes, then combining them at the end in a central process. Other performance improvement in this commit is eliminating the need to periodically flush or scrape stats in the background in both couch_stats and prometheus apps. Stats fetching from persistent terms and counters takes less than 5 milliseconds, and sliding time window histogram will always return the last 10 seconds of data no matter when the stats are queried. Now that will be done only when the stats are actually queried. Since the Folsom library was abstracted away behind a couch_stats API, the rest of the applications do not need to be updated. They still call `couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc. Previously couch_stats did not have any tests at all. Folsom and Bear had some tests, but I don't think we ever ran those test suites. To rectify the situation added tests to cover the functionality. All the newly added or updated modules should be have near or exactly 100% test coverage. [1] #4650 (comment) [2] https://www.erlang.org/doc/man/counters.html [3] https://www.erlang.org/doc/man/persistent_term.html

nickva added enhancement needs-triage labels Jun 20, 2023

nickva removed the needs-triage label Jun 23, 2023

nickva mentioned this issue Jul 11, 2023

Replace Folsom and improve performance #4672

Merged

nickva added a commit to apache/couchdb-ioq that referenced this issue Jul 19, 2023

Remove folsom references.

51c49e7

We removed it in the main repo already. Issue: apache/couchdb#4650

nickva mentioned this issue Jul 19, 2023

Remove folsom references. apache/couchdb-ioq#25

Merged

nickva added a commit to apache/couchdb-ioq that referenced this issue Jul 19, 2023

Remove folsom references.

fe7f242

We removed it in the main repo already. Issue: apache/couchdb#4650

nickva closed this as completed Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace folsom histograms #4650

Replace folsom histograms #4650

nickva commented Jun 20, 2023

nickva commented Jun 20, 2023

chewbranca commented Jun 20, 2023

chewbranca commented Jun 20, 2023

nickva commented Jun 20, 2023 •

edited

Loading

nickva commented Jun 20, 2023

chewbranca commented Jun 20, 2023

nickva commented Jun 20, 2023 •

edited

Loading

nickva commented Jun 22, 2023

nickva commented Jun 22, 2023 •

edited

Loading

pgj commented Jun 23, 2023

nickva commented Jun 23, 2023

nickva commented Jun 23, 2023

nickva commented Jun 28, 2023 •

edited

Loading

nickva commented Jul 11, 2023 •

edited

Loading

nickva commented Jul 19, 2023

Replace folsom histograms #4650

Replace folsom histograms #4650

Comments

nickva commented Jun 20, 2023

nickva commented Jun 20, 2023

chewbranca commented Jun 20, 2023

chewbranca commented Jun 20, 2023

nickva commented Jun 20, 2023 • edited Loading

nickva commented Jun 20, 2023

chewbranca commented Jun 20, 2023

nickva commented Jun 20, 2023 • edited Loading

nickva commented Jun 22, 2023

nickva commented Jun 22, 2023 • edited Loading

pgj commented Jun 23, 2023

nickva commented Jun 23, 2023

nickva commented Jun 23, 2023

nickva commented Jun 28, 2023 • edited Loading

nickva commented Jul 11, 2023 • edited Loading

nickva commented Jul 19, 2023

nickva commented Jun 20, 2023 •

edited

Loading

nickva commented Jun 20, 2023 •

edited

Loading

nickva commented Jun 22, 2023 •

edited

Loading

nickva commented Jun 28, 2023 •

edited

Loading

nickva commented Jul 11, 2023 •

edited

Loading