Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace folsom histograms #4650

Closed
nickva opened this issue Jun 20, 2023 · 15 comments
Closed

Replace folsom histograms #4650

nickva opened this issue Jun 20, 2023 · 15 comments

Comments

@nickva
Copy link
Contributor

nickva commented Jun 20, 2023

Folsom histograms can become a bottleneck when writting docs with a high concurrency.

Initially this was observed in a benchmark run with lock counting enabled:

> lcnt:rt_opt({copy_save, true}), lcnt:clear(), timer:sleep(10000), lcnt:collect(), lcnt:conflicts().
                       lock    id   #tries  #collisions  collisions [%]  time [us]  duration [%]
                      -----   ---  ------- ------------ --------------- ---------- -------------
               db_hash_slot  2432  3073630       155606          5.0626   80812927      807.3885
                  run_queue    22 12050789      1823812         15.1344   74547468      744.7913
                  proc_main 64973  6287153      2020053         32.1299   19640746      196.2274
                   pix_lock  1024   320419          272          0.0849    3071989       30.6917
                proc_status 64973  5666139        92365          1.6301    1792422       17.9078
             alcu_allocator    10  1146957       108469          9.4571    1674152       16.7262
 dirty_run_queue_sleep_list     2  1920837       276037         14.3707    1355878       13.5464
                  proc_msgq 64973  7011355        24101          0.3437      78924        0.7885
       dist_entry_out_queue     7   367096         1676          0.4566       6023        0.0602
                   atom_tab     1  4760261           46          0.0010       4167        0.0416
            port_sched_lock   101   390590          353          0.0904       1705        0.0170
           dist_entry_links     6    23389           14          0.0599        323        0.0032
                     db_tab   399  4968232           13          0.0003        146        0.0015
               drv_ev_state   128    72977            6          0.0082         40        0.0004
ok
> lcnt:inspect(db_hash_slot).
         lock                   id  #tries  #collisions  collisions [%]  time [us]  duration [%] histogram [log2(us)]
        -----                  --- ------- ------------ --------------- ---------- ------------- ---------------------
 db_hash_slot folsom_slide_uniform   21808        13020         59.7029    7075598       70.6911 |      ............XX.  .      |
 db_hash_slot folsom_slide_uniform   24165        12706         52.5802    6640890       66.3480 |     .............XX..        |
 db_hash_slot folsom_slide_uniform   25239        10766         42.6562    5779694       57.7440 |      ............XX..        |
 db_hash_slot folsom_slide_uniform   23089         9987         43.2544    5082403       50.7774 |      ............XX...       |

To confirm it further, created a build with histogram updates disabled based on #4647

Then ran a document insertion ramp-up benchmark and measured number of ops (inserts) in 10 second windows (throughput) and latency.

Screen Shot 2023-06-20 at 1 42 45 AM Screen Shot 2023-06-20 at 1 41 45 AM

Started at 1000 clients concurrency our throughput improves 100% without histogram updates.

Latency increases at a faster rate with histogram updates enabled. P95 with 3k concurrent clients is 3x as bad with histogram updates as without.

@nickva
Copy link
Contributor Author

nickva commented Jun 20, 2023

Going by https://github.com/open-telemetry/oteps/blob/main/text/0149-exponential-histogram.md and https://opentelemetry.io/blog/2022/exponential-histograms/ it seems exponential base 2 histograms are popular.

The scheme to replace Folsom sliding window histogram could be based on persistent terms and counters.

persistent_term:put(Histogram, counters:new(1024, [write_concurrency])).
Ref = persistent_term:get(histogram_now(Histogram, NowSec)),
counters:add(Ref, bucket(Value), Value),

bucket index can be computed with a simple hack of starting with a simple float representation:

exp(N) when is_integer(N) ->
    exp(float(N));
exp(F) when is_float(F), F < 0.0 ->
    0;
exp(F) when is_float(F) ->
    <<0:1,Exp:11,Mantissa:52>> = <<F:64/float>>,
    Exp.

Exponent then can be scaled by a precision value by shifting left (bsl 4, for instance) and combining with the most significant bits of the mantissa (aka the significand). Index = (Exp bsl 4) bor (Mantissa bsr (52-4)).

The temporal (windowing) aspect can be implementing by having 15 or so counter arrays and writing to the NowSec % 15th one. We'll need a similar auto-cleanup / trim process to recycle and clean old entries.

@chewbranca
Copy link
Contributor

@nickva nicely done demonstrating the lock collision overhead for the histograms. That said, it seems like the underlying histogram implementation is orthogonal to the issue of concurrent updates to the same histogram inducing locks. Your example for a histogram creates a single global histogram that would be concurrently updated just like we do with Folsom. So how do the alternatives to Folsom address the issue of concurrent lock collisions?

For bucketing based histograms, given the same accuracy and range, they should be additive in theory. Were you hoping to achieve that with the histograms you linked?

It's possible we could use a counter based ets backend for creating bucketing histograms, but counters are designed to be incremented and decremented, whereas with the sampling based histogram we use with Folsom will directly and go set the histogram reservoir values based on the current sample.

@chewbranca
Copy link
Contributor

It's also worth remembering that the Folsom histograms are front loaded on the number of samples actually collected (for better or worse) which means that the percentage of couch_stats histogram updates that actually triggers an ets update goes down as the number of requests per second goes up. I personally think this is not great as the default bucket size of 1024 entries in a given one second "moment", leaves huge amounts of data on the floor; for example, if 100k histogram updates are called in a given second, the last update will have something like a 1:99k chance of actually impacting the histogram.

So one thing to watch out for is that switching to a bucketing based histogram that is explicitly accurate in incrementing a bucket for every histogram update, we will drastically increase the amount of lock contention happening. In our example above, all 100k requests will actually modify the histogram, as opposed to Folsom's sampling based approach which has a chance to update the histogram inversely proportional to the number of requests made in the current moment.

So again, I believe the fundamental issue at hand is concurrent updates to histograms, and any implementation we run with needs to address that issue directly.

@nickva
Copy link
Contributor Author

nickva commented Jun 20, 2023

I tried to compare the two approaches:

-module(upstats).

-export([go/0, go/1, go/2]).

go() ->
    go(1000, 10000).

go(N) ->
    go(N, 10000).

go(N, X) ->
    T0 = erlang:monotonic_time(),
    Workers = [spawn_monitor(fun() ->
        rand:seed(default, os:timestamp()),
        upstat(X)
    end) || _ <- lists:seq(1, N)],
    {_Pids, Refs} = lists:unzip(Workers),
    WorkersSet = sets:from_list(Refs, [{version,2}]),
    ok = wait_workers(WorkersSet),
    Dt = erlang:monotonic_time() - T0,
    erlang:convert_time_unit(Dt, native, millisecond).

upstat(0) ->
    ok;
upstat(Times) ->
    Hist = case rand:uniform(5) of
        1 -> [couchdb, request_time];
        2 -> [couchdb, dbinfo];
        3 -> [couchdb, httpd, bulk_docs];
        4 -> [fsync, time];
        5 -> [couchdb, db_open_time]
    end,
    couch_stats:update_histogram(Hist, 100),
    upstat(Times - 1).

wait_workers(Workers) when is_map(Workers) ->
    case sets:size(Workers) of
        0 ->
            ok;
        _ ->
            receive
                {'DOWN', Ref, process, _, normal} ->
                    Workers1 = sets:del_element(Ref, Workers),
                    wait_workers(Workers1);
                {'DOWN', Ref, process, _, Err} ->
                    Workers1 = sets:del_element(Ref, Workers),
                    io:format("~n worker ~p crashed: ~p~n", [Ref, Err]),
                    wait_workers(Workers1)
            end
    end.

With an example counters update:

-module(upstats_counter).

-export([go/0, go/1, go/2]).

go() ->
    go(1000, 10000).

go(N) ->
    go(N, 10000).

go(N, X) ->
    Ref = counters:new(1024, [write_concurrency]),
    persistent_term:put(term, Ref),
    T0 = erlang:monotonic_time(),
    Workers = [spawn_monitor(fun() ->
        rand:seed(default, os:timestamp()),
        upstat(X)
    end) || _ <- lists:seq(1, N)],
    {_Pids, Refs} = lists:unzip(Workers),
    WorkersSet = sets:from_list(Refs, [{version,2}]),
    ok = wait_workers(WorkersSet),
    Dt = erlang:monotonic_time() - T0,
    erlang:convert_time_unit(Dt, native, millisecond).

upstat(0) ->
    ok;
upstat(Times) ->
    Ref = persistent_term:get(term),
    counters:add(Ref, rand:uniform(1024), 100),
    upstat(Times - 1).

wait_workers(Workers) when is_map(Workers) ->
    case sets:size(Workers) of
        0 ->
            ok;
        _ ->
            receive
                {'DOWN', Ref, process, _, normal} ->
                    Workers1 = sets:del_element(Ref, Workers),
                    wait_workers(Workers1);
                {'DOWN', Ref, process, _, Err} ->
                    Workers1 = sets:del_element(Ref, Workers),
                    io:format("~n worker ~p crashed: ~p~n", [Ref, Err]),
                    wait_workers(Workers1)
            end
    end.
(node1@127.0.0.1)18> upstats_counter:go(50000, 1000).
1385
(node1@127.0.0.1)19> upstats_counter:go(50000, 1000).
1425
(node1@127.0.0.1)20> upstats_counter:go(50000, 1000).
1427
(node1@127.0.0.1)21> upstats_counter:go(50000, 1000).
1438
(node1@127.0.0.1)22> upstats:go(50000, 1000).
15131
(node1@127.0.0.1)23> upstats:go(50000, 1000).
16075
(node1@127.0.0.1)24> upstats:go(50000, 1000).
14869

Even with folsom front loading and updating only 1024 values per second it's till 10x slower!

@nickva
Copy link
Contributor Author

nickva commented Jun 20, 2023

For bucketing based histograms, given the same accuracy and range, they should be additive in theory. Were you hoping to achieve that with the histograms you linked?

The base-2 histograms would be additive and would record every single update.

@chewbranca
Copy link
Contributor

I chatted with @nickva directly and he mentioned that write_concurrency option to Erlang counters implementation results in per scheduler counter tracking, avoiding the concurrent updates I was concerned with! So while a bucket approach that records every histogram entry would result in more updates than a sampling based approach, it sounds like we can completely avoid the locking issue with Erlang counters.

@nickva
Copy link
Contributor Author

nickva commented Jun 20, 2023

So one thing to watch out for is that switching to a bucketing based histogram that is explicitly accurate in incrementing a bucket for every histogram update, we will drastically increase the amount of lock contention happening.

That's why I thought counters might work here. With [write_concurrency] it does auto-stripping across schedulers
https://www.erlang.org/doc/man/counters.html#new-2

This is an optimization to achieve very efficient concurrent add and sub operations at the expense of potential read inconsistency and memory consumption per counters. Read operations may see sequentially inconsistent results with regard to concurrent write operations. Even if write operation A is done sequentially before write operation B, a concurrent reader may see any combination of A and B, including only B. A read operation is only guaranteed to see all writes done sequentially before the read. No writes are ever lost, but will eventually all be seen.

The typical use case for write_concurrency is when concurrent calls to add and sub toward the same counters are very frequent, while calls to get and put are much less frequent. The lack of absolute read consistency must also be acceptable

Seems to check out based on OTP source

@nickva
Copy link
Contributor Author

nickva commented Jun 22, 2023

The basic idea for preserving the windowed updates with 10 second windows, and 1 second granularity, like we have currently, could be to use at 10 counter arrays (in a tuple for instance, to allow quick iteration), one per second:

Counters = {counters:new(NumberOfBins), counters:new(NumberOfBins), ....}

So update(#hist{}, Val) would do something like TimeIndex = erlang:monotonic_time(second) rem tuple_size(Counters) + 1, CounterRef = element(TimeIndex, Counters).

In order to for the trimmer/cleaner process to not step on the toes of current updaters, we'd create 15 or 20 counters, to allow plenty of time in between when the trimmer process resets the old counters and some delayed updaters still updating it. Using monotonic time prevents the updaters from going backwards but there could be some time between them reading the monotonic time and the update itself.

In this way we avoid the need to modify (delete old and create new) counter arrays in a persistent term. The persistent histogram term structure remains immutable. The only thing that gets updated are the counter values themselves which should be efficient.

@nickva
Copy link
Contributor Author

nickva commented Jun 22, 2023

To go with float base-2 exponential binning (to capture as much range as possible in each counters array) threw this together:

-define(EXPONENT_BIAS, 1023).
-define(MANTISSA_BITS, 52).
-define(EXPONENT_BITS, 11).
​
bin_index(Val, Scale) when is_integer(Val) ->
    bin_index(float(Val), Scale);
bin_index(Val, Scale) when is_float(Val), is_integer(Scale), Scale >=0, Scale < ?MANTISSA_BITS ->
    {Exponent, Mantissa} = exp(Val),
    ExponentBits = Exponent bsl Scale,
    MantissaBits = Mantissa bsr (?MANTISSA_BITS - Scale),
    ExponentBits bor MantissaBits.
​
exp(Val) when is_float(Val) ->
    <<0:1, Exponent:?EXPONENT_BITS, Mantissa:?MANTISSA_BITS>> = <<Val/float>>,
    {Exponent - ?EXPONENT_BIAS, Mantissa}.

At first thought of using just the exponent part directly, however the bins are spaced out a bit too much. So then used the idea from https://github.com/newrelic-experimental/newrelic-sketch-java/tree/main/src/main/java/com/newrelic/nrsketch/indexer to add some more precision from the mantissa part into the bin index.

First exponent is shifted a bit "to make some room" for the extra bits Exponent bsl Scale. eeee bsl 2 => eeee00. Now we have a place for the 2 extra precision bits.

Then, we shift the mantissa bits to the right and leave only its two top most significant bits. For example, mmmmmm... bsr (52 -2) => 0000mm. And finally, combine it by OR-ing into the shifted exponent part to get the index eeee00 bor 0000mm => eeeemm.

This is a bit of a low-level bit hack, here is a good reminder of the structure of the floating point encoding: https://fabiensanglard.net/floating_point_visually_explained/

@pgj
Copy link
Contributor

pgj commented Jun 23, 2023

This is a bit of a low-level bit hack, here is a good reminder of the structure of the floating point encoding: https://fabiensanglard.net/floating_point_visually_explained/

https://floating-point-gui.de/formats/fp/ also has some links to pages with visualization:

@nickva
Copy link
Contributor Author

nickva commented Jun 23, 2023

Thinking more about how we compute the bin_index: Since Erlang support bitstrings, we can avoid doing all the shifting and and OR-ing gymnastics and just pick exponent + top msb bits from the mantissa directly.

bin_index(Val) ->
    <<0:1, BinIndex:?INDEX_BITS, _/bitstring>> = <<Val/float>>,
    BinIndex.
INDEX_BITS = EXPONENT_BITS + SCALE = 11 + 3 = 14

This however returns the unbiased exponent, but since already have to handle an offset to transform bin indices [1, ... MaxBins] range, we'll just have a large offset now (8000 or so)

@nickva
Copy link
Contributor Author

nickva commented Jun 23, 2023

A quick proof of concept: https://gist.github.com/nickva/d1e5a086107e2f0e824e1c49ff177507

test_speed() ->
    H = new(15),
    timer:tc(fun() -> lists:foreach(fun(I) -> fhist:update(H, I, 10000.0) end, lists:seq(1, 1000000)) end).
> fhist:test_speed().
{243211,ok}

We can do a bin selection (normally a math:log/1 operation), a counter time index selection, and a counter update in about 200 nanoseconds.

@nickva
Copy link
Contributor Author

nickva commented Jun 28, 2023

Since the main goal is improve performance during higher concurrency, and avoid the bottleneck-ing on a single ETS lock, created a benchmark which compares old (Folsom) histogram updates with the new ones under concurrency.

https://gist.github.com/nickva/c057ee081e59359aca3c6e78ffb3d263

At 10k concurrency level:

> upstats_new_hist:go_new_hist(10000, 10000).
6083
> upstats_new_hist:go_new_hist(10000, 10000).
6541
> upstats_new_hist:go_new_hist(10000, 10000).
6509

> upstats_new_hist:go_folsom_hist(10000, 10000).
156323
> upstats_new_hist:go_folsom_hist(10000, 10000).
154118
> upstats_new_hist:go_folsom_hist(10000, 10000).
166647

Taking the best times:

1> 154118 / 6083.
25.335854019398322

The new histogram implementation is 25x faster at 10k concurrency

At 1k concurrency level:

> upstats_new_hist:go_new_hist(1000, 10000).
567
> upstats_new_hist:go_new_hist(1000, 10000).
563
> upstats_new_hist:go_new_hist(1000, 10000).
573

> upstats_new_hist:go_folsom_hist(1000, 10000).
12807
> upstats_new_hist:go_folsom_hist(1000, 10000).
12221
> upstats_new_hist:go_folsom_hist(1000, 10000).
12459

Using best times:

> 12221 / 563.
21.706927175843695

The new histogram implementation is 20x faster at 1k concurrency. Which confirms that the new histogram implementation is faster and is less of a bottleneck during concurrent updates.

nickva added a commit that referenced this issue Jul 11, 2023
Folsom histograms are a major bottleneck under high concurrency, as described
in #4650. This was noticed during performance testing, confirmed using Erlang
VM lock counting, then verified by creating a test release with histogram
update logic commented out [1].

CouchDB doesn't use most of the Folsom statistics and metrics; we only use
counters, gauages and one type of sliding window, sampling histogram. Instead
of trying to re-design and update Folsom, which is a generic stats and metrics
library, take a simpler approach and create just the three metrics we need, and
then remove Folsom and Bear dependencies altogether.

ALl the metrics types we re-implement are based on two relatively new
Erlang/OTP features: counters [2] and persistent terms [3]. Counters are
mutable arrays of integers, which allow fast concurrent updates, and persistent
terms allow fast, global, constant time access to Erlang terms.

Gauges and counters are implemented as counter arrays with one element.
Histograms are represented as counter arrays where each array element is a
histogram bin. Since we're dealing with sliding time window histograms, we have a
tuple of counter arrays, where each time instant (each second) is a counter
array. The overall histogram object then looks something like:

```
Histogram = {
     1          = [1, 2, ..., ?BIN_COUNT]
     2          = [1, 2, ..., ?BIN_COUNT]
     ...
     TimeWindow = [1, 2, ..., ?BIN_COUNT]
  }
```

To keep the structure immutable we need to set a limit on both the number of
bins and the time window size. To limit the number of bins we need to set some
minimum and maximum value limits. Since almost all our histograms record access
times in milliseconds, we pick a range from 10 microseconds up to over an one
hour. Histogram bin widths are increasing exponentialy in order to keep a
reasonable precision across the whole range of values. This encoding is similar
to how floating point numbers work. Additional details on how this works are
described in the the couch_stats_histogram.erl module.

To keep the histogram object structure immutable, the time window is used in a
circular fashion. The time parameter to the histogram update/3 function is the
monotonic clock time, and the histogram time index is computed as `Time rem
TimeWindow`. So, as the monotonic time is advancing forward, the histogram time
index will loop around. This comes with a minor annoynance of having to
allocate a larger time window to accomodate some process which cleans stale
(expired) histogram entries, possibly with some extra buffers to ensure the
currently updated interval and the interval ready to be cleaned would not
overlap. This periodic cleanup is performed in the couch_stats_server process.

Besides performance, the new histograms have two other improvement over the
folsom ones:

  - They record every single value. Previous histograms did sampling and
    recorded mostly just the first 1024 values during each time instant
    (second).

  - They are mergeable. Multiple histograms can be merged with corresponding
    bins summed together. This could allow cluster wide histogram summaries or
    gathering histograms from individual processes, then combining them at the
    end in a central process.

Other performance improvement in this commit is eliminating the need to
periodically flush or scrape stats in the background in both couch_stats and
prometheus apps. Stats fetching from persistent terms and counters takes less
than 5 milliseconds, and sliding time window histogram will always return the
last 10 seconds of data no matter when the stats are queried. Now that will be
done only when the stats are actually queried.

Since the folsom library was abstracted away behind a couch_stats API, the rest
of the applications do not need to be updated. They still call
couch_stats:update_histogram/2, couch_stats:increment_counter/1, etc.

Previously couch_stats did not have any tests at all. Folsom and Bear had some
tests, but I don't think we ever ran those test suites. To rectify the
situation added tests to cover the functionality. All the newly added or
updated modules should be have near or exactly 100% test coverage.

[1] #4650 (comment)
[2] https://www.erlang.org/doc/man/counters.html
[3] https://www.erlang.org/doc/man/persistent_term.html
nickva added a commit that referenced this issue Jul 11, 2023
Folsom histograms are a major bottleneck under high concurrency, as described
in #4650. This was noticed during performance testing, confirmed using Erlang
VM lock counting, then verified by creating a test release with histogram
update logic commented out [1].

CouchDB doesn't use most of the Folsom statistics and metrics; we only use
counters, gauges and one type of sliding window, sampling histogram. Instead of
trying to re-design and update Folsom, which is a generic stats and metrics
library, take a simpler approach and create just the three metrics we need, and
then remove Folsom and Bear dependencies altogether.

All the metrics types we re-implement are based on two relatively new
Erlang/OTP features: counters [2] and persistent terms [3]. Counters are
mutable arrays of integers, which allow fast concurrent updates, and persistent
terms allow fast, global, constant time access to Erlang terms.

Gauges and counters are implemented as counter arrays with one element.
Histograms are represented as counter arrays where each array element is a
histogram bin. Since we're dealing with sliding time window histograms, we have
a tuple of counter arrays, where each time instant (each second) is a counter
array. The overall histogram object then looks something like:

```
Histogram = {
     1          = [1, 2, ..., ?BIN_COUNT]
     2          = [1, 2, ..., ?BIN_COUNT]
     ...
     TimeWindow = [1, 2, ..., ?BIN_COUNT]
  }
```

To keep the structure immutable we need to set a limit on both the number of
bins and the time window size. To limit the number of bins we need to set some
minimum and maximum value limits. Since almost all our histograms record access
times in milliseconds, we pick a range from 10 microseconds up to over one
hour. Histogram bin widths are increasing exponentially in order to keep a
reasonable precision across the whole range of values. This encoding is similar
to how floating point numbers work. Additional details on how this works are
described in the the `couch_stats_histogram.erl` module.

To keep the histogram object structure immutable, the time window is used in a
circular fashion. The time parameter to the histogram update/3 function is the
monotonic clock time, and the histogram time index is computed as `Time rem
TimeWindow`. So, as the monotonic time is advancing forward, the histogram time
index will loop around. This comes with a minor annoyance of having to allocate
a larger time window to accommodate some process which cleans stale (expired)
histogram entries, possibly with some extra buffers to ensure the currently
updated interval and the interval ready to be cleaned would not overlap. This
periodic cleanup is performed in the couch_stats_server process.

Besides performance, the new histograms have two other improvement over the
Folsom ones:

  - They record every single value. Previous histograms did sampling and
    recorded mostly just the first 1024 values during each time instant
    (second).

  - They are mergeable. Multiple histograms can be merged with corresponding
    bins summed together. This could allow cluster wide histogram summaries or
    gathering histograms from individual processes, then combining them at the
    end in a central process.

Other performance improvement in this commit is eliminating the need to
periodically flush or scrape stats in the background in both couch_stats and
prometheus apps. Stats fetching from persistent terms and counters takes less
than 5 milliseconds, and sliding time window histogram will always return the
last 10 seconds of data no matter when the stats are queried. Now that will be
done only when the stats are actually queried.

Since the Folsom library was abstracted away behind a couch_stats API, the rest
of the applications do not need to be updated. They still call
couch_stats:update_histogram/2, couch_stats:increment_counter/1, etc.

Previously couch_stats did not have any tests at all. Folsom and Bear had some
tests, but I don't think we ever ran those test suites. To rectify the
situation added tests to cover the functionality. All the newly added or
updated modules should be have near or exactly 100% test coverage.

[1] #4650 (comment)
[2] https://www.erlang.org/doc/man/counters.html
[3] https://www.erlang.org/doc/man/persistent_term.html
@nickva
Copy link
Contributor Author

nickva commented Jul 11, 2023

Folsom and bear replacement is implemented in #4672

The full implementation preserves the same 20x speedup improvement as in the initial prototype #4672 (comment)

nickva added a commit that referenced this issue Jul 12, 2023
Folsom histograms are a major bottleneck under high concurrency, as described
in #4650. This was noticed during performance testing, confirmed using Erlang
VM lock counting, then verified by creating a test release with histogram
update logic commented out [1].

CouchDB doesn't use most of the Folsom statistics and metrics; we only use
counters, gauges and one type of sliding window, sampling histogram. Instead of
trying to re-design and update Folsom, which is a generic stats and metrics
library, take a simpler approach and create just the three metrics we need, and
then remove Folsom and Bear dependencies altogether.

All the metrics types we re-implement are based on two relatively new
Erlang/OTP features: counters [2] and persistent terms [3]. Counters are
mutable arrays of integers, which allow fast concurrent updates, and persistent
terms allow fast, global, constant time access to Erlang terms.

Gauges and counters are implemented as counter arrays with one element.
Histograms are represented as counter arrays where each array element is a
histogram bin. Since we're dealing with sliding time window histograms, we have
a tuple of counter arrays, where each time instant (each second) is a counter
array. The overall histogram object then looks something like:

```
Histogram = {
     1          = [1, 2, ..., ?BIN_COUNT]
     2          = [1, 2, ..., ?BIN_COUNT]
     ...
     TimeWindow = [1, 2, ..., ?BIN_COUNT]
  }
```

To keep the structure immutable we need to set a limit on both the number of
bins and the time window size. To limit the number of bins we need to set some
minimum and maximum value limits. Since almost all our histograms record access
times in milliseconds, we pick a range from 10 microseconds up to over one
hour. Histogram bin widths are increasing exponentially in order to keep a
reasonable precision across the whole range of values. This encoding is similar
to how floating point numbers work. Additional details on how this works are
described in the the `couch_stats_histogram.erl` module.

To keep the histogram object structure immutable, the time window is used in a
circular fashion. The time parameter to the histogram `update/3` function is the
monotonic clock time, and the histogram time index is computed as `Time rem
TimeWindow`. So, as the monotonic time is advancing forward, the histogram time
index will loop around. This comes with a minor annoyance of having to allocate
a larger time window to accommodate some process which cleans stale (expired)
histogram entries, possibly with some extra buffers to ensure the currently
updated interval and the interval ready to be cleaned would not overlap. This
periodic cleanup is performed in the couch_stats_server process.

Besides performance, the new histograms have two other improvement over the
Folsom ones:

  - They record every single value. Previous histograms did sampling and
    recorded mostly just the first 1024 values during each time instant
    (second).

  - They are mergeable. Multiple histograms can be merged with corresponding
    bins summed together. This could allow cluster wide histogram summaries or
    gathering histograms from individual processes, then combining them at the
    end in a central process.

Other performance improvement in this commit is eliminating the need to
periodically flush or scrape stats in the background in both couch_stats and
prometheus apps. Stats fetching from persistent terms and counters takes less
than 5 milliseconds, and sliding time window histogram will always return the
last 10 seconds of data no matter when the stats are queried. Now that will be
done only when the stats are actually queried.

Since the Folsom library was abstracted away behind a couch_stats API, the rest
of the applications do not need to be updated. They still call
`couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc.

Previously couch_stats did not have any tests at all. Folsom and Bear had some
tests, but I don't think we ever ran those test suites. To rectify the
situation added tests to cover the functionality. All the newly added or
updated modules should be have near or exactly 100% test coverage.

[1] #4650 (comment)
[2] https://www.erlang.org/doc/man/counters.html
[3] https://www.erlang.org/doc/man/persistent_term.html
nickva added a commit that referenced this issue Jul 18, 2023
Folsom histograms are a major bottleneck under high concurrency, as described
in #4650. This was noticed during performance testing, confirmed using Erlang
VM lock counting, then verified by creating a test release with histogram
update logic commented out [1].

CouchDB doesn't use most of the Folsom statistics and metrics; we only use
counters, gauges and one type of sliding window, sampling histogram. Instead of
trying to re-design and update Folsom, which is a generic stats and metrics
library, take a simpler approach and create just the three metrics we need, and
then remove Folsom and Bear dependencies altogether.

All the metrics types we re-implement are based on two relatively new
Erlang/OTP features: counters [2] and persistent terms [3]. Counters are
mutable arrays of integers, which allow fast concurrent updates, and persistent
terms allow fast, global, constant time access to Erlang terms.

Gauges and counters are implemented as counter arrays with one element.
Histograms are represented as counter arrays where each array element is a
histogram bin. Since we're dealing with sliding time window histograms, we have
a tuple of counter arrays, where each time instant (each second) is a counter
array. The overall histogram object then looks something like:

```
Histogram = {
     1          = [1, 2, ..., ?BIN_COUNT]
     2          = [1, 2, ..., ?BIN_COUNT]
     ...
     TimeWindow = [1, 2, ..., ?BIN_COUNT]
  }
```

To keep the structure immutable we need to set a limit on both the number of
bins and the time window size. To limit the number of bins we need to set some
minimum and maximum value limits. Since almost all our histograms record access
times in milliseconds, we pick a range from 10 microseconds up to over one
hour. Histogram bin widths are increasing exponentially in order to keep a
reasonable precision across the whole range of values. This encoding is similar
to how floating point numbers work. Additional details on how this works are
described in the the `couch_stats_histogram.erl` module.

To keep the histogram object structure immutable, the time window is used in a
circular fashion. The time parameter to the histogram `update/3` function is the
monotonic clock time, and the histogram time index is computed as `Time rem
TimeWindow`. So, as the monotonic time is advancing forward, the histogram time
index will loop around. This comes with a minor annoyance of having to allocate
a larger time window to accommodate some process which cleans stale (expired)
histogram entries, possibly with some extra buffers to ensure the currently
updated interval and the interval ready to be cleaned would not overlap. This
periodic cleanup is performed in the couch_stats_server process.

Besides performance, the new histograms have two other improvement over the
Folsom ones:

  - They record every single value. Previous histograms did sampling and
    recorded mostly just the first 1024 values during each time instant
    (second).

  - They are mergeable. Multiple histograms can be merged with corresponding
    bins summed together. This could allow cluster wide histogram summaries or
    gathering histograms from individual processes, then combining them at the
    end in a central process.

Other performance improvement in this commit is eliminating the need to
periodically flush or scrape stats in the background in both couch_stats and
prometheus apps. Stats fetching from persistent terms and counters takes less
than 5 milliseconds, and sliding time window histogram will always return the
last 10 seconds of data no matter when the stats are queried. Now that will be
done only when the stats are actually queried.

Since the Folsom library was abstracted away behind a couch_stats API, the rest
of the applications do not need to be updated. They still call
`couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc.

Previously couch_stats did not have any tests at all. Folsom and Bear had some
tests, but I don't think we ever ran those test suites. To rectify the
situation added tests to cover the functionality. All the newly added or
updated modules should be have near or exactly 100% test coverage.

[1] #4650 (comment)
[2] https://www.erlang.org/doc/man/counters.html
[3] https://www.erlang.org/doc/man/persistent_term.html
nickva added a commit that referenced this issue Jul 19, 2023
Folsom histograms are a major bottleneck under high concurrency, as described
in #4650. This was noticed during performance testing, confirmed using Erlang
VM lock counting, then verified by creating a test release with histogram
update logic commented out [1].

CouchDB doesn't use most of the Folsom statistics and metrics; we only use
counters, gauges and one type of sliding window, sampling histogram. Instead of
trying to re-design and update Folsom, which is a generic stats and metrics
library, take a simpler approach and create just the three metrics we need, and
then remove Folsom and Bear dependencies altogether.

All the metrics types we re-implement are based on two relatively new
Erlang/OTP features: counters [2] and persistent terms [3]. Counters are
mutable arrays of integers, which allow fast concurrent updates, and persistent
terms allow fast, global, constant time access to Erlang terms.

Gauges and counters are implemented as counter arrays with one element.
Histograms are represented as counter arrays where each array element is a
histogram bin. Since we're dealing with sliding time window histograms, we have
a tuple of counter arrays, where each time instant (each second) is a counter
array. The overall histogram object then looks something like:

```
Histogram = {
     1          = [1, 2, ..., ?BIN_COUNT]
     2          = [1, 2, ..., ?BIN_COUNT]
     ...
     TimeWindow = [1, 2, ..., ?BIN_COUNT]
  }
```

To keep the structure immutable we need to set a limit on both the number of
bins and the time window size. To limit the number of bins we need to set some
minimum and maximum value limits. Since almost all our histograms record access
times in milliseconds, we pick a range from 10 microseconds up to over one
hour. Histogram bin widths are increasing exponentially in order to keep a
reasonable precision across the whole range of values. This encoding is similar
to how floating point numbers work. Additional details on how this works are
described in the the `couch_stats_histogram.erl` module.

To keep the histogram object structure immutable, the time window is used in a
circular fashion. The time parameter to the histogram `update/3` function is the
monotonic clock time, and the histogram time index is computed as `Time rem
TimeWindow`. So, as the monotonic time is advancing forward, the histogram time
index will loop around. This comes with a minor annoyance of having to allocate
a larger time window to accommodate some process which cleans stale (expired)
histogram entries, possibly with some extra buffers to ensure the currently
updated interval and the interval ready to be cleaned would not overlap. This
periodic cleanup is performed in the couch_stats_server process.

Besides performance, the new histograms have two other improvement over the
Folsom ones:

  - They record every single value. Previous histograms did sampling and
    recorded mostly just the first 1024 values during each time instant
    (second).

  - They are mergeable. Multiple histograms can be merged with corresponding
    bins summed together. This could allow cluster wide histogram summaries or
    gathering histograms from individual processes, then combining them at the
    end in a central process.

Other performance improvement in this commit is eliminating the need to
periodically flush or scrape stats in the background in both couch_stats and
prometheus apps. Stats fetching from persistent terms and counters takes less
than 5 milliseconds, and sliding time window histogram will always return the
last 10 seconds of data no matter when the stats are queried. Now that will be
done only when the stats are actually queried.

Since the Folsom library was abstracted away behind a couch_stats API, the rest
of the applications do not need to be updated. They still call
`couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc.

Previously couch_stats did not have any tests at all. Folsom and Bear had some
tests, but I don't think we ever ran those test suites. To rectify the
situation added tests to cover the functionality. All the newly added or
updated modules should be have near or exactly 100% test coverage.

[1] #4650 (comment)
[2] https://www.erlang.org/doc/man/counters.html
[3] https://www.erlang.org/doc/man/persistent_term.html
nickva added a commit that referenced this issue Jul 19, 2023
Folsom histograms are a major bottleneck under high concurrency, as described
in #4650. This was noticed during performance testing, confirmed using Erlang
VM lock counting, then verified by creating a test release with histogram
update logic commented out [1].

CouchDB doesn't use most of the Folsom statistics and metrics; we only use
counters, gauges and one type of sliding window, sampling histogram. Instead of
trying to re-design and update Folsom, which is a generic stats and metrics
library, take a simpler approach and create just the three metrics we need, and
then remove Folsom and Bear dependencies altogether.

All the metrics types we re-implement are based on two relatively new
Erlang/OTP features: counters [2] and persistent terms [3]. Counters are
mutable arrays of integers, which allow fast concurrent updates, and persistent
terms allow fast, global, constant time access to Erlang terms.

Gauges and counters are implemented as counter arrays with one element.
Histograms are represented as counter arrays where each array element is a
histogram bin. Since we're dealing with sliding time window histograms, we have
a tuple of counter arrays, where each time instant (each second) is a counter
array. The overall histogram object then looks something like:

```
Histogram = {
     1          = [1, 2, ..., ?BIN_COUNT]
     2          = [1, 2, ..., ?BIN_COUNT]
     ...
     TimeWindow = [1, 2, ..., ?BIN_COUNT]
  }
```

To keep the structure immutable we need to set a limit on both the number of
bins and the time window size. To limit the number of bins we need to set some
minimum and maximum value limits. Since almost all our histograms record access
times in milliseconds, we pick a range from 10 microseconds up to over one
hour. Histogram bin widths are increasing exponentially in order to keep a
reasonable precision across the whole range of values. This encoding is similar
to how floating point numbers work. Additional details on how this works are
described in the the `couch_stats_histogram.erl` module.

To keep the histogram object structure immutable, the time window is used in a
circular fashion. The time parameter to the histogram `update/3` function is the
monotonic clock time, and the histogram time index is computed as `Time rem
TimeWindow`. So, as the monotonic time is advancing forward, the histogram time
index will loop around. This comes with a minor annoyance of having to allocate
a larger time window to accommodate some process which cleans stale (expired)
histogram entries, possibly with some extra buffers to ensure the currently
updated interval and the interval ready to be cleaned would not overlap. This
periodic cleanup is performed in the couch_stats_server process.

Besides performance, the new histograms have two other improvement over the
Folsom ones:

  - They record every single value. Previous histograms did sampling and
    recorded mostly just the first 1024 values during each time instant
    (second).

  - They are mergeable. Multiple histograms can be merged with corresponding
    bins summed together. This could allow cluster wide histogram summaries or
    gathering histograms from individual processes, then combining them at the
    end in a central process.

Other performance improvement in this commit is eliminating the need to
periodically flush or scrape stats in the background in both couch_stats and
prometheus apps. Stats fetching from persistent terms and counters takes less
than 5 milliseconds, and sliding time window histogram will always return the
last 10 seconds of data no matter when the stats are queried. Now that will be
done only when the stats are actually queried.

Since the Folsom library was abstracted away behind a couch_stats API, the rest
of the applications do not need to be updated. They still call
`couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc.

Previously couch_stats did not have any tests at all. Folsom and Bear had some
tests, but I don't think we ever ran those test suites. To rectify the
situation added tests to cover the functionality. All the newly added or
updated modules should be have near or exactly 100% test coverage.

[1] #4650 (comment)
[2] https://www.erlang.org/doc/man/counters.html
[3] https://www.erlang.org/doc/man/persistent_term.html
nickva added a commit that referenced this issue Jul 19, 2023
Folsom histograms are a major bottleneck under high concurrency, as described
in #4650. This was noticed during performance testing, confirmed using Erlang
VM lock counting, then verified by creating a test release with histogram
update logic commented out [1].

CouchDB doesn't use most of the Folsom statistics and metrics; we only use
counters, gauges and one type of sliding window, sampling histogram. Instead of
trying to re-design and update Folsom, which is a generic stats and metrics
library, take a simpler approach and create just the three metrics we need, and
then remove Folsom and Bear dependencies altogether.

All the metrics types we re-implement are based on two relatively new
Erlang/OTP features: counters [2] and persistent terms [3]. Counters are
mutable arrays of integers, which allow fast concurrent updates, and persistent
terms allow fast, global, constant time access to Erlang terms.

Gauges and counters are implemented as counter arrays with one element.
Histograms are represented as counter arrays where each array element is a
histogram bin. Since we're dealing with sliding time window histograms, we have
a tuple of counter arrays, where each time instant (each second) is a counter
array. The overall histogram object then looks something like:

```
Histogram = {
     1          = [1, 2, ..., ?BIN_COUNT]
     2          = [1, 2, ..., ?BIN_COUNT]
     ...
     TimeWindow = [1, 2, ..., ?BIN_COUNT]
  }
```

To keep the structure immutable we need to set a limit on both the number of
bins and the time window size. To limit the number of bins we need to set some
minimum and maximum value limits. Since almost all our histograms record access
times in milliseconds, we pick a range from 10 microseconds up to over one
hour. Histogram bin widths are increasing exponentially in order to keep a
reasonable precision across the whole range of values. This encoding is similar
to how floating point numbers work. Additional details on how this works are
described in the the `couch_stats_histogram.erl` module.

To keep the histogram object structure immutable, the time window is used in a
circular fashion. The time parameter to the histogram `update/3` function is the
monotonic clock time, and the histogram time index is computed as `Time rem
TimeWindow`. So, as the monotonic time is advancing forward, the histogram time
index will loop around. This comes with a minor annoyance of having to allocate
a larger time window to accommodate some process which cleans stale (expired)
histogram entries, possibly with some extra buffers to ensure the currently
updated interval and the interval ready to be cleaned would not overlap. This
periodic cleanup is performed in the couch_stats_server process.

Besides performance, the new histograms have two other improvement over the
Folsom ones:

  - They record every single value. Previous histograms did sampling and
    recorded mostly just the first 1024 values during each time instant
    (second).

  - They are mergeable. Multiple histograms can be merged with corresponding
    bins summed together. This could allow cluster wide histogram summaries or
    gathering histograms from individual processes, then combining them at the
    end in a central process.

Other performance improvement in this commit is eliminating the need to
periodically flush or scrape stats in the background in both couch_stats and
prometheus apps. Stats fetching from persistent terms and counters takes less
than 5 milliseconds, and sliding time window histogram will always return the
last 10 seconds of data no matter when the stats are queried. Now that will be
done only when the stats are actually queried.

Since the Folsom library was abstracted away behind a couch_stats API, the rest
of the applications do not need to be updated. They still call
`couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc.

Previously couch_stats did not have any tests at all. Folsom and Bear had some
tests, but I don't think we ever ran those test suites. To rectify the
situation added tests to cover the functionality. All the newly added or
updated modules should be have near or exactly 100% test coverage.

[1] #4650 (comment)
[2] https://www.erlang.org/doc/man/counters.html
[3] https://www.erlang.org/doc/man/persistent_term.html
nickva added a commit to apache/couchdb-ioq that referenced this issue Jul 19, 2023
We removed it in the main repo already.

Issue: apache/couchdb#4650
nickva added a commit to apache/couchdb-ioq that referenced this issue Jul 19, 2023
We removed it in the main repo already.

Issue: apache/couchdb#4650
@nickva
Copy link
Contributor Author

nickva commented Jul 19, 2023

Main PR was merged. Close issue as "fixed".

@nickva nickva closed this as completed Jul 19, 2023
big-r81 pushed a commit that referenced this issue Jul 23, 2023
Folsom histograms are a major bottleneck under high concurrency, as described
in #4650. This was noticed during performance testing, confirmed using Erlang
VM lock counting, then verified by creating a test release with histogram
update logic commented out [1].

CouchDB doesn't use most of the Folsom statistics and metrics; we only use
counters, gauges and one type of sliding window, sampling histogram. Instead of
trying to re-design and update Folsom, which is a generic stats and metrics
library, take a simpler approach and create just the three metrics we need, and
then remove Folsom and Bear dependencies altogether.

All the metrics types we re-implement are based on two relatively new
Erlang/OTP features: counters [2] and persistent terms [3]. Counters are
mutable arrays of integers, which allow fast concurrent updates, and persistent
terms allow fast, global, constant time access to Erlang terms.

Gauges and counters are implemented as counter arrays with one element.
Histograms are represented as counter arrays where each array element is a
histogram bin. Since we're dealing with sliding time window histograms, we have
a tuple of counter arrays, where each time instant (each second) is a counter
array. The overall histogram object then looks something like:

```
Histogram = {
     1          = [1, 2, ..., ?BIN_COUNT]
     2          = [1, 2, ..., ?BIN_COUNT]
     ...
     TimeWindow = [1, 2, ..., ?BIN_COUNT]
  }
```

To keep the structure immutable we need to set a limit on both the number of
bins and the time window size. To limit the number of bins we need to set some
minimum and maximum value limits. Since almost all our histograms record access
times in milliseconds, we pick a range from 10 microseconds up to over one
hour. Histogram bin widths are increasing exponentially in order to keep a
reasonable precision across the whole range of values. This encoding is similar
to how floating point numbers work. Additional details on how this works are
described in the the `couch_stats_histogram.erl` module.

To keep the histogram object structure immutable, the time window is used in a
circular fashion. The time parameter to the histogram `update/3` function is the
monotonic clock time, and the histogram time index is computed as `Time rem
TimeWindow`. So, as the monotonic time is advancing forward, the histogram time
index will loop around. This comes with a minor annoyance of having to allocate
a larger time window to accommodate some process which cleans stale (expired)
histogram entries, possibly with some extra buffers to ensure the currently
updated interval and the interval ready to be cleaned would not overlap. This
periodic cleanup is performed in the couch_stats_server process.

Besides performance, the new histograms have two other improvement over the
Folsom ones:

  - They record every single value. Previous histograms did sampling and
    recorded mostly just the first 1024 values during each time instant
    (second).

  - They are mergeable. Multiple histograms can be merged with corresponding
    bins summed together. This could allow cluster wide histogram summaries or
    gathering histograms from individual processes, then combining them at the
    end in a central process.

Other performance improvement in this commit is eliminating the need to
periodically flush or scrape stats in the background in both couch_stats and
prometheus apps. Stats fetching from persistent terms and counters takes less
than 5 milliseconds, and sliding time window histogram will always return the
last 10 seconds of data no matter when the stats are queried. Now that will be
done only when the stats are actually queried.

Since the Folsom library was abstracted away behind a couch_stats API, the rest
of the applications do not need to be updated. They still call
`couch_stats:update_histogram/2`, `couch_stats:increment_counter/1`, etc.

Previously couch_stats did not have any tests at all. Folsom and Bear had some
tests, but I don't think we ever ran those test suites. To rectify the
situation added tests to cover the functionality. All the newly added or
updated modules should be have near or exactly 100% test coverage.

[1] #4650 (comment)
[2] https://www.erlang.org/doc/man/counters.html
[3] https://www.erlang.org/doc/man/persistent_term.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants