Skip to content

Race condition in some metrics (counter and gauge are fine) #20

etrepum opened this Issue May 2, 2012 · 11 comments

2 participants

etrepum commented May 2, 2012

These metrics are ok:

  • folsom_metrics_counter (update_counter is atomic)
  • folsom_metrics_gauge (insert is atomic)

These metrics can drop data due to race conditions (interleaved get_value / insert):

  • folsom_metrics_histogram
  • folsom_metrics_meter
  • folsom_metrics_meter_reader

And this one can grow and shrink incorrectly (interleaved ets:info(…, size) / delete / insert):

  • folsom_metrics_history

There are some similar problems with folsom_sample_exdec, folsom_sample_none, folsom_sample_uniform but those problems would go away if folsom_metrics_histogram usage was serialized by key.

The most straightforward way out of this predicament is to set up a process per metric (of these types) for update serialization. The metrics are basically determined at compile time and their count should be smallish (hundreds maybe), so a fixed size pool isn't necessary. Most reads can still go directly to the table since it's updated atomically, but some of the histogram implementations may need some kind of get_values serialization.

  • folsom_sample_exdec: can delete a table before you get its contents, maybe ets:delete_all_objects/1 would be preferable to cycling the table. This would change worst case to returning somewhere between 0 and Size samples instead of crashing. Inserting a list instead of calling insert in a loop would change it such that 0 or Size samples could be returned, since inserting a list is atomic.
  • folsom_sample_none: could be in a state where there are Size - 1 items in the pool (this is likely acceptable)
  • folsom_sample_uniform: this looks like it would always be in a consistent state, so definitely acceptable.

Yup, I am aware of these issues and have started work to resolve them. Here is one option for exdec samples:

I haven't merged this yet as I am not convinced this is the best way to go yet.

etrepum commented May 2, 2012

If you're open to dependencies, a straightforward solution would be to use gproc. This would make it super easy to register worker processes with the name of the metric that they own and do the pub/sub.

Not sure that making exdec a process is the best solution, although it looks like it might solve the problem in that particular place.

etrepum commented May 2, 2012

Also, which of the notify interfaces in folsom_metrics / folsom_ets are actually used? folsom_metrics:notify/1 is the only one that's documented.


Right, I was looking at that last week when I was writing the exdec_gen_server branch. Seemed like a decent way to go.


folsom_metrics is the API module I prefer folks use.

etrepum commented May 2, 2012

What I mean is that there are notify/1, notify/2, notify/3, and notify_existing_metric/3.

notify/1 and notify/2 are essentially the same thing, and notify_existing_metric/3 has similar semantics. The comments lead me to believe that notify/2 is actually preferred, but the documentation only shows notify/1. notify_existing_metric/3 seems reasonable to have, but it's not clear if it's actually worth the trouble. notify/2 could surely be made faster by essentially passing down the result of get_info(Name) to notify/4 and not calling handler_exists at all.

notify/3 would be a good thing to get rid of if you could, or at least clean up all the implementation bloat it causes. I could see more use cases for it if metrics were named by terms and not atoms.

etrepum commented May 2, 2012

So here's what I was thinking last night, I started sketching it out last night but didn't finish it.

Have a function that maps types to a way to create metrics from them. Sometimes the state will be meaningless, sometimes it will be a pid. If it was a behaviour I'd have it look like this:

%% The returned module here is a folsom_metric_handler
-callback new(Name :: atom(), Opts :: [{atom(), term()}]) -> {Handler :: module(), State :: term()}.

Define a behaviour to encapsulate what a metric must do:


-callback delete(Name :: atom(), State :: term()) -> ok.
-callback get_value(Name :: atom(), State :: term()) -> term().
-callback get_value(Name :: atom(), Count :: integer(), State :: term()) -> term().
-callback notify(Name :: atom(), Value :: term(), State :: term()) -> term().

folsom_ets should change its #metric{} to include the state and the handler module returned by the constructor. An example module might look like this:



new(Name, []) ->
    {?MODULE, ets:insert(?COUNTER_TABLE, {Name, 0})}.

notify(Name, Value, _State) when is_integer(Value) ->
    inc(Name, Value);
notify(Name, Value, _State) ->
    case Value of
        inc ->
            inc(Name, 1);
        {inc, Value} ->
            inc(Name, Value);
        dec ->
            inc(Name, -1);
        {dec, Value} ->
            inc(Name, -Value)

delete(Name, _State) ->
    ets:delete(?COUNTER_TABLE, Name).

get_value(Name, _State) ->
    [{_, Value}] = ets:lookup(?COUNTER_TABLE, Name),

get_value(Name, _Count, _State) ->
    get_value(Name, _State).

inc(Name, Value) ->
    ets:update_counter(?COUNTER_TABLE, Name, Value).

There would be a wrapper module that constructs a metric in its own gen_server process, this would itself be a metric that has State as its pid. Metrics that it wraps would have an init/1 to return an initial {ok, State}, and notify casts would return the new state for the gen_server (will often be faster than using ets since we are already serialized).


%% API
-export([start_link/0, new/2]).


%% gen_server callbacks
-export([init/1, handle_call/3, handle_cast/2, handle_info/2,
         terminate/2, code_change/3]).

-record(state, {name :: atom(),
                handler :: module(),
                state :: term()}).

new(Name, Opts) ->
    {ok, Pid} = gen_server:start(?MODULE, [{name, Name} | Opts]),
    {?MODULE, Pid}.

notify(Name, Value, Pid) ->
    gen_server:cast(Pid, {notify, Name, Value}).

delete(Name, Pid) ->
    gen_server:call(Pid, {delete, Name}).

get_value(Name, Pid) ->
    gen_server:call(Pid, {get_value, Name}).

get_value(Name, Count, Pid) ->
    gen_server:call(Pid, {get_value, Name, Count}).

init(Args) ->
    {_, Name} = lists:keyfind(name, 1, Args),
    {_, Handler} = lists:keyfind(handler, 1, Args),
    {_, Opts} = lists:keyfind(opts, 1, Args),
    {ok, S} = Handler:init(Opts),
    {ok, #state{name=Name, handler=Handler, state=S}}.

handle_call({delete, Name}, _From
            State=#state{handler=Handler, state=S}) ->
    Reply = Handler:delete(Name, S),
    {stop, normal, Reply, S#state{state=undefined}};
handle_call({get_history_values, Name, Count}, _From,
            State=#state{handler=Handler, state=S}) ->
    Reply = Handler:get_history_values(Name, Count, S),
    {reply, Reply, State};
handle_call({get_value, Name}, _From,
            State=#state{handler=Handler, state=S}) ->
    Reply = Handler:get_value(Name, S),
    {reply, Reply, State}.

handle_cast({notify, Name, Event},
            State=#state{handler=Handler, state=S}) ->
    S1 = Handler:notify(Name, Event, S),
    {noreply, State#state{state=S1}}.

handle_info(_Info, State) ->
    {noreply, State}.

terminate(_Reason, _State) ->

code_change(_OldVsn, State, _Extra) ->
    {ok, State}.

An example that uses this might look something like this:



new(Name, Opts) ->
    folsom_metric_proc:new(Name, [{handler, ?MODULE}, {opts, Opts}]).

init(Opts) ->
    {Type, Size, Alpha} = lists:foldl(
        fun opt_fold/2,
    Sample = folsom_sample:new(Type, Size, Alpha),
    {ok, #histogram{type=Type, sample=Sample}}.

notify(_Name, Value, Hist=#histogram{type=Type, sample=Sample}) ->
    Hist#histogram{sample=folsom_sample:update(Type, Sample, Value)}.

delete(_Name, _State) ->

% pulls the sample out of the record gotten from ets
get_value(_Name, #histogram{type=Type, sample=Sample}) ->
    folsom_sample:get_values(Type, Sample).

get_value(Name, _Count, Hist) ->
    get_value(Name, Hist).

%% Internal API

opt_fold({type, Type}, {_Type, Size, Alpha}) ->
    {Type, Size, Alpha};
opt_fold({size, Size}, {Type, _Size, Alpha}) ->
    {Type, Size, Alpha};
opt_fold({alpha, Alpha}, {Type, Size, _Alpha}) ->
    {Type, Size, Alpha}.

This seems like a pretty reasonable solution. I assume we also want a metric supervisor to keep an eye on each new metric gen_server that we start up (supervisor:start_child/2).

etrepum commented May 5, 2012

Yeah, a simple_one_for_one or something like that

@joewilliams joewilliams was assigned Aug 10, 2012

Test message, ignore me.

@joewilliams joewilliams removed their assignment Oct 21, 2015

Folsom has moved, please resubmit your issue at Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.