General purpose stats collection
Erlang Makefile
Latest commit e6d26d4 Apr 22, 2016 @lamont-granquist lamont-granquist Merge pull request #14 from chef/tas50/chefize
Opscode -> Chef
Failed to load latest commit information.
doc
include
src
test
.gitignore
LICENSE
Makefile
README.md Opscode -> Chef Feb 26, 2016
rebar.config

README.md

stats_hero your metrics and logging helper

Copyright (c) 2012-2016 Chef Software, Inc.

Authors: Seth Falcon (seth@chef.io).

Overview

The stats_hero application can help you instrument your Erlang application with metrics that will be reported to estatsd and browsable in Graphite.

How to instrument your code to collect metrics

Include the stats_hero.hrl file. Wrap calls to upstream services using the ?SH_TIME(ReqId, Mod, Fun, Args) macro. Here's an example from the chef_db module:

fetch_user(#context{reqid = ReqId, otto_connection = _Server} = _Context, UserName) ->
    case ?SH_TIME(ReqId, chef_sql, fetch_user, (UserName)) of
        {ok, not_found} ->
            not_found;
        {ok, #chef_user{}=User} ->
            User;
        {error, Error} ->
            {error, Error}
    end.

The ReqId argument is a request id generated by our front-end load balancers (via nginx module). It is used to lookup a helper process that aggregates metrics before sending them to estatsd. The macro calls stats_hero:ctime/3 and the stats_hero side of the computation is a no-op if no worker process is found that matches the specified request id. Note, however, that determining if there is a match requires the stats_hero application to be running. So instrumented code will run fine with a missing or crashed worker process, but will not function if the stats_hero application is not running.

In order to generate consistent labels in graphite, stats_hero maintains a mapping of module names to upstream names. The Mod argument serves a double purpose: it is used as the module name for the call to evaluate (and time) and it gets mapped to a generic upstream name (e.g. chef_sql maps to rdbms). A future enhancement would put this mapping in app config for stats_hero. If Mod is not recognized, an error is raised.

Sending metrics with stats_hero

At the begining of your request, start a stats_hero worker process using stats_hero_worker_sup:new_worker/1 like this:

    Config = [{estatsd_host, EstatsdServer},
              {estatsd_port, EstatsdPort},
              {request_id, ReqId},
              {my_app, "ChefAPI"},
              {request_label, RequestLabel},
              {request_action, atom_to_list(wrq:method(Req))},
              {upstream_prefixes, [<<"mysql">>, <<"couch">>, <<"authz">>, <<"solr">>]}],

stats_hero_worker_sup:new_worker(Config)

This will create the stats_hero worker and set the calling process as parent. In the event the calling process terminates prior to cleaning up the worker, then the stats_hero process will log and terminate gracefully. It is possible to change the monitored parent process using the reparent api.

stats_hero:reparent(ReqId),

You can also reparent using the two arity api where you supply a target process.

stats_hero:reparent(ReqId, Pid)

Then proceed to call code instrumented as described above. When your request is complete, tell your worker process and it will send a total time for the request to estatsd along with details on cummulative time spent in the specified upstreams. Use stats_hero:report_metrics/2 for that. For example,

stats_hero:report_metrics(ReqId, Code),

Where ReqId is your request id as a binary and Code is an HTTP response code (or similar integer).

Finally, you need to tell the stats_hero worker that you're done with it. Use stats_her:stop_worker/1 for that.

Obtaining metrics data for request logging

You can retrieve call timing and count data from your stats_hero worker using stats_hero:snapshot/2. This will return a proplist suitable for logging with the fast_log module tuple logger.

Here's the usage from chef_rest_wm:

log_request(Req, #base_state{reqid = ReqId, log_msg = Msg}) ->
    Status = wrq:response_code(Req),
    Tuples = [{req_id, ReqId},
              {status, Status},
              {method, wrq:method(Req)},
              {path, wrq:raw_path(Req)},
              {user, wrq:get_req_header("x-ops-userid", Req)},
              {msg, {raw, Msg}}],
    PerfTuples = stats_hero:snapshot(ReqId, agg),
    Level = log_level(Status),
    fast_log:Level(erchef, Tuples ++ PerfTuples).

log_level(Code) when Code >= 500 ->
    err;
log_level(_) ->
    info.

Design notes

The stats_hero workers are started using a simple_one_for_one dynamic supervisor. Each worker updates a shared ETS table that maintains a two-way mapping between worker pids and request ids. The workers also register with the stats_hero_monitor which monitors each worker to ensure that the pid/reqid mapping is cleaned from the ETS table when the worker exits. The top-level supervisor initializes the ETS table.

Metrics Reported to estatsd

Currently, the metrics sent to estatsd (UDP) are as follows and are hard-coded into the stats_hero worker. At the start of a request, the worker sends the following data:

1|167
test_hero.application.allRequests:1|m
test_hero.test-host.allRequests:1|m
test_hero.application.byRequestType.nodes.PUT:1|m

When stats_hero:report_metrics/2 is called the worker sends:

1|640
test_hero.application.byStatusCode.200:1|m
test_hero.test-host.byStatusCode.200:1|m
test_hero.application.allRequests:108|h
test_hero.test-host.allRequests:108|h
test_hero.application.byRequestType.nodes.PUT:108|h
test_hero.upstreamRequests.rdbms:1200|h
test_hero.upstreamRequests.authz:100|h
test_hero.upstreamRequests.rdbms.nodes.put:200|h
test_hero.upstreamRequests.rdbms.nodes.fetch:1000|h
test_hero.upstreamRequests.authz.nodes.read:100|h
test_hero.application.byRequestType.nodes.PUT.upstreamRequests.rdbms:1200|h
test_hero.application.byRequestType.nodes.PUT.upstreamRequests.authz:100|h

Mapping modules to upstream labels

The current handling of upstream prefixes and module mappings is not very flexible. You can specify a list of upstream prefixes when the worker is initialized and any metrics (ctimer) with a label that matches one of the prefixes will be appropriately aggregated in reporting. However, the helper macro used to instrument code currently relies on stats_hero:label/2 which hard-codes a mapping of module to upstream label. We'd like to make this configurable.

LICENSE

Copyright 2011-2016 Chef Software, Inc. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Modules

stats_hero
stats_hero_app
stats_hero_monitor
stats_hero_sender
stats_hero_sender_sup
stats_hero_sup
stats_hero_worker_sup