Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more dimensions to fsm stats #414

Merged
merged 4 commits into from
Nov 15, 2012
Merged

Add more dimensions to fsm stats #414

merged 4 commits into from
Nov 15, 2012

Conversation

russelldb
Copy link
Member

Exposed the put_fsm per stage timings
Added per stage timings to the get_fsm

sorry about the trailing whitespace clean-ups. Maybe I should disable that in my .emacs.

@russelldb russelldb mentioned this pull request Oct 25, 2012
@russelldb
Copy link
Member Author

To verify this works, do some gets/puts then attach to the riak console and run something like:

 [{Stat, folsom_metrics:get_metric_value(Stat)} || Stat <- folsom_metrics:get_metrics(),
                                                       is_tuple(Stat), 
                                                       element(1, Stat) == riak_kv, 
                                                       element(2, Stat) == node_get_fsm].

Or

[{Stat, folsom_metrics:get_metric_value(Stat)} || Stat <- folsom_metrics:get_metrics(),
                                                       is_tuple(Stat),
                                                       element(1, Stat) == riak_kv,
                                                       element(2, Stat) == node_put_fsm].

Until I have the query / read stuff written.

Also, if you want to see per-bucket stats, you need to set the stat_tracked bucket property to true for the bucket(s) you're interested in.

Attach to a node and

 riak_core_bucket:set_bucket(<<"a">>, [{stat_tracked, true}]).

Then after putting/getting that bucket you should see stats like:

 {riak_kv,node_put_fsm,time,a,postcommit}

Where one element in the tuple is the bucket in question.

Again, attached to the node:

folsom_metrics:get_histogram_statistics({riak_kv,node_put_fsm,time,a}).

@@ -66,7 +66,11 @@
bucket_props,
startnow :: {non_neg_integer(), non_neg_integer(), non_neg_integer()},
get_usecs :: non_neg_integer(),
tracked_bucket=false :: boolean() %% is per bucket stats enabled for this bucket
tracked_bucket=false :: boolean(), %% is per bucket stats enabled for this bucket
timing = [] :: [{atom(), {non_neg_integer(), non_neg_integer(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really [{atom(), timestamp()}] (erlang:timestamp/0)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, yes it is , for shame.

@coderoshi
Copy link
Contributor

Works as advertised +1

calc_timing(Timing, Now, ReplyNow, []).

%% Each timing stage has a start time.
calc_timing([], StageEnd, ReplyNow, Stages) ->

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this has been cut-and-pasted from another module, but it seems like it's possible that this might some day be called with ReplyNow as undefined, which will blow up in the call to timer:now_diff/2. Is there a reason that will never happen? Should it be documented in case other things change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case seems to be (and you're right, I didn't fully understand the function when I hoisted it for re-use) that {repl, TS} is always the ultimate or penultimate state when you calculate timings. Meaning that line 49 then gets the reply timestamp from the list, as undefined is matched as the 3rd argument to the function.

Options are:

  1. refactor so the function does what it seems to be supposed to do (calculate times for all states up to reply time or
  2. Document the assumption that reply is the final or pen-ultimate` state in the list of timings

I think 1 is the right thing to do, but will require more review from you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, found more problems here. The calc_timing fun returns two timing entries for reply on the put_fsm, which means two updates to that stat. I really need to refactor this function so it does what I think it is meant to do. (Time each stage, and the total time from init to reply)

@beerriot
Copy link

Should there be a utility function for deleting tracked-bucket stats? That way we can turn them on for a while, then turn them off, and not have to worry about unused memory consumption?

@russelldb
Copy link
Member Author

Yes there should. Good point. Same utility should provide a way to turn on / off bucket tracking too.

@russelldb
Copy link
Member Author

I've opened an issue (#434) for this "utility to delete tracked-bucket stats", it will depend on code in basho/riak_core#252. It may or may not get into the code freeze. Is that acceptable, or does a +1 for this PR depend on it?

riak_kv_fsm_timing seems a better name for the former `util` module
since all it does is timing right now.
Refactored the `cal_timing` function to handle duplicate states
and made explicit the need for a `reply` tuple in the timings list.
@russelldb
Copy link
Member Author

Added commit that I hope addresses your comments.

@beerriot
Copy link

Issue #434 works for me. I see no need to hold up this PR for it.

There is no need to do a ordict find, update, and store
orddict:update_counter is there for that, this change use it.
@russelldb
Copy link
Member Author

Squashed that fun.

@beerriot
Copy link

It might be a good idea to loosen the function head match on riak_kv_stat:do_put_bucket and :do_get_bucket. As it stands, where only true and false are allowed for the first argument, it's easy to lose all of your stats by setting a bucket property incorrectly.

For instance, using curl:

# put something to update stats
$ curl -X PUT -H "content-type: text/plain" http://localhost:8098/riak/a/1 --data "okay"

# see stats increment
$ curl http://localhost:8098/stats | jsonpp | grep node_puts_total
  "vnode_puts_total": 3,
  "node_puts_total": 1,

# set the 'stat_tracked' bucket property to something not true or false
$ curl -X PUT -H "content-type: application/json" http://localhost:8098/riak/a --data "{\"props\":{\"stat_tracked\":\"not a boolean\"}}"

# put something to update stats again
$ curl -X PUT -H "content-type: text/plain" http://localhost:8098/riak/a/2 --data "bummer"

# see the stats clear
$ curl http://localhost:8098/stats | jsonpp | grep node_puts_total
  "vnode_puts_total": 0,
  "node_puts_total": 0,

Either rewriting the false clause of those functions as a don't-care/catch-all, or using PerBucket==true instead of just PerBucket should fix this.

@russelldb
Copy link
Member Author

Good catch, maybe here is the right place to handle this (https://github.com/basho/riak_kv/blob/master/src/riak_kv_bucket.erl)?

@beerriot
Copy link

Aha! I totally forgot that there is bucket-property validation now. Good plan.

@beerriot
Copy link

One bit about per-bucket stats that's bugging me is the creation of an atom from the bucket name. For tracking a constant handful of buckets, this isn't bad, but for tracking lots of buckets over time, it may lead to atom table exhaustion (think about the case of enabling it as a default bucket prop). It seems like it would be safer (if slightly less efficient during update/query) to use the binary bucket name instead. Am I worrying too much? How hard would such a change be?

@russelldb
Copy link
Member Author

Pushed change to validate that the stat_tracked property is a boolean.

I spoke with Jon about the risks of allowing people to track buckets in this way. We're going to warn against tracking too many. In fact, it should be emphasised as a temporary / debugging measure.

I agree with your concerns about the atom table and am investigating making the change to using atoms.

@ghost ghost assigned beerriot Nov 14, 2012
@russelldb
Copy link
Member Author

I'm going to deal with the atoms for bucket names in a different branch. This PR is the first of a few that build on it, and I think #428 is the place to make that change.

Can I get a view on this PR as it stands, worries about atoms aside?

@russelldb
Copy link
Member Author

I've made the changes to gh412-tree-stats to use binaries for bucket names in stat keys and it all works fine. I'll pay the price when it comes to offering a REST query interface, and I can't just use list_to_existing_atom/1 for each path token. Discerning what should be a binary vs. what should be an atom will be…fun.

@beerriot
Copy link

Indeed, +1 for this branch. Thanks!

@russelldb russelldb merged commit 28c937d into master Nov 15, 2012
@ghost ghost assigned russelldb Nov 16, 2012
@seancribbs seancribbs deleted the rdb-fsm-timing-stats branch April 1, 2015 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants