mgr: increase time resolution of Commit/Apply OSD latencies. #19232

socketpair · 2017-11-29T09:46:45Z

As you can see, millisecond resolution is no enough for fast (Bluestore, ssd, NVME) storage monitoring

socketpair · 2017-11-29T09:51:51Z

I have never compiled that, so please take a look on my idea and answer, should I continue or not.

socketpair · 2017-11-29T09:53:00Z

src/mon/Paxos.cc

@@ -105,7 +105,7 @@ void Paxos::init_logger()
  pcb.add_u64_avg(l_paxos_commit_keys, "commit_keys", "Keys in transaction on commit");
  pcb.add_u64_avg(l_paxos_commit_bytes, "commit_bytes", "Data in transaction on commit");
  pcb.add_time_avg(l_paxos_commit_latency, "commit_latency",
-      "Commit latency", "clat");
+      "Commit latency (ns)", "clat");


I'm unshure what is it.

shinobu-x · 2017-11-29T10:25:51Z

Please follow:
https://github.com/ceph/ceph/blob/master/SubmittingPatches.rst

socketpair · 2017-12-23T20:39:02Z

src/osd/osd_types.cc

-  f->dump_unsigned("commit_latency_ms", os_commit_latency);
-  f->dump_unsigned("apply_latency_ms", os_apply_latency);
+  // *_ms values just for compatibility.
+  f->dump_float("commit_latency_ms", os_commit_latency_ns / 1000000.0);


note, unsigned -> float here

Presumably the motivation for having commit_latency_ms in addition to commit_latency_ns is to have backward compatibility, so it probably doesn't make sense to change the type of the _ms field.

But JSON (as a format) internally does not distinguish between int and float. So after my patch any correct application that reads JSON should see enhancement.

jcsp · 2018-01-02T09:55:00Z

retest this please (jenkins)

jcsp · 2018-01-02T09:59:51Z

This seems reasonable to me, I'd be on the fence about whether the _ms fields in objectstore_perf_stat_t::dump still need to be there at all, if we're outputting full resolution nanosecond data. The structure isn't backwards compatible in general. Any thoughts @tchaikov ?

socketpair · 2018-01-02T20:00:02Z

Tests failed. I can not figure out why.
consoleText.txt

tchaikov

agreed. i don't think we need to keep the _ms field if we already have _ns , unless there are other issues preventing us from doing so.

tchaikov · 2018-01-03T07:29:05Z

src/osd/osd_types.cc

  ENCODE_FINISH(bl);
 }

 void objectstore_perf_stat_t::decode(bufferlist::iterator &bl)
 {
  DECODE_START(1, bl);
-  ::decode(os_commit_latency, bl);
-  ::decode(os_apply_latency, bl);
+  ::decode(os_commit_latency_ns, bl);


this is not backward compatible: os_commit_latency_ns and os_apply_latency_ns are 64bits integers, while they were 32 bits before this change. that's why buffer::end_of_buffer exception is thrown when the test is trying to decode the encoded blob before this change using the new decoder.

What is this blob intended for ? Should I fix tests or code ? Where does this blob used ?

The objectstore stats get encoded and transmitted as part of MPGStats in OSD::send_pg_stats (via store->get_cur_stats).

That means that if you make this change in an OSD and someone has an older monitor, the monitor will crash when it tries to decode the message.

Increase the version on encoded structures (increment the first 1 in ENCODE_START in ::encode) and they'll still be understood by older daemons as long as fields are only added (not changed in place). Annoyingly, that means in this instance that the ::encode function needs to output two u32 fields (truncated from os_commit_latency_ns) and then follow those with the two u64 values. Then in ::decode you would decode two u32 values, then if the message version was >=2 you'd throw them away and decode two u64 values, else you'd stop.

Check out the (many) places in the ceph tree where " if (struct_v " fragments appear to see how that kind of conditional decoding is done.

@jcsp Huge thanks. I have changed code by your advice.

socketpair · 2018-01-21T21:17:11Z

@tchaikov

agreed. i don't think we need to keep the _ms field if we already have _ns , unless there are other issues preventing us from doing so.

Should I remove these fields in order to close review ?

socketpair · 2018-01-22T09:06:10Z

@shinobu-x
@jcsp
@tchaikov

Please review. All tests passed.

jcsp · 2018-01-22T10:57:39Z

Looks correct now, question is just are we okay with the increase in size of the encoded form (I think probably yes?)

tchaikov · 2018-01-23T07:01:25Z

question is just are we okay with the increase in size of the encoded form (I think probably yes?)

probably we can avoid increasing the size of the MPGStats by complicating MPGStats::encode_payload() and MPGStats::decode_payload(). please note QOS_DMC was introduced in mimic. we can reused this feature bit in this release.

i prepared a change at tchaikov@e64ba65 to pass the features down to objectstore_perf_stat_t 's encoder.

probably we need to update the ceph-object-corpus submodule to add MPGStats to archive/12.0.0/forward_incompat.

tchaikov · 2018-01-23T07:38:10Z

src/osd/osd_types.cc

+  decode(commit_latency_ms, bl);
+  decode(apply_latency_ms, bl);
+  if (struct_v >= 2) {
+      decode(os_commit_latency_ns, bl);


wrong indent.

Is there any auto-indentation script ?

@socketpair not yet, AFAICT. some developers (it not all) are using VIM or Emacs for editing source files. and these editors read the file variables at the first lines in the source files. for the coding style we are following, see https://github.com/ceph/ceph/blob/master/CodingStyle

Fixed. FYI:

clang-format-3.9 -i -style="{BasedOnStyle: LLVM, IndentWidth: 2, BreakBeforeBraces: Linux, AllowShortIfStatementsOnASingleLine: false, IndentCaseLabels: false}" makes huge amount of changes. I think, someone should reformat sources using this (or another) tool and also add github check that patched sources do not change after automatic reformatting.

socketpair · 2018-01-23T09:59:57Z

@tchaikov

i prepared a change at tchaikov/ceph@e64ba65

Should I cherry-pick your changes above my ones ?

tchaikov · 2018-01-23T10:15:46Z

Should I cherry-pick your changes above my ones ?

@socketpair as long as you think it's correct.

Increase precision/resolution of time measurements in performance monitoring. Affects only Commit/Apply OSD latencies. Signed-off-by: Коренберг Марк <socketpair@gmail.com>

socketpair · 2018-01-24T08:33:35Z

@tchaikov I kindly say that I will not merge your changes since I does not understand them. Please make separate PR after my PR is merged, if you want.

socketpair · 2018-02-03T17:39:31Z

@tchaikov ping

tchaikov · 2018-02-05T14:03:56Z

@jcsp what do you think regarding to the idea of #19232 (comment) ?

socketpair · 2018-02-05T14:04:33Z

Thanks, @tchaikov. What should I do in order to merge this PR ?

tchaikov · 2018-02-05T14:21:18Z

@socketpair i am adding the needs-qa label so this PR can be picked-up in a rados qa suite run. and in the meanwhile, we can wait for the insights from @jcsp .

jcsp · 2018-02-05T21:06:05Z

@tchaikov since there is already a feature bit available in the release, your patch looks like a 👍 thing to do here to avoid bloating the encoded form

tchaikov · 2018-02-09T02:44:10Z

the failures are either caused by #19117 or tracked by http://tracker.ceph.com/issues/9356 .

tchaikov · 2018-02-09T03:23:10Z

thanks @jcsp , #20378 is posted.

socketpair · 2018-02-09T07:01:56Z

Thanks @tchaikov !

socketpair commented Nov 29, 2017

View reviewed changes

socketpair changed the title ~~(WIP) Increase precision of Commit/Apply OSD latencies (in performance moniroting)~~ mgr: increase time resolution of Commit/Apply OSD latencies. Dec 6, 2017

batrick added mgr needs-review labels Dec 20, 2017

socketpair commented Dec 23, 2017

View reviewed changes

jcsp added bluestore core labels Jan 2, 2018

tchaikov self-requested a review January 2, 2018 13:36

tchaikov requested changes Jan 3, 2018

View reviewed changes

tchaikov self-requested a review January 22, 2018 16:50

tchaikov removed their request for review January 23, 2018 07:37

tchaikov reviewed Jan 23, 2018

View reviewed changes

mgr: increase time resolution of Commit/Apply OSD latencies.

3b75db8

Increase precision/resolution of time measurements in performance monitoring. Affects only Commit/Apply OSD latencies. Signed-off-by: Коренберг Марк <socketpair@gmail.com>

tchaikov approved these changes Feb 5, 2018

View reviewed changes

tchaikov added the needs-qa label Feb 5, 2018

tchaikov added the wip-kefu-testing label Feb 7, 2018

tchaikov merged commit 24d1b2e into ceph:master Feb 9, 2018

tchaikov mentioned this pull request Feb 9, 2018

osd: check feature bits when encoding objectstore_perf_stat_t #20378

Merged

socketpair deleted the precision branch February 9, 2018 05:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr: increase time resolution of Commit/Apply OSD latencies. #19232

mgr: increase time resolution of Commit/Apply OSD latencies. #19232

socketpair commented Nov 29, 2017 •

edited

socketpair commented Nov 29, 2017 •

edited

socketpair Nov 29, 2017

shinobu-x commented Nov 29, 2017

socketpair Dec 23, 2017

jcsp Jan 2, 2018

socketpair Jan 2, 2018

jcsp commented Jan 2, 2018

jcsp commented Jan 2, 2018

socketpair commented Jan 2, 2018

tchaikov left a comment

tchaikov Jan 3, 2018

socketpair Jan 3, 2018

jcsp Jan 18, 2018

socketpair Jan 21, 2018

socketpair commented Jan 21, 2018

socketpair commented Jan 22, 2018

jcsp commented Jan 22, 2018 •

edited

tchaikov commented Jan 23, 2018 •

edited

tchaikov Jan 23, 2018

socketpair Jan 23, 2018

tchaikov Jan 23, 2018 •

edited

socketpair Jan 24, 2018

socketpair commented Jan 23, 2018

tchaikov commented Jan 23, 2018

socketpair commented Jan 24, 2018

socketpair commented Feb 3, 2018

tchaikov commented Feb 5, 2018

socketpair commented Feb 5, 2018

tchaikov commented Feb 5, 2018

jcsp commented Feb 5, 2018

tchaikov commented Feb 9, 2018

tchaikov commented Feb 9, 2018

socketpair commented Feb 9, 2018

mgr: increase time resolution of Commit/Apply OSD latencies. #19232

mgr: increase time resolution of Commit/Apply OSD latencies. #19232

Conversation

socketpair commented Nov 29, 2017 • edited

socketpair commented Nov 29, 2017 • edited

Choose a reason for hiding this comment

shinobu-x commented Nov 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcsp commented Jan 2, 2018

jcsp commented Jan 2, 2018

socketpair commented Jan 2, 2018

tchaikov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

socketpair commented Jan 21, 2018

socketpair commented Jan 22, 2018

jcsp commented Jan 22, 2018 • edited

tchaikov commented Jan 23, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaikov Jan 23, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

socketpair commented Jan 23, 2018

tchaikov commented Jan 23, 2018

socketpair commented Jan 24, 2018

socketpair commented Feb 3, 2018

tchaikov commented Feb 5, 2018

socketpair commented Feb 5, 2018

tchaikov commented Feb 5, 2018

jcsp commented Feb 5, 2018

tchaikov commented Feb 9, 2018

tchaikov commented Feb 9, 2018

socketpair commented Feb 9, 2018

socketpair commented Nov 29, 2017 •

edited

socketpair commented Nov 29, 2017 •

edited

jcsp commented Jan 22, 2018 •

edited

tchaikov commented Jan 23, 2018 •

edited

tchaikov Jan 23, 2018 •

edited