Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] tracing: use LTTng for logging #22458

Closed
wants to merge 1 commit into from
Closed

[DNM] tracing: use LTTng for logging #22458

wants to merge 1 commit into from

Conversation

mogeb
Copy link
Contributor

@mogeb mogeb commented Jun 7, 2018

See src/tracing/tracetool/README for details. Use -DWITH_LTTNG_LOGGING=ON with cmake to use LTTng for logging.
Using fio's objectstore backend, with the following job:

runtime=120
rw=randwrite
bs=4m
numjobs=16
iodepth=16

I get the following results:

  • with dout and debug bluestore = 10/0:
Average throughput = 119 MiB/s
Average IOPS = 13.68
Disk usage: 48M
  • with LTTng:
Average throughput = 125 MiB/s
Average IOPS = 14.09
Disk usage: 23M

fio is writing to Bluestore on an SSD and the log is being written to an HDD for both dout and LTTng. The throughput difference varies. I've seen LTTng up to 10% faster with more iodepth and numjobs (more contention on dout's internal structures), and I've seen it pretty much the same as dout (with lower iodepth and lower numjobs). However, the size on disk of the log is consistently 50%-60% smaller with LTTng (as expected).

@mogeb mogeb changed the title [DNM] Use LTTng for logging [DNM] tracing: use LTTng for logging Jun 7, 2018
@tchaikov tchaikov self-requested a review June 7, 2018 15:59
src/ceph_osd.cc Outdated
chan.attr.read_timer_interval = 200;
chan.attr.output = LTTNG_EVENT_MMAP;

dout(1) << "mogeb creating session " << tracing_session << " in " << (string("/var/data/ceph/build/out/") + tracing_session) << dendl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe replace "mogeb" with something else? :)

@@ -0,0 +1,28 @@
#!/bin/bash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file can be ignored, I've been using it locally so I don't have to do_cmake every time to switch between backends.

@@ -445,6 +445,18 @@ struct ghobject_t {
(*this) = temp;
}

operator std::string() {
std::stringstream out;
out << this;
Copy link
Contributor

@chardan chardan Jun 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this will write the address of "this" to the stream, rather than the value, *this. Is that what you want? Is there some other magic happening here that writes the value instead? Do I just need coffee? :-)
(There are a number of these, so I'm guessing that /is/ what you want, but I saw at least one exception so I figure it's worth asking!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OT, but these operator whatever() methods make me very nervous. IIUC, this will silently cast/convert the object to a string if it's in a context that wants a string (e.g., passing to void func(string foo)), right? I've been bitten by silent/accidental conversions in the past. Do we need these, or would an explicit std::string str() method suffice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, it's for the implicit cast to string for unknown types. A simple str() method would also work, and we'd append the .str() call in tracetool.

@@ -3792,6 +3809,7 @@ class pg_missing_const_i {
virtual bool is_missing(const hobject_t& oid, pg_missing_item *out = nullptr) const = 0;
virtual bool is_missing(const hobject_t& oid, eversion_t v) const = 0;
virtual ~pg_missing_const_i() {}
virtual operator std::string() { return "needs implementation"; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this class is already pure virtual, why not just make this pure virtual as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're obviously right. I was either not thinking, or this was a note to myself to look into implementing the operator in the base class. Either way, thanks!

return out.str();
}

operator std::string() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need two operator std::string()? i think the const one would suffice.

@tchaikov
Copy link
Contributor

tchaikov commented Jul 5, 2018

hey @mogeb may i learn your plan on this PR? please let me know if you need more hands. probably i can help on it.

@mogeb
Copy link
Contributor Author

mogeb commented Jul 5, 2018

@tchaikov sorry for the delay, I got side-tracked on another project for a couple of weeks now, but I'm planning on coming back to this PR next week.

I already have a second iteration of this PR ready and working, I just need to clean it a bit and probably ask you for a code review and to try it for usability. The second iteration gets rid of the tracing/tracetool/subsys files, as I explained here. If you have any thought that approach, please feel free to share it.

@mogeb
Copy link
Contributor Author

mogeb commented Jul 11, 2018

@tchaikov I've updated the PR. This is still experimental so please excuse some less than optimal code. You should be able to try it, and I think it's better to look at the usability before doing a full code review. For now, only Bluestore.cc can be instrumented, I've updated the README accordingly.

@tchaikov tchaikov self-requested a review July 17, 2018 14:52
@@ -178,6 +178,9 @@
/* Define if you want to use LTTng */
#cmakedefine WITH_LTTNG

/* Degine if you want to use LTTng for logging instead of dout */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Degine/Define/

@@ -723,6 +723,13 @@ class interval_set {
other = std::move(m);
}

public:
operator std::string() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark it const if you please.

@@ -355,7 +355,11 @@ int cmp(const hobject_t& l, const hobject_t& r)
return 0;
}


ghobject_t::operator std::string() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, we could use CRTP for this idiom, like

template <typename T>
struct Printable {
  operator std::string() const {
    std::stringstream out;
    out << *static_cast<const T*>(this);
    return out.str();
  }
};

struct Bar : Printable<Bar>
{};

ostream& operator<<(ostream& os, const Bar&) {
  return os << "bar!";
}

so we don't need to repeat this method for every class that needs to be printed in the new logging system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would make sense. OTOH this comment suggests to use a str() method instead of relying on the implicit operator std::string() to avoid surprises. What are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am also for the str() method, or at least make the operator std::string explicit, like

explicit operator std::string() const { 
 /// ....
}

. @mogeb have you considered reusing a single trace point everywhere and send the fmt string to it, i.e. use it as an alternative of syslog(3) without the ... parameters. if we are fine with this, we can avoid all these troubles. but the downside of this apporach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov sure, that's the very first solution I explored (pre-Cephalocon). I initially changed the implementation of dout() to just record a tracepoint. With that approach, none of the actual has to be changed. Another alternative could be to use LTTng's tracef() (similar to printf), but it also needs/uses strings.

My objective here though is to avoid passing strings around for performance reasons.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I think if the legacy dout lines degenerate into a slow string op, we can focus our efforts on replacing dout calls in the fast path and leave the less frequently-run and slow-path code alone until later (e.g., peering).

Copy link
Contributor

@tchaikov tchaikov Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks guys. after re-reading tracing/tracetool/{README.md,parsetool.py}, now i understand the objective and rationale behind this design.

Copy link
Contributor

@tchaikov tchaikov Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mogeb i agree with @liewegas on this. and i was thinking about how to ease the pain to add a logging message in ceph in future. i even explored the possibility to use libclang to parse the source for creating the .tp file. with it, we can coin the event name using the filename, and lineno, and deduce the name of argument type from the argument in the trace_*(...) expression, so we can extract information interesting to us and create the .tp , .h and _impl.h files for us. but this dependency is kind of heavy weight for us at this moment. probably we can revisit it in future.

but, we can always have a compromise here by differentiating following two use cases:

  1. the logging in critical path: to use trace_${event_name}(loglevel, subsys, [arg_type, arg_name]...), so we don't need to render the values of primitive types in string before passing them to LTTng.
  2. the logging in non-critical path: to use trace(loglevel, subsys, fmt_str, [arg]...). we can use fmt::format() offered by libfmt in the pasrsetool.py. as libfmt is already included by ceph as a submodule. and make trace() a template function. which calls tracepoint(subsys, ${default_event_name}_at_${log_evel}) , assuming all combinations of ${default_event_name}_at_${log_level} TRACEPOINT_EVENTs of different ${subsys}s and ${log_level}s combinations have been defined by parsetool.py. they only have a single TP_FIELD.

in this way, we can have

  • a unified logging backend, so developers or users don't need to check different sources for tracing an issue, and
  • a flexible front-end, so developers don't need to worry about the name of a trace event, and the argument type of event payload for all logging messages. they only need to do so when it's really needed.

what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the clang approach would be great. Sounds good, let's rethink it later as you said.

For having a critical and non-critical paths: just to make sure I understand, the non-critical path uses a formatted string using the arguments' operator <<() functions, correct? If that's the case, we could change the implementation of dout to record a tracepoint instead of using trace(loglevel, subsys, fmt_str, [arg]...). This becomes even easier for the developers as nothing changes: dout() just takes the string passed to it and ends up calling tracepoint() at the end. I'd need to figure out the details, but I was able to achieve that at the very beginning. Any thought?

Copy link
Contributor

@tchaikov tchaikov Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the non-critical path uses a formatted string using the arguments' operator <<() functions, correct?

yes. at this moment, we can use the argument's operator<<(), so minimum change is needed. in short and long term, i'd suggest use libfmt's fmt::format() instead, it has better performance than that of std::iostream. probably we can expose two more interfaces

  1. just for the backward compatibility: `dout(1) << "ceph rock" << 5 << '!' << dendl;
  2. as the faster one: trace(bluestore, 1, "{0} rock{1}{2}", "ceph", 5, '!');
  3. as the fatest one: trace_ceph_rocks(1, bluestore, string, who, "ceph", int, s, 5, uint_8, wow, '!', "%s rock %d%d");

so, i'd suggest start using the trace() or trace_${event_name} interface in the new code. and replace the implementaion of dout(1) for the existing code.

'static inline void __%(api)s(%(args)s)',
'{',
api=e.api(e.CEPH_TRACE),
args=", ".join([' '.join(arg[:2]) if is_valid_type(arg[0]) else ' '.join(('string', arg[1])) for arg in e.args]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to differentiate the args which are already std::string or string in Event.build(). probably we could have three Arg types to represent them?

so we don't create a temporary copy of them just for passing them to ctf_string().

@tchaikov
Copy link
Contributor

@mogeb i will try your patch out. probably the best way to improve it is to use it.

if(WITH_LTTNG_LOGGING)
add_dependencies(osd bluestore-parsetool)
target_link_libraries(osd logging_tp)
target_link_libraries(osd lttng-ust-fork)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to add dependencies from os to *_impl.h, as BlueStore.cc now includes them. could use set_source_files_properties() for this purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, thanks.

@mogeb
Copy link
Contributor Author

mogeb commented Aug 1, 2018

@tchaikov any thought on the usability part? The LTTng usage, the resulting trace, reading/parsing it, etc.

src/ceph_osd.cc Outdated
@@ -19,6 +19,7 @@

#include <iostream>
#include <string>
#include <lttng/lttng.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to add liblttng-ctl-dev as a build dependency.

@tchaikov
Copy link
Contributor

tchaikov commented Aug 6, 2018

@mogeb i am stilling playing around with it. will reply you tomorrow.

@tchaikov
Copy link
Contributor

tchaikov commented Aug 7, 2018

@mogeb following is what we have in the output of lttng view after applying this changeset:

...
[18:07:11.985680268] (+0.000163564) gen8 bluestore:do_write_small: { cpu_id = 4 }, { offset = 0x0, length = 0x1DC }
[18:07:11.985854796] (+0.000174528) gen8 bluestore:write: { cpu_id = 4 }, { cid = "meta", oid = "#-1:c683972f:::inc_osdmap.12:0#", offset = 0x0, length = 0x1D
C, r = 0 }
...

i think this the best compact representation of a (log) event we can have in the fast path, but from a typical user's point of view, it's still not very friendly. we can ship the .tp files in a separated package accompanied by a set of command line tools for analyzing the tracings. one of these tools will be able to act as a filter to utilize the babeltrace parser to compose the tracing events and the .tp file to produce a more readable output, in other words, to printf the fields along with the format string.

also, to integrate with existing logging infrastructure:

  1. to see if we can use lttng rotate along with logrotate
  2. update the systemd script to start/stop lttng sessions and tracing

but we can leave these tooling for following-up PRs. i guess, the first thing we need to have is dout backend and the "faster" mode, and to fix the CMakeLists.txt, to make sure it compiles if WITH_LTTNG_LOGGING is enabled. we can keep WITH_LTTNG_LOGGING disabled by default before we get these tooling and script sorted out.

once this change gets merged. we will

  • improve the tooling,
  • update the systemd scripts.
  • fix the packaging dependencies

and then switch OSD over to LTTng based logging.

what do you think?

@mogeb
Copy link
Contributor Author

mogeb commented Aug 7, 2018

@tchaikov I really like the approach, especially having a tool that knows how to rewrite the trace's events as more human-readable ones. Note however that the .tp files don't have the printf information in them (how to format and what to print), so we'll need to add a file for it. The session rotation option in the 2.10 version of LTTng certainly makes things easier. We can set the rotation to be automatic and either time-based or size-based. Did you measure any performance difference other than what I had reported? I think the reduced size on disk is a given, but let's confirm that the performance is also better to warrant this added complexity. Once I clean this PR I will run benchmarks on a real cluster too.

A couple of small things to clear:

  • For the slow path, do we want to keep dout as it is (plain text logging), or do we want to make dout use LTTng, but with the complete formatted string in the payload of the event?
  • Which log entries do we rewrite as fast-path? Maybe the ones lower than a certain loglevel?

@tchaikov
Copy link
Contributor

tchaikov commented Aug 7, 2018

Did you measure any performance difference other than what I had reported?

@mogeb not yet. i will try to reproduce your test in my test bed.

For the slow path, do we want to keep dout as it is (plain text logging), or do we want to make dout use LTTng, but with the complete formatted string in the payload of the event?

i think it'd be hard for us to read the logging messages if we have a hybrid logging: plain text logging + LTTng tracing. to ease the transition. i'd recommend a runtime option for switching between the plain-text backend and the LTTng backend.

Which log entries do we rewrite as fast-path? Maybe the ones lower than a certain loglevel?

i don't think there is a simple criteria for telling if a log entry is in the fast path by looking at its loglevel. for instance, we use dout(1) for a critical warning or very important message, for example,

dout(1) << "not feeling well.. i am about to re-boot" << dendl;

and use dout(30) for a trivial message,

dout(30) << "tick, tick. i am healthy!" << dendl;

i think, to ease the pain of migration, before we have the tool to "rewrite the trace's events as more human-readable ones.", we'd better leave the logging messages as they are. and leave the tool and the changes (for instance, ff26e95, 2b00fe2, 8e227fb) to LTTngize logging message for another PR.

@tchaikov
Copy link
Contributor

@mogeb in my case, lttng based log outperforms the existing dout(). tested with test/fio/ceph-bluestore.fio. a separated SSD for osd_data. and a separated HDD for log/ust.

lttng enable-event  -u -a

  WRITE: bw=65.0MiB/s (68.2MB/s), 65.0MiB/s-65.0MiB/s (68.2MB/s-68.2MB/s), io=1302MiB (1365MB), run=20018-20018msec

while with existing dout backend

  WRITE: bw=36.2MiB/s (37.0MB/s), 36.2MiB/s-36.2MiB/s (37.0MB/s-37.0MB/s), io=725MiB (760MB), run=20017-20017msec

@mogeb
Copy link
Contributor Author

mogeb commented Aug 14, 2018

@tchaikov so the performance is twice as good with LTTng compared to current dout? This is great. What's the loglevel you set for dout?

Something came up on my side for so I got sidetracked for a few days, but I'll update this PR afterwards.

Edit: Sorry, I missed a part of your comment at first ;)

@tchaikov
Copy link
Contributor

so the performance is twice as good with LTTng compared to current dout?

looks like.

Is that using an actual OSD or fio with Bluestore backend?

fio w/ bluestore backend.

What's the loglevel you set for dout?

debug-bluestore = 10/0

In my tests, the logs were going to a different disk (which is more realistic) than the data with fio. I assume dout does more writes to disk than LTTng, so there's more contention. If the logs are going to the same disk than the data, the performance difference will be more flagrant. Nonetheless, this is good news, thanks.

will double check my settings. and rerun the test .

Something came up on my side for so I got sidetracked for a few days, but I'll update this PR afterwards.

sure. thanks!

#ifndef typeof
#define typeof __typeof__
#endif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why this is necessary. src/test/fio/CMakeLists.txt has a workaround for this:

# fio headers use typeof(), which requires c++11 extensions
set_target_properties(fio_ceph_objectstore PROPERTIES
  CXX_EXTENSIONS ON
  COMPILE_FLAGS "${FIO_CFLAGS}")

Copy link
Contributor Author

@mogeb mogeb Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably won't be it in the final version of this PR. However, I was planning on submitting it alone as fio_ceph_objectstore.cc wouldn't build in my environment. I haven't tried it since though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also ran into this issue while compiling libfio_ceph_objectstore.so with GCC 8.2, hence axboe/fio#628 , which renders both workaround not necessary anymore.

@tchaikov
Copy link
Contributor

what i don't understand is why my SSD is 2x slower than it was last week... it was
bw=65.0MiB/s with LTTng logging, while it's now bw=34.7MiB/s with the same settings.

@mogeb
Copy link
Contributor Author

mogeb commented Aug 21, 2018

@tchaikov exactly.. even if a trace wasn't generated last time (tracing disabled), it still wouldn't cause that much of a difference. In any case, I will do proper benchmarking once I'm back on this.

@tchaikov
Copy link
Contributor

@mogeb ping?

@mogeb
Copy link
Contributor Author

mogeb commented Sep 11, 2018

@tchaikov there are now three ways to write an event:

  1. Fast path: trace_<event_name>(loglevel, subsys, type1, arg1, value1 [, type2, arg2, value2, ...], <fmt>);
  2. Slow path: trace(loglevel, subsys, <libfmt-formatted string>, arg1, ...)
  3. Slow path: trace_error(subsys, <libfmt-formatted string>, arg1, ...)

3 is exactly like 2, but the loglevel is not required. For 1, there is no string manipulation, and raw basic types have to be provided. The result in the trace is a compact event. For 2 and 3, arguments are formatted to a string and passed to the event. Here's an example.

The only issue right now is that for 2 and 3, you have to use fmt::ptr(arg) if arg is a pointer, as explained here. It doesn't look like there's a straight way to print pointers (at least in C++).

dout(10) << __func__ << " " << cid << " " << oid
<< " 0x" << std::hex << offset << "~" << length << std::dec
<< " = " << r << dendl;
trace(10, "bluestore", "read {}, {}, {}, {} and {}",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also replace susbsys with __func__.

@mogeb
Copy link
Contributor Author

mogeb commented Oct 5, 2018

Rebased. I'm trying to run this on a real cluster now.

@mogeb mogeb force-pushed the lttng branch 2 times, most recently from ca62802 to be4a212 Compare October 5, 2018 10:29
@tchaikov
Copy link
Contributor

sorry for the latency, i will review this PR early tomorrow.

@tchaikov tchaikov self-assigned this Oct 16, 2018
@mogeb
Copy link
Contributor Author

mogeb commented Oct 16, 2018

@tchaikov no problem, take your time. On my side, I'm trying to build it with a proper build system to deploy it on a real cluster and see what impact it has. I'll update you with the results soon.

@mogeb
Copy link
Contributor Author

mogeb commented Nov 8, 2018

Update: my hardware is currently broken so I'm waiting for it to be fixed to report performance numbers on a real cluster. I'll report soon and sorry for the delay.

@mogeb mogeb force-pushed the lttng branch 2 times, most recently from f13a782 to 42668c2 Compare June 24, 2019 16:23
Signed-off-by: Mohamad Gebai <mgebai@suse.com>
@mogeb
Copy link
Contributor Author

mogeb commented Jun 25, 2019

New results here

@jdurgin
Copy link
Member

jdurgin commented Aug 9, 2019

This was discussed at the perf weekly 2019-06-20, where @mogeb shared those results. Given the difficulty of use of lttng for both users and developers, we decided that it would make more sense to concentrate on optimizing the log statements in the hot path (e.g. reducing the memcpys and stringification of many structures there) rather than converting the logging machinery to lttng.

I'd update the trello card earlier: https://trello.com/c/WAu7nsH4/362-osd-make-logging-on-the-i-o-path-lightweight but this PR being open caused some confusion, so closing it now.

@jdurgin jdurgin closed this Aug 9, 2019
@cbodley
Copy link
Contributor

cbodley commented Aug 9, 2019

@mogeb thanks for all of your efforts in exploring this!

@mogeb
Copy link
Contributor Author

mogeb commented Aug 12, 2019

@cbodley and @jdurgin thanks for your help as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
6 participants