New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: add debug merge-logs command to merge log files into a single stream. #32790

Merged
merged 2 commits into from Dec 7, 2018

Conversation

Projects
None yet
4 participants
@ajwerner
Copy link
Collaborator

ajwerner commented Dec 3, 2018

Debugging distributed systems is challenging, often requiring correlating logs
from many servers. This PR add the new cockroach debug merge-logs command
which simplifies the workflow for combining logs from multiple servers. The new
command takes a list of log files or glob patterns which expand to log file
names and combines them into a single sorted stream. By default only logs from
programs named "cockroach" are considered though this is controllable from the
program flag which takes a regular expression over program name.

When logs are merged, a prefix of the form {{ .Details.Host }}> are added to
the front of log messages. This can be overridden by supplying a template string
to the prefix flag which is evaluated on a *log.FileInfo struct.

The command accepts unexpanded glob patterns to enable use in situations where
thousands of files exist and shell expansion would overflow MAX_ARG_STRLEN. The
command provides functionality to efficiently explore specific time ranges by
only reading files which might contain entries in the specified interval and by
using binary search to find an appropriate offset within files. It also provides
a mechanism for regular expression filtering of log line messages, something
which is difficult to accomplish with just grep.

Within a given set of files for the same node, log messages which occur out of
time order are "ratcheted" to be equal to the last time seen from that node.

More thought could be given to how this command behaves in situations where a
host time may have moved backwards appreciably such that file timestamps are not
properly ordered.

Proposed in favor of #32435.

Release note (cli change): Adds debug merge-logs command to combine logs from multiple nodes.

@ajwerner ajwerner requested review from tbg and petermattis Dec 3, 2018

@ajwerner ajwerner requested a review from cockroachdb/cli-prs as a code owner Dec 3, 2018

@cockroach-teamcity

This comment has been minimized.

Copy link
Member

cockroach-teamcity commented Dec 3, 2018

This change is Reviewable

@ajwerner ajwerner changed the title cli: add debug merge-logs command to merge log files into a single stream. [WIP][DNM] cli: add debug merge-logs command to merge log files into a single stream. Dec 3, 2018

@tbg
Copy link
Member

tbg left a comment

Great, can't wait to use this! Will give a proper review later, but one question I already have is that not all of the test data seems to be there (or reviewable is tricking me somehow):

✂-1

Reviewed 7 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cli/debug_merge_logs_test.go, line 30 at r1 (raw file):

	flags    []string
	args     []string
	expected string

I'd also put the expected output into files.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/debug-merge-logs branch from bb8a735 to b3478ee Dec 3, 2018

@ajwerner
Copy link
Collaborator

ajwerner left a comment

Thanks! I think they were being gitignored, I was perplexed why it was passing locally and not in teamcity. I'll move the outputs to files now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@ajwerner

This comment has been minimized.

Copy link
Collaborator

ajwerner commented Dec 3, 2018

I also found the merged logs to be difficult to read so I created a separate tool https://github.com/ajwerner/logcolor/ which accepts the merged stream and colorizes it based on the prefix by hashing to a color.

screenshot from 2018-12-03 12-53-39

@ajwerner ajwerner changed the title [WIP][DNM] cli: add debug merge-logs command to merge log files into a single stream. cli: add debug merge-logs command to merge log files into a single stream. Dec 3, 2018

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/debug-merge-logs branch from b3478ee to b7e0764 Dec 3, 2018

@petermattis
Copy link
Contributor

petermattis left a comment

:lgtm:

Generally looks like what I expected, though I didn't go over this with a fine-toothed comb.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/cli/debug_merge_logs.go, line 44 at r2 (raw file):

func newMergedStream(files []fileInfo, from, to time.Time) *mergedStream {
	// TODO: think about clock movement and PID

Nit: we put author names in parens after the TODO. TODO(ajwerner).


pkg/cli/debug_merge_logs.go, line 53 at r2 (raw file):

	// TODO(ajwerner): could do the peek in parallel though most of the cost
	// has been mitigated with the binary search.
	addLogStream := func() {

There is a bit of complexity in this code due to this concatenated per-node log stream. Did you consider just putting all of the logs in the heap? The downside to that is that you wouldn't want to keep all of the logs open, but I think you could avoid that problem by peeking at the first record in each log first.


pkg/cli/debug_merge_logs.go, line 72 at r2 (raw file):

		}
		// Overwrite the previous file if there is one and the current file starts
		// before from.

I've noticed instances of a single host producing logs with overlapping timestamps. Is that what you're trying to handle here? It will be very unfortunate if this code accidentally drops a log.


pkg/cli/debug_merge_logs_test.go, line 30 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'd also put the expected output into files.

Agreed. If you want to get fancy you can use testutils.datadriven, though that might be overkill here.


pkg/util/log/file.go, line 124 at r2 (raw file):

// NewFileInfo constructs a FileInfo from FileDetails and os.FileInfo.
func NewFileInfo(details FileDetails, info os.FileInfo) FileInfo {

Nit: We usually reserve New for when a function returns a pointer and use the term Make when it returns a value. So this function should be named MakeFileInfo.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/debug-merge-logs branch 2 times, most recently from ba32c17 to 8ef10ae Dec 3, 2018

@ajwerner
Copy link
Collaborator

ajwerner left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/cli/debug_merge_logs.go, line 53 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

There is a bit of complexity in this code due to this concatenated per-node log stream. Did you consider just putting all of the logs in the heap? The downside to that is that you wouldn't want to keep all of the logs open, but I think you could avoid that problem by peeking at the first record in each log first.

I mostly didn't want to put all the files into a single heap so I didn't have to open them all then need a semaphore or something. Adding the logic to peek at the first record, then close the file and then reopen seems like a similar amount of complexity. I also figured that it might be possible to reorder log files for a node such that all files with the same PID are contiguous or something in the case of clock changes but given I didn't type that code and I'm not sure how it would actually work, it seems like a case of YAGNI.

The downside of the current approach is if there are two files with the same timestamp then it's possible we'll look at them out of order, which seems like a big enough downside to change this logic. Will revisit.


pkg/cli/debug_merge_logs.go, line 72 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I've noticed instances of a single host producing logs with overlapping timestamps. Is that what you're trying to handle here? It will be very unfortunate if this code accidentally drops a log.

Can you help me to better understand when this situation occurs? If it does happen regularly then it would warrant opening all of the files and peeking at them here.


pkg/cli/debug_merge_logs_test.go, line 30 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Agreed. If you want to get fancy you can use testutils.datadriven, though that might be overkill here.

Done.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/debug-merge-logs branch from 8ef10ae to 8f63d95 Dec 3, 2018

@petermattis
Copy link
Contributor

petermattis left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/cli/debug_merge_logs.go, line 53 at r2 (raw file):

Previously, ajwerner wrote…

I mostly didn't want to put all the files into a single heap so I didn't have to open them all then need a semaphore or something. Adding the logic to peek at the first record, then close the file and then reopen seems like a similar amount of complexity. I also figured that it might be possible to reorder log files for a node such that all files with the same PID are contiguous or something in the case of clock changes but given I didn't type that code and I'm not sure how it would actually work, it seems like a case of YAGNI.

The downside of the current approach is if there are two files with the same timestamp then it's possible we'll look at them out of order, which seems like a big enough downside to change this logic. Will revisit.

Ack.


pkg/cli/debug_merge_logs.go, line 72 at r2 (raw file):

Previously, ajwerner wrote…

Can you help me to better understand when this situation occurs? If it does happen regularly then it would warrant opening all of the files and peeking at them here.

I have no idea when it occurs or why it occurs. It is just something I've observed when running various roachtests and then looking at the logs. I think most recently I saw it with the disk-full roachtest.

@ajwerner
Copy link
Collaborator

ajwerner left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/cli/debug_merge_logs.go, line 53 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ack.

I was wrong about the filename ts resolution. It's definitely on a per second basis. The number after the dot is the PID. I'll change the logic to peek at log entries per file.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/debug-merge-logs branch 2 times, most recently from c349d6f to 21a3d0c Dec 3, 2018

@petermattis
Copy link
Contributor

petermattis left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/cli/debug_merge_logs.go, line 53 at r2 (raw file):

Previously, ajwerner wrote…

I was wrong about the filename ts resolution. It's definitely on a per second basis. The number after the dot is the PID. I'll change the logic to peek at log entries per file.

Oh, that probably explains what I was seeing. If I recall correctly, the timestamp was identical, but the portion after the decimal point was lower in the second log file. This is in a test that was rapidly restarting a node so that makes sense.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/debug-merge-logs branch 4 times, most recently from 3a7747d to 3015781 Dec 4, 2018

@ajwerner
Copy link
Collaborator

ajwerner left a comment

I got a little carried away tuning the performance for this but I got a little frustrated at how long it was taking to scan and filter large amounts of data. It feels pretty fast now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/cli/debug_merge_logs.go, line 53 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Oh, that probably explains what I was seeing. If I recall correctly, the timestamp was identical, but the portion after the decimal point was lower in the second log file. This is in a test that was rapidly restarting a node so that makes sense.

Done.

@petermattis
Copy link
Contributor

petermattis left a comment

:lgtm:

The implementation looks nice and clean. You mention performance tuning. Is the performance good with that channel based pipeline?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/cli/debug.go, line 1291 at r4 (raw file):

		return time.Time{}, nil
	}
	t, err := time.Parse("060102 15:04:05.999999", ts)

I wonder if this should support alternative time formats, as this one is pretty verbose. Feel free to leave this suggestion to a future PR.


pkg/cli/debug_merge_logs.go, line 73 at r4 (raw file):

	entryChan := make(chan *entryInfo, chanSize)  // read -> process
	msgChan := make(chan *bytes.Buffer, chanSize) // process -> bufferWrites
	writeChan := make(chan *bytes.Buffer)         // bufferWrites -> write

Whoa, channels! I haven't fully grok'd this yet, but I suspect this might be an overuse of channels. Channels are a neat way to structure a pipeline like this, but they can actually be a bottleneck if the processing is minimal as I suspect it is here. Or perhaps it isn't. On the plus side, this pipeline is nice and clean.

Looking at this a bit more, as long as the performance is good, I like the structure of this code.


pkg/cli/debug_merge_logs.go, line 79 at r4 (raw file):

		for e, ok := s.peek(); ok; e, ok = s.peek() {
			select {
			case entryChan <- &entryInfo{Entry: e, FileInfo: s.fileInfo()}:

Does entryInfo need to be a pointer here? I think you can use chan entryInfo and avoid the allocation. Probably doesn't matter and this is just a knee-jerk comment on my part.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/debug-merge-logs branch from 3015781 to d6cdb14 Dec 6, 2018

@ajwerner
Copy link
Collaborator

ajwerner left a comment

It's now able to pin all the CPUs on my laptop to 100%. The biggest wins were copying the regexp in log.NewEntryDecoder and caching the template rendered prefix. Now the vast majority of time is spend decoding log entries, as it should be. I think I'm done tuning this.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/cli/debug.go, line 1291 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wonder if this should support alternative time formats, as this one is pretty verbose. Feel free to leave this suggestion to a future PR.

I had some other time stamp formats before that were relative to now. I'll add them back in a different PR.


pkg/cli/debug_merge_logs.go, line 73 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Whoa, channels! I haven't fully grok'd this yet, but I suspect this might be an overuse of channels. Channels are a neat way to structure a pipeline like this, but they can actually be a bottleneck if the processing is minimal as I suspect it is here. Or perhaps it isn't. On the plus side, this pipeline is nice and clean.

Looking at this a bit more, as long as the performance is good, I like the structure of this code.

This pipeline lets means that all of the blocking IO happens in its own goroutine. My thinking upon reading this was that the goroutine that probably is wasteful here is the one that does "processing" which was actually the first one that existed. Initially I had just the reading goroutine and the processing one which then wrote directly to out. The problem with that is to make it performant I wrapped out in a bufio.Writer which had the problem of sometimes writing just portions of messages to stdout, which is less than pleasant. Anyway, I went and dropped one of the levels of goroutines and the whole buffer copy thing and it lead to slightly higher throughput.


pkg/cli/debug_merge_logs.go, line 79 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Does entryInfo need to be a pointer here? I think you can use chan entryInfo and avoid the allocation. Probably doesn't matter and this is just a knee-jerk comment on my part.

It doesn't seem to make a difference. Anecdotally I swapped it because it seemed to have higher throughput as a pointer with my theory being that the channel write is cheaper but running it a few more times each way it seems like noise.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/debug-merge-logs branch from d6cdb14 to 6f5fb57 Dec 6, 2018

@petermattis
Copy link
Contributor

petermattis left a comment

:shipit:

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/cli/debug.go, line 1291 at r4 (raw file):

Previously, ajwerner wrote…

I had some other time stamp formats before that were relative to now. I'll add them back in a different PR.

SGTM

@tbg

This comment has been minimized.

Copy link
Member

tbg commented Dec 6, 2018

@ajwerner
Copy link
Collaborator

ajwerner left a comment

There's currently not a good story about dealing with different log file names. I am happy to extend the functionality to support different naming schemes. I created #32919 to track, please add commentary there.

Somewhat unrelatedly, before I started this I was thinking that it might be nice to allow users to merge logs inside of a compressed archive without first needing to do decompression. I could imagine supporting input patterns that maybe look like cockroach debug merge-logs whole-cluster-logs.zip:logs/*/* where the tool detects the archive and expands the glob inside of it. The reason I got discouraged from jumping on this straight away was the tight coupling of glob expansion logic in filepath.Glob to the filesystem APIs. This functionality too should come in another PR if at all, just proposing it to see if there's interest.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@tbg

tbg approved these changes Dec 7, 2018

if err := g.Wait(); err != nil {
return nil, err
}
filtered := res[:0] // Filter the nil streams.

This comment has been minimized.

@tbg

tbg Dec 7, 2018

Member

inconsequential nit, but thought you'd generally like to know: Go comment notation uses proper punctuation when the comment is on its own line, but not in inline comments, so this would technically be // filter nil streams. I don't know that I love this rule, though. Happy to ignore it.

cli: add debug merge-logs command to merge log files into a single st…
…ream.

Debugging distributed systems is challenging, often requiring correlating logs
from many servers. This PR add the new `cockroach debug merge-logs` command
which simplifies the workflow for combining logs from multiple servers. The new
command takes a list of log files or glob patterns which expand to log file
names and combines them into a single sorted stream. By default only logs from
programs named "cockroach" are considered though this is controllable from the
`program` flag which takes a regular expression over program name.

When logs are merged, a prefix of the form `{{ .Details.Host }}>` are added to
the front of log messages. This can be overridden by supplying a template string
to the `prefix` flag which is evaluated on a *log.FileInfo struct.

The command accepts unexpanded glob patterns to enable use in situations where
thousands of files exist and shell expansion would overflow MAX_ARG_STRLEN. The
command provides functionality to efficiently explore specific time ranges by
only reading files which might contain entries in the specified interval and by
using binary search to find an appropriate offset within files. It also provides
a mechanism for regular expression filtering of log line messages, something
which is difficult to accomplish with just grep.

Within a given set of files for the same node, log messages which occur out of
time order are "ratcheted" to be equal to the last time seen from that node.

More thought could be given to how this command behaves in situations where a
host time may have moved backwards appreciably such that file timestamps are not
properly ordered.

Proposed in favor of #32435.

Release note (cli change): Adds debug merge-logs command to combine logs from multiple nodes.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/debug-merge-logs branch from 6f5fb57 to fdbab5e Dec 7, 2018

@ajwerner
Copy link
Collaborator

ajwerner left a comment

TFYR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/cli/debug_merge_logs.go, line 190 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

inconsequential nit, but thought you'd generally like to know: Go comment notation uses proper punctuation when the comment is on its own line, but not in inline comments, so this would technically be // filter nil streams. I don't know that I love this rule, though. Happy to ignore it.

Done.

cli: clean up and add buffering to debug merge-logs
This commit moves the logic of writing and reading of files into separate
goroutines. The change seems to increase the throughput of debug merge-logs by
almost 4x from ~16MB/s to ~60MB/s on my local laptop when merging 4GB of logs
from 102 hosts into /dev/null.

Release note: None

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/debug-merge-logs branch from fdbab5e to e04511a Dec 7, 2018

@ajwerner

This comment has been minimized.

Copy link
Collaborator

ajwerner commented Dec 7, 2018

bors r+

craig bot pushed a commit that referenced this pull request Dec 7, 2018

Merge #32790
32790: cli: add debug merge-logs command to merge log files into a single stream. r=ajwerner a=ajwerner

Debugging distributed systems is challenging, often requiring correlating logs
from many servers. This PR add the new `cockroach debug merge-logs` command
which simplifies the workflow for combining logs from multiple servers. The new
command takes a list of log files or glob patterns which expand to log file
names and combines them into a single sorted stream. By default only logs from
programs named "cockroach" are considered though this is controllable from the
`program` flag which takes a regular expression over program name.

When logs are merged, a prefix of the form `{{ .Details.Host }}>` are added to
the front of log messages. This can be overridden by supplying a template string
to the `prefix` flag which is evaluated on a *log.FileInfo struct.

The command accepts unexpanded glob patterns to enable use in situations where
thousands of files exist and shell expansion would overflow MAX_ARG_STRLEN. The
command provides functionality to efficiently explore specific time ranges by
only reading files which might contain entries in the specified interval and by
using binary search to find an appropriate offset within files. It also provides
a mechanism for regular expression filtering of log line messages, something
which is difficult to accomplish with just grep.

Within a given set of files for the same node, log messages which occur out of
time order are "ratcheted" to be equal to the last time seen from that node.

More thought could be given to how this command behaves in situations where a
host time may have moved backwards appreciably such that file timestamps are not
properly ordered.

Proposed in favor of #32435.

Release note (cli change): Adds debug merge-logs command to combine logs from multiple nodes.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig

This comment has been minimized.

Copy link

craig bot commented Dec 7, 2018

Build succeeded

@craig craig bot merged commit e04511a into cockroachdb:master Dec 7, 2018

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment