pkg/util/log: enable bufferedSink usage with fileSink as experimental #120428

abarganier · 2024-03-13T18:28:01Z

Informs: #72452

For some time, two buffering mechanisms have existed within the log
package. One is controlled by buffered-writes, which is only used by
the file sink. buffered-writes buffers writes until a buffer is full,
after which it flushes the buffer to an underlying file. This is not an
asynchronous operation.

Separately, the bufferedSink was introduced to provide an async
buffering mechanism. This was originally intended for network log sinks
such as the fluentSink and httpSink. Originally, the bufferedSink
was not allowed for use with the fileSink as we already had a
buffering mechanism available.

Today, we find ourselves in a situation where we aim to make CRDB even
more resilient in the face of disk failures. However, some of these
efforts like WAL failover are unable to achieve their goals if the
logging subsystem isn't resilient to disk unavailability as well.

Today, if a file sink is configured to write to an unavailable disk, any
goroutine making a logging call to one of those sinks will hang as the
underlying i/o operations used are unable to interact with the disk.
This can have a severe negative impact on cluster performance and
throughput, even when using buffered-writes: true.

As a way to alleviate this pain, this patch experimentally enables the
usage of the bufferedSink with the fileSink. The bufferedSink is
used to "wrap" an underlying sink (such as a fileSink), and
asynchronously outputs buffered data to said sink. It is especially good
at shielding writers from any problems with the underlying sink, which
makes it a great candidate to help us accomplish our resiliency goals
when it comes to disk outage scenarios.

Users can leverage the bufferedSink by using the buffering config option.
For example:

file-defaults:
  buffered-writes: false
  buffering:
    max-staleness: 5s
    flush-trigger-size: 1.0MiB
    max-buffer-size: 50MiB

Note that we don't allow both buffered-writes and buffering to be
used together. This patch adds a validation step to ensure this.

Release note (ops change): Users have the ability to update their log
configuration to enable asynchronous buffering of file-group log
sinks via the buffering configuration
options.

The buffering configuration can be applied to file-defaults or
individual file-groups as needed.

Note however that buffering and buffered-writes are incompatible for
the time being. If you'd like to try the buffering option on a file
log sink, it's necessary to explicitly set buffered-writes: false. For
example:

file-defaults:
  buffered-writes: false
  buffering:
    max-staleness: 1s
    flush-trigger-size: 256KiB
    max-buffer-size: 50MiB

We recommend a max-staleness: 1s and flush-trigger-size: 256KiB when
using this feature with file log sinks.

cockroach-teamcity · 2024-03-13T18:28:12Z

This change is

jbowens

assuming it looks good to @xinhaoz !

Reviewed 5 of 5 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier and @xinhaoz)

-- commits line 65 at r1:
Since it's already opt-in, would we be able to avoid the additional environment variable requirement? It seems like something we could document for the buffering option.

Or, even more aggressively, could we not mark it as experimental but perhaps not document very loudly. Our WAL failover option will not be considered experimental, so it's a bit unfortunate if we have to force them into using experimental options at the same time.

abarganier

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jbowens and @xinhaoz)

-- commits line 65 at r1:

Previously, jbowens (Jackson Owens) wrote…

Since it's already opt-in, would we be able to avoid the additional environment variable requirement? It seems like something we could document for the buffering option.

Or, even more aggressively, could we not mark it as experimental but perhaps not document very loudly. Our WAL failover option will not be considered experimental, so it's a bit unfortunate if we have to force them into using experimental options at the same time.

Let me discuss with the team, I'm open to ditching the env var and perhaps not advertising the feature quite yet.

abarganier

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jbowens and @xinhaoz)

-- commits line 65 at r1:

Previously, abarganier (Alex Barganier) wrote…

Let me discuss with the team, I'm open to ditching the env var and perhaps not advertising the feature quite yet.

@jbowens envvar removed! We like the idea of not documenting this loudly.

PTAL, and let us know how your testing goes.

Informs: cockroachdb#72452 For some time, two buffering mechanisms have existed within the log package. One is controlled by `buffered-writes`, which is only used by the file sink. `buffered-writes` buffers writes until a buffer is full, after which it flushes the buffer to an underlying file. This is not an asynchronous operation. Separately, the `bufferedSink` was introduced to provide an async buffering mechanism. This was originally intended for network log sinks such as the `fluentSink` and `httpSink`. Originally, the `bufferedSink` was not allowed for use with the `fileSink` as we already had a buffering mechanism available. Today, we find ourselves in a situation where we aim to make CRDB even more resilient in the face of disk failures. However, some of these efforts like WAL failover are unable to achieve their goals if the logging subsystem isn't resilient to disk unavailability as well. Today, if a file sink is configured to write to an unavailable disk, any goroutine making a logging call to one of those sinks will hang as the underlying i/o operations used are unable to interact with the disk. This can have a severe negative impact on cluster performance and throughput, even when using `buffered-writes: true`. As a way to alleviate this pain, this patch experimentally enables the usage of the `bufferedSink` with the `fileSink`. The `bufferedSink` is used to "wrap" an underlying sink (such as a `fileSink`), and asynchronously outputs buffered data to said sink. It is especially good at shielding writers from any problems with the underlying sink, which makes it a great candidate to help us accomplish our resiliency goals when it comes to disk outage scenarios. Now, users can leverage the `bufferedSink` in `file-defaults` and `file-groups by using the `buffering` config option. For example: ``` file-defaults: buffered-writes: false buffering: max-staleness: 1s flush-trigger-size: 256KiB max-buffer-size: 50MiB ``` Note that we don't allow both `buffered-writes` and `buffering` to be used together. This patch adds a validation step to ensure this. We recommend a `max-staleness: 1s` and `flush-trigger-size: 256KiB` when using this feature with file log sinks. Release note (ops change): Users have the ability to update their log configuration to enable asynchronous buffering of `file-group` log sinks via the `buffering` [configuration options](https://www.cockroachlabs.com/docs/stable/configure-logs#log-buffering-for-network-sinks). The `buffering` configuration can be applied to `file-defaults` or individual `file-groups` as needed. Note however that `buffering` and `buffered-writes` are incompatible for the time being. If you'd like to try the `buffering` option on a file log sink, it's necessary to explicitly set `buffered-writes: false`. For example: ``` file-defaults: buffered-writes: false buffering: max-staleness: 1s flush-trigger-size: 256KiB max-buffer-size: 50MiB ``` We recommend a `max-staleness: 1s` and `flush-trigger-size: 256KiB` when using this feature with file log sinks.

jbowens

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier and @xinhaoz)

-- commits line 65 at r1:

Previously, abarganier (Alex Barganier) wrote…

@jbowens envvar removed! We like the idea of not documenting this loudly.

PTAL, and let us know how your testing goes.

Nice!

dhartunian

thank you for the detailed commit messages 💯. just a few wording nits.

Reviewed 1 of 7 files at r3, 5 of 5 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @abarganier, @jbowens, and @xinhaoz)

-- commits line 82 at r6:
nit: grammar

pkg/util/log/logconfig/testdata/validate line 797 at r5 (raw file):

  max-group-size: 100MiB

# Check that buffering works with file sinks when the correct env var is set.

nit: remove reference to env var

As we introduce the ability to enable `buffering` on file-based log sinks as experimental, there's a need for a new `BufferFormat`. Previously, the default format was to introduce a newline between each log message flushed to the underlying log sink as a delimiter. However, when flushing to files, this is not desireable behavior as it causes the log file to look like: ``` I240313 16:38:33.355448 113 1@gossip/gossip.go:1374 ⋮ [T1,Vsystem,n1] 80 node has connected to cluster via gossip I240313 16:38:33.360943 113 kv/kvserver/stores.go:282 ⋮ [T1,Vsystem,n1] 81 wrote 0 node addresses to persistent storage ``` Empty lines do not play well with log parsers, so we introduce a new `BufferFormat`, `BufferFmtNone`, which uses no delimiters, prefixes, or suffixes. With this new option, log files no longer see these empty lines. The validation code for file log sinks forces the `BufferFmtNone` into the buffering configuration, so this is not something that needs to explicitly be configured by the user. Release note: none

abarganier

TFTRs!

I've created #120759 to track follow up work to enable this feature in roachtests, to help put the new config option through its paces.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @dhartunian, @jbowens, and @xinhaoz)

-- commits line 82 at r6:

Previously, dhartunian (David Hartunian) wrote…

nit: grammar

Done.

abarganier

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @dhartunian, @jbowens, and @xinhaoz)

pkg/util/log/logconfig/testdata/validate line 797 at r5 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: remove reference to env var

Done.

abarganier · 2024-03-20T15:13:00Z

bors r=dhartunian,jbowens

craig · 2024-03-20T15:53:07Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2024-03-20T16:02:42Z

Build failed (retrying...):

examples_orms

craig · 2024-03-20T17:43:41Z

Build succeeded:

Addresses: cockroachdb#120759 Previously, the `network_logging` roachtest was created to put the `bufferedSink` (an async producer+consumer buffer that we primarily use to wrap network log sinks) through its paces. Originally, this functionality was only available to network (`fluent`,`http`) log sinks, and so the test did not include any file based log sinks. As of cockroachdb#120428, we now support the ability to use async buffering on `file-group` log sinks as well, via the same `bufferedSink`. Therefore, it makes sense that we should expand our existing `network_logging` roachtest to also include file sinks. This commit does so by expanding the log config to enable async buffering for files by default, and adds three `file-groups` log sinks to the config that capture all log channels. This will now run alongside the existing network logging sinks. As a final step, since this test is no longer specific to network logging, we rename the roachtest from `network_logging` to `buffered_logging`. Release note: none

123431: roachtest: change `network_logging` to `buffered_logging`, include files r=dhartunian a=abarganier Addresses: #120759 Previously, the `network_logging` roachtest was created to put the `bufferedSink` (an async producer+consumer buffer that we primarily use to wrap network log sinks) through its paces. Originally, this functionality was only available to network (`fluent`,`http`) log sinks, and so the test did not include any file based log sinks. As of #120428, we now support the ability to use async buffering on `file-group` log sinks as well, via the same `bufferedSink`. Therefore, it makes sense that we should expand our existing `network_logging` roachtest to also include file sinks. This commit does so by expanding the log config to enable async buffering for files by default, and adds three `file-groups` log sinks to the config that capture all log channels. This will now run alongside the existing network logging sinks. As a final step, since this test is no longer specific to network logging, we rename the roachtest from `network_logging` to `buffered_logging`. Release note: none Epic: CRDB-35401 Co-authored-by: Alex Barganier <abarganier@cockroachlabs.com>

abarganier requested a review from jbowens March 13, 2024 18:28

abarganier requested a review from a team as a code owner March 13, 2024 18:28

abarganier requested review from xinhaoz and removed request for a team March 13, 2024 18:28

abarganier force-pushed the file-bufferedSink branch from d609729 to 14d9e78 Compare March 13, 2024 18:55

jbowens approved these changes Mar 15, 2024

View reviewed changes

abarganier requested a review from jbowens March 15, 2024 20:02

abarganier commented Mar 15, 2024

View reviewed changes

abarganier force-pushed the file-bufferedSink branch from 14d9e78 to e37f897 Compare March 18, 2024 19:37

abarganier commented Mar 18, 2024

View reviewed changes

abarganier force-pushed the file-bufferedSink branch from e37f897 to 0dfbc08 Compare March 18, 2024 20:26

jbowens approved these changes Mar 19, 2024

View reviewed changes

dhartunian requested a review from jbowens March 19, 2024 22:35

dhartunian approved these changes Mar 19, 2024

View reviewed changes

abarganier force-pushed the file-bufferedSink branch from 0dfbc08 to 8a2087a Compare March 20, 2024 14:59

abarganier mentioned this pull request Mar 20, 2024

roachtest: update network_logging test to enable buffering on file log sinks #120759

Closed

abarganier requested a review from dhartunian March 20, 2024 15:08

abarganier commented Mar 20, 2024

View reviewed changes

craig bot merged commit f796c91 into cockroachdb:master Mar 20, 2024
22 of 36 checks passed

abarganier mentioned this pull request May 1, 2024

roachtest: change network_logging to buffered_logging, include files #123431

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/util/log: enable bufferedSink usage with fileSink as experimental #120428

pkg/util/log: enable bufferedSink usage with fileSink as experimental #120428

abarganier commented Mar 13, 2024 •

edited

cockroach-teamcity commented Mar 13, 2024

jbowens left a comment

abarganier left a comment

abarganier left a comment

jbowens left a comment

dhartunian left a comment

abarganier left a comment

abarganier left a comment

abarganier commented Mar 20, 2024

craig bot commented Mar 20, 2024

craig bot commented Mar 20, 2024

craig bot commented Mar 20, 2024

pkg/util/log: enable bufferedSink usage with fileSink as experimental #120428

pkg/util/log: enable bufferedSink usage with fileSink as experimental #120428

Conversation

abarganier commented Mar 13, 2024 • edited

cockroach-teamcity commented Mar 13, 2024

jbowens left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

dhartunian left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

abarganier commented Mar 20, 2024

craig bot commented Mar 20, 2024

craig bot commented Mar 20, 2024

craig bot commented Mar 20, 2024

abarganier commented Mar 13, 2024 •

edited