Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/util/log: enable bufferedSink usage with fileSink as experimental #120428

Merged
merged 2 commits into from Mar 20, 2024

Conversation

abarganier
Copy link
Member

@abarganier abarganier commented Mar 13, 2024

Informs: #72452

Epic: CRDB-35401

For some time, two buffering mechanisms have existed within the log
package. One is controlled by buffered-writes, which is only used by
the file sink. buffered-writes buffers writes until a buffer is full,
after which it flushes the buffer to an underlying file. This is not an
asynchronous operation.

Separately, the bufferedSink was introduced to provide an async
buffering mechanism. This was originally intended for network log sinks
such as the fluentSink and httpSink. Originally, the bufferedSink
was not allowed for use with the fileSink as we already had a
buffering mechanism available.

Today, we find ourselves in a situation where we aim to make CRDB even
more resilient in the face of disk failures. However, some of these
efforts like WAL failover are unable to achieve their goals if the
logging subsystem isn't resilient to disk unavailability as well.

Today, if a file sink is configured to write to an unavailable disk, any
goroutine making a logging call to one of those sinks will hang as the
underlying i/o operations used are unable to interact with the disk.
This can have a severe negative impact on cluster performance and
throughput, even when using buffered-writes: true.

As a way to alleviate this pain, this patch experimentally enables the
usage of the bufferedSink with the fileSink. The bufferedSink is
used to "wrap" an underlying sink (such as a fileSink), and
asynchronously outputs buffered data to said sink. It is especially good
at shielding writers from any problems with the underlying sink, which
makes it a great candidate to help us accomplish our resiliency goals
when it comes to disk outage scenarios.

Users can leverage the bufferedSink by using the buffering config option.
For example:

file-defaults:
  buffered-writes: false
  buffering:
    max-staleness: 5s
    flush-trigger-size: 1.0MiB
    max-buffer-size: 50MiB

Note that we don't allow both buffered-writes and buffering to be
used together. This patch adds a validation step to ensure this.

Release note (ops change): Users have the ability to update their log
configuration to enable asynchronous buffering of file-group log
sinks via the buffering configuration
options
.

The buffering configuration can be applied to file-defaults or
individual file-groups as needed.

Note however that buffering and buffered-writes are incompatible for
the time being. If you'd like to try the buffering option on a file
log sink, it's necessary to explicitly set buffered-writes: false. For
example:

file-defaults:
  buffered-writes: false
  buffering:
    max-staleness: 1s
    flush-trigger-size: 256KiB
    max-buffer-size: 50MiB

We recommend a max-staleness: 1s and flush-trigger-size: 256KiB when
using this feature with file log sinks.

@abarganier abarganier requested a review from jbowens March 13, 2024 18:28
@abarganier abarganier requested a review from a team as a code owner March 13, 2024 18:28
@abarganier abarganier requested review from xinhaoz and removed request for a team March 13, 2024 18:28
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: assuming it looks good to @xinhaoz !

Reviewed 5 of 5 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @abarganier and @xinhaoz)


-- commits line 65 at r1:
Since it's already opt-in, would we be able to avoid the additional environment variable requirement? It seems like something we could document for the buffering option.

Or, even more aggressively, could we not mark it as experimental but perhaps not document very loudly. Our WAL failover option will not be considered experimental, so it's a bit unfortunate if we have to force them into using experimental options at the same time.

@abarganier abarganier requested a review from jbowens March 15, 2024 20:02
Copy link
Member Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jbowens and @xinhaoz)


-- commits line 65 at r1:

Previously, jbowens (Jackson Owens) wrote…

Since it's already opt-in, would we be able to avoid the additional environment variable requirement? It seems like something we could document for the buffering option.

Or, even more aggressively, could we not mark it as experimental but perhaps not document very loudly. Our WAL failover option will not be considered experimental, so it's a bit unfortunate if we have to force them into using experimental options at the same time.

Let me discuss with the team, I'm open to ditching the env var and perhaps not advertising the feature quite yet.

Copy link
Member Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jbowens and @xinhaoz)


-- commits line 65 at r1:

Previously, abarganier (Alex Barganier) wrote…

Let me discuss with the team, I'm open to ditching the env var and perhaps not advertising the feature quite yet.

@jbowens envvar removed! We like the idea of not documenting this loudly.

PTAL, and let us know how your testing goes.

Informs: cockroachdb#72452

For some time, two buffering mechanisms have existed within the log
package. One is controlled by `buffered-writes`, which is only used by
the file sink. `buffered-writes` buffers writes until a buffer is full,
after which it flushes the buffer to an underlying file. This is not an
asynchronous operation.

Separately, the `bufferedSink` was introduced to provide an async
buffering mechanism. This was originally intended for network log sinks
such as the `fluentSink` and `httpSink`. Originally, the `bufferedSink`
was not allowed for use with the `fileSink` as we already had a
buffering mechanism available.

Today, we find ourselves in a situation where we aim to make CRDB even
more resilient in the face of disk failures. However, some of these
efforts like WAL failover are unable to achieve their goals if the
logging subsystem isn't resilient to disk unavailability as well.

Today, if a file sink is configured to write to an unavailable disk, any
goroutine making a logging call to one of those sinks will hang as the
underlying i/o operations used are unable to interact with the disk.
This can have a severe negative impact on cluster performance and
throughput, even when using `buffered-writes: true`.

As a way to alleviate this pain, this patch experimentally enables the
usage of the `bufferedSink` with the `fileSink`. The `bufferedSink` is
used to "wrap" an underlying sink (such as a `fileSink`), and
asynchronously outputs buffered data to said sink. It is especially good
at shielding writers from any problems with the underlying sink, which
makes it a great candidate to help us accomplish our resiliency goals
when it comes to disk outage scenarios.

Now, users can leverage the `bufferedSink` in `file-defaults` and
`file-groups  by using the `buffering` config option. For example:

```
file-defaults:
  buffered-writes: false
  buffering:
    max-staleness: 1s
    flush-trigger-size: 256KiB
    max-buffer-size: 50MiB
```

Note that we don't allow both `buffered-writes` and `buffering` to be
used together. This patch adds a validation step to ensure this.

We recommend a `max-staleness: 1s` and `flush-trigger-size: 256KiB` when
using this feature with file log sinks.

Release note (ops change): Users have the ability to update their log
configuration to enable asynchronous buffering of `file-group` log
sinks via the `buffering` [configuration
options](https://www.cockroachlabs.com/docs/stable/configure-logs#log-buffering-for-network-sinks).

The `buffering` configuration can be applied to `file-defaults` or
individual `file-groups` as needed.

Note however that `buffering` and `buffered-writes` are incompatible for
the time being. If you'd like to try the `buffering` option on a file
log sink, it's necessary to explicitly set `buffered-writes: false`. For
example:
```
file-defaults:
  buffered-writes: false
  buffering:
    max-staleness: 1s
    flush-trigger-size: 256KiB
    max-buffer-size: 50MiB
```
We recommend a `max-staleness: 1s` and `flush-trigger-size: 256KiB` when
using this feature with file log sinks.
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @abarganier and @xinhaoz)


-- commits line 65 at r1:

Previously, abarganier (Alex Barganier) wrote…

@jbowens envvar removed! We like the idea of not documenting this loudly.

PTAL, and let us know how your testing goes.

Nice!

@dhartunian dhartunian requested a review from jbowens March 19, 2024 22:35
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: thank you for the detailed commit messages 💯. just a few wording nits.

Reviewed 1 of 7 files at r3, 5 of 5 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @abarganier, @jbowens, and @xinhaoz)


-- commits line 82 at r6:
nit: grammar


pkg/util/log/logconfig/testdata/validate line 797 at r5 (raw file):

  max-group-size: 100MiB

# Check that buffering works with file sinks when the correct env var is set.

nit: remove reference to env var

As we introduce the ability to enable `buffering` on file-based log
sinks as experimental, there's a need for a new `BufferFormat`.

Previously, the default format was to introduce a newline between each
log message flushed to the underlying log sink as a delimiter. However,
when flushing to files, this is not desireable behavior as it causes the
log file to look like:
```
I240313 16:38:33.355448 113 1@gossip/gossip.go:1374 ⋮ [T1,Vsystem,n1] 80  node has connected to cluster via gossip

I240313 16:38:33.360943 113 kv/kvserver/stores.go:282 ⋮ [T1,Vsystem,n1] 81  wrote 0 node addresses to persistent storage
```

Empty lines do not play well with log parsers, so we introduce a new
`BufferFormat`, `BufferFmtNone`, which uses no delimiters, prefixes, or
suffixes. With this new option, log files no longer see these empty
lines.

The validation code for file log sinks forces the `BufferFmtNone` into
the buffering configuration, so this is not something that needs to
explicitly be configured by the user.

Release note: none
Copy link
Member Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

I've created #120759 to track follow up work to enable this feature in roachtests, to help put the new config option through its paces.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @dhartunian, @jbowens, and @xinhaoz)


-- commits line 82 at r6:

Previously, dhartunian (David Hartunian) wrote…

nit: grammar

Done.

Copy link
Member Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @dhartunian, @jbowens, and @xinhaoz)


pkg/util/log/logconfig/testdata/validate line 797 at r5 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: remove reference to env var

Done.

@abarganier
Copy link
Member Author

bors r=dhartunian,jbowens

@craig
Copy link
Contributor

craig bot commented Mar 20, 2024

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 20, 2024

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 20, 2024

@craig craig bot merged commit f796c91 into cockroachdb:master Mar 20, 2024
22 of 36 checks passed
abarganier added a commit to abarganier/cockroach that referenced this pull request May 1, 2024
Addresses: cockroachdb#120759

Previously, the `network_logging` roachtest was created to put the
`bufferedSink` (an async producer+consumer buffer that we primarily use
to wrap network log sinks) through its paces. Originally, this
functionality was only available to network (`fluent`,`http`) log sinks,
and so the test did not include any file based log sinks.

As of cockroachdb#120428, we now
support the ability to use async buffering on `file-group` log sinks as
well, via the same `bufferedSink`. Therefore, it makes sense that we
should expand our existing `network_logging` roachtest to also include
file sinks.

This commit does so by expanding the log config to enable async
buffering for files by default, and adds three `file-groups` log sinks
to the config that capture all log channels. This will now run alongside
the existing network logging sinks.

As a final step, since this test is no longer specific to network
logging, we rename the roachtest from `network_logging` to
`buffered_logging`.

Release note: none
craig bot pushed a commit that referenced this pull request May 3, 2024
123431: roachtest: change `network_logging` to `buffered_logging`, include files r=dhartunian a=abarganier

Addresses: #120759

Previously, the `network_logging` roachtest was created to put the `bufferedSink` (an async producer+consumer buffer that we primarily use to wrap network log sinks) through its paces. Originally, this functionality was only available to network (`fluent`,`http`) log sinks, and so the test did not include any file based log sinks.

As of #120428, we now support the ability to use async buffering on `file-group` log sinks as well, via the same `bufferedSink`. Therefore, it makes sense that we should expand our existing `network_logging` roachtest to also include file sinks.

This commit does so by expanding the log config to enable async buffering for files by default, and adds three `file-groups` log sinks to the config that capture all log channels. This will now run alongside the existing network logging sinks.

As a final step, since this test is no longer specific to network logging, we rename the roachtest from `network_logging` to `buffered_logging`.

Release note: none

Epic: CRDB-35401

Co-authored-by: Alex Barganier <abarganier@cockroachlabs.com>
craig bot pushed a commit that referenced this pull request May 3, 2024
123431: roachtest: change `network_logging` to `buffered_logging`, include files r=dhartunian a=abarganier

Addresses: #120759

Previously, the `network_logging` roachtest was created to put the `bufferedSink` (an async producer+consumer buffer that we primarily use to wrap network log sinks) through its paces. Originally, this functionality was only available to network (`fluent`,`http`) log sinks, and so the test did not include any file based log sinks.

As of #120428, we now support the ability to use async buffering on `file-group` log sinks as well, via the same `bufferedSink`. Therefore, it makes sense that we should expand our existing `network_logging` roachtest to also include file sinks.

This commit does so by expanding the log config to enable async buffering for files by default, and adds three `file-groups` log sinks to the config that capture all log channels. This will now run alongside the existing network logging sinks.

As a final step, since this test is no longer specific to network logging, we rename the roachtest from `network_logging` to `buffered_logging`.

Release note: none

Epic: CRDB-35401

Co-authored-by: Alex Barganier <abarganier@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants