Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/util/log: change default httpSink timeout to 2s #109264

Merged
merged 1 commit into from Aug 25, 2023

Conversation

abarganier
Copy link
Member

@abarganier abarganier commented Aug 22, 2023

Previously, the default timeout for the httpSink was
for there to be no timeout at all.

This means that the first call to output() on where
the http target was unavailable would hang forever.
This would deadlock the calling goroutine, whether
that's the bufferedSink flush goroutine, or (even worse)
a server goroutine in the event that the httpSink is not
buffered.

Our default timeout should not deadlock in the worst case
scenario. Admittedly, 2s would also cause a noticeable
performance degradation in the event that the httpSink was
unbuffered, but it would at least be able to emit logs
indicating the timeout as the cause. Availability would also
be maintained to some degree. Previously, the deadlocks
due to no timeout being set by default meant that no
indication was ever given that the httpSink was unable
to reach the http target.

Release note (ops change): The default value of timeout
for http-servers logging sinks has been changed from
"no timeout" to 2s. This will be reflected in the
http-defaults section of the log configuration. Users
still maintain the ability to override the timeout, or
disable it by explicitly setting it to 0 (e.g. timeout: 0).

Fixes: #109263

@abarganier abarganier requested a review from a team as a code owner August 22, 2023 18:23
@blathers-crl
Copy link

blathers-crl bot commented Aug 22, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@abarganier
Copy link
Member Author

cc @florence-crl - this will require a docs change to update the http-defaults/timeout default value in our log config docs: https://www.cockroachlabs.com/docs/stable/configure-logs#output-to-http-network-collectors

@abarganier abarganier requested a review from a team as a code owner August 22, 2023 20:27
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: do we have same thing with fluent-servers?

Reviewed 4 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz)

Copy link
Member Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

fluent-servers don't have timeout config options, but they do have hardcoded dial and write timeouts (5 & 1 seconds, respectively):

const fluentDialTimeout = 5 * time.Second
const fluentWriteTimeout = time.Second

This makes me wonder - should the default http timeout be more aggressive than 60s? Maybe 5s?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz)

@abarganier abarganier changed the title pkg/util/log: change default httpSink timeout to 60s pkg/util/log: change default httpSink timeout to 2s Aug 23, 2023
@abarganier
Copy link
Member Author

Based on some internal discussion, I've dropped the default from 60s to 2s. This aligns with our general network timeout.

Previously, the default timeout for the httpSink was
for there to be no timeout at all.

This means that the first call to `output()` on where
the http target was unavailable would hang forever.
This would deadlock the calling goroutine, whether
that's the bufferedSink flush goroutine, or (even worse)
a server goroutine in the event that the httpSink is not
buffered.

Our default timeout should not deadlock in the worst case
scenario. Admittedly, `2s` would also cause a noticeable
performance degradation in the event that the httpSink was
unbuffered, but it would at least be able to emit logs
indicating the timeout as the cause. Availability would also
be maintained to some degree. Previously, the deadlocks
due to no timeout being set by default meant that no
indication was ever given that the httpSink was unable
to reach the http target.

Release note (ops change): The default value of `timeout`
for `http-servers` logging sinks has been changed from
"no timeout" to `2s`. This will be reflected in the
`http-defaults` section of the log configuration. Users
still maintain the ability to override the timeout, or
disable it by explicitly setting it to `0` (e.g. `timeout: 0`).
@sean-
Copy link
Collaborator

sean- commented Aug 23, 2023

Why align with our network timeouts vs max(${network_timeout}, ${storage_timeout})? EBS has a 60s IO timeout for storage. Is the sink synchronous or an async, buffered leaky bucket? And if it is a leaky bucket, do we have any metrics around dropped messages when the bucket gets full, and we start discarding messages?

@abarganier
Copy link
Member Author

abarganier commented Aug 23, 2023

Why align with our network timeouts vs max(${network_timeout}, ${storage_timeout})? EBS has a 60s IO timeout for storage. Is the sink synchronous or an async, buffered leaky bucket? And if it is a leaky bucket, do we have any metrics around dropped messages when the bucket gets full, and we start discarding messages?

See discussion here: https://cockroachlabs.slack.com/archives/C01CDD4HRC5/p1692804661841219

We don't have precedent for setting conditional defaults for log configs, AFAIK (e.g. using a max function).

The async buffer (if enabled, which it is by default, but can be disabled) is leaky once a limit is reached. Regarding metrics for this, see #72453. The infra wasn't there to record this until quite recently, see #106607.

@abarganier abarganier removed the request for review from a team August 23, 2023 20:38
@abarganier
Copy link
Member Author

bors r=dhartunian

@craig
Copy link
Contributor

craig bot commented Aug 25, 2023

Build succeeded:

@craig craig bot merged commit 87fc44f into cockroachdb:master Aug 25, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pkg/util/log: httpSink default timeout should not be "no timeout"
4 participants