Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: lower the raft-log queue timer from 50ms to 0ms #23869

Merged

Conversation

petermattis
Copy link
Collaborator

Lower the raft-log queue timer from 50ms to 0ms. This timer was forcing
an artificial delay between raft-log truncation operations which was in
turn allowing the raft-log to grow undesirably long in cluster overload
situations. When the raft-log for a range grows to large, the eventual
truncation operation can then take a prohibitively long time which leads
to a downward spiral of performance, oftentimes resulting in Raft
snapshots (which are significantly more expensive) and even further
performance degradation. There are no downsides to a zero duration
between raft-log truncations as there are other mechanisms in place to
avoid performing truncations unless they are necessary (e.g. tracking of
the raft-log size and the number of entries).

Release note (performance improvement): Improve cluster performance
during overload scenarios.

Lower the raft-log queue timer from 50ms to 0ms. This timer was forcing
an artificial delay between raft-log truncation operations which was in
turn allowing the raft-log to grow undesirably long in cluster overload
situations. When the raft-log for a range grows to large, the eventual
truncation operation can then take a prohibitively long time which leads
to a downward spiral of performance, oftentimes resulting in Raft
snapshots (which are significantly more expensive) and even further
performance degradation. There are no downsides to a zero duration
between raft-log truncations as there are other mechanisms in place to
avoid performing truncations unless they are necessary (e.g. tracking of
the raft-log size and the number of entries).

Release note (performance improvement): Improve cluster performance
during overload scenarios.
@petermattis petermattis requested a review from a team March 14, 2018 21:24
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@bdarnell
Copy link
Member

:lgtm:


Review status: 0 of 1 files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

@jordanlewis I recall you saw this during tpcc testing. So did I which is what motivated this change. Did you file an issue about it? (I couldn't find the issue if you did).

@petermattis
Copy link
Collaborator Author

FYI, I plan to cherry-pick this for 2.0.

Copy link
Contributor

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍👍👍

@jordanlewis
Copy link
Member

I don't think I filed an issue. It was in the performance weekly notes:

  • [nathan] seeing a lot of raft log truncation events
    • We were seeing the raft log truncation queue operating a lot. This isn’t necessarily indicative of anything going wrong, but we are seeing nonzero snapshots. So that’s indication that this is maybe not working as intended. Investigating
    • [peter] We don’t have much tracing around the quota pool, but we should make sure we add a tracing event there, as it’s the last stop before the raft train leaves the tracing station.
    • The quota pool is the thing that proactively rate limits the raft log
    • [ben] lots of magic constants baked in above - we should look at tweaking them, perhaps that will fix/mitigate the problems.

@petermattis petermattis merged commit 67a1d96 into cockroachdb:master Mar 14, 2018
@petermattis petermattis deleted the pmattis/raft-log-queue-timer branch March 14, 2018 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants