Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add MaxInflightMsgBytes to raft #90314

Closed
tbg opened this issue Oct 20, 2022 · 4 comments
Closed

kvserver: add MaxInflightMsgBytes to raft #90314

tbg opened this issue Oct 20, 2022 · 4 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication KV Replication Team

Comments

@tbg
Copy link
Member

tbg commented Oct 20, 2022

Is your feature request related to a problem? Please describe.

raft.RawNode has two options MaxSizePerMsg, MaxInflightMsgs:

MaxSizePerMsg: storeCfg.RaftMaxSizePerMsg,
MaxInflightMsgs: storeCfg.RaftMaxInflightMsgs,

We set them to 32kb 1 and 1282, respectively.

Consider a slow follower that the leader is replicating a stream of 16mb AddSST entries to. The follower will, for this one range, have to hold up to 128*16m = 2G worth of entries in memory. If this happens on even a handful of ranges, an OOM is likely to follow; we see this happen in practice3.

Note that the quota pool (I think defaulting to 8mb right now) can sort of inhibit this undesirable behavior in the steady state. But once a node falls out of the quota pool, for example since it was recently restarted or something like that (I assume slow nodes frequently fall out of the quota pool, since we haven't really seen range write stalls caused by them, though that doesn't mean they don't happen - see #79756) there's no stopping the raft leader from catching up the follower in this unintended overly aggressive way.

Describe the solution you'd like

RawNode should have options that allow us to specify the max inflight bytes. We could set this to, say, 64MiB to say that no more than 64MiB should be outstanding at any point in time. (This is an etcd/raft change).

In an ideal world, we'd also be able to budget how much can be outstanding across raft instances, but this is a much harder problem. For example, there may be 1000 ranges for which the leader is sending us 16mb AddSSTs. We would need to have these leaders coordinate (unlikely) or the follower would need to reject messages (this being essentially 3).

Describe alternatives you've considered

Additional context

Jira issue: CRDB-20690

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/2c3916be699e01165193d148e6caef7cb79ad110/pkg/base/config.go#L136-L137

  2. https://github.com/cockroachdb/cockroach/blob/2c3916be699e01165193d148e6caef7cb79ad110/pkg/base/config.go#L145-L148

  3. https://github.com/cockroachdb/cockroach/issues/79752 2

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication KV Replication Team labels Oct 20, 2022
@blathers-crl
Copy link

blathers-crl bot commented Oct 20, 2022

cc @cockroachdb/replication

@tbg tbg self-assigned this Oct 20, 2022
@exalate-issue-sync exalate-issue-sync bot assigned pav-kv and unassigned tbg Oct 24, 2022
@pav-kv
Copy link
Collaborator

pav-kv commented Oct 25, 2022

The current prototype is in etcd-io/etcd#14624.
It requires etcd-io/etcd#14633 to be merged first.

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 14, 2022

The etcd/raft side of work is done. Next steps:

craig bot pushed a commit that referenced this issue Jan 16, 2023
94692: kvserver: use MaxInflightBytes raft config option r=erikgrinaker,tbg a=pavelkalinnikov

This change starts using the raft's `MaxInflightBytes` option which limits the inflight total byte size of the log entries sent via `MsgApp`. The default limit is set to a conservatively large value, to not change the current behaviour for now.

Part of #90314
Epic: none
Release note (ops change): Added COCKROACH_RAFT_MAX_INFLIGHT_BYTES env variable which helps strictly limiting inflight traffic from a Raft leader to followers, particularly in situations when many large messages are sent and significantly exceed COCKROACH_RAFT_MAX_SIZE_PER_MSG * COCKROACH_RAFT_MAX_INFLIGHT_MSGS which is a softer limit.

Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
@pav-kv
Copy link
Collaborator

pav-kv commented Feb 1, 2023

The option is implemented and plumbed into CRDB. The default is a generous 256MB which should not disrupt existing workflows. We may revisit this setting in the future.

UPD: it is 32 MiB now [#101437].

@pav-kv pav-kv closed this as completed Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication KV Replication Team
Projects
None yet
Development

No branches or pull requests

2 participants