Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission,kvserver: subject snapshot ingestion to admission control #80607

Closed
sumeerbhola opened this issue Apr 27, 2022 · 17 comments
Closed

admission,kvserver: subject snapshot ingestion to admission control #80607

sumeerbhola opened this issue Apr 27, 2022 · 17 comments
Assignees
Labels
A-admission-control A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) N-followup Needs followup. O-23.2-scale-testing issues found during 23.2 scale testing O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-2 Issues/test failures with a fix SLA of 3 months T-admission-control Admission Control
Projects

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Apr 27, 2022

We currently throttle writes on the receiving store based on store health (e.g. via admission control or via specialized AddSSTable throttling). However, this only takes into account the local store health, and not the associated write cost on followers during replication, which isn't always throttled. We've seen this lead to hotspots where follower stores get overwhelmed, since the follower writes bypass admission control. A similar problem exists with snapshot application.

This has been touched on in several other issues as well:

A snapshot can be 512MB (the maximum size of a range). Rapid ingestion of snapshots can cause an inverted LSM e.g. https://github.com/cockroachlabs/support/issues/1558 where we were ingesting ~1.6GB of snapshots every 1min, of which ~1GB were being ingested into L0.

  • Unlike other activities that integrate with admission control, streaming the snapshot from the sender to receiver can take 16+s (512MB at a rate of 32MB/s, the default value of kv.snapshot_recovery.max_rate). Admission control currently calculates tokens at 15s time granularity, and expects admission work to be short-lived compared to this 15s interval (say < 1s).
  • Snapshots need to be ingested atomically, since they can be used to catchup a replica that has fallen behind. This atomicity is important for ensuring the replica state is consistent in case of a crash. This means the atomic ingest operation can add 512MB to L0 in the worst case.

Solution sketch:

  • We assume that

    • Overloaded stores are rare and so we don't have to worry about differentiating the priority of REBALANCE snapshots and RECOVERY snapshots from the perspective of store-write admission control.
    • The allocator will usually not try to add replicas to a store with a high L0 sublevel count or high file count. This is not an essential requirement.
  • The kv.snapshot_recovery.max_rate setting continues to be driven by the rate of resource consumption on the source and of the network. That is, it has nothing to do with how fast the destination can ingest. This is reasonable since the ingestion will be atomic, after all the ssts for the snapshot have been locally written, so it doesn't matter how fast the data for a snapshot arrives. This is ignoring the fact that writing the ssts also consumes disk write bandwidth, which may be constrained -- this is acceptable since the write amplification is the biggest consumer of resources.

  • We keep the Store.snapshotApplySem to limit the number of snapshots that are streaming over their data. After the local ssts are ready to be ingested, we ask for admission tokens equal to the total size of the ssts. This is assuming we have switched store-write admission control to use byte tokens (see admission: change store write admission control to use byte tokens #80480). Once granted the ingestion is performed and after that the snapshotApplySem is released. This is where having a single priority for all snapshot ingestion is convenient -- we don't have to worry about a high-priority snapshot waiting for snapshotApplySem while a low-priority snapshot is waiting for store-write admission.

  • Potential issues:

    • Consuming 512MB of tokens in one shot can starve out other traffic. This is ok since the ingest got to the head of the queue so there must not be more important work waiting. Also, the 512MB is across 5 non-overlapping sstables, so will add at most 1 sublevel to L0.
    • Available tokens will become negative. This is ok, since the negative value is proper accounting and the 1s refilling of tokens will eventually bring it back to a positive token count. If there was no higher priority waiting work for an instant of time, that caused the snapshot to consume all the tokens, later arriving higher priority work will need to wait -- if this becomes a problem we could start differentiating the tokens consumed by snapshot ingests from other work, and allow some overcommitment (this may work because a snapshot ingest adds at most 1 sublevel).
    • The snapshot will exceed the deadline while waiting in the admission queue. This is wasted work, which is ideally to be avoided, but we can mitigate by (a) the aforementioned allocator decision-making causing this to be rare, (b) whenever we exceed the deadline, we could add an extra delay for future acquisitions of snapshotApplySem (i.e., add some sort of backoff scheme before the streaming of a snapshot starts).
  • Misc things to figure out:

    • In a multi-tenant setting, we will need to decide which tenant these ingestions are attributed to, since we have rough fair sharing across tenants. Should they be using the tenant of the range, or the system tenant.

@tbg

Jira issue: CRDB-15330

Epic CRDB-34248

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. A-admission-control labels Apr 27, 2022
@tbg
Copy link
Member

tbg commented Apr 27, 2022

The kv.snapshot_recovery.max_rate setting continues to be driven by the rate of resource consumption on the source and of the network. That is, it has nothing to do with how fast the destination can ingest.

The snapshot rate limits configured in https://github.com/cockroachlabs/support/issues/1558 were dangerously close to maxing out the disks' configured max throughputs. With everything you suggest above done, wouldn't we still see issues owing to the fact that streaming the snapshot alone would overload the disks? There is a mismatch between our L0-centric tracking and snapshots in this regard. I suppose this is fine, the rate limits should really be adaptive to disk throughput or for now, shouldn't be misconfigured like that.


Consuming 512MB of tokens in one shot can starve out other traffic. [...] Available tokens will become negative.

This does worry me a bit. It's just a lot of tokens! Can we acquire the tokens in smaller chunks (say 1mb/s increments) so that the bucket does not go negative, and we're less likely to impose a big delay on a high-priority operation?


The snapshot will exceed the deadline while waiting in the admission queue.

In that case, we can apply the snapshot, right? It doesn't really matter if we do, since the caller (replicate queue in all likelihood) will assume the snapshot failed & will not make good use of it. But if we "pre-acked" the snapshot the moment we get to the point at which we're waiting for ingestion tokens, the caller could be more lenient and willing to wait a little longer.


Playing devil's advocate:

Admission control currently calculates tokens at 15s time granularity, and expects admission work to be short-lived compared to this 15s interval (say < 1s).

Is it maybe ok to ignore that for snapshots? Let's say we acquire tokens as the snapshot is streamed in, and then ingest no matter what. Since we're doing at most one snapshot per store at any given time, in the scenario we're worried about wouldn't it still be ok:

  • lsm is healthy
  • snapshot gets streamed in, acquires 512mb over time
  • streaming finished
  • lsm suddenly becomes very unhealthy and there are no more tokens
  • 512MB snapshot ingested into L0, so LSM is getting a little more unhealthy (but not too bad as you mentioned, since we added only one sublevel and also these SSTs should compact down rather well due to the contained rangedels, right)
  • next snapshot will be backpressured due to this bad LSM while streaming unless things have recovered until then anyway

Do I have the scenario wrong? It already feels fairly contrived. I might not be understanding it correctly.

@sumeerbhola
Copy link
Collaborator Author

sumeerbhola commented Apr 28, 2022

The snapshot rate limits configured in https://github.com/cockroachlabs/support/issues/1558 were dangerously close to maxing out the disks' configured max throughputs. With everything you suggest above done, wouldn't we still see issues owing to the fact that streaming the snapshot alone would overload the disks?

Are you worried about the reads at the source? If so, it depends on how much of this hits the Pebble block cache and the OS page cache. But yes, it is a valid concern -- and admission control at the source can't help yet with disk bandwidth limits (only CPU). We could consider at least hooking up the reads to admission control.

This does worry me a bit. It's just a lot of tokens! Can we acquire the tokens in smaller chunks (say 1mb/s increments) so that the bucket does not go negative, and we're less likely to impose a big delay on a high-priority operation?

That's definitely a potential alternative and I did consider that briefly. I discarded it because:

  • it is lying, the ingest will happen atomically later. And yes, the consequence of this lying is limited in terms of sub-levels, so it is something we could reconsider.
  • it subjectively felt more complicated-- I had a vague unease with consuming tokens as the snapshot is being streamed in, partly because what are we going to do if we block on those tokens (push back to the sender, and what about the deadline?). It seemed simpler to let the sender send everything at the rate at which it is capable of (and the network can support).

I'll give this alternative some more thought. And flesh out the third alternative which I had alluded to above, that overcommits, by consuming tokens for these snapshots that don't touch the normal tokens.

In that case, we can apply the snapshot, right? It doesn't really matter if we do, since the caller (replicate queue in all likelihood) will assume the snapshot failed & will not make good use of it. But if we "pre-acked" the snapshot the moment we get to the point at which we're waiting for ingestion tokens, the caller could be more lenient and willing to wait a little longer.

I need to understand the deadline situation better. IIUC, the deadline exists so that the source can try another healthier node, yes? So the leniency above is not indefinite, yes?

@sumeerbhola
Copy link
Collaborator Author

Relevant comment by @nicktrav https://github.com/cockroachlabs/support/issues/1558#issuecomment-1112278968 regarding not ignoring the write bandwidth to write the initial sstables while the snapshot is streaming in, as it could be substantial (20% in that case).

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue May 3, 2022
The approach here is to use a special set of tokens for snapshots
kvStoreTokenGranter.availableRangeSnapshotIOTokens that allows
for over-commitment, so that normal work is not directly
affected. There is a long code comment in kvStoreTokenGranter
with justification.

Informs cockroachdb#80607

Release note: None
@sumeerbhola
Copy link
Collaborator Author

And flesh out the third alternative which I had alluded to above, that overcommits, by consuming tokens for these snapshots that don't touch the normal tokens

I'm leaning towards this, since it is effective and is the simplest -- created a draft PR in #80914

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue May 4, 2022
The approach here is to use a special set of tokens for snapshots
kvStoreTokenGranter.availableRangeSnapshotIOTokens that allows
for over-commitment, so that normal work is not directly
affected. There is a long code comment in kvStoreTokenGranter
with justification.

Informs cockroachdb#80607

Release note: None
@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
@mwang1026 mwang1026 added the O-postmortem Originated from a Postmortem action item. label May 27, 2022
@exalate-issue-sync exalate-issue-sync bot added the T-storage Storage Team label May 27, 2022
@blathers-crl blathers-crl bot added this to Incoming in (Deprecated) Storage May 27, 2022
@tbg tbg removed this from Incoming in (Deprecated) Storage May 31, 2022
@tbg tbg added this to Incoming in KV via automation May 31, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label May 31, 2022
@tbg
Copy link
Member

tbg commented May 31, 2022

Moving this to KV (as the new owners of admission control) and tagging @shralex and @mwang1026 to see who can work on this and when.

Note there's a prototype PR #79215 (of Sumeer's) that I've looked at and that I think would be good to experiment with. Before we do that, we should write a roachtest (a la #81516), which I filed as #82116.

@tbg
Copy link
Member

tbg commented Jun 27, 2022

One approach that was being explored in #82132 is to "ignore" overloaded followers as much as possible - in a sense, saying that when we see an overloaded follower, as long as that follower isn't required for quorum, we are better off leaving it alone for some time because the alternative involves backpressuring client traffic, which is essentially like introducing some amount of voluntary unavailability.

We could do something similar here: avoid raft snapshots to nodes that are overloaded, unless the snapshot is necessary for forward progress (which is something the initiator of the snapshot can reason about, since it is the raft leader).

Note that we already don't rebalance to nodes that have an inverted LSM, see #73714.

@exalate-issue-sync exalate-issue-sync bot added the T-admission-control Admission Control label Nov 2, 2023
@williamkulju williamkulju added O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster O-23.2-scale-testing issues found during 23.2 scale testing labels Nov 10, 2023
@andrewbaptist
Copy link
Collaborator

Closing #113720 in favor of this issue.

The current situation is that we require manual tuning of the rate, and it causes problems because it is “always” either too slow or too fast. This is likely one of the most tuned parameters today, and it is typically reactionary and modified frequently.
Additionally, there are options for sending and receiving multiple snapshots in parallel, and there is a tradeoff between sending too many and sending them slowly vs running just 1 and not taking full advantage of the hardware. These are done at the store level and not the node level.

Also, some snapshots are “cheap” in that they are for a new range and don’t disrupt the LSM very often vs others are “expensive” since they replace additional data. The KV code has some idea of which it is, but doesn’t take that into account now.

We have timeouts and rate computations for snapshots, but they don’t always work as expected and we end up with timeouts today.

Short term this is hard because there are throttles on both sides (2 for send, 1 for receive) and the timeout is based on what the setting is configured to, so if it is set too high, even though we send at the same rate, we timeout more often. I would like to remove the timeout as a “normal flow control” mechanism and instead only time out in extreme cases.
We don’t take into account at all the shape of the LSM or other health on either node.

@exalate-issue-sync exalate-issue-sync bot added the P-2 Issues/test failures with a fix SLA of 3 months label Dec 8, 2023
@aadityasondhi aadityasondhi self-assigned this Feb 20, 2024
@arulajmani arulajmani added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Feb 23, 2024
@arulajmani
Copy link
Collaborator

Adding the O-support label given this would have prevented https://github.com/cockroachlabs/support/issues/2841.

@aadityasondhi
Copy link
Collaborator

aadityasondhi commented Feb 29, 2024

This issue has been resolved through our optimizations in pebble (see #117116). We no longer see the inverted LSM problem as described here due to our use of excise in the ingestion path. This feature is turned on by default in 24.1 and can be turned on in 23.2.

We verified the behavior through extensive internal experiments. The end result of which is that with these settings turned on, we see minimal (often none) of the ingests landing in L0. Instead with the help of excise, a big majority lands in L6.

Before (left) and after (right):
image

Pebble logs to confirm:

I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +      |                             |       |       |   ingested   |     moved    |    written   |       |    amp
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +level | tables  size val-bl vtables | score |   in  | tables  size | tables  size | tables  size |  read |   r   w
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +------+-----------------------------+-------+-------+--------------+--------------+--------------+-------+---------
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +    0 |     0     0B     0B       0 |  0.00 |  10GB |     0     0B |     0     0B |  3.8K  8.1GB |    0B |   0  0.8
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +    1 |     0     0B     0B       0 |  0.00 |    0B |     0     0B |     0     0B |     0     0B |    0B |   0  0.0
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +    2 |    30   60MB     0B       0 |  0.93 | 8.1GB |     2  2.4KB |    32   82MB |  7.7K   20GB |  21GB |   1  2.5
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +    3 |    81  352MB   977B       0 |  1.00 | 6.3GB |    16   18KB |   497  1.7GB |  3.7K   17GB |  18GB |   1  2.7
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +    4 |   216  1.9GB   517B       1 |  0.99 | 6.2GB |   110  123KB |   203  279MB |  1.9K   17GB |  19GB |   1  2.8
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +    5 |   465  2.1GB  222KB     147 |  0.44 | 1.5GB |  2.9K  3.6MB |   391  1.3GB |   584  5.1GB | 7.9GB |   1  3.5
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +    6 |  1.1K   66GB  645KB      13 |     - | 2.7GB |  3.0K   63GB |    46   58MB |   255   12GB |  19GB |   1  4.3
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +total |  1.9K   70GB  869KB     161 |     - |  74GB |  6.0K   63GB |  1.2K  3.4GB |   18K  152GB |  85GB |   5  2.1
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +-------------------------------------------------------------------------------------------------------------------
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +WAL: 1 files (40MB)  in: 10GB  written: 10GB (1% overhead)
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +Flushes: 243
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +Compactions: 9723  estimated debt: 0B  in progress: 0 (0B)
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +             default: 6473  delete: 78  elision: 2002  move: 1169  read: 1  rewrite: 0  multi-level: 0
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +MemTables: 1 (64MB)  zombie: 1 (64MB)
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +Zombie tables: 0 (0B)
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +Backing tables: 127 (1.9GB)
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +Virtual tables: 161 (937MB)
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +Block cache: 151K entries (3.6GB)  hit rate: 68.9%
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +Table cache: 780 entries (616KB)  hit rate: 99.4%
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +Secondary cache: 0 entries (0B)  hit rate: 0.0%
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +Snapshots: 0  earliest seq num: 0
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +Table iters: 0
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +Filter utility: 97.8%
I240228 22:29:25.630517 632 kv/kvserver/store.go:3403 ⋮ [T1,Vsystem,n3] 4510 +Ingestions: 998  as flushable: 0 (0B in 0 tables)

Based on these findings, we can close out this issue.

Internal doc with more detail here
Internal thread here

KV automation moved this from On Hold to Closed Feb 29, 2024
@andrewbaptist
Copy link
Collaborator

Since we are no longer concerned with the snapshot overhead, we have the option to either increase the concurrency of snapshots or simply increase the default ingestion rate (to something like 256MiB). It is kept artificially low with the idea that we would increase it once this enhancement was implemented. Currently many customers manually increase this rate, but when they do that is when they get in trouble on 23.2. @itsbilal had looked at options to increase this rate as part of disaggregated storage.

@sumeerbhola
Copy link
Collaborator Author

sumeerbhola commented Mar 12, 2024

simply increase the default ingestion rate (to something like 256MiB). It is kept artificially low with the idea that we would increase it once this enhancement was implemented.

Until AC has control over disk bandwidth usage, I think increasing the default is risky in that the disk could become saturated.

@andrewbaptist
Copy link
Collaborator

I am going to re-open this issue even if we don't deal with it now. As part of closing this we should also resolve #14768
or at a minimum set the defaults mich higher than they are now. If we don't want to increase the default because we are concerned about it being higher because of overload if they are, then I don't think it is appropriate to mark this as not an issue.

We are in a strange state where the guidance to customers and customer best practices are to increase (or at least modify) this rate in many escalations and runbooks, but we are always threading a line between this working well for different scenarios.

EIther way if we want to keep this closed, we should update with guidance about what it should be in 24.1:

https://cockroachlabs.atlassian.net/wiki/spaces/CKB/pages/2273673403/How+To+Change+Rebalance+and+Recovery+Rates
https://cockroachlabs.atlassian.net/wiki/spaces/CKB/pages/2500003791/How+To+Speed+Up+Range+Replication
https://cockroachlabs.atlassian.net/wiki/spaces/IM/pages/2930409473/Shrink+Disk+Size+for+Operator+Managed+Clusters

@andrewbaptist andrewbaptist reopened this Mar 13, 2024
KV automation moved this from Closed to Incoming Mar 13, 2024
@aadityasondhi
Copy link
Collaborator

Ack, that's fine with me. It is worth noting that we have a tracking issue for the bandwidth specific problem: #86857. It is a problem we plan to solve soon. I think the original motivation of this issue was increased r-amp due to snapshots landing in L0 on the receiver, which is no longer the case, hence why I originally closed this issue. But either way, we should be solving the bandwidth issue with snapshots in mind (ref: #86857 (comment)).

we should update with guidance about what it should be in 24.1

I think the concern is with the change to the defaults since we don't know how it will react in bandwidth constrained setups without AC for bandwidth, especially in environments where reads and writes share the same provisioned bandwidth limits. I think it still makes sense to recommend increasing this limit in cases where it can help. FWIW, in the doc referenced above with my results, I was running with this knob tuned up to 256MB.

@sumeerbhola
Copy link
Collaborator Author

see #120708 for that followup. Closing this issue.

KV automation moved this from Incoming to Closed Mar 19, 2024
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 23, 2024
This patch adds a roachtest for running snapshots with excises enabled.
In this workload, when splits and excises are disabled, we see an
inverted LSM and degraded p99 latencies.

The test asserts that the LSM stays healthy while doing the snapshot
ingest, and p99 latencies don't spike over a threshold.

Informs cockroachdb#80607.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 23, 2024
This patch adds a roachtest for running snapshots with excises enabled.
In this workload, when splits and excises are disabled, we see an
inverted LSM and degraded p99 latencies.

The test asserts that the LSM stays healthy while doing the snapshot
ingest, and p99 latencies don't spike over a threshold.

Informs cockroachdb#80607.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 23, 2024
This patch adds a roachtest for running snapshots with excises enabled.
In this workload, when splits and excises are disabled, we see an
inverted LSM and degraded p99 latencies.

The test asserts that the LSM stays healthy while doing the snapshot
ingest, and p99 latencies don't spike over a threshold.

Informs cockroachdb#80607.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 23, 2024
This patch adds a roachtest for running snapshots with excises enabled.
In this workload, when splits and excises are disabled, we see an
inverted LSM and degraded p99 latencies.

The test asserts that the LSM stays healthy while doing the snapshot
ingest, and p99 latencies don't spike over a threshold.

Informs cockroachdb#80607.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 24, 2024
This patch adds a roachtest for running snapshots with excises enabled.
In this workload, when splits and excises are disabled, we see an
inverted LSM and degraded p99 latencies.

The test asserts that the LSM stays healthy while doing the snapshot
ingest, and p99 latencies don't spike over a threshold.

Informs cockroachdb#80607.

Release note: None
craig bot pushed a commit that referenced this issue May 24, 2024
124591: roachtest: add roachtest for snapshot ingest with excises r=sumeerbhola a=aadityasondhi

This patch adds a roachtest for running snapshots with excises enabled. In this workload, when splits and excises are disabled, we see an inverted LSM and degraded p99 latencies.

The test asserts that the LSM stays healthy while doing the snapshot ingest, and p99 latencies don't spike over a threshold.

Informs #80607.

Release note: None

Co-authored-by: Aaditya Sondhi <20070511+aadityasondhi@users.noreply.github.com>
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 24, 2024
This patch adds a roachtest for running snapshots with excises enabled.
In this workload, when splits and excises are disabled, we see an
inverted LSM and degraded p99 latencies.

The test asserts that the LSM stays healthy while doing the snapshot
ingest, and p99 latencies don't spike over a threshold.

Informs cockroachdb#80607.

Release note: None
craig bot pushed a commit that referenced this issue May 24, 2024
124591: roachtest: add roachtest for snapshot ingest with excises r=sumeerbhola a=aadityasondhi

This patch adds a roachtest for running snapshots with excises enabled. In this workload, when splits and excises are disabled, we see an inverted LSM and degraded p99 latencies.

The test asserts that the LSM stays healthy while doing the snapshot ingest, and p99 latencies don't spike over a threshold.

Informs #80607.

Release note: None

Co-authored-by: Aaditya Sondhi <20070511+aadityasondhi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) N-followup Needs followup. O-23.2-scale-testing issues found during 23.2 scale testing O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-2 Issues/test failures with a fix SLA of 3 months T-admission-control Admission Control
Projects
KV
Closed
Development

No branches or pull requests

10 participants