Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure read latency of single row SQL reads done by users, without capturing contention or hardware queueing #98305

Open
joshimhoff opened this issue Mar 9, 2023 · 8 comments
Labels
A-kv-observability A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-storage Storage Team

Comments

@joshimhoff
Copy link
Collaborator

joshimhoff commented Mar 9, 2023

Is your feature request related to a problem? Please describe.
Would be nice to measure point read latency of single row SQL reads done by users, without capturing contention or hardware overload. CRDB should deliver predictable performance for single row SQL reads.

Describe the solution you'd like
Implement kv.replica_read_batch_evaluate.single_row.latency. Similar to kv.replica_read_batch_evaluate.latency but only captures read requests that touch a single SQL row (or similar e.g. allow joins by FK). @rytaft suggests a way to check if a read touch just one SQL row at #71169 (comment). We could do the check in SQL land, attach some metadata to the context indicating that only a single SQL row is affected, then plumb that into BatchRequest like we do with tracing info, then plumb into kvserver context, the use to implement kv.replica_read_batch_evaluate.single_row.latency.

Describe alternatives you've considered
#71169 is a more complete solution but will take much longer to implement.

Additional context
This will help with:

  1. SRE alerting.
  2. Disaggregated storage rollout to CC, as we need to know what latency we are currently providing our users to decide on the right caching solution.

Jira issue: CRDB-25182

@joshimhoff joshimhoff added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 9, 2023
@joshimhoff
Copy link
Collaborator Author

CC @tbg

@joshimhoff joshimhoff changed the title Measure point read latency of single row SQL reads done by users, without capturing contention or hardware overload Measure read latency of single row SQL reads done by users, without capturing contention or hardware overload Mar 9, 2023
@joshimhoff joshimhoff added A-kv Anything in KV that doesn't belong in a more specific category. A-kv-observability and removed A-kv Anything in KV that doesn't belong in a more specific category. labels Mar 9, 2023
@joshimhoff joshimhoff added O-sre For issues SRE opened or otherwise cares about tracking. A-storage Relating to our storage engine (Pebble) on-disk storage. and removed T-kv-observability labels Mar 9, 2023
@blathers-crl blathers-crl bot added the T-storage Storage Team label Mar 9, 2023
@tbg
Copy link
Member

tbg commented Mar 14, 2023

hardware overload

What do you mean by that? You will measure hardware overload no matter what, no? Similarly, network latencies are measured too, and if there is network congestion that too will influence the measurement. I'd think all of this is intentional, too.

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Mar 14, 2023

Ideally, AC should shift queuing into the AC system rather than the goroutine scheduler, etc. So, for the same reason we expect node liveness HBs & kvprober requests to succeed under hardware overload, we expect kv.replica_read_batch_evaluate.single_row.latency to not increase much under hardware overload. FWIW, I expect it will increase a bit, and that is fine, and also at least in 22.2 there are some known gaps in AC such as no AC on followers that can lead to significant queueing outside of AC.

Perhaps the problem is the term "hardware overload". In this case, by "hardware overload", I mean some hardware resource is saturated but only a little since AC is working.

Re: network, yes, since AC doesn't take this into account. We have no observed that problem in practice in CC to date.

@andrewbaptist
Copy link
Collaborator

I would replace the word "hardware overload" with "hardware queueing". Specifically, there are a few places I could see hardware queueing:

  1. Network - if the aggregate throughput is greater than the link bandwidth, then gRPC calls will stall before sending (since TCP typically has a small window)
  2. CPU - If the number of runnable goroutines is >1, then the go scheduler will prioritize some operations over others
  3. Disk - if the queue depth is > disk queue depth, then it will wait prior to issuing a read/write to the disk. (*)
  4. Memory - if the memory reads/writes exceed the memory bus speed, then the CPU will stall

For DIsk it is a little more tricky since the OS will also queue and reorder operations since we are going through the file system and not directly to the raw disk. There are metrics that the file system has (dirty bytes, ...) that can be used to determine how far behind it is on writes.

Ideally, AC would reduce the size of these queues to prevent hardware queuing. I'm not sure it will want to consider all of them though. For instance memory queueing requires a deep understanding of caches and numa architecture, so that is likely outside the scope of what AC should do.

@dt
Copy link
Member

dt commented Mar 14, 2023

"single row" seems overly specific and also maybe not what it sounds like you're looking for here -- a "row" could be 200 reads or one read depending on how many column families and plenty of workloads are going to be scans rather than single row reads.

It sounds like what you're getting at is measuring the latency / service time of the storage layer in finding and returning the data requested by the transaction layer, as an indicator of storage performance which is independent of performance of the transaction layer sitting on top of it, including anything like AC queueing, latch acquisitions, etc that happen in that transaction layer? You just want a pure indicator of how quickly the storage layer is getting the data to it?

For that, I think it sounds like the iterator stats, specifically the timing of Seek*() and Next*() calls, might get you closer to what you want, i.e. the retrieval performance of the storage engine?

@joshimhoff
Copy link
Collaborator Author

I'd think all of this is intentional, too.

Just noticed this bit. Hardware overload is a problem for the customer experience so makes sense to measure it with certain metrics. But we don't want to wake up an SRE at 3am because a cluster is hardware overloaded. As a result, for this metric, I like that it mostly doesn't capture hardware overload.

I would replace the word "hardware overload" with "hardware queueing".

Good idea. Updating ticket title.

Ideally, AC would reduce the size of these queues to prevent hardware queuing. I'm not sure it will want to consider all of them though.

Agreed! I think it's okay that AC is not perfect. AC needs to be good enough that SRE / dev can page on things like kvprober / this & not receive too many false positive pages (pages with cause being hardware overload). Some false positive pages are fine tho.

For that, I think it sounds like the iterator stats, specifically the timing of Seek*() and Next*() calls, might get you closer to what you want, i.e. the retrieval performance of the storage engine?

I'll take a look at that. Thanks for idea!

Re: goal, the goal you are laying out is perhaps not the same as the goal I have. I think the goal is to measure E2E SQL perf in latency sense for the single SQL row SELECT workload, but with certain pieces of latency that are generally caused by customer workload choices, e.g. contention + hardware queueing, excluded. You might say: Josh, you are not measuring any SQL stuff. Agreed but that is more for expediency! One way to think about this ticket is a very scrappy & minimal version of #71169 (comment).

"single row" seems overly specific and also maybe not what it sounds like you're looking for here -- a "row" could be 200 reads or one read depending on how many column families and plenty of workloads are going to be scans rather than single row reads.

Re: 200 reads or one read depend on column families, let's see if that is a problem in practice.

Re: scans, I think that is out of scope. Useful to know perf in latency sense of non-scan workloads, even tho people do scans. Scans are tricky since perf is expected to be extremely variable.

@joshimhoff joshimhoff changed the title Measure read latency of single row SQL reads done by users, without capturing contention or hardware overload Scrappily measure read latency of single row SQL reads done by users, without capturing contention or hardware queueing Mar 14, 2023
@joshimhoff joshimhoff changed the title Scrappily measure read latency of single row SQL reads done by users, without capturing contention or hardware queueing Measure read latency of single row SQL reads done by users, without capturing contention or hardware queueing Mar 14, 2023
@dt
Copy link
Member

dt commented Mar 14, 2023

I guess I'm arguing with the goal; why is your goal "single row SELECT latency that excludes `$stuff" ?

Is that a goal because we think if we get it we will be able to answer questions like like "how does disagg storage affects performance?" or "does this node on this cluster have a dodgy EBS volume and need to trigger an alert?", without the confounding variables of per-workload differences like contention, AC, schema, etc?, If so, then I think having a metric for "single row SQL read excluding $stuff" isn't actually the right goal, that if we had it would best give us those answers, for reasons like the ones I had above. In particular, I think exposing the pebble iterator operation latencies, measured below all of the "", would be a better goal.

@joshimhoff
Copy link
Collaborator Author

Ya, it's an interesting point. I shall dig more into details of exactly what the iterator stats are measuring & then report back!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-storage Storage Team
Projects
Status: Backlog
Development

No branches or pull requests

4 participants