New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission: priority inversion caused by SQLKVResponseWork and SQLSQLResponseWork #85471
Comments
Still trying to wrap my head around admission control, but I have a question prompted by the work to measure SQL CPU time - would this story be improved by measuring the time to process each SQL batch? Instead of scheduling admission of work into the SQL layer / across RPC boundaries, we could cooperatively schedule processing for each batch by a SQL operator, aiming for some target duration (e.g. 20ms). Could that be sufficient to have KV and SQL share a slot pool? |
Can you point me to the SQL instrumentation for measuring CPU time. It is possible I have been thinking about this wrong and assuming we need to intercept the start and end of the SQL work in the places where SQL and KV interact (assuming you have read the slack thread linked above). We just want to intercept inside SQL wherever there is the potential for significant work. The interception will take the form of asking for permission to do some CPU work and then informing when the work is done (so we can return slots and account for cpu consumed).
|
We have a cockroach/pkg/sql/colflow/stats.go Lines 121 to 138 in 46631a4
Currently this is only done for some queries when statistics are being collected, but you could imagine a version of this that is "always on". We'd want to run some experiments to ensure it doesn't add significant overhead, but I think it would be workable.
This seems like it could be handled easily enough by storing the information on whatever operator implements the behavior. I expect we'd have everything needed during operator creation time (or could easily plumb it).
I think this is already the case, but +cc @yuzefovich.
SQL batches are 1024 SQL rows by default. AFAIK KV batch sizes are up to 100,000 keys, and so may end up being larger. For a prototype the easiest thing would probably be to hook into admission control on a batch basis, but with a little extra work we could do it more or less than once per batch. Since we'd be measuring CPU time anyway, we could wait to intercept until the usage reached some limit (say, 20ms) to keep the overhead small. I think some of the elastic CPU stuff already works this way, right? |
Yes, all SQL operators have a relevant I agree with Drew that we should be able to plan additional "admission" operators into the tree (with low overhead) that would be responsible for providing hooks into the admission control system. E.g. if we have a query that involves a scan followed by a sort, we'd get a tree of operators like IIUC the main difficulty is the estimation of how CPU intensive particular SQL work will be prior to admitting it. In the KV land we have different parameters (like number of requests in BatchRequest) to use for the estimate, but in the SQL land we'd need to come up with a costing scheme for all of the operators which might be the wrong way to integrate the admission control. Instead, perhaps we would use an integration like this one:
and the following scheme:
Similar to how we compute SQL CPU time we'd need to be careful around operators that call into KV but there are only a few of those. We also might need to be careful with dealing the admission system delay on the child Does this make sense / sound reasonable? |
How is this deduping done with the current
AC does not need any up front estimation of CPU work (we do ask for 100ms of tokens for certain elastic work admission, but that is not how this SQL code will integrate). I imagine something like the following (we can wrap more things in a library so it looks cleaner than below):
|
Each |
Thanks. I forgot to mention another constraint. The thing that is calling |
That's a good point. Maybe then we would have to handle blocking at the root of an operator tree like you said earlier. Alternatively, what if
Then, the root of the tree would call
I don't expect this to be expensive at the batch granularity. But as I mentioned above, one common case to consider is an operator that performs buffering like a sort - its input will do all the work it will ever do on the first call to
What sort of special handling would be needed for SQL-KV interactions? Would it be enough to measure the CPU time spent during a call to |
Not sure I understand. It seems there are still multiple calls to Admit, starting from the root down, which will all call (in sequence)
btw, we don't have to be perfect. If expensive sorts that consume 100+ms on a single
Regarding whether we need to do the |
The idea is that there could be multiple calls to
I think I'm missing something here. For the lookup-join case (for example), would it not be enough to handle the call to |
I see. I think that makes sense.
Possibly. I assumed your statement "only work for "pipelined" operators" means this option was not fully on the table.
Hmm, I didn't think of that, but that's a good point. If we are using the same goroutine to go through to KV via I'll mock up some interfaces for SQL to use, in line with what we discussed above, and you could try using them, and then we can iteratively refine. Sound good? |
Lookup joins could be a problem for the mechanism I proposed, but only because they potentially buffer input rows similar to sorts. The problem is when there isn't a one-to-one mapping between calling
The current CPU measuring logic actually already handles this, though it misses some SQL work. We'll probably want to make it more precise going forward.
SGTM |
Informs cockroachdb#85471 Epic: none Release note: None
Informs cockroachdb#85471 Epic: none Release note: None
Admission control handles admission into the SQL layer, via SQLKVResponseWork and SQLSQLResponseWork. Due to the inability to estimate how many CPU tokens to give to such work, admission control sets up a hierarchy where such work is not admitted while there is KVWork waiting. This can cause unfairness and other issues, since remotely submitted low priority KVWork can prevent allocation of tokens to locally originating SQLKVResponseWork.
The current behavior is a side-effect of integrating admission control into our volcano-like iterator model. Let’s start with SQLKVResponseWork. This is called in
kvstreamer.workerCoordinator.performRequestAsync
,tableWriterBase
,txnKVFetcher.fetch
. We don't know when the processing of the[]Result
is complete in the tree ofcolexecop.Operators
that works with this[]Result
. If we had a completion indicator we could convert this CPU-bound SQLKVResponseWork (and SQLSQLResponseWork) to use admission “slots”, and have KV and SQL share from the same slot pool. Also, we have recently started considering cpu-nano tokens for some elastic kv-work admission, and the same problem applies in a token-based design: we need to know when the goroutine(s) processing the[]Result
is done doing that processing and can tell admission control how many tokens it used.This is a summary of the more detailed discussion in https://cockroachlabs.slack.com/archives/C01SRKWGHG8/p1657286381660829
@irfansharif @yuzefovich
Jira issue: CRDB-18259
Epic CRDB-25469
The text was updated successfully, but these errors were encountered: