-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: call DrainMeta and Next from a single goroutine #48785
Comments
Hi @asubiotto, please add a C-ategory label to your issue. Check out the label system docs. While you're here, please consider adding an A- label to help keep our repository tidy. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
another occurrence here: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests/1939915? |
@yuzefovich based on a git bisect it looks like #48668 is the first bad commit. Would you be able to look into this? |
Sure. |
This issue is quite puzzling. I've been playing around with this slightly modified logic test:
and what's interesting is that the failure seems to occur only if both tables do not have any data on the gateway node (that's the reason for I feel like the problem might be with parallel unordered synchronizer, and that PR to short-circuit the hash joiner just exposed it (namely, made it more likely to occur on this logic test). I think so because if it was the hash joiner to blame, then I'd expect the failure to occur every time, but it occurs quite rarely and only under stress which indicates that there some race going on. |
I stared at it for another hour or two but didn't make any progress. Another logic test that be used to repro is
and I wasn't successful in converting it into using My theory still remains the same - that somehow parallel unordered synchronizer is to blame. @asubiotto do you mind taking a look? |
Sure. |
I just got this failure on a CI run. Is it related? |
Yeah, I think it's the same underlying problem. |
In case it helps anything, I got
|
Forcing the use of a serial synchronizer (slower but simpler logic):
still reproduces this problem so I don't think this is a parallel unordered synchronizer problem, and as soon as I comment out the short-circuiting logic in the hash joiner, this problem doesn't reproduce. Another change that makes this test pass is exhausting the left side:
Which indicates that not calling |
Hm, since we started populating batches of projecting operators upfront, we're able to short-circuit execution of operators when they receive zero-length batch from its inputs, however, in all previous scenarios we would always call I'm thinking that #49147 that shows a data race when cleaning up disk spilling infrastructure might be the root cause of this issue as well, but I haven't looked too closely into that issue. |
I took a look at that and fixed the race and it doesn't make a difference for the erroneous test output. |
Going to merge the single call to |
Got to the bottom of this. This is the EXPLAIN (DISTSQL) plan for the repro shared in #48785 (comment) In the repro, it looks like This row can be traced back to the inbox pointed out on the diagram. The row is correctly sent over the wire but
To properly fix this, I think we need to make the outbox call |
Nice find! |
That is a very nice find indeed. |
50922: importccl: support `unique_rowid()` as default expression for IMPORT INTO r=Anzoteh96 a=Anzoteh96 The PR #50295 supports non-targeted columns with constant expression. This PR is a follow up to that in adding support to `unique_rowid()`. Previously, the only support given to `rowid` as a default expression is for hidden column, which is a function of timestamp, row number, and source ID (the ID of processor). To accommodate for more usage of `unique_rowid()`, this PR modifies the `unique_rowid` function by making `unique_rowid` as a function of: 1. timestamp; 2. row number; 3. source ID; 4. the total occurrences of `unique_rowid` in the table schema; 5. instances of each `unique_rowid` within each row. In addition, this PR also modifies the visitor method #51390 by adding override methods for volatile methods like `unique_rowid`. Annotations containing the total occurrences of `unique_rowid` and `unique_rowid` instances within a row are stored inside `evalCtx`, which will be read and updated when visitor walks through the default expression at the sanitization stage, and when default expression is evaluated at each row. Partially addresses #48253 Release note (general change): IMPORT INTO now supports `unique_rowid()` as a default expression. 51518: rowflow,colexec: make routers propagate errors to all non-closed outputs r=yuzefovich a=yuzefovich This commit changes the way we propagate the errors in the hash router so that the error metadata is sent on all non-closed streams. Previously, we would be sending it over only the first non-closed stream which could result in the processors on the same stage as that single stream end to treat the absence of rows and errors as the input being exhausted successfully, which is wrong because the input did encounter an error. The same thing has been happening in the vectorized flow, but in that case the problem is less severe - the issue will present itself only when we have wrapped processors (because the materializers will prevent the propagation throughout the whole flow as described below): In the vectorized engine we use panic-catch mechanism of error propagation, and we end up with the following sequence of events: 1. an operator encounters an error on any node (e.g. `colBatchScan` encounters RWUI error on a remote node). It is not an internal vectorized error, so the operator will panic with `colexecerror.ExpectedError`. 2. the panic is caught by one of the catchers (it can be a parallel unordered synchronizer goroutine, an outbox goroutine, a materializer, a hash router) 3. that component will then decide how to propagate the error further: 3.1 if it is a parallel unordered synchronizer, then it will cancel all of its inputs and will repanic 3.2 if it is an outbox, the error is sent as metadata which will be received by an inbox which will panic with it 3.3. if it is a materializer, then it might swallow the error (this is the reason we need for the vectorized hash router to send the error to all of its inputs). The swallowing is acceptable if it is the root materializer though. 3.4 if it is a hash router, it'll cancel all of its outputs and will forward the error on each of the outputs. Fixes: #51458. Release note (bug fix): Previously, CockroachDB could return incorrect results on query that encountered ReadWithinUncertaintyInterval error, and this has been fixed. 52016: colexec: re-enable short-circuiting in the hash joiner r=yuzefovich a=yuzefovich This commit re-enables short-circuiting logic in the hash joiner when the build side is empty (it was temporarily disabled because of #48785 which has been fixed). Fixes: #49631. Release note: None 52027: sql: skip TestQueryProgress r=yuzefovich a=yuzefovich This test started failing more often, so we'll skip it temporarily until we figure it out. Addresses: #51356. Release note: None Co-authored-by: anzoteh96 <anzot@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
The
inner-join
logictest fails under stress with thefakedist
configuration. Repro:Looks like sometimes rows are being only partially returned:
The text was updated successfully, but these errors were encountered: