-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python] Sporadic asof_join failures in PyArrow #41149
Comments
Also @raulcd |
The tests are new, so this is not necessarily a regression. |
It's not the same test that's failing, though the underlying cause might be the same. |
Ah, I didn't actually look at the test output you posted here, sorry ;) |
From playing around with this locally a bit I can reproduce the flake. However, if I force |
Yes, it's certainly on the C++ side since PyArrow is a wrapper. |
I don't think this is a blocker so I've created RC0 for 16.0.0 without it. |
Is someone looking into this? Those failures are quite annoying for the python CI, so maybe we could allow them (both this one and #40675) to fail for now? (although that's also an easy way to forget about it then ..) |
I managed to make a C++ unit test that fails as #40675 (let's assume the cause is the same as this one, which is very likely), at a fair chance (one of tens). I'll keep looking, though I can't tell how long I'm going to take as I'm not quite familiar with asof join. If anyone else is interested as well, feel free to go ahead with the test.
|
This is causing wrong result so I'm adding critical label. My first label added as a collaborator, please correct me if it's not proper :) |
Also reproduced the sympton in this issue using C++, about the same chance as above:
|
### Rationale for this change Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in #40675 and #41149. Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens. ### What changes are included in this PR? Eliminate the logical race of emptiness by combining multiple call-sites of `Empty()`. ### Are these changes tested? Include the UT to reproduce the issue. ### Are there any user-facing changes? None. **This PR contains a "Critical Fix".** In #40675 and #41149 , incorrect results are produced. * GitHub Issue: #41149 * Also closes #40675 Authored-by: Ruoxi Sun <zanmato1984@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Issue resolved by pull request 41614 |
### Rationale for this change Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in apache#40675 and apache#41149. Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens. ### What changes are included in this PR? Eliminate the logical race of emptiness by combining multiple call-sites of `Empty()`. ### Are these changes tested? Include the UT to reproduce the issue. ### Are there any user-facing changes? None. **This PR contains a "Critical Fix".** In apache#40675 and apache#41149 , incorrect results are produced. * GitHub Issue: apache#41149 * Also closes apache#40675 Authored-by: Ruoxi Sun <zanmato1984@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in apache#40675 and apache#41149. Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens. ### What changes are included in this PR? Eliminate the logical race of emptiness by combining multiple call-sites of `Empty()`. ### Are these changes tested? Include the UT to reproduce the issue. ### Are there any user-facing changes? None. **This PR contains a "Critical Fix".** In apache#40675 and apache#41149 , incorrect results are produced. * GitHub Issue: apache#41149 * Also closes apache#40675 Authored-by: Ruoxi Sun <zanmato1984@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the bug, including details regarding any error messages, version, and platform.
We see sporadic CI failures in
test_dataset_join_asof_empty_by
. Example:https://github.com/ursacomputing/crossbow/actions/runs/8597009571/job/23554761859#step:10:787
The useful information is unfortunately truncated by stupid pytest, but we can still see that there is a
null
value somewhere in thecolVals
result, which is probably unexpected.Component(s)
C++, Python
The text was updated successfully, but these errors were encountered: