Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Sporadic asof_join failures in PyArrow #41149

Closed
pitrou opened this issue Apr 11, 2024 · 14 comments
Closed

[C++][Python] Sporadic asof_join failures in PyArrow #41149

pitrou opened this issue Apr 11, 2024 · 14 comments

Comments

@pitrou
Copy link
Member

pitrou commented Apr 11, 2024

Describe the bug, including details regarding any error messages, version, and platform.

We see sporadic CI failures in test_dataset_join_asof_empty_by. Example:
https://github.com/ursacomputing/crossbow/actions/runs/8597009571/job/23554761859#step:10:787

=================================== FAILURES ===================================
_______________________ test_dataset_join_asof_empty_by ________________________

tempdir = PosixPath('/tmp/pytest-of-root/pytest-0/test_dataset_join_asof_empty_b0')

    @pytest.mark.dataset
    def test_dataset_join_asof_empty_by(tempdir):
        t1 = pa.table({
            "on": [1, 2, 3],
        })
        ds.write_dataset(t1, tempdir / "t1", format="ipc")
        ds1 = ds.dataset(tempdir / "t1", format="ipc")
    
        t2 = pa.table({
            "colVals": ["Z", "B", "A"],
            "on": [2, 3, 4],
        })
        ds.write_dataset(t2, tempdir / "t2", format="ipc")
        ds2 = ds.dataset(tempdir / "t2", format="ipc")
    
        result = ds1.join_asof(
            ds2, on="on", by=[], tolerance=1
        )
>       assert result.to_table() == pa.table({
            "on": [1, 2, 3],
            "colVals": ["Z", "Z", "B"],
        })
E       assert pyarrow.Table...null,"Z","B"]] == pyarrow.Table...["Z","Z","B"]]
E         
E         Use -v to get more diff

usr/local/lib/python3.9/site-packages/pyarrow/tests/test_dataset.py:5154: AssertionError

The useful information is unfortunately truncated by stupid pytest, but we can still see that there is a null value somewhere in the colVals result, which is probably unexpected.

Component(s)

C++, Python

@pitrou
Copy link
Member Author

pitrou commented Apr 11, 2024

@pitrou
Copy link
Member Author

pitrou commented Apr 11, 2024

Also @raulcd

@pitrou
Copy link
Member Author

pitrou commented Apr 11, 2024

The tests are new, so this is not necessarily a regression.

@jorisvandenbossche
Copy link
Member

@pitrou you already opened an issue for this a few weeks ago, I think: #40675

(although my feeling is that it happens more often the last days than before, but maybe I am just paying more attention)

@pitrou
Copy link
Member Author

pitrou commented Apr 11, 2024

It's not the same test that's failing, though the underlying cause might be the same.

@jorisvandenbossche
Copy link
Member

Ah, I didn't actually look at the test output you posted here, sorry ;)
The ones that I have been noticing the last days was the one with the nulls in the output, so the one described in #40675

@judahrand
Copy link
Contributor

From playing around with this locally a bit I can reproduce the flake. However, if I force use_threads=False in the Python code I can't. To me this suggests the failure is threading related and on the C++ side rather than the Python side.

@pitrou
Copy link
Member Author

pitrou commented Apr 15, 2024

Yes, it's certainly on the C++ side since PyArrow is a wrapper.

@raulcd raulcd removed this from the 16.0.0 milestone Apr 16, 2024
@raulcd
Copy link
Member

raulcd commented Apr 16, 2024

I don't think this is a blocker so I've created RC0 for 16.0.0 without it.

@jorisvandenbossche
Copy link
Member

Is someone looking into this?

Those failures are quite annoying for the python CI, so maybe we could allow them (both this one and #40675) to fail for now? (although that's also an easy way to forget about it then ..)

@zanmato1984
Copy link
Collaborator

zanmato1984 commented May 9, 2024

I managed to make a C++ unit test that fails as #40675 (let's assume the cause is the same as this one, which is very likely), at a fair chance (one of tens). I'll keep looking, though I can't tell how long I'm going to take as I'm not quite familiar with asof join. If anyone else is interested as well, feel free to go ahead with the test.

TEST(AsofJoinTest, Flaky) {
  std::vector<TypeHolder> left_types = {int64(), utf8()};
  auto left_batch = ExecBatchFromJSON(
      left_types, R"([[1, "a"], [1, "b"], [5, "a"], [6, "b"], [7, "f"]])");
  std::vector<TypeHolder> right_types = {int64(), utf8(), float64()};
  auto right_batch =
      ExecBatchFromJSON(right_types, R"([[2, "a", 1.0], [9, "b", 3.0], [15, "g", 5.0]])");

  Declaration left{
      "exec_batch_source",
      ExecBatchSourceNodeOptions(schema({field("colA", int64()), field("col2", utf8())}),
                                 {std::move(left_batch)})};
  Declaration right{
      "exec_batch_source",
      ExecBatchSourceNodeOptions(schema({field("colB", int64()), field("col3", utf8()),
                                         field("colC", float64())}),
                                 {std::move(right_batch)})};
  AsofJoinNodeOptions asof_join_opts({{{"colA"}, {{"col2"}}}, {{"colB"}, {{"col3"}}}}, 1);
  Declaration asof_join{"asofjoin", {left, right}, asof_join_opts};

  ASSERT_OK_AND_ASSIGN(auto result, DeclarationToExecBatches(asof_join));

  std::vector<TypeHolder> exp_types = {int64(), utf8(), float64()};
  auto exp_batch = ExecBatchFromJSON(
      exp_types,
      R"([[1, "a", 1.0], [1, "b", null], [5, "a", null], [6, "b", null], [7, "f", null]])");
  AssertExecBatchesEqualIgnoringOrder(result.schema, {exp_batch}, result.batches);
}

@zanmato1984
Copy link
Collaborator

This is causing wrong result so I'm adding critical label.

My first label added as a collaborator, please correct me if it's not proper :)

@zanmato1984
Copy link
Collaborator

Also reproduced the sympton in this issue using C++, about the same chance as above:

TEST(AsofJoinTest, Flaky2) {
  std::vector<TypeHolder> left_types = {int64()};
  auto left_batch = ExecBatchFromJSON(left_types, R"([[1], [2], [3]])");
  std::vector<TypeHolder> right_types = {utf8(), int64()};
  auto right_batch = ExecBatchFromJSON(right_types, R"([["Z", 2], ["B", 3], ["A", 4]])");

  Declaration left{"exec_batch_source",
                   ExecBatchSourceNodeOptions(schema({field("on", int64())}),
                                              {std::move(left_batch)})};
  Declaration right{
      "exec_batch_source",
      ExecBatchSourceNodeOptions(schema({field("colVals", utf8()), field("on", int64())}),
                                 {std::move(right_batch)})};
  AsofJoinNodeOptions asof_join_opts({{{"on"}, {}}, {{"on"}, {}}}, 1);
  Declaration asof_join{"asofjoin", {left, right}, asof_join_opts};

  ASSERT_OK_AND_ASSIGN(auto result, DeclarationToExecBatches(asof_join));

  std::vector<TypeHolder> exp_types = {int64(), utf8()};
  auto exp_batch = ExecBatchFromJSON(exp_types, R"([[1, "Z"], [2, "Z"], [3, "B"]])");
  AssertExecBatchesEqualIgnoringOrder(result.schema, {exp_batch}, result.batches);
}

zanmato1984 added a commit to zanmato1984/arrow that referenced this issue May 10, 2024
pitrou pushed a commit that referenced this issue May 14, 2024
### Rationale for this change

Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in #40675 and #41149.

Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens.

### What changes are included in this PR?

Eliminate the logical race of emptiness by combining multiple call-sites of `Empty()`.

### Are these changes tested?

Include the UT to reproduce the issue.

### Are there any user-facing changes?

None.

**This PR contains a "Critical Fix".**
In #40675 and #41149 , incorrect results are produced.
* GitHub Issue: #41149 
* Also closes #40675

Authored-by: Ruoxi Sun <zanmato1984@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 17.0.0 milestone May 14, 2024
@pitrou
Copy link
Member Author

pitrou commented May 14, 2024

Issue resolved by pull request 41614
#41614

@pitrou pitrou closed this as completed May 14, 2024
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
### Rationale for this change

Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in apache#40675 and apache#41149.

Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens.

### What changes are included in this PR?

Eliminate the logical race of emptiness by combining multiple call-sites of `Empty()`.

### Are these changes tested?

Include the UT to reproduce the issue.

### Are there any user-facing changes?

None.

**This PR contains a "Critical Fix".**
In apache#40675 and apache#41149 , incorrect results are produced.
* GitHub Issue: apache#41149 
* Also closes apache#40675

Authored-by: Ruoxi Sun <zanmato1984@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue May 29, 2024
### Rationale for this change

Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in apache#40675 and apache#41149.

Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens.

### What changes are included in this PR?

Eliminate the logical race of emptiness by combining multiple call-sites of `Empty()`.

### Are these changes tested?

Include the UT to reproduce the issue.

### Are there any user-facing changes?

None.

**This PR contains a "Critical Fix".**
In apache#40675 and apache#41149 , incorrect results are produced.
* GitHub Issue: apache#41149 
* Also closes apache#40675

Authored-by: Ruoxi Sun <zanmato1984@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants