Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][CI] Sporadic test failures in AsofJoinBasicTest #36482

Closed
pitrou opened this issue Jul 5, 2023 · 17 comments · Fixed by #36499
Closed

[C++][CI] Sporadic test failures in AsofJoinBasicTest #36482

pitrou opened this issue Jul 5, 2023 · 17 comments · Fixed by #36499
Assignees
Labels
Component: C++ Component: Continuous Integration Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Milestone

Comments

@pitrou
Copy link
Member

pitrou commented Jul 5, 2023

Describe the bug, including details regarding any error messages, version, and platform.

We routinely see this kind of failures on CI:
https://github.com/apache/arrow/actions/runs/5464016379/jobs/9945469106#step:7:3365


[ RUN      ] AsofJoinNodeTest/AsofJoinBasicTest.TestBasic7Forward/2
/arrow/cpp/src/arrow/testing/gtest_util.cc:505: Failure
Failed
Unequal at absolute position 3

@@ -1, +1 @@
-11
+null
Expected:
  [
    10,
    null,
    12
  ]
Actual:
  [
    10,
    11,
    12
  ]
Google Test trace:
/arrow/cpp/src/arrow/acero/asof_join_node_test.cc:630: Right-1 type: timestamp[ms, tz=UTC]
/arrow/cpp/src/arrow/acero/asof_join_node_test.cc:628: Right-0 type: double
/arrow/cpp/src/arrow/acero/asof_join_node_test.cc:626: Left type: uint16
/arrow/cpp/src/arrow/acero/asof_join_node_test.cc:624: Key type: date32[day]
/arrow/cpp/src/arrow/acero/asof_join_node_test.cc:622: Time type: timestamp[us, tz=UTC]
/arrow/cpp/src/arrow/acero/asof_join_node_test.cc:620: Iteration: 3
/arrow/cpp/src/arrow/acero/asof_join_node_test.cc:611: Types seed: 1688560093425811692
/arrow/cpp/src/arrow/acero/asof_join_node_test.cc:998: AsofJoinBasicTest_TestBasic7_MutateByKey
/arrow/cpp/src/arrow/acero/asof_join_node_test.cc:997: AsofJoinBasicTest_TestBasic7Forward_MutateByKey
AsofJoinTest debug:
AsofjoinNode(0xffffc2670910): received batch from input 2:
time:   [
    1970-01-01 00:00:00.000000,
    1970-01-01 00:00:00.001500,
    1970-01-01 00:00:00.002500
  ]
key2:   [
    1970-01-01,
    1970-01-01,
    1970-01-01
  ]
key:   [
    1970-01-02,
    1970-01-02,
    1970-01-02
  ]
r1_v0:   [
    1970-01-01 00:00:00.100,
    1970-01-01 00:00:00.101,
    1970-01-01 00:00:00.102
  ]

AsofjoinNode(0xffffc2670910): received batch from input 1:
time:   [
    1970-01-01 00:00:00.000000,
    1970-01-01 00:00:00.001500,
    1970-01-01 00:00:00.002500
  ]
key2:   [
    1970-01-01,
    1970-01-01,
    1970-01-01
  ]
key:   [
    1970-01-02,
    1970-01-02,
    1970-01-02
  ]
r0_v0:   [
    10,
    11,
    12
  ]

AsofjoinNode(0xffffc2670910): received batch from input 0:
time:   [
    1970-01-01 00:00:00.000000
  ]
key2:   [
    1970-01-01
  ]
key:   [
    1970-01-02
  ]
l_v0:   [
    1
  ]

AsofjoinNode(0xffffb33fb440): Advancing input 1
AsofjoinNode(0xffffb33fb240): key hasher 1 got hashes [10621589963589458110, 10621589963589458110, 10621589963589458110]
AsofjoinNode(0xffffb33fb340): memo 1 store: for_time=0 row=0 time=0 key=10621589963589458110
AsofjoinNode(0xffffb33fb440): Advancing input 1 hit distant time=1500 at=0
AsofjoinNode(0xffffb33fb360): memo 1 remove: ts=0
AsofjoinNode(0xffffb33fb440): Advancing input 1 updated=0
AsofjoinNode(0xffffb33fb440): Advancing input 2
AsofjoinNode(0xffffb33fb240): key hasher 2 got hashes [10621589963589458110, 10621589963589458110, 10621589963589458110]
AsofjoinNode(0xffffb33fb340): memo 2 store: for_time=0 row=0 time=0 key=10621589963589458110
AsofjoinNode(0xffffb33fb440): Advancing input 2 hit distant time=1500 at=0
AsofjoinNode(0xffffb33fb360): memo 2 remove: ts=0
AsofjoinNode(0xffffb33fb440): Advancing input 2 updated=0
AsofjoinNode(0xffffb33fb240): key hasher 0 got hashes [10621589963589458110]
AsofjoinNode(0xffffb33fb440): Emplace: key=10621589963589458110 lhs_latest_row=0 lhs_latest_time=0
AsofjoinNode(0xffffb33fb440):   i=1 has_entry=1 time=0 row=0 accepted=1
AsofjoinNode(0xffffb33fb440):   i=2 has_entry=1 time=0 row=0 accepted=1
AsofjoinNode(0xffffb33fb6a0): produce batch 0:
time:   [
    1970-01-01 00:00:00.000000
  ]
key2:   [
    1970-01-01
  ]
key:   [
    1970-01-02
  ]
l_v0:   [
    1
  ]
r0_v0:   [
    10
  ]
r1_v0:   [
    1970-01-01 00:00:00.100
  ]

AsofjoinNode(0xffffc2670910): received batch from input 0:
time:   [
    1970-01-01 00:00:00.001000
  ]
key2:   [
    1970-01-01
  ]
key:   [
    1970-01-03
  ]
l_v0:   [
    2
  ]

AsofjoinNode(0xffffb33fb440): Advancing input 1
AsofjoinNode(0xffffb33fb340): memo 1 store: for_time=1000 row=1 time=1500 key=10621589963589458110
AsofjoinNode(0xffffb33fb440): Advancing input 1 hit distant time=2500 at=1000
AsofjoinNode(0xffffb33fb340): memo 1 store: for_time=1000 row=2 time=2500 key=10621589963589458110
AsofjoinNode(0xffffb33fb360): memo 1 remove: ts=1000
AsofjoinNode(0xffffb33fb440): Advancing input 1 updated=1
AsofjoinNode(0xffffb33fb440): Advancing input 2
AsofjoinNode(0xffffb33fb340): memo 2 store: for_time=1000 row=1 time=1500 key=10621589963589458110
AsofjoinNode(0xffffb33fb440): Advancing input 2 hit distant time=2500 at=1000
AsofjoinNode(0xffffb33fb340): memo 2 store: for_time=1000 row=2 time=2500 key=10621589963589458110
AsofjoinNode(0xffffb33fb360): memo 2 remove: ts=1000
AsofjoinNode(0xffffb33fb440): Advancing input 2 updated=1
AsofjoinNode(0xffffb33fb440): Emplace: key=10621589963589458110 lhs_latest_row=0 lhs_latest_time=1000
AsofjoinNode(0xffffb33fb440):   i=1 has_entry=1 time=1500 row=1 accepted=1
AsofjoinNode(0xffffb33fb440):   i=2 has_entry=1 time=1500 row=1 accepted=1
AsofjoinNode(0xffffb33fb6a0): produce batch 1:
time:   [
    1970-01-01 00:00:00.001000
  ]
key2:   [
    1970-01-01
  ]
key:   [
    1970-01-03
  ]
l_v0:   [
    2
  ]
r0_v0:   [
    11
  ]
r1_v0:   [
    1970-01-01 00:00:00.101
  ]

AsofjoinNode(0xffffc2670910): received batch from input 0:
time:   [
    1970-01-01 00:00:00.002000
  ]
key2:   [
    1970-01-01
  ]
key:   [
    1970-01-02
  ]
l_v0:   [
    3
  ]

AsofjoinNode(0xffffb33fb440): Advancing input 1
AsofjoinNode(0xffffb33fb360): memo 1 remove: ts=2000
AsofjoinNode(0xffffb33fb440): Advancing input 2
AsofjoinNode(0xffffb33fb360): memo 2 remove: ts=2000
AsofjoinNode(0xffffb33fb440): Emplace: key=10621589963589458110 lhs_latest_row=0 lhs_latest_time=2000
AsofjoinNode(0xffffb33fb440):   i=1 has_entry=1 time=2500 row=2 accepted=1
AsofjoinNode(0xffffb33fb440):   i=2 has_entry=1 time=2500 row=2 accepted=1
AsofjoinNode(0xffffb33fb6a0): produce batch 2:
time:   [
    1970-01-01 00:00:00.002000
  ]
key2:   [
    1970-01-01
  ]
key:   [
    1970-01-02
  ]
l_v0:   [
    3
  ]
r0_v0:   [
    12
  ]
r1_v0:   [
    1970-01-01 00:00:00.102
  ]

Comparing flattened expected table:
time: timestamp[us, tz=UTC]
key2: date32[day]
key: date32[day]
l_v0: uint16
r0_v0: double
r1_v0: timestamp[ms, tz=UTC]
----
time:
  [
    [
      1970-01-01 00:00:00.000000,
      1970-01-01 00:00:00.001000,
      1970-01-01 00:00:00.002000
    ]
  ]
key2:
  [
    [
      1970-01-01,
      1970-01-01,
      1970-01-01
    ]
  ]
key:
  [
    [
      1970-01-02,
      1970-01-03,
      1970-01-02
    ]
  ]
l_v0:
  [
    [
      1,
      2,
      3
    ]
  ]
r0_v0:
  [
    [
      10,
      null,
      12
    ]
  ]
r1_v0:
  [
    [
      1970-01-01 00:00:00.100,
      null,
      1970-01-01 00:00:00.102
    ]
  ]

with flattened result table:
time: timestamp[us, tz=UTC]
key2: date32[day]
key: date32[day]
l_v0: uint16
r0_v0: double
r1_v0: timestamp[ms, tz=UTC]
----
time:
  [
    [
      1970-01-01 00:00:00.000000
    ],
    [
      1970-01-01 00:00:00.001000
    ],
    [
      1970-01-01 00:00:00.002000
    ]
  ]
key2:
  [
    [
      1970-01-01
    ],
    [
      1970-01-01
    ],
    [
      1970-01-01
    ]
  ]
key:
  [
    [
      1970-01-02
    ],
    [
      1970-01-03
    ],
    [
      1970-01-02
    ]
  ]
l_v0:
  [
    [
      1
    ],
    [
      2
    ],
    [
      3
    ]
  ]
r0_v0:
  [
    [
      10
    ],
    [
      11
    ],
    [
      12
    ]
  ]
r1_v0:
  [
    [
      1970-01-01 00:00:00.100
    ],
    [
      1970-01-01 00:00:00.101
    ],
    [
      1970-01-01 00:00:00.102
    ]
  ]


[  FAILED  ] AsofJoinNodeTest/AsofJoinBasicTest.TestBasic7Forward/2, where GetParam() = AsofJoinBasicParams: MutateByKey (52 ms)

Component(s)

C++, Continuous Integration

@pitrou
Copy link
Member Author

pitrou commented Jul 5, 2023

@rtpsw @icexelloss This seems to point to flaky/incorrect logic in the asof-join implementation.

@rtpsw
Copy link
Contributor

rtpsw commented Jul 5, 2023

@pitrou, when did this start? Perhaps since #36092?

@pitrou
Copy link
Member Author

pitrou commented Jul 5, 2023

No idea, sorry.

@rtpsw
Copy link
Contributor

rtpsw commented Jul 5, 2023

Do we have any systematic way to locate past CI job failures?

@pitrou
Copy link
Member Author

pitrou commented Jul 5, 2023

Not easily. You could browse through the Github Actions results, but that's not handy.

I would recommend finding a way to reproduce locally on your machine, then git bisect.

@rtpsw
Copy link
Contributor

rtpsw commented Jul 5, 2023

Yeah, I'd like to repro locally, if I can, but past experience with similar cases suggests this may be hard. I'll see what the best way forward is.

@pitrou
Copy link
Member Author

pitrou commented Jul 5, 2023

Also, it seems it's always either TestBasic7Forward/1 or TestBasic7Forward/2, so perhaps you can try reasoning about the test instead.

@pitrou
Copy link
Member Author

pitrou commented Jul 5, 2023

I'm able to reproduce after about 400 iterations with:

$ /build/build-test/debug/arrow-acero-asof-join-node-test --gtest_filter="*AsofJoinBasicTest.TestBasic7Forward/2*" --gtest_repeat=-1 --gtest_break_on_failure

[...]

Repeating all tests (iteration 428) . . .

Note: Google Test filter = *AsofJoinBasicTest.TestBasic7Forward/2*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from AsofJoinNodeTest/AsofJoinBasicTest
[ RUN      ] AsofJoinNodeTest/AsofJoinBasicTest.TestBasic7Forward/2
/home/antoine/arrow/dev/cpp/src/arrow/testing/gtest_util.cc:505: Failure
Failed
Unequal at absolute position 3

@@ -1, +1 @@
-1970-01-01 00:00:00.000000011
+null
Expected:
  [
    1970-01-01 00:00:00.000000010,
    null,
    1970-01-01 00:00:00.000000012
  ]
Actual:
  [
    1970-01-01 00:00:00.000000010,
    1970-01-01 00:00:00.000000011,
    1970-01-01 00:00:00.000000012
  ]
Google Test trace:
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:630: Right-1 type: uint16
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:628: Right-0 type: timestamp[ns, tz=UTC]
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:626: Left type: large_binary
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:624: Key type: uint32
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:622: Time type: date64[ms]
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:620: Iteration: 30
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:611: Types seed: 1688567126938002996
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:998: AsofJoinBasicTest_TestBasic7_MutateByKey
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:997: AsofJoinBasicTest_TestBasic7Forward_MutateByKey
Trappe pour point d'arrêt et de trace

@rtpsw
Copy link
Contributor

rtpsw commented Jul 5, 2023

I could work with that. Just to clarify, which iterations do you mean? There are internal test iterations (up to 100) and there are repetitions of gtest. Do you mean the latter?

@pitrou
Copy link
Member Author

pitrou commented Jul 5, 2023

The latter (see excerpt above).

@rtpsw
Copy link
Contributor

rtpsw commented Jul 5, 2023

Unfortunately, on my platform (virtualized Ubuntu 22.04 on Intel) I didn't get a single failure after 7000 iterations, when I gave up. @pitrou, what platform did you use when you got the failure?

@pitrou
Copy link
Member Author

pitrou commented Jul 5, 2023

Ubuntu 22.04 on AMD64, debug mode.

@rtpsw
Copy link
Contributor

rtpsw commented Jul 5, 2023

Mine was debug mode too. I tried again in parallel with stress -c 8 and it still didn't fail on my platform. I'll see if I can get another platform where the failure occurs.

@rtpsw
Copy link
Contributor

rtpsw commented Jul 6, 2023

I was able to reproduce on a different platform and then fix using the code in the PR I just created.

@rtpsw
Copy link
Contributor

rtpsw commented Jul 6, 2023

@pitrou, could you change the title of this issue to match that of the PR?

@raulcd
Copy link
Member

raulcd commented Jul 6, 2023

@pitrou, could you change the title of this issue to match that of the PR?

The title of the issue and the PR doesn't have to match. The commit message used will be the title of the PR. I think the title of the issue being Sporadic test failures in AsofJoinBasicTest is good already as it shows what the problem was but happy to change it if you really fell it must be updated :)

@rtpsw
Copy link
Contributor

rtpsw commented Jul 6, 2023

The title of the issue and the PR doesn't have to match.

Ah, in this case, never mind.

rtpsw added a commit to rtpsw/arrow that referenced this issue Jul 9, 2023
westonpace pushed a commit that referenced this issue Jul 12, 2023
…36499)

### What changes are included in this PR?

The key hasher is invalidated before the first invocation of `GetKey` (via `GetLatestKey`) after a new batch arrives. In the pre-PR code, this invalidation happens within `Advance`, which is called from `AdvanceAndMemoize` only after `GetLatestKey` is called. The change adds synchronization between the input-receiving- and processing- threads, because avoiding that would require a more complicated and brittle change, e.g., one that involves detecting in the processing thread when a new batch was added to the queue in order to invalidate the key hasher at that time.

### Are these changes tested?

Yes, by existing tests.

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".**
* Closes: #36482

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace added this to the 13.0.0 milestone Jul 12, 2023
raulcd pushed a commit that referenced this issue Jul 13, 2023
…36499)

### What changes are included in this PR?

The key hasher is invalidated before the first invocation of `GetKey` (via `GetLatestKey`) after a new batch arrives. In the pre-PR code, this invalidation happens within `Advance`, which is called from `AdvanceAndMemoize` only after `GetLatestKey` is called. The change adds synchronization between the input-receiving- and processing- threads, because avoiding that would require a more complicated and brittle change, e.g., one that involves detecting in the processing thread when a new batch was added to the queue in order to invalidate the key hasher at that time.

### Are these changes tested?

Yes, by existing tests.

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".**
* Closes: #36482

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
chelseajonesr pushed a commit to chelseajonesr/arrow that referenced this issue Jul 20, 2023
…Test (apache#36499)

### What changes are included in this PR?

The key hasher is invalidated before the first invocation of `GetKey` (via `GetLatestKey`) after a new batch arrives. In the pre-PR code, this invalidation happens within `Advance`, which is called from `AdvanceAndMemoize` only after `GetLatestKey` is called. The change adds synchronization between the input-receiving- and processing- threads, because avoiding that would require a more complicated and brittle change, e.g., one that involves detecting in the processing thread when a new batch was added to the queue in order to invalidate the key hasher at that time.

### Are these changes tested?

Yes, by existing tests.

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".**
* Closes: apache#36482

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this issue Aug 20, 2023
…Test (apache#36499)

### What changes are included in this PR?

The key hasher is invalidated before the first invocation of `GetKey` (via `GetLatestKey`) after a new batch arrives. In the pre-PR code, this invalidation happens within `Advance`, which is called from `AdvanceAndMemoize` only after `GetLatestKey` is called. The change adds synchronization between the input-receiving- and processing- threads, because avoiding that would require a more complicated and brittle change, e.g., one that involves detecting in the processing thread when a new batch was added to the queue in order to invalidate the key hasher at that time.

### Are these changes tested?

Yes, by existing tests.

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".**
* Closes: apache#36482

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@raulcd raulcd added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Component: Continuous Integration Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Projects
None yet
4 participants