Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][C++] Nightly tests for valgrind have been failing for the last #33699

Closed
raulcd opened this issue Jan 16, 2023 · 4 comments · Fixed by #33886
Closed

[CI][C++] Nightly tests for valgrind have been failing for the last #33699

raulcd opened this issue Jan 16, 2023 · 4 comments · Fixed by #33886

Comments

@raulcd
Copy link
Member

raulcd commented Jan 16, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Based on http://crossbow.voltrondata.com/
The test-conda-cpp-valgrind job has been failing on our nightlies for the last 26 days.
The latest error includes several test errors:

89% tests passed, 8 tests failed out of 70

Label Time Summary:
arrow-tests        = 1201.63 sec*proc (35 tests)
arrow_compute      = 1991.41 sec*proc (12 tests)
arrow_dataset      = 1085.71 sec*proc (10 tests)
arrow_substrait    =  45.52 sec*proc (1 test)
filesystem         =  70.61 sec*proc (3 tests)
parquet-tests      = 799.20 sec*proc (9 tests)
plasma-tests       = 294.49 sec*proc (3 tests)
unittest           = 5417.95 sec*proc (70 tests)

Total Test time (real) = 2787.52 sec

The following tests FAILED:
	 31 - arrow-compute-scalar-test (Timeout)
	 32 - arrow-compute-vector-test (Timeout)
	 37 - arrow-compute-hash-join-node-test (Timeout)
	 39 - arrow-compute-tpch-node-test (Timeout)
	 40 - arrow-compute-union-node-test (Timeout)
	 49 - arrow-dataset-scanner-test (Timeout)
	 50 - arrow-dataset-file-csv-test (Timeout)
	 65 - parquet-arrow-test (Timeout)
Errors while running CTest

Component(s)

C++, Continuous Integration

@westonpace
Copy link
Member

This is an odd collection of tests. I don't think I've ever seen arrow-compute-union-node-test fail and parquet-reader-test doesn't really fit in with the mix. Note that the server seems to be under extreme stress (even more than usual for CI):

2023-01-16T00:36:55.9251228Z  5/70 Test  #6: arrow-extension-type-test ................   Passed    9.28 sec

That test takes, at most, 15ms on my server. If I compare results with the last successful build it seems things are slow across the board, even on the tests that pass:

test name last failure last success
arrow-csv-test 50.19 18.68
arrow-compute-aggregate-test 209.02 45.40
arrow-array-test 63.27 21.42

I don't notice any considerable difference in compilation flags. The only change I can see is that the passing build includes -ggdb and the failing build does not.

@westonpace
Copy link
Member

I tried looking into this a bit more today. I ran the parquet-reader-test on master, on the same commit that last passed (df4cb95) and on a really old commit (54ff2d8) and finally on the 8.0.0 release build (from May). All runs came very close to the 5 minute mark (actually the third run went over). However, I didn't notice any significant differences. The most recent build performed best.

If anything, the oddity is that the valgrind job passed. The test-conda-cpp-valgrind job has been failing regularly all the way back to September and only passed for a few days in December.

I recommend increasing the timeout similar to what we did with the TSAN build and then revisit in a few weeks. @pitrou any thoughts?

@westonpace
Copy link
Member

Alternatively, we could try reducing the runtime of these tests when valgrind is enabled. parquet-arrow-test for example tries many different type variations (8 different combinations of decimal) and we could probably trim that down when valgrind is enabled.

@pitrou
Copy link
Member

pitrou commented Jan 26, 2023

I think the strategy should be two-pronged:

  • increase the test timeout on Valgrind CI jobs specifically
  • analyze the runtimes of these long-running tests and see if any particular test cases can be blamed; sometimes a random or stress test can take a lot of time, in which case we can tweak the number of iterations for Valgrind jobs

westonpace added a commit that referenced this issue Feb 2, 2023
…rind and shorten long tests (#33886)

This partially addresses #33699.  Once these tests are passing we can analyze the runtime and try and scope our valgrind coverage to something appropriate.
* Closes: #33699

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace added this to the 12.0.0 milestone Feb 2, 2023
sjperkins pushed a commit to sjperkins/arrow that referenced this issue Feb 10, 2023
…r valgrind and shorten long tests (apache#33886)

This partially addresses apache#33699.  Once these tests are passing we can analyze the runtime and try and scope our valgrind coverage to something appropriate.
* Closes: apache#33699

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this issue Feb 17, 2023
…r valgrind and shorten long tests (apache#33886)

This partially addresses apache#33699.  Once these tests are passing we can analyze the runtime and try and scope our valgrind coverage to something appropriate.
* Closes: apache#33699

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this issue Feb 24, 2023
…r valgrind and shorten long tests (apache#33886)

This partially addresses apache#33699.  Once these tests are passing we can analyze the runtime and try and scope our valgrind coverage to something appropriate.
* Closes: apache#33699

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment