[CI][C++] Nightly tests for valgrind have been failing for the last #33699

raulcd · 2023-01-16T13:20:52Z

Describe the bug, including details regarding any error messages, version, and platform.

Based on http://crossbow.voltrondata.com/
The test-conda-cpp-valgrind job has been failing on our nightlies for the last 26 days.
The latest error includes several test errors:

89% tests passed, 8 tests failed out of 70

Label Time Summary:
arrow-tests        = 1201.63 sec*proc (35 tests)
arrow_compute      = 1991.41 sec*proc (12 tests)
arrow_dataset      = 1085.71 sec*proc (10 tests)
arrow_substrait    =  45.52 sec*proc (1 test)
filesystem         =  70.61 sec*proc (3 tests)
parquet-tests      = 799.20 sec*proc (9 tests)
plasma-tests       = 294.49 sec*proc (3 tests)
unittest           = 5417.95 sec*proc (70 tests)

Total Test time (real) = 2787.52 sec

The following tests FAILED:
	 31 - arrow-compute-scalar-test (Timeout)
	 32 - arrow-compute-vector-test (Timeout)
	 37 - arrow-compute-hash-join-node-test (Timeout)
	 39 - arrow-compute-tpch-node-test (Timeout)
	 40 - arrow-compute-union-node-test (Timeout)
	 49 - arrow-dataset-scanner-test (Timeout)
	 50 - arrow-dataset-file-csv-test (Timeout)
	 65 - parquet-arrow-test (Timeout)
Errors while running CTest

Component(s)

C++, Continuous Integration

The text was updated successfully, but these errors were encountered:

westonpace · 2023-01-16T14:21:11Z

This is an odd collection of tests. I don't think I've ever seen arrow-compute-union-node-test fail and parquet-reader-test doesn't really fit in with the mix. Note that the server seems to be under extreme stress (even more than usual for CI):

2023-01-16T00:36:55.9251228Z  5/70 Test  #6: arrow-extension-type-test ................   Passed    9.28 sec

That test takes, at most, 15ms on my server. If I compare results with the last successful build it seems things are slow across the board, even on the tests that pass:

test name	last failure	last success
arrow-csv-test	50.19	18.68
arrow-compute-aggregate-test	209.02	45.40
arrow-array-test	63.27	21.42

I don't notice any considerable difference in compilation flags. The only change I can see is that the passing build includes -ggdb and the failing build does not.

westonpace · 2023-01-20T14:57:37Z

I tried looking into this a bit more today. I ran the parquet-reader-test on master, on the same commit that last passed (df4cb95) and on a really old commit (54ff2d8) and finally on the 8.0.0 release build (from May). All runs came very close to the 5 minute mark (actually the third run went over). However, I didn't notice any significant differences. The most recent build performed best.

If anything, the oddity is that the valgrind job passed. The test-conda-cpp-valgrind job has been failing regularly all the way back to September and only passed for a few days in December.

I recommend increasing the timeout similar to what we did with the TSAN build and then revisit in a few weeks. @pitrou any thoughts?

westonpace · 2023-01-20T15:00:20Z

Alternatively, we could try reducing the runtime of these tests when valgrind is enabled. parquet-arrow-test for example tries many different type variations (8 different combinations of decimal) and we could probably trim that down when valgrind is enabled.

pitrou · 2023-01-26T13:21:38Z

I think the strategy should be two-pronged:

increase the test timeout on Valgrind CI jobs specifically
analyze the runtimes of these long-running tests and see if any particular test cases can be blamed; sometimes a random or stress test can take a lot of time, in which case we can tweak the number of iterations for Valgrind jobs

…rind and shorten long tests (#33886) This partially addresses #33699. Once these tests are passing we can analyze the runtime and try and scope our valgrind coverage to something appropriate. * Closes: #33699 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

…r valgrind and shorten long tests (apache#33886) This partially addresses apache#33699. Once these tests are passing we can analyze the runtime and try and scope our valgrind coverage to something appropriate. * Closes: apache#33699 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

raulcd added the Type: bug label Jan 16, 2023

github-actions bot added Component: C++ Component: Continuous Integration labels Jan 16, 2023

raulcd mentioned this issue Jan 16, 2023

[R] Intermittent memory leaks in the valgrind nightly test #33094

Open

westonpace mentioned this issue Jan 26, 2023

GH-33699: [C++] Increase timeout of c++ tests when running under valgrind and shorten long tests #33886

Merged

github-actions bot assigned westonpace Jan 26, 2023

westonpace closed this as completed in #33886 Feb 2, 2023

westonpace added this to the 12.0.0 milestone Feb 2, 2023

wjones127 mentioned this issue Apr 10, 2023

GH-35008: [C++] Add printers for REETestData and PageIndexReaderParam to placate Valgrind #35011

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI][C++] Nightly tests for valgrind have been failing for the last #33699

[CI][C++] Nightly tests for valgrind have been failing for the last #33699

raulcd commented Jan 16, 2023

westonpace commented Jan 16, 2023

westonpace commented Jan 20, 2023

westonpace commented Jan 20, 2023

pitrou commented Jan 26, 2023

[CI][C++] Nightly tests for valgrind have been failing for the last #33699

[CI][C++] Nightly tests for valgrind have been failing for the last #33699

Comments

raulcd commented Jan 16, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

westonpace commented Jan 16, 2023

westonpace commented Jan 20, 2023

westonpace commented Jan 20, 2023

pitrou commented Jan 26, 2023