Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][C++] Nightly test-conda-cpp-valgrind is flaky due to arrow-compute-vector-test taking closer to 500 seconds #35636

Closed
raulcd opened this issue May 17, 2023 · 7 comments · Fixed by #36401

Comments

@raulcd
Copy link
Member

raulcd commented May 17, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Sometimes the test-conda-cpp-valgrind nightly test fails due to a timeout on arrow-compute-vector-test. Examples:

with the failure:

The following tests FAILED:
	 25 - arrow-compute-vector-test (Timeout)

When it succeeds it is close to the 500 seconds, example:

I am not sure whether we should reduce the tests time or split the test so we stop getting those timeouts.

Component(s)

C++, Continuous Integration

@westonpace
Copy link
Member

I think we could probably be more selective about which types we test (e.g. testing one or two of the 8 integer types is often sufficient) if we really wanted to trim down the runtimes.

@raulcd
Copy link
Member Author

raulcd commented Jun 27, 2023

@westonpace @pitrou should we do something about this flaky one?

@pitrou
Copy link
Member

pitrou commented Jun 27, 2023

@raulcd Someone should ideally take a look into it and try to reduce the runtime under Valgrind.

For example, it's not obvious that testing all datatypes exhaustively with random input is useful (why all integers, both signed and unsigned, for example?). Test sizes could also be reduced a bit.

@felipecrv Since you've been working quite a bit on kernels, would you like to take a look at this?
Note that you can use Archery to run the Valgrind CI job, for example archery docker run conda-cpp-valgrind.
I would recommended shelling in the Docker image, though (archery docker run conda-cpp-valgrind bash), allowing you to build and test more selectively.

@pitrou
Copy link
Member

pitrou commented Jun 27, 2023

An orthogonal solution is to split arrow-compute-vector-test into two or three independent test executables (for example: 1) sort+rank+top/bottomK 2) select+filter 3) all the rest). But, while that would workaround the timeout problem, it wouldn't reduce the total CI runtime.

@felipecrv felipecrv self-assigned this Jun 27, 2023
@felipecrv
Copy link
Contributor

An orthogonal solution is to split arrow-compute-vector-test into two or three independent test executables (for example: 1) sort+rank+top/bottomK 2) select+filter 3) all the rest). But, while that would workaround the timeout problem, it wouldn't reduce the total CI runtime.

I will try to reduce number of allocations in tests (I hope it's possible to do so), but will split these tests as I don't feel safe selecting which tests to disable. When developing kernels, exhaustive tests were very helpful in catching unexpected UB and mistakes in the type dispatching logic.

@felipecrv
Copy link
Contributor

felipecrv commented Jun 29, 2023

I generate a bunch of test cases for run-end encoding (and decoding), but it's really fast compared to other kernels. Number of variations is a bad predictor of test cost.

[----------] Global test environment tear-down
[==========] 423 tests from 1 test suite ran. (326 ms total)
[  PASSED  ] 423 tests.
debug/arrow-compute-vector-test --gtest_filter="EncodeArrayTest*"  0.44s user 0.06s system 96% cpu 0.511 total

@felipecrv
Copy link
Contributor

felipecrv commented Jun 29, 2023

I tried looking for inefficiencies in test code, but none of the changes I tried made a difference. So I opened a PR containing just tests extracted.

pitrou pushed a commit that referenced this issue Jul 3, 2023
…-test (#36401)

### Rationale for this change

`arrow-compute-vector-test` is too big and takes a long time to run because of that.

### What changes are included in this PR?

Extracting two tests.

Timings on my machine (Debug builds with ASAN).

```
debug/arrow-compute-vector-test > /dev/null  11.54s user 0.47s system 99% cpu 12.023 total
debug/arrow-compute-vector-sort-test > /dev/null  13.30s user 0.26s system 99% cpu 13.579 total
debug/arrow-compute-vector-selection-test > /dev/null  6.97s user 0.22s system 99% cpu 7.207 total
```

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* Closes: #35636

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 13.0.0 milestone Jul 3, 2023
westonpace pushed a commit to westonpace/arrow that referenced this issue Jul 7, 2023
…vector-test (apache#36401)

### Rationale for this change

`arrow-compute-vector-test` is too big and takes a long time to run because of that.

### What changes are included in this PR?

Extracting two tests.

Timings on my machine (Debug builds with ASAN).

```
debug/arrow-compute-vector-test > /dev/null  11.54s user 0.47s system 99% cpu 12.023 total
debug/arrow-compute-vector-sort-test > /dev/null  13.30s user 0.26s system 99% cpu 13.579 total
debug/arrow-compute-vector-selection-test > /dev/null  6.97s user 0.22s system 99% cpu 7.207 total
```

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* Closes: apache#35636

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment