DISPATCH-1739 Add GitHub Action w/ sharding and bubblewrap #809

jiridanek · 2020-08-02T19:54:47Z

This is a CI job triggered by pushes and pull requests sent to GitHub.

Output artifacts include parts of the workspace before running tests; this is
what the machines running test shards get. At the end, each shard outputs
a XML CTest report.

Note that ctest -j2 results in test log lines being mixed up in stdout.
Refer to the output XML report or use grep (grep for lines starting with
the test number) to get understandable logs.

Using python 3.6 as this is the version that works on all platforms tested.
Using four shards, as adding more does not help much.
Using ctest -j2, mostly just to exercise bubblewrap. Bwrap is more useful on Travis

jiridanek · 2020-08-02T19:56:50Z

One issue I know of. For some reason, dispatch skips SASL tests, even though I believe I installed all that's needed

53: test_ssl_sasl_client_invalid (system_tests_ssl.RouterTestSslClient) ... skipped 'Cyrus library not available. skipping test'
53: test_ssl_sasl_client_valid (system_tests_ssl.RouterTestSslClient) ... skipped 'Cyrus library not available. skipping test'

I also couldn't find a way to use here the artifacts that the Proton pipeline is building.

CC @gemmellr

gemmellr · 2020-08-04T14:50:19Z

Note that ctest -j2 results in test log lines being mixed up in stdout.
Refer to the output XML report or use grep (grep for lines starting with
the test number) to get understandable logs.

Or not use -j2? How much of a difference does it make? You commented later its mainly for excercising bwrap and is more useful on Travis.

Using four shards, as adding more does not help much.

Presumably due to all the setup costs relative to the time spent on tests, or the fact 1 shard is clearly longer than others and maybe still ends up that way?

Also, did you try less than 4? I see one of them has a test run around twice as long as most others, so unless all the long-running tests are perfectly spaced to fall in that shard, is it possible having less shards might actually distribute the test time more evenly such that it isnt much slower overall (if at all; it could even be faster if the tests group better), while giving less different output areas to look at (not personally a huge fan of the sharding, significantly complicates the build config and results, may point to some inefficient tests that it seems needed to begin with)

I also couldn't find a way to use here the artifacts that the Proton pipeline is building.

I think thats fine, possibly even preferable. Its not that lengthy a build, and its likely to eventually want different versions as e.g the Travis tests use.

jiridanek · 2020-08-04T15:05:09Z

Or not use -j2?

Probably. I'll leave -DUSE_BWRAP=ON and remove the -j2. On Travis it cuts the total time in half. Here, most time savings is by doing the shards.

Also, did you try less than 4?

Yes, there is about 1 min improvement in total time between 3 and 4. I did not try 2.

There is the Python tox test, that is the most lengthy one. It is essentially running the other tests on py27 and then 36. I was thinking of giving it a shard of its own, but it seemed to me overly elaborate.

.github/workflows/build.yaml

gemmellr · 2020-08-05T10:06:13Z

The test times look a fair bit longer in the latest version, presumably thats as much/more due to removing -j2? Did you also try doing that leaving it with 4 shards first, to see the difference each change made? Maybe using 4 (or more :o) would be better if not using the -j2.

(I dislike the complexity sharding adds, but if having it then it should at least result in much faster runs to offset that. Right now its probbly barely halving the test time, so the overall job still looks well over half the time of a basic singular run)

jiridanek · 2020-08-05T10:16:57Z

I am now thinking of these opportunities for speedup

fix Proton build to not build test and example programs (either there isn't an option for it, or it does not work, not sure now)
set Dispatch build not to build the console, as that takes about 2 minutes, caching npm does not help much, and is already tested on Travis and in Console GH Action
run the Action with all reasonable combinations of shard cound and -j, to see what the run times are
shard the tests smarter, based on the knowledge of their past duration, with some provision not to forget running newly added tests

Having a target goal for the test runtime would help. I am now thinking of aiming for <10 minutes for the ASan instrumented run.

Using larger -j values is risky as it increases the probability of spurious failures in the tests, caused by timeouts.

Long term, it would help to have more efficient tests. A lot of what's there now is something a QE team would traditionally write; orchestrated end-to-end system tests, except it is better integrated with the build process, and therefore somewhat easier to debug than a stand-alone test suite (possibly running on multiple machines or docker containers) would be.

edit: I like splitting the compile and test phases into separate GHA jobs. I think the complexity from that is actually outweighed by the clear separation of the two phases and the knowledge of what precisely is needed to compile, and what files are needed to run the tests, and the ability to fetch the workspace, replicate environment (There is PPA repository with Python3.6 for Focal) and run what tests I want on my machine from sources compiled in Travis.

gemmellr · 2020-08-05T11:02:24Z

fix Proton build to not build test and example programs (either there isn't an option for it, or it does not work, not sure now)

Seems reasonable.

set Dispatch build not to build the console, as that takes about 2 minutes, caching npm does not help much, and is already tested on Travis and in Console GH Action

Yep, I had just noticed the main build was caching there, and was actually thinking of suggesting maybe the workflows should be combined. Alternatively, ensuring the build isnt duplicated would make sense if its that slow.

shard the tests smarter, based on the knowledge of their past duration, with some provision not to forget running newly added tests
I'd try to avoid the CI being overly smart here, it tends to lead to hassle or test holes later. I think looking at seperating overly long running tests up in the build itself, or finding if they are burning unecessary time and elmiminating [some of] it, such that there is a more uniform spread in general might make most sense.

Having a target goal for the test runtime would help. I am now thinking of aiming for <10 minutes for the ASan instrumented run.

Hard to say. In part it depends how often folks are going to use it and actually look at it. Thus far I think most folks rely on their local testing and dont much care about the CI. Though maybe thats as it is slower. It might help if the main developers actually chipped in here.

Using larger -j values is risky as it increases the probability of spurious failures in the tests, caused by timeouts.
Yeah, I'd avoid those personally simply due to making the simple logs unreadable :)

edit: I like splitting the compile and test phases into separate GHA jobs. I think the complexity from that is actually outweighed by the clear separation of the two phases and the knowledge of what precisely is needed to compile, and what files are needed to run the tests, and the ability to fetch the workspace, replicate environment (There is PPA repository with Python3.6 for Focal) and run what tests I want on my machine from sources compiled in Travis.

Fair enough. I lean towards simpler CI thats easier to maintain and more like what people using the source do. I'll admit working on mostly Java stuff means I have things easier, generally having little need to consider replicating environments or using compiled-elsewhere bits.

Bubblewrap (https://github.com/containers/bubblewrap) is an unprivileged sandbox. It allows running tests in parallel, without any port clashes with already running servers.

jiridanek · 2020-08-05T22:49:52Z

Caching npm is not helpful, but caching compiles with ccache is. The Build/Compile step in this PR takes under 2 minutes when the cache is ready.

I've checked few combinations of shard count and thread count. I certainly don't want to go higher than -j2, due to spurious failures on timeouts that happen often with more threads. Keeping the shard count low is good for ergonomic, to have small number of PR jobs. So I am going with two shards and two threads, in the end. It runs for about 12 minutes (build + the slower of the shards), which should be OK for now. The equivalent job on Travis takes 30 minutes.

Skipping Proton tests/examples won't help much, when ccache is already deployed. Skipping the Console build for ASan GH Action is helpful, it saves almost two minutes of the compile time. Partitioning the tests into shards in an intelligent manner (using knowledge of their past duration) would help a lot, but that adds complications. Maybe next time.

This is a CI job triggered by pushes and pull requests sent to GitHub. Output artifacts include parts of the workspace before running tests; this is what the machines running test shards get. At the end, each shard outputs a XML CTest report. Note that `ctest -j2` results in test log lines being mixed up in stdout. Refer to the output XML report or use grep (grep for lines starting with the test number) to get understandable log. * Using python 3.6 as this is the version that works on all platforms tested. * Using two shards with -j2, which should balance speed and convenience. * Using ccache w/ 400MB size limit, to avoid purges during the build.

jiridanek · 2020-08-05T23:17:00Z

Table of test duration for combinations of shard count and -j count. Red numbers mean that at least one shard had at least one failed test.

gemmellr

Glad to see you found newlines (between steps), much nicer :)

gemmellr · 2020-08-06T11:54:08Z

.github/workflows/build.yaml

+        uses: actions/upload-artifact@v2
+        if: ${{ ! cancelled() }}
+        with:
+          name: Test_Results_${{matrix.os}}_${{matrix.buildType}}_${{matrix.shard}}


If there are to be different runtimeCheck values these will clash. If there arent, maybe drop the related matrix element.

gemmellr · 2020-08-06T11:54:57Z

.github/workflows/build.yaml

+        uses: actions/upload-artifact@v2
+        if: failure()
+        with:
+          name: cores_${{matrix.os}}_${{matrix.buildType}}_${{matrix.shard}}


gemmellr · 2020-08-06T14:12:55Z

I just included the runtimeCheck value to the test names, consistent with the workspace filename used, and pushed the change.

jiridanek · 2020-08-06T14:28:15Z

GH Action on master is done, while Travis has not even started with rat check ;) Beating Travis at speed is not much of an accomplishment, but I'm happy with the result anyways.

gemmellr · 2020-08-06T15:08:40Z

Give it time ;)

Everyone will flock to GHA, the apache org might have not have a shared quota big enough to avoid huge projects from causing queues for the rest, and then Travis/other could become more responsive. Repeat cycle.

It happens every time something becomes the shiny new CI Env hehe. The ASF Jenkins servers long used to be uselessly overloaded - now they too easily beat Travis / Appveyor most of the time as enough bigger jobs moved away from using Jenkins to using those.

Hence why I like having CI jobs in all the envs. Invariably at any given point at least one of them is backlogged, down, or walking wounded, but others are in a better state.

jiridanek self-assigned this Aug 2, 2020

jiridanek requested review from ChugR, kgiusti and astitcher August 2, 2020 19:55

jiridanek added the enhancement label Aug 2, 2020

gemmellr reviewed Aug 4, 2020

View reviewed changes

.github/workflows/build.yaml Outdated Show resolved Hide resolved

.github/workflows/build.yaml Outdated Show resolved Hide resolved

.github/workflows/build.yaml Outdated Show resolved Hide resolved

jiridanek force-pushed the jd_2020_08_github_action_build branch from 26b6fea to 637b79a Compare August 4, 2020 19:08

DISPATCH-1739 Add -DUSE_BWRAP option, defaulting to OFF

cb1656e

Bubblewrap (https://github.com/containers/bubblewrap) is an unprivileged sandbox. It allows running tests in parallel, without any port clashes with already running servers.

jiridanek force-pushed the jd_2020_08_github_action_build branch from 637b79a to 27102c0 Compare August 5, 2020 22:34

jiridanek force-pushed the jd_2020_08_github_action_build branch from 27102c0 to c824790 Compare August 5, 2020 22:55

jiridanek added the ready to merge label Aug 5, 2020

gemmellr reviewed Aug 6, 2020

View reviewed changes

asfgit closed this in 16ac2c5 Aug 6, 2020

jiridanek deleted the jd_2020_08_github_action_build branch August 6, 2020 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISPATCH-1739 Add GitHub Action w/ sharding and bubblewrap #809

DISPATCH-1739 Add GitHub Action w/ sharding and bubblewrap #809

jiridanek commented Aug 2, 2020

jiridanek commented Aug 2, 2020

gemmellr commented Aug 4, 2020

jiridanek commented Aug 4, 2020 •

edited

gemmellr commented Aug 5, 2020

jiridanek commented Aug 5, 2020 •

edited

gemmellr commented Aug 5, 2020

jiridanek commented Aug 5, 2020 •

edited

jiridanek commented Aug 5, 2020

gemmellr left a comment

gemmellr Aug 6, 2020

gemmellr Aug 6, 2020

gemmellr commented Aug 6, 2020

jiridanek commented Aug 6, 2020

gemmellr commented Aug 6, 2020

DISPATCH-1739 Add GitHub Action w/ sharding and bubblewrap #809

DISPATCH-1739 Add GitHub Action w/ sharding and bubblewrap #809

Conversation

jiridanek commented Aug 2, 2020

jiridanek commented Aug 2, 2020

gemmellr commented Aug 4, 2020

jiridanek commented Aug 4, 2020 • edited

gemmellr commented Aug 5, 2020

jiridanek commented Aug 5, 2020 • edited

gemmellr commented Aug 5, 2020

jiridanek commented Aug 5, 2020 • edited

jiridanek commented Aug 5, 2020

gemmellr left a comment

Choose a reason for hiding this comment

gemmellr Aug 6, 2020

Choose a reason for hiding this comment

gemmellr Aug 6, 2020

Choose a reason for hiding this comment

gemmellr commented Aug 6, 2020

jiridanek commented Aug 6, 2020

gemmellr commented Aug 6, 2020

jiridanek commented Aug 4, 2020 •

edited

jiridanek commented Aug 5, 2020 •

edited

jiridanek commented Aug 5, 2020 •

edited