Skip to content

Container-based interop testing#408

Merged
branlwyd merged 11 commits into
mainfrom
bran/build-interop-container
Aug 19, 2022
Merged

Container-based interop testing#408
branlwyd merged 11 commits into
mainfrom
bran/build-interop-container

Conversation

@branlwyd
Copy link
Copy Markdown
Contributor

@branlwyd branlwyd commented Aug 18, 2022

  • We build the interop test Janus container during build time, so that
    we can embed it in the interop test binary. This allows the interop
    testcontainer library to load the container at runtime as needed.
  • If we published a container (or with some magic to check if we want
    to try pulling from a private repository?), we could test with a
    published container version by default and trigger this behavior only
    with a feature/environment variable. This would eliminate the
    buildtime hit. (Or, if the buildtime hit is acceptable, we might even
    want to run different versions against the code under test as a test
    against regressions against previous versions of our code.)
  • I build a shim container for Daphne; this is a temporary solution to
    allow easy networking setup & task configuration. The next step is to
    contribute to Daphne a Dockerfile to build something like this
    container, then set up our repository to build (or, eventually, load)
    that container instead.
  • My next steps are the contribution to Daphne I mention above, and
    then using this so that the end-to-end test uses the real interop
    containers (extending it to exercise the Dockerfiles as well).
  • I moved the non-/ serving path regression test to the new interop
    API end-to-end test, since it's now easier to express there. But we
    may want to move it back once the end-to-end test uses the
    containers too, to keep this test API-centric. If it's controversial,
    I'll figure out a way to move the regression test back in this PR.

Also sets -p on builds to avoid workspace feature coalescence enabling unexpected features in our binary builds (see #394 for more details).

Closes #394.

@branlwyd branlwyd requested a review from a team as a code owner August 18, 2022 06:00
@branlwyd branlwyd force-pushed the bran/build-interop-container branch from b906df5 to 2816d44 Compare August 18, 2022 06:26
 * We build the interop test Janus container during build time, so that
   we can embed it in the interop test binary. This allows the interop
   test binary to load the container at runtime as needed.
 * I build a shim container for Daphne; this is a temporary solution to
   allow easy networking setup & task configuration. The next step is to
   contribute to Daphne a Dockerfile to build something like this
   container, then set up our repository to build (or, eventually, load)
   that container instead.
 * My next steps are the contribution to Daphne I mention above, and
   then using this so that the end-to-end test uses the real interop
   containers (extending it to exercise the Dockerfiles as well).
 * I moved the non-`/` serving path regression test to the new interop
   API end-to-end test, since it's now easier to express there. But we
   may want to move it back once the end-to-end test uses the
   containers too, to keep this test API-centric. If it's controversial,
   I'll figure out a way to move the regression test back in this PR.
Comment thread Dockerfile.interop_aggregator
Comment thread Dockerfile.interop Outdated
Comment thread Dockerfile.interop_aggregator Outdated
Comment thread interop_binaries/build.rs
Comment thread interop_binaries/build.rs Outdated
Comment thread monolithic_integration_test/build.rs
Copy link
Copy Markdown
Contributor

@tgeoghegan tgeoghegan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little concerned about the implications here for build time. I wonder if there's some config file we can drop in the repo so that we could have rust-analyzer opt out of building the container images? Or maybe there's a Cargo.toml section that rust-analyzer honors? Additionally, maybe invoking cargo test on one's workstation shouldn't have to build all this stuff. It's kinda painful having to wait on container builds when I might just quickly want to see if a unit test passes.

Comment thread interop_binaries/src/testcontainer.rs Outdated
Comment on lines +14 to +15
const INTEROP_AGGREGATOR_IMAGE_BYTES: &[u8] =
include_bytes!(concat!(env!("OUT_DIR"), "/interop_aggregator.tar"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big is the interop aggregator image? If we pull it in with include_bytes!, does it all get loaded into memory at once, or will the child_stdin.write_all(INTEROP_AGGREGATOR_IMAGE_BYTES) call below gracefully read a chunk of it at a time?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dear, the tar file from when I built it is 750MB! It would get mapped in as part of the .rdata section, I think. The two biggest contributors are the final layer, with janus_interop_aggregator only, and the layer that installed postgres, perl, etc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streaming the bytes from a .tar on disk and then into child_stdin seems like it ought to be about the same as streaming the bytes from the binary on disk, but yeah, maybe 750 MB in .rdata is a bit much.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should be self-contained: this library knows how to write a docker image in a self-contained way. If the Docker image was on-disk, it could become separated from the binaries that need it. Therefore, we need to build it into the binary -- embedding it is one way of packaging the library with the data that it needs to operate.

Much of the image size comes from the interop aggregator binary, which is ~330 MB on my machine. We're OK with binaries that size, why not ~750 MB? (especially since these are test binaries that will be built & run, then thrown away)

Copy link
Copy Markdown
Contributor Author

@branlwyd branlwyd Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(that said, no objection from me if we want to try to shrink these binaries/images somehow. ~330 MB for the amount of functionality in an aggregator binary, or ~750 MB for aggregator binary + DB + base for the aggregator image, is wild to me.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked out 10c92e7, cleaned my target directory, and triggered the OOM killer a few times when trying to build, either with cargo build or cargo test. For example:

error: could not compile `interop_binaries`

Caused by:
  process didn't exit successfully: `rustc --crate-name interop_binaries --edition=2021 interop_binaries/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --cfg 'feature="testcontainer"' -C metadata=fd8a365b13b98fba -C extra-filename=-fd8a365b13b98fba --out-dir /home/david/Code/ppm/janus/target/debug/deps -C incremental=/home/david/Code/ppm/janus/target/debug/incremental -L dependency=/home/david/Code/ppm/janus/target/debug/deps --extern ... -L native=/home/david/Code/ppm/janus/target/debug/build/ring-bce3495b5618ba54/out` (signal: 9, SIGKILL: kill)
kernel: Out of memory: Killed process 287248 (rustc) total-vm:9795372kB, anon-rss:3633724kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:13796kB oom_score_adj:0
kernel: Out of memory: Killed process 284952 (rustc) total-vm:25262432kB, anon-rss:9123824kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:27788kB oom_score_adj:0

Building succeeded eventually after some retries. I figure the SIGKILL in the CI logs is due to the same cause. The interop_aggregator.tar file is 750M, as before. I figure rustc and LLVM are both making several copies of the byte array as it passes through the pipeline.

How about we add RUSTFLAGS="-C debuginfo=0" in the build.rs invocation, via a new argument?

Comment thread interop_binaries/src/testcontainer.rs Outdated
This is now effectively default, since we always want it, and presumably
any other project that might use this library in the future would use it
for its ability to turn up a test Janus instance.

Also minimize deps a bit -- Cargo.toml had fallen fairly badly out of
date.

Note: I had to specify `-p interop_binaries` in the interop aggregator
Dockerfile since otherwise workplace feature coalescence causes the
interop_binaries package to always be built with the `testcontainer`
feature. I guess this proves that specifying the package when building
binaries works to avoid feature coalescence.
@branlwyd branlwyd force-pushed the bran/build-interop-container branch from 2816d44 to 28bea87 Compare August 18, 2022 19:44
@branlwyd
Copy link
Copy Markdown
Contributor Author

branlwyd commented Aug 18, 2022

I am a little concerned about the implications here for build time. I wonder if there's some config file we can drop in the repo so that we could have rust-analyzer opt out of building the container images?

There is a rust-analyzer setting for this: rust-analyzer.cargo.buildScripts.enable (https://rust-analyzer.github.io/manual.html). I can't imagine setting this will substantively change the output of the (IIUC) cargo check that rust-analyzer does to get its symbols.

I'm not sure if we can configure our repo to have rust-analyzer do this by default (I'm going to do a little more digging here), but we can set it up in our IDEs.

Additionally, maybe invoking cargo test on one's workstation shouldn't have to build all this stuff. It's kinda painful having to wait on container builds when I might just quickly want to see if a unit test passes.

Today running only unit tests would be e.g. cargo test -p janus_server. An unmitigated cargo test will run both unit & integration tests. That said, maybe we don't want to run integration tests by default? We could hide it behind a feature. I don't object to doing this if desirable, as long as there is some way for contributors to run these tests on their workstation/pre-CI -- otherwise, if most of our behavioral testing is in integration tests, contributors will be likely to run into CI-only errors.

@branlwyd
Copy link
Copy Markdown
Contributor Author

(also it's pretty surprising to me that rust-analyzer runs build scripts by default -- if we were e.g. packaging some C library, every rust-analyzer run would try to build the library!)

I think this is necessary to avoid workspace feature coalescence from
enabling various undesirable features (e.g. `test-util`) on our release
binaries. That's important since monolithic_integration_test now enables
`test-util` features by default on several of the Janus crates it
depends on.
@branlwyd
Copy link
Copy Markdown
Contributor Author

branlwyd commented Aug 18, 2022

I just pushed a change to specify -p janus_server for our release builds, too. This is important for a couple of reasons:

  1. It lets us use features like test-util in e.g. monolithic_integration_test without having workspace feature convalescence also enable them while building binaries in totally different packages.

  2. More importantly, I believe it skips trying to build packages in the workspace that aren't included in the binary, which is even more important now that building some packages will end up building a container. (edit: this part was wrong, I did a local test & didn't see the container getting built even without -p janus_server)

I think our cargo clippy & cargo check are also running the build script; I guess that's a good idea, but we don't need an actual image to be built. I'm going to investigate having our CI invocations of clippy/check skip building the image.

Copy link
Copy Markdown
Contributor

@tgeoghegan tgeoghegan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say: let's take the change, and see where the pain is, and we can iterate and improve on this stuff later.

Copy link
Copy Markdown
Collaborator

@divergentdave divergentdave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, my one remaining blocker is a solution for OOM-killer in CI

Comment on lines +8 to +9
static ref CONTAINER_CLIENT: testcontainers::clients::Cli =
testcontainers::clients::Cli::default();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the Drop implementation of Cli's internals is responsible for cleaning up Docker networks, so this will leak them. We could accept this, as they just represent some BoltDB rows and iptables rules. Alternately, we could create a new client object at the top of each test, or use a global RwLock<Weak<Cli>> with lazy initialization.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're correct, but I'm going to leave this cleanup for now. Container instances include an internal reference to the client that created them, so I think I'd need to do some refactoring to make the lifetimes work out.

Comment thread janus_core/src/test_util/kubernetes.rs
Comment thread interop_binaries/src/testcontainer.rs Outdated
Comment on lines +14 to +15
const INTEROP_AGGREGATOR_IMAGE_BYTES: &[u8] =
include_bytes!(concat!(env!("OUT_DIR"), "/interop_aggregator.tar"));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked out 10c92e7, cleaned my target directory, and triggered the OOM killer a few times when trying to build, either with cargo build or cargo test. For example:

error: could not compile `interop_binaries`

Caused by:
  process didn't exit successfully: `rustc --crate-name interop_binaries --edition=2021 interop_binaries/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --cfg 'feature="testcontainer"' -C metadata=fd8a365b13b98fba -C extra-filename=-fd8a365b13b98fba --out-dir /home/david/Code/ppm/janus/target/debug/deps -C incremental=/home/david/Code/ppm/janus/target/debug/incremental -L dependency=/home/david/Code/ppm/janus/target/debug/deps --extern ... -L native=/home/david/Code/ppm/janus/target/debug/build/ring-bce3495b5618ba54/out` (signal: 9, SIGKILL: kill)
kernel: Out of memory: Killed process 287248 (rustc) total-vm:9795372kB, anon-rss:3633724kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:13796kB oom_score_adj:0
kernel: Out of memory: Killed process 284952 (rustc) total-vm:25262432kB, anon-rss:9123824kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:27788kB oom_score_adj:0

Building succeeded eventually after some retries. I figure the SIGKILL in the CI logs is due to the same cause. The interop_aggregator.tar file is 750M, as before. I figure rustc and LLVM are both making several copies of the byte array as it passes through the pipeline.

How about we add RUSTFLAGS="-C debuginfo=0" in the build.rs invocation, via a new argument?

Note that the `zstd` crate compiles the C zstd library. I think this is
OK since it is only used during build & tests, i.e. this does not
introduce a new non-Rust dependency into our release binaries. I went
with zstd as it is generally considered to have an excellent tradeoff in
compression speed & compression ratio. Testing the `zstd` CLI tools with
default configuration against the current image that is built, the image
is reduced from 751 MiB to 191 MiB, ~25% of the original size.
Anecdotally, I don't notice a different in build script time -- which
makes sense, as the CLI tool compressed the image in a few seconds.

Also, factor some of the more complex common logic between the Janus &
Daphne image-handling code out to common libraries. Since some of this
logic lived in build scripts, I needed to create a new crate
`build_script_utils` for the build scripts to depend on.
@branlwyd branlwyd force-pushed the bran/build-interop-container branch from 6d9c841 to 7c373cb Compare August 19, 2022 00:55
@branlwyd
Copy link
Copy Markdown
Contributor Author

I didn't see OOMs on my own workstation, but I'm pretty sure the CI failure is due to OOM.

I compressed the images with zstd, which is fast & has an excellent compression ratio for its speed -- at the cost of a few seconds of buildtime, it compresses the image to about ~25% of the original size. Anecdotally, I don't really notice a difference in buildtime or test-run time.

This build profile is intended to produce minimally-sized binaries; this
change shrinks the compressed image from 191MiB to 144MiB. With this
profile, the actual binary is only 11MiB, so I think I'm running out of
room to shrink the binary itself.

This change also makes our ordinary builds of interop binaries use the
release build profile again, too, which I think is what we probably
want.
This shrinks the small-binary, zstd-compressed image from 144MiB to
93MiB.
@branlwyd
Copy link
Copy Markdown
Contributor Author

OK, I think this is good to go, or at least small enough that it doesn't regularly crash our CI actions. 😆

I forgot to mention in the commit description that the approaches used to minimize the binary size came from https://github.com/johnthagen/min-sized-rust.

@branlwyd branlwyd requested a review from divergentdave August 19, 2022 03:52
@branlwyd branlwyd merged commit acd8545 into main Aug 19, 2022
@branlwyd branlwyd deleted the bran/build-interop-container branch August 19, 2022 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revisit workspace feature coalescing

3 participants