Container-based interop testing#408
Conversation
b906df5 to
2816d44
Compare
* We build the interop test Janus container during build time, so that we can embed it in the interop test binary. This allows the interop test binary to load the container at runtime as needed. * I build a shim container for Daphne; this is a temporary solution to allow easy networking setup & task configuration. The next step is to contribute to Daphne a Dockerfile to build something like this container, then set up our repository to build (or, eventually, load) that container instead. * My next steps are the contribution to Daphne I mention above, and then using this so that the end-to-end test uses the real interop containers (extending it to exercise the Dockerfiles as well). * I moved the non-`/` serving path regression test to the new interop API end-to-end test, since it's now easier to express there. But we may want to move it back once the end-to-end test uses the containers too, to keep this test API-centric. If it's controversial, I'll figure out a way to move the regression test back in this PR.
tgeoghegan
left a comment
There was a problem hiding this comment.
I am a little concerned about the implications here for build time. I wonder if there's some config file we can drop in the repo so that we could have rust-analyzer opt out of building the container images? Or maybe there's a Cargo.toml section that rust-analyzer honors? Additionally, maybe invoking cargo test on one's workstation shouldn't have to build all this stuff. It's kinda painful having to wait on container builds when I might just quickly want to see if a unit test passes.
| const INTEROP_AGGREGATOR_IMAGE_BYTES: &[u8] = | ||
| include_bytes!(concat!(env!("OUT_DIR"), "/interop_aggregator.tar")); |
There was a problem hiding this comment.
How big is the interop aggregator image? If we pull it in with include_bytes!, does it all get loaded into memory at once, or will the child_stdin.write_all(INTEROP_AGGREGATOR_IMAGE_BYTES) call below gracefully read a chunk of it at a time?
There was a problem hiding this comment.
Oh dear, the tar file from when I built it is 750MB! It would get mapped in as part of the .rdata section, I think. The two biggest contributors are the final layer, with janus_interop_aggregator only, and the layer that installed postgres, perl, etc.
There was a problem hiding this comment.
Streaming the bytes from a .tar on disk and then into child_stdin seems like it ought to be about the same as streaming the bytes from the binary on disk, but yeah, maybe 750 MB in .rdata is a bit much.
There was a problem hiding this comment.
Tests should be self-contained: this library knows how to write a docker image in a self-contained way. If the Docker image was on-disk, it could become separated from the binaries that need it. Therefore, we need to build it into the binary -- embedding it is one way of packaging the library with the data that it needs to operate.
Much of the image size comes from the interop aggregator binary, which is ~330 MB on my machine. We're OK with binaries that size, why not ~750 MB? (especially since these are test binaries that will be built & run, then thrown away)
There was a problem hiding this comment.
(that said, no objection from me if we want to try to shrink these binaries/images somehow. ~330 MB for the amount of functionality in an aggregator binary, or ~750 MB for aggregator binary + DB + base for the aggregator image, is wild to me.)
There was a problem hiding this comment.
I checked out 10c92e7, cleaned my target directory, and triggered the OOM killer a few times when trying to build, either with cargo build or cargo test. For example:
error: could not compile `interop_binaries`
Caused by:
process didn't exit successfully: `rustc --crate-name interop_binaries --edition=2021 interop_binaries/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --cfg 'feature="testcontainer"' -C metadata=fd8a365b13b98fba -C extra-filename=-fd8a365b13b98fba --out-dir /home/david/Code/ppm/janus/target/debug/deps -C incremental=/home/david/Code/ppm/janus/target/debug/incremental -L dependency=/home/david/Code/ppm/janus/target/debug/deps --extern ... -L native=/home/david/Code/ppm/janus/target/debug/build/ring-bce3495b5618ba54/out` (signal: 9, SIGKILL: kill)
kernel: Out of memory: Killed process 287248 (rustc) total-vm:9795372kB, anon-rss:3633724kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:13796kB oom_score_adj:0
kernel: Out of memory: Killed process 284952 (rustc) total-vm:25262432kB, anon-rss:9123824kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:27788kB oom_score_adj:0
Building succeeded eventually after some retries. I figure the SIGKILL in the CI logs is due to the same cause. The interop_aggregator.tar file is 750M, as before. I figure rustc and LLVM are both making several copies of the byte array as it passes through the pipeline.
How about we add RUSTFLAGS="-C debuginfo=0" in the build.rs invocation, via a new argument?
This is now effectively default, since we always want it, and presumably any other project that might use this library in the future would use it for its ability to turn up a test Janus instance. Also minimize deps a bit -- Cargo.toml had fallen fairly badly out of date. Note: I had to specify `-p interop_binaries` in the interop aggregator Dockerfile since otherwise workplace feature coalescence causes the interop_binaries package to always be built with the `testcontainer` feature. I guess this proves that specifying the package when building binaries works to avoid feature coalescence.
2816d44 to
28bea87
Compare
There is a rust-analyzer setting for this: I'm not sure if we can configure our repo to have rust-analyzer do this by default (I'm going to do a little more digging here), but we can set it up in our IDEs.
Today running only unit tests would be e.g. |
|
(also it's pretty surprising to me that rust-analyzer runs build scripts by default -- if we were e.g. packaging some C library, every rust-analyzer run would try to build the library!) |
I think this is necessary to avoid workspace feature coalescence from enabling various undesirable features (e.g. `test-util`) on our release binaries. That's important since monolithic_integration_test now enables `test-util` features by default on several of the Janus crates it depends on.
|
I just pushed a change to specify
I think our |
tgeoghegan
left a comment
There was a problem hiding this comment.
I say: let's take the change, and see where the pain is, and we can iterate and improve on this stuff later.
divergentdave
left a comment
There was a problem hiding this comment.
LGTM, my one remaining blocker is a solution for OOM-killer in CI
| static ref CONTAINER_CLIENT: testcontainers::clients::Cli = | ||
| testcontainers::clients::Cli::default(); |
There was a problem hiding this comment.
FYI, the Drop implementation of Cli's internals is responsible for cleaning up Docker networks, so this will leak them. We could accept this, as they just represent some BoltDB rows and iptables rules. Alternately, we could create a new client object at the top of each test, or use a global RwLock<Weak<Cli>> with lazy initialization.
There was a problem hiding this comment.
I think you're correct, but I'm going to leave this cleanup for now. Container instances include an internal reference to the client that created them, so I think I'd need to do some refactoring to make the lifetimes work out.
| const INTEROP_AGGREGATOR_IMAGE_BYTES: &[u8] = | ||
| include_bytes!(concat!(env!("OUT_DIR"), "/interop_aggregator.tar")); |
There was a problem hiding this comment.
I checked out 10c92e7, cleaned my target directory, and triggered the OOM killer a few times when trying to build, either with cargo build or cargo test. For example:
error: could not compile `interop_binaries`
Caused by:
process didn't exit successfully: `rustc --crate-name interop_binaries --edition=2021 interop_binaries/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --cfg 'feature="testcontainer"' -C metadata=fd8a365b13b98fba -C extra-filename=-fd8a365b13b98fba --out-dir /home/david/Code/ppm/janus/target/debug/deps -C incremental=/home/david/Code/ppm/janus/target/debug/incremental -L dependency=/home/david/Code/ppm/janus/target/debug/deps --extern ... -L native=/home/david/Code/ppm/janus/target/debug/build/ring-bce3495b5618ba54/out` (signal: 9, SIGKILL: kill)
kernel: Out of memory: Killed process 287248 (rustc) total-vm:9795372kB, anon-rss:3633724kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:13796kB oom_score_adj:0
kernel: Out of memory: Killed process 284952 (rustc) total-vm:25262432kB, anon-rss:9123824kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:27788kB oom_score_adj:0
Building succeeded eventually after some retries. I figure the SIGKILL in the CI logs is due to the same cause. The interop_aggregator.tar file is 750M, as before. I figure rustc and LLVM are both making several copies of the byte array as it passes through the pipeline.
How about we add RUSTFLAGS="-C debuginfo=0" in the build.rs invocation, via a new argument?
Note that the `zstd` crate compiles the C zstd library. I think this is OK since it is only used during build & tests, i.e. this does not introduce a new non-Rust dependency into our release binaries. I went with zstd as it is generally considered to have an excellent tradeoff in compression speed & compression ratio. Testing the `zstd` CLI tools with default configuration against the current image that is built, the image is reduced from 751 MiB to 191 MiB, ~25% of the original size. Anecdotally, I don't notice a different in build script time -- which makes sense, as the CLI tool compressed the image in a few seconds. Also, factor some of the more complex common logic between the Janus & Daphne image-handling code out to common libraries. Since some of this logic lived in build scripts, I needed to create a new crate `build_script_utils` for the build scripts to depend on.
6d9c841 to
7c373cb
Compare
|
I didn't see OOMs on my own workstation, but I'm pretty sure the CI failure is due to OOM. I compressed the images with |
This build profile is intended to produce minimally-sized binaries; this change shrinks the compressed image from 191MiB to 144MiB. With this profile, the actual binary is only 11MiB, so I think I'm running out of room to shrink the binary itself. This change also makes our ordinary builds of interop binaries use the release build profile again, too, which I think is what we probably want.
This shrinks the small-binary, zstd-compressed image from 144MiB to 93MiB.
|
OK, I think this is good to go, or at least small enough that it doesn't regularly crash our CI actions. 😆 I forgot to mention in the commit description that the approaches used to minimize the binary size came from https://github.com/johnthagen/min-sized-rust. |
we can embed it in the interop test binary. This allows the interop
testcontainer library to load the container at runtime as needed.
to try pulling from a private repository?), we could test with a
published container version by default and trigger this behavior only
with a feature/environment variable. This would eliminate the
buildtime hit. (Or, if the buildtime hit is acceptable, we might even
want to run different versions against the code under test as a test
against regressions against previous versions of our code.)
allow easy networking setup & task configuration. The next step is to
contribute to Daphne a Dockerfile to build something like this
container, then set up our repository to build (or, eventually, load)
that container instead.
then using this so that the end-to-end test uses the real interop
containers (extending it to exercise the Dockerfiles as well).
/serving path regression test to the new interopAPI end-to-end test, since it's now easier to express there. But we
may want to move it back once the end-to-end test uses the
containers too, to keep this test API-centric. If it's controversial,
I'll figure out a way to move the regression test back in this PR.
Also sets
-pon builds to avoid workspace feature coalescence enabling unexpected features in our binary builds (see #394 for more details).Closes #394.