Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backwards-compatible test matrix #4010

Merged
merged 5 commits into from
Jan 26, 2024

Conversation

bradleystachurski
Copy link
Member

@bradleystachurski bradleystachurski commented Jan 4, 2024

Building off of #3982
Progress on #3962

Adds backwards-compatibility testing for fedimintd, gatewayd, fedimint-cli, and gateway-cli defined as a Cartesian product of the binaries and versions defined in scripts/tests/cross-test-ci-all.sh (just v0.2.1 for now). This approach won't scale well, however this is useful to begin identifying breaking changes in master, which already exist.

In CI, this step takes ~30 minutes so I've decided to run on each push instead of only in the merge queue. I think it's a good future improvement to parallelize the runs and to consider running only on merge.

Backwards-compatibility matrix is logged at the end of the run.

Backwards-compatibility tests summary:
fed_version  client_version  gateway_version  exit_code
v0.2.1       v0.2.1          current          1
v0.2.1       current         v0.2.1           1
v0.2.1       current         current          1
current      v0.2.1          v0.2.1           1
current      v0.2.1          current          1
current      current         v0.2.1           1

If any combination fails, this step will fail CI.

Copy link

codecov bot commented Jan 4, 2024

Codecov Report

Attention: 213 lines in your changes are missing coverage. Please review.

Comparison is base (a099fff) 58.30% compared to head (e4e0098) 58.10%.

Files Patch % Lines
devimint/src/util.rs 0.00% 158 Missing ⚠️
devimint/src/main.rs 0.00% 44 Missing ⚠️
devimint/src/external.rs 0.00% 7 Missing ⚠️
devimint/src/federation.rs 0.00% 2 Missing ⚠️
devimint/src/lib.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4010      +/-   ##
==========================================
- Coverage   58.30%   58.10%   -0.20%     
==========================================
  Files         192      192              
  Lines       42757    42923     +166     
==========================================
+ Hits        24928    24941      +13     
- Misses      17829    17982     +153     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bradleystachurski bradleystachurski force-pushed the overridable_commands branch 29 times, most recently from 05e0dbc to 5d9df4e Compare January 5, 2024 23:21
@douglaz
Copy link
Contributor

douglaz commented Jan 16, 2024

Good stuff

@elsirion
Copy link
Contributor

I agree it's useful to test old/new gateway-cli with old/new gatewayd. It's already intentionally included in the test matrix for this PR 😄

Sorry 🙈 I went by the table and didn't check the code 😅

@bradleystachurski
Copy link
Member Author

bradleystachurski commented Jan 17, 2024

Sorry 🙈 I went by #4010 (comment) and didn't check the code 😅

No worries!

Also, for visibility to folks following this PR, I'm currently debugging some issues in CI in a separate PR (#4050) to avoid cluttering this one. The GH actions runner started running out of disk space before finishing the backwards-compatibility run.

@@ -126,6 +126,17 @@ jobs:
if: (github.event_name != 'pull_request' || matrix.build-in-pr) && matrix.run-tests && (matrix.host != 'macos')
run: nix build -L .#wasm32-unknown.ci.wasmTest --keep-failed

- name: Cleanup disk space
if: (github.event_name != 'pull_request' || matrix.build-in-pr) && matrix.run-tests
run: sudo rm -rf /usr/local/lib/android
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buildjet runner started running out of space during the backwards compatibility tests. It's not obvious to me why this just started happening after several successful CI runs previously without a rebase. In a separate branch I identified some low hanging fruit to cleanup, along with deleting the devimint-* directory, which was adding ~2GB per matrix iteration.

#4050

https://github.com/fedimint/fedimint/actions/runs/7554418498/job/20567202313?pr=4050#step:11:4336

Filesystem      Size  Used Avail Use% Mounted on
udev             14G     0   14G   0% /dev
tmpfs           2.8G  992K  2.8G   1% /run
/dev/sda1       114G  112G  1.6G  99% /
tmpfs            14G  4.0K   14G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            14G     0   14G   0% /sys/fs/cgroup
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/loop0       64M   64M     0 100% /snap/core20/1822
/dev/loop1       50M   50M     0 100% /snap/snapd/17950
/dev/loop2       92M   92M     0 100% /snap/lxd/24061
tmpfs           2.8G     0  2.8G   0% /run/user/0
/dev/loop3       64M   64M     0 100% /snap/core20/2105
Did not have permissions for all directories
6.8G       ┌── CodeQL          │███▒▒░                                    │   6%
 10G     ┌─┴ hostedtoolcache   │█████░                                    │  10%
 13G   ┌─┴ opt                 │██████                                    │  12%
 11G   │               ┌── deps│█████▓▓▓                                  │  10%
 17G   │             ┌─┴ debug │███████▓                                  │  17%
 17G   │           ┌─┴ target  │███████▓                                  │  17%
 17G   │         ┌─┴ fedimint  │███████▓                                  │  17%
 17G   │       ┌─┴ fedimint    │███████▓                                  │  17%
 17G   │     ┌─┴ _work         │███████▓                                  │  17%
 18G   │   ┌─┴ actions-runner  │████████                                  │  17%
 19G   │ ┌─┴ ubuntu            │████████                                  │  19%
 20G   ├─┴ home                │████████                                  │  19%
 31G   │ ┌── store             │█████████████                             │  30%
 31G   ├─┴ nix                 │█████████████                             │  30%
6.2G   │ ┌── lib               │███░░░░░░░░░░░                            │   6%
 12G   │ │     ┌── sdk         │█████▓▒▒▒░░░░░                            │  11%
 12G   │ │   ┌─┴ android       │█████▓▒▒▒░░░░░                            │  11%
 13G   │ │ ┌─┴ lib             │██████▒▒▒░░░░░                            │  13%
 22G   │ ├─┴ local             │█████████░░░░░                            │  21%
 36G   ├─┴ usr                 │██████████████                            │  34%
106G ┌─┴ /                     │█████████████████████████████████████████ │ 100%

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bradleystachurski The problem is that cargo build etc. is very wasteful.

Nix makes it better because crane compresses ./target directory and keeps it deduplicated between runs.

Now this change introduces a whole new build of ./target, just like a developer would have locally, and that does not decompose with with anything else.

We could avoid it, by making these scripts a little smarter and use binaries built already via Nix, instead of running an extra raw cargo build.

This can be made conditional by the CI env var , that would skip all cargo builds, and for the "testing current version" use nix build to get the binaries of the current source code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh... but this won't work for running tests ... Oh dear ...

In that case maybe running the whole thing on a separate GHA workflow, that runs after the main one passes is the way to go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case maybe running the whole thing on a separate GHA workflow, that runs after the main one passes is the way to go.

@dpc Is that something that can be addressed in a followup PR along with other optimizations, or do you think that's a blocker for merging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this might require twisting the solution inside out, so I'm not sure if there's a point landing it...

We might want to think a bit deeper about the whole thing. :/

douglaz
douglaz previously approved these changes Jan 19, 2024
Copy link
Contributor

@douglaz douglaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps there are some optimizations that could be done, but looks good enough for now

@bradleystachurski
Copy link
Member Author

bradleystachurski commented Jan 24, 2024

The latest change moves the backwards-compatibility tests to a separate github action workflow that isn't required to pass.

  • Doesn't add any additional time to the required Build on linux workflow
  • Gives a clear pass/fail of backwards-compatibility in CI if we wait for CI to finish
  • Doesn't block merging if backwards-compatibility doesn't finish or fails

@bradleystachurski
Copy link
Member Author

Rebasing to include the fix from #4104

branches: [ "main", "master", "devel", "releases/v*" ]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid keeping it as a PR forever, but I'm worried that even non-required run doesn't buy us anything, will cost money to run, will still cause PRs to wait for it to finish slowing us down and will make the ✔️ to turn into ❎ in the UI.

That's why I was thinking about making it available as just ... command for now, and maybe we could make it a workflow that one can dispatch manually, and just not run it on PRs at all for now.

But I guess we can land first, see how it goes, and tweak later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will cost money to run

Yep, I agree.

Some followups will help:

  • Run tests in parallel Add backwards-compatible test matrix #4010 (comment)
  • Consider running in merge queue instead of on each push
    • Main drawback is the additional review cycles introduced by finding out about breaking changes at the end of the review process instead of throughout
  • Run backwards-compatibility on self-hosted infra

will still cause PRs to wait for it to finish slowing us down

That's fair, I think that makes running tests in a parallel the highest priority followup.

will make the ✔️ to turn into ❎ in the UI

This is good! Master has been broken for several weeks so I think it's valuable it shows up in the UI 😃

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: costs

Some back-of-the-envelope math

0.016 $/min * 30 min * 70 commits/month => ~$34 / month

Ballpark, $34 / month seems reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime cost isn't our biggest concern imo, but number of blocked runners are (and upping our runner count would be $$$). I'm ok with turning it on and seeing how it goes.

.github/workflows/ci-nix.yml Outdated Show resolved Hide resolved
nix/flakebox.nix Outdated Show resolved Hide resolved
branches: [ "main", "master", "devel", "releases/v*" ]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime cost isn't our biggest concern imo, but number of blocked runners are (and upping our runner count would be $$$). I'm ok with turning it on and seeing how it goes.

@elsirion elsirion added this pull request to the merge queue Jan 26, 2024
Merged via the queue into fedimint:master with commit 5ed187d Jan 26, 2024
21 of 22 checks passed
@bradleystachurski bradleystachurski deleted the overridable_commands branch January 26, 2024 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants