Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

endpoint: store state in ep_config.json #31559

Merged
merged 2 commits into from
Apr 5, 2024

Conversation

lmb
Copy link
Contributor

@lmb lmb commented Mar 22, 2024

endpoint: remove stat retry on restore

Commit d81a5cdea5 ("pkg/endpoint: Simplify search for C header file") 
retained earlier retry behaviour around stat, with a comment saying that we
can remove the logic if the bug hasn't manifested in a while.

That was in 2020. The time to remove the check is now.

Signed-off-by: Lorenz Bauer <lmb@isovalent.com>

endpoint: store state in ep_config.json

On startup, the agent attemps to restore existing endpoint state from disk.
It does this by reading the per Endpoint ep_config.h, parsing that file for
a line containing a magic string and then decoding a base64 encoded JSON
blob from that. ep_config.h is in turn used by the loader to compile the per
endpoint BPF programs. In effect, the concern of persisting endpoint state
is mixed up with how we compile per endpoint programs.

Instead, write the JSON state into a separate ep_config.json file in
addition to ep_config.h. The old behaviour is retained so that downgrades do
not lose endpoint state unnecessarily.

Signed-off-by: Lorenz Bauer <lmb@isovalent.com>

@lmb lmb added sig/loader Impacts the loading of BPF programs into the kernel. release-note/misc This PR makes changes that have no direct user impact. labels Mar 22, 2024
@lmb lmb force-pushed the pr/lmb/endpoint-config-json branch 2 times, most recently from 9b326f7 to f910f95 Compare March 22, 2024 14:50
@lmb
Copy link
Contributor Author

lmb commented Mar 22, 2024

/test

@lmb lmb force-pushed the pr/lmb/endpoint-config-json branch from f910f95 to f1553ac Compare March 22, 2024 17:23
@lmb
Copy link
Contributor Author

lmb commented Mar 22, 2024

/test

@lmb lmb force-pushed the pr/lmb/endpoint-config-json branch from f1553ac to 2c566c3 Compare March 26, 2024 09:17
@lmb
Copy link
Contributor Author

lmb commented Mar 26, 2024

/test

@lmb lmb marked this pull request as ready for review March 26, 2024 10:17
@lmb lmb requested review from a team as code owners March 26, 2024 10:17
@lmb lmb requested review from tommyp1ckles and squeed March 26, 2024 10:17
Copy link
Contributor

@squeed squeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some quick upgrade / downgrade test cases?

@lmb
Copy link
Contributor Author

lmb commented Mar 26, 2024

Can you add some quick upgrade / downgrade test cases?

If you can tell me how?

@squeed
Copy link
Contributor

squeed commented Mar 26, 2024

Can you add some quick upgrade / downgrade test cases?

If you can tell me how?

Maybe :-)

This will have to be a ginkgo test, since we will need to test adding and removing Cilium. The easiest will be basing it off of a test in /test/k8s/chaos.go, which should work nicely.

@lmb
Copy link
Contributor Author

lmb commented Mar 26, 2024

Ok, I'll take a look. How are the tests you would like different than the other upgrade / downgrade tests we already do? If I understand correctly we have some CI which does upgrade / downgrade from the previous stable.

@squeed
Copy link
Contributor

squeed commented Mar 27, 2024

We don't have that many upgrade tests, only ipsec and cluster-mesh. So a quick ginkgo test wouldn't be bad idea.

If this winds up being a total mess, then skip it. It would be good to have, though.

@ti-mo
Copy link
Contributor

ti-mo commented Mar 27, 2024

We don't have that many upgrade tests, only ipsec and cluster-mesh. So a quick ginkgo test wouldn't be bad idea.

If this winds up being a total mess, then skip it. It would be good to have, though.

Wait, why should we ever add more ginkgo tests at this point? The decision was made a long time ago to move off of ginkgo in favor of cilium-cli driven tests, and many efforts have been made in this direction over the past few years. If there's an issue with up/downgrades, the cilium-cli e2e tests you mentioned will catch those. I think the goal was to add more up/downgrade tests at some point, but since these are slow by definition, I think we're not adding more.

Summoning @brb for a bit more context here.

The only kind of tests that I think could make sense here are unit tests based on golden files rendered by older/newer versions of Cilium, but there's no real downgrade path that needs to be tested here. This PR only adds a double-write and opportunistically uses it. Maybe some other form of unit test could work, I dunno.

pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved
pkg/common/const.go Show resolved Hide resolved
@ti-mo ti-mo self-requested a review March 28, 2024 13:48
@lmb
Copy link
Contributor Author

lmb commented Mar 28, 2024

@squeed we discussed this on our weekly call. Quick summary:

  • If the restore fails we log a warning, which will fail CI runs.
  • The real test for this behaviour is an upgrade from a previous version to main, and then a downgrade again. This is exactly what ci-ipsec-upgrade does. I'm not sure how I could add a better e2e test tbh.
  • I could add unit tests which clobber some state or similar, but that can't cover upgrade / downgrade issues. Seems like policy_test.go has some restore tests.

Please let me know what you'd like to see to get this unblocked.

@squeed
Copy link
Contributor

squeed commented Mar 28, 2024

If you’re comfortable with the test coverage, that’s OK with me. The IPsec upgrade test has a habit of being disabled, though.

I realize we try to avoid new ginkgo tests, sometimes they are hard to avoid.

@lmb lmb requested a review from tommyp1ckles April 2, 2024 10:04
pkg/endpoint/restore.go Outdated Show resolved Hide resolved
pkg/endpoint/restore.go Show resolved Hide resolved
@lmb lmb force-pushed the pr/lmb/endpoint-config-json branch from 2c566c3 to 0ac5e51 Compare April 4, 2024 09:26
@lmb
Copy link
Contributor Author

lmb commented Apr 4, 2024

/ci-e2e-upgrade

@lmb lmb force-pushed the pr/lmb/endpoint-config-json branch from 0ac5e51 to e358a8a Compare April 4, 2024 13:35
@lmb
Copy link
Contributor Author

lmb commented Apr 4, 2024

Got an almost clean e2e-upgrade run. https://github.com/cilium/cilium/actions/runs/8552399781

@lmb
Copy link
Contributor Author

lmb commented Apr 4, 2024

@tommyp1ckles can you take another look please?

@lmb
Copy link
Contributor Author

lmb commented Apr 4, 2024

/test

@lmb
Copy link
Contributor Author

lmb commented Apr 4, 2024

/ci-e2e-upgrade

lmb added 2 commits April 5, 2024 08:48
Commit d81a5cd ("pkg/endpoint: Simplify search for C header file")
retained earlier retry behaviour around stat, with a comment saying
that we can remove the logic if the bug hasn't manifested in a while.

That was in 2020. The time to remove the check is now.

Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
On startup, the agent attempts to restore existing endpoint state
from disk. It does this by reading the per Endpoint ep_config.h,
parsing that file for a line containing a magic string and then
decoding a base64 encoded JSON blob from that. ep_config.h is in
turn used by the loader to compile the per endpoint BPF programs.
In effect, the concern of persisting endpoint state is mixed up
with how we compile per endpoint programs.

Instead, write the JSON state into a separate ep_config.json file
in addition to ep_config.h. The old behaviour is retained so that
downgrades do not lose endpoint state unnecessarily.

Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
@lmb lmb force-pushed the pr/lmb/endpoint-config-json branch from e358a8a to f315d2a Compare April 5, 2024 07:49
@lmb
Copy link
Contributor Author

lmb commented Apr 5, 2024

/test

@lmb
Copy link
Contributor Author

lmb commented Apr 5, 2024

/ci-e2e-upgrade

1 similar comment
@lmb
Copy link
Contributor Author

lmb commented Apr 5, 2024

/ci-e2e-upgrade

@lmb lmb enabled auto-merge April 5, 2024 09:25
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Apr 5, 2024
@lmb lmb added this pull request to the merge queue Apr 5, 2024
Merged via the queue into cilium:main with commit 169e53c Apr 5, 2024
63 checks passed
@lmb lmb deleted the pr/lmb/endpoint-config-json branch April 5, 2024 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/loader Impacts the loading of BPF programs into the kernel.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants