Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removes CEP subresource. #15632

Merged
merged 1 commit into from Jun 17, 2021
Merged

Removes CEP subresource. #15632

merged 1 commit into from Jun 17, 2021

Conversation

Weil0ng
Copy link
Contributor

@Weil0ng Weil0ng commented Apr 9, 2021

This is part 2/2 of trimmming CEP subresource to improve scalability.
Part 1/2 is PR #15230.

This will bump cilium CRD schema version and is only backward-compatible
with agent that has part 1/2.

Signed-off-by: Weilong Cui cuiwl@google.com

Fixes: #15153

@Weil0ng Weil0ng requested a review from a team as a code owner April 9, 2021 22:38
@Weil0ng Weil0ng requested review from a team, rolinh and christarazi April 9, 2021 22:38
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 9, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.10.0 Apr 9, 2021
@Weil0ng Weil0ng added the release-note/misc This PR makes changes that have no direct user impact. label Apr 9, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 9, 2021
@Weil0ng Weil0ng requested a review from aanm April 9, 2021 22:39
@rolinh rolinh removed their assignment Apr 12, 2021
@aanm aanm removed their assignment Apr 12, 2021
@Weil0ng Weil0ng added this to the 1.11 milestone Apr 12, 2021
@Weil0ng
Copy link
Contributor Author

Weil0ng commented Apr 12, 2021

@aanm, as we discussed this should NOT go into 1.10, we need users to upgrade to part 1 first with 1.10, is there a way to do that except for manual monitoring?

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. In terms of signaling that this PR should only be merged after the release, we have a label for that. Applying.

@christarazi christarazi added the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Apr 12, 2021
@christarazi christarazi removed their assignment Apr 12, 2021
@Weil0ng
Copy link
Contributor Author

Weil0ng commented Apr 16, 2021

CI should be fixed by PR #15731, will rebase once that is merged.

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Apr 16, 2021

test-me-please

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Apr 18, 2021

retest-net-next

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Apr 19, 2021

test-only --focus="K8sServicesTest.*Checks service across nodes Tests NodePort BPF Tests with direct routing Tests LoadBalancer" --kernel_version=net-next

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Apr 19, 2021

Reran the flaky test (failure in k8s-1.16-kernel-netnext (test-1.16-netnext)) in Focused-Test-Run, it passes, so I think all CI is "green" now, PTAL.

@aanm aanm added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Apr 28, 2021
@Weil0ng
Copy link
Contributor Author

Weil0ng commented May 11, 2021

Rebased with current master

@Weil0ng Weil0ng removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label May 11, 2021
@aanm
Copy link
Member

aanm commented May 17, 2021

test-me-please

@Weil0ng
Copy link
Contributor Author

Weil0ng commented May 21, 2021

Shall we remove dont-merge/wait-until-release since 1.10 is released now :)

@christarazi christarazi added sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. and removed dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs labels May 22, 2021
@aanm
Copy link
Member

aanm commented May 27, 2021

@Weil0ng can you rebase against master so that we can run a full CI in this PR?

This is part 2/2 of trimmming CEP subresource to improve scalability.
Part 1/2 is PR cilium#15230.

This will bump cilium CRD schema version and is only backward-compatible
with agent that has part 1/2.

Signed-off-by: Weilong Cui <cuiwl@google.com>
@Weil0ng
Copy link
Contributor Author

Weil0ng commented Jun 2, 2021

test-me-please

@aanm
Copy link
Member

aanm commented Jun 11, 2021

test-1.21-4.9 (previous build was gone)

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Jun 11, 2021

It seems like some vbox/ssh failure? odd...

@aanm
Copy link
Member

aanm commented Jun 16, 2021

test-1.21-4.9

@aanm
Copy link
Member

aanm commented Jun 17, 2021

The CI hit a flake that was fixed by #16381 so merging.

@aanm aanm merged commit 0681343 into cilium:master Jun 17, 2021
1.10.0 automation moved this from In progress to Done Jun 17, 2021
nbusseneau added a commit to nbusseneau/cilium that referenced this pull request Jul 23, 2021
This reverts commit 0681343, initially
from PR cilium#15632.

Note: we revert the initial changes as-is, but bump up
`CustomResourceDefinitionSchemaVersion` to `1.23.4` since the revert
itself is a change of the CRD schema.

Rationale:

The commit introduced a regression in clustermesh connectivity with
external workloads.

Identifying the regression:

We initially worked on adding external workloads testing to `cilium`
with a new workflow running a `cilium-cli` connectivity test on a GKE
cluster / GCP VM clustermesh in PR cilium#16789, but were consistently hitting
a connectivity issue soon after the GCP VM joined the clustermesh with
the GCP VM suddenly being unable to communicate with the cluster.

Example of failing runs:
- https://github.com/cilium/cilium/actions/runs/1050302990
- https://github.com/cilium/cilium/actions/runs/1056640820

This failure was not happening on the `cilium-cli` repository with a
similar workflow, with the difference being that `cilium-cli` uses a
stable version of Cilium (1.10.3 at the moment).

To check, we cherry-picked the changes from PR cilium#16789 on top of tag
`v1.10.3` in a secondary PR cilium#16946, and verified that it worked. This
indicated the changes from PR cilium#16789 are sound and a regression in
external workloads behavior had happened on `cilium`'s `master` branch.

We bisected / cherry-picked the workflow on top of older commits in
order to find the regression, still via secondary PR cilium#16946.

We confirm the regression is due to
0681343 by using these 3 scenarios:

1. Running `cilium-cli` connectivity test right on top of
  0681343:
  - Link: https://github.com/cilium/cilium/commits/db3e9108b1c9020d9cd0549b85aae552fb0bb7ba
  - Log:
      db3e9108b DO NOT MERGE
      da0fbd714 workflows: add external workload conformance test
      783dc9a62 k8s: Fix External Workloads service access
      0681343 Removes CEP subresource.
      3a55d74 fix warning log for list IPV6 address: move IPV4 to IPv6

2. Running `cilium-cli` connectivity test on top of previous `master`
  commit:
  - Link: https://github.com/cilium/cilium/commits/746f9062dc4fe081e1a6d03921fe9c2abe58acb7
  - Log:
      746f9062d DO NOT MERGE
      6cb37b377 workflows: add external workload conformance test
      627d025e8 k8s: Fix External Workloads service access
      3a55d74 fix warning log for list IPV6 address: move IPV4 to IPv6

3. Running `cilium-cli` connectivity test on top of current `master`
  with 0681343 reverted:
- Link: https://github.com/cilium/cilium/commits/6bea7087a41c149bfa1781ab7b4a6be88764ec45
- Log:
    6bea7087a DO NOT MERGE
    d6a769efc workflows: add external workload conformance test
    55da4e5f0 Revert "Removes CEP subresource."
    189cf7f contrib: Improve release script guard rails

Note: since the regression comes from a commit anterior to the external
workloads compatibility fix for `cilium-cli` with Cilium 1.10, added in
929c28f (PR cilium#16662), backporting only
the GitHub workflow while searching for the regression was insufficient
and we also had cherry-pick the fix in scenarios 1 and 2.

Results:

1. Failing: https://github.com/cilium/cilium/actions/runs/1059788504
2. Successful: https://github.com/cilium/cilium/actions/runs/1059727108
3. Successful: https://github.com/cilium/cilium/actions/runs/1059836909

This failure is consistent, and is the same failure as the one happening
in runs of initial workflow PR cilium#16789. This confirms the regression.

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
nbusseneau added a commit to nbusseneau/cilium that referenced this pull request Jul 23, 2021
In 0681343 (from PR cilium#15632), we changed
CEP CRD schema and removed the `status` subresource.

This broke clustermesh logic as it was still trying to update CEP using
the now removed `status` subresource. In particular, this resulted in
a loss of connectivity in clustermeshes with external workloads: the VM
could initially join the cluster but would immediately lose connectivity
after failing to update the CEP resource (see cilium#16984 for full context).

We change the clustermesh logic to adhere to the new CEP update CRD
schema.

Fixes: 0681343

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
brb pushed a commit that referenced this pull request Jul 28, 2021
In 0681343 (from PR #15632), we changed
CEP CRD schema and removed the `status` subresource.

This broke clustermesh logic as it was still trying to update CEP using
the now removed `status` subresource. In particular, this resulted in
a loss of connectivity in clustermeshes with external workloads: the VM
could initially join the cluster but would immediately lose connectivity
after failing to update the CEP resource (see #16984 for full context).

We change the clustermesh logic to adhere to the new CEP update CRD
schema.

Fixes: 0681343

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
krishgobinath pushed a commit to krishgobinath/cilium that referenced this pull request Oct 20, 2021
In 0681343 (from PR cilium#15632), we changed
CEP CRD schema and removed the `status` subresource.

This broke clustermesh logic as it was still trying to update CEP using
the now removed `status` subresource. In particular, this resulted in
a loss of connectivity in clustermeshes with external workloads: the VM
could initially join the cluster but would immediately lose connectivity
after failing to update the CEP resource (see cilium#16984 for full context).

We change the clustermesh logic to adhere to the new CEP update CRD
schema.

Fixes: 0681343

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Removing CEP status subresource
4 participants