Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidy up Kubernetes watcher synchronization #17145

Merged
merged 4 commits into from Oct 21, 2021

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Aug 12, 2021

Cilium should no longer register or wait to sync k8s resources that are disabled such as LRP.

See commits for more details.

@joestringer joestringer requested a review from a team August 12, 2021 00:43
@joestringer joestringer added the release-note/misc This PR makes changes that have no direct user impact. label Aug 12, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 12, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 12, 2021
@joestringer joestringer marked this pull request as draft August 12, 2021 00:43
close(cachesSynced)
}()

go func() {
select {
case <-cachesSynced:
log.Info("All pre-existing resources related to policy have been received; continuing")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These log messages mentioning this is "related to policy" have been wrong for some time now since we wait for just about every type of resource, so I just dropped those bits from the messages.

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one small comment

pkg/k8s/watchers/watcher.go Outdated Show resolved Hide resolved
@joestringer joestringer force-pushed the submit/k8s-subsystem-init-cleanup branch from f5b24a8 to 439d7d7 Compare August 12, 2021 16:08
@joestringer joestringer marked this pull request as ready for review August 12, 2021 16:08
@joestringer joestringer force-pushed the submit/k8s-subsystem-init-cleanup branch from 439d7d7 to 14d8fa2 Compare August 12, 2021 17:08
@joestringer
Copy link
Member Author

joestringer commented Aug 13, 2021

test-me-please

Job 'Cilium-PR-K8s-1.16-net-next' hit: #17060 (85.43% similarity)

func (k *K8sWatcher) EnableK8sWatcher(ctx context.Context) error {
// EnableK8sWatcher watches for policy, services and endpoint changes on the
// Kubernetes api server defined in the receiver's daemon k8sClient.
func (k *K8sWatcher) EnableK8sWatcher(ctx context.Context, resources []string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on whether it might make sense to pass a map of handler functions instead of string slice here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It crossed my mind while reading this code, it might make it easier to read as opposed to long switch/case block, just thought we could discuss that :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly thought about it, it does sound tidier. Only thing is that the function prototypes are different for the different init functions. I wasn't sure if it would make sense to force them all to conform to the same signature just for this cleanup.

Copy link
Contributor

@errordeveloper errordeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@joestringer
Copy link
Member Author

Hitting #16938.

@joestringer joestringer marked this pull request as draft September 27, 2021 23:51
@joestringer joestringer force-pushed the submit/k8s-subsystem-init-cleanup branch 4 times, most recently from 117f3ea to 5b2a4c7 Compare September 28, 2021 00:28
@joestringer
Copy link
Member Author

/test

@joestringer joestringer marked this pull request as ready for review September 28, 2021 07:13
@joestringer joestringer requested review from a team as code owners September 28, 2021 07:13
@joestringer
Copy link
Member Author

I've also triaged ci-eks failures as #16938 / cilium/cilium-cli#367 .

@joestringer joestringer requested review from jibi and removed request for errordeveloper September 28, 2021 20:49
@joestringer
Copy link
Member Author

joestringer commented Sep 28, 2021

Adding @jibi for review as janitor. @nebril , your review request is just for a one-liner variable -> function rename in the test directory so I think that's fine to bypass as long as we have maybe one extra sanity check from janitor.

EDIT: Actually, I realize that the only reason janitor is pulled in is because of the equivalent change in clustermesh-apiserver. So then the main remaining item on this PR is just to settle out how to deal with the CI failures which seem unrelated to the changes being introduced here.

Refactor this logic into a dedicated function to simplify later commits.
No functional changes.

Signed-off-by: Joe Stringer <joe@cilium.io>
Previously, the list of resources to be initialized would be prepared in
k.EnableK8sWatcher(), but the logic to handle the wait for cache sync
could select a different set of resources.

This PR tidies these up into smaller functions divided by the
functionality they provide (core CNI or feature additions), and
consistently enables these using the pkg/option configurations.

In future we can potentially even remove unused CRDs from the
cluster and Cilium will not complain that they are unavailable; this
commit doesn't try to go quite that far yet.

Signed-off-by: Joe Stringer <joe@cilium.io>
Shift the declaration of the CRD resources used by the agent & other
applications into pkg/k8s/synced, then declare a mapping from that
canonical list of resources to the corresponding groups inside
pkg/k8s/watchers. This makes both pieces of code synchronized. If for
instance CiliumEndpoint CRDs are disabled, they are globally disabled
everywhere rather than just in part of the agent code.

Signed-off-by: Joe Stringer <joe@cilium.io>
Some resources may be dynamically disabled by configuration options in
the agent / operator. It doesn't make sense to register these resources
into Kubernetes if they are disabled in the agent, so don't register
them in that case.

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer joestringer force-pushed the submit/k8s-subsystem-init-cleanup branch from 5b2a4c7 to 6c6ad94 Compare October 19, 2021 23:38
@joestringer
Copy link
Member Author

joestringer commented Oct 19, 2021

/test

EDIT: CI was broken due to VirtualBox 6.1 upgrade. Need to re-run.

@joestringer
Copy link
Member Author

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 20, 2021
@kkourt kkourt merged commit 5a2c816 into cilium:master Oct 21, 2021
@joestringer joestringer deleted the submit/k8s-subsystem-init-cleanup branch October 21, 2021 17:14
nbusseneau added a commit to nbusseneau/cilium that referenced this pull request Oct 22, 2021
This reverts commits b5c9b55 through
5a2c816.

Rationale: it seems this broke clustermesh functionality, as both
external workloads and multicluster testing workflows are consistently
failing since merge. The two workflows also did not pass in the original
PR, with the same errors.

- External workloads: standalone Cilium agent is unable to come up while
  trying to reach clustermesh apiserver.
- Multicluster: cross-cluster traffic not being able to reach the other
  cluster during connectivity test.

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
@nbusseneau
Copy link
Member

nbusseneau commented Oct 22, 2021

Opened #17675 for reverting the changes in this PR as I believe we broke clustermesh functionality (both external workloads and multicluster consistently failing since merge).

joestringer pushed a commit that referenced this pull request Oct 22, 2021
This reverts commits b5c9b55 through
5a2c816.

Rationale: it seems this broke clustermesh functionality, as both
external workloads and multicluster testing workflows are consistently
failing since merge. The two workflows also did not pass in the original
PR, with the same errors.

- External workloads: standalone Cilium agent is unable to come up while
  trying to reach clustermesh apiserver.
- Multicluster: cross-cluster traffic not being able to reach the other
  cluster during connectivity test.

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
vincentmli pushed a commit to f5devcentral/cilium that referenced this pull request Oct 25, 2021
This reverts commits b5c9b55 through
5a2c816.

Rationale: it seems this broke clustermesh functionality, as both
external workloads and multicluster testing workflows are consistently
failing since merge. The two workflows also did not pass in the original
PR, with the same errors.

- External workloads: standalone Cilium agent is unable to come up while
  trying to reach clustermesh apiserver.
- Multicluster: cross-cluster traffic not being able to reach the other
  cluster during connectivity test.

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
michi-covalent pushed a commit that referenced this pull request Oct 27, 2021
This reverts commits b5c9b55 through
5a2c816.

Rationale: it seems this broke clustermesh functionality, as both
external workloads and multicluster testing workflows are consistently
failing since merge. The two workflows also did not pass in the original
PR, with the same errors.

- External workloads: standalone Cilium agent is unable to come up while
  trying to reach clustermesh apiserver.
- Multicluster: cross-cluster traffic not being able to reach the other
  cluster during connectivity test.

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants