Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitor: Initialize agent in deamon early #17407

Merged

Conversation

gandro
Copy link
Member

@gandro gandro commented Sep 15, 2021

The pkg/monitor/agent package mainly performs three tasks:

  • Managing subscribers of monitor events (such as cilium monitor
    clients or the Hubble observer)
  • Receiving and broadcasting user-space notifications ("agent events")
    to the above subscribers.
  • Polling the perf event ring buffer for datapath notifications and
    broadcasting them to subscribers.

The third task can only be performed once the perf event ring buffer map
has been set up by the daemon. However, some subsystems will try to send
user-space notifications before the events map has been set up. This
often causes an unnecessary log warning to be emitted, which regularly
confuses users (see #14146). Fundamentally however, there is no reason
to drop user-space events if the datapath is not set up yet.

Therefore, this commit splits the monitor agent constructor into two
functions: The first one, NewAgent initializes the subscriber
management and allows user-space notifications to be received early, the
second function AttachToEventsMap opens the perf event ring buffer to
poll datapath events.

By splitting up the initialization phase and invoking the constructor
earlier, we can avoid the data race in #17167 and we also avoid emitting
a unhelpful warning if user-space events are created early.

Note: Because subscribers (i.e. Hubble or cilium monitor clients) are
still attached relatively late, we will still drop early events if there
is no one listening for them. This commit does not address that - but
instead of emitting a confusing warning that the user cannot do anything
about, it will drop the event the same way Cilium up to version 1.8 did.

Fixes: #17167

@gandro gandro added area/monitor Impacts monitoring, access logging, flow logging, visibility of datapath traffic. release-note/misc This PR makes changes that have no direct user impact. kind/bug/race-detector labels Sep 15, 2021
@gandro gandro requested review from a team and joamaki September 15, 2021 13:42
@gandro
Copy link
Member Author

gandro commented Sep 15, 2021

test-me-please

Edit: ConfirmanceGKE hit known flake #17337

Copy link
Contributor

@joamaki joamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one comment.

a.Lock()
defer a.Unlock()

if a.events != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this check be before we call LoadPinnedMap?

Copy link
Member Author

@gandro gandro Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch, we should. Especially since the code doesn't close the map fd if this case is hit in the current code. Will fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the call LoadPinnedMap below the check. Unfortunately this means that we need to keep the lock while loading the map (to avoid a TOCTOW issue with a.events), but I think that should be fine.

The `pkg/monitor/agent` package mainly performs three tasks:

 - Managing subscribers of monitor events (such as `cilium monitor`
   clients or the Hubble observer)
 - Receiving and broadcasting user-space notifications ("agent events")
   to the above subscribers.
 - Polling the perf event ring buffer for datapath notifications and
   broadcasting them to subscribers.

The third task can only be performed once the perf event ring buffer map
has been set up by the daemon. However, some subsystems will try to send
user-space notifications before the events map has been set up. This
often causes an unnecessary log warning to be emitted, which regularly
confuses users (see cilium#14146). Fundamentally however, there is no reason
to drop user-space events if the datapath is not set up yet.

Therefore, this commit splits the monitor agent constructor into two
functions: The first one, `NewAgent` initializes the subscriber
management and allows user-space notifications to be received early, the
second function `AttachToEventsMap` opens the perf event ring buffer to
poll datapath events.

By splitting up the initialization phase and invoking the constructor
earlier, we can avoid the data race in cilium#17167 and we also avoid emitting
a unhelpful warning if user-space events are created early.

Note: Because subscribers (i.e. Hubble or `cilium monitor` clients) are
still attached relatively late, we will still drop early events if there
is no one listening for them. This commit does not address that - but
instead of emitting a confusing warning that the user cannot do anything
about, it will drop the event the same way Cilium up to version 1.8 did.

Fixes: cilium#17167

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/initialize-monitor-early branch from 4994e40 to 8cd9c6e Compare September 15, 2021 16:14
@gandro
Copy link
Member Author

gandro commented Sep 15, 2021

test-me-please

Job 'Cilium-PR-K8s-GKE' has 1 failure but they might be new flake since it also hit 1 known flake: #17403 (79.85)

@gandro
Copy link
Member Author

gandro commented Sep 16, 2021

Marking this ready to merge. Both GKE failures are known flakes and this PR fixes some other unit test flakes, so it should be exempt from the zero flake merge policy.

@gandro gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 16, 2021
@christarazi christarazi merged commit f1463a8 into cilium:master Sep 16, 2021
@joestringer
Copy link
Member

This can mitigate issues on v1.10 where Cilium can erroneously print warnings in the log on startup like level=warning msg="Failed to send policy update as monitor notification" error="monitor agent is not set up". I raised the topic of backporting with @gandro and it seems low-risk and will help to filter out some of the noise on startup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitor Impacts monitoring, access logging, flow logging, visibility of datapath traffic. kind/bug/race-detector ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

race: in pkg/k8s/watchers.(*K8sWatcher).ciliumNodeInit() (daemon/cmd and pkg/monitor/agent)
6 participants