Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubble: Add --hubble-monitor-events flag #24828

Merged
merged 1 commit into from Apr 22, 2023

Conversation

epk
Copy link
Contributor

@epk epk commented Apr 12, 2023

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

Related #19929

Adds a --hubble-monitor-events flag that let's a user control the event types that get to the hubble subsystem.

Add `--hubble-monitor-events` flag, to control the event types that get to the hubble subsystem.

Why

Nodes with traffic heavy workloads are doing 30k+ flows per second. This causes very high CPU utilization for the cilium-agent container. We'd like to be able to only monitor drop and other events and ignore the normal flows.

Flame graph showing that most of the CPU time is spent in the hubble subsystem decoding the events:

image

cc @michi-covalent

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 12, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Apr 12, 2023
@epk epk force-pushed the epk/hubble-event-filter branch 3 times, most recently from cce7da8 to d85ad93 Compare April 12, 2023 17:25
@maintainer-s-little-helper
Copy link

Commit 3d92136363dac97bac8e4a67607185225ccf699f does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Apr 12, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Apr 12, 2023
@epk epk changed the title hubble: Add --hubble-monitor-events flag hubble: Add monitor event filter Apr 12, 2023
@epk epk changed the title hubble: Add monitor event filter hubble: Add --hubble-monitor-events flag Apr 12, 2023
@epk epk marked this pull request as ready for review April 12, 2023 17:47
@epk epk requested review from a team as code owners April 12, 2023 17:47
@michi-covalent michi-covalent added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Apr 12, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 12, 2023
@maintainer-s-little-helper
Copy link

Commit e4632624f7a096c7fbd1792e5bc7c5a81f8bbee4 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Apr 12, 2023
@maintainer-s-little-helper
Copy link

Commit e4632624f7a096c7fbd1792e5bc7c5a81f8bbee4 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Apr 12, 2023
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor usability/UX suggestion below.

pkg/hubble/monitor/filter.go Outdated Show resolved Hide resolved
@epk epk requested review from a team as code owners April 20, 2023 17:59
@epk epk requested a review from nathanjsweet April 20, 2023 17:59
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

@joestringer
Copy link
Member

It looks like the two test failures are pointing at the same thing: The cmdref docs need to be updated: https://github.com/cilium/cilium/actions/runs/4757401849/jobs/8454194805?pr=24828#step:8:150

@epk epk force-pushed the epk/hubble-event-filter branch 2 times, most recently from 746a246 to 38484ac Compare April 20, 2023 18:46
@michi-covalent
Copy link
Contributor

/test

Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@michi-covalent
Copy link
Contributor

test-1.27-net-next failure looks like #24966. it might have gotten fixed by #24785. i'll rebase and re-run the test

Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
Signed-off-by: Aditya Sharma <aditya.sharma@shopify.com>

Co-authored-by: Michi Mutsuzaki <michi@isovalent.com>
Co-authored-by: Aditya Sharma <aditya.sharma@shopify.com>
Copy link
Member

@nathanjsweet nathanjsweet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for api changes.

@michi-covalent
Copy link
Contributor

/test

@michi-covalent
Copy link
Contributor

test-runtime: hit #22373

@michi-covalent
Copy link
Contributor

/test-runtime

@michi-covalent
Copy link
Contributor

ci-multicluster: filed #25064

@michi-covalent
Copy link
Contributor

/ci-multicluster

@michi-covalent michi-covalent merged commit 2834365 into cilium:main Apr 22, 2023
56 checks passed
Copy link
Contributor

@marqc marqc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you misinterpreted the meaning of return value from the OnMonitorEvent handler. Please check my comments within a PR.

return &monitorFilter, nil
}

func (m *monitorFilter) OnMonitorEvent(ctx context.Context, event *observerTypes.MonitorEvent) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return bool flag is used as stopPropagation indicator https://github.com/cilium/cilium/blob/main/pkg/hubble/observer/local_observer.go#L119

So if this function returns true the event won't be propagated to hubble observers. If it returns false, the event will be propagated to hubble observers.

Either return value of this function needs to be reverted, or configuration flag should be changed to clearly indicate that this settings is excluding given event types from getting observed by hubble.

Currently when I use --hubble-monitor-events="drop debug capture trace policy-verdict recorder l7 agent" I expect to see everything, but actually there is nothing reaching hubble subsystem.

flags.StringSlice(option.HubbleMonitorEvents, []string{},
fmt.Sprintf(
"Cilium monitor events for Hubble to observe: [%s]. By default, Hubble observes all monitor events.",
strings.Join(monitorAPI.AllMessageTypeNames(), " "),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation is to "exclude", not to "observe"

switch payload := event.Payload.(type) {
case *observerTypes.PerfEvent:
if len(payload.Data) == 0 {
return false, errors.ErrEmptyData
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my above comment: return false actually passes this event to the hubble subsystem. If you want to not pass empty events (which makes a lot of sense) this should return true


type testEvent struct {
event *observerTypes.MonitorEvent
allowed bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with https://github.com/cilium/cilium/blob/main/pkg/hubble/observer/local_observer.go#L119 please rename allowed to stop or stopPropagation

assert.NoError(t, err)

for _, event := range tc.events {
allowed, err := mf.OnMonitorEvent(context.Background(), event.event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename allowed to stop or stopPropagation

@wayt
Copy link

wayt commented Apr 27, 2023

Can we back-port this to 1.12 / 1.13? that would really help with managing Hubble CPU usage.

Cheers.

@epk epk deleted the epk/hubble-event-filter branch April 27, 2023 13:19
@epk
Copy link
Contributor Author

epk commented Apr 27, 2023

Opened #25167 addressing @marqc's comments.

@joestringer
Copy link
Member

joestringer commented Apr 27, 2023

Can we back-port this to 1.12 / 1.13? that would really help with managing Hubble CPU usage.

In general, only bugfixes are backported, see the backporting policy here. This is considered a new feature. This feature will be made available in an upcoming v1.14 monthly snapshot release which you're welcome to try out and provide feedback. Furthermore, as we can see from recent posts on this PR, there's still active development on it, so it would be premature to release at this stage without further testing and feedback.

@wayt
Copy link

wayt commented Apr 27, 2023

@joestringer that make sense, thank you for the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/community-contribution This was a contribution made by a community member. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet