Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubble: Add GetNamespaces to observer API #25563

Merged
merged 1 commit into from Jun 13, 2023

Conversation

chancez
Copy link
Contributor

@chancez chancez commented May 19, 2023

Fixes: #25266

hubble: Add GetNamespaces to observer API

I also have a branch adding support to Hubble CLI ready once this is merged. https://github.com/cilium/hubble/tree/pr/chancez/get_namespaces

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 19, 2023
@chancez chancez force-pushed the pr/chancez/observer_get_namespaces branch 4 times, most recently from afb8e21 to e364994 Compare May 19, 2023 21:21
@chancez chancez marked this pull request as ready for review May 19, 2023 21:42
@chancez chancez requested review from a team as code owners May 19, 2023 21:42
@chancez chancez requested review from pippolo84 and kaworu May 19, 2023 21:42
@chancez chancez force-pushed the pr/chancez/observer_get_namespaces branch from e364994 to 209ecfe Compare May 19, 2023 21:50
@chancez chancez requested a review from a team as a code owner May 19, 2023 21:50
@chancez chancez requested a review from qmonnet May 19, 2023 21:50
@chancez
Copy link
Contributor Author

chancez commented May 19, 2023

/test

Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chancez! Overall LGTM, I would like more testing around the namespaceManager and also the Relay part to be reworked as I think we can clean it up.

@@ -28,6 +28,9 @@ service Observer {
// GetNodes returns information about nodes in a cluster.
rpc GetNodes(GetNodesRequest) returns (GetNodesResponse) {}

// GetNamespaces returns information about namespaces in a cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "memory period" of 1h and the order should be documented here as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the update, still missing the bits about ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to document ordering, or should we lave that unspecified in case we decide we don't want to sort anymore? Glad to do it either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think users will rely on the order regardless of whether it is documented or not, so I would say let's document it. It's nice to output a consistent order.

Comment on lines 177 to 184
var wg sync.WaitGroup
wg.Add(1)
go func() {
namespaceManager.Run(d.ctx)
wg.Done()
}()
defer wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happen on Hubble initialization failure? As far as I understand, we'll wait indefinitely on this wg.Wait() since d.ctx is not cancelled.

To me, the usage of sync.WaitGroup here indicate that we need something to finish cleanly from namespaceManager.Run before returning, but it doesn't seems to be the case. Instead, we want to shutdown the namespaceManager when we return. If correct, I would like to suggest

Suggested change
var wg sync.WaitGroup
wg.Add(1)
go func() {
namespaceManager.Run(d.ctx)
wg.Done()
}()
defer wg.Wait()
ctx, cancel := context.WithCancel(d.ctx)
defer cancel()
go namespaceManager.Run(ctx)

Copy link
Contributor Author

@chancez chancez May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on initialization failure, and the approach to do cancellation seems like the right way to do it.

The waitGroup is just best practice from my perspective. In my opinion you should always ensure every go routine returns (even on shutdown). The waitGroup here is to ensure that it does. We just aren't doing a very (IMO) good job of it in the rest of this function currently (and I don't want to rewrite the whole function right now either).

This is to just avoid potential memory leaks. It's hard to know how code might be used in the future. For example, if we changed agent to restart hubble automatically on failure (right now it does not) then if we don't check the go routine returns before returning, then we might have an infinite number of go routines get started,.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking another look, before this patch, launchHubble is executed into its own goroutine. In the happy path the "launch" goroutine setup a goroutine for the "local server" and a corresponding cleanup goroutine, another goroutine for the "remote server" and a corresponding cleanup goroutine, and then returns. So we start the "launch" goroutine, itself starting 4 other goroutines, and then the "launch" goroutine terminates.

My suggestion then doesn't work because when launchHubble returns then the context is cancelled, and the namespaceManager shutdown.

In the current version of the patch, the "launch" goroutine will be stuck on the deferred wg.Wait() until d.ctx is cancelled, cancelling the local ctx and shutting down the namespaceManager (which will eventually call wg.Done() and allow the "launch" goroutine to return from launchHubble and terminate).

If correct, although I agree with you wrt best practice on principle, practically I don't see the point of keeping the waitGroup. Technically it only make it so the "launch" goroutine stick around, which have any purpose anymore. As I see it it is one more goroutine that is sleeping or that we might leak.

Copy link
Contributor Author

@chancez chancez May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I rebuilt and tested locally with this change but didn't actually notice any issues. I guess that means the namespace manager just wasn't going to expire resources, so I didn't catch it. I'll take another look at this.

pkg/hubble/observer/local_observer.go Outdated Show resolved Hide resolved
pkg/hubble/observer/local_observer.go Outdated Show resolved Hide resolved
pkg/hubble/observer/local_observer.go Outdated Show resolved Hide resolved
pkg/hubble/observer/local_observer.go Show resolved Hide resolved
pkg/hubble/observer/local_observer.go Outdated Show resolved Hide resolved
pkg/hubble/observer/local_observer_test.go Outdated Show resolved Hide resolved
pkg/hubble/observer/local_observer.go Outdated Show resolved Hide resolved
pkg/hubble/relay/observer/server.go Show resolved Hide resolved
Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc update looks good. I only glanced quickly at the rest of the PR (and have no particular concern).

If anything, I'd like to have a bit more context and motivation in the commit description. The description used for #25266 would do just fine, for example.

@chancez chancez force-pushed the pr/chancez/observer_get_namespaces branch from 209ecfe to a303b45 Compare May 22, 2023 22:23
@chancez chancez requested a review from kaworu May 22, 2023 22:24
@chancez
Copy link
Contributor Author

chancez commented May 22, 2023

Thanks for the review @kaworu. Please take another look.

@kaworu kaworu added kind/feature This introduces new functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay labels May 23, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 23, 2023
Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @chancez, looking good! A few more comments, and also:

  1. golangci-lint is failing:

    pkg/hubble/observer/namespace_manager_test.go:10: File is not `goimports`-ed with -local github.com/cilium/cilium (goimports)
        observerpb "github.com/cilium/cilium/api/v1/observer"
    
  2. generate-api CI is failing:

    Please run 'make generate-api generate-health-api generate-hubble-api generate-operator-api' and submit your changes
    

Documentation/internals/hubble.rst Outdated Show resolved Hide resolved
@@ -28,6 +28,9 @@ service Observer {
// GetNodes returns information about nodes in a cluster.
rpc GetNodes(GetNodesRequest) returns (GetNodesResponse) {}

// GetNamespaces returns information about namespaces in a cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the update, still missing the bits about ordering.

Comment on lines 177 to 184
var wg sync.WaitGroup
wg.Add(1)
go func() {
namespaceManager.Run(d.ctx)
wg.Done()
}()
defer wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking another look, before this patch, launchHubble is executed into its own goroutine. In the happy path the "launch" goroutine setup a goroutine for the "local server" and a corresponding cleanup goroutine, another goroutine for the "remote server" and a corresponding cleanup goroutine, and then returns. So we start the "launch" goroutine, itself starting 4 other goroutines, and then the "launch" goroutine terminates.

My suggestion then doesn't work because when launchHubble returns then the context is cancelled, and the namespaceManager shutdown.

In the current version of the patch, the "launch" goroutine will be stuck on the deferred wg.Wait() until d.ctx is cancelled, cancelling the local ctx and shutting down the namespaceManager (which will eventually call wg.Done() and allow the "launch" goroutine to return from launchHubble and terminate).

If correct, although I agree with you wrt best practice on principle, practically I don't see the point of keeping the waitGroup. Technically it only make it so the "launch" goroutine stick around, which have any purpose anymore. As I see it it is one more goroutine that is sleeping or that we might leak.

pkg/hubble/relay/observer/server.go Show resolved Hide resolved
pkg/hubble/observer/local_observer.go Outdated Show resolved Hide resolved
pkg/hubble/observer/namespace_manager_test.go Outdated Show resolved Hide resolved
pkg/hubble/observer/namespace_manager_test.go Outdated Show resolved Hide resolved
pkg/hubble/observer/namespace_manager_test.go Outdated Show resolved Hide resolved
Comment on lines 92 to 93
ctx, cancel := context.WithCancel(d.ctx)
defer cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

launchHubble is not blocking, so If we use this new derived context and defer its cancellation, it will be cancelled as soon as launchHubble returns.
I don't think that is what we want, looking at how ctx is used in namespaceManager.Run() and in the two select blocks at lines 284 and 311.

Comment on lines 177 to 184
namespaceManager := observer.NewNamespaceManager()
var wg sync.WaitGroup
wg.Add(1)
go func() {
namespaceManager.Run(ctx)
wg.Done()
}()
defer wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To manage background goroutines while remaining "sensitive" to external context cancellation (and ensuring proper cleanup) I suggest to look into pkg/controller. Alternatively, you can consider the new pkg/hive/job (I think the OneShot is what you're looking for here), even if it is targeted to be integrated with the hive/cell framework.

@chancez chancez force-pushed the pr/chancez/observer_get_namespaces branch from a303b45 to c97f292 Compare May 23, 2023 19:56
@chancez
Copy link
Contributor Author

chancez commented May 23, 2023

Okay, I believe I got everything addressed. I opted to just remove the waitGroup from launchHubble and start the go routine without any guards about it returning, like the other go routines in the function. I'd rather spend the effort on refactoring all of launchHubble, perhaps as a Hive/Cell module, which presumably is something we'll need to do soon anyways.

@chancez chancez requested review from pippolo84 and kaworu May 23, 2023 20:13
Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for agent related changes 💯

@kaworu kaworu requested a review from rolinh May 24, 2023 15:11
Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chancez!

@@ -28,6 +28,9 @@ service Observer {
// GetNodes returns information about nodes in a cluster.
rpc GetNodes(GetNodesRequest) returns (GetNodesResponse) {}

// GetNamespaces returns information about namespaces in a cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think users will rely on the order regardless of whether it is documented or not, so I would say let's document it. It's nice to output a consistent order.

@chancez
Copy link
Contributor Author

chancez commented May 24, 2023

/test

@chancez
Copy link
Contributor Author

chancez commented Jun 8, 2023

/test-1.26-net-next

@chancez chancez force-pushed the pr/chancez/observer_get_namespaces branch from c97f292 to 3994bb0 Compare June 8, 2023 19:46
@chancez
Copy link
Contributor Author

chancez commented Jun 8, 2023

/test

Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! lgtm overall but I left a few comments to address.

api/v1/observer/observer.proto Outdated Show resolved Hide resolved
Documentation/internals/hubble.rst Outdated Show resolved Hide resolved
pkg/hubble/observer/namespace_manager.go Show resolved Hide resolved
pkg/hubble/relay/observer/server.go Outdated Show resolved Hide resolved
@chancez chancez force-pushed the pr/chancez/observer_get_namespaces branch from 3994bb0 to 08ff93a Compare June 9, 2023 20:39
@chancez
Copy link
Contributor Author

chancez commented Jun 9, 2023

/test

@chancez chancez requested a review from rolinh June 9, 2023 20:40
Signed-off-by: Chance Zibolski <chance.zibolski@gmail.com>
@chancez chancez force-pushed the pr/chancez/observer_get_namespaces branch from 08ff93a to 2e8059e Compare June 9, 2023 21:01
@rolinh
Copy link
Member

rolinh commented Jun 12, 2023

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 12, 2023
@borkmann borkmann merged commit 6ce4074 into main Jun 13, 2023
63 of 64 checks passed
@borkmann borkmann deleted the pr/chancez/observer_get_namespaces branch June 13, 2023 20:11
Comment on lines +46 to +59
func (m *namespaceManager) Run(ctx context.Context) {
ticker := time.NewTicker(checkNamespaceAgeFrequency)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
// periodically remove any namespaces which haven't been seen in flows
// for the last hour
m.cleanupNamespaces()
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chancez Did you consider using pkg/controller in order to run this logic? The benefits are that it's easier to get visibility into when this periodic logic runs, whether there are errors, and so on. pkg/controller will automatically hook the logic into metrics as well as registering the status each time it runs with the cilium status reporter, so that users can understand whether the logic ran, when it ran, and whether it's stuck.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but Hubble also is a bit of a snowflake in general, and hasn't had nearly as much focus as the rest of Cilium, including building better reusable primitives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature This introduces new functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add GetNamespaces support to Hubble
7 participants