hubble: Add GetNamespaces to observer API #25563

chancez · 2023-05-19T18:56:17Z

hubble: Add GetNamespaces to observer API

I also have a branch adding support to Hubble CLI ready once this is merged. https://github.com/cilium/hubble/tree/pr/chancez/get_namespaces

chancez · 2023-05-19T21:50:49Z

/test

kaworu

Thanks @chancez! Overall LGTM, I would like more testing around the namespaceManager and also the Relay part to be reworked as I think we can clean it up.

kaworu · 2023-05-22T07:00:35Z

api/v1/observer/observer.proto

@@ -28,6 +28,9 @@ service Observer {
    // GetNodes returns information about nodes in a cluster.
    rpc GetNodes(GetNodesRequest) returns (GetNodesResponse) {}

+    // GetNamespaces returns information about namespaces in a cluster.


I think the "memory period" of 1h and the order should be documented here as well.

thanks for the update, still missing the bits about ordering.

Do we want to document ordering, or should we lave that unspecified in case we decide we don't want to sort anymore? Glad to do it either way.

I think users will rely on the order regardless of whether it is documented or not, so I would say let's document it. It's nice to output a consistent order.

kaworu · 2023-05-22T07:08:11Z

daemon/cmd/hubble.go

+	var wg sync.WaitGroup
+	wg.Add(1)
+	go func() {
+		namespaceManager.Run(d.ctx)
+		wg.Done()
+	}()
+	defer wg.Wait()


What happen on Hubble initialization failure? As far as I understand, we'll wait indefinitely on this wg.Wait() since d.ctx is not cancelled.

To me, the usage of sync.WaitGroup here indicate that we need something to finish cleanly from namespaceManager.Run before returning, but it doesn't seems to be the case. Instead, we want to shutdown the namespaceManager when we return. If correct, I would like to suggest

Suggested change

var wg sync.WaitGroup

wg.Add(1)

go func() {

namespaceManager.Run(d.ctx)

wg.Done()

}()

defer wg.Wait()

ctx, cancel := context.WithCancel(d.ctx)

defer cancel()

go namespaceManager.Run(ctx)

Good point on initialization failure, and the approach to do cancellation seems like the right way to do it.

The waitGroup is just best practice from my perspective. In my opinion you should always ensure every go routine returns (even on shutdown). The waitGroup here is to ensure that it does. We just aren't doing a very (IMO) good job of it in the rest of this function currently (and I don't want to rewrite the whole function right now either).

This is to just avoid potential memory leaks. It's hard to know how code might be used in the future. For example, if we changed agent to restart hubble automatically on failure (right now it does not) then if we don't check the go routine returns before returning, then we might have an infinite number of go routines get started,.

Taking another look, before this patch, launchHubble is executed into its own goroutine. In the happy path the "launch" goroutine setup a goroutine for the "local server" and a corresponding cleanup goroutine, another goroutine for the "remote server" and a corresponding cleanup goroutine, and then returns. So we start the "launch" goroutine, itself starting 4 other goroutines, and then the "launch" goroutine terminates.

My suggestion then doesn't work because when launchHubble returns then the context is cancelled, and the namespaceManager shutdown.

In the current version of the patch, the "launch" goroutine will be stuck on the deferred wg.Wait() until d.ctx is cancelled, cancelling the local ctx and shutting down the namespaceManager (which will eventually call wg.Done() and allow the "launch" goroutine to return from launchHubble and terminate).

If correct, although I agree with you wrt best practice on principle, practically I don't see the point of keeping the waitGroup. Technically it only make it so the "launch" goroutine stick around, which have any purpose anymore. As I see it it is one more goroutine that is sleeping or that we might leak.

Interesting, I rebuilt and tested locally with this change but didn't actually notice any issues. I guess that means the namespace manager just wasn't going to expire resources, so I didn't catch it. I'll take another look at this.

pkg/hubble/observer/local_observer.go

pkg/hubble/observer/local_observer_test.go

pkg/hubble/observer/local_observer.go

pkg/hubble/relay/observer/server.go

qmonnet

Doc update looks good. I only glanced quickly at the rest of the PR (and have no particular concern).

If anything, I'd like to have a bit more context and motivation in the commit description. The description used for #25266 would do just fine, for example.

chancez · 2023-05-22T22:28:31Z

Thanks for the review @kaworu. Please take another look.

kaworu

Thanks for the update @chancez, looking good! A few more comments, and also:

golangci-lint is failing:

pkg/hubble/observer/namespace_manager_test.go:10: File is not `goimports`-ed with -local github.com/cilium/cilium (goimports)
    observerpb "github.com/cilium/cilium/api/v1/observer"

generate-api CI is failing:

Please run 'make generate-api generate-health-api generate-hubble-api generate-operator-api' and submit your changes

Documentation/internals/hubble.rst

kaworu · 2023-05-23T09:42:37Z

api/v1/observer/observer.proto

@@ -28,6 +28,9 @@ service Observer {
    // GetNodes returns information about nodes in a cluster.
    rpc GetNodes(GetNodesRequest) returns (GetNodesResponse) {}

+    // GetNamespaces returns information about namespaces in a cluster.


thanks for the update, still missing the bits about ordering.

kaworu · 2023-05-23T10:02:05Z

daemon/cmd/hubble.go

+	var wg sync.WaitGroup
+	wg.Add(1)
+	go func() {
+		namespaceManager.Run(d.ctx)
+		wg.Done()
+	}()
+	defer wg.Wait()


Taking another look, before this patch, launchHubble is executed into its own goroutine. In the happy path the "launch" goroutine setup a goroutine for the "local server" and a corresponding cleanup goroutine, another goroutine for the "remote server" and a corresponding cleanup goroutine, and then returns. So we start the "launch" goroutine, itself starting 4 other goroutines, and then the "launch" goroutine terminates.

My suggestion then doesn't work because when launchHubble returns then the context is cancelled, and the namespaceManager shutdown.

In the current version of the patch, the "launch" goroutine will be stuck on the deferred wg.Wait() until d.ctx is cancelled, cancelling the local ctx and shutting down the namespaceManager (which will eventually call wg.Done() and allow the "launch" goroutine to return from launchHubble and terminate).

If correct, although I agree with you wrt best practice on principle, practically I don't see the point of keeping the waitGroup. Technically it only make it so the "launch" goroutine stick around, which have any purpose anymore. As I see it it is one more goroutine that is sleeping or that we might leak.

pkg/hubble/relay/observer/server.go

pkg/hubble/observer/local_observer.go

pkg/hubble/observer/namespace_manager_test.go

pippolo84 · 2023-05-23T13:45:00Z

daemon/cmd/hubble.go

+	ctx, cancel := context.WithCancel(d.ctx)
+	defer cancel()


launchHubble is not blocking, so If we use this new derived context and defer its cancellation, it will be cancelled as soon as launchHubble returns.
I don't think that is what we want, looking at how ctx is used in namespaceManager.Run() and in the two select blocks at lines 284 and 311.

pippolo84 · 2023-05-23T13:51:13Z

daemon/cmd/hubble.go

+	namespaceManager := observer.NewNamespaceManager()
+	var wg sync.WaitGroup
+	wg.Add(1)
+	go func() {
+		namespaceManager.Run(ctx)
+		wg.Done()
+	}()
+	defer wg.Wait()


To manage background goroutines while remaining "sensitive" to external context cancellation (and ensuring proper cleanup) I suggest to look into pkg/controller. Alternatively, you can consider the new pkg/hive/job (I think the OneShot is what you're looking for here), even if it is targeted to be integrated with the hive/cell framework.

chancez · 2023-05-23T19:57:59Z

Okay, I believe I got everything addressed. I opted to just remove the waitGroup from launchHubble and start the go routine without any guards about it returning, like the other go routines in the function. I'd rather spend the effort on refactoring all of launchHubble, perhaps as a Hive/Cell module, which presumably is something we'll need to do soon anyways.

pippolo84

LGTM for agent related changes 💯

kaworu

Thanks @chancez!

kaworu · 2023-05-24T14:49:38Z

api/v1/observer/observer.proto

@@ -28,6 +28,9 @@ service Observer {
    // GetNodes returns information about nodes in a cluster.
    rpc GetNodes(GetNodesRequest) returns (GetNodesResponse) {}

+    // GetNamespaces returns information about namespaces in a cluster.


I think users will rely on the order regardless of whether it is documented or not, so I would say let's document it. It's nice to output a consistent order.

chancez · 2023-05-24T15:14:07Z

/test

chancez · 2023-06-08T14:58:44Z

/test-1.26-net-next

chancez · 2023-06-08T19:47:04Z

/test

rolinh

Thanks! lgtm overall but I left a few comments to address.

api/v1/observer/observer.proto

Documentation/internals/hubble.rst

pkg/hubble/observer/namespace_manager.go

pkg/hubble/relay/observer/server.go

chancez · 2023-06-09T20:40:07Z

/test

Signed-off-by: Chance Zibolski <chance.zibolski@gmail.com>

rolinh · 2023-06-12T08:55:25Z

/test

joestringer · 2023-10-10T16:56:02Z

pkg/hubble/observer/namespace_manager.go

+func (m *namespaceManager) Run(ctx context.Context) {
+	ticker := time.NewTicker(checkNamespaceAgeFrequency)
+	defer ticker.Stop()
+	for {
+		select {
+		case <-ctx.Done():
+			return
+		case <-ticker.C:
+			// periodically remove any namespaces which haven't been seen in flows
+			// for the last hour
+			m.cleanupNamespaces()
+		}
+	}
+}


@chancez Did you consider using pkg/controller in order to run this logic? The benefits are that it's easier to get visibility into when this periodic logic runs, whether there are errors, and so on. pkg/controller will automatically hook the logic into metrics as well as registering the status each time it runs with the cilium status reporter, so that users can understand whether the logic ran, when it ran, and whether it's stuck.

No, but Hubble also is a bit of a snowflake in general, and hasn't had nearly as much focus as the rest of Cilium, including building better reusable primitives.

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 19, 2023

chancez force-pushed the pr/chancez/observer_get_namespaces branch 4 times, most recently from afb8e21 to e364994 Compare May 19, 2023 21:21

chancez marked this pull request as ready for review May 19, 2023 21:42

chancez requested review from a team as code owners May 19, 2023 21:42

chancez requested review from pippolo84 and kaworu May 19, 2023 21:42

chancez force-pushed the pr/chancez/observer_get_namespaces branch from e364994 to 209ecfe Compare May 19, 2023 21:50

chancez requested a review from a team as a code owner May 19, 2023 21:50

chancez requested a review from qmonnet May 19, 2023 21:50

kaworu requested changes May 22, 2023

View reviewed changes

qmonnet approved these changes May 22, 2023

View reviewed changes

chancez force-pushed the pr/chancez/observer_get_namespaces branch from 209ecfe to a303b45 Compare May 22, 2023 22:23

chancez requested a review from kaworu May 22, 2023 22:24

kaworu added kind/feature This introduces new functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay labels May 23, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 23, 2023

kaworu requested changes May 23, 2023

View reviewed changes

pippolo84 requested changes May 23, 2023

View reviewed changes

chancez force-pushed the pr/chancez/observer_get_namespaces branch from a303b45 to c97f292 Compare May 23, 2023 19:56

chancez requested review from pippolo84 and kaworu May 23, 2023 20:13

pippolo84 approved these changes May 24, 2023

View reviewed changes

kaworu requested a review from rolinh May 24, 2023 15:11

kaworu approved these changes May 24, 2023

View reviewed changes

chancez force-pushed the pr/chancez/observer_get_namespaces branch from c97f292 to 3994bb0 Compare June 8, 2023 19:46

rolinh requested changes Jun 9, 2023

View reviewed changes

api/v1/observer/observer.proto Outdated Show resolved Hide resolved

Documentation/internals/hubble.rst Outdated Show resolved Hide resolved

pkg/hubble/observer/namespace_manager.go Show resolved Hide resolved

pkg/hubble/relay/observer/server.go Outdated Show resolved Hide resolved

chancez force-pushed the pr/chancez/observer_get_namespaces branch from 3994bb0 to 08ff93a Compare June 9, 2023 20:39

chancez requested a review from rolinh June 9, 2023 20:40

hubble: Add GetNamespaces to hubble observer API

2e8059e

Signed-off-by: Chance Zibolski <chance.zibolski@gmail.com>

chancez force-pushed the pr/chancez/observer_get_namespaces branch from 08ff93a to 2e8059e Compare June 9, 2023 21:01

rolinh approved these changes Jun 12, 2023

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 12, 2023

chancez mentioned this pull request Jun 12, 2023

CI: K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing with bpf_host #25964

Closed

borkmann merged commit 6ce4074 into main Jun 13, 2023
63 of 64 checks passed

borkmann deleted the pr/chancez/observer_get_namespaces branch June 13, 2023 20:11

chancez mentioned this pull request Jun 14, 2023

Add hubble list namespaces command cilium/hubble#1086

Merged

joestringer reviewed Oct 10, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hubble: Add GetNamespaces to observer API #25563

hubble: Add GetNamespaces to observer API #25563

chancez commented May 19, 2023 •

edited

chancez commented May 19, 2023

kaworu left a comment

kaworu May 22, 2023

kaworu May 23, 2023

chancez May 23, 2023

kaworu May 24, 2023

kaworu May 22, 2023

chancez May 22, 2023 •

edited

kaworu May 23, 2023

chancez May 23, 2023 •

edited

qmonnet left a comment

chancez commented May 22, 2023

kaworu left a comment

kaworu May 23, 2023

kaworu May 23, 2023

pippolo84 May 23, 2023

pippolo84 May 23, 2023

chancez commented May 23, 2023

pippolo84 left a comment

kaworu left a comment

kaworu May 24, 2023

chancez commented May 24, 2023

chancez commented Jun 8, 2023

chancez commented Jun 8, 2023

rolinh left a comment

chancez commented Jun 9, 2023

rolinh commented Jun 12, 2023

joestringer Oct 10, 2023

chancez Oct 10, 2023

hubble: Add GetNamespaces to observer API #25563

hubble: Add GetNamespaces to observer API #25563

Conversation

chancez commented May 19, 2023 • edited

chancez commented May 19, 2023

kaworu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chancez May 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chancez May 23, 2023 • edited

Choose a reason for hiding this comment

qmonnet left a comment

Choose a reason for hiding this comment

chancez commented May 22, 2023

kaworu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chancez commented May 23, 2023

pippolo84 left a comment

Choose a reason for hiding this comment

kaworu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chancez commented May 24, 2023

chancez commented Jun 8, 2023

chancez commented Jun 8, 2023

rolinh left a comment

Choose a reason for hiding this comment

chancez commented Jun 9, 2023

rolinh commented Jun 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chancez commented May 19, 2023 •

edited

chancez May 22, 2023 •

edited

chancez May 23, 2023 •

edited