New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hubble: Add GetNamespaces to observer API #25563
Conversation
afb8e21
to
e364994
Compare
e364994
to
209ecfe
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chancez! Overall LGTM, I would like more testing around the namespaceManager
and also the Relay part to be reworked as I think we can clean it up.
@@ -28,6 +28,9 @@ service Observer { | |||
// GetNodes returns information about nodes in a cluster. | |||
rpc GetNodes(GetNodesRequest) returns (GetNodesResponse) {} | |||
|
|||
// GetNamespaces returns information about namespaces in a cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "memory period" of 1h and the order should be documented here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the update, still missing the bits about ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to document ordering, or should we lave that unspecified in case we decide we don't want to sort anymore? Glad to do it either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think users will rely on the order regardless of whether it is documented or not, so I would say let's document it. It's nice to output a consistent order.
daemon/cmd/hubble.go
Outdated
var wg sync.WaitGroup | ||
wg.Add(1) | ||
go func() { | ||
namespaceManager.Run(d.ctx) | ||
wg.Done() | ||
}() | ||
defer wg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happen on Hubble initialization failure? As far as I understand, we'll wait indefinitely on this wg.Wait()
since d.ctx
is not cancelled.
To me, the usage of sync.WaitGroup
here indicate that we need something to finish cleanly from namespaceManager.Run
before returning, but it doesn't seems to be the case. Instead, we want to shutdown the namespaceManager
when we return. If correct, I would like to suggest
var wg sync.WaitGroup | |
wg.Add(1) | |
go func() { | |
namespaceManager.Run(d.ctx) | |
wg.Done() | |
}() | |
defer wg.Wait() | |
ctx, cancel := context.WithCancel(d.ctx) | |
defer cancel() | |
go namespaceManager.Run(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on initialization failure, and the approach to do cancellation seems like the right way to do it.
The waitGroup is just best practice from my perspective. In my opinion you should always ensure every go routine returns (even on shutdown). The waitGroup here is to ensure that it does. We just aren't doing a very (IMO) good job of it in the rest of this function currently (and I don't want to rewrite the whole function right now either).
This is to just avoid potential memory leaks. It's hard to know how code might be used in the future. For example, if we changed agent to restart hubble automatically on failure (right now it does not) then if we don't check the go routine returns before returning, then we might have an infinite number of go routines get started,.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking another look, before this patch, launchHubble
is executed into its own goroutine. In the happy path the "launch" goroutine setup a goroutine for the "local server" and a corresponding cleanup goroutine, another goroutine for the "remote server" and a corresponding cleanup goroutine, and then returns. So we start the "launch" goroutine, itself starting 4 other goroutines, and then the "launch" goroutine terminates.
My suggestion then doesn't work because when launchHubble returns then the context is cancelled, and the namespaceManager
shutdown.
In the current version of the patch, the "launch" goroutine will be stuck on the deferred wg.Wait()
until d.ctx
is cancelled, cancelling the local ctx
and shutting down the namespaceManager
(which will eventually call wg.Done()
and allow the "launch" goroutine to return from launchHubble
and terminate).
If correct, although I agree with you wrt best practice on principle, practically I don't see the point of keeping the waitGroup. Technically it only make it so the "launch" goroutine stick around, which have any purpose anymore. As I see it it is one more goroutine that is sleeping or that we might leak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I rebuilt and tested locally with this change but didn't actually notice any issues. I guess that means the namespace manager just wasn't going to expire resources, so I didn't catch it. I'll take another look at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc update looks good. I only glanced quickly at the rest of the PR (and have no particular concern).
If anything, I'd like to have a bit more context and motivation in the commit description. The description used for #25266 would do just fine, for example.
209ecfe
to
a303b45
Compare
Thanks for the review @kaworu. Please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @chancez, looking good! A few more comments, and also:
-
pkg/hubble/observer/namespace_manager_test.go:10: File is not `goimports`-ed with -local github.com/cilium/cilium (goimports) observerpb "github.com/cilium/cilium/api/v1/observer"
-
Please run 'make generate-api generate-health-api generate-hubble-api generate-operator-api' and submit your changes
@@ -28,6 +28,9 @@ service Observer { | |||
// GetNodes returns information about nodes in a cluster. | |||
rpc GetNodes(GetNodesRequest) returns (GetNodesResponse) {} | |||
|
|||
// GetNamespaces returns information about namespaces in a cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the update, still missing the bits about ordering.
daemon/cmd/hubble.go
Outdated
var wg sync.WaitGroup | ||
wg.Add(1) | ||
go func() { | ||
namespaceManager.Run(d.ctx) | ||
wg.Done() | ||
}() | ||
defer wg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking another look, before this patch, launchHubble
is executed into its own goroutine. In the happy path the "launch" goroutine setup a goroutine for the "local server" and a corresponding cleanup goroutine, another goroutine for the "remote server" and a corresponding cleanup goroutine, and then returns. So we start the "launch" goroutine, itself starting 4 other goroutines, and then the "launch" goroutine terminates.
My suggestion then doesn't work because when launchHubble returns then the context is cancelled, and the namespaceManager
shutdown.
In the current version of the patch, the "launch" goroutine will be stuck on the deferred wg.Wait()
until d.ctx
is cancelled, cancelling the local ctx
and shutting down the namespaceManager
(which will eventually call wg.Done()
and allow the "launch" goroutine to return from launchHubble
and terminate).
If correct, although I agree with you wrt best practice on principle, practically I don't see the point of keeping the waitGroup. Technically it only make it so the "launch" goroutine stick around, which have any purpose anymore. As I see it it is one more goroutine that is sleeping or that we might leak.
daemon/cmd/hubble.go
Outdated
ctx, cancel := context.WithCancel(d.ctx) | ||
defer cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
launchHubble
is not blocking, so If we use this new derived context and defer its cancellation, it will be cancelled as soon as launchHubble
returns.
I don't think that is what we want, looking at how ctx
is used in namespaceManager.Run()
and in the two select
blocks at lines 284 and 311.
daemon/cmd/hubble.go
Outdated
namespaceManager := observer.NewNamespaceManager() | ||
var wg sync.WaitGroup | ||
wg.Add(1) | ||
go func() { | ||
namespaceManager.Run(ctx) | ||
wg.Done() | ||
}() | ||
defer wg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To manage background goroutines while remaining "sensitive" to external context cancellation (and ensuring proper cleanup) I suggest to look into pkg/controller. Alternatively, you can consider the new pkg/hive/job (I think the OneShot is what you're looking for here), even if it is targeted to be integrated with the hive/cell framework.
a303b45
to
c97f292
Compare
Okay, I believe I got everything addressed. I opted to just remove the waitGroup from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for agent related changes 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chancez!
@@ -28,6 +28,9 @@ service Observer { | |||
// GetNodes returns information about nodes in a cluster. | |||
rpc GetNodes(GetNodesRequest) returns (GetNodesResponse) {} | |||
|
|||
// GetNamespaces returns information about namespaces in a cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think users will rely on the order regardless of whether it is documented or not, so I would say let's document it. It's nice to output a consistent order.
/test |
/test-1.26-net-next |
c97f292
to
3994bb0
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! lgtm overall but I left a few comments to address.
3994bb0
to
08ff93a
Compare
/test |
Signed-off-by: Chance Zibolski <chance.zibolski@gmail.com>
08ff93a
to
2e8059e
Compare
/test |
func (m *namespaceManager) Run(ctx context.Context) { | ||
ticker := time.NewTicker(checkNamespaceAgeFrequency) | ||
defer ticker.Stop() | ||
for { | ||
select { | ||
case <-ctx.Done(): | ||
return | ||
case <-ticker.C: | ||
// periodically remove any namespaces which haven't been seen in flows | ||
// for the last hour | ||
m.cleanupNamespaces() | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chancez Did you consider using pkg/controller
in order to run this logic? The benefits are that it's easier to get visibility into when this periodic logic runs, whether there are errors, and so on. pkg/controller will automatically hook the logic into metrics as well as registering the status each time it runs with the cilium status
reporter, so that users can understand whether the logic ran, when it ran, and whether it's stuck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but Hubble also is a bit of a snowflake in general, and hasn't had nearly as much focus as the rest of Cilium, including building better reusable primitives.
Fixes: #25266
I also have a branch adding support to Hubble CLI ready once this is merged. https://github.com/cilium/hubble/tree/pr/chancez/get_namespaces