New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve logging statements in CES usage and reduce code reuse #22428
Improve logging statements in CES usage and reduce code reuse #22428
Conversation
@Weil0ng , could you take a look at this or pull in someone with CES background? |
@yanggangtony do you have an example stack trace for the failure? That would be helpful also to include in the commit message and PR description. |
@joestringer |
Ah yes, I missed it since it wasn't in the commit message or PR description. That one should do 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, I think it's a good catch. Please consider expanding it a bit as explained in the comment.
@nebril |
Hello, thanks for handling my issue. |
@hongkunyoo |
I agree, it would be very useful to investigate how it got into this state in the first place. If there's a way to reproduce the original bug, that would be very helpful. Either way, I think that if we catch the error to avoid a nil pointer exception and log the error, that is better than the current state. Then over time hopefully we can find out more about the underlying bug that causes this to happen. |
@yanggangtony the code doesn't compile, please investigate. |
@joestringer |
/test |
rebase master code . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several reviewers have asked for the commit description to be expanded. Adding the stack trace there would probably be good enough.
Ah, thanks for reminder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't take too deep of a look but I noticed some aspects of the logging that deviates from the usual approach to logging in Cilium. I'm also not sure I understand whether the new logging situations are expected or unexpected conditions. Typically log.Info()
should be limited to mostly once-off messages that do not recur frequently. If we're hiding an error condition, then maybe we should be reflecting those errors somewhere but I'm not exactly sure where. Probably someone with better CES and/or k8s knowledge could provide more concrete feedback in this regard. cc @Weil0ng @cilium/sig-k8s
@varunmar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. Looks good to me now.
Weilong is out of office for several weeks. I'm fine with this change though - can we merge it? |
thanks for review.. |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was OOO, thank you for spotting the issue and sending in the mitigation! Change LGTM. Just curious, is the cluster under heavy pod churn when you encountered the panic?
/cc @dlapcevic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I already fixed the nil access panic bug in getCESQueueDelayInSeconds()
- #22884
The explanation of how this happens is in the description.
I was unaware of this PR and the issue, we should’ve closed #22417 when my PR was merged.
I think there is no way it will try to access nil pointer right now, because all parts of the code that use getCESTrackerOnly()
have checks for it, either by checking if it’s nil or using getCESName()
to see if it exists.
getCESName()
is also reliable because createCES()
and insertCEP()
are used together, so if cepNametoCESName
exists, a cesTracker
for that CEP will also exist.
Code:
getCESTrackerOnly - https://github.com/cilium/cilium/blob/master/operator/pkg/ciliumendpointslice/ceptocesmap.go#L89
getCESName - https://github.com/cilium/cilium/blob/master/operator/pkg/ciliumendpointslice/ceptocesmap.go#L47
createCES - https://github.com/cilium/cilium/blob/master/operator/pkg/ciliumendpointslice/manager.go#L200
insertCEP - https://github.com/cilium/cilium/blob/master/operator/pkg/ciliumendpointslice/ceptocesmap.go#L34
Anyway, it’s good to remove getCESTrackerOnly()
to avoid future issues with unsafe access.
LGTM
@sayboras I believe all team review requests are already covered, that is cilium/cli and cilium/operator. |
getCESTrackerOnly() is just a slightly different version of getCESTracker(), we only need one such function. Remove the other one. Signed-off-by: yanggang <gang.yang@daocloud.io> Signed-off-by: Joe Stringer <joe@cilium.io>
Thanks for your review @dlapcevic! I took a closer look and I couldn't see any bug that this PR was fixing. I've updated the PR title accordingly. Additionally, I did the following:
No functional changes should have been made since the prior test run, so if this passes the basic checks then I will be confident to merge it into the tree. |
Reuse logfields for common names, add some missing fields from logging statements, and add additional debugging statements in error conditions. Signed-off-by: yanggang <gang.yang@daocloud.io> Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: yanggang gang.yang@daocloud.io
/kind bug
fix panic when map return nil.
fix: #22417