Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Simplify k8sErrorHandler time management #1899

Merged
merged 1 commit into from
Oct 31, 2017

Conversation

raybejjani
Copy link
Contributor

@raybejjani raybejjani commented Oct 30, 2017

This function is hard :(
I tried to keep the same behaviour as before, but also tried to be very explicit about each scenario (to the point of repeating code).

  • We used to do all of the side-effects and special messages only on the first time we saw a particular error message. Chatting with @aanm it turns out only once case was intended to be a one-off. We now restartCiliumController every time the corresponding error is seen. We also print warnings about TPR/CRD conflicts every minute, along with the error message that caused it.
  • I used timestamps to do the rate-limiting on the assumption that they are smaller than timers. I discovered that time.Time is a struct and not a uint64, which makes this assertion less true :/
cilium-agent correctly restarts listening for CiliumNetworkPolicy changes when it sees decoding errors

@raybejjani raybejjani added area/daemon Impacts operation of the Cilium daemon. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. labels Oct 30, 2017
@raybejjani raybejjani requested review from aanm and a team October 30, 2017 17:48
@raybejjani raybejjani added the wip label Oct 30, 2017
@ianvernon
Copy link
Member

ianvernon commented Oct 30, 2017

Your PR description says We also print warnings about - can you finish this description?

Edit: heh, I commented on this right as you added the wip label.

@raybejjani
Copy link
Contributor Author

gah, I just realised I need separate instances of the Once object. I shall fix this and ping again. Ignore this PR for now!

@raybejjani raybejjani force-pushed the daemon-k8s-simpler-error-handler branch 2 times, most recently from d726c9d to 54e174c Compare October 30, 2017 19:11
@raybejjani
Copy link
Contributor Author

@ianvernon ha, yeah, that's what I get for being distracted I have updated it.
@aanm ok, up for review.

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an error appears multiple times, for example every second, how are you skipping it? It seems that you're always incrementing the unmuteDeadline

// k8sErrMsg stores a timer for each k8s error message received
k8sErrMsg = map[string]*time.Timer{}
// k8sErrMsgMU guards additions and removals to k8sErrMsg
k8sErrMsgMU lock.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read it somewhere but I can't find it now there is a convention that defines the mutexes should be right before the variables they are protecting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea anyway. Done.

}

k8sErrMsg[errstr] = unmuteDeadline.Add(k8sErrLogTimeout)
return unmuteDeadline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you wanted to do this because Add doesn't add the time to the receiver, it returns a new value. For this reason you should return the assigned value.

	k8sErrMsg[errstr] = unmuteDeadline.Add(k8sErrLogTimeout)
	// k8sErrMsg[errstr] == unmuteDeadline // since it's false we need to return the assigned value
        return k8sErrMsg[errstr]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had meant to return the original, looked up value but my logic was wrong as you mentioned above. I wasn't doing the rate-limiting correctly. This is fixed up.

}
return

case (strings.Contains(errstr, "Unable to decode an event from the watch stream: unable to decode watch event") ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you replace the || with a ,?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realise I could do that! done.

@raybejjani raybejjani force-pushed the daemon-k8s-simpler-error-handler branch from 9bc884d to 90365a0 Compare October 31, 2017 10:20
@raybejjani
Copy link
Contributor Author

ok, @aanm now it has a test! It's probably correct :/

@raybejjani raybejjani requested a review from aanm October 31, 2017 10:22
@raybejjani raybejjani force-pushed the daemon-k8s-simpler-error-handler branch from 90365a0 to 2496a54 Compare October 31, 2017 10:25
@raybejjani raybejjani added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Oct 31, 2017
@raybejjani
Copy link
Contributor Author

@tgraf I added a release note about the only user-visible change. Is this what you had in mind?

@raybejjani raybejjani force-pushed the daemon-k8s-simpler-error-handler branch from 2496a54 to f0ed2a6 Compare October 31, 2017 12:48
@raybejjani
Copy link
Contributor Author

@aanm fixed the "connection refused" limiting, it now has the same rate-limit as the other special cases.

}
return

case strings.Contains(errstr, "Failed to list *v1.NetworkPolicy: the server could not find the requested resource"):
if k8sErrorUpdateCheckUnmuteTime(errstr, now) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this if since we want this message to be printed as it occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@raybejjani raybejjani force-pushed the daemon-k8s-simpler-error-handler branch from f0ed2a6 to b295322 Compare October 31, 2017 15:17
@raybejjani raybejjani requested a review from aanm October 31, 2017 15:44
@raybejjani raybejjani force-pushed the daemon-k8s-simpler-error-handler branch from b295322 to 66444b0 Compare October 31, 2017 17:11
This function was difficult to follow and had side-effects that happened
only once because of the code flow. In one case, this was an accident.
The new code is more explicit about what is a one-time side-effect and
what isn't.

Signed-off-by: Ray Bejjani <ray@covalent.io>
@raybejjani raybejjani force-pushed the daemon-k8s-simpler-error-handler branch from 66444b0 to 128632c Compare October 31, 2017 17:32
@tgraf tgraf merged commit 9c642f0 into master Oct 31, 2017
@tgraf tgraf deleted the daemon-k8s-simpler-error-handler branch October 31, 2017 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants