New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check to ensure no two consecutive check runs share a timestamp. #2025
Conversation
cmd/bosun/sched/alertRunner.go
Outdated
break | ||
} | ||
collect.Add("check.context_not_changed", opentsdb.TagSet{"alert": a.Name}, 1) | ||
time.Sleep(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this should be some fraction of the checkFrequency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially chose a millisecond. If this occurs at all, it is because the alert and the global context are perfectly lock stepped, and this one won the race. A second should be more than adequate to "unlock" it in a timely manner.
If the counter explodes, we have a bigger problem, and some bad assumptions.
What about something like instead of the routines sleeping on time, a timer tells them when they can run. That way the timer can always set the context before sending the signal |
I thought about something like that, maybe using sync.Cond as a signal. But I think there is an additional race there. If the context is updated and signalled between when the alert detects no-change, and when it subscribes to the signal, you miss a whole check interval. My way you only lose a second. |
cmd/bosun/sched/alertRunner.go
Outdated
break | ||
} | ||
collect.Add("check.context_not_changed", opentsdb.TagSet{"alert": a.Name}, 1) | ||
time.Sleep(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should comment in the code on why were are doing the sleep
cmd/bosun/sched/alertRunner.go
Outdated
@@ -39,15 +48,32 @@ func (s *Schedule) RunAlert(a *conf.Alert) { | |||
s.checksRunning.Add(1) | |||
// ensure when an alert is done it is removed from the wait group | |||
defer s.checksRunning.Done() | |||
var lastCheckID int64 = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checkId and lastCheckID thing is confusing me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each alert tracks the last id that it ran a check on. we actually could dispense with the field and just track lastCheckTime in the same way. Good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how's that look now?
Kyle and I had a better idea in hangouts. will implement that instead. |
Uses a simple counter that gets incremented every time we make a new check context. When a check runs, if the context has not changed since last time, we wait a second and look again.
Also fixes race condition.
Fixes #2023