Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check to ensure no two consecutive check runs share a timestamp. #2025

Closed
wants to merge 5 commits into from

Conversation

captncraig
Copy link
Contributor

Uses a simple counter that gets incremented every time we make a new check context. When a check runs, if the context has not changed since last time, we wait a second and look again.

Also fixes race condition.

Fixes #2023

break
}
collect.Add("check.context_not_changed", opentsdb.TagSet{"alert": a.Name}, 1)
time.Sleep(time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be some fraction of the checkFrequency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially chose a millisecond. If this occurs at all, it is because the alert and the global context are perfectly lock stepped, and this one won the race. A second should be more than adequate to "unlock" it in a timely manner.

If the counter explodes, we have a bigger problem, and some bad assumptions.

@kylebrandt
Copy link
Member

What about something like instead of the routines sleeping on time, a timer tells them when they can run. That way the timer can always set the context before sending the signal

@captncraig
Copy link
Contributor Author

I thought about something like that, maybe using sync.Cond as a signal. But I think there is an additional race there.

If the context is updated and signalled between when the alert detects no-change, and when it subscribes to the signal, you miss a whole check interval. My way you only lose a second.

break
}
collect.Add("check.context_not_changed", opentsdb.TagSet{"alert": a.Name}, 1)
time.Sleep(time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should comment in the code on why were are doing the sleep

Craig Peterson added 2 commits February 17, 2017 14:48
@@ -39,15 +48,32 @@ func (s *Schedule) RunAlert(a *conf.Alert) {
s.checksRunning.Add(1)
// ensure when an alert is done it is removed from the wait group
defer s.checksRunning.Done()
var lastCheckID int64 = -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checkId and lastCheckID thing is confusing me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each alert tracks the last id that it ran a check on. we actually could dispense with the field and just track lastCheckTime in the same way. Good point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how's that look now?

@captncraig
Copy link
Contributor Author

Kyle and I had a better idea in hangouts. will implement that instead.

@captncraig captncraig closed this Feb 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Checks that run at CheckFrequency*1 may run twice with same now
2 participants