Check to ensure no two consecutive check runs share a timestamp. #2025

captncraig · 2017-02-17T20:17:08Z

Uses a simple counter that gets incremented every time we make a new check context. When a check runs, if the context has not changed since last time, we wait a second and look again.

Also fixes race condition.

Fixes #2023

kylebrandt · 2017-02-17T21:10:41Z

cmd/bosun/sched/alertRunner.go

+				break
+			}
+			collect.Add("check.context_not_changed", opentsdb.TagSet{"alert": a.Name}, 1)
+			time.Sleep(time.Second)


maybe this should be some fraction of the checkFrequency?

I initially chose a millisecond. If this occurs at all, it is because the alert and the global context are perfectly lock stepped, and this one won the race. A second should be more than adequate to "unlock" it in a timely manner.

If the counter explodes, we have a bigger problem, and some bad assumptions.

kylebrandt · 2017-02-17T21:13:53Z

What about something like instead of the routines sleeping on time, a timer tells them when they can run. That way the timer can always set the context before sending the signal

captncraig · 2017-02-17T21:16:33Z

I thought about something like that, maybe using sync.Cond as a signal. But I think there is an additional race there.

If the context is updated and signalled between when the alert detects no-change, and when it subscribes to the signal, you miss a whole check interval. My way you only lose a second.

kylebrandt · 2017-02-17T21:19:23Z

cmd/bosun/sched/alertRunner.go

+				break
+			}
+			collect.Add("check.context_not_changed", opentsdb.TagSet{"alert": a.Name}, 1)
+			time.Sleep(time.Second)


should comment in the code on why were are doing the sleep

kylebrandt · 2017-02-17T21:57:38Z

cmd/bosun/sched/alertRunner.go

@@ -39,15 +48,32 @@ func (s *Schedule) RunAlert(a *conf.Alert) {
 	s.checksRunning.Add(1)
 	// ensure when an alert is done it is removed from the wait group
 	defer s.checksRunning.Done()
+	var lastCheckID int64 = -1


The checkId and lastCheckID thing is confusing me

each alert tracks the last id that it ran a check on. we actually could dispense with the field and just track lastCheckTime in the same way. Good point.

how's that look now?

captncraig · 2017-02-24T19:59:02Z

Kyle and I had a better idea in hangouts. will implement that instead.

Craig Peterson added 2 commits February 17, 2017 13:14

Check to ensure no two consecutive check runs share a timestamp.

1f4c519

fix

d7214f2

kylebrandt reviewed Feb 17, 2017

View reviewed changes

Craig Peterson added 2 commits February 17, 2017 14:48

fix

c85da46

comment

1bf9ca5

kylebrandt reviewed Feb 17, 2017

View reviewed changes

no need extra field

ddb8f33

captncraig closed this Feb 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check to ensure no two consecutive check runs share a timestamp. #2025

Check to ensure no two consecutive check runs share a timestamp. #2025

captncraig commented Feb 17, 2017

kylebrandt Feb 17, 2017

captncraig Feb 17, 2017 •

edited

kylebrandt commented Feb 17, 2017

captncraig commented Feb 17, 2017

kylebrandt Feb 17, 2017

kylebrandt Feb 17, 2017

captncraig Feb 17, 2017

captncraig Feb 17, 2017

captncraig commented Feb 24, 2017

Check to ensure no two consecutive check runs share a timestamp. #2025

Check to ensure no two consecutive check runs share a timestamp. #2025

Conversation

captncraig commented Feb 17, 2017

kylebrandt Feb 17, 2017

Choose a reason for hiding this comment

captncraig Feb 17, 2017 • edited

Choose a reason for hiding this comment

kylebrandt commented Feb 17, 2017

captncraig commented Feb 17, 2017

kylebrandt Feb 17, 2017

Choose a reason for hiding this comment

kylebrandt Feb 17, 2017

Choose a reason for hiding this comment

captncraig Feb 17, 2017

Choose a reason for hiding this comment

captncraig Feb 17, 2017

Choose a reason for hiding this comment

captncraig commented Feb 24, 2017

captncraig Feb 17, 2017 •

edited