New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leveraging knowledge of past results #719

Open
Dieterbe opened this Issue Feb 13, 2015 · 0 comments

Comments

Projects
None yet
2 participants
@Dieterbe
Contributor

Dieterbe commented Feb 13, 2015

it's common with bosun to check some current value against a past value. for example median value of current hour compared to median value of the same hour of the same day, a week ago.

if in that past timeframe there was an outage, this distorts the logic: an outage or problem now might go undetected (and perhaps more unlikely: a good value now might be seen as bad and alerted on).
a way to address this is incorporating more past timeframes, not just same hour&day a week ago, but two weeks, and 3 weeks ago, with banding.
this makes this problem more unlikely

  • though not impossible: there could have been outages at similar times in each period
  • the data is still within the band output (so you have to know what you're doing, which statistical operations apply and which don't. for example median should be safe unless a significant fraction of the time considered was having troubles, avg is unsafe because it always includes all data)
  • finally: your alerts only start properly working after this warm-up time of x timeframes (3 weeks in this example)

I want to have something a bit more intelligent. I'm not sure yet how it should look like, but here's some ideas:

  • a band function that automatically doesn't include periods if we had a non-normal state in said period.
  • a band function that automatically switches from "same hour every x past days" to "same hour on same day every x past weeks" as data becomes available. because in my experience same hour, same day, different weeks, is often more representative of same hour of different days.
  • retaining "last known good" values for the values we use in our computations. e.g. if we track median and standard dev, but if we hit a period that is known to be faulty, stick to the last known values, until we pass the faulty period.

thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment