cumbersome to validate/optimize algorithms historically #636

Dieterbe · 2015-01-07T03:58:32Z

so far, I've been doing my alerting via graphite, graph-explorer and grafana.
A common example for me looks like so:

what were're seeing here is on top a signal that comes straight out of graphite.
on this signal I want to detect sudden drops and spikes. as you can see there's a few of those,
and the bottom graph has some processing applied to it to represent how "anomalous" the signal is, going into the orange/red area whenever there's something bad in the input signal.

often i need to do some finetuning of the algorithm, and for this to work well, i need to be able to correlate the bottom signal to the top one.
I was hoping to implement this workflow with bosun using the timeline view, but it doesn't seem to work very well, I think because of these reasons:

hard to see correlation between ok/warn/crit state and the rest of the used data, such as the input series, or the intermediate processed series. It would be nice to have a more extensive timeline view that lets me plot the series i want in addition to the alerting state, so i can correlate things to each other.
too coarse: from what I understand, if I want to run the algorithm on a week or months' worth of data and see how it looks, bosun will execute a request for each "alerting rule run", so you're basically forced to bring the number of intervals down in order not to overload your backend. (admittedly, graphite might be more sensitive to this than opentsdb).
in the example above we get all data we need in only 2 requests, each covering the entire timeframe, so the processing logic has all data available it needs at all times, which starts to pay off a lot when you want to run the alert in short intervals, but each run requires a lot of prior data.
if we wanted to use graphing like the expression view, then still the svg rendering is very slow, canvas rendering would help a lot (there's a few js libs that do it)

Dieterbe · 2015-01-07T17:25:46Z

@kylebrandt @mjibson thoughts?

kylebrandt · 2015-01-07T17:30:12Z

As far as "too course" I have thought about this too. Seems like it might be kind of hard though, the expression tree will need to be inspected, and the min max time of all queries needs to be discovered (tricker when you start to bring in things like the band function) and then chop up the results - then query those results instead.

Agree that it would be very nice because it will make the testing page very fast.

As far as the view goes, that is tough one, I have alerts that query many different time series, going to be hard to create visualization for that. This all sounds good - but seems like a good deal of work to me.

Dieterbe · 2015-01-07T17:56:48Z

As far as the view goes, that is tough one, I have alerts that query many different time series, going to
be hard to create visualization for that. This all sounds good - but seems like a good deal of work to me.

what if there was a template for timeline and the user could override it?
like default:

{{.Eval .Status}}

which would show the colored bar like it does now, whereas

{{.Eval .SomeSeries}}
{{.Eval .Status}}

would show every colored bar, but with a minimal graph above it (of similar or same height) showing the requested series.
seems to me like a lot of the needed code (timeline and graphs) already exists, and it seems like a powerful and flexible solution. thoughts? the hardest part might end up making sure the time axis of graphs and timeline are in sync. but if we use the same lib for both timeline and graph then they should be automatically aligned. we should even be able to render everything in 1 visualisation, see for example http://www.flotcharts.org/flot/examples/series-types/index.html

Dieterbe · 2015-01-14T03:54:06Z

i've made some progress towards the timeline view with arbitrary series graphs, but after further discussion with Matt it became apparent we would only be able to plot the points at each "time of check", I had been assuming i could plot it at my data resolution (of 1 minute) across the entire timespan, and since that won't be possible, there's not much point in this anymore.

would love to take this back up once we can just do the checks at the same resolution as the data (minutely), even over long timeframes such as weeks or months. so maybe we should aim for that.

Dieterbe · 2015-01-14T13:56:01Z

suggestion:

in web/expr.go, Rule(), take out the worker stuff, and invoke all requests "all at once". leave it up to the backend (graphite, opentsdb, ...) to implement their own worker pool. this allows to fine tune concurrency/rate-limiting on a per-backend basis, which seems to make more sense
the additional benefit is that in scenarios where the code knows it will execute a bunch of requests (possibly adjacent), it can let the backend know before issuing and once all requests are issued.
that mechanism enables the backend to, once it receives an "about to issue a whole bunch of requests", to start collecting requests without executing them, until it receives the "done issuing requests" message, it can then inspect the requests and possibly execute broader requests and once it receives the result, generate small sub-responses which it feeds back into the original requests.

this should allow us to do rule testing across weeks or months of data up to the same resolution as the data is in, as far as i can tell there would not be much code overhead and we wouldn't bombard the server.

One thing to keep in mind is the profiling of requests, because we currently only start timing once a request starts getting processed by worker, and don't do any request merging, these profiling stats could look a little different, but that doesn't seem like a big issue.

anything else i'm forgetting? or would this cover it?

@mjibson @kylebrandt

Dieterbe · 2015-01-14T16:42:16Z

also, while we're at it, it would be nice to be able to zoom in on the status& associated graphs(and a reset zoom button somewhere)

stale · 2020-10-01T21:50:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Dieterbe changed the title ~~cumbersome to validate algorithms historically~~ cumbersome to validate/optimize algorithms historically Jan 7, 2015

kylebrandt added enhancement Needs Review / Implementation Plan bosun labels Mar 9, 2015

stale bot added the wontfix label Oct 1, 2020

stale bot closed this as completed Oct 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cumbersome to validate/optimize algorithms historically #636

cumbersome to validate/optimize algorithms historically #636

Dieterbe commented Jan 7, 2015

Dieterbe commented Jan 7, 2015

kylebrandt commented Jan 7, 2015

Dieterbe commented Jan 7, 2015

Dieterbe commented Jan 14, 2015

Dieterbe commented Jan 14, 2015

Dieterbe commented Jan 14, 2015

stale bot commented Oct 1, 2020

cumbersome to validate/optimize algorithms historically #636

cumbersome to validate/optimize algorithms historically #636

Comments

Dieterbe commented Jan 7, 2015

Dieterbe commented Jan 7, 2015

kylebrandt commented Jan 7, 2015

Dieterbe commented Jan 7, 2015

Dieterbe commented Jan 14, 2015

Dieterbe commented Jan 14, 2015

Dieterbe commented Jan 14, 2015

stale bot commented Oct 1, 2020