New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cumbersome to validate/optimize algorithms historically #636

Open
Dieterbe opened this Issue Jan 7, 2015 · 6 comments

Comments

Projects
None yet
2 participants
@Dieterbe
Contributor

Dieterbe commented Jan 7, 2015

so far, I've been doing my alerting via graphite, graph-explorer and grafana.
A common example for me looks like so:
screenshot

what were're seeing here is on top a signal that comes straight out of graphite.
on this signal I want to detect sudden drops and spikes. as you can see there's a few of those,
and the bottom graph has some processing applied to it to represent how "anomalous" the signal is, going into the orange/red area whenever there's something bad in the input signal.

often i need to do some finetuning of the algorithm, and for this to work well, i need to be able to correlate the bottom signal to the top one.
I was hoping to implement this workflow with bosun using the timeline view, but it doesn't seem to work very well, I think because of these reasons:

  • hard to see correlation between ok/warn/crit state and the rest of the used data, such as the input series, or the intermediate processed series. It would be nice to have a more extensive timeline view that lets me plot the series i want in addition to the alerting state, so i can correlate things to each other.
  • too coarse: from what I understand, if I want to run the algorithm on a week or months' worth of data and see how it looks, bosun will execute a request for each "alerting rule run", so you're basically forced to bring the number of intervals down in order not to overload your backend. (admittedly, graphite might be more sensitive to this than opentsdb).
    in the example above we get all data we need in only 2 requests, each covering the entire timeframe, so the processing logic has all data available it needs at all times, which starts to pay off a lot when you want to run the alert in short intervals, but each run requires a lot of prior data.
  • if we wanted to use graphing like the expression view, then still the svg rendering is very slow, canvas rendering would help a lot (there's a few js libs that do it)

@Dieterbe Dieterbe changed the title from cumbersome to validate algorithms historically to cumbersome to validate/optimize algorithms historically Jan 7, 2015

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jan 7, 2015

@kylebrandt @mjibson thoughts?

@kylebrandt

This comment has been minimized.

Member

kylebrandt commented Jan 7, 2015

As far as "too course" I have thought about this too. Seems like it might be kind of hard though, the expression tree will need to be inspected, and the min max time of all queries needs to be discovered (tricker when you start to bring in things like the band function) and then chop up the results - then query those results instead.

Agree that it would be very nice because it will make the testing page very fast.

As far as the view goes, that is tough one, I have alerts that query many different time series, going to be hard to create visualization for that. This all sounds good - but seems like a good deal of work to me.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jan 7, 2015

As far as the view goes, that is tough one, I have alerts that query many different time series, going to
be hard to create visualization for that. This all sounds good - but seems like a good deal of work to me.

what if there was a template for timeline and the user could override it?
like default:

{{.Eval .Status}}

which would show the colored bar like it does now, whereas

{{.Eval .SomeSeries}}
{{.Eval .Status}}

would show every colored bar, but with a minimal graph above it (of similar or same height) showing the requested series.
seems to me like a lot of the needed code (timeline and graphs) already exists, and it seems like a powerful and flexible solution. thoughts? the hardest part might end up making sure the time axis of graphs and timeline are in sync. but if we use the same lib for both timeline and graph then they should be automatically aligned. we should even be able to render everything in 1 visualisation, see for example http://www.flotcharts.org/flot/examples/series-types/index.html

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jan 14, 2015

i've made some progress towards the timeline view with arbitrary series graphs, but after further discussion with Matt it became apparent we would only be able to plot the points at each "time of check", I had been assuming i could plot it at my data resolution (of 1 minute) across the entire timespan, and since that won't be possible, there's not much point in this anymore.

would love to take this back up once we can just do the checks at the same resolution as the data (minutely), even over long timeframes such as weeks or months. so maybe we should aim for that.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jan 14, 2015

suggestion:

  • in web/expr.go, Rule(), take out the worker stuff, and invoke all requests "all at once". leave it up to the backend (graphite, opentsdb, ...) to implement their own worker pool. this allows to fine tune concurrency/rate-limiting on a per-backend basis, which seems to make more sense
  • the additional benefit is that in scenarios where the code knows it will execute a bunch of requests (possibly adjacent), it can let the backend know before issuing and once all requests are issued.
  • that mechanism enables the backend to, once it receives an "about to issue a whole bunch of requests", to start collecting requests without executing them, until it receives the "done issuing requests" message, it can then inspect the requests and possibly execute broader requests and once it receives the result, generate small sub-responses which it feeds back into the original requests.

this should allow us to do rule testing across weeks or months of data up to the same resolution as the data is in, as far as i can tell there would not be much code overhead and we wouldn't bombard the server.

One thing to keep in mind is the profiling of requests, because we currently only start timing once a request starts getting processed by worker, and don't do any request merging, these profiling stats could look a little different, but that doesn't seem like a big issue.

anything else i'm forgetting? or would this cover it?

@mjibson @kylebrandt

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jan 14, 2015

also, while we're at it, it would be nice to be able to zoom in on the status& associated graphs(and a reset zoom button somewhere)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment