Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Median Absolute Deviation #26681

Closed
carlosvega opened this issue Sep 17, 2017 · 15 comments
Closed

Median Absolute Deviation #26681

carlosvega opened this issue Sep 17, 2017 · 15 comments

Comments

@carlosvega
Copy link

carlosvega commented Sep 17, 2017

Feature request:

In order to have more robust estimators it would be great that, as we are able to calculate the median, which is a robust estimator, we could also as well calculate the Median Absolute Deviation (MAD).

These robust estimators are aimed to avoid individual outliers to heavily influence the result. MAD is more resilient to outliers in a data set than the standard deviation. For example, the breakdown point of the median is 50% whilst the breakdown point of the mean is 0% as just one value is needed to deviate the mean value of a set of values. It would help very much to the analysis performed through Elasticsearch.

@danielmitterdorfer
Copy link
Member

@colings86, @polyfractal do have any thoughts on that?

@colings86
Copy link
Contributor

This is an interesting idea and looks like a good candidate for a new aggregation. However we would need to investigate methods for calculating (or estimating) this in a streaming one-pass fashion. The basic way to calculate this involves first iterating over the values to find the median and then iterating over the values again to calculate the deviation of each value from that median. In the aggregation collect phase we only have one pass over the data so we can do either one of the above steps but not both. If there is a method of calculating or estimating this value in a one-pass fashion then we could explore implementing it.

Note that you could calculate the MAD today by calculating the median of the values in the normal way in a first request and then by using a script in the percentiles aggregation in a second request to subtract each value from the median obtained form the first request. I appreciate this would be nicer to have in a single request though.

@carlosvega
Copy link
Author

carlosvega commented Sep 18, 2017

There are some methods available, with a bit of literature about it. It might be some inaccuracies in the result [1][2] but you could do it online.

However, I think, if you first need a value like the median that requires one-pass is not possible to do the MAD with only one pass over the data.

@colings86
Copy link
Contributor

One way to make this algorithm streaming might be to use a binning approach to estimate the median based on the median of previous 'bins' of values so you update the median at the end of every bin and use that value for the deviation of values from the median for the next bin (described for mean absolute deviation in https://stats.stackexchange.com/a/3378). I think this relies on an assumption that the median does not change quickly (where quickly in this context is defined as within a bin) so we can assume that the median for the previous bin is a good estimator for the median in the next bin. This means that the size of the bin we choose will be important and should also probably be configurable (at least to start with)

@carlosvega
Copy link
Author

One way to make this algorithm streaming might be to use a binning approach to estimate the median based on the median of previous 'bins' of values so you update the median at the end of every bin and use that value for the deviation of values from the median for the next bin

I think this method would be great. Also, the median should not change that quick, at least, if the bin is relatively small. The median would need 50% of their values to be different in order to change from the previous value.

This means that the size of the bin we choose will be important and should also probably be configurable (at least to start with)

Couldn't the size of the bin be the size of, for example, the data histogram bin?

@colings86
Copy link
Contributor

Actually I think there is also an assumption that the median does not change significantly over the whole dataset, since the estimates for the median at the start of the stream don't know what the median will be at the end of the stream, so if for example the median value was 10 after 1 million values but 10 million after 5 million values the estimates of the deviation of the median would be quite off since the estimate of the median was off by a factor 10 for at least 20% of the data. We will largely collect in insertion order which for a lot of use-cases translates to timestamp order so I think there is a bit of investigation to be done to find an approach that is not going to be too sensitive to median changes over time (or at least find and document all the assumptions) to make this a useful tool in Elasticsearch. Very interesting subject though 😄

P.S we actually already use t-digest to estimate percentiles (and hence median) in the percentiles aggregation. We would almost certainly use it here too but unfortunately it doesn't help solve the fact that we need the median (or at least an estimate of the median) for the whole dataset to be able to calculate the deviation from that median of each value. This is something the binning technique might solve but its a fairly naive approach so we may need to be a bit more clever with the implementation

@polyfractal
Copy link
Contributor

So amusingly enough, I have some old code that implemented MAD in an aggregation. Trying to track down the branch that had the agg, but the algo was essentially:

  1. Set approxMedian equal to the first point currentValue
  2. For each point:
    1. Add the point's value to a T-Digest sketch (values)
    2. Calculate the deviation as Math.abs(approxMedian - currentValue), add to a secondary T-Digest sketch (deviations)
  3. Every n points, recalculate the approximate median by setting approxMedian = values.quantile(0.5)
  4. When all data has been visited, the MAD is deviations.quantile(0.5) * 1.4826

It's very similar to what @colings86 suggested with binning. One T-Digest is used to record the streaming median, and periodically inspected to set the "approximate global median". This approximate global median is used to calculate an the "approximate deviation" for every point. These deviations are then placed into a T-Digest and used to calculate the median of deviations at the end.

It worked well enough in my informal testing at the time. As @colings86 noted, it does assume you have "enough data" to smooth out the median, that the cold startup period isn't a big influence on the final MAD, and that the median doesn't change too drastically over the lifetime of the stream.

At the time, I also contemplated adding some sort of "replay" mechanism so that the first n% could be replayed through the algo once the median starts to stabilize. It'd add complexity and runtime, and potentially skew the MAD itself (since you're re-playing values) but might provide a better estimate. Never tested it though.

@carlosvega
Copy link
Author

It would be great, being able to have some index stats and use warmers to recalculate them for this kind of stuff. Like, "each day, recalc some statistics (like the median) over my previous X days of data and store it in order to be used with these aggregations". That could boost some kind of visualizations and aggregations. Your stuff sounds great. Thanks for the answers.

@colings86
Copy link
Contributor

We discussed this in FixItFriday and decide that we would like to add this as a pipeline aggregation. It would use the T-digest produced by a percentiles aggregator to get the median and then use the centroids of the t-digest as an approximation of the data as an input to a new t-digest for the deviations from the median.

@colings86 colings86 added help wanted adoptme and removed discuss labels Sep 22, 2017
@polyfractal
Copy link
Contributor

@elastic/es-search-aggs

@andyb-elastic
Copy link
Contributor

Some rally benchmarks for the implementations we've looked at so far - these just cover service performance and don't say anything about accuracy. This is using the noaa dataset track with 4 shards and default compression settings

For the method Zach described in this comment, which is calculating an approximate median and accumulating deviation measurements in the collection phase

For the same algorithm as the one above, but recalculating the median less frequently after a threshold of documents has been reached

This method was pretty slow, and recalculating the median less frequently seemed anecdotally to decrease accuracy enough that tests failed. It does seem like the median recalculation is a bottleneck, taking jstacks always showed a thread in the tdigest's quantile calculation method.

For another method Zach described elsewhere, which is collecting a tdigest sketch of values, then in reduce using the percentile results of that sketch to construct a similar tdigest sketch of deviations from the median

For the method Colin described in this comment, which is collecting a tdigest sketch of values, and then in reduce using that tdigest's centroids to construct a similar tdigest sketch of deviations from the median

Anecdotally from testing the centroids version seems more accurate. In the percentiles version I didn't make how many quantiles are calculcated configurable though.

@andyb-elastic
Copy link
Contributor

I'll have some accuracy benchmarks next. It's looking like the centroid method is likely going to be the one to use

@andyb-elastic
Copy link
Contributor

Here's some accuracy benchmarks that I've roughly digested by hand (well, with jq)

Results

tl;dr it looks like the centroids method is most accurate. When using the default compression settings it's within 0.01 relative error in all benchmarking scenarios except for sparsely pareto-distributed data. And for that distribution, increasing compression (higher -> more accuracy, more resource usage) brought it under the 0.01 threshold.

Calculating an approximate median and accumulating deviation measurements in the collection phase

This method has many scenarios with high relative error across all benchmarking variables. I'm not aware of any modifications to this algorithm that could make it more accurate. It also was by far the slowest during rally benchmarking, even with optimizations that were not included in these speed benchmarks. We can rule this method out.

Collecting a tdigest sketch of values, then in reduce using the percentile results of that sketch to construct a similar tdigest sketch of deviations from the median

This method has similar results to the centroids method at the 0.05 relative error threshold, but has more scenarios above the 0.01 threshold. It seems like it deals with small data sizes poorly. This algorithm can be modified by changing the percentile intervals that we use to build the deviation tdigest sketch, which I did not test here. These tests use thousandth-quantiles

Collecting a tdigest sketch of values, and then in reduce using that tdigest's centroids to construct a similar tdigest sketch of deviations from the median

This method's results are summarized at the top of this comment. This method behaves similarly to the percentiles method but with greater accuracy. I think we can safely choose this method at this point, and it looks like our default compression factor choice is good but could be higher if we want to be safe.

Methodology

The benchmarking driver code is in the mad-client-test project. It's the same on all three of the commits mentioned in this comment.

The variables tested were dataset size = [100, 1000, 10000], tdigest compression factor = [20, 100, 1000] (100 is the default we have set for this agg and the percentiles agg), data distribution = [uniform, normal, pareto], and data density = [dense, sparse]. If you want to run these benchmarks with other values for these variables, you can set system properties benchmark.sizes, benchmark.compressions, benchmark.distributions, and benchmark.densities in gradle run.

Each scenario (that is, a combination of these variables) is run through ten iterations, with each iteration creating a new index and data set. Each iteration takes ten measurements of the aggregation on the same dataset, and then records the median of those measurements. That median of measurements is used as the approximate MAD for that iteration. For a scenario to be listed as exceeding the error thresholds mentioned above, at least one of it's iterations' approximate MAD had to be outside of the error threshold.

@colings86
Copy link
Contributor

I'm happy to go with the centroids method given the above results. @polyfractal do you agree?

@andyb-elastic
Copy link
Contributor

This was implemented in #34482

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants