median/nanmedian #46

shoyer · 2015-02-23T00:49:00Z

I'm not entirely sure what algorithms for out of core median look like, but if this is relatively straightforward it would be a nice addition to dask.

mrocklin · 2015-02-23T01:05:59Z

My guess is that it isn't particularly straightforward. My intuition says that this is one of the things that we give up when we move to data that might live outside of memory. I can take a look though. How should I prioritize this? Is this important for xray?

shoyer · 2015-02-23T01:15:31Z

No, definitely not essential. Typically, I would guess that the intermediate arrays could fit into memory but not all at once, e.g, taking the median along one axis of a 1000x1000x10000 array. I'm not entirely sure if this sort of thing belongs in dask or not. Summary statistics of billions of items are usually not interesting -- subsampling is the appropriate strategy.

mrocklin · 2015-02-23T01:16:39Z

Just as topk can replace sorted in a distributed context, what is the appropriate replacement for median?

Sample then median?

eriknw · 2015-02-24T19:05:57Z

I think the equivalent would be approximate percentiles, which can be improved if you know the form of the pdf or cdf. For instance, Boost.Accumulators has streaming versions of such things:

http://www.boost.org/doc/libs/1_57_0/doc/html/accumulators/user_s_guide.html#accumulators.user_s_guide.the_statistical_accumulators_library.median

mrocklin · 2015-02-25T20:29:52Z

This is sufficiently far down the priority list with dask that I'm going to close for now. Approximate quantiles sounds like a nicer and more immediate issue that could probably be used to spoof median in a pinch.

mrocklin closed this as completed Feb 25, 2015

stsievert mentioned this issue Jul 26, 2018

MAINT: implement median #3819

Closed

4 tasks

jangorecki mentioned this issue Jan 10, 2019

median function for dask dataframe #4362

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

median/nanmedian #46

median/nanmedian #46

shoyer commented Feb 23, 2015

mrocklin commented Feb 23, 2015

shoyer commented Feb 23, 2015

mrocklin commented Feb 23, 2015

eriknw commented Feb 24, 2015

mrocklin commented Feb 25, 2015

median/nanmedian #46

median/nanmedian #46

Comments

shoyer commented Feb 23, 2015

mrocklin commented Feb 23, 2015

shoyer commented Feb 23, 2015

mrocklin commented Feb 23, 2015

eriknw commented Feb 24, 2015

mrocklin commented Feb 25, 2015