Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

median/nanmedian #46

Closed
shoyer opened this issue Feb 23, 2015 · 5 comments
Closed

median/nanmedian #46

shoyer opened this issue Feb 23, 2015 · 5 comments

Comments

@shoyer
Copy link
Member

shoyer commented Feb 23, 2015

I'm not entirely sure what algorithms for out of core median look like, but if this is relatively straightforward it would be a nice addition to dask.

@mrocklin
Copy link
Member

My guess is that it isn't particularly straightforward. My intuition says that this is one of the things that we give up when we move to data that might live outside of memory. I can take a look though. How should I prioritize this? Is this important for xray?

@shoyer
Copy link
Member Author

shoyer commented Feb 23, 2015

No, definitely not essential. Typically, I would guess that the intermediate arrays could fit into memory but not all at once, e.g, taking the median along one axis of a 1000x1000x10000 array. I'm not entirely sure if this sort of thing belongs in dask or not. Summary statistics of billions of items are usually not interesting -- subsampling is the appropriate strategy.

@mrocklin
Copy link
Member

Just as topk can replace sorted in a distributed context, what is the appropriate replacement for median?

Sample then median?

@eriknw
Copy link
Member

eriknw commented Feb 24, 2015

I think the equivalent would be approximate percentiles, which can be improved if you know the form of the pdf or cdf. For instance, Boost.Accumulators has streaming versions of such things:

http://www.boost.org/doc/libs/1_57_0/doc/html/accumulators/user_s_guide.html#accumulators.user_s_guide.the_statistical_accumulators_library.median

@mrocklin
Copy link
Member

This is sufficiently far down the priority list with dask that I'm going to close for now. Approximate quantiles sounds like a nicer and more immediate issue that could probably be used to spoof median in a pinch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants