-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
median/nanmedian #46
Comments
My guess is that it isn't particularly straightforward. My intuition says that this is one of the things that we give up when we move to data that might live outside of memory. I can take a look though. How should I prioritize this? Is this important for xray? |
No, definitely not essential. Typically, I would guess that the intermediate arrays could fit into memory but not all at once, e.g, taking the median along one axis of a 1000x1000x10000 array. I'm not entirely sure if this sort of thing belongs in dask or not. Summary statistics of billions of items are usually not interesting -- subsampling is the appropriate strategy. |
Just as Sample then median? |
I think the equivalent would be approximate percentiles, which can be improved if you know the form of the pdf or cdf. For instance, Boost.Accumulators has streaming versions of such things: |
This is sufficiently far down the priority list with dask that I'm going to close for now. Approximate quantiles sounds like a nicer and more immediate issue that could probably be used to spoof median in a pinch. |
I'm not entirely sure what algorithms for out of core median look like, but if this is relatively straightforward it would be a nice addition to dask.
The text was updated successfully, but these errors were encountered: