-
-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding median reduction #245
Comments
Hi, thanks for opening this issue. The heart of the matter for me is that I have highly noisy data and my naive implementation of multi-dimensional binning (kidpixo/multibinner: python module based on pandas to generate multidimensional bin of large dataset.) takes ages. Now, datashader literally blew my mind and I would really see this thing working. About my naive code : I simply copy the code from the Correct me if I'm wrong :
Could this stackoverflow discussion "Array of ints in numba" helps to add at leas an implementation od the mode? Maybe it is better to implement. |
Unfortunately, our existing reduction examples are all for operations that can be implemented incrementally, without having to store all the points that map into each pixel. E.g. a sum() operator can be implemented by simply accumulating a single value, adding each new one without ever increasing any data storage requirements. That's a big part of what makes datashader be able to be so efficient, avoiding multiple passes through the dataset or ever having to store the entire dataset in memory. Unfortunately, finding the median of a distribution in a pixel requires having access to that distribution or some approximation of it, and so you'll have to take a very different approach from our current operators. @jcrist wrote our existing reductions, but neither he nor I have bandwidth to work on writing such a function at the moment, though we agree it would be a great thing to support. So alas, for now this is something you'd have to figure out yourself (or hire someone to do here, if you've got the money). Or maybe someone who sees this message will jump in with something! |
Ok, give me your opinion on this trick: sort beforehand the column to be processed by reduction. It sounds crazy and computationally heavy, but then I get for free, for each bin of the final image :
In your opinion, is extracting a certain element from the |
Maybe I'm misunderstanding the proposal, but I don't see how sorting the column in the original data frame would help. To compute the median in each pixel, you need the full set of values that map into that pixel, regardless of any ordering in the original non-aggregated data structure. None of our current reduction operators ever keep track of the full set of values that map into that pixel, which is why they are so efficient. If you want to write one that does keep track of all datapoints that fall into a certain pixel, it will be very expensive to run but then computing the median or any other statistic at that point will be trivial. The API allows you to do that, but it's just not something we have previously implemented and so it would take some thought to put that together, particularly if we want it to be at all efficient. We would welcome a contribution of code that does that! |
Correct me if I wrong, but :
Now, If I am not wrong , my idea would be to return the count of element that map into that pixel in Here is were I need some idea how to do that, some trick maybe. The 'higher order' reduction like 'mean' use simpler reduction methods. @property
def _bases(self):
return (sum(self.column), count(self.column))
@staticmethod
def _finalize(bases, **kwargs):
sums, counts = bases
with np.errstate(divide='ignore', invalid='ignore'):
x = sums/counts
return xr.DataArray(x, **kwargs) |
Right.
Well, not necessarily; if the reduction operation is parallelized, as all our current ones are, then there is no guarantee that any original order in the data frame is preserved. You can certainly implement a sequential aggregation that guarantees to preserve the ordering, but it would have lower performance during the aggregation step. Moreover, it's vastly cheaper to sort a bunch of small lists (per pixel) than one huge list, of the same total length, so sorting after aggregation would actually be much faster. Still, if (a) you do enough different aggregations on the same dataframe and thus don't mind having to do an expensive operation if you only have to do it once, and (b) don't have many processing cores to use during aggregation so that you don't mind it being a sequential algorithm and (c) provide a way to tell the reduction function that the original data frame is sorted, then such an approach could make sense. In any case, relying on the original dataframe's ordering is only an implementation detail; it won't make the code shorter, it won't use less memory, and only in the special case outlined above would it make things faster. What's crucial no matter how you do it is to keep track of the full set of datapoints per pixel during aggregation, so that you can later extract the median. That's not hard, just not something we have an example of yet, and it will require having enough memory to store two copies of the entire dataset. |
BTW, I fully support us adding such an example, because having the full set of datapoints per pixel allows any conceivable reduction operation to be performed, not just a median. So even though it will be much worse in performance than our current ones, having it as an example to follow is a great idea. We just don't have anyone with free time to do it at present; we're using datashader in a lot of different projects, none of which currently requires such an approach, so it would need to be developed by someone else if it's to happen in the near future. |
Ok, I get it now. I'm looking forward to an example to follow, if you or someone will have time. Thanks a lot for your comments! |
Unfortunately, I still don't have an example to provide, but I do have a suggestion from @jcrist. He's got a new library just released on PyPI, with source at https://github.com/jcrist/crick, that provides a few useful approximate algorithms, including one for approximating an unknown distribution in a single pass. This would be very practical to use in datashader, unlike the above completely correct but very impractical approach. So what would really be great is if someone would implement both, the correct but slow approach and the approximate but very fast and memory efficient approach from |
It would be great to have a reduction operator that computes a median. Here's a first pass:
https://gist.github.com/kidpixo/16e76ac201375e4d3fb4bf89b8a9b6ad
(Notes adapted from @jcrist:) Unfortunately, median is one of the few reductions that can't be parallelized trivially. In the code above, self.column is just a string name, not the column itself, hence the error shown. Implementing median will require custom append, combine, and finalize methods, so it will be fairly complicated. There are various parallel median algorithms findable on google, but it's not clear that they would apply in this case.
The text was updated successfully, but these errors were encountered: