Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API design question: default value of `index` for ndmeasure functions #116

Open
GenevieveBuckley opened this issue May 6, 2019 · 4 comments

Comments

Projects
None yet
2 participants
@GenevieveBuckley
Copy link
Contributor

commented May 6, 2019

This is a question about the API design for label indices in dask_image.ndmeasure functions.

Where labels is given but index is None, the label array is overwritten and becomes a mask image. This was not very intuitive for me, and means the behaviour of the ndmeasure functions is inconsistent. In some cases (index=None) you get an aggregate value, and in others you get values for each individual label (even if there are multiple indices).

I think there's an argument to be made that if labels is given and index=None the default value should be the range of all non-zero labels (eg: [1, 2, 3, ..., n]). This would mean (a) no nasty surprise aggregations, and (b) you wouldn't need to near-constantly write index=da.arange(da.max(labels)) or revert to the slightly clunkier label_comprehension() syntax (I can never remember the six input arguments).

Questions

  1. Are the majority of use cases different than what I imagine here? If what I expect to be the most common use scenario is actually pretty uncommon, I may need to rethink my opinion.
  2. What is your opinion on replacing:
def _norm_input_labels_index(input, labels=None, index=None):
    ...
    elif index is None:
        labels = (labels > 0).astype(int)
        index = dask.array.ones(tuple(), dtype=int, chunks=tuple())

with this instead:

def _norm_input_labels_index(input, labels=None, index=None):
    ...
    elif index is None:
        index = dask.array.arange(dask.array.max(labels) + 1)[1:]

and making a separate mask() convenience function available.

In my view it's much clearer that area(input, mask(labels)) is expected to return an aggregate value, compared to area(input, labels, index=None).

@jakirkham

This comment has been minimized.

Copy link
Member

commented May 6, 2019

This behavior comes from scipy.ndimage. That said, I'm not totally sure why they settled on this behavior. I agree it seems unintuitive. From a usability standpoint, I agree that something along the lines of your proposal sounds reasonable and is probably preferable.

Have you been using this argument to index thus far? How has that been working for you? I would assume that dask.array.arange would cause a computation of the value provided, but maybe that is not the case. If it is causing a computation, we might want to think a bit more about how to do this to avoid computation or fix arange to take unknown values. Thoughts on this?

It's worth noting this change would be a behavioral break. So we would want to convey this somehow to the user (perhaps with a recommendation on how to keep the old behavior and avoid the warning?). Am a little unsure how best to do that though. Do you have any thoughts on how best to do this?

@GenevieveBuckley

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

  1. I'm glad you agree, I think this is one case where usability trumps strict adherence to the scipy API.

  2. Avoiding immediate computation. I think you're right, good catch! I've only done that with small toy data. I have one idea, will post code in a follow up comment.

  3. Deprecation cycles: scikit-learn suggests raising a future warning like this:

import warnings

def example_function(n_clusters='warn'):
    if n_clusters == 'warn':
        warnings.warn("The default value of n_clusters will change from "
                      "5 to 10 in 0.22.", FutureWarning)
        n_clusters = 5

More info on scikit-learn's deprecation policy here.
Scikit-image also has their policy for deprecation cycles made available.
Both projects are more established and thus more widely used, so there may or may not be latitude to be less strict about a two release deprecation cycle. I'm not sure, and would want to seek advice from some of the dask core devs on this point.

I couldn't find a policy page for deprecations in dask, but I'm sure at the very least an unofficial policy is held tucked away in someone's head. We should talk to them.

@GenevieveBuckley

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

This might work for a lazy computation of index values. Thoughts?

import dask.array as da
from dask.array import Array
from dask.highlevelgraph import HighLevelGraph
import numpy as np
from operator import add

def _index_default(x):
    dsk = {'x': x}
    dsk[('max', 0)] = (da.max, 'x')
    dsk[('arange', 0)] = (np.arange, ('max', 0))
    dsk[('index-default', 0)] = (add, 1, ('arange', 0))
    graph = HighLevelGraph.from_collections('index-default', dsk)
    return Array(graph, 'index-default', ((np.nan,),), int)


x = da.from_array([[1, 1, 0], [1, 0, 3], [0, 7, 0]], chunks=(1, 3))

result = _index_default(x)

print(result.compute())
result.visualize()
[1 2 3 4 5 6 7]

image

@GenevieveBuckley

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

Hm, this could be a problem though. It looks like there are several references to index.shape in label_comprehension() (and also in minimum_position() and maximum_position())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.