Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

with_dependencies method for adding arbitrary dependencies #1519

Open
shoyer opened this issue Sep 1, 2016 · 3 comments
Open

with_dependencies method for adding arbitrary dependencies #1519

shoyer opened this issue Sep 1, 2016 · 3 comments

Comments

@shoyer
Copy link
Member

shoyer commented Sep 1, 2016

e.g., suppose x is a dask.array or dataframe that we want to verify is finite:

@dask.delayed
def assert_(condition):
    assert condition

x = x.with_dependencies([assert_(da.notnull(x).all())])

Under the covers, this would create a new dask object with each task dependent on computing the dependent tasks.

Potentially, this would be quite useful for tools like xarray, so we could defer equality checks until we've built the entire graph. One concern is that this might trigger a fail case for the dask scheduler (#874) when checks inevitably require looking at the entire dataset.

This is a more general solution to #97, inspired by the corresponding design from TensorFlow:
https://www.tensorflow.org/versions/r0.10/api_docs/python/check_ops.html#assert_negative

@jcrist
Copy link
Member

jcrist commented Sep 6, 2016

Just to clarify, these are equivalent?

res1 = x.with_dependencies([assert_(da.notnull(x).all())])

# Is semantically equivalent to:

def check(x, cond):
    assert cond
    return x

res2 = x.map_blocks(check, da.notnull(x).all())

As you stated above, global checks will inevitably lead to computing and caching the entire array, removing any out-of-core benefits. As such, I'm reluctant to adding a method to do this when it will result in poor performance in many use cases (would rather not make it easy to do poorly performing things).

@shoyer
Copy link
Member Author

shoyer commented Sep 6, 2016

Just to clarify, these are equivalent?

Yes.

As such, I'm reluctant to adding a method to do this when it will result in poor performance in many use cases (would rather not make it easy to do poorly performing things).

Agreed. I do think this is further evidence of this how significant a shortcoming this is for the scheduler, though, because the need for this sort of pattern is quite common.

@jakirkham
Copy link
Member

@shoyer, is this issue still relevant for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants