Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add name kwarg to from_zarr (#1) #4663

Merged
merged 6 commits into from Apr 18, 2019
Merged

Add name kwarg to from_zarr (#1) #4663

merged 6 commits into from Apr 18, 2019

Conversation

@mpeaton
Copy link
Contributor

@mpeaton mpeaton commented Apr 2, 2019

  • Add name kwarg to from_zarr

  • added default from_zarr naming convention per jrbourbeau

  • flake8 fix

  • [ na] Tests added / passed
  • [ x] Passes flake8 dask
@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Apr 2, 2019

For some context, currently the from_zarr function doesn't always produce dask arrays with a unique .name attribute. This is because from_zarr uses the input zarr array repr to generate the output dask array name. For example,

import zarr
import dask
import dask.array as da

a = zarr.array([1, 2, 3])
b = zarr.array([4, 5, 6])
a_dsk = da.from_zarr(a)
b_dsk = da.from_zarr(b)
print(a_dsk.name)
print(b_dsk.name)
print(dask.base.tokenize(a_dsk) == dask.base.tokenize(b_dsk))

produces the following output on the current master:

zarr-<zarr.core.Array (3,) int64>
zarr-<zarr.core.Array (3,) int64>
True

The goal of this PR is to provide a fix such that from_zarr provides unique dask array names.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Thank you for the PR @mpeaton! Could you add a test to dask/array/tests/test_array_core.py to demonstrate that the changes here provide the desired behavior for from_zarr

@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Apr 2, 2019

cc @jakirkham as you may be interested based on your discussion in zarr-developers/zarr-python#202

@mpeaton
Copy link
Contributor Author

@mpeaton mpeaton commented Apr 2, 2019

I have run such a test locally. The capability is innate to the from_array module, so in the original pass-through design, a test was not needed. The bug fix required was to allow proper unique hashing.

Here is the test code I have been using. I suppose it would be possible to built a separate pytest to test "again" the tokenizer, though I would think it to be redundant at present. With regards to how Dask handles creating unique hashes of large datasets in general, my intuition would be that this implies a bigger problem. For the sake of the immediate customer requirement, it seems low risk to utilize the proposed strategy for the bug fix, and devise a new strategy in a performance enhancement. At any rate, here is my current (external) unit test:

import dask.array as da
import zarr
a = zarr.open('a.zarr')
a.require_dataset('x', shape=(1000,), chunks=(100,), dtype='f8')
a['x'][()] = 3
b = zarr.open('b.zarr')
b.require_dataset('x', shape=(1000,), chunks=(100,), dtype='f8')
b['x'][()] = 5

print((a['x'][()] == b['x'][()]).any())
print(da.from_zarr(a['x']).name )
print(da.from_zarr(b['x']).name)
print(dask.base.tokenize(da.from_zarr(a['x'])))
print(dask.base.tokenize(da.from_zarr(b['x'])))

assert(da.from_zarr(a['x']).name != da.from_zarr(b['x']).name)
assert(da.from_zarr(a['x'],name='aasdf').all())
assert(da.from_zarr(b['x'],name='asdf').all())```

@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Apr 3, 2019

I agree that from_zarr ultimately passes name to from_array. I was suggesting to test the default case for from_zarr where name is None and a unique value for name is generated using tokenize. By running locally I can see that a unique name is indeed created, but a short test along the lines of

def test_from_zarr_unique_name():
    a = zarr.array([1, 2, 3])
    b = zarr.array([4, 5, 6])

    assert da.from_zarr(a).name != da.from_zarr(b).name

would act as a safeguard to ensure that this bug doesn't pop up again in the future.

In addition, since this PR adds the new name keyword to the from_zarr function signature, another short test that makes sure name is properly passed to from_array when specified would also be useful here. For example:

def test_from_zarr_name():
    a = zarr.array([1, 2, 3])
    assert da.from_zarr(a, name='foo').name == 'foo'

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 5, 2019

You may want to rebase on master to take those image and arrow xfails out of the diff

@mpeaton
Copy link
Contributor Author

@mpeaton mpeaton commented Apr 5, 2019

Why is appveryor having trouble with asciitree? I am running conda 4.6.8 locally, and the dependency of zarr on asciitree and fasteners is handled appropriately. I am new to appveyor, what is going on here?

ModuleNotFoundError: No module named 'asciitree'
!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!
==================== 10 skipped, 1 error in 10.20 seconds =====================

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Thanks for adding a test @mpeaton! I've included a few comments

I think we can avoid adding zarr to the Appveyor CI in this PR. While I'm in favor of this addition generally (thank you @mpeaton for noticing zarr is not being installed in the Appveyor environment), since it doesn't seem to be as simple as adding conda install zarr we can open up a separate issue for this.

Also there are a couple of minor code linting issues https://travis-ci.org/dask/dask/jobs/516317003#L3142-L3145

dask/array/tests/test_array_core.py Outdated Show resolved Hide resolved
dask/array/tests/test_array_core.py Outdated Show resolved Hide resolved
dask/array/tests/test_array_core.py Outdated Show resolved Hide resolved
dask/array/tests/test_array_core.py Outdated Show resolved Hide resolved
dask/array/core.py Outdated Show resolved Hide resolved
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Looks like other commits are showing up here, you may want to rebase on master to take the parquet fix out of the diff.

Since we're adding the name keyword to from_zarr, could you add a test to ensure that when name is given, the output dask array has the specified name. Something like the following should work:

def test_from_zarr_name():
    zarr = pytest.importorskip('zarr')
    a = zarr.array([1, 2, 3])
    assert da.from_zarr(a, name='foo').name == 'foo'

dask/array/tests/test_array_core.py Show resolved Hide resolved
@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Apr 10, 2019

The test failures showing up on Travis CI are related to a recent bokeh update and have been fixed in #4680.

Other than updating the name docstring description (ref #4663 (comment)), this PR looks good to me

@mrocklin
Copy link
Member

@mrocklin mrocklin commented Apr 12, 2019

Checking in here. Is there anything left to do on this PR?

@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Apr 12, 2019

There is one lingering (minor) comment about modifying the name docstring description (see #4663 (comment)). @mpeaton do you have any comments on the proposed description?

Copy link
Contributor Author

@mpeaton mpeaton left a comment

name: str, optional
The key name used for the array. Defaults to the string 'from-zarr-' prepended to a hash of x.

dask/array/tests/test_array_core.py Show resolved Hide resolved
@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Apr 13, 2019

Thanks, I like your suggestion @mpeaton. One very minor comment, since there is no input parameter x for from_zarr, could we say "a hash of the input zarr array" instead of "a hash of x"?

Also, just as a note, currently the output from tokenize(z, ...) when z is a zarr array is random string. It corresponds to uuid.uuid4().hex returned here

return normalize_function(o) if callable(o) else uuid.uuid4().hex

@mpeaton
Copy link
Contributor Author

@mpeaton mpeaton commented Apr 13, 2019

@jrbourbeau I see. Would it be concise, yet accurate enough to replace x with url?
Which raises the question: What happens in the case where isinstance(url, str)? Does the mapper immediately evaluate the mapper or is it a lazy reference?

@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Apr 13, 2019

Would it be concise, yet accurate enough to replace x with url?

Thinking about this again, perhaps the most accurate, yet concise, description would be a general statement about hashing the input (since we're tokenize-ing all the from_zarr inputs like component and chunks). For example, the name description in the DataFrame.from_pandas function is:

name: string, optional
An optional keyname for the dataframe. Defaults to hashing the input

A similar statement would work here too. "An optional key name for the array. Defaults to hashing the input." Apologies for going back and forth on this point.

Does the mapper immediately evaluate the mapper or is it a lazy reference?

Data should be loaded lazily

mpeaton and others added 4 commits Apr 15, 2019
* Add name kwarg to from_zarr

* added default from_zarr naming convention per jrbourbeau

* flake8 fix
dask/array/core.py Outdated Show resolved Hide resolved
Co-Authored-By: mpeaton <mpeaton@users.noreply.github.com>
Copy link
Member

@jrbourbeau jrbourbeau left a comment

LGTM, merging this afternoon if there are no additional comments

@jrbourbeau jrbourbeau merged commit a9fbcb2 into dask:master Apr 18, 2019
3 checks passed
@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Apr 18, 2019

Thanks @mpeaton! I noticed this was your first code contribution to this repository, welcome!

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this issue May 14, 2019
* Add name kwarg to from_zarr (dask#1)

* Add name kwarg to from_zarr

* added default from_zarr naming convention per jrbourbeau

* flake8 fix

* Added test for from_zarr name hashing

* added name test

* update from_zarr docstring

* grammer curl

Co-Authored-By: mpeaton <mpeaton@users.noreply.github.com>

* python-snappy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants