Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add name kwarg to from_zarr (#1) #4663

Merged
merged 6 commits into from Apr 18, 2019

Conversation

Projects
None yet
5 participants
@mpeaton
Copy link
Contributor

commented Apr 2, 2019

  • Add name kwarg to from_zarr

  • added default from_zarr naming convention per jrbourbeau

  • flake8 fix

  • [ na] Tests added / passed
  • [ x] Passes flake8 dask
@jrbourbeau

This comment has been minimized.

Copy link
Member

commented Apr 2, 2019

For some context, currently the from_zarr function doesn't always produce dask arrays with a unique .name attribute. This is because from_zarr uses the input zarr array repr to generate the output dask array name. For example,

import zarr
import dask
import dask.array as da

a = zarr.array([1, 2, 3])
b = zarr.array([4, 5, 6])
a_dsk = da.from_zarr(a)
b_dsk = da.from_zarr(b)
print(a_dsk.name)
print(b_dsk.name)
print(dask.base.tokenize(a_dsk) == dask.base.tokenize(b_dsk))

produces the following output on the current master:

zarr-<zarr.core.Array (3,) int64>
zarr-<zarr.core.Array (3,) int64>
True

The goal of this PR is to provide a fix such that from_zarr provides unique dask array names.

@jrbourbeau
Copy link
Member

left a comment

Thank you for the PR @mpeaton! Could you add a test to dask/array/tests/test_array_core.py to demonstrate that the changes here provide the desired behavior for from_zarr

@jrbourbeau

This comment has been minimized.

Copy link
Member

commented Apr 2, 2019

cc @jakirkham as you may be interested based on your discussion in zarr-developers/zarr#202

@mpeaton

This comment has been minimized.

Copy link
Contributor Author

commented Apr 2, 2019

I have run such a test locally. The capability is innate to the from_array module, so in the original pass-through design, a test was not needed. The bug fix required was to allow proper unique hashing.

Here is the test code I have been using. I suppose it would be possible to built a separate pytest to test "again" the tokenizer, though I would think it to be redundant at present. With regards to how Dask handles creating unique hashes of large datasets in general, my intuition would be that this implies a bigger problem. For the sake of the immediate customer requirement, it seems low risk to utilize the proposed strategy for the bug fix, and devise a new strategy in a performance enhancement. At any rate, here is my current (external) unit test:

import dask.array as da
import zarr
a = zarr.open('a.zarr')
a.require_dataset('x', shape=(1000,), chunks=(100,), dtype='f8')
a['x'][()] = 3
b = zarr.open('b.zarr')
b.require_dataset('x', shape=(1000,), chunks=(100,), dtype='f8')
b['x'][()] = 5

print((a['x'][()] == b['x'][()]).any())
print(da.from_zarr(a['x']).name )
print(da.from_zarr(b['x']).name)
print(dask.base.tokenize(da.from_zarr(a['x'])))
print(dask.base.tokenize(da.from_zarr(b['x'])))

assert(da.from_zarr(a['x']).name != da.from_zarr(b['x']).name)
assert(da.from_zarr(a['x'],name='aasdf').all())
assert(da.from_zarr(b['x'],name='asdf').all())```
@jrbourbeau

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

I agree that from_zarr ultimately passes name to from_array. I was suggesting to test the default case for from_zarr where name is None and a unique value for name is generated using tokenize. By running locally I can see that a unique name is indeed created, but a short test along the lines of

def test_from_zarr_unique_name():
    a = zarr.array([1, 2, 3])
    b = zarr.array([4, 5, 6])

    assert da.from_zarr(a).name != da.from_zarr(b).name

would act as a safeguard to ensure that this bug doesn't pop up again in the future.

In addition, since this PR adds the new name keyword to the from_zarr function signature, another short test that makes sure name is properly passed to from_array when specified would also be useful here. For example:

def test_from_zarr_name():
    a = zarr.array([1, 2, 3])
    assert da.from_zarr(a, name='foo').name == 'foo'
@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

You may want to rebase on master to take those image and arrow xfails out of the diff

@mpeaton mpeaton force-pushed the Quansight:master branch from 88d3ae0 to 097b8a7 Apr 5, 2019

@mpeaton

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2019

Why is appveryor having trouble with asciitree? I am running conda 4.6.8 locally, and the dependency of zarr on asciitree and fasteners is handled appropriately. I am new to appveyor, what is going on here?

ModuleNotFoundError: No module named 'asciitree'
!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!
==================== 10 skipped, 1 error in 10.20 seconds =====================

@jrbourbeau
Copy link
Member

left a comment

Thanks for adding a test @mpeaton! I've included a few comments

I think we can avoid adding zarr to the Appveyor CI in this PR. While I'm in favor of this addition generally (thank you @mpeaton for noticing zarr is not being installed in the Appveyor environment), since it doesn't seem to be as simple as adding conda install zarr we can open up a separate issue for this.

Also there are a couple of minor code linting issues https://travis-ci.org/dask/dask/jobs/516317003#L3142-L3145

Show resolved Hide resolved dask/array/tests/test_array_core.py Outdated
Show resolved Hide resolved dask/array/tests/test_array_core.py Outdated
Show resolved Hide resolved dask/array/tests/test_array_core.py Outdated
Show resolved Hide resolved dask/array/tests/test_array_core.py Outdated
Show resolved Hide resolved dask/array/core.py Outdated

@mpeaton mpeaton force-pushed the Quansight:master branch from bad70a5 to e7a4823 Apr 8, 2019

@jrbourbeau
Copy link
Member

left a comment

Looks like other commits are showing up here, you may want to rebase on master to take the parquet fix out of the diff.

Since we're adding the name keyword to from_zarr, could you add a test to ensure that when name is given, the output dask array has the specified name. Something like the following should work:

def test_from_zarr_name():
    zarr = pytest.importorskip('zarr')
    a = zarr.array([1, 2, 3])
    assert da.from_zarr(a, name='foo').name == 'foo'
Show resolved Hide resolved dask/array/tests/test_array_core.py

@mpeaton mpeaton force-pushed the Quansight:master branch from e7a4823 to 0836033 Apr 9, 2019

@jrbourbeau

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

The test failures showing up on Travis CI are related to a recent bokeh update and have been fixed in #4680.

Other than updating the name docstring description (ref #4663 (comment)), this PR looks good to me

@mpeaton mpeaton force-pushed the Quansight:master branch from 0836033 to 29623d3 Apr 10, 2019

@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 12, 2019

Checking in here. Is there anything left to do on this PR?

@jrbourbeau

This comment has been minimized.

Copy link
Member

commented Apr 12, 2019

There is one lingering (minor) comment about modifying the name docstring description (see #4663 (comment)). @mpeaton do you have any comments on the proposed description?

@mpeaton
Copy link
Contributor Author

left a comment

name: str, optional
The key name used for the array. Defaults to the string 'from-zarr-' prepended to a hash of x.

Show resolved Hide resolved dask/array/tests/test_array_core.py
@jrbourbeau

This comment has been minimized.

Copy link
Member

commented Apr 13, 2019

Thanks, I like your suggestion @mpeaton. One very minor comment, since there is no input parameter x for from_zarr, could we say "a hash of the input zarr array" instead of "a hash of x"?

Also, just as a note, currently the output from tokenize(z, ...) when z is a zarr array is random string. It corresponds to uuid.uuid4().hex returned here

return normalize_function(o) if callable(o) else uuid.uuid4().hex

@mpeaton

This comment has been minimized.

Copy link
Contributor Author

commented Apr 13, 2019

@jrbourbeau I see. Would it be concise, yet accurate enough to replace x with url?
Which raises the question: What happens in the case where isinstance(url, str)? Does the mapper immediately evaluate the mapper or is it a lazy reference?

@jrbourbeau

This comment has been minimized.

Copy link
Member

commented Apr 13, 2019

Would it be concise, yet accurate enough to replace x with url?

Thinking about this again, perhaps the most accurate, yet concise, description would be a general statement about hashing the input (since we're tokenize-ing all the from_zarr inputs like component and chunks). For example, the name description in the DataFrame.from_pandas function is:

name: string, optional
An optional keyname for the dataframe. Defaults to hashing the input

A similar statement would work here too. "An optional key name for the array. Defaults to hashing the input." Apologies for going back and forth on this point.

Does the mapper immediately evaluate the mapper or is it a lazy reference?

Data should be loaded lazily

mpeaton added some commits Apr 2, 2019

Add name kwarg to from_zarr (#1)
* Add name kwarg to from_zarr

* added default from_zarr naming convention per jrbourbeau

* flake8 fix

@mpeaton mpeaton force-pushed the Quansight:master branch from 29623d3 to 4c102a9 Apr 15, 2019

Show resolved Hide resolved dask/array/core.py Outdated
Show resolved Hide resolved continuous_integration/appveyor/setup_conda_environment.cmd Outdated
grammer curl
Co-Authored-By: mpeaton <mpeaton@users.noreply.github.com>
@jrbourbeau
Copy link
Member

left a comment

LGTM, merging this afternoon if there are no additional comments

@jrbourbeau jrbourbeau merged commit a9fbcb2 into dask:master Apr 18, 2019

3 checks passed

codecov/project 91.22% (target 90%)
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jrbourbeau

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

Thanks @mpeaton! I noticed this was your first code contribution to this repository, welcome!

asmith26 added a commit to asmith26/dask that referenced this pull request Apr 22, 2019

Add name kwarg to from_zarr (dask#4663)
* Add name kwarg to from_zarr (dask#1)

* Add name kwarg to from_zarr

* added default from_zarr naming convention per jrbourbeau

* flake8 fix

* Added test for from_zarr name hashing

* added name test

* update from_zarr docstring

* grammer curl

Co-Authored-By: mpeaton <mpeaton@users.noreply.github.com>

* python-snappy

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Add name kwarg to from_zarr (dask#4663)
* Add name kwarg to from_zarr (dask#1)

* Add name kwarg to from_zarr

* added default from_zarr naming convention per jrbourbeau

* flake8 fix

* Added test for from_zarr name hashing

* added name test

* update from_zarr docstring

* grammer curl

Co-Authored-By: mpeaton <mpeaton@users.noreply.github.com>

* python-snappy

Thomas-Z added a commit to Thomas-Z/dask that referenced this pull request May 17, 2019

Add name kwarg to from_zarr (dask#4663)
* Add name kwarg to from_zarr (#1)

* Add name kwarg to from_zarr

* added default from_zarr naming convention per jrbourbeau

* flake8 fix

* Added test for from_zarr name hashing

* added name test

* update from_zarr docstring

* grammer curl

Co-Authored-By: mpeaton <mpeaton@users.noreply.github.com>

* python-snappy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.