Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: DataFrame→Array conversion and unknown chunks #4516

Merged
merged 6 commits into from Mar 2, 2019

Conversation

Projects
None yet
2 participants
@stsievert
Copy link
Contributor

commented Feb 20, 2019

What does this PR implement?
This PR provides documentation for converting converting a Dask DataFrame to a Dask Array and computing chunks in the process (so chunks is not nan).

Reference issues/PRs

  • "I had to pass lengths to to_dask_array which I didn't realize existed."—#3293 (comment)
  • dask/dask-ml#465, "How to chunk a Dask Array with unknown chunks"
@stsievert

This comment has been minimized.

Copy link
Contributor Author

commented Feb 20, 2019

I had some debate over whether to include this in the DataFrame or Array documentation. I chose the Array documentation because the chunks are an Array object.

@jrbourbeau

This comment has been minimized.

Copy link
Member

commented Feb 20, 2019

Thanks for adding this @stsievert! It might be worth including a note that lengths=True will trigger an immediate computation.

@stsievert

This comment has been minimized.

Copy link
Contributor Author

commented Feb 21, 2019

Thanks @jrbourbeau! I've added the note. I mention that this enables downstream computations, but don't point to any examples in case they're fixed (e.g., slicing an array raises a NotImplementedError at slicing.py#L933).

I can see another use case with arrays:

>>> x = np.random.choice([-1, 0, 1], size=100)
>>> y = da.from_array(x, chunks=50)
>>> y[y != -1]
# dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,)>

I think computing the chunk size could be useful (e.g., with the slicing example above). Looks like #3293 (comment) is the relevant work.

@jrbourbeau

This comment has been minimized.

Copy link
Member

commented Feb 27, 2019

After looking at this again, converting a Dask DataFrame to a Dask array (and the issue of .values giving unknown chunk sizes) is brought up in the "From Dask DataFrame" section of the Dask array creation documentation (dask/docs/source/array-creation.rst). Do you think this additional to_dask_array documentation is a better fit in that section?

@stsievert stsievert force-pushed the stsievert:array-dataframe-chunks-doc branch from c137456 to d1efd1f Feb 27, 2019

@stsievert

This comment has been minimized.

Copy link
Contributor Author

commented Feb 27, 2019

Thanks for that @jrbourbeau! I think that's a better place, and I still link to it from chunks page.

I also improved that page a bit – that page didn't have a mention of to_dask_array, so I highlighted it there.

@jrbourbeau
Copy link
Member

left a comment

A few nitpicky comments. Otherwise LGTM

Show resolved Hide resolved docs/source/array-chunks.rst Outdated
Show resolved Hide resolved docs/source/array-creation.rst Outdated
Show resolved Hide resolved docs/source/array-creation.rst Outdated

jrbourbeau and others added some commits Mar 1, 2019

Update docs/source/array-chunks.rst
Co-Authored-By: stsievert <stsievert@users.noreply.github.com>
Update docs/source/array-creation.rst
Co-Authored-By: stsievert <stsievert@users.noreply.github.com>
Update docs/source/array-creation.rst
Co-Authored-By: stsievert <stsievert@users.noreply.github.com>

@jrbourbeau jrbourbeau merged commit f3c2d5d into dask:master Mar 2, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jrbourbeau

This comment has been minimized.

Copy link
Member

commented Mar 2, 2019

Thanks @stsievert!

@stsievert stsievert deleted the stsievert:array-dataframe-chunks-doc branch Mar 2, 2019

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

DOC: DataFrame to Array conversion and unknown chunks (dask#4516)
* DOC: DataFrame chunks when converting to array

* DOC: add note about immediate computation

* MAINT: move doc note to array creation

* Update docs/source/array-chunks.rst

Co-Authored-By: stsievert <stsievert@users.noreply.github.com>

* Update docs/source/array-creation.rst

Co-Authored-By: stsievert <stsievert@users.noreply.github.com>

* Update docs/source/array-creation.rst

Co-Authored-By: stsievert <stsievert@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.