Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: DataFrame→Array conversion and unknown chunks #4516

Merged
merged 6 commits into from Mar 2, 2019

Conversation

@stsievert
Copy link
Member

@stsievert stsievert commented Feb 20, 2019

What does this PR implement?
This PR provides documentation for converting converting a Dask DataFrame to a Dask Array and computing chunks in the process (so chunks is not nan).

Reference issues/PRs

  • "I had to pass lengths to to_dask_array which I didn't realize existed."—#3293 (comment)
  • dask/dask-ml#465, "How to chunk a Dask Array with unknown chunks"
@stsievert
Copy link
Member Author

@stsievert stsievert commented Feb 20, 2019

I had some debate over whether to include this in the DataFrame or Array documentation. I chose the Array documentation because the chunks are an Array object.

@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Feb 20, 2019

Thanks for adding this @stsievert! It might be worth including a note that lengths=True will trigger an immediate computation.

@stsievert
Copy link
Member Author

@stsievert stsievert commented Feb 21, 2019

Thanks @jrbourbeau! I've added the note. I mention that this enables downstream computations, but don't point to any examples in case they're fixed (e.g., slicing an array raises a NotImplementedError at slicing.py#L933).

I can see another use case with arrays:

>>> x = np.random.choice([-1, 0, 1], size=100)
>>> y = da.from_array(x, chunks=50)
>>> y[y != -1]
# dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,)>

I think computing the chunk size could be useful (e.g., with the slicing example above). Looks like #3293 (comment) is the relevant work.

@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Feb 27, 2019

After looking at this again, converting a Dask DataFrame to a Dask array (and the issue of .values giving unknown chunk sizes) is brought up in the "From Dask DataFrame" section of the Dask array creation documentation (dask/docs/source/array-creation.rst). Do you think this additional to_dask_array documentation is a better fit in that section?

@stsievert stsievert force-pushed the array-dataframe-chunks-doc branch from c137456 to d1efd1f Feb 27, 2019
@stsievert
Copy link
Member Author

@stsievert stsievert commented Feb 27, 2019

Thanks for that @jrbourbeau! I think that's a better place, and I still link to it from chunks page.

I also improved that page a bit – that page didn't have a mention of to_dask_array, so I highlighted it there.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

A few nitpicky comments. Otherwise LGTM

docs/source/array-chunks.rst Outdated Show resolved Hide resolved
docs/source/array-creation.rst Outdated Show resolved Hide resolved
docs/source/array-creation.rst Outdated Show resolved Hide resolved
jrbourbeau and others added 3 commits Mar 1, 2019
Co-Authored-By: stsievert <stsievert@users.noreply.github.com>
Co-Authored-By: stsievert <stsievert@users.noreply.github.com>
Co-Authored-By: stsievert <stsievert@users.noreply.github.com>
@jrbourbeau jrbourbeau merged commit f3c2d5d into dask:master Mar 2, 2019
2 checks passed
@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Mar 2, 2019

Thanks @stsievert!

@stsievert stsievert deleted the array-dataframe-chunks-doc branch Mar 2, 2019
jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this issue May 14, 2019
* DOC: DataFrame chunks when converting to array

* DOC: add note about immediate computation

* MAINT: move doc note to array creation

* Update docs/source/array-chunks.rst

Co-Authored-By: stsievert <stsievert@users.noreply.github.com>

* Update docs/source/array-creation.rst

Co-Authored-By: stsievert <stsievert@users.noreply.github.com>

* Update docs/source/array-creation.rst

Co-Authored-By: stsievert <stsievert@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants