WIP: Update test_to_records to test with lengths argument, added co…#4515
WIP: Update test_to_records to test with lengths argument, added co…#4515TomAugspurger merged 10 commits intodask:masterfrom asmith26:issue-4469
lengths argument, added co…#4515Conversation
|
… On Wed, Feb 20, 2019 at 10:43 AM asmith26 ***@***.***> wrote:
…rresponding functionality based on .to_dask_array()
- Tests added / passed
- Passes flake8 dask
------------------------------
You can view, comment on, or merge this pull request online at:
#4515
Commit Summary
- WIP: Update test_to_records to test with `lengths` argument, added
corresponding functionality based on `.to_dask_array()`
File Changes
- *M* dask/dataframe/core.py
<https://github.com/dask/dask/pull/4515/files#diff-0> (29)
- *M* dask/dataframe/io/tests/test_io.py
<https://github.com/dask/dask/pull/4515/files#diff-1> (18)
Patch Links:
- https://github.com/dask/dask/pull/4515.patch
- https://github.com/dask/dask/pull/4515.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4515>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszBg0-e4AfOkQtVWp1y6qMiv91d3Lks5vPZdTgaJpZM4bF3dg>
.
|
dask/dataframe/core.py
Outdated
|
|
||
| records = to_records(self) | ||
|
|
||
| if isinstance(lengths, Sequence): |
There was a problem hiding this comment.
Move this to a _validate_chunks method and call like
chunks = self._validate_chunks(records, lengths)
records._chunks = chunks
When lengths is None, you'll return records._chunks from _validat_chunks.
and call from to_dask_array as well.
There was a problem hiding this comment.
Would we need to test the new _validate_chunks() method?
dask/dataframe/io/tests/test_io.py
Outdated
| # if as_frame: | ||
| # expected_chunks = expected_chunks + ((1,),) | ||
| # | ||
| # assert result.chunks == expected_chunks |
There was a problem hiding this comment.
Add some tests for the cases that raise as well
- len(lenghts) != npartitions
- an invalid value for lengths.
There was a problem hiding this comment.
In progress (hopefully this week, possibly next).
…rresponding functionality based on `.to_dask_array()`
| aggfunc=aggfunc) | ||
|
|
||
| def to_records(self, index=False): | ||
| def to_records(self, index=False, lengths=None): |
There was a problem hiding this comment.
Reordered arguments following #4515 (comment)
Although index is not currently being used? Should it be added to https://github.com/asmith26/dask/blob/c0cf9f7c1717d7ec40fbda4d6cebbc4421e3d137/dask/dataframe/core.py#L3264 ?
|
Thanks for your feedback! I think I've completed all of your requests. The tests generally mirror the to_dask_array tests and pass, however I'm a bit concerned that I have commented out the check for equality: https://github.com/asmith26/dask/blob/issue-4469/dask/dataframe/io/tests/test_io.py#L521 If I uncomment this I get a Key error still- I understand what the error is, but I am not familiar enough with Dask internals to understand why the key computed by Dask is different to that in the graph; in particular: This key key = ('to_records-7efaccf97358a6d0b69926d9bf624a17', 0, 0)Is trying to index the dict: dsk = {('from_pandas-b581c7965fed6d41f34e734ae95a309c', 0): x y
ind
1.0 a 2
2.0 b 3, ('from_pandas-b581c7965...ubgraph_callable, ('from_pandas-b581c7965fed6d41f34e734ae95a309c', 1), 'from_pandas-b581c7965fed6d41f34e734ae95a309c')}which clearly doesn't exist. It appears to be getting the key within: https://github.com/asmith26/dask/blob/master/dask/base.py#L206-L207 Any thoughts? |
|
I'll commit with check for equality below to make it clearer: |
TomAugspurger
left a comment
There was a problem hiding this comment.
I think the issue with the KeyError is coming from the dimensionality of the chunks being incorrect. Fixing that should fix up everything.
dask/dataframe/core.py
Outdated
| records = to_records(self) | ||
|
|
||
| chunks = self._validate_chunks(records, lengths) | ||
| records._chunks = chunks |
There was a problem hiding this comment.
I'm not very familiar with record arrays, but apparently they're 1-dimensional?
In [6]: df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
In [7]: df.to_records().ndim
Out[7]: 1So this should (maybe) be records._chunks = (chunks[0],).
|
I pushed a commit that addresses your review comments @TomAugspurger. Tests are passing now, so a quick re-review whenever you get a chance would be appreciated. |
|
Thanks @asmith26 and @jrbourbeau! |
|
Thank you very much for everyone's help! I thoroughly enjoyed tackling this, and I've learned loads from everyone's feedback. Hope I'm able to help out with this brilliant library in the future! :) |
…rresponding functionality based on
.to_dask_array()flake8 dask