WIP: Update test_to_records to test with `lengths` argument, added co… by asmith26 · Pull Request #4515 · dask/dask

asmith26 · 2019-02-20T18:43:30Z

…rresponding functionality based on .to_dask_array()

Tests added / passed
Passes flake8 dask

mrocklin · 2019-02-20T18:46:28Z

On Wed, Feb 20, 2019 at 10:43 AM asmith26 ***@***.***> wrote: …rresponding functionality based on .to_dask_array() - Tests added / passed - Passes flake8 dask ------------------------------ You can view, comment on, or merge this pull request online at: #4515 Commit Summary - WIP: Update test_to_records to test with `lengths` argument, added corresponding functionality based on `.to_dask_array()` File Changes - *M* dask/dataframe/core.py <https://github.com/dask/dask/pull/4515/files#diff-0> (29) - *M* dask/dataframe/io/tests/test_io.py <https://github.com/dask/dask/pull/4515/files#diff-1> (18) Patch Links: - https://github.com/dask/dask/pull/4515.patch - https://github.com/dask/dask/pull/4515.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4515>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszBg0-e4AfOkQtVWp1y6qMiv91d3Lks5vPZdTgaJpZM4bF3dg> .

dask/dataframe/core.py

TomAugspurger · 2019-03-13T14:33:26Z

dask/dataframe/core.py

+
+        records = to_records(self)
+
+        if isinstance(lengths, Sequence):


Move this to a _validate_chunks method and call like

chunks = self._validate_chunks(records, lengths) records._chunks = chunks

When lengths is None, you'll return records._chunks from _validat_chunks.

and call from to_dask_array as well.

Would we need to test the new _validate_chunks() method?

dask/dataframe/io/tests/test_io.py

TomAugspurger · 2019-03-13T14:36:45Z

dask/dataframe/io/tests/test_io.py

+    # if as_frame:
+    #     expected_chunks = expected_chunks + ((1,),)
+    #
+    # assert result.chunks == expected_chunks


Add some tests for the cases that raise as well

len(lenghts) != npartitions

an invalid value for lengths.

In progress (hopefully this week, possibly next).

…rresponding functionality based on `.to_dask_array()`

…lowing #4515 (comment)

asmith26 · 2019-04-22T22:05:08Z

dask/dataframe/core.py

                           aggfunc=aggfunc)

-    def to_records(self, index=False):
+    def to_records(self, index=False, lengths=None):


Reordered arguments following #4515 (comment)

Although index is not currently being used? Should it be added to https://github.com/asmith26/dask/blob/c0cf9f7c1717d7ec40fbda4d6cebbc4421e3d137/dask/dataframe/core.py#L3264 ?

asmith26 · 2019-04-23T12:12:43Z

Hi @TomAugspurger

Thanks for your feedback! I think I've completed all of your requests.

The tests generally mirror the to_dask_array tests and pass, however I'm a bit concerned that I have commented out the check for equality: https://github.com/asmith26/dask/blob/issue-4469/dask/dataframe/io/tests/test_io.py#L521

If I uncomment this I get a Key error still- I understand what the error is, but I am not familiar enough with Dask internals to understand why the key computed by Dask is different to that in the graph; in particular:

This key

key = ('to_records-7efaccf97358a6d0b69926d9bf624a17', 0, 0)

Is trying to index the dict:

dsk = {('from_pandas-b581c7965fed6d41f34e734ae95a309c', 0):      x  y
ind      
1.0  a  2
2.0  b  3, ('from_pandas-b581c7965...ubgraph_callable, ('from_pandas-b581c7965fed6d41f34e734ae95a309c', 1), 'from_pandas-b581c7965fed6d41f34e734ae95a309c')}

which clearly doesn't exist.

It appears to be getting the key within: https://github.com/asmith26/dask/blob/master/dask/base.py#L206-L207

Any thoughts?

asmith26 · 2019-04-23T12:13:25Z

I'll commit with check for equality below to make it clearer:

TomAugspurger

I think the issue with the KeyError is coming from the dimensionality of the chunks being incorrect. Fixing that should fix up everything.

dask/dataframe/core.py

dask/dataframe/io/tests/test_io.py

TomAugspurger · 2019-05-14T13:33:31Z

dask/dataframe/core.py

+        records = to_records(self)
+
+        chunks = self._validate_chunks(records, lengths)
+        records._chunks = chunks


I'm not very familiar with record arrays, but apparently they're 1-dimensional?

In [6]: df = pd.DataFrame({"A": [1, 2], "B": [3, 4]}) In [7]: df.to_records().ndim Out[7]: 1

So this should (maybe) be records._chunks = (chunks[0],).

jrbourbeau · 2019-06-18T18:20:43Z

I pushed a commit that addresses your review comments @TomAugspurger. Tests are passing now, so a quick re-review whenever you get a chance would be appreciated.

TomAugspurger · 2019-06-18T18:30:19Z

Thanks @asmith26 and @jrbourbeau!

asmith26 · 2019-06-18T19:44:50Z

Thank you very much for everyone's help! I thoroughly enjoyed tackling this, and I've learned loads from everyone's feedback.

Hope I'm able to help out with this brilliant library in the future! :)

asmith26 mentioned this pull request Feb 20, 2019

Feature Request/Suggestion: Ability to .reshape(...) array when chunk size unknown #4469

Closed

TomAugspurger reviewed Mar 13, 2019

View reviewed changes

asmith26 added 6 commits March 18, 2019 21:56

WIP: Update test_to_records to test with lengths argument, added co…

c7d0a47

…rresponding functionality based on `.to_dask_array()`

Merge branch 'master' of https://github.com/dask/dask into issue-4469

d757ca5

Reordered arguments following #4515 (comment)

182996a

Create _validate_chunks method following #4515 (comment)

89b7f1b

Update test_to_records to mirror test_to_dask_array

ef8ce1c

Add test_to_records_raises (similar to test_to_dask_array_raises) fol…

3e9e599

…lowing #4515 (comment)

asmith26 commented Apr 22, 2019

View reviewed changes

Update test_to_records_with_lengths to test for equality.

04ad867

mrocklin marked this pull request as ready for review April 30, 2019 19:37

Merge branch 'master' into issue-4469

b0c6454

TomAugspurger reviewed May 14, 2019

View reviewed changes

jrbourbeau added 2 commits June 18, 2019 11:38

Merge remote-tracking branch 'upstream/master' into issue-4469

1a8eacf

Address reviewer comments

eb8c92e

TomAugspurger added array dataframe labels Jun 18, 2019

TomAugspurger merged commit 84ff737 into dask:master Jun 18, 2019

Uh oh!

Conversation

asmith26 commented Feb 20, 2019

Uh oh!

mrocklin commented Feb 20, 2019 via email

Uh oh!

Uh oh!

Uh oh!

TomAugspurger Mar 13, 2019

Choose a reason for hiding this comment

Uh oh!

asmith26 Mar 18, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomAugspurger Mar 13, 2019

Choose a reason for hiding this comment

Uh oh!

asmith26 Mar 24, 2019

Choose a reason for hiding this comment

Uh oh!

asmith26 Apr 22, 2019

Choose a reason for hiding this comment

Uh oh!

asmith26 commented Apr 23, 2019

Uh oh!

asmith26 commented Apr 23, 2019

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TomAugspurger May 14, 2019

Choose a reason for hiding this comment

Uh oh!

jrbourbeau commented Jun 18, 2019

Uh oh!

TomAugspurger commented Jun 18, 2019

Uh oh!

asmith26 commented Jun 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants