Skip to content

WIP: Update test_to_records to test with lengths argument, added co…#4515

Merged
TomAugspurger merged 10 commits intodask:masterfrom
asmith26:issue-4469
Jun 18, 2019
Merged

WIP: Update test_to_records to test with lengths argument, added co…#4515
TomAugspurger merged 10 commits intodask:masterfrom
asmith26:issue-4469

Conversation

@asmith26
Copy link
Copy Markdown
Contributor

…rresponding functionality based on .to_dask_array()

  • Tests added / passed
  • Passes flake8 dask

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Feb 20, 2019 via email


records = to_records(self)

if isinstance(lengths, Sequence):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to a _validate_chunks method and call like

chunks = self._validate_chunks(records, lengths)
records._chunks = chunks

When lengths is None, you'll return records._chunks from _validat_chunks.

and call from to_dask_array as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we need to test the new _validate_chunks() method?

# if as_frame:
# expected_chunks = expected_chunks + ((1,),)
#
# assert result.chunks == expected_chunks
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some tests for the cases that raise as well

  1. len(lenghts) != npartitions
  2. an invalid value for lengths.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In progress (hopefully this week, possibly next).

aggfunc=aggfunc)

def to_records(self, index=False):
def to_records(self, index=False, lengths=None):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reordered arguments following #4515 (comment)

Although index is not currently being used? Should it be added to https://github.com/asmith26/dask/blob/c0cf9f7c1717d7ec40fbda4d6cebbc4421e3d137/dask/dataframe/core.py#L3264 ?

@asmith26
Copy link
Copy Markdown
Contributor Author

Hi @TomAugspurger

Thanks for your feedback! I think I've completed all of your requests.

The tests generally mirror the to_dask_array tests and pass, however I'm a bit concerned that I have commented out the check for equality: https://github.com/asmith26/dask/blob/issue-4469/dask/dataframe/io/tests/test_io.py#L521

If I uncomment this I get a Key error still- I understand what the error is, but I am not familiar enough with Dask internals to understand why the key computed by Dask is different to that in the graph; in particular:

This key

key = ('to_records-7efaccf97358a6d0b69926d9bf624a17', 0, 0)

Is trying to index the dict:

dsk = {('from_pandas-b581c7965fed6d41f34e734ae95a309c', 0):      x  y
ind      
1.0  a  2
2.0  b  3, ('from_pandas-b581c7965...ubgraph_callable, ('from_pandas-b581c7965fed6d41f34e734ae95a309c', 1), 'from_pandas-b581c7965fed6d41f34e734ae95a309c')}

which clearly doesn't exist.

It appears to be getting the key within: https://github.com/asmith26/dask/blob/master/dask/base.py#L206-L207

Any thoughts?

@asmith26
Copy link
Copy Markdown
Contributor Author

I'll commit with check for equality below to make it clearer:

@mrocklin mrocklin marked this pull request as ready for review April 30, 2019 19:37
Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue with the KeyError is coming from the dimensionality of the chunks being incorrect. Fixing that should fix up everything.

records = to_records(self)

chunks = self._validate_chunks(records, lengths)
records._chunks = chunks
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with record arrays, but apparently they're 1-dimensional?

In [6]: df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})

In [7]: df.to_records().ndim
Out[7]: 1

So this should (maybe) be records._chunks = (chunks[0],).

@jrbourbeau
Copy link
Copy Markdown
Member

I pushed a commit that addresses your review comments @TomAugspurger. Tests are passing now, so a quick re-review whenever you get a chance would be appreciated.

@TomAugspurger
Copy link
Copy Markdown
Member

Thanks @asmith26 and @jrbourbeau!

@asmith26
Copy link
Copy Markdown
Contributor Author

Thank you very much for everyone's help! I thoroughly enjoyed tackling this, and I've learned loads from everyone's feedback.

Hope I'm able to help out with this brilliant library in the future! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants