Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few dataframe metadata fixes #4695

Merged
merged 5 commits into from Apr 22, 2019

Conversation

Projects
None yet
3 participants
@jcrist
Copy link
Member

commented Apr 11, 2019

Due to a bug in dd.utils.assert_eq, we weren't properly testing that index metadata was correct. This fixes that, as well as issues this turned up.

  • Use _meta_nonempty in assign
  • Use _meta_nonempty in groupby aggregations
  • Error nicely when trying to infer metadata from a dask.delayed
    object (which we can't do since it could be any dtype).
  • Fix dd.utils.assert_eq to properly compare index dtypes

Fixes #2564.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 12, 2019

Thanks for finding and resolving this @jcrist !

I'm curious, are the parquet errors related, or are they due to something else?

@jcrist

This comment has been minimized.

Copy link
Member Author

commented Apr 12, 2019

Not sure, I can take a look tomorrow.

jcrist added some commits Apr 11, 2019

A few dataframe metadata fixes
- Use `_meta_nonempty` in assign
- Use `_meta_nonempty` in groupby aggregations
- Error nicely when trying to infer metadata from a `dask.delayed`
object (which we can't do since it could be any dtype).
- Fix `dd.utils.assert_eq` to properly compare index dtypes
@jcrist

This comment has been minimized.

Copy link
Member Author

commented Apr 19, 2019

This is due to a bug in our index metadata handling for pyarrow 0.13.0. I spent 20 mins or so trying to debug and haven't found an easy solution (the test fixes here turned up a few bad code paths). With our desire to rewrite all the parquet handling anyway, and other pyarrow 0.13.0 issues (no pickle support) would it be fine to just skip these tests?

@jcrist

This comment has been minimized.

Copy link
Member Author

commented Apr 19, 2019

Alternatively we could just error nicely if pyarrow 0.13 is installed and recommend people downgrade/upgrade? I'm not sure what the complete parquet & pyarrow 0.13.0 status is, but until we do a clean rewrite there may be too many bugs for me to completely trust the index handling logic here.

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

Some things are still broken after #4668 ?

The general problem with inference from parquet is that there are genuinely many possibilities in the schema contents, the file organisations and further metadata present in the files, and the engines' view of it (the latter of which may depend on version, as in the present case of arrow0.13). So partly the code is convoluted because it grew organically to accommodate a moving target, but also in no small part it reflects a genuinely complex situation. I've tried to be as clear as I can about the situation in comments in #4336 .
A "clear rewrite" should probably include stricter version requirements.

Since arrow is actively working on the parquet story for the next release, including pickle and metadata-file handling, I would be OK skipping some tests for now. Whether the rewrite will succeed in making a more maintainable situation remains to be seen.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

A "clear rewrite" should probably include stricter version requirements.

+1

So partly the code is convoluted because it grew organically to accommodate a moving target

I recall early conversations when you were writing the first integration where there were open questions of "should this logic go in fastparquet, or in dask.dataframe?" to which we really didn't have any good answers :)

but also in no small part it reflects a genuinely complex situation

Indeed

Whether the rewrite will succeed in making a more maintainable situation remains to be seen.

I think that we're more than capable of making this happen if we act as a group. I don't think that I am independently capable of this though.

I think that the parquet situation is important for Dask.

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

Whether the rewrite will succeed in making a more maintainable situation remains to be seen.

I think that we're more than capable of making this happen if we act as a group.

I should clarify that by "success", here, I mean greatly reducing the code complexity and maintenance burden. I'm sure we can make something that work well for the current situation :)

@jcrist

This comment has been minimized.

Copy link
Member Author

commented Apr 19, 2019

Some things are still broken after #4668 ?

This PR fixes a bug in our test suite that let metadata issues slip through. Yes things are still broken after #4668, and the fix seems to require refactoring the pyarrow code path significantly. To be clear, I'm not attacking the state of the code, I'm just saying that for now it seems easier to either say "don't use pyarrow 0.13.0 with dask" or skip these tests. I don't care which one we choose.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

I'm totally ok with a "don't use pyarrow 0.13.0" with Dask warning

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

don't use pyarrow 0.13.0

Agree, since it doesn't work with distributed anyway

@jcrist

This comment has been minimized.

Copy link
Member Author

commented Apr 19, 2019

Cool, I'll push that here.

Drop parquet pyarrow 0.13.0 support
Due to incompatibilities and bugs in pyarrow 0.13.0 we drop support for
it in our parquet implementation and tests.

@jcrist jcrist force-pushed the jcrist:dd-metadata-fixups branch from 4197631 to 3dabb4d Apr 19, 2019

@jcrist

This comment has been minimized.

Copy link
Member Author

commented Apr 20, 2019

I think this is good to go now.

@martindurant
Copy link
Member

left a comment

I am fine with the parquet part of this

raise RuntimeError("PyArrow version >= 0.8.0 required")
elif pa_version == '0.13.0':
raise RuntimeError("PyArrow versino 0.13.0 isn't supported, please "

This comment has been minimized.

Copy link
@martindurant
@jcrist

This comment has been minimized.

Copy link
Member Author

commented Apr 22, 2019

Merging later today if no comment.

@jcrist jcrist merged commit 87954ec into dask:master Apr 22, 2019

@jcrist jcrist deleted the jcrist:dd-metadata-fixups branch Apr 22, 2019

asmith26 added a commit to asmith26/dask that referenced this pull request Apr 22, 2019

A few dataframe metadata fixes (dask#4695)
* A few dataframe metadata fixes

- Use `_meta_nonempty` in assign
- Use `_meta_nonempty` in groupby aggregations
- Error nicely when trying to infer metadata from a `dask.delayed`
object (which we can't do since it could be any dtype).
- Fix `dd.utils.assert_eq` to properly compare index dtypes

* More fixups

* Drop parquet pyarrow 0.13.0 support

Due to incompatibilities and bugs in pyarrow 0.13.0 we drop support for
it in our parquet implementation and tests.

* py27 fixup

* Spelling [skip ci]

@fjetter fjetter referenced this pull request May 6, 2019

Merged

Minimal build setup #4

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

A few dataframe metadata fixes (dask#4695)
* A few dataframe metadata fixes

- Use `_meta_nonempty` in assign
- Use `_meta_nonempty` in groupby aggregations
- Error nicely when trying to infer metadata from a `dask.delayed`
object (which we can't do since it could be any dtype).
- Fix `dd.utils.assert_eq` to properly compare index dtypes

* More fixups

* Drop parquet pyarrow 0.13.0 support

Due to incompatibilities and bugs in pyarrow 0.13.0 we drop support for
it in our parquet implementation and tests.

* py27 fixup

* Spelling [skip ci]

Thomas-Z added a commit to Thomas-Z/dask that referenced this pull request May 17, 2019

A few dataframe metadata fixes (dask#4695)
* A few dataframe metadata fixes

- Use `_meta_nonempty` in assign
- Use `_meta_nonempty` in groupby aggregations
- Error nicely when trying to infer metadata from a `dask.delayed`
object (which we can't do since it could be any dtype).
- Fix `dd.utils.assert_eq` to properly compare index dtypes

* More fixups

* Drop parquet pyarrow 0.13.0 support

Due to incompatibilities and bugs in pyarrow 0.13.0 we drop support for
it in our parquet implementation and tests.

* py27 fixup

* Spelling [skip ci]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.