Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes parquet breakages with arrow 0.13.0 #4668

Merged
merged 8 commits into from Apr 6, 2019

Conversation

Projects
None yet
2 participants
@martindurant
Copy link
Member

commented Apr 4, 2019

  • Tests added / passed
  • Passes flake8 dask
@martindurant

This comment has been minimized.

Copy link
Member Author

commented Apr 4, 2019

Two of the failures are OK, where things can now be loaded where before they could not.

The code here is minimum-lines, which means somewhat convoluted in the flow.

One remaining problem is that the index produced by pyarrow contains names that refer to none of the input columns (they are in metadata only). If the user were to choose another index, a real column, it's not clear what we should do: materialise the range index as a real column?

Also, there is a question around infer_divisions, which for range indexes is similar to having to load statistics - you need the number of rows in each file, or to read the version of the range index metadata in each file's metadata - but in any case, needs some extra special handling. Note that we didn't previously apply divisions to the implicit default index for non-pandas parquet reading, but we should (i.e., if the first row-group has 10 rows, the second's rangeindex should start at 10, so we do know the divisions, but only if we can percolate the information into the read function).

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Apr 4, 2019

Fixes #4666 (eventually)

martindurant added some commits Apr 4, 2019

skip test where exception went away and add note
TODO points out where the problem is
@martindurant

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2019

This will still fail for those runs that use older arrow. This is v0.10.0 on the py27 build, although v0.13.0 is available. The error that shows up in the log is totally incomprehensible to me.
Suggestions?

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2019

(presumably we get the older versions because of the numpy pins)

martindurant added some commits Apr 5, 2019

@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 6, 2019

Everything here looks sensible to me. Thanks for prioritizing this @martindurant . +1 from me

@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 6, 2019

I'd be happy to merge after tests pass (which looks likely shortly), though I notice that there is still a WIP label in the title, so holding off for now in case there is more that you're planning to do.

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Apr 6, 2019

@martindurant martindurant changed the title (WIP) Fixes parquet breakages with arrow 0.13.0 Fixes parquet breakages with arrow 0.13.0 Apr 6, 2019

@martindurant martindurant merged commit 0f97f40 into dask:master Apr 6, 2019

3 of 4 checks passed

codecov/patch 90.9% of diff hit (target 91.21%)
Details
codecov/project 91.24% (+0.03%) compared to 5dc48d2
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@martindurant martindurant deleted the martindurant:arrow13-fix branch Apr 6, 2019

mpeaton added a commit to Quansight/dask that referenced this pull request Apr 8, 2019

Fixes parquet breakages with arrow 0.13.0 (dask#4668)
* Fixes some of the breakages

* skip test where exception went away and add note

TODO points out where the problem is

* make things pass

* lint

* backcompat

* over-eager merge

* for coverage

asmith26 added a commit to asmith26/dask that referenced this pull request Apr 22, 2019

Fixes parquet breakages with arrow 0.13.0 (dask#4668)
* Fixes some of the breakages

* skip test where exception went away and add note

TODO points out where the problem is

* make things pass

* lint

* backcompat

* over-eager merge

* for coverage

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Fixes parquet breakages with arrow 0.13.0 (dask#4668)
* Fixes some of the breakages

* skip test where exception went away and add note

TODO points out where the problem is

* make things pass

* lint

* backcompat

* over-eager merge

* for coverage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.