Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes parquet breakages with arrow 0.13.0 #4668

Merged
merged 8 commits into from Apr 6, 2019

Conversation

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 4, 2019

  • Tests added / passed
  • Passes flake8 dask
@martindurant
Copy link
Member Author

@martindurant martindurant commented Apr 4, 2019

Two of the failures are OK, where things can now be loaded where before they could not.

The code here is minimum-lines, which means somewhat convoluted in the flow.

One remaining problem is that the index produced by pyarrow contains names that refer to none of the input columns (they are in metadata only). If the user were to choose another index, a real column, it's not clear what we should do: materialise the range index as a real column?

Also, there is a question around infer_divisions, which for range indexes is similar to having to load statistics - you need the number of rows in each file, or to read the version of the range index metadata in each file's metadata - but in any case, needs some extra special handling. Note that we didn't previously apply divisions to the implicit default index for non-pandas parquet reading, but we should (i.e., if the first row-group has 10 rows, the second's rangeindex should start at 10, so we do know the divisions, but only if we can percolate the information into the read function).

Loading

@martindurant
Copy link
Member Author

@martindurant martindurant commented Apr 4, 2019

Fixes #4666 (eventually)

Loading

Martin Durant added 3 commits Apr 4, 2019
@martindurant
Copy link
Member Author

@martindurant martindurant commented Apr 5, 2019

This will still fail for those runs that use older arrow. This is v0.10.0 on the py27 build, although v0.13.0 is available. The error that shows up in the log is totally incomprehensible to me.
Suggestions?

Loading

@martindurant
Copy link
Member Author

@martindurant martindurant commented Apr 5, 2019

(presumably we get the older versions because of the numpy pins)

Loading

@mrocklin
Copy link
Member

@mrocklin mrocklin commented Apr 6, 2019

Everything here looks sensible to me. Thanks for prioritizing this @martindurant . +1 from me

Loading

@mrocklin
Copy link
Member

@mrocklin mrocklin commented Apr 6, 2019

I'd be happy to merge after tests pass (which looks likely shortly), though I notice that there is still a WIP label in the title, so holding off for now in case there is more that you're planning to do.

Loading

@martindurant
Copy link
Member Author

@martindurant martindurant commented Apr 6, 2019

Loading

@martindurant martindurant changed the title (WIP) Fixes parquet breakages with arrow 0.13.0 Fixes parquet breakages with arrow 0.13.0 Apr 6, 2019
@martindurant martindurant merged commit 0f97f40 into dask:master Apr 6, 2019
3 of 4 checks passed
Loading
@martindurant martindurant deleted the arrow13-fix branch Apr 6, 2019
mpeaton added a commit to Quansight/dask that referenced this issue Apr 8, 2019
* Fixes some of the breakages

* skip test where exception went away and add note

TODO points out where the problem is

* make things pass

* lint

* backcompat

* over-eager merge

* for coverage
jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this issue May 14, 2019
* Fixes some of the breakages

* skip test where exception went away and add note

TODO points out where the problem is

* make things pass

* lint

* backcompat

* over-eager merge

* for coverage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants