Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Fixes parquet breakages with arrow 0.13.0 #4668
Two of the failures are OK, where things can now be loaded where before they could not.
The code here is minimum-lines, which means somewhat convoluted in the flow.
One remaining problem is that the index produced by pyarrow contains names that refer to none of the input columns (they are in metadata only). If the user were to choose another index, a real column, it's not clear what we should do: materialise the range index as a real column?
Also, there is a question around infer_divisions, which for range indexes is similar to having to load statistics - you need the number of rows in each file, or to read the version of the range index metadata in each file's metadata - but in any case, needs some extra special handling. Note that we didn't previously apply divisions to the implicit default index for non-pandas parquet reading, but we should (i.e., if the first row-group has 10 rows, the second's rangeindex should start at 10, so we do know the divisions, but only if we can percolate the information into the read function).
WIP should have been removed, sorry. At airport, can edit maybe in a bit.…
On April 6, 2019 11:09:15 AM EDT, Matthew Rocklin ***@***.***> wrote: I'd be happy to merge after tests pass (which looks likely shortly), though I notice that there is still a WIP label in the title, so holding off for now in case there is more that you're planning to do. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #4668 (comment)
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.