[Python][C++][Parquet] Support reading Parquet files having a permutation of column order #18353

asfimport · 2018-03-29T17:25:21Z

See discussion in dask/fastparquet#320

Reporter: Wes McKinney / @wesm
Assignee: Alenka Frim / @AlenkaF

Related issues:

[Python] More graceful reading of empty String columns in ParquetDataset (relates to)
[C++][Dataset] Schema evolution in Dataset scanning (is related to)
[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim (depends upon)

PRs and other links:

GitHub Pull Request #11561

_{Note: This issue was originally created as ARROW-2366. Please see the migration documentation for further details.}

asfimport · 2018-03-29T17:35:01Z

Uwe Korn / @xhochy:
This is the first iteration of schema evolution. While the issue here is quite simple, I would like to add general schema evolution / resolution rules to Arrow. My favorite would be to stick as close as possible to Avro's rules: http://avro.apache.org/docs/current/spec.html#Schema+Resolution

asfimport · 2019-05-31T00:41:45Z

Wes McKinney / @wesm:
This will need to be addressed as part of general schema conformance in the C++ Datasets API

cc @pitrou @nealrichardson

asfimport · 2020-04-01T14:20:46Z

Joris Van den Bossche / @jorisvandenbossche:
This is now implemented in the C++ Datasets project:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

# create dummy dataset with column order permutation
import pathlib
basedir = pathlib.Path(".")
case = basedir / "dataset_column_order_permutation"
case.mkdir(exist_ok=True)

table1 = pa.table([[1, 2, 3], [.1, .2, .3]], names=['a', 'b'])
pq.write_table(table1, case / "data1.parquet")

table2 = pa.table([[.4, .5, .6], [4, 5, 6]], names=['b', 'a'])
pq.write_table(table2, case / "data2.parquet")

# reading with the old python implementation indeed raises on schema mismatch
pq.read_table(str(case))

# this reads fine
ds.dataset(str(case)).to_table().to_pandas()

So once we use the datasets API under the hood in pyarrow.parquet (ARROW-8039), this issue should be solved (we can still add a test for it to close this issue)

asfimport · 2021-02-17T16:00:10Z

Antoine Pitrou / @pitrou:
@jorisvandenbossche Can we close this now? Perhaps we just need to add a test to ensure this works properly?

asfimport · 2021-02-17T16:05:24Z

Joris Van den Bossche / @jorisvandenbossche:
Yes, it's on my "needs a test and then can close it" list, so would prefer to keep it open for now

asfimport · 2021-10-28T10:40:45Z

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 11561
#11561

asfimport closed this as completed Oct 28, 2021

asfimport assigned AlenkaF Jan 10, 2023

This was referenced Jan 11, 2023

[Python] More graceful reading of empty String columns in ParquetDataset #19053

Open

[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim #17077

Closed

[C++][Dataset] Schema evolution in Dataset scanning #26923

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][C++][Parquet] Support reading Parquet files having a permutation of column order #18353

[Python][C++][Parquet] Support reading Parquet files having a permutation of column order #18353

asfimport commented Mar 29, 2018 •

edited

asfimport commented Mar 29, 2018

asfimport commented May 31, 2019

asfimport commented Apr 1, 2020

asfimport commented Feb 17, 2021

asfimport commented Feb 17, 2021

asfimport commented Oct 28, 2021

[Python][C++][Parquet] Support reading Parquet files having a permutation of column order #18353

[Python][C++][Parquet] Support reading Parquet files having a permutation of column order #18353

Comments

asfimport commented Mar 29, 2018 • edited

Related issues:

PRs and other links:

asfimport commented Mar 29, 2018

asfimport commented May 31, 2019

asfimport commented Apr 1, 2020

asfimport commented Feb 17, 2021

asfimport commented Feb 17, 2021

asfimport commented Oct 28, 2021

asfimport commented Mar 29, 2018 •

edited