[Python] Order of columns in pyarrow.feather.read_table #32634

asfimport · 2022-08-09T18:59:05Z

In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})

# pandas main branch / 1.5
In [2]: df.to_orc("abc")

In [3]: pd.read_orc("abc", columns=['b', 'a'])
Out[3]:
   a  b
0  1  a
1  2  b
2  3  c

In [4]: import pyarrow.orc as orc

In [5]: orc_file = orc.ORCFile("abc")

# reordered to a, b
In [6]: orc_file.read(columns=['b', 'a']).to_pandas()
Out[6]:
   a  b
0  1  a
1  2  b
2  3  c

# reordered to a, b
In [7]: orc_file.read(columns=['b', 'a'])
Out[7]:
pyarrow.Table
a: int64
b: string
----
a: [[1,2,3]]
b: [["a","b","c"]]

Reporter: Matthew Roeschke / @mroeschke
Assignee: Alenka Frim / @AlenkaF

PRs and other links:

_{Note: This issue was originally created as ARROW-17360. Please see the migration documentation for further details.}

asfimport · 2022-10-20T11:20:08Z

Alenka Frim / @AlenkaF:
Thank you for reporting!

I would say this is not the expected behaviour. If we look at the parquet or feather format the read methods preserve the ordering of selected columns:

import pyarrow as pa
table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})

import pyarrow.parquet as pq
pq.write_table(table, 'example.parquet')
pq.read_table('example.parquet', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# ----
# b: [["a","b","c"]]
# a: [[1,2,3]]

import pyarrow.feather as feather
feather.write_feather(table, 'example_feather')
feather.read_table('example_feather', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# ----
# b: [["a","b","c"]]
# a: [[1,2,3]]

FWIU looking at the code in pyarrow/_orc.pyx and arrow/adapters/orc/adapter.cc I think the behaviour comes from Apache ORC and can therefore be open as an issue there (about following order in the original schema).

Nevertheless there are two options we have to make this work correctly:

add a re-ordering in pyarrow as it is done for feather implementation.
Even better would be if pandas uses the new dataset API to read orc files like so:
{code:python}
import pyarrow.dataset as ds
dataset = ds.dataset("example.orc", format="orc")
dataset.to_table(columns=['b', 'a'])

pyarrow.Table
b: string
a: int64
b: [["a","b","c"]]
a: [[1,2,3]]
{code}

asfimport · 2022-10-20T12:14:28Z

Joris Van den Bossche / @jorisvandenbossche:
For reference, we had a similar issue for Feather, where the underlying C++ reader always follows the order of the schema (ARROW-8641). And there we solved this by reordering the columns on the Python side in pyarrow.feather.read_table (as Alenka linked above).

asfimport · 2022-10-27T11:04:51Z

Alenka Frim / @AlenkaF:
Unfortunately the easy solution for reordering the columns in pyarrow isn't feasible as pyarrow.Table.select() with "dotted path" doesn't work but "dotted path" can be used when selecting a column with {}ORCF.read(){}:

>       result4 = orc_file.read(columns=["struct.middle.inner"])

opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/test_orc.py:584: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/orc.py:189: in read
    table = table.select(columns)
pyarrow/table.pxi:3053: in pyarrow.lib.Table.select
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   KeyError: 'Field "struct.middle.inner" does not exist in table schema'

To close this issue I will add information to the orc.read_table() docstrings that we always follow the order of the file.

A workaround for a user with ordering issue:

add .select(columns= ['b', 'a'])) after reading the Table from the orc file

I still think a better solution would be that pandas starts using the new dataset API as mentioned above.

asfimport · 2022-11-09T10:55:28Z

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 14528
#14528

asfimport closed this as completed Nov 9, 2022

asfimport assigned AlenkaF Jan 11, 2023

asfimport added this to the 11.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Order of columns in pyarrow.feather.read_table #32634

[Python] Order of columns in pyarrow.feather.read_table #32634

asfimport commented Aug 9, 2022

asfimport commented Oct 20, 2022

asfimport commented Oct 20, 2022

asfimport commented Oct 27, 2022

asfimport commented Nov 9, 2022

[Python] Order of columns in pyarrow.feather.read_table #32634

[Python] Order of columns in pyarrow.feather.read_table #32634

Comments

asfimport commented Aug 9, 2022

PRs and other links:

asfimport commented Oct 20, 2022

asfimport commented Oct 20, 2022

asfimport commented Oct 27, 2022

asfimport commented Nov 9, 2022