Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Order of columns in pyarrow.feather.read_table #32634

Closed
asfimport opened this issue Aug 9, 2022 · 4 comments
Closed

[Python] Order of columns in pyarrow.feather.read_table #32634

asfimport opened this issue Aug 9, 2022 · 4 comments

Comments

@asfimport
Copy link

xref pandas-dev/pandas#47944

 

In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})

# pandas main branch / 1.5
In [2]: df.to_orc("abc")

In [3]: pd.read_orc("abc", columns=['b', 'a'])
Out[3]:
   a  b
0  1  a
1  2  b
2  3  c

In [4]: import pyarrow.orc as orc

In [5]: orc_file = orc.ORCFile("abc")

# reordered to a, b
In [6]: orc_file.read(columns=['b', 'a']).to_pandas()
Out[6]:
   a  b
0  1  a
1  2  b
2  3  c

# reordered to a, b
In [7]: orc_file.read(columns=['b', 'a'])
Out[7]:
pyarrow.Table
a: int64
b: string
----
a: [[1,2,3]]
b: [["a","b","c"]] 

Reporter: Matthew Roeschke / @mroeschke
Assignee: Alenka Frim / @AlenkaF

PRs and other links:

Note: This issue was originally created as ARROW-17360. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Alenka Frim / @AlenkaF:
Thank you for reporting!

I would say this is not the expected behaviour. If we look at the parquet or feather format the read methods preserve the ordering of selected columns:

import pyarrow as pa
table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})

import pyarrow.parquet as pq
pq.write_table(table, 'example.parquet')
pq.read_table('example.parquet', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# ----
# b: [["a","b","c"]]
# a: [[1,2,3]]

import pyarrow.feather as feather
feather.write_feather(table, 'example_feather')
feather.read_table('example_feather', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# ----
# b: [["a","b","c"]]
# a: [[1,2,3]]

FWIU looking at the code in pyarrow/_orc.pyx and arrow/adapters/orc/adapter.cc I think the behaviour comes from Apache ORC and can therefore be open as an issue there (about following order in the original schema).

Nevertheless there are two options we have to make this work correctly:

  • add a re-ordering in pyarrow as it is done for feather implementation.
  • Even better would be if pandas uses the new dataset API to read orc files like so:
    {code:python}
    import pyarrow.dataset as ds
    dataset = ds.dataset("example.orc", format="orc")
    dataset.to_table(columns=['b', 'a'])
  1. pyarrow.Table
  2. b: string
  3. a: int64

  4. b: [["a","b","c"]]
  5. a: [[1,2,3]]
    {code}

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
For reference, we had a similar issue for Feather, where the underlying C++ reader always follows the order of the schema (ARROW-8641). And there we solved this by reordering the columns on the Python side in pyarrow.feather.read_table (as Alenka linked above).

@asfimport
Copy link
Author

Alenka Frim / @AlenkaF:
Unfortunately the easy solution for reordering the columns in pyarrow isn't feasible as pyarrow.Table.select() with "dotted path" doesn't work but "dotted path" can be used when selecting a column with {}ORCF.read(){}:

>       result4 = orc_file.read(columns=["struct.middle.inner"])

opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/test_orc.py:584: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/orc.py:189: in read
    table = table.select(columns)
pyarrow/table.pxi:3053: in pyarrow.lib.Table.select
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   KeyError: 'Field "struct.middle.inner" does not exist in table schema'

To close this issue I will add information to the orc.read_table() docstrings that we always follow the order of the file.

A workaround for a user with ordering issue:

  • add .select(columns= ['b', 'a'])) after reading the Table from the orc file

    I still think a better solution would be that pandas starts using the new dataset API as mentioned above.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 14528
#14528

@asfimport asfimport added this to the 11.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants