Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Support for union type in ORC writer #34262

Closed
hinxx opened this issue Feb 20, 2023 · 9 comments · Fixed by #34416
Closed

[C++] Support for union type in ORC writer #34262

hinxx opened this issue Feb 20, 2023 · 9 comments · Fixed by #34416

Comments

@hinxx
Copy link

hinxx commented Feb 20, 2023

Describe the enhancement requested

I've built an union manually with pa.UnionArray.from_dense(), not using inference,

In python I'm trying to write that union to an ORC file and I'm getting the following error:

pyarrow.lib.ArrowNotImplementedError: Unknown or unsupported Arrow type: dense_union<int64: int64=0, float64: double=1, string: string=2>

It seems that this in not supported even in C++ code that, IMO, python uses:

return Status::NotImplemented("Unknown or unsupported Arrow type: ",

Any plans on adding this feature to arrow?

Component(s)

C++, Python

@kou kou changed the title Support for union type in ORC writer [C++] Support for union type in ORC writer Feb 20, 2023
@kou
Copy link
Member

kou commented Feb 20, 2023

Could you share a script that reproduces this case?

There is no plan of this. We need a volunteer for this.

It seems that there is Type::type::DENSE_UNION in the switch:

case Type::type::DENSE_UNION:

@hinxx
Copy link
Author

hinxx commented Feb 22, 2023

Here is the script:

import pandas as pd
import pyarrow as pa
from pyarrow import orc
import json


names = pd.Series(['row1', 'row2', 'row3', 'row4', 'row5', 'row6'], dtype='string', name='name')
timestamps = pd.Series([pd.Timestamp('2022-07-15 12:40:13.439549952'), pd.Timestamp('2022-07-15 12:40:13.439546880'), pd.Timestamp('2023-02-08 09:13:32.287076352'), pd.Timestamp('2023-02-08 09:13:32.587076352'), pd.Timestamp('2022-07-07 14:23:10.092787968'), pd.Timestamp('2022-07-15 12:40:13.839546624')], dtype='datetime64[ns]', name='timestamp')
tags = pd.Series([1, 1, 0, 0, 2, 1], dtype='uint8', name='tag')
offsets = pd.Series([0, 1, 0, 1, 0, 2], dtype='uint32', name='offset')
integers = pd.Series([5, 53], dtype='int64', name='integer')
floats = pd.Series([0.011021, -32580.0, -33580.0], dtype='float64', name='float')
strings = pd.Series(['3.10.0'], dtype='string', name='string')

union_schema = pa.union([
    pa.field('int64', pa.int64()),
    pa.field('float64', pa.float64()),
    pa.field('string', pa.string())
    ], 'dense')
schema = pa.schema([
    ('name', pa.string()),
    ('timestamp', pa.timestamp('ns')),
    ('value', union_schema)
    ])
union = pa.UnionArray.from_dense(
    pa.array(tags, type='int8'),
    pa.array(offsets, type='int32'),
    [   pa.Array.from_pandas(integers),
        pa.Array.from_pandas(floats),
        pa.Array.from_pandas(strings)
    ],
    ['int64', 'float64', 'string']
    )
table = pa.Table.from_arrays([
    pa.Array.from_pandas(names),
    pa.Array.from_pandas(timestamps),
    union
    ], schema=schema)

print('table', table)
writer = orc.ORCWriter('union1.orc', dictionary_key_size_threshold=1)
writer.write(table)

And this the output I get:

$ python union1.py 
table pyarrow.Table
name: string
timestamp: timestamp[ns]
value: dense_union<int64: int64=0, float64: double=1, string: string=2>
  child 0, int64: int64
  child 1, float64: double
  child 2, string: string
----
name: [["row1","row2","row3","row4","row5","row6"]]
timestamp: [[2022-07-15 12:40:13.439549952,2022-07-15 12:40:13.439546880,2023-02-08 09:13:32.287076352,2023-02-08 09:13:32.587076352,2022-07-07 14:23:10.092787968,2022-07-15 12:40:13.839546624]]
value: [  -- is_valid: all not null  -- type_ids: [1,1,0,0,2,1]  -- value_offsets: [0,1,0,1,0,2]
  -- child 0 type: int64
[5,53]
  -- child 1 type: double
[0.011021,-32580,-33580]
  -- child 2 type: string
["3.10.0"]]
Traceback (most recent call last):
  File "union1.py", line 42, in <module>
    writer.write(table)
  File "/data/data/Code/orc/python/venv/lib/python3.8/site-packages/pyarrow/orc.py", line 289, in write
    self.writer.write(table)
  File "pyarrow/_orc.pyx", line 443, in pyarrow._orc.ORCWriter.write
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unknown or unsupported Arrow type: dense_union<int64: int64=0, float64: double=1, string: string=2>

@kou
Copy link
Member

kou commented Feb 22, 2023

Thanks.

@wgtmac Do you want to take a look at this as a Apache ORC PMC member?

@wgtmac
Copy link
Member

wgtmac commented Feb 23, 2023

Thanks.

@wgtmac Do you want to take a look at this as a Apache ORC PMC member?

Thanks for mentioning me. Could you please assign it to me? @kou

@kou
Copy link
Member

kou commented Feb 23, 2023

Thanks!
Could you add a comment that contains only "take" like #33849 (comment) to here?
See also: https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment

@wgtmac
Copy link
Member

wgtmac commented Feb 23, 2023

take

wgtmac added a commit to wgtmac/arrow that referenced this issue Mar 2, 2023
wgtmac added a commit to wgtmac/arrow that referenced this issue Mar 3, 2023
wgtmac added a commit to wgtmac/arrow that referenced this issue Mar 3, 2023
westonpace pushed a commit that referenced this issue Mar 6, 2023
### Rationale for this change

The ORC adapter does not support union type yet.

### What changes are included in this PR?

Support union type to both ORC reader and writer.

### Are these changes tested?

To be added.

### Are there any user-facing changes?

No.
* Closes: #34262

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace added this to the 12.0.0 milestone Mar 6, 2023
@wgtmac
Copy link
Member

wgtmac commented Mar 7, 2023

@hinxx This is fixed and you can try it out from the latest main branch. Let me know if there is any feedback. Thanks!

@hinxx
Copy link
Author

hinxx commented Mar 7, 2023

@wgtmac I get no errors on writer.write() call as seen previously. Also, readback of the written data looks as expected to me:

>>> table = orc.read_table('union1.orc')
>>> table
pyarrow.Table
name: string
timestamp: timestamp[ns]
value: sparse_union<_union_0: int64=0, _union_1: double=1, _union_2: string=2>
  child 0, _union_0: int64
  child 1, _union_1: double
  child 2, _union_2: string
----
name: [["row1","row2","row3","row4","row5","row6"]]
timestamp: [[2022-07-15 12:40:13.439549952,2022-07-15 12:40:13.439546880,2023-02-08 09:13:32.287076352,2023-02-08 09:13:32.587076352,2022-07-07 14:23:10.092787968,2022-07-15 12:40:13.839546624]]
value: [  -- is_valid: all not null  -- type_ids: [1,1,0,0,2,1]
  -- child 0 type: int64
[null,null,5,53,null,null]
  -- child 1 type: double
[0.011021,-32580,null,null,null,-33580]
  -- child 2 type: string
[null,null,null,null,"3.10.0",null]]

Great job! Thank you for such a quick resolution of this issue.

@wgtmac
Copy link
Member

wgtmac commented Mar 7, 2023

@wgtmac I get no errors on writer.write() call as seen previously. Also, readback of the written data looks as expected to me:

>>> table = orc.read_table('union1.orc')
>>> table
pyarrow.Table
name: string
timestamp: timestamp[ns]
value: sparse_union<_union_0: int64=0, _union_1: double=1, _union_2: string=2>
  child 0, _union_0: int64
  child 1, _union_1: double
  child 2, _union_2: string
----
name: [["row1","row2","row3","row4","row5","row6"]]
timestamp: [[2022-07-15 12:40:13.439549952,2022-07-15 12:40:13.439546880,2023-02-08 09:13:32.287076352,2023-02-08 09:13:32.587076352,2022-07-07 14:23:10.092787968,2022-07-15 12:40:13.839546624]]
value: [  -- is_valid: all not null  -- type_ids: [1,1,0,0,2,1]
  -- child 0 type: int64
[null,null,5,53,null,null]
  -- child 1 type: double
[0.011021,-32580,null,null,null,-33580]
  -- child 2 type: string
[null,null,null,null,"3.10.0",null]]

Great job! Thank you for such a quick resolution of this issue.

Thanks for your confirmation! Good to know it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants