Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] How to add one level of nesting to flat table? #38912

Closed
sergun opened this issue Nov 28, 2023 · 5 comments
Closed

[Python] How to add one level of nesting to flat table? #38912

sergun opened this issue Nov 28, 2023 · 5 comments
Labels
Component: Python Type: usage Issue is a user question

Comments

@sergun
Copy link

sergun commented Nov 28, 2023

Describe the usage question you have. Please include as many useful details as possible.

I have flat pa.Table:

table = pa.table({"a": [1, 2, 3], "b": [3, 4, 5]})

How can I create new table from this one by adding one level of nesting?
So I want to have a new table with only one column "c" of type struct with two fields "a" and "b" and keep data from original table.

Component(s)

Python

@sergun sergun added the Type: usage Issue is a user question label Nov 28, 2023
@AlenkaF
Copy link
Member

AlenkaF commented Nov 28, 2023

If you can work with record batches I would suggest using to_struct_array() method:

import pyarrow as pa
batch = pa.RecordBatch.from_pydict({"a": [1, 2, 3], "b": [3, 4, 5]})
struct_array = batch.to_struct_array()
batch_result = pa.RecordBatch.from_arrays([struct_array], names=["c"])
# pyarrow.RecordBatch
# c: struct<a: int64, b: int64>
#   child 0, a: int64
#   child 1, b: int64
# ----
# c: -- is_valid: all not null
# -- child 0 type: int64
# [1,2,3]
# -- child 1 type: int64
# [3,4,5]

If you need to work with tables then you can do the same for each individual chunk:

# I think this should work
table = pa.table({"a": [1, 2, 3], "b": [3, 4, 5]})
batches = []
for b in table.to_batches():
    batches.append(pa.RecordBatch.from_arrays([b.to_struct_array()], names=["c"]))
table_result = pa.Table.from_batches(batches)
# pyarrow.Table
# c: struct<a: int64, b: int64>
#   child 0, a: int64
#   child 1, b: int64
# ----
# c: [
#   -- is_valid: all not null
#   -- child 0 type: int64
# [1,2,3]
#   -- child 1 type: int64
# [3,4,5]]

@sergun
Copy link
Author

sergun commented Nov 28, 2023

If you can work with record batches I would suggest using to_struct_array() method:

import pyarrow as pa
batch = pa.RecordBatch.from_pydict({"a": [1, 2, 3], "b": [3, 4, 5]})
struct_array = batch.to_struct_array()
batch_result = pa.RecordBatch.from_arrays([struct_array], names=["c"])
# pyarrow.RecordBatch
# c: struct<a: int64, b: int64>
#   child 0, a: int64
#   child 1, b: int64
# ----
# c: -- is_valid: all not null
# -- child 0 type: int64
# [1,2,3]
# -- child 1 type: int64
# [3,4,5]

If you need to work with tables then you can do the same for each individual chunk:

# I think this should work
table = pa.table({"a": [1, 2, 3], "b": [3, 4, 5]})
batches = []
for b in table.to_batches():
    batches.append(pa.RecordBatch.from_arrays([b.to_struct_array()], names=["c"]))
table_result = pa.Table.from_batches(batches)
# pyarrow.Table
# c: struct<a: int64, b: int64>
#   child 0, a: int64
#   child 1, b: int64
# ----
# c: [
#   -- is_valid: all not null
#   -- child 0 type: int64
# [1,2,3]
#   -- child 1 type: int64
# [3,4,5]]

Thanks a lot @AlenkaF !

Am I right such transformations Table <-> Batches cost close to zero according:
https://arrow.apache.org/docs/cpp/tables.html#record-batches
?

"However, a table can be converted to and built from a sequence of record batches easily without needing to copy the underlying array buffers. A table can be streamed as an arbitrary number of record batches using a arrow::TableBatchReader. Conversely, a logical sequence of record batches can be assembled to form a table using one of the arrow::Table::FromRecordBatches() factory function overloads."

@AlenkaF
Copy link
Member

AlenkaF commented Nov 28, 2023

Table to/from RecordBatches transformations are zero-copy.

@jorisvandenbossche
Copy link
Member

We might want to add to_struct_array to Table as well? (returning a ChunkedArray of struct type) To make this a bit more convenient in case of a table.

@AlenkaF
Copy link
Member

AlenkaF commented Dec 1, 2023

Oh, forgot the issue already exists with an open PR! :)#38520

@AlenkaF AlenkaF closed this as completed Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

3 participants