[Python] Info about offset is loss when converting struct array to record batch #34639

lccnl · 2023-03-20T09:44:50Z

Describe the bug, including details regarding any error messages, version, and platform.

Hello,
it seems that when a struct is pointing to a part of a larger array via an offset and we try to convert it to a record batch, that information is lost and we get a record batch with columns having the larger array length.

For now, a workaround is to select all the actual values in the array before converting it to a recordbatch (though this solution does not scale well, slicing does not work).

The following code reproduces the error and shows the workaround:


# create a struct and a table having as column the struct, then split it into record batches
struct_1=pa.StructArray.from_arrays([pa.array([1.,2.]),pa.array(['a','b'])],names=['col1','col2'])
out=pa.Table.from_arrays(arrays=[struct_1],names=['struct'])
batches=out.to_batches(max_chunksize=1)

#convert to  struct arrays, here each array has an offset referencing the table
arrays=[pa.StructArray.from_arrays(batch.columns,names=batch.schema.names) for batch in batches]
#select the struct
modified_arrays=[array.field('struct') for array in arrays]

# just take length of each array to show difference
taken_arrays=[array.take(pa.array([0])) for array in modified_arrays]

for standard,taken in zip(modified_arrays,taken_arrays):
    #arrays are equals
    assert standard==taken
    assert len(standard)==len(taken)==1
    #but record batches are different!
    assert len(pa.RecordBatch.from_struct_array(standard).column(0))==2
    assert len(pa.RecordBatch.from_struct_array(taken).column(0))==1 ```

### Component(s)

Python

The text was updated successfully, but these errors were encountered:

…rray has nulls/offsets (#34691) ### Rationale for this change A struct array can have a validity map and an offset. A record batch cannot. When converting from a struct array to a record batch we were throwing an error if a validity map was present and returning the wrong data if an offset was present. ### What changes are included in this PR? If a validity map or offset are present then StructArray::Flatten is used to push the offset and validity map into the struct array. Note: this means that RecordBatch::FromStructArray will not be zero-copy if it has to push down a validity map. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, RecordBatch::FromStructArray now takes in a memory pool because it might have to make allocations when pushing validity bitmaps down. * Closes: #34639 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

lccnl · 2023-05-03T15:58:38Z

Hello @westonpace ,
I tested the fix on pyarrow 12 and it seems that the bug is still there:

assert(len(pa.RecordBatch.from_struct_array(standard).column(0)))==len(pa.RecordBatch.from_struct_array(taken).column(0))

yields an error

The pb is on the first recordbatch that has both rows rather than only the first, the second one works fine. By the way, I tried to use array.flatten()[0] rather than array.field('struct') but that does not solve the pb.

westonpace · 2023-05-05T16:42:07Z

I've reopened this so we can verify but I think it is actually doing the right thing. Although I think there is another bug in to_struct_array and to_pandas (:face_exhaling:)

> pa.RecordBatch.from_struct_array(standard)

This will give you a record batch that has length 1 with two child arrays that each have length 2. This is allowed because it lets us use zero-copy.

>>> x = pa.RecordBatch.from_struct_array(standard)
>>> print(x) # Sadly, we don't print the contents here
pyarrow.RecordBatch
col1: double
col2: string
>>> print(x.num_rows) # This is correct
1
>>> print(x.column(0)) # This is arguably correct but misleading
[
  1,
  2
]
>>> print(x.to_pylist()) # This is correct
[{'col1': 1.0, 'col2': 'a'}]
>>> print(x.to_struct_array()) # This is wrong
-- is_valid: all not null
-- child 0 type: double
  [
    1,
    2
  ]
-- child 1 type: string
  [
    "a",
    "b"
  ]
>>> print(x.to_pandas()) # this is also wrong
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 852, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 2506, in pyarrow.lib.RecordBatch._to_pandas
  File "pyarrow/table.pxi", line 4075, in pyarrow.lib.Table._to_pandas
  File "/home/pace/dev/arrow/python/pyarrow/pandas_compat.py", line 823, in table_to_blockmanager
    return BlockManager(blocks, axes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pace/miniconda3/envs/conbench3/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 1040, in __init__
    self._verify_integrity()
  File "/home/pace/miniconda3/envs/conbench3/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 1047, in _verify_integrity
    raise construction_error(tot_items, block.shape[1:], self.axes)
ValueError: Shape of passed values is (2, 2), indices imply (1, 2)

I will open up two new issues for to_struct_array and to_pandas. Arguably, we should also modify to_batches to push "short lengths" into the arrays themselves. I'll have to ask on the ML if it's legal for a record batch and its arrays to have different lengths.

westonpace · 2023-05-05T17:46:31Z

Alright, I've filed #35450 and #35452 and I've asked on the ML about these kinds of arrays.

lccnl · 2023-05-06T12:47:34Z

ok thank you for the explanation and for opening specific issues!

One thing I have noticed is that if you do your same test on the second batch:

x=pa.RecordBatch.from_struct_array(modified_arrays[1])

everything works fine (methods to_pandas and to.struct_array yield the correct array with one row).

In the first case (taking the first batch), the method to_pydict also returns the wrong result (I guess it must be related to how to_struct_array works?).

If you want to close this issue, I can follow the two you have opened.

westonpace · 2023-05-08T20:30:23Z

One thing I have noticed is that if you do your same test on the second batch:

Yes, since the second batch has a non-zero offset we follow a different code path that handles the non-matching lengths better. I will close this back up.

…ruct array has nulls/offsets (apache#34691) ### Rationale for this change A struct array can have a validity map and an offset. A record batch cannot. When converting from a struct array to a record batch we were throwing an error if a validity map was present and returning the wrong data if an offset was present. ### What changes are included in this PR? If a validity map or offset are present then StructArray::Flatten is used to push the offset and validity map into the struct array. Note: this means that RecordBatch::FromStructArray will not be zero-copy if it has to push down a validity map. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, RecordBatch::FromStructArray now takes in a memory pool because it might have to make allocations when pushing validity bitmaps down. * Closes: apache#34639 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

lccnl added the Type: bug label Mar 20, 2023

github-actions bot added the Component: Python label Mar 20, 2023

github-actions bot mentioned this issue Mar 22, 2023

GH-34639: [C++] Support RecordBatch::FromStructArray even if struct array has nulls/offsets #34691

Merged

github-actions bot assigned westonpace Mar 22, 2023

westonpace closed this as completed in #34691 Mar 27, 2023

westonpace added this to the 12.0.0 milestone Mar 27, 2023

westonpace reopened this May 5, 2023

westonpace closed this as completed May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Info about offset is loss when converting struct array to record batch #34639

[Python] Info about offset is loss when converting struct array to record batch #34639

lccnl commented Mar 20, 2023

lccnl commented May 3, 2023 •

edited

Loading

westonpace commented May 5, 2023

westonpace commented May 5, 2023

lccnl commented May 6, 2023

westonpace commented May 8, 2023

[Python] Info about offset is loss when converting struct array to record batch #34639

[Python] Info about offset is loss when converting struct array to record batch #34639

Comments

lccnl commented Mar 20, 2023

Describe the bug, including details regarding any error messages, version, and platform.

lccnl commented May 3, 2023 • edited Loading

westonpace commented May 5, 2023

westonpace commented May 5, 2023

lccnl commented May 6, 2023

westonpace commented May 8, 2023

lccnl commented May 3, 2023 •

edited

Loading