Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Info about offset is loss when converting struct array to record batch #34639

Closed
lccnl opened this issue Mar 20, 2023 · 5 comments · Fixed by #34691
Closed

[Python] Info about offset is loss when converting struct array to record batch #34639

lccnl opened this issue Mar 20, 2023 · 5 comments · Fixed by #34691

Comments

@lccnl
Copy link

lccnl commented Mar 20, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Hello,
it seems that when a struct is pointing to a part of a larger array via an offset and we try to convert it to a record batch, that information is lost and we get a record batch with columns having the larger array length.

For now, a workaround is to select all the actual values in the array before converting it to a recordbatch (though this solution does not scale well, slicing does not work).

The following code reproduces the error and shows the workaround:


# create a struct and a table having as column the struct, then split it into record batches
struct_1=pa.StructArray.from_arrays([pa.array([1.,2.]),pa.array(['a','b'])],names=['col1','col2'])
out=pa.Table.from_arrays(arrays=[struct_1],names=['struct'])
batches=out.to_batches(max_chunksize=1)

#convert to  struct arrays, here each array has an offset referencing the table
arrays=[pa.StructArray.from_arrays(batch.columns,names=batch.schema.names) for batch in batches]
#select the struct
modified_arrays=[array.field('struct') for array in arrays]

# just take length of each array to show difference
taken_arrays=[array.take(pa.array([0])) for array in modified_arrays]

for standard,taken in zip(modified_arrays,taken_arrays):
    #arrays are equals
    assert standard==taken
    assert len(standard)==len(taken)==1
    #but record batches are different!
    assert len(pa.RecordBatch.from_struct_array(standard).column(0))==2
    assert len(pa.RecordBatch.from_struct_array(taken).column(0))==1 ```

### Component(s)

Python
westonpace added a commit that referenced this issue Mar 27, 2023
…rray has nulls/offsets (#34691)

### Rationale for this change

A struct array can have a validity map and an offset.  A record batch cannot.  When converting from a struct array to a record batch we were throwing an error if a validity map was present and returning the wrong data if an offset was present.

### What changes are included in this PR?

If a validity map or offset are present then StructArray::Flatten is used to push the offset and validity map into the struct array.  Note: this means that RecordBatch::FromStructArray will not be zero-copy if it has to push down a validity map.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes, RecordBatch::FromStructArray now takes in a memory pool because it might have to make allocations when pushing validity bitmaps down.

* Closes: #34639

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace added this to the 12.0.0 milestone Mar 27, 2023
@lccnl
Copy link
Author

lccnl commented May 3, 2023

Hello @westonpace ,
I tested the fix on pyarrow 12 and it seems that the bug is still there:

assert(len(pa.RecordBatch.from_struct_array(standard).column(0)))==len(pa.RecordBatch.from_struct_array(taken).column(0))

yields an error

The pb is on the first recordbatch that has both rows rather than only the first, the second one works fine. By the way, I tried to use array.flatten()[0] rather than array.field('struct') but that does not solve the pb.

@westonpace westonpace reopened this May 5, 2023
@westonpace
Copy link
Member

I've reopened this so we can verify but I think it is actually doing the right thing. Although I think there is another bug in to_struct_array and to_pandas (:face_exhaling:)

> pa.RecordBatch.from_struct_array(standard)

This will give you a record batch that has length 1 with two child arrays that each have length 2. This is allowed because it lets us use zero-copy.

>>> x = pa.RecordBatch.from_struct_array(standard)
>>> print(x) # Sadly, we don't print the contents here
pyarrow.RecordBatch
col1: double
col2: string
>>> print(x.num_rows) # This is correct
1
>>> print(x.column(0)) # This is arguably correct but misleading
[
  1,
  2
]
>>> print(x.to_pylist()) # This is correct
[{'col1': 1.0, 'col2': 'a'}]
>>> print(x.to_struct_array()) # This is wrong
-- is_valid: all not null
-- child 0 type: double
  [
    1,
    2
  ]
-- child 1 type: string
  [
    "a",
    "b"
  ]
>>> print(x.to_pandas()) # this is also wrong
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 852, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 2506, in pyarrow.lib.RecordBatch._to_pandas
  File "pyarrow/table.pxi", line 4075, in pyarrow.lib.Table._to_pandas
  File "/home/pace/dev/arrow/python/pyarrow/pandas_compat.py", line 823, in table_to_blockmanager
    return BlockManager(blocks, axes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pace/miniconda3/envs/conbench3/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 1040, in __init__
    self._verify_integrity()
  File "/home/pace/miniconda3/envs/conbench3/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 1047, in _verify_integrity
    raise construction_error(tot_items, block.shape[1:], self.axes)
ValueError: Shape of passed values is (2, 2), indices imply (1, 2)

I will open up two new issues for to_struct_array and to_pandas. Arguably, we should also modify to_batches to push "short lengths" into the arrays themselves. I'll have to ask on the ML if it's legal for a record batch and its arrays to have different lengths.

@westonpace
Copy link
Member

Alright, I've filed #35450 and #35452 and I've asked on the ML about these kinds of arrays.

@lccnl
Copy link
Author

lccnl commented May 6, 2023

ok thank you for the explanation and for opening specific issues!

One thing I have noticed is that if you do your same test on the second batch:

x=pa.RecordBatch.from_struct_array(modified_arrays[1])

everything works fine (methods to_pandas and to.struct_array yield the correct array with one row).

In the first case (taking the first batch), the method to_pydict also returns the wrong result (I guess it must be related to how to_struct_array works?).

If you want to close this issue, I can follow the two you have opened.

@westonpace
Copy link
Member

One thing I have noticed is that if you do your same test on the second batch:

Yes, since the second batch has a non-zero offset we follow a different code path that handles the non-matching lengths better. I will close this back up.

ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…ruct array has nulls/offsets (apache#34691)

### Rationale for this change

A struct array can have a validity map and an offset.  A record batch cannot.  When converting from a struct array to a record batch we were throwing an error if a validity map was present and returning the wrong data if an offset was present.

### What changes are included in this PR?

If a validity map or offset are present then StructArray::Flatten is used to push the offset and validity map into the struct array.  Note: this means that RecordBatch::FromStructArray will not be zero-copy if it has to push down a validity map.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes, RecordBatch::FromStructArray now takes in a memory pool because it might have to make allocations when pushing validity bitmaps down.

* Closes: apache#34639

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants