-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Info about offset is loss when converting struct array to record batch #34639
Comments
…rray has nulls/offsets (#34691) ### Rationale for this change A struct array can have a validity map and an offset. A record batch cannot. When converting from a struct array to a record batch we were throwing an error if a validity map was present and returning the wrong data if an offset was present. ### What changes are included in this PR? If a validity map or offset are present then StructArray::Flatten is used to push the offset and validity map into the struct array. Note: this means that RecordBatch::FromStructArray will not be zero-copy if it has to push down a validity map. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, RecordBatch::FromStructArray now takes in a memory pool because it might have to make allocations when pushing validity bitmaps down. * Closes: #34639 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
Hello @westonpace ,
yields an error The pb is on the first recordbatch that has both rows rather than only the first, the second one works fine. By the way, I tried to use |
I've reopened this so we can verify but I think it is actually doing the right thing. Although I think there is another bug in to_struct_array and to_pandas (:face_exhaling:)
This will give you a record batch that has length 1 with two child arrays that each have length 2. This is allowed because it lets us use zero-copy.
I will open up two new issues for to_struct_array and to_pandas. Arguably, we should also modify |
Alright, I've filed #35450 and #35452 and I've asked on the ML about these kinds of arrays. |
ok thank you for the explanation and for opening specific issues! One thing I have noticed is that if you do your same test on the second batch:
everything works fine (methods to_pandas and to.struct_array yield the correct array with one row). In the first case (taking the first batch), the method to_pydict also returns the wrong result (I guess it must be related to how to_struct_array works?). If you want to close this issue, I can follow the two you have opened. |
Yes, since the second batch has a non-zero offset we follow a different code path that handles the non-matching lengths better. I will close this back up. |
…ruct array has nulls/offsets (apache#34691) ### Rationale for this change A struct array can have a validity map and an offset. A record batch cannot. When converting from a struct array to a record batch we were throwing an error if a validity map was present and returning the wrong data if an offset was present. ### What changes are included in this PR? If a validity map or offset are present then StructArray::Flatten is used to push the offset and validity map into the struct array. Note: this means that RecordBatch::FromStructArray will not be zero-copy if it has to push down a validity map. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, RecordBatch::FromStructArray now takes in a memory pool because it might have to make allocations when pushing validity bitmaps down. * Closes: apache#34639 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
Hello,
it seems that when a struct is pointing to a part of a larger array via an offset and we try to convert it to a record batch, that information is lost and we get a record batch with columns having the larger array length.
For now, a workaround is to select all the actual values in the array before converting it to a recordbatch (though this solution does not scale well, slicing does not work).
The following code reproduces the error and shows the workaround:
The text was updated successfully, but these errors were encountered: