-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] RecordBatch::ToStructArray does not respect the length of the record batch if it differs from the length of the child arrays #35450
Comments
Should we fix |
The docs for arrow/cpp/src/arrow/record_batch.h Lines 44 to 50 in ed99693
Whether that should be the case is debatable though. If we were to slice the arrays then we'd probably want to do it on construction so methods like |
As you mention, we can get into this situation by slicing. So this GH issue is specifically that |
What does |
|
Ok, so the record batch is invalid. I'm not sure it makes sense to check specifically for that when converting to struct. |
This seems to disagree with your statement on the ML thread: https://lists.apache.org/thread/6jtyf5xhfdocb2rlx1jfjwx0rj4hn6o1
However, in the ML, we were talking about struct arrays, and not strictly record batches, and it's probably ok for those two things to act differently. If we want to align on If we instead want record batches to behave like struct arrays were described in the ML discussion then |
Record batches are a bit different from struct arrays. Slicing a struct array changes the offset of the struct array itself. Slicing a record batch slices each column individually, since record batches do not have an offset. So there's no need to allow record batch columns with a different length than the record batch itself.
That sounds like the best solution to me. |
… with mismatched column lengths (#36654) ### Rationale for this change If a `RecordBatch` is created with column lengths that don't match the provided `num_rows` (technically invalid), then there are some circumstances where `ToStructArray` will successfully return an array whose length doesn't match `num_rows`. Instead, we should return an error. ### What changes are included in this PR? * Add a small validation check to `ToStructArray` before constructing the output array * Add a test ### Are these changes tested? Yes (tests are included) ### Are there any user-facing changes? No * Closes: #35450 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…called with mismatched column lengths (apache#36654) ### Rationale for this change If a `RecordBatch` is created with column lengths that don't match the provided `num_rows` (technically invalid), then there are some circumstances where `ToStructArray` will successfully return an array whose length doesn't match `num_rows`. Instead, we should return an error. ### What changes are included in this PR? * Add a small validation check to `ToStructArray` before constructing the output array * Add a test ### Are these changes tested? Yes (tests are included) ### Are there any user-facing changes? No * Closes: apache#35450 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the bug, including details regarding any error messages, version, and platform.
I'm not entirely certain if it's legal for a record batch to be shorter than its child arrays. However, if it is, then ToStructArray is not working properly.
Component(s)
C++
The text was updated successfully, but these errors were encountered: