Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] RecordBatch::ToStructArray does not respect the length of the record batch if it differs from the length of the child arrays #35450

Closed
westonpace opened this issue May 5, 2023 · 8 comments · Fixed by #36654
Assignees
Milestone

Comments

@westonpace
Copy link
Member

Describe the bug, including details regarding any error messages, version, and platform.

I'm not entirely certain if it's legal for a record batch to be shorter than its child arrays. However, if it is, then ToStructArray is not working properly.

TEST_F(TestRecordBatch, SlicedToStructArray) {
  std::shared_ptr<Array> x_arr = ArrayFromJSON(int64(), "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]");
  auto batch = RecordBatch::Make(::arrow::schema({field("x", int64())}), 5, {x_arr});
  ASSERT_EQ(batch->num_rows(), 5); // This works
  ASSERT_OK_AND_ASSIGN(auto struct_array, batch->ToStructArray());
  ASSERT_EQ(struct_array->length(), 5); // This fails
}

Component(s)

C++

@wgtmac
Copy link
Member

wgtmac commented May 8, 2023

Should we fix ToStructArray() by slicing the struct array before return? @westonpace

@benibus
Copy link
Collaborator

benibus commented Jun 23, 2023

The docs for Make say that the columns' lengths should match the requested length, so I suppose it's technically illegal.

/// \param[in] schema The record batch schema
/// \param[in] num_rows length of fields in the record batch. Each array
/// should have the same length as num_rows
/// \param[in] columns the record batch fields as vector of arrays
static std::shared_ptr<RecordBatch> Make(std::shared_ptr<Schema> schema,
int64_t num_rows,
std::vector<std::shared_ptr<Array>> columns);

Whether that should be the case is debatable though. If we were to slice the arrays then we'd probably want to do it on construction so methods like Equals would also work.

@westonpace
Copy link
Member Author

As you mention, we can get into this situation by slicing. So this GH issue is specifically that ToStructArray works once we've gotten in this situation. If we want to forbid Make with invalid lengths that is ok (though we should probably at least DCHECK this constraint). However, even in that case, there is still a bug with ToStructArray that needs to be fixed.

@pitrou
Copy link
Member

pitrou commented Jul 13, 2023

What does RecordBatch::Validate say in this situation?

@benibus
Copy link
Collaborator

benibus commented Jul 13, 2023

Validate checks that the column lengths are equal to num_rows

@pitrou
Copy link
Member

pitrou commented Jul 13, 2023

Ok, so the record batch is invalid. I'm not sure it makes sense to check specifically for that when converting to struct.

@westonpace
Copy link
Member Author

Ok, so the record batch is invalid.

@pitrou

This seems to disagree with your statement on the ML thread: https://lists.apache.org/thread/6jtyf5xhfdocb2rlx1jfjwx0rj4hn6o1

The fact that a struct field may be backed by a
physically larger C++ ArrayData is irrelevant, as long as it's logically
interpreted as having "the same length".

However, in the ML, we were talking about struct arrays, and not strictly record batches, and it's probably ok for those two things to act differently. If we want to align on RecordBatch::Validate then I think we need to update RecordBatch::FromStructArray to check for this case and repair it (instead of using the struct array's length directly).

If we instead want record batches to behave like struct arrays were described in the ML discussion then RecordBatch::Validate is wrong and this should be valid.

@pitrou
Copy link
Member

pitrou commented Jul 24, 2023

Record batches are a bit different from struct arrays. Slicing a struct array changes the offset of the struct array itself. Slicing a record batch slices each column individually, since record batches do not have an offset. So there's no need to allow record batch columns with a different length than the record batch itself.

If we want to align on RecordBatch::Validate then I think we need to update RecordBatch::FromStructArray to check for this case and repair it (instead of using the struct array's length directly).

That sounds like the best solution to me.

pitrou pushed a commit that referenced this issue Aug 10, 2023
… with mismatched column lengths (#36654)

### Rationale for this change

If a `RecordBatch` is created with column lengths that don't match the provided `num_rows` (technically invalid), then there are some circumstances where `ToStructArray` will successfully return an array whose length doesn't match `num_rows`. Instead, we should return an error.

### What changes are included in this PR?

* Add a small validation check to `ToStructArray` before constructing the output array
* Add a test

### Are these changes tested?

Yes (tests are included)

### Are there any user-facing changes?

No

* Closes: #35450

Authored-by: benibus <bpharks@gmx.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 14.0.0 milestone Aug 10, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…called with mismatched column lengths (apache#36654)

### Rationale for this change

If a `RecordBatch` is created with column lengths that don't match the provided `num_rows` (technically invalid), then there are some circumstances where `ToStructArray` will successfully return an array whose length doesn't match `num_rows`. Instead, we should return an error.

### What changes are included in this PR?

* Add a small validation check to `ToStructArray` before constructing the output array
* Add a test

### Are these changes tested?

Yes (tests are included)

### Are there any user-facing changes?

No

* Closes: apache#35450

Authored-by: benibus <bpharks@gmx.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants