-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-2328: [C++] Fixed and unit tested feather writing with slice #1784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…f a table: - scalar data is written from the offset - null bitmap is written from the offset.
|
I will review this when I can, it has been a very busy week |
…led because of the extra test added. Alos boolean didn't need to be treated special, all types data length should be calcutated as the rounded up to the nearest multiple of 8 of length*bit-width.
|
BTW, feel no need to put these "ARROW-2328: [C++]" messages in your commits. The only one that matters is in the PR title |
|
Will do, thanks. |
cpp/src/arrow/ipc/test-common.h
Outdated
| } | ||
|
|
||
| Status MakeStringTypesRecordBatch(std::shared_ptr<RecordBatch>* out) { | ||
| Status MakeStringTypesRecordBatchWithNulls(std::shared_ptr<RecordBatch>* out, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style note: if it's called MakeStringTypesRecordBatchWithNulls, it shouldn't take an argument indicating whether you want to include nulls. Consider renaming or using a private helper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or simply add a bool with_nulls = true optional argument to MakeStringTypesRecordBatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the default argument didn't work, because MakeStringTypesRecordBatch is used as a function pointer in other tests. I'll rename is to MakeStringTypesRecordBatchWithoutNulls abd this one to MakeStringTypesRecordBatch.
cpp/src/arrow/ipc/feather-test.cc
Outdated
| TEST_F(TestTableWriter, SliceRoundTrip) { | ||
| std::shared_ptr<RecordBatch> batch; | ||
| ASSERT_OK(MakeIntBatchSized(300, &batch)); | ||
| batch = batch->Slice(100, 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if you could iterate over a couple of different slice starts / lengths, especially non-multiples of 8.
cpp/src/arrow/ipc/feather-test.cc
Outdated
| ASSERT_OK(MakeStringTypesRecordBatchWithNulls(&batch, false)); | ||
| batch = batch->Slice(320, 30); | ||
|
|
||
| ASSERT_OK(writer_->Append("f0", *batch->column(0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of duplicate code here with SliceRoundTrip.
cpp/src/arrow/ipc/feather-test.cc
Outdated
|
|
||
| std::shared_ptr<Column> col; | ||
| ASSERT_OK(reader_->GetColumn(0, &col)); | ||
| SCOPED_TRACE(col->data()->chunk(0)->ToString() + "\n" + batch->column(0)->ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes(ish) It helps give a helpfull message when the test fails.
cpp/src/arrow/ipc/feather-test.cc
Outdated
| ASSERT_EQ("f1", col->name()); | ||
| } | ||
|
|
||
| TEST_F(TestTableWriter, SliceAtNonEightOffsetStringsWithNullsRoundTrip) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need writing independent methods here, just iterate over different slice starts / lengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem with loops in unit tests is that you only see the first failure. That said I should be able to convert these tests to one parameterised test, which hopefully also answers the above concerns..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool :-) Though I don't think using a loop is a big deal either...
| int64_t bit_offset, const int64_t length, | ||
| int64_t* bytes_written) { | ||
| data = data + bit_offset / 8; | ||
| uint8_t bit_shift = static_cast<uint8_t>(bit_offset % 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the cast to int8 necessary here? (for performance perhaps?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, but I think I got some warning, maybe with clang.
cpp/src/arrow/ipc/feather.cc
Outdated
| @@ -558,15 +595,18 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { | |||
| // byte boundary, and we write this much data into the stream | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment needs updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. I added a couple review comments, please read.
|
Thanks for the review comments. I’ve rewritten the tests as parameter unit tests and did most of the rest. |
cpp/src/arrow/ipc/feather-test.cc
Outdated
|
|
||
| INSTANTIATE_TEST_CASE_P( | ||
| TestTableWriterSliceOffsets, TestTableWriterSlice, | ||
| ::testing::Values(std::make_tuple(300, 30), std::make_tuple(301, 30), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Does gtest allow making a Cartesian product here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does, yesterday I was just too lazy to find out how.
cpp/src/arrow/ipc/feather-test.cc
Outdated
| ASSERT_OK(MakeIntBatchSized(start * 2, &batch)); | ||
| batch = batch->Slice(start, size); | ||
|
|
||
| ASSERT_OK(writer_->Append("f0", *batch->column(0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part and below is still duplicated in the other two test cases. Do you think you can factor that out, for example as a template function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right the duplication can be removed without adding any complexity.
cpp/src/arrow/ipc/test-common.h
Outdated
| return Status::OK(); | ||
| } | ||
|
|
||
| Status MakeStringTypesRecordBatchWithoutNulls(std::shared_ptr<RecordBatch>* out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it called "...WithoutNulls" if it passes with_nulls = true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops.. I'll rename it.
cpp/src/arrow/ipc/test-common.h
Outdated
| } else { | ||
| const std::string& value = values[values_index]; | ||
| const std::string value = | ||
| i < int64_t(values.size()) ? values[values_index] : std::to_string(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you need this change? Is this just a debugging leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original random strings were not very random, just a repetition of the same few strings. I admit the new ones aren't very random either, in particular there are no repetitions which you would expect in a in real data. I probably should have added a new method (MakeRangeBinaryArray?).
Of course you don't want true random data either in a test.
If I revert the change I wouldn't be confident that my tests are still testing anything, the repetition might just align in a lucky pattern.
You ok with leaving it, I think it is better than the previous code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather revert this change. If we want better random generation of binary arrays we should do it more thoroughly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not knowing the history of this particular function, I don't know what would be "better" in there. For my test I pretty much just want the consecutive numbers, otherwise it is very difficult to see what has gone wrong (I like the tests to give me the answer to that question if possible). I made a change to add a second function that implements that, but turns out this function is only used from MakeStringTypesRecordBatch, so we can't really have two implementations.
…rayWithUniqueValues and used that in MakeStringTypesRecordBatch.
|
Thank you! I will merge once the AppVeyor build passes (the Travis-CI failures in the Rust and glib builds are unrelated). |
When writing a slice of a table to feather, both the scalar data and the null bitmap were written out wrongly, i.e. from the beginning of the original table instead of from the offset.
I messed up my original pull request #1766, hence this replaces it. Hope that helps.