Skip to content

Conversation

@Adriandorr
Copy link
Contributor

When writing a slice of a table to feather, both the scalar data and the null bitmap were written out wrongly, i.e. from the beginning of the original table instead of from the offset.

  • scalar data is written from the offset
  • null bitmap is written from the offset.

I messed up my original pull request #1766, hence this replaces it. Hope that helps.

…f a table:

- scalar data is written from the offset
- null bitmap is written from the offset.
@wesm wesm changed the title (ARROW-2328: [C++] Fixed and unit tested feather writing with slice ARROW-2328: [C++] Fixed and unit tested feather writing with slice Mar 23, 2018
@wesm
Copy link
Member

wesm commented Mar 23, 2018

I will review this when I can, it has been a very busy week

…led because of the extra test added.

Alos boolean didn't need to be treated special, all types data length should be calcutated as the rounded up to the nearest multiple of 8 of length*bit-width.
@wesm
Copy link
Member

wesm commented Mar 25, 2018

BTW, feel no need to put these "ARROW-2328: [C++]" messages in your commits. The only one that matters is in the PR title

@Adriandorr
Copy link
Contributor Author

Will do, thanks.

}

Status MakeStringTypesRecordBatch(std::shared_ptr<RecordBatch>* out) {
Status MakeStringTypesRecordBatchWithNulls(std::shared_ptr<RecordBatch>* out,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style note: if it's called MakeStringTypesRecordBatchWithNulls, it shouldn't take an argument indicating whether you want to include nulls. Consider renaming or using a private helper function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or simply add a bool with_nulls = true optional argument to MakeStringTypesRecordBatch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the default argument didn't work, because MakeStringTypesRecordBatch is used as a function pointer in other tests. I'll rename is to MakeStringTypesRecordBatchWithoutNulls abd this one to MakeStringTypesRecordBatch.

TEST_F(TestTableWriter, SliceRoundTrip) {
std::shared_ptr<RecordBatch> batch;
ASSERT_OK(MakeIntBatchSized(300, &batch));
batch = batch->Slice(100, 100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if you could iterate over a couple of different slice starts / lengths, especially non-multiples of 8.

ASSERT_OK(MakeStringTypesRecordBatchWithNulls(&batch, false));
batch = batch->Slice(320, 30);

ASSERT_OK(writer_->Append("f0", *batch->column(0)));
Copy link
Member

@pitrou pitrou Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of duplicate code here with SliceRoundTrip.


std::shared_ptr<Column> col;
ASSERT_OK(reader_->GetColumn(0, &col));
SCOPED_TRACE(col->data()->chunk(0)->ToString() + "\n" + batch->column(0)->ToString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes(ish) It helps give a helpfull message when the test fails.

ASSERT_EQ("f1", col->name());
}

TEST_F(TestTableWriter, SliceAtNonEightOffsetStringsWithNullsRoundTrip) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need writing independent methods here, just iterate over different slice starts / lengths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem with loops in unit tests is that you only see the first failure. That said I should be able to convert these tests to one parameterised test, which hopefully also answers the above concerns..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool :-) Though I don't think using a loop is a big deal either...

int64_t bit_offset, const int64_t length,
int64_t* bytes_written) {
data = data + bit_offset / 8;
uint8_t bit_shift = static_cast<uint8_t>(bit_offset % 8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the cast to int8 necessary here? (for performance perhaps?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, but I think I got some warning, maybe with clang.

@@ -558,15 +595,18 @@ class TableWriter::TableWriterImpl : public ArrayVisitor {
// byte boundary, and we write this much data into the stream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs updating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. I added a couple review comments, please read.

@Adriandorr
Copy link
Contributor Author

Thanks for the review comments. I’ve rewritten the tests as parameter unit tests and did most of the rest.


INSTANTIATE_TEST_CASE_P(
TestTableWriterSliceOffsets, TestTableWriterSlice,
::testing::Values(std::make_tuple(300, 30), std::make_tuple(301, 30),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Does gtest allow making a Cartesian product here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does, yesterday I was just too lazy to find out how.

ASSERT_OK(MakeIntBatchSized(start * 2, &batch));
batch = batch->Slice(start, size);

ASSERT_OK(writer_->Append("f0", *batch->column(0)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part and below is still duplicated in the other two test cases. Do you think you can factor that out, for example as a template function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right the duplication can be removed without adding any complexity.

return Status::OK();
}

Status MakeStringTypesRecordBatchWithoutNulls(std::shared_ptr<RecordBatch>* out) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it called "...WithoutNulls" if it passes with_nulls = true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops.. I'll rename it.

} else {
const std::string& value = values[values_index];
const std::string value =
i < int64_t(values.size()) ? values[values_index] : std::to_string(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you need this change? Is this just a debugging leftover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original random strings were not very random, just a repetition of the same few strings. I admit the new ones aren't very random either, in particular there are no repetitions which you would expect in a in real data. I probably should have added a new method (MakeRangeBinaryArray?).
Of course you don't want true random data either in a test.
If I revert the change I wouldn't be confident that my tests are still testing anything, the repetition might just align in a lucky pattern.
You ok with leaving it, I think it is better than the previous code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather revert this change. If we want better random generation of binary arrays we should do it more thoroughly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not knowing the history of this particular function, I don't know what would be "better" in there. For my test I pretty much just want the consecutive numbers, otherwise it is very difficult to see what has gone wrong (I like the tests to give me the answer to that question if possible). I made a change to add a second function that implements that, but turns out this function is only used from MakeStringTypesRecordBatch, so we can't really have two implementations.

…rayWithUniqueValues and used that in MakeStringTypesRecordBatch.
@pitrou
Copy link
Member

pitrou commented Apr 9, 2018

Thank you! I will merge once the AppVeyor build passes (the Travis-CI failures in the Rust and glib builds are unrelated).

@pitrou pitrou closed this in 27417b2 Apr 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants