Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1712: [C++] Add method to BinaryBuilder to reserve space for value data #1481

Closed
wants to merge 23 commits into from
Closed

Conversation

xuepanchen
Copy link
Contributor

Modified BinaryBuilder::Resize(int64_t) so that when building BinaryArrays with a known size, space is also reserved for value_data_builder_ to prevent internal reallocation.

When building BinaryArrays with a known size using Resize and Reserve methods, space is also reserved for value_data_builder_ to prevent internal reallocation
Update BinaryBuilder::Resize(int64_t capacity) in builder.cc
@xuepanchen
Copy link
Contributor Author

Plz let me know if I should create a new method and add allocation of value data in the new function, instead of directly putting inside the Resize method.

@xuepanchen xuepanchen changed the title [C++] Add method to BinaryBuilder to reserve space for value data ARROW-1712: [C++] Add method to BinaryBuilder to reserve space for value data Jan 16, 2018
@wesm
Copy link
Member

wesm commented Jan 16, 2018

@xuepanchen thank you for your contribution. We need to add a new method to BinaryBuilder for this use case.

To give an example, suppose that we anticipate building an array with 1000 elements, each of which has an expected size of around 100 bytes. You would want to write something like:

RETURN_NOT_OK(builder.Reserve(1000));
RETURN_NOT_OK(builder.ReserveData(100 * 1000));

(@xhochy do you have an opinion on what to call this?)

Please also add a method to return the capacity of the internal value_data_builder_ and add unit tests to array-test.cc. Thanks!

@wesm
Copy link
Member

wesm commented Jan 16, 2018

@xuepanchen note that each time you push any commits to GitHub on an open PR, it creates CI builds in our Travis CI and Appveyor queues, so small incremental pushes can impact other developers who are waiting on their builds to run. In general, it's a good practice to wait to open a PR on a WIP patch until you're ready to validate a completed patch and/or need code review.

If you enable Travis CI and Appveyor on your fork of Arrow, you can see CI builds on your branches without having to open a PR to the Arrow repo. e.g. https://travis-ci.org/wesm/arrow/branches

@xuepanchen
Copy link
Contributor Author

xuepanchen commented Jan 16, 2018

@wesm thank you for the reminder. Will pay more attention next time.


Status BinaryBuilder::ReserveData(int64_t capacity) {
DCHECK_LT(capacity, std::numeric_limits<int32_t>::max());
return value_data_builder_.Resize(capacity * sizeof(int64_t));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we multiply here with int64_t? I would expect that ReserveData(x) will lead to value_data_capacity() = x.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be Resize(capacity + value_data_builder_.length()). The sizeof(int64_t) looks like a copy-paste error from implementation of BinaryBuilder::Resize above (and that should be sizeof(int32_t) there, so we should fix that)

We should check that extra_capacity + length does not exceed INT32_MAX but probably return Status::Invalid since overflowing a BinaryBuilder is likely to happen somewhat more regularly

I note also that BufferBuilder and TypedBufferBuilder<T> don't have a shrink_to_fit option in their Resize method, that would be good to add to avoid unnecessary reallocations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you two for the comments.

@xhochy you are right. I will make the change accordingly.

@wesm I have modified BinaryBuilder::Resize to use sizeof(int32_t) in previous commit. Will change to Resize(extra_capacity + curr_length)

Replace parameter name from "bytes" to "capacity" to avoid confusion.
Add TestCapacityReserve to test space reservation for BinaryBuilder
data_capacity_ represents the indicated capacity for value_data_builder and it is always smaller than or equal to the actual capacity of underlying value_data_builder (data_capacity_ <= value_data_builder.capacity()). That's because when we say:

ReserveData(capacity);

The new capacity is max(data_capacity_, data length + capacity), and data_capacity_ is set to be equal to new capacity but underlying buffer size is set to BitUtil::RoundUpToMultipleOf64(new capacity) to ensure that the capacity of the buffer is a multiple of 64 bytes as defined in Layout.md.

That's why data_capacity_ is needed to show the indicated capacity of the BinaryBuilder, just like ArrayBuilder::capacity_ indicates the indicated capacity of ArrayBuilder.

A safety check is added in BinaryBuilder::Append() to update data_capacity_ if data length is greater than data_capacity_. The reason is that data_capacity is updated in ResearveData(). But if users make mistakes to append too much data, data length might be larger than data_capacity_ (data length <= actual capacity of underlying value_data_builder). If this happens data_capacity_ is set equal to data length to avoid confusion.
Update ReserveData(int64_t) method for BinaryBuilder
@xuepanchen
Copy link
Contributor Author

Update ReserveData method based on feedbacks and add test case for BinaryBuilder.

ASSERT_EQ(builder_->length(), length);
ASSERT_EQ(builder_->capacity(), BitUtil::NextPower2(capacity));
ASSERT_EQ(builder_->value_data_length(), data_length);
ASSERT_EQ(builder_->value_data_capacity(), capacity);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not a power of 2?

if (data_length <= capacity) {
ASSERT_EQ(builder_->value_data_capacity(), capacity);
} else {
ASSERT_EQ(builder_->value_data_capacity(), data_length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why these assertions would hold true

ASSERT_EQ(builder_->capacity(), BitUtil::NextPower2(capacity));
ASSERT_EQ(builder_->value_data_capacity(), capacity);

I would think that value_data_capacity() is always the power of 2 greater than or equal to the amount of data appended so far, i.e. ASSERT_EQ(BitUtil::NextPower2(data_length), builder_->value_data_capacity())

Can you make the strings you are appending much larger, at least 10 length each?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm value_data_capacity() is actually always a multiple of 64 greater than or equal to the amount of data appended so far because the underlying buffer size is set to ensure that the capacity of the buffer is a multiple of 64 bytes as defined in Layout.md, i.e.

ASSERT_EQ(BitUtil::RoundUpToMultipleOf64(data_length), builder_->value_data_capacity())

So if you call ReserveData(capacity) at the very beginning, then we have

ASSERT_EQ(BitUtil::RoundUpToMultipleOf64(capacity), builder_->value_data_capacity())

ASSERT_OK(builder_->Reserve(capacity));
ASSERT_OK(builder_->ReserveData(capacity));

ASSERT_EQ(builder_->length(), length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In googletest, the 1st parameter passed to ASSERT_EQ should be the expected result, so flip the argument order here and below

int64_t capacity = N;

ASSERT_OK(builder_->Reserve(capacity));
ASSERT_OK(builder_->ReserveData(capacity));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These reservations should be viewed as different, because the size of the offsets buffer and the data buffer grow at different rates. The way this unit test should read is:

  • Call ReserveData with enough space for some large-ish amount of data (say 4K bytes or so)
  • Append <= N bytes incrementally
  • Check that the capacity remains invariant at the end (i.e. the initial ReserveData made sure that no additional reallocations took place)

@@ -1208,7 +1208,7 @@ ArrayBuilder* ListBuilder::value_builder() const {
// String and binary

BinaryBuilder::BinaryBuilder(const std::shared_ptr<DataType>& type, MemoryPool* pool)
: ArrayBuilder(type, pool), offsets_builder_(pool), value_data_builder_(pool) {}
: ArrayBuilder(type, pool), offsets_builder_(pool), value_data_builder_(pool), data_capacity_(0) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this extra member; we can just use value_data_capacity() wherever data_capacity_ is currently being used

return Status::Invalid("Cannot reserve capacity larger than 2^31 - 1 in length for binary data");
}

RETURN_NOT_OK(value_data_builder_.Resize(value_data_length() + capacity));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you rebase on master, you can use value_data_builder_.Reserve(capacity) here

}

RETURN_NOT_OK(value_data_builder_.Resize(value_data_length() + capacity));
data_capacity_ = value_data_length() + capacity;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed

@@ -1241,6 +1253,9 @@ Status BinaryBuilder::Append(const uint8_t* value, int32_t length) {
RETURN_NOT_OK(Reserve(1));
RETURN_NOT_OK(AppendNextOffset());
RETURN_NOT_OK(value_data_builder_.Append(value, length));
if (data_capacity_ < value_data_length()) {
data_capacity_ = value_data_length();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

@@ -682,10 +682,13 @@ class ARROW_EXPORT BinaryBuilder : public ArrayBuilder {

Status Init(int64_t elements) override;
Status Resize(int64_t capacity) override;
Status ReserveData(int64_t capacity);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a doxygen comment, since this is a new API

I note that the Reserve-type methods in this header have different semantics from their STL counterparts. They are reserving additional space rather than absolute space (e.g. std::vector::reserve takes an absolute length as argument)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also call the argument to ReserveData size (or elements) instead of capacity (to avoid confusion about whether we are passing an incremental value vs. absolute)

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is looking pretty close, just a couple of minor comments

@@ -682,10 +682,15 @@ class ARROW_EXPORT BinaryBuilder : public ArrayBuilder {

Status Init(int64_t elements) override;
Status Resize(int64_t capacity) override;
/// Ensures there is enough space for adding the number of value elements
/// by checking value buffer capacity and resizing if necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add \brief to the method description? Can we rephrase this to "Ensures there is enough allocated capacity to append the indicated number of bytes to the value data buffer without additional allocations"

ASSERT_EQ(length, builder_->value_data_length());
ASSERT_EQ(BitUtil::RoundUpToMultipleOf64(capacity), builder_->value_data_capacity());
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add another call to ReserveData here, like builder_->ReserveData(500) to show that the input argument is an incremental amount rather than an absolute amount?

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 -- thank you. I will merge once the style issues are fixed and the build is passing. There are a number of linting problems, see https://github.com/apache/arrow/tree/master/cpp#continuous-integration

@wesm
Copy link
Member

wesm commented Jan 22, 2018

In the future, it is better to make pull requests from a branch on your fork, so that you can work on multiple changes at once. Here are some links from the pandas project about this

@wesm
Copy link
Member

wesm commented Jan 23, 2018

@kou do you know what is causing this error?

Building dependency tree...
Reading state information...
ccache is already the newest version (3.1.9-1).
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 libgirepository1.0-dev : Depends: libgirepository-1.0-1 (= 1.40.0-1) but 1.40.0-1ubuntu0.2 is to be installed
                          Depends: gobject-introspection (= 1.40.0-1) but it is not going to be installed
                          Depends: gir1.2-glib-2.0 (= 1.40.0-1) but 1.40.0-1ubuntu0.2 is to be installed
                          Depends: gir1.2-freedesktop (= 1.40.0-1) but 1.40.0-1ubuntu0.2 is to be installed
E: Unable to correct problems, you have held broken packages.

TEST_F(TestBinaryBuilder, TestCapacityReserve) {
vector<string> strings = {"aaaaa", "bbbbbbbbbb", "ccccccccccccccc", "dddddddddddddddddddd", "eeeeeeeeee"};
vector<string> strings = {"aaaaa", "bbbbbbbbbb", "ccccccccccccccc", "dddddddddd"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, you can run make format (which uses clang-format) to fix these long lines without having to make code changes

@kou
Copy link
Member

kou commented Jan 24, 2018

@wesm I think that it's a network problem:

https://travis-ci.org/apache/arrow/jobs/332472808#L552

E: Failed to fetch http://us-central1.gce.archive.ubuntu.com/ubuntu/dists/trusty-updates/main/binary-amd64/Packages.gz Hash Sum mismatch

We will be able to reduce the case by changing the apt-get update command line with the following:

for i in {1..3}; do
  sudo -E apt-get -yq update &>> ~/apt-get-update.log && break
done

@wesm wesm closed this in 0a49022 Jan 24, 2018
@wesm
Copy link
Member

wesm commented Jan 24, 2018

@kou I see. These commands are initiated by Travis CI from https://github.com/apache/arrow/blob/master/.travis.yml#L21. We could install our package toolchain outside of Travis CI's built-in commands, if that might help improve the flakiness

I opened https://issues.apache.org/jira/browse/ARROW-2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants