-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size #14545
Conversation
@emkornfield could you take a look? |
values_to_skip -= this->num_buffered_values_ - this->num_decoded_values_; | ||
this->num_decoded_values_ = this->num_buffered_values_; | ||
const int64_t available_values = this->available_values_current_page(); | ||
if (values_to_skip >= available_values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line has the main change for this request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks.
if (values_to_skip > (this->num_buffered_values_ - this->num_decoded_values_)) { | ||
values_to_skip -= this->num_buffered_values_ - this->num_decoded_values_; | ||
this->num_decoded_values_ = this->num_buffered_values_; | ||
const int64_t available_values = this->available_values_current_page(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring to re-use this function instead of doing the manipulation.
const int64_t available_values = this->available_values_current_page(); | ||
if (values_to_skip >= available_values) { | ||
values_to_skip -= available_values; | ||
this->ConsumeBufferedValues(available_values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring to use this function instead of doing the manipulation.
This is not so minor IMHO, can you open a JIRA for this? |
Opened a Jira ticket. |
Can you explain what this benchmark does? |
|
@pitrou, @emkornfield regarding your comments, I am running the same benchmark that is being checked in here #14523. I just cleaned up the output a bit here to be more readable. What the benchmark does is it creates a few pages with 100K entries each. Then it calls Skip(100K) repeatedly. Before this change, we would read the 100K values and throw them away. With this change, we skip reading from the page. |
values_.begin() + static_cast<int>(5 * static_cast<double>(levels_per_page)), | ||
values_.begin() + 5.5 * levels_per_page); | ||
ASSERT_TRUE(vector_equal(sub_values, vresult)); | ||
|
||
// 3) skip_size < page_size (skip limited to a single page) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix the numbering in the comments? :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, will wait for CI
In the current code, we will read this page because we are using > and not >= in the branch that decides to skip the rest of the page.
Also includes minor refactoring to reuse the function available_values_current_page(), and use ConsumeBufferedValues() accordingly.
Benchmark results for when batch size = 100K and number of values per page = 100K.