Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size #14545

Merged
merged 4 commits into from
Nov 2, 2022

Conversation

fatemehp
Copy link
Contributor

In the current code, we will read this page because we are using > and not >= in the branch that decides to skip the rest of the page.

Also includes minor refactoring to reuse the function available_values_current_page(), and use ConsumeBufferedValues() accordingly.

Benchmark results for when batch size = 100K and number of values per page = 100K.

BEFORE
-------------------------------------------------------------------------------
Benchmark       Time             CPU         Iterations
-------------------------------------------------------------------------------
REQUIRED      96831 ns        96326 ns         1000
OPTIONAL     623897 ns       621734 ns         1000
REPEATED    1006153 ns       997482 ns         1000

AFTER
-------------------------------------------------------------------------------
Benchmark      Time             CPU          Iterations
-------------------------------------------------------------------------------
REQUIRED       2175 ns         2164 ns         1000
OPTIONAL       2743 ns         2719 ns         1000
REPEATED       2368 ns         2424 ns         1000

@fatemehp
Copy link
Contributor Author

@emkornfield could you take a look?

values_to_skip -= this->num_buffered_values_ - this->num_decoded_values_;
this->num_decoded_values_ = this->num_buffered_values_;
const int64_t available_values = this->available_values_current_page();
if (values_to_skip >= available_values) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line has the main change for this request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks.

if (values_to_skip > (this->num_buffered_values_ - this->num_decoded_values_)) {
values_to_skip -= this->num_buffered_values_ - this->num_decoded_values_;
this->num_decoded_values_ = this->num_buffered_values_;
const int64_t available_values = this->available_values_current_page();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring to re-use this function instead of doing the manipulation.

const int64_t available_values = this->available_values_current_page();
if (values_to_skip >= available_values) {
values_to_skip -= available_values;
this->ConsumeBufferedValues(available_values);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring to use this function instead of doing the manipulation.

@pitrou
Copy link
Member

pitrou commented Oct 31, 2022

This is not so minor IMHO, can you open a JIRA for this?

@fatemehp fatemehp changed the title MINOR: [PARQUET] Optimize skip for the case that number of values to skip equals page size PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size Oct 31, 2022
@fatemehp
Copy link
Contributor Author

Opened a Jira ticket.

@pitrou
Copy link
Member

pitrou commented Oct 31, 2022

Benchmark results for when batch size = 100K and number of values per page = 100K.

Can you explain what this benchmark does?

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@emkornfield
Copy link
Contributor

emkornfield commented Oct 31, 2022

@fatemehp This looks like a different benchmark then what is in: #14523 ? is it an existing one or is there can it be included in this PR?

@fatemehp
Copy link
Contributor Author

@pitrou, @emkornfield regarding your comments, I am running the same benchmark that is being checked in here #14523. I just cleaned up the output a bit here to be more readable.

What the benchmark does is it creates a few pages with 100K entries each. Then it calls Skip(100K) repeatedly. Before this change, we would read the 100K values and throw them away. With this change, we skip reading from the page.

values_.begin() + static_cast<int>(5 * static_cast<double>(levels_per_page)),
values_.begin() + 5.5 * levels_per_page);
ASSERT_TRUE(vector_equal(sub_values, vresult));

// 3) skip_size < page_size (skip limited to a single page)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the numbering in the comments? :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will wait for CI

@pitrou pitrou merged commit 04bb068 into apache:master Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants