Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] [Parquet] Use std::count in parquet ColumnReader #39398

Closed
Hattonuri opened this issue Dec 30, 2023 · 0 comments · Fixed by #39397
Closed

[C++] [Parquet] Use std::count in parquet ColumnReader #39398

Hattonuri opened this issue Dec 30, 2023 · 0 comments · Fixed by #39397

Comments

@Hattonuri
Copy link
Contributor

Describe the enhancement requested

I've found that for-loop here

void ReadLevels(int64_t batch_size, int16_t* def_levels, int16_t* rep_levels,
int64_t* num_def_levels, int64_t* values_to_read) {
batch_size =
std::min(batch_size, this->num_buffered_values_ - this->num_decoded_values_);
// If the field is required and non-repeated, there are no definition levels
if (this->max_def_level_ > 0 && def_levels != nullptr) {
*num_def_levels = this->ReadDefinitionLevels(batch_size, def_levels);
// TODO(wesm): this tallying of values-to-decode can be performed with better
// cache-efficiency if fused with the level decoding.
for (int64_t i = 0; i < *num_def_levels; ++i) {
if (def_levels[i] == this->max_def_level_) {
++(*values_to_read);
}
}
} else {
// Required field, read all values
*values_to_read = batch_size;
}

transforms into

0xc0c2f0 <ReadLevels()+96> inc %rdx
0xc0c2f3 <ReadLevels()+99> cmp %rax,%rdx
0xc0c2f6 <ReadLevels()+102> jge 0xc0c30c <ReadLevels()+124>
0xc0c2f8 <ReadLevels()+104> cmp %cx,(%r14,%rdx,2)
0xc0c2fd <ReadLevels()+109> jne 0xc0c2f0 <ReadLevels()+96>
0xc0c2ff <ReadLevels()+111> incq 0x0(%rbp)
0xc0c303 <ReadLevels()+115> mov (%rbx),%rax
0xc0c306 <ReadLevels()+118> jmp 0xc0c2f0 <ReadLevels()+96>

That means that it uses iteration element by element and changes reference with incq
I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd)

On my flamegraph this part showed ~10% of time spent

Component(s)

C++, Parquet

mapleFU pushed a commit that referenced this issue Jan 9, 2024
…39397)

### Rationale for this change

I've found that for-loop here
https://github.com/apache/arrow/blob/7c3480e2f028f5881242f227f42155cf833efee7/cpp/src/parquet/column_reader.cc#L1055-L1073

transforms into

0xc0c2f0 <ReadLevels()+96>      inc    %rdx
0xc0c2f3 <ReadLevels()+99>      cmp    %rax,%rdx
0xc0c2f6 <ReadLevels()+102>     jge    0xc0c30c <ReadLevels()+124>
0xc0c2f8 <ReadLevels()+104>     cmp    %cx,(%r14,%rdx,2)
0xc0c2fd <ReadLevels()+109>     jne    0xc0c2f0 <ReadLevels()+96>
0xc0c2ff <ReadLevels()+111>     incq   0x0(%rbp)                                                   
0xc0c303 <ReadLevels()+115>     mov    (%rbx),%rax
0xc0c306 <ReadLevels()+118>     jmp    0xc0c2f0 <ReadLevels()+96>

That means that it uses iteration element by element and changes reference with incq
I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd)

On my flamegraph this part showed ~10% of time spent

In this file there also some for loops which could easily be changed to std::count, but they do not touch references and I don't know the reason why std::count was not used in the all cpp/src/parquet/ directory - so I didn't change much

### What changes are included in this PR?

Using `std::count` in `parquet/column_reader.cc` to avoid loop not being optimized

### Are these changes tested?
They are tested with unittest but not benched because I don't know what bench will show performance rise here(

### Are there any user-facing changes?

* Closes: #39398

Authored-by: Dmitry Stasenko <dmitry.stasenko@pinely.com>
Signed-off-by: mwish <maplewish117@gmail.com>
@mapleFU mapleFU added this to the 16.0.0 milestone Jan 9, 2024
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…els (apache#39397)

### Rationale for this change

I've found that for-loop here
https://github.com/apache/arrow/blob/7c3480e2f028f5881242f227f42155cf833efee7/cpp/src/parquet/column_reader.cc#L1055-L1073

transforms into

0xc0c2f0 <ReadLevels()+96>      inc    %rdx
0xc0c2f3 <ReadLevels()+99>      cmp    %rax,%rdx
0xc0c2f6 <ReadLevels()+102>     jge    0xc0c30c <ReadLevels()+124>
0xc0c2f8 <ReadLevels()+104>     cmp    %cx,(%r14,%rdx,2)
0xc0c2fd <ReadLevels()+109>     jne    0xc0c2f0 <ReadLevels()+96>
0xc0c2ff <ReadLevels()+111>     incq   0x0(%rbp)                                                   
0xc0c303 <ReadLevels()+115>     mov    (%rbx),%rax
0xc0c306 <ReadLevels()+118>     jmp    0xc0c2f0 <ReadLevels()+96>

That means that it uses iteration element by element and changes reference with incq
I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd)

On my flamegraph this part showed ~10% of time spent

In this file there also some for loops which could easily be changed to std::count, but they do not touch references and I don't know the reason why std::count was not used in the all cpp/src/parquet/ directory - so I didn't change much

### What changes are included in this PR?

Using `std::count` in `parquet/column_reader.cc` to avoid loop not being optimized

### Are these changes tested?
They are tested with unittest but not benched because I don't know what bench will show performance rise here(

### Are there any user-facing changes?

* Closes: apache#39398

Authored-by: Dmitry Stasenko <dmitry.stasenko@pinely.com>
Signed-off-by: mwish <maplewish117@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…els (apache#39397)

### Rationale for this change

I've found that for-loop here
https://github.com/apache/arrow/blob/7c3480e2f028f5881242f227f42155cf833efee7/cpp/src/parquet/column_reader.cc#L1055-L1073

transforms into

0xc0c2f0 <ReadLevels()+96>      inc    %rdx
0xc0c2f3 <ReadLevels()+99>      cmp    %rax,%rdx
0xc0c2f6 <ReadLevels()+102>     jge    0xc0c30c <ReadLevels()+124>
0xc0c2f8 <ReadLevels()+104>     cmp    %cx,(%r14,%rdx,2)
0xc0c2fd <ReadLevels()+109>     jne    0xc0c2f0 <ReadLevels()+96>
0xc0c2ff <ReadLevels()+111>     incq   0x0(%rbp)                                                   
0xc0c303 <ReadLevels()+115>     mov    (%rbx),%rax
0xc0c306 <ReadLevels()+118>     jmp    0xc0c2f0 <ReadLevels()+96>

That means that it uses iteration element by element and changes reference with incq
I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd)

On my flamegraph this part showed ~10% of time spent

In this file there also some for loops which could easily be changed to std::count, but they do not touch references and I don't know the reason why std::count was not used in the all cpp/src/parquet/ directory - so I didn't change much

### What changes are included in this PR?

Using `std::count` in `parquet/column_reader.cc` to avoid loop not being optimized

### Are these changes tested?
They are tested with unittest but not benched because I don't know what bench will show performance rise here(

### Are there any user-facing changes?

* Closes: apache#39398

Authored-by: Dmitry Stasenko <dmitry.stasenko@pinely.com>
Signed-off-by: mwish <maplewish117@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…els (apache#39397)

### Rationale for this change

I've found that for-loop here
https://github.com/apache/arrow/blob/7c3480e2f028f5881242f227f42155cf833efee7/cpp/src/parquet/column_reader.cc#L1055-L1073

transforms into

0xc0c2f0 <ReadLevels()+96>      inc    %rdx
0xc0c2f3 <ReadLevels()+99>      cmp    %rax,%rdx
0xc0c2f6 <ReadLevels()+102>     jge    0xc0c30c <ReadLevels()+124>
0xc0c2f8 <ReadLevels()+104>     cmp    %cx,(%r14,%rdx,2)
0xc0c2fd <ReadLevels()+109>     jne    0xc0c2f0 <ReadLevels()+96>
0xc0c2ff <ReadLevels()+111>     incq   0x0(%rbp)                                                   
0xc0c303 <ReadLevels()+115>     mov    (%rbx),%rax
0xc0c306 <ReadLevels()+118>     jmp    0xc0c2f0 <ReadLevels()+96>

That means that it uses iteration element by element and changes reference with incq
I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd)

On my flamegraph this part showed ~10% of time spent

In this file there also some for loops which could easily be changed to std::count, but they do not touch references and I don't know the reason why std::count was not used in the all cpp/src/parquet/ directory - so I didn't change much

### What changes are included in this PR?

Using `std::count` in `parquet/column_reader.cc` to avoid loop not being optimized

### Are these changes tested?
They are tested with unittest but not benched because I don't know what bench will show performance rise here(

### Are there any user-facing changes?

* Closes: apache#39398

Authored-by: Dmitry Stasenko <dmitry.stasenko@pinely.com>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment