Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6876: [C++][Parquet] Use shared_ptr to avoid copying ReaderContext struct, fix performance regression with reading many columns #5653

Closed
wants to merge 3 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Oct 14, 2019

This also adds a benchmark to be able to see how the performance of reading Parquet files with many columns (up to 10,000) scales.

@github-actions
Copy link

@wesm
Copy link
Member Author

wesm commented Oct 14, 2019

before:

import pyarrow as pa
import pyarrow.parquet as pq 
table = pa.table({'c' + str(i): np.random.randn(10) for i in range(10000)})  
pq.write_table(table, "test_wide.parquet")
res = pq.read_table("test_wide.parquet")

In [2]: %timeit res = pq.read_table("test_wide.parquet")                                                                                                                                       
1.87 s ± 70 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

after

In [2]: timeit pq.read_table('test_wide.parquet')                                                                                                                                              
235 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

0.14.x is still faster here

In [2]: %timeit res = pq.read_table("test_wide.parquet")                                                                                                                                       
126 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I don't think that doing more microperformance work for the many-column use case (> 10000) is really worthwhile. I'm going to take a quick look to see if there's anything simple to do to make things a bit faster.

self.buf = out.getvalue()

def time_read_time(self, num_cols):
pq.read_table(self.buf)
Copy link
Member

@jorisvandenbossche jorisvandenbossche Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can directly add a timing for writing the file as well?

def time_write(self, num_cols):
    out = pa.BufferOutputStream()
    pq.write_table(self.table, out)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wesm
Copy link
Member Author

wesm commented Oct 14, 2019

I looked at the perf output for this and I don't think we should be doing significant refactoring over 5-10 microseconds of overhead per column


bool IncludesLeaf(int leaf_index) const {
return (!this->filter_leaves ||
(included_leaves.find(leaf_index) != included_leaves.end()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep the unordered_set? Does the vector make things faster?
(or you may want to use a vector<bool> indexed by node index)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the subject, can you also elide filter_leaves and use include_leaves.empty() instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to unordered_set. Leaving the filter_leaves thing unchanged since there is other business logic

@codecov-io
Copy link

Codecov Report

Merging #5653 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5653      +/-   ##
==========================================
- Coverage   89.49%   89.49%   -0.01%     
==========================================
  Files         898      796     -102     
  Lines      125350   118740    -6610     
  Branches     1501        0    -1501     
==========================================
- Hits       112182   106264    -5918     
+ Misses      13157    12476     -681     
+ Partials       11        0      -11
Impacted Files Coverage Δ
cpp/src/parquet/arrow/reader_internal.h 93.18% <100%> (+0.15%) ⬆️
cpp/src/parquet/arrow/reader.cc 83.5% <100%> (ø) ⬆️
cpp/src/arrow/filesystem/s3_internal.h 90.74% <0%> (-3.71%) ⬇️
python/pyarrow/tests/test_parquet.py 95.24% <0%> (-0.06%) ⬇️
cpp/src/arrow/compute/kernels/hash_test.cc 100% <0%> (ø) ⬆️
js/src/util/fn.ts
js/src/builder/index.ts
js/src/enum.ts
js/src/Arrow.node.ts
js/src/schema.ts
... and 101 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7ad509...baeb1c7. Read the comment docs.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@pitrou
Copy link
Member

pitrou commented Oct 16, 2019

@pitrou pitrou closed this in 2ce62df Oct 16, 2019
@wesm wesm deleted the ARROW-6876 branch October 16, 2019 18:34
@bobfang1992
Copy link

Out of curiosity, why this project does not use github provided way to merge a merge request, but to push to master and then close the MR here?

@wesm
Copy link
Member Author

wesm commented Oct 17, 2019

We use a merge tool https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py which does several things for us:

  • Creates a clean, squashed commit
  • Preserves the PR description in the commit message (GitHub does not do this)
  • Closes the corresponding JIRA issue (GitHub does not do this)
  • Includes names/email addresses of authors, co-authors, and sign-off person in the commit message (GitHub does not do this)

kszucs pushed a commit to kszucs/arrow that referenced this pull request Oct 21, 2019
…ext struct, fix performance regression with reading many columns

This also adds a benchmark to be able to see how the performance of reading Parquet files with many columns (up to 10,000) scales.

Closes apache#5653 from wesm/ARROW-6876 and squashes the following commits:

e97c430 <Wes McKinney> Revert member to unordered_set
0b8d690 <Wes McKinney> Add write benchmark, write more rows
1d60433 <Wes McKinney> Use shared_ptr to avoid copying ReaderContext, fix performance regression with reading many columns. Add asv benchmark

Authored-by: Wes McKinney <wesm+git@apache.org>
Signed-off-by: Antoine Pitrou <pitrou@free.fr>
kszucs pushed a commit to kszucs/arrow that referenced this pull request Oct 21, 2019
…ext struct, fix performance regression with reading many columns

This also adds a benchmark to be able to see how the performance of reading Parquet files with many columns (up to 10,000) scales.

Closes apache#5653 from wesm/ARROW-6876 and squashes the following commits:

e97c430 <Wes McKinney> Revert member to unordered_set
0b8d690 <Wes McKinney> Add write benchmark, write more rows
1d60433 <Wes McKinney> Use shared_ptr to avoid copying ReaderContext, fix performance regression with reading many columns. Add asv benchmark

Authored-by: Wes McKinney <wesm+git@apache.org>
Signed-off-by: Antoine Pitrou <pitrou@free.fr>
kszucs pushed a commit to kszucs/arrow that referenced this pull request Oct 21, 2019
…ext struct, fix performance regression with reading many columns

This also adds a benchmark to be able to see how the performance of reading Parquet files with many columns (up to 10,000) scales.

Closes apache#5653 from wesm/ARROW-6876 and squashes the following commits:

e97c430 <Wes McKinney> Revert member to unordered_set
0b8d690 <Wes McKinney> Add write benchmark, write more rows
1d60433 <Wes McKinney> Use shared_ptr to avoid copying ReaderContext, fix performance regression with reading many columns. Add asv benchmark

Authored-by: Wes McKinney <wesm+git@apache.org>
Signed-off-by: Antoine Pitrou <pitrou@free.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants