ARROW-6876: [C++][Parquet] Use shared_ptr to avoid copying ReaderContext struct, fix performance regression with reading many columns #5653

wesm · 2019-10-14T20:19:03Z

This also adds a benchmark to be able to see how the performance of reading Parquet files with many columns (up to 10,000) scales.

github-actions · 2019-10-14T20:24:12Z

https://issues.apache.org/jira/browse/ARROW-6876

wesm · 2019-10-14T20:26:34Z

before:

import pyarrow as pa
import pyarrow.parquet as pq 
table = pa.table({'c' + str(i): np.random.randn(10) for i in range(10000)})  
pq.write_table(table, "test_wide.parquet")
res = pq.read_table("test_wide.parquet")

In [2]: %timeit res = pq.read_table("test_wide.parquet")                                                                                                                                       
1.87 s ± 70 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

after

In [2]: timeit pq.read_table('test_wide.parquet')                                                                                                                                              
235 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

0.14.x is still faster here

In [2]: %timeit res = pq.read_table("test_wide.parquet")                                                                                                                                       
126 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I don't think that doing more microperformance work for the many-column use case (> 10000) is really worthwhile. I'm going to take a quick look to see if there's anything simple to do to make things a bit faster.

jorisvandenbossche · 2019-10-14T20:29:20Z

python/benchmarks/parquet.py

+        self.buf = out.getvalue()
+
+    def time_read_time(self, num_cols):
+        pq.read_table(self.buf)


Maybe can directly add a timing for writing the file as well?

def time_write(self, num_cols): out = pa.BufferOutputStream() pq.write_table(self.table, out)

wesm · 2019-10-14T20:40:16Z

I looked at the perf output for this and I don't think we should be doing significant refactoring over 5-10 microseconds of overhead per column

pitrou · 2019-10-14T20:42:36Z

cpp/src/parquet/arrow/reader_internal.h


  bool IncludesLeaf(int leaf_index) const {
    return (!this->filter_leaves ||
-            (included_leaves.find(leaf_index) != included_leaves.end()));


Why not keep the unordered_set? Does the vector make things faster?
(or you may want to use a vector<bool> indexed by node index)

On the subject, can you also elide filter_leaves and use include_leaves.empty() instead?

Reverted to unordered_set. Leaving the filter_leaves thing unchanged since there is other business logic

codecov-io · 2019-10-15T00:08:50Z

Codecov Report

Merging #5653 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #5653      +/-   ##
==========================================
- Coverage   89.49%   89.49%   -0.01%     
==========================================
  Files         898      796     -102     
  Lines      125350   118740    -6610     
  Branches     1501        0    -1501     
==========================================
- Hits       112182   106264    -5918     
+ Misses      13157    12476     -681     
+ Partials       11        0      -11

Impacted Files	Coverage Δ
cpp/src/parquet/arrow/reader_internal.h	`93.18% <100%> (+0.15%)`	⬆️
cpp/src/parquet/arrow/reader.cc	`83.5% <100%> (ø)`	⬆️
cpp/src/arrow/filesystem/s3_internal.h	`90.74% <0%> (-3.71%)`	⬇️
python/pyarrow/tests/test_parquet.py	`95.24% <0%> (-0.06%)`	⬇️
cpp/src/arrow/compute/kernels/hash_test.cc	`100% <0%> (ø)`	⬆️
js/src/util/fn.ts
js/src/builder/index.ts
js/src/enum.ts
js/src/Arrow.node.ts
js/src/schema.ts
... and 101 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7ad509...baeb1c7. Read the comment docs.

…sion with reading many columns. Add asv benchmark

pitrou

+1

pitrou · 2019-10-16T17:57:51Z

Travis-CI: https://travis-ci.org/wesm/arrow/builds/598746511

bobfang1992 · 2019-10-17T10:00:11Z

Out of curiosity, why this project does not use github provided way to merge a merge request, but to push to master and then close the MR here?

wesm · 2019-10-17T13:36:08Z

We use a merge tool https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py which does several things for us:

Creates a clean, squashed commit
Preserves the PR description in the commit message (GitHub does not do this)
Closes the corresponding JIRA issue (GitHub does not do this)
Includes names/email addresses of authors, co-authors, and sign-off person in the commit message (GitHub does not do this)

…ext struct, fix performance regression with reading many columns This also adds a benchmark to be able to see how the performance of reading Parquet files with many columns (up to 10,000) scales. Closes apache#5653 from wesm/ARROW-6876 and squashes the following commits: e97c430 <Wes McKinney> Revert member to unordered_set 0b8d690 <Wes McKinney> Add write benchmark, write more rows 1d60433 <Wes McKinney> Use shared_ptr to avoid copying ReaderContext, fix performance regression with reading many columns. Add asv benchmark Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Antoine Pitrou <pitrou@free.fr>

jorisvandenbossche reviewed Oct 14, 2019

View reviewed changes

pitrou reviewed Oct 14, 2019

View reviewed changes

wesm added 3 commits October 16, 2019 11:22

Use shared_ptr to avoid copying ReaderContext, fix performance regres…

1d60433

…sion with reading many columns. Add asv benchmark

Add write benchmark, write more rows

0b8d690

Revert member to unordered_set

e97c430

wesm force-pushed the ARROW-6876 branch from baeb1c7 to e97c430 Compare October 16, 2019 16:28

pitrou approved these changes Oct 16, 2019

View reviewed changes

pitrou closed this in 2ce62df Oct 16, 2019

wesm deleted the ARROW-6876 branch October 16, 2019 18:34

asfimport mentioned this pull request Nov 28, 2019

[Python] Reading parquet file with many columns becomes slow for 0.15.0 #23204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-6876: [C++][Parquet] Use shared_ptr to avoid copying ReaderContext struct, fix performance regression with reading many columns #5653

ARROW-6876: [C++][Parquet] Use shared_ptr to avoid copying ReaderContext struct, fix performance regression with reading many columns #5653

wesm commented Oct 14, 2019

github-actions bot commented Oct 14, 2019

wesm commented Oct 14, 2019

jorisvandenbossche Oct 14, 2019 •

edited

Loading

wesm Oct 14, 2019

wesm commented Oct 14, 2019

pitrou Oct 14, 2019

fsaintjacques Oct 16, 2019

wesm Oct 16, 2019

codecov-io commented Oct 15, 2019

pitrou left a comment

pitrou commented Oct 16, 2019

bobfang1992 commented Oct 17, 2019

wesm commented Oct 17, 2019 •

edited

Loading

ARROW-6876: [C++][Parquet] Use shared_ptr to avoid copying ReaderContext struct, fix performance regression with reading many columns #5653

ARROW-6876: [C++][Parquet] Use shared_ptr to avoid copying ReaderContext struct, fix performance regression with reading many columns #5653

Conversation

wesm commented Oct 14, 2019

github-actions bot commented Oct 14, 2019

wesm commented Oct 14, 2019

jorisvandenbossche Oct 14, 2019 • edited Loading

Choose a reason for hiding this comment

wesm Oct 14, 2019

Choose a reason for hiding this comment

wesm commented Oct 14, 2019

pitrou Oct 14, 2019

Choose a reason for hiding this comment

fsaintjacques Oct 16, 2019

Choose a reason for hiding this comment

wesm Oct 16, 2019

Choose a reason for hiding this comment

codecov-io commented Oct 15, 2019

Codecov Report

pitrou left a comment

Choose a reason for hiding this comment

pitrou commented Oct 16, 2019

bobfang1992 commented Oct 17, 2019

wesm commented Oct 17, 2019 • edited Loading

jorisvandenbossche Oct 14, 2019 •

edited

Loading

wesm commented Oct 17, 2019 •

edited

Loading