ARROW2: Performance benchmark #1652

houqp · 2022-01-24T01:09:08Z

Perform thorough testing to make sure the arrow2 branch will indeed result in better performance across the board.

Please give https://github.com/apache/arrow-datafusion/tree/arrow2 a try for your workloads and report any regression you find here.

houqp · 2022-01-24T01:10:00Z

Here are some of the TPCH results I got from running our tpch benchmark suite on an 8 cores x86-64 Linux machine.

The base commit from I used as baseline for arrow-rs is 2008b1d. The commit for arrow2 is c0c9c72.

Default single partition CSV files generated from our tpch gen script (--batch-size 4096):

CSV tables partitioned into 16 files and processed with 8 datafusion partitions (--batch-size 4096 --partitions 8):

Data generated with:

PARTITIONS=16
for i in `seq 1 ${PARTITIONS}`; do
  echo "generating partition $i"
  ./dbgen -vf -s 16 -C ${PARTITIONS} -S $i
  ls -lh
done

Parquet tables partitioned in 8 files and processed with 8 datafusion partitions (--batch-size 4096 --partitions 8):

Note query 7 could not complete with arrow2 due to OOM. Arrow2 parquet reader currently consuming almost double the memory compared to arrow-rs is a known issue. Related upstream issue: jorgecarleitao/arrow2#768.

Q1 is significantly slower in arrow2 compared to the other queies (perhaps related to predicate pushdown?).

I think both of these two regressions require deepdive before we merge arrow2 into master.

Overall, arrow2 is around 5% faster for CSV tables and 10-20% faster for parquet tables across the board.

tustvold · 2022-01-24T09:54:07Z

Big 👍 to this, getting some concrete numbers would be really nice.

FWIW some ideas for whoever picks this up that I at least would be very interested in:

Performance against current arrow-rs master, a number of non-trivial performance improvements have landed in the last month, with more currently in progress
Performance of floating point aggregates, I seem to remember all the TPCH queries testing such fail to run correctly, but I could be mistaken
Performance of dictionary arrays, there is a substantial amount of work completed and in-flight to improve this situation as it has historically been poor (and still is WIP)- Implement dyn scalar comparison kernels arrow-rs#1113, Switch datafusion to using eq_dyn_scalar, etc kernels #1610, Use dynamic arrow compute kernels #1474, Preserve dictionary encoding when decoding parquet into Arrow arrays, 60x perf improvement (#171) arrow-rs#1180, etc...
Performance on parquet files with reasonable row group sizes, the OOM would suggest they aren't teeny but wanted to clarify - there is currently a limitation of arrow-rs's parquet writer that makes it produce impractically small row groups - Write Multiple RecordBatch to Parquet Row Group arrow-rs#1211
Performance of specific operators, e.g. FilterExec or SortPreservingMerge or ParquetExec, I'd basically be interested in where the performance gains are, and where we might gain or lose performance in a potential switch. Perhaps some execution plan metrics, or a perf dump or something 🤔

alamb · 2022-01-24T12:01:38Z

A number of the performance improvements will be in arrow-rs 8.0.0, though some such as apache/arrow-rs#1180 will not be released until arrow-rs 9.0.0

houqp · 2022-01-24T18:42:10Z

bummer that the dictionary encoding optimization will need to wait for the 9.0 release, but we can test it with a 9.0 datafusion branch if needed.

sunchao · 2022-01-27T19:55:37Z

I think a big part of perf improvement comes from Parquet predicate pushdown (e.g., stats, dictionary, bloom filter and column index). Does either arrow-rs or arrow2 implement (some of) these currently?

alamb · 2022-01-27T20:23:25Z

DataFusion does implement row group pruning based on statistics (that arrow-rs creates)

arrow-rs creates statistics. It will (in version after 8.0.0) preserve dictionary encoding (meaning the output of a dictionary encoded column will also be dictionary encoded) which should be a large win for low cardinality columns

It does not currently create or use column indexes or bloom filters

I believe that @tustvold has a plan for implementing more sophisticated predicate pushdown (aka that a filter on one column could be used to avoid decoding swaths of others) but I am not sure what the timeline on that is -- apache/arrow-rs#1191

sunchao · 2022-01-27T20:39:49Z

It will (in version after 8.0.0) preserve dictionary encoding (meaning the output of a dictionary encoded column will also be dictionary encoded) which should be a large win for low cardinality columns

This is nice 👍 . By dictionary-based row group filtering I mean we can first decode only the dictionary pages for the column chunks where pushed down predicates (e.g., in predicate) are associated to, and skip the whole row group if no keys in the dictionary match.

alamb · 2022-01-27T20:46:30Z

This is nice 👍 . By dictionary-based row group filtering I mean we can first decode only the dictionary pages for the column chunks where pushed down predicates (e.g., in predicate) are associated to, and skip the whole row group if no keys in the dictionary match.

this would definitely be a neat optimization

tustvold · 2022-01-27T21:29:44Z

skip the whole row group if no keys in the dictionary match.

You'd need to check that column chunk hadn't spilled, but yes that is broadly speaking the access pattern I wish to support with the work in apache/arrow-rs#1191. Specially exploit apache/arrow-rs#1180 to cheaply get dictionary data, evaluate predicates on it, and use this to construct a refined row selection to fetch.

As @alamb alludes to the timelines for this effort are not clear, and there may be other more pressing things for me to spend time on, but it is definitely something that I hope to deliver in the next month or so.

jorgecarleitao · 2022-01-27T21:37:10Z

arrow-rs: group filter push down
arrow2: group filter push down, page filter push down

afaik both support reading and writing dictionary encoded arrays to dictionary-encoded column chunks (but neither supports pushdown based on dict values atm).

TBH, imo the biggest limiting factor in implementing anything in parquet is its lack of documentation - it is just a lot of work to decipher what is being requested, and the situation is not improving. For example, I spent a lot of time in understanding the encodings, have a PR to try to help future folks implementing it, and it has been lingering for ~9 months now. I wish parquet was a bit more inspired by e.g. Arrow or ORC in this respect.

Note that datafusion's benchmarks only use "required" / non-nullable data, so most optimizations on the null values are not seen from datafusion's perspective. Last time I benched arrow2/parquet2 was much faster in nullable data; I am a bit surprised to see so many differences in non-nullable data.

alamb · 2022-01-27T21:46:25Z

arrow-rs 8 has some improvements for nullable columns as well now courtesy of @tustvold apache/arrow-rs#1054

matthewmturner · 2022-03-05T03:00:16Z

checking in on this - has anyone rerun tpch on master and arrow2 branches since the arrow-rs 9 release?

tustvold · 2022-03-05T09:11:45Z

I added some further benchmarks in #1738 which I would also be interested in the numbers for

xudong963 · 2022-11-22T05:18:55Z

More than half a year passed, and I believe arrow-rs has made great progress. Does anyone have the latest benchmark data?

alamb · 2022-11-22T14:23:58Z

I have not really been following arrow2 very closely -- I am not sure when the last time anyone got datafusion to work enough with arrow2 to run the TPCH benchmarks 🤷

xudong963 · 2022-11-22T15:10:40Z

I will bump arrow branch and make comparision.

tustvold · 2022-11-22T15:24:02Z

My 2 cents is that as mentioned by @sunchao above, parquet predicate pushdown makes a drastic difference to performance. Whilst there are still some kinks to iron out in the DataFusion integration, arrow-rs now supports the full suite, including late materialisation.

Given they were previously roughly in the same ballpark, and significant optimisations have landed in arrow-rs since then, with predicate pushdown I think the performance story for arrow-rs is pretty compelling, at least w.r.t parquet. Arrow2 only supports row group level pruning, as page pruning requires skippable decode and more sophisticated handling of repetition levels, and so is a fair ways behind on this.

This has been born out by some benchmarks I did, but the TPCH benchmarks would be very interesting to see also.

I mention this mainly because previous attempts to update the arrow2 branch have gotten stuck in merge conflict hell, and perhaps I can save you some time, and sanity 😅

Dandandan · 2022-11-22T15:37:02Z

For TPC-H It might be the pruning wouldn't help a lot, as most of the data is AFAIK very evenly distributed, so quite hard to prune values based on min/max values without sorting data first - at least on row group level this was the case. But maybe page pruning works in some cases, that would be great.

tustvold · 2022-11-22T15:42:16Z

But maybe page pruning works in some cases, that would be great.

FWIW we support late materialization now, in addition to row group and page pruning, and which doesn't rely on aggregate statistics. It instead evaluates predicates on the subset of the columns needed by the predicate, and then uses the result to conditionally materialize the other projected columns. This isn't always advantageous, e.g. for predicates that don't eliminate large numbers of rows, we need to develop better heuristics / bailing logic, but where it works the benefits can be huge.

Dandandan · 2022-11-22T15:57:33Z

But maybe page pruning works in some cases, that would be great.

FWIW we support late materialization now, in addition to row group and page pruning, and which doesn't rely on aggregate statistics. It instead evaluates predicates on the subset of the columns needed by the predicate, and then uses the result to conditionally materialize the other projected columns. This isn't always advantageous, e.g. for predicates that don't eliminate large numbers of rows, we need to develop better heuristics / bailing logic, but where it works the benefits can be huge.

Yes, I realized that, but also seeing some tpc-h generated data and filters, I would be (pleasantly) surprised if it can prune out many pages.

alamb · 2022-11-22T15:59:45Z

Yes, I realized that, but also seeing some tpc-h generated data and filters, I would be (pleasantly) surprised if it can prune out many pages.

I have it on my list to propose enabling the pushdown filtering by default (see #3828) -- I plan to run the TPCH benchmarks as part of that exercise to see how much / if at all they help

xudong963 · 2022-11-22T16:06:48Z

I plan to run the TPCH benchmarks as part of that exercise to see how much / if at all they help

From benchmarks, I think pushdown filtering is very help

Ted-Jiang · 2022-11-23T01:48:12Z

Yes, I realized that, but also seeing some tpc-h generated data and filters, I would be (pleasantly) surprised if it can prune out many pages.

I try to modify the filter condition in tpch q1, with page index it really helped 😂 . It can boost a lot sometimes in real world.

alamb · 2023-03-10T12:27:45Z

Given jorgecarleitao/arrow2#1429 I don't think this issue is relevant any more -- please let me know if you feel otherwise.

houqp added the question Further information is requested label Jan 24, 2022

houqp added this to the arrow2 milestone Jan 24, 2022

This was referenced Jan 24, 2022

ARROW2: Optimize parquet read memory usage #1657

Closed

Officially maintained Arrow2 branch #1556

Merged

houqp mentioned this issue Jun 27, 2022

Updating arrow2 branch #2709

Closed

alamb closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW2: Performance benchmark #1652

ARROW2: Performance benchmark #1652

houqp commented Jan 24, 2022

houqp commented Jan 24, 2022 •

edited

tustvold commented Jan 24, 2022 •

edited

alamb commented Jan 24, 2022

houqp commented Jan 24, 2022

sunchao commented Jan 27, 2022

alamb commented Jan 27, 2022 •

edited

sunchao commented Jan 27, 2022

alamb commented Jan 27, 2022

tustvold commented Jan 27, 2022

jorgecarleitao commented Jan 27, 2022

alamb commented Jan 27, 2022

matthewmturner commented Mar 5, 2022

tustvold commented Mar 5, 2022

xudong963 commented Nov 22, 2022

alamb commented Nov 22, 2022

xudong963 commented Nov 22, 2022

tustvold commented Nov 22, 2022

Dandandan commented Nov 22, 2022

tustvold commented Nov 22, 2022

Dandandan commented Nov 22, 2022

alamb commented Nov 22, 2022

xudong963 commented Nov 22, 2022

Ted-Jiang commented Nov 23, 2022

alamb commented Mar 10, 2023

ARROW2: Performance benchmark #1652

ARROW2: Performance benchmark #1652

Comments

houqp commented Jan 24, 2022

houqp commented Jan 24, 2022 • edited

tustvold commented Jan 24, 2022 • edited

alamb commented Jan 24, 2022

houqp commented Jan 24, 2022

sunchao commented Jan 27, 2022

alamb commented Jan 27, 2022 • edited

sunchao commented Jan 27, 2022

alamb commented Jan 27, 2022

tustvold commented Jan 27, 2022

jorgecarleitao commented Jan 27, 2022

alamb commented Jan 27, 2022

matthewmturner commented Mar 5, 2022

tustvold commented Mar 5, 2022

xudong963 commented Nov 22, 2022

alamb commented Nov 22, 2022

xudong963 commented Nov 22, 2022

tustvold commented Nov 22, 2022

Dandandan commented Nov 22, 2022

tustvold commented Nov 22, 2022

Dandandan commented Nov 22, 2022

alamb commented Nov 22, 2022

xudong963 commented Nov 22, 2022

Ted-Jiang commented Nov 23, 2022

alamb commented Mar 10, 2023

houqp commented Jan 24, 2022 •

edited

tustvold commented Jan 24, 2022 •

edited

alamb commented Jan 27, 2022 •

edited