-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize reads and deserialize to Arrow #9
Comments
There's more context and discussion on the issue in the old Netflix project: Netflix/iceberg#90 |
@rdblue .. I ran some benchmarks on production sized data and we'r seeing a large enough gap in scan performance between Spark's core reader impl (with vectorization) and Iceberg's Parquet I'd be glad to work on this effort. If you have thoughts around this from previous discussions and general approach/challenges please let me know. I understand this is prolly a pretty large undertaking coz it involves other formats as well. Meanwhile, i'm trying to get something working so I can get an understanding of the challenges involved so I can work on a proposal. But i'd appreciate your thoughts around things we should address in a potential solution. |
I would totally support this effort. The benchmarks in this PR also confirm that vectorized execution is important. I should mention that Iceberg's Parquet reader seems to be significantly more efficient on nested data. |
I would certainly review and collaborate on this! @aokolnychyi Spark disables vectorized read on nested data, which would then indicate that Iceberg without vectorized read is faster than Spark without vectorized read. |
@mccheah you are right that we don't have vectorized reads on nested data in Spark. The benchmarks in the PR above test vectorized and non-vectorized reads on flat and nested data. On flat data, Iceberg is slightly faster. The real difference is seen on nested data. As I also said in that PR, Iceberg DS is V2 and the file source is still V1, which complicates the comparison. I am working on benchmarks for readers alone without Spark. |
It would be great to have people working on this! Feel free to pick it up and I'll help review it. I think there are good tests that we can copy that are used to validate the row-based readers and writers. @julienledem may also be interested. He has given talks on how to efficiently deserialize Parquet to Arrow and can hopefully help answer questions. |
For anyone interested, Spark already has a good reference implementation in Scala here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala One thing to note is that there isn't an implementation for map type. I think Arrow was planning on using list<struct<key,value>, but engines may need something closer to struct<keys:list<>, values:list<>> in order overlay the columnar implementations directly. It might be necessary to provide both. I'm happy to help anyone who's interested in working on this. |
@aokolnychyi I just made a pull request for a schema converter between iceberg and arrow, which is the first step in getting setup for building vectors: #194 |
Thanks @danielcweeks ! I also found https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala that has some nice plumbing for Batch to |
Another point about the current reader is that we first construct Here are benchmark results for readers/writers alone (without Spark):
The code is available in the above PR. |
Is there a plan for how we're going to tackle this? So far I've only seen concrete work in #194. Is there a plan to build out the entire Arrow integration? Perhaps it's appropriate to break down this issue into multiple sub-issues so that PRs can target the sub-issues - then we can merge everything knowing that each piece will be part of a known complete picture? |
@mccheah We have started working on a more complete integration with arrow and we'll be following up incrementally as things progress. @anjalinorwood and @samarthjain are working on this here at Netflix. There are still some things that will need to be proved out, but plan to try to chunk the work into smaller pieces for better review and comment. |
I'v added a WIP branch with a working POC for vectorization for primitive types in Iceberg Implementation Notes:
P.S. There's some unused code under Lemme know what folks think of the approach. I'm getting this working for our scale test benchmark and will report back with numbers. Feel free to run your own benchmarks and share. |
A few of us met with @prodeezy to discuss the poc implementation (great work btw) and look at the future work we all hope to be contributing to. I've captured some of the discussion and ideas of where we going and considerations here: https://docs.google.com/document/d/1qVcowrYP6xBoB9C4htwEA0QvbHpdstzieNsX26SMG2k/edit#heading=h.yun6jblu7cfi Feel free to add comments. I think with a little cleanup we might be close to having an initial implementation that we can start iterating on. @rdblue had proposed creating a branch, which I'm all for so we can work more openly and get more eyes on the implementation. |
Will be using https://github.com/apache/incubator-iceberg/tree/vectorized-read going forward to iterate on this feature. |
Vectorization Perf Meeting notes (Aug 1): After running benchmarks met with @samarthjain , @anjalinorwood and @rdblue to go over some possible improvements we came up with the following. Possible low hanging fruit to reap with perf :
Deeper look :
|
…n ids to an incoming schema as per the table schema (apache#9)
…n ids to an incoming schema as per the table schema (apache#9) Disable test broken due to LIHADOOP-49587
Since #828 was merged, I'm going to close this. Let's track the remaining work in the vectorized read milestone. |
…ipe-0.9.0 Publish new stripe version
…o an incoming schema as per the table schema (apache#9) Disable test broken due to LIHADOOP-49587
…o an incoming schema as per the table schema (apache#9) Disable test broken due to LIHADOOP-49587
Iceberg does not use vectorized reads to produce data for Spark. For cases where Spark can use its vectorized read path (flat schemas, no evolution) Spark will be faster. Iceberg should solve this problem by adding a vectorized read path that deserializes to Arrow RowBatch. Spark already has support for Arrow data from PySpark.
The text was updated successfully, but these errors were encountered: