[SUPPORT][SPARK][NATIVE] make hudi integrate into gluten/velox #10252

YannByron · 2023-12-06T05:15:28Z

Currently, The integration between spark and gluten/velox has made a good performance on parquet or lake format. And @vinothchandar also mentioned this in #8679. So I think Hudi should take part in.

Here is a design I proposed in gluten before and some discussion: apache/incubator-gluten#3378

Now, all the scan types that gluten has supported are file based, like BatchScan or FileSourceScanExec. Datasource provides the list of files during planning, then gluten pass them to the native library and the native reader (parquet/orc/...) loads them.

For hudi cow table without hoodie.schema.on.read.enable, it can return HadoopFSRelation (that's file based) when call createRelation. So maybe we can make this integration easily if the native reader can load the hudi files correctly.

But for other hudi tables, they return HoodieBaseRelation (with BaseRelation, FileRelation, PrunedFilteredScan) that will be transformed to RowDataSourceScanExec that's not supported in gluten. To solve this, maybe there are two ways:

to make gluten support it. IMO, it's not easy, and not a high-priority thing in gluten.
to make hudi be file-based scan. But mor table needs to merge data, and hudi use Spark DatasourceV1 interface that doesn't have the ability to merge data, I guess we have to migrate to DSV2 to use BatchScan which can use hudi-defined reader to load data. As well as, a native C++ Hudi Reader is required in velox. With these two, hudi mor tables can be queried in native env.

Gluten: https://github.com/oap-project/gluten
Velox: https://github.com/facebookincubator/velox

@vinothchandar @xushiyan

The text was updated successfully, but these errors were encountered:

vinothchandar · 2023-12-07T18:01:51Z

@YannByron Great to hear from you. @rmahindra123 is actively exploring this as well.

but a lot of work is going on to build a new vectorized read path for all queries. cc @jonvex @yihua @linliu-code . Can you check out some of their recent work?

jonvex · 2023-12-07T18:13:30Z

We are actually switching to use HadoopFSRelation for all query types. So it sounds like this will make the integration easier

yihua · 2023-12-07T18:17:26Z

Hey @YannByron great that you brought this up.

@jonvex @linliu-code and I are actively working on improving Spark read and write performance and one aspect is to return HadoopFSRelation for all query types (including MOR snapshot queries with log merging). As of now, on the latest master, for snapshot, RO, and CDC queries on both COW and MOR tables in Spark the DefaultSource return HadoopFSRelation already.

yihua · 2023-12-07T18:18:53Z

We need to check if our read and merging logic is compatible with Velox cc @jonvex @linliu-code

linliu-code · 2023-12-07T23:39:33Z

After we support HadoopFsRelation for all queries types, what else has been left for Gluten/Velox integration?

YannByron · 2023-12-08T02:25:56Z

After we support HadoopFsRelation for all queries types, what else has been left for Gluten/Velox integration?

A native reader, for cases where the existing parquet reader cannot directly load the data correctly, like iceberg v2(positional deletes) reader Support Iceberg positional deletes facebookincubator/velox#5897.
A gluten-hudi module, as a bridge between spark-hudi and velox.

vinothchandar · 2023-12-08T20:42:10Z

@YannByron Expected.

To confirm, CoW snapshot queries should work, after we support HadoopFsRelation for all queries right. We will be happy to work with you on 1 & 2 items, if you have time/interest. let @linliu-code & team know

YannByron · 2023-12-11T02:39:32Z

@vinothchandar, i'm also glad to work with you guys.

Honestly, item 1 (a native reader in velox for mor table) is beyond my ability.
I can implement a gluten-hudi module in gluten and verify the Cow snapshot queries can work with them.

vinothchandar · 2023-12-11T19:37:42Z

Honestly, item 1 (a native reader in velox for mor table) is beyond my ability.

@linliu-code or @rmahindra123 can help here. being the C++ nerds here.

I can implement a gluten-hudi module in gluten and verify the Cow snapshot queries can work with them.

we are seeing great results with the new read path that the team is implementing. So sth like this will help us make some decisions. I am also working on NVIDIA rapids along similar lines, the snapshot queries are already accelerated there

vinothchandar · 2024-02-28T12:26:57Z

@YannByron Pinging on this again. Is there a WIP integration for CoW that we could build as a quick prototype? how hard is that

danny0405 added feature-enquiry issue contains feature enquiries/requests or great improvement ideas spark-sql performance labels Dec 7, 2023

vinothchandar self-assigned this Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT][SPARK][NATIVE] make hudi integrate into gluten/velox #10252

[SUPPORT][SPARK][NATIVE] make hudi integrate into gluten/velox #10252

YannByron commented Dec 6, 2023 •

edited

Loading

vinothchandar commented Dec 7, 2023

jonvex commented Dec 7, 2023

yihua commented Dec 7, 2023

yihua commented Dec 7, 2023

linliu-code commented Dec 7, 2023

YannByron commented Dec 8, 2023 •

edited

Loading

vinothchandar commented Dec 8, 2023

YannByron commented Dec 11, 2023

vinothchandar commented Dec 11, 2023 •

edited

Loading

vinothchandar commented Feb 28, 2024

[SUPPORT][SPARK][NATIVE] make hudi integrate into gluten/velox #10252

[SUPPORT][SPARK][NATIVE] make hudi integrate into gluten/velox #10252

Comments

YannByron commented Dec 6, 2023 • edited Loading

vinothchandar commented Dec 7, 2023

jonvex commented Dec 7, 2023

yihua commented Dec 7, 2023

yihua commented Dec 7, 2023

linliu-code commented Dec 7, 2023

YannByron commented Dec 8, 2023 • edited Loading

vinothchandar commented Dec 8, 2023

YannByron commented Dec 11, 2023

vinothchandar commented Dec 11, 2023 • edited Loading

vinothchandar commented Feb 28, 2024

YannByron commented Dec 6, 2023 •

edited

Loading

YannByron commented Dec 8, 2023 •

edited

Loading

vinothchandar commented Dec 11, 2023 •

edited

Loading