[GLUTEN-3378][VL] Feat: Support read iceberg mor table for Velox backend #4779

liujiayi771 · 2024-02-26T07:49:36Z

What changes were proposed in this pull request?

Velox add iceberg mor table read support in facebookincubator/velox#7847. This PR supports read iceberg mor table for Velox backend.

How was this patch tested?

Add mor table read test case "iceberg read mor table".

github-actions · 2024-02-26T07:49:56Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2024-02-26T07:50:11Z

Run Gluten Clickhouse CI

liujiayi771 · 2024-02-26T07:51:11Z

cc @yma11 @YannByron, thanks.

yma11 · 2024-02-26T08:19:46Z

gluten-iceberg/src/main/java/io/glutenproject/substrait/rel/IcebergLocalFilesNode.java

+                  .setEnableRowGroupMaxminIndex(
+                      GlutenConfig.getConf().enableParquetRowGroupMaxMinIndex())
+                  .build();
+          deleteFileBuilder.setParquet(parquetReadOptions);


Should the deletion files share same read options as the data file?

Iceberg allows the format of delete file and data file to be different, but in most cases they are consistent.

yma11 · 2024-02-26T08:23:27Z

gluten-iceberg/src/test/scala/io/glutenproject/execution/VeloxIcebergSuite.scala

+                    |insert into table iceberg_mor_tb
+                    |values (1, 'a1', 'p1'), (2, 'a2', 'p1'), (3, 'a3', 'p2');
+                    |""".stripMargin)
+        // Delete row.


add cases for multi deletions?

+1. At least, should cover UPDATE operation. MergeInto is nice to have.

@yma11 @YannByron While adding test cases, I discovered a bug in the Velox code. I will fix this bug first before updating the current PR. Later on, I will add test cases for MERGE INTO and UPDATE, as well as multiple DELETE operations.

yma11 · 2024-02-26T08:25:13Z

cpp/velox/compute/WholeStageResultIterator.cc

+        // Set Iceberg split.
+        std::unordered_map<std::string, std::string> customSplitInfo{{"table_format", "hive-iceberg"}};
+        auto deleteFilesFind = icebergSplitInfo->deleteFilesMap.find(paths[idx]);
+        auto deleteFiles = deleteFilesFind != icebergSplitInfo->deleteFilesMap.end() ? deleteFilesFind->second


Possible to pass deleteFiles without this map?

For normal SplitInfo, file information is stored in lists, it is unable to obtain the mapping relationship between the data file and the delete file. I have actually considered placing the delete file map under protobuf's LocalFiles. The current processing logic is to extract the map information into the IcebergReadOptions of each FileOrFiles in Java, and then combine them back into a map in C++. It is actually somewhat redundant, but this approach has the least impact on the current protobuf changes.
We can also add a new definition for table_format in LocalFiles, with the default being hive. It can also be hive-iceberg. When the format is hive-iceberg, the delete file map contained within will be read.

In current code, we put all the fields like paths, starts, lengths, etc of all files for each task together, with each as a list. I think we can refactor it by using a single list of file or split which contains path, start, length, a list of deleteFiles so that we won't need this map. But how will it affect the protobuf part? any idea?

@yma11 This is also a feasible solution, but not all data files have a delete file, and some data files may have multiple delete files. We would need a two-dimensional vector to maintain this relationship, and the two-dimensional vector would need to have empty vector for data files that do not have a corresponding delete file. Do you think this is a better solution? I'm OK to change to using this approach.

Yeah. scanInfo will have a list of SplitInfo and each SplitInfo contains its own path, start, length as well as a list of deleteFiles which may be empty. I think it will be clear in conception and without using idx anywhere. Spark-delta has a similar structure called AddFile which organizes like this. It's okay to merge this PR first and we do it as a follow up.

yma11 · 2024-02-26T08:27:36Z

cpp/velox/compute/iceberg/IcebergPlanConverter.cc

+      ::substrait::ReadRel_LocalFiles_FileOrFiles::IcebergReadOptions::DeleteFile::FileFormatCase;
+  auto icebergSplitInfo = std::dynamic_pointer_cast<IcebergSplitInfo>(splitInfo)
+      ? std::dynamic_pointer_cast<IcebergSplitInfo>(splitInfo)
+      : std::make_shared<IcebergSplitInfo>(*splitInfo);


When will it happen that it's not a IcebergSplitInfo?

Since substrait::ReadRel_LocalFiles_FileOrFiles contains multiple files, each file will enter this function during iteration. The first time it enters as SplitInfo, and subsequently, it is replaced by IcebergSplitInfo.

YannByron · 2024-02-27T06:17:57Z

gluten-core/src/main/resources/substrait/proto/substrait/algebra.proto

+        message DeleteFile {
+          FileContent fileContent = 1;
+          string filePath = 2;
+          uint64 fileSize = 5;


maybe a stupid question: why it skips 3 and 4?

I removed some redundant fields and forgot to update the sequence numbers; it shouldn't skip 3 and 4.

github-actions · 2024-03-11T10:08:24Z

Run Gluten Clickhouse CI

github-actions · 2024-03-12T15:00:18Z

Run Gluten Clickhouse CI

yma11 · 2024-03-13T01:00:46Z

@liujiayi771 Seems code has scala style violations. Please update.

github-actions · 2024-03-13T01:54:46Z

Run Gluten Clickhouse CI

liujiayi771 · 2024-03-13T01:55:23Z

@yma11 I have modified the map in SplitInfo to a two-dimensional vector.

zhouyuan · 2024-03-14T08:31:45Z

@liujiayi771 There's a small conflict, could you please help to do a rebase?

thanks,
-yuan

github-actions · 2024-03-14T08:49:14Z

Run Gluten Clickhouse CI

github-actions · 2024-03-14T08:52:00Z

Run Gluten Clickhouse CI

liujiayi771 · 2024-03-15T01:53:38Z

@liujiayi771 There's a small conflict, could you please help to do a rebase?

thanks, -yuan

Done.

github-actions · 2024-03-15T02:23:47Z

#3378

zhouyuan

👍

GlutenPerfBot · 2024-03-15T04:43:45Z

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query	log/native_4779_time.csv	log/native_master_03_14_2024_58a459bf4_time.csv	difference	percentage
q1	35.35	36.11	0.758	102.14%
q2	23.88	23.82	-0.067	99.72%
q3	36.72	36.95	0.227	100.62%
q4	36.39	38.44	2.051	105.64%
q5	69.22	69.68	0.459	100.66%
q6	5.92	7.39	1.464	124.73%
q7	82.27	82.24	-0.035	99.96%
q8	85.04	84.83	-0.215	99.75%
q9	118.26	125.41	7.147	106.04%
q10	44.72	45.43	0.702	101.57%
q11	20.37	20.82	0.450	102.21%
q12	28.62	24.93	-3.690	87.11%
q13	47.31	47.32	0.014	100.03%
q14	18.38	20.91	2.533	113.78%
q15	29.84	31.01	1.167	103.91%
q16	13.23	12.67	-0.562	95.76%
q17	99.41	100.39	0.985	100.99%
q18	142.58	141.86	-0.715	99.50%
q19	16.82	14.82	-2.005	88.08%
q20	27.58	28.88	1.300	104.71%
q21	226.44	226.25	-0.193	99.91%
q22	14.96	13.94	-1.015	93.21%
total	1223.32	1234.08	10.760	100.88%

…end (apache#4779) Velox add iceberg mor table read support in facebookincubator/velox#7847. This PR supports read iceberg mor table for Velox backend.

yma11 reviewed Feb 26, 2024

View reviewed changes

YannByron reviewed Feb 27, 2024

View reviewed changes

liujiayi771 force-pushed the iceberg-mor branch from 4a48906 to 20c5b98 Compare March 11, 2024 09:51

liujiayi771 force-pushed the iceberg-mor branch from 20c5b98 to 032bf3e Compare March 12, 2024 14:59

yma11 approved these changes Mar 13, 2024

View reviewed changes

yma11 approved these changes Mar 14, 2024

View reviewed changes

liujiayi771 added 4 commits March 14, 2024 16:38

Support read iceberg mor table for Velox backend

5c3b7ed

Add more test case

ca82741

Use two-dimensional vector to store delete files

86d1769

Rebase

73c8eb5

liujiayi771 force-pushed the iceberg-mor branch from d2ebfb7 to 73c8eb5 Compare March 14, 2024 08:48

Remove useless ut case

1d95748

zhouyuan changed the title ~~[VL] Support read iceberg mor table for Velox backend~~ [VL] Feat: Support read iceberg mor table for Velox backend Mar 15, 2024

zhouyuan changed the title ~~[VL] Feat: Support read iceberg mor table for Velox backend~~ [GLUTEN-3378][VL] Feat: Support read iceberg mor table for Velox backend Mar 15, 2024

zhouyuan approved these changes Mar 15, 2024

View reviewed changes

zhouyuan merged commit 80bb0cf into apache:main Mar 15, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-3378][VL] Feat: Support read iceberg mor table for Velox backend #4779

[GLUTEN-3378][VL] Feat: Support read iceberg mor table for Velox backend #4779

liujiayi771 commented Feb 26, 2024

github-actions bot commented Feb 26, 2024

github-actions bot commented Feb 26, 2024

liujiayi771 commented Feb 26, 2024

yma11 Feb 26, 2024

liujiayi771 Feb 26, 2024

yma11 Feb 26, 2024

YannByron Feb 27, 2024

liujiayi771 Mar 11, 2024

yma11 Feb 26, 2024

liujiayi771 Feb 27, 2024

yma11 Mar 12, 2024

liujiayi771 Mar 12, 2024

yma11 Mar 13, 2024

yma11 Feb 26, 2024

liujiayi771 Feb 27, 2024

YannByron Feb 27, 2024

liujiayi771 Feb 28, 2024

github-actions bot commented Mar 11, 2024

github-actions bot commented Mar 12, 2024

yma11 commented Mar 13, 2024

github-actions bot commented Mar 13, 2024

liujiayi771 commented Mar 13, 2024

zhouyuan commented Mar 14, 2024

github-actions bot commented Mar 14, 2024

github-actions bot commented Mar 14, 2024

liujiayi771 commented Mar 15, 2024

github-actions bot commented Mar 15, 2024

zhouyuan left a comment

GlutenPerfBot commented Mar 15, 2024

[GLUTEN-3378][VL] Feat: Support read iceberg mor table for Velox backend #4779

[GLUTEN-3378][VL] Feat: Support read iceberg mor table for Velox backend #4779

Conversation

liujiayi771 commented Feb 26, 2024

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Feb 26, 2024

github-actions bot commented Feb 26, 2024

liujiayi771 commented Feb 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 11, 2024

github-actions bot commented Mar 12, 2024

yma11 commented Mar 13, 2024

github-actions bot commented Mar 13, 2024

liujiayi771 commented Mar 13, 2024

zhouyuan commented Mar 14, 2024

github-actions bot commented Mar 14, 2024

github-actions bot commented Mar 14, 2024

liujiayi771 commented Mar 15, 2024

github-actions bot commented Mar 15, 2024

zhouyuan left a comment

Choose a reason for hiding this comment

GlutenPerfBot commented Mar 15, 2024