Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-3378][VL] Feat: Support read iceberg mor table for Velox backend #4779

Merged
merged 5 commits into from
Mar 15, 2024

Conversation

liujiayi771
Copy link
Contributor

What changes were proposed in this pull request?

Velox add iceberg mor table read support in facebookincubator/velox#7847. This PR supports read iceberg mor table for Velox backend.

How was this patch tested?

Add mor table read test case "iceberg read mor table".

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten Clickhouse CI

@liujiayi771
Copy link
Contributor Author

cc @yma11 @YannByron, thanks.

.setEnableRowGroupMaxminIndex(
GlutenConfig.getConf().enableParquetRowGroupMaxMinIndex())
.build();
deleteFileBuilder.setParquet(parquetReadOptions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the deletion files share same read options as the data file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg allows the format of delete file and data file to be different, but in most cases they are consistent.

|insert into table iceberg_mor_tb
|values (1, 'a1', 'p1'), (2, 'a2', 'p1'), (3, 'a3', 'p2');
|""".stripMargin)
// Delete row.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add cases for multi deletions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. At least, should cover UPDATE operation. MergeInto is nice to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yma11 @YannByron While adding test cases, I discovered a bug in the Velox code. I will fix this bug first before updating the current PR. Later on, I will add test cases for MERGE INTO and UPDATE, as well as multiple DELETE operations.

// Set Iceberg split.
std::unordered_map<std::string, std::string> customSplitInfo{{"table_format", "hive-iceberg"}};
auto deleteFilesFind = icebergSplitInfo->deleteFilesMap.find(paths[idx]);
auto deleteFiles = deleteFilesFind != icebergSplitInfo->deleteFilesMap.end() ? deleteFilesFind->second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to pass deleteFiles without this map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For normal SplitInfo, file information is stored in lists, it is unable to obtain the mapping relationship between the data file and the delete file. I have actually considered placing the delete file map under protobuf's LocalFiles. The current processing logic is to extract the map information into the IcebergReadOptions of each FileOrFiles in Java, and then combine them back into a map in C++. It is actually somewhat redundant, but this approach has the least impact on the current protobuf changes.
We can also add a new definition for table_format in LocalFiles, with the default being hive. It can also be hive-iceberg. When the format is hive-iceberg, the delete file map contained within will be read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current code, we put all the fields like paths, starts, lengths, etc of all files for each task together, with each as a list. I think we can refactor it by using a single list of file or split which contains path, start, length, a list of deleteFiles so that we won't need this map. But how will it affect the protobuf part? any idea?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yma11 This is also a feasible solution, but not all data files have a delete file, and some data files may have multiple delete files. We would need a two-dimensional vector to maintain this relationship, and the two-dimensional vector would need to have empty vector for data files that do not have a corresponding delete file. Do you think this is a better solution? I'm OK to change to using this approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. scanInfo will have a list of SplitInfo and each SplitInfo contains its own path, start, length as well as a list of deleteFiles which may be empty. I think it will be clear in conception and without using idx anywhere. Spark-delta has a similar structure called AddFile which organizes like this. It's okay to merge this PR first and we do it as a follow up.

::substrait::ReadRel_LocalFiles_FileOrFiles::IcebergReadOptions::DeleteFile::FileFormatCase;
auto icebergSplitInfo = std::dynamic_pointer_cast<IcebergSplitInfo>(splitInfo)
? std::dynamic_pointer_cast<IcebergSplitInfo>(splitInfo)
: std::make_shared<IcebergSplitInfo>(*splitInfo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will it happen that it's not a IcebergSplitInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since substrait::ReadRel_LocalFiles_FileOrFiles contains multiple files, each file will enter this function during iteration. The first time it enters as SplitInfo, and subsequently, it is replaced by IcebergSplitInfo.

message DeleteFile {
FileContent fileContent = 1;
string filePath = 2;
uint64 fileSize = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a stupid question: why it skips 3 and 4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed some redundant fields and forgot to update the sequence numbers; it shouldn't skip 3 and 4.

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@yma11
Copy link
Contributor

yma11 commented Mar 13, 2024

@liujiayi771 Seems code has scala style violations. Please update.

Copy link

Run Gluten Clickhouse CI

@liujiayi771
Copy link
Contributor Author

@yma11 I have modified the map in SplitInfo to a two-dimensional vector.

@zhouyuan
Copy link
Contributor

@liujiayi771 There's a small conflict, could you please help to do a rebase?

thanks,
-yuan

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@liujiayi771
Copy link
Contributor Author

@liujiayi771 There's a small conflict, could you please help to do a rebase?

thanks, -yuan

Done.

@zhouyuan zhouyuan changed the title [VL] Support read iceberg mor table for Velox backend [VL] Feat: Support read iceberg mor table for Velox backend Mar 15, 2024
@zhouyuan zhouyuan changed the title [VL] Feat: Support read iceberg mor table for Velox backend [GLUTEN-3378][VL] Feat: Support read iceberg mor table for Velox backend Mar 15, 2024
Copy link

#3378

Copy link
Contributor

@zhouyuan zhouyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@zhouyuan zhouyuan merged commit 80bb0cf into apache:main Mar 15, 2024
19 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_4779_time.csv log/native_master_03_14_2024_58a459bf4_time.csv difference percentage
q1 35.35 36.11 0.758 102.14%
q2 23.88 23.82 -0.067 99.72%
q3 36.72 36.95 0.227 100.62%
q4 36.39 38.44 2.051 105.64%
q5 69.22 69.68 0.459 100.66%
q6 5.92 7.39 1.464 124.73%
q7 82.27 82.24 -0.035 99.96%
q8 85.04 84.83 -0.215 99.75%
q9 118.26 125.41 7.147 106.04%
q10 44.72 45.43 0.702 101.57%
q11 20.37 20.82 0.450 102.21%
q12 28.62 24.93 -3.690 87.11%
q13 47.31 47.32 0.014 100.03%
q14 18.38 20.91 2.533 113.78%
q15 29.84 31.01 1.167 103.91%
q16 13.23 12.67 -0.562 95.76%
q17 99.41 100.39 0.985 100.99%
q18 142.58 141.86 -0.715 99.50%
q19 16.82 14.82 -2.005 88.08%
q20 27.58 28.88 1.300 104.71%
q21 226.44 226.25 -0.193 99.91%
q22 14.96 13.94 -1.015 93.21%
total 1223.32 1234.08 10.760 100.88%

taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Mar 25, 2024
…end (apache#4779)

Velox add iceberg mor table read support in facebookincubator/velox#7847. This PR supports read iceberg mor table for Velox backend.
taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Oct 8, 2024
…end (apache#4779)

Velox add iceberg mor table read support in facebookincubator/velox#7847. This PR supports read iceberg mor table for Velox backend.
taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Oct 9, 2024
…end (apache#4779)

Velox add iceberg mor table read support in facebookincubator/velox#7847. This PR supports read iceberg mor table for Velox backend.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants