-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature](iceberg)support read iceberg complex type,iceberg.orc format and position delete. #33935
[feature](iceberg)support read iceberg complex type,iceberg.orc format and position delete. #33935
Conversation
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
clang-tidy review says "All clean, LGTM! 👍" |
be/.clion.source.upload.marker
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this
@@ -1684,6 +1699,9 @@ Status OrcReader::get_next_block(Block* block, size_t* read_rows, bool* eof) { | |||
for (auto& conjunct : _non_dict_filter_conjuncts) { | |||
filter_conjuncts.emplace_back(conjunct); | |||
} | |||
for (auto& [missing_col, conjunct] : _lazy_read_ctx.predicate_missing_columns) { | |||
filter_conjuncts.emplace_back(conjunct); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to resolve conjuncts like missing_column1 = missing_column2
, and other column = missing_column
?
@@ -177,6 +177,12 @@ class OrcReader : public GenericReader { | |||
Status get_parsed_schema(std::vector<std::string>* col_names, | |||
std::vector<TypeDescriptor>* col_types) override; | |||
|
|||
Status get_parsed_col_name_iceberg_ids(std::vector<std::string>* col_names, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other components like tvf
will get the file schema/metadata, How to provide a public interface to get the schema/metadata. get_parsed_col_name_iceberg_ids
may be correct, but it is not elegant enough.
@@ -136,6 +138,32 @@ Status IcebergTableReader::init_reader( | |||
|
|||
return status; | |||
} | |||
Status IcebergTableReader::init_reader_for_orc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we abstract the abstract class for the iceberg reader, using parquet/orc as the concrete implementation? If there are reuse functions between two formats, it can be moved to the parent class. The code style like if-else
often increases the difficulty of later maintenance and modification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
//Read position_delete_file TFileRangeDesc, generate DeleteFile | ||
virtual Status _read_position_delete_file(const TFileRangeDesc*, DeleteFile*) = 0; | ||
|
||
void _gen_position_delete_file_range(Block& block, DeleteFile* const position_delete, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: parameter 'position_delete' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]
void _gen_position_delete_file_range(Block& block, DeleteFile* const position_delete, | |
void _gen_position_delete_file_range(Block& block, DeleteFile* position_delete, |
@@ -1684,6 +1702,10 @@ Status OrcReader::get_next_block(Block* block, size_t* read_rows, bool* eof) { | |||
for (auto& conjunct : _non_dict_filter_conjuncts) { | |||
filter_conjuncts.emplace_back(conjunct); | |||
} | |||
//include missing_columns != missing_columns ; missing_column is null; missing_column != file_columns etc... | |||
for (auto& [missing_col, conjunct] : _lazy_read_ctx.predicate_missing_columns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parquet reader also has the same issue.
PR approved by at least one committer and no changes requested. |
PR approved by anyone and no changes requested. |
1 similar comment
PR approved by anyone and no changes requested. |
run buildall |
TeamCity be ut coverage result: |
TPC-H: Total hot run time: 40001 ms
|
TPC-DS: Total hot run time: 188541 ms
|
a7516ba
to
2cf74bf
Compare
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 40274 ms
|
TeamCity be ut coverage result: |
TPC-DS: Total hot run time: 185820 ms
|
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 41682 ms
|
TeamCity be ut coverage result: |
TPC-DS: Total hot run time: 186092 ms
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by at least one committer and no changes requested. |
…t and position delete. (apache#33935) 1. support read iceberg complex type : struct map. 2. support read the ORC storage format of Iceberg tables 3. Supports reading of the iceberg table after schema change when using the orc storage format . In addition to the rename operation performed inside the struct. [iceberg schema change](https://iceberg.apache.org/docs/latest/evolution/#schema-evolution) 4. support iceberg position delete. Related changes: 1. Add in orc_reader: perform filtering operation when the column to be filtered does not exist in the current orc file. eg. iceberg_table : a int, b string add column ( c int ) to iceberg_table. sql : `select * from iceberg_table where c is not null.`
Proposed changes
Issue Number: close #xxx
Related changes:
eg.
iceberg_table : a int, b string
add column ( c int ) to iceberg_table.
sql :
select * from iceberg_table where c is not null.
Further comments
todo :
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...