Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deltalake/iceberg/hudi improvements #47307

Merged
merged 55 commits into from
Apr 14, 2023

Conversation

kssenii
Copy link
Member

@kssenii kssenii commented Mar 7, 2023

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Several improvements around data lakes:

  • Make StorageIceberg work with non-partitioned data.
  • Support Iceberg format version V2 (previously only V1 was supported)
  • Support reading partitioned data for DeltaLake/Hudi
  • Faster reading of DeltaLake metadata by using Delta's checkpoint files
  • Fixed incorrect Hudi reads: previously it incorrectly chose which data to read and therefore was able to read correctly only small size tables
  • Made these engines to pickup updates of changed data (previously the state was set on table creation)
  • Make proper testing for Iceberg/DeltaLake/Hudi using spark

@robot-ch-test-poll robot-ch-test-poll added the pr-improvement Pull request with some product improvements label Mar 7, 2023
@kssenii kssenii force-pushed the better-tests-for-data-lakes branch 2 times, most recently from ead85a7 to e832457 Compare March 7, 2023 23:19
@kssenii kssenii changed the title Fix metadata reading and rewrite tests using spark for deltalake/iceberg/hudi Some fixes and rewrite tests using spark for deltalake/iceberg/hudi Mar 7, 2023
@kssenii kssenii force-pushed the better-tests-for-data-lakes branch 2 times, most recently from d03d6de to 89edd80 Compare March 8, 2023 14:10
@kssenii kssenii force-pushed the better-tests-for-data-lakes branch from 89edd80 to 0240ad4 Compare March 8, 2023 14:45
@AmosG
Copy link

AmosG commented Mar 16, 2023

can a test be added for partitioned delta tables?
i think there is an issue that clickhouse ignores the partitions

@amosdoublec
Copy link

what about support for deltalake timetravel
ie
select * from table@v2
union all
select * from table@v1

@kssenii
Copy link
Member Author

kssenii commented Mar 21, 2023

what about support for deltalake timetravel
ie
select * from table@v2
union all
select * from table@v1

yes, but not in this PR, could you please create an issue feature request for this?

@kssenii kssenii changed the title Some fixes and rewrite tests using spark for deltalake/iceberg/hudi deltalake/iceberg/hudi improvements Mar 24, 2023
@ucasfl
Copy link
Collaborator

ucasfl commented Mar 25, 2023

Also, it worth to support reading the table with cluster.

@antonio2368 antonio2368 self-assigned this Mar 27, 2023
@kssenii kssenii force-pushed the better-tests-for-data-lakes branch from a5de483 to 13f29a7 Compare March 28, 2023 16:57
Copy link
Member

@antonio2368 antonio2368 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review of the code only, didn't check tests.
Looks really clean, good job!

src/Storages/DataLakes/DeltaLakeMetadataParser.cpp Outdated Show resolved Hide resolved
src/Storages/DataLakes/DeltaLakeMetadataParser.cpp Outdated Show resolved Hide resolved
src/Storages/DataLakes/DeltaLakeMetadataParser.cpp Outdated Show resolved Hide resolved
src/Storages/DataLakes/DeltaLakeMetadataParser.cpp Outdated Show resolved Hide resolved
@kssenii
Copy link
Member Author

kssenii commented Apr 5, 2023

Hi @kssenii
Thanks for this very valuable PR, it's amazing !
Can you please tell us what do you mean by "Make StorageIceberg work, previously it didn't." ?
Thanks

@alifirat, when I and some other people tested it - it did not work. I checked what changes I made to make it work and looks like it was able to work only if data was partitioned (because it incorrectly worked with paths).

@kssenii kssenii force-pushed the better-tests-for-data-lakes branch from 4513857 to 0c8d65b Compare April 12, 2023 21:12
src/Core/NamesAndTypes.h Outdated Show resolved Hide resolved
src/Storages/DataLakes/HudiMetadataParser.cpp Outdated Show resolved Hide resolved
src/Storages/DataLakes/HudiMetadataParser.cpp Outdated Show resolved Hide resolved
src/Storages/DataLakes/HudiMetadataParser.cpp Outdated Show resolved Hide resolved
@kssenii kssenii force-pushed the better-tests-for-data-lakes branch from 2e2c634 to 6f53784 Compare April 13, 2023 13:57
Copy link
Member

@antonio2368 antonio2368 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

docker/test/integration/runner/Dockerfile Outdated Show resolved Hide resolved
@kssenii
Copy link
Member Author

kssenii commented Apr 14, 2023

Integration tests (tsan) [5/6] — fail: 2

test_tcp_handler_interserver_listen_host/test_case.py::test_request_to_node_with_interserver_listen_host
test_tcp_handler_interserver_listen_host/test_case.py::test_request_to_node_without_interserver_listen_host

Integration tests flaky check (asan)

test_storage_s3/test.py::test_select_columns

 DB::Exception: Message: The specified multipart upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.

not related to changes

Stress test (debug) — Killed by signal

#48777

@kssenii kssenii merged commit 9eee7c7 into ClickHouse:master Apr 14, 2023
134 of 136 checks passed
@alifirat
Copy link

Hi @kssenii I just see that you have merged this PR. Do you think it's going to be available in your monthly release or backport to the existing releases ? Thanks !

@kssenii
Copy link
Member Author

kssenii commented Apr 14, 2023

@alifirat, it will be available only in the next release, to the existing releases it will not be backported.

@ucasfl
Copy link
Collaborator

ucasfl commented Apr 14, 2023

For Iceberg V2, it may have delete files, current implementation seems can not handle it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants