[Improvement]: Large-scale native Iceberg table self-optimization #1057

XBaith · 2023-02-06T09:35:35Z

Search before asking

I have searched in the issues and found no similar issues.

What would you like to be improved?

There are too many small files or too many deleted files in the native iceberg table, which is unacceptable to AMS, because scanning table files consumes a lot of memory, which will cause AMS OOM and crash!
Therefore, we must figure out how to use less memory when native iceberg do optimization.

How should we improve?

I have some simple ideas for everyone to discuss and exchange:

Scan iceberg files in batches: For example, create a scan queue to make memory consumption controllable.
Exteral/Separable optimize planer: Separate the planer and ams services to avoid service unavailability caused by scanning files, and can assign a large number of table file scanning tasks to multiple planers.
Rewrite small files/delete files by submitting new Spark/Flink rewrite action for unoptimized tables.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Subtasks

No response

Code of Conduct

I agree to follow this project's Code of Conduct

XBaith · 2023-02-06T09:35:52Z

cc @wangtaohz , @hameizi

wangtaohz · 2023-02-06T11:23:24Z

Thanks for your report! @XBaith

Making the memory consumption of Optimizing controllable is of great value, and there are two main issues:

the memory consumption of planning, which means scanning files
the memory consumption of executing, which means reading and writing files

Here, we are discussing the planning issue.

I think if only a limited number of files are processed by the plan at a time, the OOM could be avoided.

Scan iceberg files in batches: For example, create a scan queue to make memory consumption controllable.

This seems to be a reasonable solution. Is the file scan still working in AMS? Could you give more details about what a scan queue is going to be like?

zhoujinsong · 2023-11-30T08:31:26Z

Closed as not updated for a long time.

XBaith added the type:improvement label Feb 6, 2023

XBaith mentioned this issue Feb 23, 2023

[ARCTIC-1057][AMS] Only load self-optimizing enabled tables into cache #1145

Merged

3 tasks

wangtaohz mentioned this issue Mar 27, 2023

[WIP][ARCTIC-1057][ams] fix memory leak problem for Only load self-optimizing enabled tables into cache #1287

Closed

3 tasks

zhoujinsong closed this as completed Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement]: Large-scale native Iceberg table self-optimization #1057

[Improvement]: Large-scale native Iceberg table self-optimization #1057

XBaith commented Feb 6, 2023

XBaith commented Feb 6, 2023

wangtaohz commented Feb 6, 2023 •

edited

Loading

zhoujinsong commented Nov 30, 2023

[Improvement]: Large-scale native Iceberg table self-optimization #1057

[Improvement]: Large-scale native Iceberg table self-optimization #1057

Comments

XBaith commented Feb 6, 2023

Search before asking

What would you like to be improved?

How should we improve?

Are you willing to submit PR?

Subtasks

Code of Conduct

XBaith commented Feb 6, 2023

wangtaohz commented Feb 6, 2023 • edited Loading

zhoujinsong commented Nov 30, 2023

wangtaohz commented Feb 6, 2023 •

edited

Loading