Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement]: Large-scale native Iceberg table self-optimization #1057

Closed
3 tasks done
XBaith opened this issue Feb 6, 2023 · 3 comments
Closed
3 tasks done

[Improvement]: Large-scale native Iceberg table self-optimization #1057

XBaith opened this issue Feb 6, 2023 · 3 comments

Comments

@XBaith
Copy link
Contributor

XBaith commented Feb 6, 2023

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

There are too many small files or too many deleted files in the native iceberg table, which is unacceptable to AMS, because scanning table files consumes a lot of memory, which will cause AMS OOM and crash!
Therefore, we must figure out how to use less memory when native iceberg do optimization.

How should we improve?

I have some simple ideas for everyone to discuss and exchange:

  1. Scan iceberg files in batches: For example, create a scan queue to make memory consumption controllable.
  2. Exteral/Separable optimize planer: Separate the planer and ams services to avoid service unavailability caused by scanning files, and can assign a large number of table file scanning tasks to multiple planers.
  3. Rewrite small files/delete files by submitting new Spark/Flink rewrite action for unoptimized tables.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Subtasks

No response

Code of Conduct

@XBaith
Copy link
Contributor Author

XBaith commented Feb 6, 2023

cc @wangtaohz , @hameizi

@wangtaohz
Copy link
Contributor

wangtaohz commented Feb 6, 2023

Thanks for your report! @XBaith

Making the memory consumption of Optimizing controllable is of great value, and there are two main issues:

  • the memory consumption of planning, which means scanning files
  • the memory consumption of executing, which means reading and writing files

Here, we are discussing the planning issue.

I think if only a limited number of files are processed by the plan at a time, the OOM could be avoided.

  1. Scan iceberg files in batches: For example, create a scan queue to make memory consumption controllable.

This seems to be a reasonable solution. Is the file scan still working in AMS? Could you give more details about what a scan queue is going to be like?

@zhoujinsong
Copy link
Contributor

Closed as not updated for a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants