Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iceberg table maintenance/compaction within AWS #5997

Closed
vshel opened this issue Oct 17, 2022 · 6 comments
Closed

Iceberg table maintenance/compaction within AWS #5997

vshel opened this issue Oct 17, 2022 · 6 comments
Labels

Comments

@vshel
Copy link

vshel commented Oct 17, 2022

Query engine

Spark3

Question

Hello, I have a ~6TB iceberg table with ~10,000 partitions within S3 and I am using Glue catalog, what is the correct way of running compaction on such a table?

From documentation: https://iceberg.apache.org/docs/latest/maintenance/ I can run:

SparkActions
    .get()
    .rewriteDataFiles(table)
    .filter(Expressions.equal("date", "2020-08-18"))
    .option("target-file-size-bytes", Long.toString(500 * 1024 * 1024)) // 500 MB
    .execute();

This is going to execute on a single aws instance, how do I scale this to many instances for the compaction process to run in parallel on many partitions at once, is there an out of the box support for this within AWS? Do I need to create my own spark cluster? I am not familiar with spark at this point.

My biggest concern:
Additionally, the table is constantly updated, am I supposed to pause all updates until compaction finishes?

Thank you.

@ismailsimsek
Copy link
Contributor

@vshel any reason you are not using Athena to do compaction?

@vshel
Copy link
Author

vshel commented Oct 25, 2022

@ismailsimsek I tried running OPTIMIZE with athena on a partition with ~25 000 files totalling 2.6GB (so pretty small dataset), it failed with an internal error after 8 minutes, I created a support ticket for AWS to investigate, but it's not looking promising now, considering whole table dataset is 6TB.

Additionally, after experimenting, Athena read performance is horrible unless I do a compaction, I tested a small 25MB dataset, it takes athena 50 seconds to get 100 000 records out of this iceberg 25MB table or to do a COUNT(*), and after I do compaction it takes 8 seconds for athena to do retrival and count operations.
All files in the dataset have a corresponding delete, because I am doing upserts of streaming data. So, it looks like upserting (delete + write) slows down athena read performance, compaction fixes it as it removes deletes. I tested performances without deletes by doing just writes during streaming of this 25MB dataset and read performance was 8 seconds even without running compaction.

So, Iceberg athena read performace is looking to be very slow, considering non-iceberg athena tables that span 60GB of data can run COUNT(*) in just 4 seconds, compared to Iceberg's 8 seconds for 25MB.

@Samrose-Ahmed
Copy link

I would recommend running a Spark job. An AWS Glue job is the easiest to get started but considering you're running this once, it'll likely be cheaper to run on EMR (serverless or provisioned). Also, Spark/EMR doesn't run on a single instance, it parallelizes across nodes.

In the future, since you're doing streaming appends/inserts I would recommend doing regular table maintenance, so you don't end up in this situation. You can check this blog post : Automated Iceberg table maintenance on AWS for how we do it in Matano, but its fairly simple you need to regularly run compaction.

@maswin
Copy link

maswin commented Dec 9, 2022

You have to set this setting to a higher value (default is 1) to run re-writes in multiple instances in parallel:

max-concurrent-file-group-rewrites

https://iceberg.apache.org/javadoc/1.1.0/org/apache/iceberg/actions/RewriteDataFiles.html#MAX_CONCURRENT_FILE_GROUP_REWRITES

Not sure why this is not documented.

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Jun 23, 2023
@github-actions
Copy link

github-actions bot commented Jul 8, 2023

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants