Iceberg table maintenance/compaction within AWS #5997

vshel · 2022-10-17T09:42:01Z

Query engine

Spark3

Question

Hello, I have a ~6TB iceberg table with ~10,000 partitions within S3 and I am using Glue catalog, what is the correct way of running compaction on such a table?

From documentation: https://iceberg.apache.org/docs/latest/maintenance/ I can run:

SparkActions
    .get()
    .rewriteDataFiles(table)
    .filter(Expressions.equal("date", "2020-08-18"))
    .option("target-file-size-bytes", Long.toString(500 * 1024 * 1024)) // 500 MB
    .execute();

This is going to execute on a single aws instance, how do I scale this to many instances for the compaction process to run in parallel on many partitions at once, is there an out of the box support for this within AWS? Do I need to create my own spark cluster? I am not familiar with spark at this point.

My biggest concern:
Additionally, the table is constantly updated, am I supposed to pause all updates until compaction finishes?

Thank you.

The text was updated successfully, but these errors were encountered:

ismailsimsek · 2022-10-24T16:23:52Z

@vshel any reason you are not using Athena to do compaction?

vshel · 2022-10-25T09:44:08Z

@ismailsimsek I tried running OPTIMIZE with athena on a partition with ~25 000 files totalling 2.6GB (so pretty small dataset), it failed with an internal error after 8 minutes, I created a support ticket for AWS to investigate, but it's not looking promising now, considering whole table dataset is 6TB.

Additionally, after experimenting, Athena read performance is horrible unless I do a compaction, I tested a small 25MB dataset, it takes athena 50 seconds to get 100 000 records out of this iceberg 25MB table or to do a COUNT(*), and after I do compaction it takes 8 seconds for athena to do retrival and count operations.
All files in the dataset have a corresponding delete, because I am doing upserts of streaming data. So, it looks like upserting (delete + write) slows down athena read performance, compaction fixes it as it removes deletes. I tested performances without deletes by doing just writes during streaming of this 25MB dataset and read performance was 8 seconds even without running compaction.

So, Iceberg athena read performace is looking to be very slow, considering non-iceberg athena tables that span 60GB of data can run COUNT(*) in just 4 seconds, compared to Iceberg's 8 seconds for 25MB.

Samrose-Ahmed · 2022-11-13T18:54:21Z

I would recommend running a Spark job. An AWS Glue job is the easiest to get started but considering you're running this once, it'll likely be cheaper to run on EMR (serverless or provisioned). Also, Spark/EMR doesn't run on a single instance, it parallelizes across nodes.

In the future, since you're doing streaming appends/inserts I would recommend doing regular table maintenance, so you don't end up in this situation. You can check this blog post : Automated Iceberg table maintenance on AWS for how we do it in Matano, but its fairly simple you need to regularly run compaction.

maswin · 2022-12-09T06:13:59Z

You have to set this setting to a higher value (default is 1) to run re-writes in multiple instances in parallel:

max-concurrent-file-group-rewrites

https://iceberg.apache.org/javadoc/1.1.0/org/apache/iceberg/actions/RewriteDataFiles.html#MAX_CONCURRENT_FILE_GROUP_REWRITES

Not sure why this is not documented.

github-actions · 2023-06-23T00:14:36Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions · 2023-07-08T00:13:55Z

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

jackye1995 mentioned this issue Mar 24, 2023

Core, AWS: Auto optimize table using post commit notifications #7194

Closed

github-actions bot added the stale label Jun 23, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg table maintenance/compaction within AWS #5997

Iceberg table maintenance/compaction within AWS #5997

vshel commented Oct 17, 2022 •

edited

Loading

ismailsimsek commented Oct 24, 2022

vshel commented Oct 25, 2022

Samrose-Ahmed commented Nov 13, 2022

maswin commented Dec 9, 2022

github-actions bot commented Jun 23, 2023

github-actions bot commented Jul 8, 2023

Iceberg table maintenance/compaction within AWS #5997

Iceberg table maintenance/compaction within AWS #5997

Comments

vshel commented Oct 17, 2022 • edited Loading

Query engine

Question

ismailsimsek commented Oct 24, 2022

vshel commented Oct 25, 2022

Samrose-Ahmed commented Nov 13, 2022

maswin commented Dec 9, 2022

github-actions bot commented Jun 23, 2023

github-actions bot commented Jul 8, 2023

vshel commented Oct 17, 2022 •

edited

Loading