Spark: Add SparkSQLProperty to control split-size #10336

sumedhsakdeo · 2024-05-15T14:59:49Z

We have a scheduled job that deletes rows in an Iceberg table. The job is authored in SQL. Given we use CoW technique for data deletion the job would rewrite the files without the deleted rows. We want to tune this job so that it creates files that are ~512MB on HDFS. We are unable to use option given job uses SparkSQL and setting read.split.target-size table property is not desired as it impacts are readers.

PR adds ability to control the split-size for a given spark SQL job, by introducing a property spark.sql.iceberg.split-size which can be set as spark session conf.

shardulm94

Thanks @sumedhsakdeo for the PR! It looks good to me. It seems useful to allow setting split size for reads in Spark SQL scripts. Support for passing options at a scan-level through SQL would have been ideal, but attempts to do that in Spark have been unsuccessful apache/spark#34072 apache/spark#41683. A session level conf would be useful to have in the meantime.

I haven't been actively reviewing Iceberg code recently though, so would be good if @aokolnychyi or @amogh-jahagirdar can provide their feedback as well (I see you guys have contributions to Iceberg Spark recently :) ).

sumedhsakdeo · 2024-05-16T23:01:33Z

Thanks Shardul for taking a look. Appreciate your review Anton and Amogh.
Also adding @wmoustafa!

szehon-ho · 2024-05-22T20:59:13Z

Yea its really something that would be great to support in Spark. I hacked together another attempt apache/spark#46707 based on the last comment in apache/spark#41683 .

SparkSQLProperty for split-size

3a7e588

github-actions bot added the spark label May 15, 2024

shardulm94 reviewed May 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Add SparkSQLProperty to control split-size #10336

Spark: Add SparkSQLProperty to control split-size #10336

sumedhsakdeo commented May 15, 2024

shardulm94 left a comment

sumedhsakdeo commented May 16, 2024

szehon-ho commented May 22, 2024 •

edited

Spark: Add SparkSQLProperty to control split-size #10336

Are you sure you want to change the base?

Spark: Add SparkSQLProperty to control split-size #10336

Conversation

sumedhsakdeo commented May 15, 2024

shardulm94 left a comment

Choose a reason for hiding this comment

sumedhsakdeo commented May 16, 2024

szehon-ho commented May 22, 2024 • edited

szehon-ho commented May 22, 2024 •

edited