Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Add SparkSQLProperty to control split-size #10336

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sumedhsakdeo
Copy link

We have a scheduled job that deletes rows in an Iceberg table. The job is authored in SQL. Given we use CoW technique for data deletion the job would rewrite the files without the deleted rows. We want to tune this job so that it creates files that are ~512MB on HDFS. We are unable to use option given job uses SparkSQL and setting read.split.target-size table property is not desired as it impacts are readers.

PR adds ability to control the split-size for a given spark SQL job, by introducing a property spark.sql.iceberg.split-size which can be set as spark session conf.

@github-actions github-actions bot added the spark label May 15, 2024
Copy link
Contributor

@shardulm94 shardulm94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sumedhsakdeo for the PR! It looks good to me. It seems useful to allow setting split size for reads in Spark SQL scripts. Support for passing options at a scan-level through SQL would have been ideal, but attempts to do that in Spark have been unsuccessful apache/spark#34072 apache/spark#41683. A session level conf would be useful to have in the meantime.

I haven't been actively reviewing Iceberg code recently though, so would be good if @aokolnychyi or @amogh-jahagirdar can provide their feedback as well (I see you guys have contributions to Iceberg Spark recently :) ).

@sumedhsakdeo
Copy link
Author

Thanks Shardul for taking a look. Appreciate your review Anton and Amogh.
Also adding @wmoustafa!

@szehon-ho
Copy link
Collaborator

szehon-ho commented May 22, 2024

Yea its really something that would be great to support in Spark. I hacked together another attempt apache/spark#46707 based on the last comment in apache/spark#41683 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants