-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Default Write Distribution Mode #6679
Comments
@aokolnychyi + @danielcweeks + @rdblue + @jackye1995 + @szehon-ho Please ping anyone else as well who would have strong opinions about this change as well |
Thank you @RussellSpitzer , I understand where this change is coming from but some of the GDPR like deletion on V1 table will benefit from the none write distribution mode (to avoid shuffle if possible). I am aware currently we can configure it via setting the table properties like |
The "none" mode in GDPR cases still only helps in case in which the data has already been aligned with the partitioning of the table. This is rarely the case in my experience. |
+1 for using range as default. Overall we probably need a dedicated doc section about how to configure those parameters in the Iceberg Spark documentation for people to make informed decisions. |
+1 on changing the default from none and having a dedicated doc section for the configuring these. Happy to contribute to this if possible. Side note : I also see |
yeah @singhpk234 I noticed that before and had my attempt #5280 to fix it but need some help on |
+1 for range as default. |
I would be careful with The upcoming Spark 3.4 has support for rebalancing partitions via AQE for hash distributions requested by v2 writes. That means, we can request a hash distribution without worrying about having too much data per task and OOM. I'd rather switch to |
We have examples in |
@dramaticlly Did you want to write up another issue for specifying write distribution mode as a Spark SqlConf option? |
+1 on @dramaticlly 's comment, changing the write distribution mode affects Spark job performance (causes heavy shuffle) when using Spark SQL like
or
setting |
I will submit a PR to change the default distribution modes for insert and merge. I'll be also happy to review a PR for #6741. |
Feature Request / Improvement
Merge Writes as well as some inserts end up generating many files with our default write distirbution mode of None. While this is the cheapest method and is our old default behavior, we now have several reasons to default to Range (or Hash).
I suggest we change the default distribution mode to Range and add some documentation around configuring AQE to the Spark docs. I think this will be a better behavior for most first users and power users can still manually configure a different mode for their specific requirements.
Query engine
Spark
The text was updated successfully, but these errors were encountered: