-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark - Accept an output-spec-id
that allows writing to a desired partition spec
#7120
Spark - Accept an output-spec-id
that allows writing to a desired partition spec
#7120
Conversation
Should this be in Spark 2.4 as well? |
@edgarRd I looked into it, but I'm not sure whether it is worthwhile to add it to Spark 2.4:
For our use-case, adding only on Spark 3 should be enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @RussellSpitzer was interested in this as well (for rewrite to different spec id).
spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWrites.java
Show resolved
Hide resolved
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/PartitionedWritesTestBase.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me, left a few minor comments.
Would also ask @aokolnychyi to see if I am missing some potential problems by doing so.
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/PartitionedWritesTestBase.java
Outdated
Show resolved
Hide resolved
62592b2
to
684f1cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just a style nit
spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review @szehon-ho!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the code changes, those look good now. Missed some questions/comments on the test.
spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWrites.java
Outdated
Show resolved
Hide resolved
spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWrites.java
Outdated
Show resolved
Hide resolved
spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWrites.java
Outdated
Show resolved
Hide resolved
spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java
Outdated
Show resolved
Hide resolved
Thanks @gustavoatt for all the changes, let me wait a little bit to see if any other concerns, will merge if not |
@@ -127,6 +129,16 @@ class SparkWrite { | |||
this.dsSchema = dsSchema; | |||
this.extraSnapshotMetadata = writeConf.extraSnapshotMetadata(); | |||
this.partitionedFanoutEnabled = writeConf.fanoutWriterEnabled(); | |||
|
|||
if (writeConf.outputSpecId() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have this inside SparkWriteConf
and make outputSpecId()
return int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I moved this logic to SparkWriteConf
. The main reason why I initially did not do it there was because I did not want to store the specs and current spec there, but I think that should be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could keep a reference to Table
, just like we do in SparkReadConf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up PR #7348
@@ -115,6 +115,10 @@ public String wapId() { | |||
return sessionConf.get("spark.wap.id", null); | |||
} | |||
|
|||
public Integer outputSpecId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially we were not sure whether to make spec IDs public but I also don't see a good alternative.
I am OK with the overall idea of exposing this.
spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
The change looks good to me, I had a few minor comments. |
Thanks for the review @aokolnychyi and @szehon-ho. There is a check failure on this PR but looks unrelated to my changes:
|
Yea I think its appeared a few times, tracked by #7321, let's just retrigger (close and re-open for example) |
@@ -115,6 +121,20 @@ public String wapId() { | |||
return sessionConf.get("spark.wap.id", null); | |||
} | |||
|
|||
public int outputSpecId() { | |||
final int outputSpecId = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Would you mind remove 'final' here? I know from @aokolnychyi (when he reviewed me change), he prefers not to have extra finals except in class fields, as modern compiler usually adds it anyway. ex: #4293 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I usually just keep the final as a way of having something like const
to avoid accidentally modifying something I did not intend to. But I think it is not necessary in this case and would prefer to do it to keep consistency in the rpo.
Merged, thanks @gustavoatt , and also @aokolnychyi, @edgarRd for additional review |
Thanks for the review and merging @szehon-ho @aokolnychyi! Appreciate the effort spent reviewing! |
Summary
Allow passing
output-spec-id
to the Spark writer, so that we can customize which partition spec to write. This is useful for when a table has more than one active spec (e.g. daily and hourly partition spec).Fixes #6932
Testing