-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory #40064
Conversation
@cloud-fan @boneanxs could you please take a look if you find a time? |
931e76d
to
61f4592
Compare
@yikf can you provide a test case, or at least the error stacktrace you hit in your environment? |
@cloud-fan This case is the error that Apache kyuubi encountered when upgrading from spark 3.3.1 to 3.3.2, can see this link to find the error stacktrace. |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactory.scala
Show resolved
Hide resolved
updated, verified w/ kyuubi on spark 3.3.2 and all tests passed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
kindly ping @cloud-fan , @boneanxs Any suggestions? |
I have not followed the changes in this part of the code too much in a while - but this specific PR will result in a different |
@mridulm Thanks your review, this is a nice question for me, How about this idea that |
@yikf Agree - we only specify two parts for the Something like this instead:
Thoughts ? |
…in FileWriterFactory
@mridulm Nice suggestion, and we can simplify to as follow since private[this] val jobTrackerID = SparkHadoopWriterUtils.createJobTrackerID(new Date)
@transient private lazy val jobId = new JobID(jobTrackerID, 0) |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactory.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactory.scala
Show resolved
Hide resolved
thanks, merging to master/3.4! |
…lizable JobID in FileWriterFactory ### What changes were proposed in this pull request? Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory ### Why are the changes needed? [SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #40064 from Yikf/write-job-id. Authored-by: Yikf <yikaifei@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d46b15d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Hi, @cloud-fan . SPARK-41448 landed to master/3.3/3.2 and this is merge this to master/3.4 only. I'm wondering if we are planning backporting to branch-3.3 and 3.2. |
Also, cc @sunchao |
@yikf can you help to open a backport PR for 3.2/3.3? Thanks! |
Sure |
… non-serializable JobID in FileWriterFactory This is a backport of #40064 for branch-3.3 ### What changes were proposed in this pull request? Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory ### Why are the changes needed? [SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #40290 from Yikf/backport-SPARK-42478-3.3. Authored-by: Yikf <yikaifei@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… non-serializable JobID in FileWriterFactory This is a backport of #40064 for branch-3.2 ### What changes were proposed in this pull request? Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory ### Why are the changes needed? [SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #40289 from Yikf/backport-SPARK-42478-3.2. Authored-by: Yikf <yikaifei@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… non-serializable JobID in FileWriterFactory This is a backport of apache#40064 for branch-3.2 ### What changes were proposed in this pull request? Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory ### Why are the changes needed? [SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes apache#40289 from Yikf/backport-SPARK-42478-3.2. Authored-by: Yikf <yikaifei@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…lizable JobID in FileWriterFactory ### What changes were proposed in this pull request? Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory ### Why are the changes needed? [SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes apache#40064 from Yikf/write-job-id. Authored-by: Yikf <yikaifei@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d46b15d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ent task attempts ### What changes were proposed in this pull request? After #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com>
…ent task attempts ### What changes were proposed in this pull request? After #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 67d11b1) Signed-off-by: yangjie01 <yangjie01@baidu.com>
…ent task attempts ### What changes were proposed in this pull request? After #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 67d11b1) Signed-off-by: yangjie01 <yangjie01@baidu.com>
…ent task attempts ### What changes were proposed in this pull request? After apache#40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com>
What changes were proposed in this pull request?
Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory
Why are the changes needed?
SPARK-41448 make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable.
Does this PR introduce any user-facing change?
No
How was this patch tested?
GA