Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory #40064

Closed
wants to merge 3 commits into from

Conversation

yikf
Copy link
Contributor

@yikf yikf commented Feb 17, 2023

What changes were proposed in this pull request?

Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

Why are the changes needed?

SPARK-41448 make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable.

Does this PR introduce any user-facing change?

No

How was this patch tested?

GA

@github-actions github-actions bot added the SQL label Feb 17, 2023
@yikf
Copy link
Contributor Author

yikf commented Feb 17, 2023

@cloud-fan @boneanxs could you please take a look if you find a time?

@yikf yikf force-pushed the write-job-id branch 2 times, most recently from 931e76d to 61f4592 Compare February 20, 2023 08:48
@cloud-fan
Copy link
Contributor

@yikf can you provide a test case, or at least the error stacktrace you hit in your environment?

@yikf
Copy link
Contributor Author

yikf commented Feb 21, 2023

@cloud-fan This case is the error that Apache kyuubi encountered when upgrading from spark 3.3.1 to 3.3.2, can see this link to find the error stacktrace.

@yikf
Copy link
Contributor Author

yikf commented Feb 21, 2023

updated, verified w/ kyuubi on spark 3.3.2 and all tests passed

build/mvn clean test -pl :kyuubi-spark-connector-hive_2.12 -am -Pspark-3.3 -Dspark.version=3.3.2

Copy link

@boneanxs boneanxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yikf
Copy link
Contributor Author

yikf commented Feb 23, 2023

kindly ping @cloud-fan , @boneanxs Any suggestions?

@mridulm
Copy link
Contributor

mridulm commented Feb 23, 2023

I have not followed the changes in this part of the code too much in a while - but this specific PR will result in a different jobId each time the class is deserialized - I would expect that to cause issues with output formats ?

@yikf
Copy link
Contributor Author

yikf commented Feb 23, 2023

@mridulm Thanks your review, this is a nice question for me, JobId maybe is different when each time the class is deserialized.

How about this idea that SparkHadoopWriterUtils.createJobTrackerID to generate an ID for a job tracker, and the job tracker is unique, JobId is constructed using a unique tackerid when the class is deserialized.

@mridulm
Copy link
Contributor

mridulm commented Feb 23, 2023

@yikf Agree - we only specify two parts for the JobID - the String jtIdentifier and int id.
We can persist those in the class - and make jobId a transient lazy val which recreates it each time.

Something like this instead:


  private[this] val jobIdParts: (String, Int) = {
    val id = SparkHadoopWriterUtils.createJobID(new Date, 0)
    (id.getJtIdentifier, id.getId)
  }

  @transient private lazy val jobId = new JobID(jobIdParts._1, jobIdParts._2)

Thoughts ?

@yikf
Copy link
Contributor Author

yikf commented Feb 23, 2023

@mridulm Nice suggestion, and we can simplify to as follow since int id is unique.

  private[this] val jobTrackerID = SparkHadoopWriterUtils.createJobTrackerID(new Date)
  @transient private lazy val jobId = new JobID(jobTrackerID, 0)

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.4!

@cloud-fan cloud-fan closed this in d46b15d Feb 27, 2023
cloud-fan pushed a commit that referenced this pull request Feb 27, 2023
…lizable JobID in FileWriterFactory

### What changes were proposed in this pull request?

Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

### Why are the changes needed?

[SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA

Closes #40064 from Yikf/write-job-id.

Authored-by: Yikf <yikaifei@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d46b15d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@dongjoon-hyun
Copy link
Member

Hi, @cloud-fan . SPARK-41448 landed to master/3.3/3.2 and this is merge this to master/3.4 only. I'm wondering if we are planning backporting to branch-3.3 and 3.2.

@dongjoon-hyun
Copy link
Member

Also, cc @sunchao

@cloud-fan
Copy link
Contributor

@yikf can you help to open a backport PR for 3.2/3.3? Thanks!

@yikf
Copy link
Contributor Author

yikf commented Mar 6, 2023

@yikf can you help to open a backport PR for 3.2/3.3? Thanks!

Sure

dongjoon-hyun pushed a commit that referenced this pull request Mar 6, 2023
… non-serializable JobID in FileWriterFactory

This is a backport of #40064 for branch-3.3

### What changes were proposed in this pull request?

Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

### Why are the changes needed?

[SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA

Closes #40290 from Yikf/backport-SPARK-42478-3.3.

Authored-by: Yikf <yikaifei@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Mar 6, 2023
… non-serializable JobID in FileWriterFactory

This is a backport of #40064 for branch-3.2

### What changes were proposed in this pull request?

Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

### Why are the changes needed?

[SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA

Closes #40289 from Yikf/backport-SPARK-42478-3.2.

Authored-by: Yikf <yikaifei@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
… non-serializable JobID in FileWriterFactory

This is a backport of apache#40064 for branch-3.2

### What changes were proposed in this pull request?

Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

### Why are the changes needed?

[SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA

Closes apache#40289 from Yikf/backport-SPARK-42478-3.2.

Authored-by: Yikf <yikaifei@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…lizable JobID in FileWriterFactory

### What changes were proposed in this pull request?

Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

### Why are the changes needed?

[SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA

Closes apache#40064 from Yikf/write-job-id.

Authored-by: Yikf <yikaifei@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d46b15d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
LuciferYang pushed a commit that referenced this pull request May 31, 2024
…ent task attempts

### What changes were proposed in this pull request?
After #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts.

Lead-authored-by: jackylee-ch <lijunqing@baidu.com>
Co-authored-by: Kent Yao <yao@apache.org>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
LuciferYang pushed a commit that referenced this pull request May 31, 2024
…ent task attempts

### What changes were proposed in this pull request?
After #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts.

Lead-authored-by: jackylee-ch <lijunqing@baidu.com>
Co-authored-by: Kent Yao <yao@apache.org>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit 67d11b1)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
LuciferYang pushed a commit that referenced this pull request May 31, 2024
…ent task attempts

### What changes were proposed in this pull request?
After #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts.

Lead-authored-by: jackylee-ch <lijunqing@baidu.com>
Co-authored-by: Kent Yao <yao@apache.org>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit 67d11b1)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
riyaverm-db pushed a commit to riyaverm-db/spark that referenced this pull request Jun 7, 2024
…ent task attempts

### What changes were proposed in this pull request?
After apache#40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts.

Lead-authored-by: jackylee-ch <lijunqing@baidu.com>
Co-authored-by: Kent Yao <yao@apache.org>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants