[SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID" #30141

steveloughran · 2020-10-23T14:13:03Z

What changes were proposed in this pull request?

This reinstates the old option spark.sql.sources.write.jobUUID to set a unique jobId in the jobconf so that hadoop MR committers have a unique ID which is (a) consistent across tasks and workers and (b) not brittle compared to generated-timestamp job IDs. The latter matches that of what JobID requires, but as they are generated per-thread, may not always be unique within a cluster.

Why are the changes needed?

If a committer (e.g s3a staging committer) uses job-attempt-ID as a unique ID then any two jobs started within the same second have the same ID, so can clash.

Does this PR introduce any user-facing change?

Good Q. It is "developer-facing" in the context of anyone writing a committer. But it reinstates a property which was in Spark 1.x and "went away"

How was this patch tested?

Testing: no test here. You'd have to create a new committer which extracted the value in both job and task(s) and verified consistency. That is possible (with a task output whose records contained the UUID), but it would be pretty convoluted and a high maintenance cost.

Because it's trying to address a race condition, it's hard to regenerate the problem downstream and so verify a fix in a test run...I'll just look at the logs to see what temporary dir is being used in the cluster FS and verify it's a UUID

…JobUUID" to description.uuid. Change-Id: I5f45f58be17c809f7eb15dc1ab5840eeabb9f955

steveloughran · 2020-10-23T14:14:38Z

@dongjoon-hyun FYI
@rdblue -think this is regressing some of your old code.

steveloughran · 2020-10-23T14:34:10Z

Moving the generation down into the HadoopMapReduceCommitProtocol so that wherever a job is set up (SQL, RDD) they get a consistent URI.

I'm going to modify the S3A Staging committer to have an option which requires the UUID to be set. This can be used as a way to verify that the property is propagating correctly. Consistent setting across jobs and tasks will be inferred simply by whether jobs complete with the expected set of files

SparkQA · 2020-10-23T15:06:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34805/

SparkQA · 2020-10-23T15:28:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34805/

dongjoon-hyun · 2020-10-23T15:38:51Z

Thank you, @steveloughran !

dongjoon-hyun

Do you think you can add a test case, @steveloughran ?

dongjoon-hyun · 2020-10-23T16:00:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -164,6 +164,10 @@ object FileFormatWriter extends Logging {

    SQLExecution.checkSQLExecutionId(sparkSession)

+    // propagate the decription UUID into the jobs, so that committers
+    // get an ID guaranteed to be unique.
+    job.getConfiguration.set("spark.sql.sources.writeJobUUID", description.uuid)


Do you mean Apache Hadoop S3A committer use this? If not, spark.sql.sources.writeJobUUID is not used inside Spark 3.x/2.x code base.

As you wrote, in Spark 1.6, it was a part of file name explicitly.

val uniqueWriteJobId = conf.get("spark.sql.sources.writeJobUUID") ... val filename = f"part-r-$partition%05d-$uniqueWriteJobId.orc"

It picked it up if set, so yes, it was being used. We've hit a problem where if >1 job kicks off in the same second for that user, the generated app ID is the same for both, so the staging committers end up using the same dir in HDFS. The committers already use the writeJobUUID property if set: restoring the original config option will mean that the shipping artifacts will work

rdblue · 2020-10-23T16:36:03Z

I think what we ended up doing was to generate a UUID in OutputCommitter.setupJob and place it in the Hadoop Configuration. Now that the configuration is reliable for a given stage, wouldn't that work here instead of using the one from Spark?

I'm okay adding this. It matches what we do in v2 writes. The new API also passes a UUID in, so I think there is a reasonable precedent for it, even if the committer or writer could generate one itself.

SparkQA · 2020-10-23T18:44:53Z

Test build #130204 has finished for PR 30141 at commit cfbb49d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2020-10-26T13:26:21Z

@rdblue I am going to add two things to the committers

Autogenerate in job setup. More specifically -generate in constructor, and in task setup, fail if that path was generated locally and the same committer hasn't been used for jobSet/jobAbort (testing, primarily)
add an option to fail fast if the spark option hasn't been set. That will let me do some regression testing on the spark code by setting that in downstream tests.

steveloughran · 2020-10-26T13:28:25Z

@dongjoon-hyun

Do you think you can add a test case, @steveloughran ?

not easily. Would need a new Hadoop committer (subclass of FileOutputCommitter easiest) which then failed if the option wasn't set on spark queries, and somewhere to put that. If you've got suggestions as to where I could put it & point a test I could work off, I'll do my best. I like ScalaTest.

dongjoon-hyun · 2020-10-26T16:15:12Z

cc @sunchao

dongjoon-hyun

+1, LGTM. Thank you, @steveloughran and @rdblue .
I agree that this is helpful for Hadoop StagingCommitter.

Merged to master/3.0/2.4.

…ql.sources.writeJobUUID" ### What changes were proposed in this pull request? This reinstates the old option `spark.sql.sources.write.jobUUID` to set a unique jobId in the jobconf so that hadoop MR committers have a unique ID which is (a) consistent across tasks and workers and (b) not brittle compared to generated-timestamp job IDs. The latter matches that of what JobID requires, but as they are generated per-thread, may not always be unique within a cluster. ### Why are the changes needed? If a committer (e.g s3a staging committer) uses job-attempt-ID as a unique ID then any two jobs started within the same second have the same ID, so can clash. ### Does this PR introduce _any_ user-facing change? Good Q. It is "developer-facing" in the context of anyone writing a committer. But it reinstates a property which was in Spark 1.x and "went away" ### How was this patch tested? Testing: no test here. You'd have to create a new committer which extracted the value in both job and task(s) and verified consistency. That is possible (with a task output whose records contained the UUID), but it would be pretty convoluted and a high maintenance cost. Because it's trying to address a race condition, it's hard to regenerate the problem downstream and so verify a fix in a test run...I'll just look at the logs to see what temporary dir is being used in the cluster FS and verify it's a UUID Closes #30141 from steveloughran/SPARK-33230-jobId. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 02fa19f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2020-10-26T19:32:55Z

cc @cloud-fan

…ql.sources.writeJobUUID" ### What changes were proposed in this pull request? This reinstates the old option `spark.sql.sources.write.jobUUID` to set a unique jobId in the jobconf so that hadoop MR committers have a unique ID which is (a) consistent across tasks and workers and (b) not brittle compared to generated-timestamp job IDs. The latter matches that of what JobID requires, but as they are generated per-thread, may not always be unique within a cluster. ### Why are the changes needed? If a committer (e.g s3a staging committer) uses job-attempt-ID as a unique ID then any two jobs started within the same second have the same ID, so can clash. ### Does this PR introduce _any_ user-facing change? Good Q. It is "developer-facing" in the context of anyone writing a committer. But it reinstates a property which was in Spark 1.x and "went away" ### How was this patch tested? Testing: no test here. You'd have to create a new committer which extracted the value in both job and task(s) and verified consistency. That is possible (with a task output whose records contained the UUID), but it would be pretty convoluted and a high maintenance cost. Because it's trying to address a race condition, it's hard to regenerate the problem downstream and so verify a fix in a test run...I'll just look at the logs to see what temporary dir is being used in the cluster FS and verify it's a UUID Closes apache#30141 from steveloughran/SPARK-33230-jobId. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

steveloughran · 2020-10-27T15:29:31Z

Thanks!

SPARK-33230. FileOutputWriter to set jobConf "spark.sql.sources.write…

cfbb49d

…JobUUID" to description.uuid. Change-Id: I5f45f58be17c809f7eb15dc1ab5840eeabb9f955

steveloughran changed the title ~~SPARK-33230. FileOutputWriter to set jobConf "spark.sql.sources.write.jobUUID" to description.uuid~~ SPARK-33230. Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID" Oct 23, 2020

dongjoon-hyun changed the title ~~SPARK-33230. Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"~~ [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID" Oct 23, 2020

dongjoon-hyun reviewed Oct 23, 2020

View reviewed changes

dongjoon-hyun approved these changes Oct 26, 2020

View reviewed changes

dongjoon-hyun closed this in 02fa19f Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID" #30141

[SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID" #30141

steveloughran commented Oct 23, 2020 •

edited by dongjoon-hyun

Loading

steveloughran commented Oct 23, 2020

steveloughran commented Oct 23, 2020 •

edited

Loading

SparkQA commented Oct 23, 2020

SparkQA commented Oct 23, 2020

dongjoon-hyun commented Oct 23, 2020

dongjoon-hyun left a comment

dongjoon-hyun Oct 23, 2020 •

edited

Loading

steveloughran Oct 26, 2020

rdblue commented Oct 23, 2020

SparkQA commented Oct 23, 2020

steveloughran commented Oct 26, 2020

steveloughran commented Oct 26, 2020

dongjoon-hyun commented Oct 26, 2020

dongjoon-hyun left a comment

dongjoon-hyun commented Oct 26, 2020

steveloughran commented Oct 27, 2020

[SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID" #30141

[SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID" #30141

Conversation

steveloughran commented Oct 23, 2020 • edited by dongjoon-hyun Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

steveloughran commented Oct 23, 2020

steveloughran commented Oct 23, 2020 • edited Loading

SparkQA commented Oct 23, 2020

SparkQA commented Oct 23, 2020

dongjoon-hyun commented Oct 23, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun Oct 23, 2020 • edited Loading

Choose a reason for hiding this comment

steveloughran Oct 26, 2020

Choose a reason for hiding this comment

rdblue commented Oct 23, 2020

SparkQA commented Oct 23, 2020

steveloughran commented Oct 26, 2020

steveloughran commented Oct 26, 2020

dongjoon-hyun commented Oct 26, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Oct 26, 2020

steveloughran commented Oct 27, 2020

steveloughran commented Oct 23, 2020 •

edited by dongjoon-hyun

Loading

steveloughran commented Oct 23, 2020 •

edited

Loading

dongjoon-hyun Oct 23, 2020 •

edited

Loading