Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
[SPARK-24238][SQL] HadoopFsRelation can't append the same table with multi job at the same time #21286
What changes were proposed in this pull request?
When there are multiple jobs at the same time append a
The main reason for this problem is because multiple job will use the same
So the core idea of this
How was this patch tested?
I manually tested.
I think the Hadoop design does not allow two jobs to share the same output folder.
Hadoop has a related patch that can partially solve this problem. You can configure the parameters to not clean up the _temporary directory. But I think this is not a good solution.
For this problem, we'd better use different temporary output directories for different jobs, and then copy the files.
However, the current implementation breaks some unit tests. There are two ways to fix it.
cc @cloud-fan. Which do you think is better? Please give me some advice?
I suppose "Stepped through the FileOutputCommit operations with a debugger and a pen and paper" counts, given the complexity there. There's still a lot of corner cases which I'm not 100% sure on (or confident that the expectations of the job coordinators are met). Otherwise its more folklore "we did this because of job A failed with error...", plus some experiments with fault injection. I'd point to @rdblue as having put in this work too.
Using a temp dir and then renaming in is ~what the FileOutputCommitter v1 algorithm does
Adding an extra directory with another rename has some serious issues
Which is why I think it's the wrong solution
Normally Spark rejects work to the destination if it's already there, so only one job will have a temp dir. This conflict will only be an issue if overwrite is allowed, which is going to have other adverse consequences if files with the same name are ever created. If the two jobs commit simultaneously, you'll get a mixture of results. This is partly why the S3A committers insert UUIDs into their filenames by default, the other being S3's lack of update consistency.
Ignoring that little issue, @cloud-fan is right: giving jobs a unique ID should be enough to ensure that FileOutputCommitter does all it's work in isolation.
Any ID known to be unique to all work actively potentially able to write to the same dest dir. Hadoop MR has a strict ordering requirement so that it can attempt to recover from job attempt failures (it looks for temporary/$job_id($job-attempt-id-1)/ to find committed work from the previous attempt). Spark should be able to just create a UUID.
finally, regarding MAPREDUCE cleanup JIRAs, the most recent is MAPREDUCE-7029. That includes comments from the google team on their store's behaviour.
Thanks @steveloughran for your deep explanation!
Spark does have a unique job id, but it's only unique within a SparkContext, we may have 2 different spark applications writing to the same directory. I think timestamp+uuid should be good enough as a job id. Spark doesn't retry jobs so we can always set job attempt id to 0.
that would work. Like you say, no need to worry about job attempt IDs, just uniqueness. If you put the timestamp first, you could still sort the listing by time, which might be good for diagnostics.
Some org.apache.hadoop code snippets do attempt to parse the yarn app attempt strings into numeric job & task IDs in exactly the way they shouldn't. It should already have surfaced if it was a problem in the committer codepaths, but it's worth remembering & maybe replicate in the new IDs.
...this makes me think that the FileOutputCommitter actually has an assumption that nobody has called out before, specifically "only one application will be writing data to the target FS with the same job id". It's probably been implicit in MR with a local HDFS for a long time, first on the assumption of all jobs getting unique job Ids from the same central source and nothing outside the cluster writing to the same destinations. With cloud stores, that doesn't hold; it's conceivable that >1 YARN cluster could start jobs with the same dest. As the timestamp of YARN launch is used as the initial part of the identifier, if >1 cluster was launched in the same minute, things are lined up to collide. Oops.
FWIW, the parsing code I mentioned is
I think I may not have described this issue clearly.
First of all,the scene of the problem is this.
When multiple applications simultaneously append data to the same parquet datasource table.
They will run simultaneously and share the same output directory.
But once one Job is successfully committed, it will run cleanupJob
The pendingJobAttemptsPath is
After the job is committed,
Meanwhile, due to all applications share the same app appempt id, they write temporary data to the same temporary dir
I see. Yes, that's
Like I said, this proposed patch breaks all the blobstore-specific committer work, causes problems at scale with HDFS alone, and adds a new problem: how do you clean up from failed jobs writing to the same destination?
It's causing these problems because it's using another layer of temp dir and then the rename.
No, that's not needed. Once spark has unique job IDs across apps, all that's needed is to set
that's the issue we've been discussing related to job IDs. If each spark driver comes up with a unique job ID, that conflict will go away. So turn off the full directory cleanup and you will get that multiple job feature. Up to you to make sure that the files generated are unique across jobs though, as job commit is not atomic.
Does Spark have a jobID in writing path? Below path is an example in my debugging log:
parquettest2 is a non-partitioned table. Seems that the
If I understand correctly, this pr proposes to add a jobId outside the _temporary and the writing path format is like below:
@jinxing64 from my reading of the code, the original patch proposed creating a temp dir for every query, which could then do its own work & cleanup in parallel, with a new meta-commit on each job commit, moving stuff from this per-job temp dir into the final dest.
This is to address
And the reason for that '0' is that spark's job id is just a counter of queries done from app start, whereas on hadoop MR it's unique for across a live YARN cluster. Spark deploys in different ways, and can't rely on that value.
The job id discussion proposes generating unique job IDs for every spark app, so allowing
Exactly. It also makes it a simpler change, which is good as the commit algorithms are pretty complex and its hard to test all the failure modes.
@jinxing64 yes, with the detail that the way some bits of hadoop parse a jobattempt, they like it to be an integer. Some random number used as the upper digits of counter could work; it'd still give meaningful job IDs like "45630001" for the first, "45630002", for the process which came up with "4563" as its prefix. Yes, eventually it'll wrap, but that's integers for you.