-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-17981. resilient commit through etag validation #3611
HADOOP-17981. resilient commit through etag validation #3611
Conversation
somehow test fs with renaming raising exceptions is being picked up by other test suites...caching,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the rename recovery logic.
As discussed, for ABFS case will work towards a long term solution where the driver and server can handle the case and avoid the client having to do the src-destn eTag checks as in ResilientCommitByRenameHelper.
tested against abfs cardiff. two failures with threads=8 HADOOP-17797. transient failure of ITestAbfsListStatusRemoteIterator. HADOOP-17989. ITestAzureBlobFileSystemDelete failing "Operations has null HTTP response" |
checkstyle on imports |
pushed up import fixup patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Will approve once the YETUS is fixed.
Nice tests.
* has the same etag. as a result, this is considered a success. | ||
*/ | ||
@Test | ||
public void testCommitMissingDestDir() throws Throwable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the description and javadoc is wrong here ?
9e33d29
to
f7763f8
Compare
thanks, updated javadocs. bear in mind this is not going to be merged in as is; once the review is in I plan to
|
…commit Contributed by Rajesh Balamohan Change-Id: I8d2cedb0ba8b6aba2f1d0f814f52d77b156f1206
Compared to the previous pull request, this moves the recovery handling into the application code. However, the nominally simpler "fs api" strategy had actually got more complicated with interfaces in RawLocalFS for testing, etc etc, and the hive team were showing a dangerous amount of interest and using the same API, judging by the comments. This solution at one new interface in hadoop fs, which is straight forward it to declare stable as s3 and abfs and others can all serve this field up immediately. With the interface and part capabilities probes, anything is free to use this for: Tracking changes across files, verifying that objects are unchanged and this special case of resilience recovery. What else do we need? We would like, but don't need, an option for abfs to throw meaningful exceptions on rename failures and for it not to attempt to use modtimes to recover from source file not found events. Change-Id: I10e6657f37769df3917b3f34b3b4cd066827fba7
abfs client to * stop trying to recover on modtime (which was broken anyway) * optionally raise exceptions on rename failure * not yet cut the unused obsolete methods. Algorithm tuning during review. * only tries on FNFE * if rename returns false, escalation will raise FNFE on source not found, so automatically fall into recovery * recovery attempts can be disabled Change-Id: If979166775333560c26c93d1d20c171b168a0100
abfs integration test suites, with and without rename raising exceptions. commit helper counts how many operations were recovered from; this is logged in the committer. Change-Id: Id9a3bb49e71bccc5fa267ffb9077bfbed8008117
completely cut tests for rename idempotency in abfs and obsolete method in AbfsClient Change-Id: I21a42d9c3858e3cb846e0b2dc90003c47209e912
Change-Id: I2d2984e6225834d067ec550e50766e343fb11ede
Change-Id: I922802b0f8251c56878bdd5419d4524b39b39ccd
Change-Id: I2bc9aefa0755ed68115526d5f6f3610d9eec6704
c67143e
to
b3cb4f8
Compare
converting this to a draft as i don't want it merged in; the etag pr and adoption of the resilient commit helper is how I intend to do it.
the manifest committer is faster than rajesh's patch on object stores where listings are slow (done during task commit), and because it doesn't rename in task commit, is atomic even on gcs. with rate limiting we can even keep it below the load threshold for azure storage. |
note, setting "fs.azure.rename.raises.exceptions" to "true" highlights that the default trash policy can't handle exceptions raised there |
* Pull out from HADOOP-17981/apache#3611 * Add S3A support * Add requirement that located status API calls MUST also support the API, and do this for ABFS. s3a code retains now deprecated getETag entries. lots of references in the s3guard code which I am leaving alone. Change-Id: I16c10c73873b3cbe60433adb9bbc7c1441221dbb
Compared to #3597 this moves the recovery handling into
the application code.
draft commit message
Uses a new EtagSource API in hadoop-common (and implemented
supported in ABFS) to implement recovery from failures
during job commit of the form
"java.io.IOException: Failed to rename VersionedFileStatus"
After a rename() call fails, the etag of the file at the
destination path is compared with that of the source file
as determined when listing the parent directory.
If the two match and the source file is no longer present,
the rename is considered successful.
This algorithm relies on
(at a minimum: when different data is uploaded to the same
destination)
(true for ADLS Gen2, not for S3)
To enable users MUST set
mapreduce.fileoutputcommitter.algorithm.version.v1.experimental.parallel.rename.recovery
to "true"
users SHOULD set
fs.azure.rename.raises.exceptions to true
That is not mandatory, but it gives better error reporting on all forms of rename
failures, and when handling recoverable errors in job commits, reduces the
number of IO requests by one, so reducing load in an IO-intensive phase
of the job.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?