-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-16490. Avoid/handle cached 404s during S3A file creation #1407
HADOOP-16490. Avoid/handle cached 404s during S3A file creation #1407
Conversation
S3Guard retry policy on failures in open/select/copy to be independently configurable from other retry policies, * and use a setting with exponential backoff * new config names * copy raises a RemoteFileChangedException which is *not* caught in rename() and downgraded to false. Thus: when a rename is unrecoverable, this fact is propagated * tests for this Also: tests turning auth mode on/off have to handle the auth state being set through an authoritative path, rather than a single flag. Caught me out as of course the first test I saw with this was the ITestS3ARemoteFileChanged rename ones, and I assumed that it was my new code. It was actually due to me setting an auth path last week. I'm unsure about the use of "RemoteFileChangedException", but it could be a remote file change, especially for an etag, where the file has been deleted since being recorded.I don't know if/how we should adapt to that in S3Guard if the problem persists. The file is either *no longer there* or *changes not yet visible*. Maybe consider marking parent dir as nonauth? Logging error, always? Change-Id: I7bb468aca0f4019537d82bc083f0a9887eaa282b
- different text when s3guard is off - better tostring values for logging Change-Id: I559d7a099e2720fee48d432c4fc7ae18c6af137f
…ently by not using deleteOnExit we save an RPC call on the normal path and reduce the risk of negative cached entries. Change-Id: Ie067c3b5179a1602488eadf35419805c089a3373
Change-Id: I3163cd3b561da0ab6bfb9f9d56d9097ec6d4438c
-initial delay = 2s -tweaked the new error messages to make clear this happened during rename and to highlight when a table is unguarded. -checkstyles Change-Id: Ibf71c433fc8342a20ffbf4d822cd9ea834a492a8
Change-Id: I1d08d64b27afd348261e0fa93444f14510b47746
no explicit testing of those probe switches -it will be straightforward Change-Id: I00f8987b445bfd192845a69d10a2ba6fd02bd2a1
…s at debug, rather than lose them. Change-Id: Icced1ac648c7a97e322362bbc03a245cf496edcd
Change-Id: Ie4a7c9aca2a505fce3b8eedd07fbb0127ad4379a
TestCopy was failing because of the removal of deleteOnExit broke the mock invocation count. -correct for current values -reinstate deleteOnExit before rename for the existing behaviour -add a few extra probes -and a test on what happens on a fail in write, rather than create. Change-Id: Ib8362ac3a5e36819255a35b8f54c8cfc523c1eb1
Gabor reported an NPE here on a test run of this PR only; added asserts that the lists from the MS are never null Change-Id: I9154eadabb486f601f86e51121d2f05d912c8e46
Change-Id: Ifa586e1d04998594af17b46b24379df7882bed2a
💔 -1 overall
This message was automatically generated. |
tested s3 ireland without s3guard -one failure in test teardown. I have been seeing this is a lot recently and suspect it may be throttling. Or just AWS performance |
itest run against ireland, there were some errors. I need to get back to this as my console buffer was not big enough to save all output, so here goes the results
Seq tests seem like the usual failures, but the others are worrying me. At the same time I was running the same tests on aws against us-west, just to have a second opinion:
|
OK. I will have a look at the commit MR jobs. you know all the output goes into target/failsafe right? it doesn;t matter if the console fills up as we get everything there, inc logs If I'm not seeing this and you are, you'll be able to set the s3a logs to debug and rerun so i ge5t the details. FWIW, all the DT/session failures imply its the location of the STS service which isn't being found |
FWIW, in the HADOOP-16207 code there was a bug in a recent iteration where each task was committing the staging file .pendingset files[ to to the temp dir system property of its task, which was being set to a different path on each NM; only work written by tasks running on the same NM as job commit were found. Wonder if this is similar |
To provide more details on Gabor's reports. Change-Id: I57c79b26552c11ee733d8d4257c94486648a6215
just pushed up a patch with more detailed asserts and some better tuning of staging dirs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the more logging and improvement in general. Added some comments to the code.
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
Show resolved
Hide resolved
* @throws FileNotFoundException when the path does not exist | ||
* @throws IOException on other problems. | ||
*/ | ||
private S3AFileStatus resolveFileStatus(final Path path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this: why do we need to create another wrapper for this call? I mean getFileStatus
calls innerGetFileStatus
calls resolveFileStatus
and I don't see why do we need to do the last call here - imho there's no need for another method in the same class.. It will be just another command+click for most of us in the IDE, while I don't see any particular gain from this - a better way would be the factor out the method call to it's own class, or create a jira for this at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have plans here. Specifically I want to push it down a layer in any refactored s3a FS. InnerGetFileStatus is essentially the obsolete one.
Once this PR is I want to switch those operations to list paths and whose callers expect to see directory lists of some kind to do the dir list first, probably in the order: LIST, HEAD / HEAD (HADOOP-16465) and massively eliminate the #of calls made during treewalks in query planning. So a we need new args.
Let me review this; I could make that StatusProbeEnum an explicit param to resolveFileStatus or innerGetFileStatus() after moving the entry point/qualification code up there to getFileStatus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, moved it into an innerGetFIleStatus which removes the entryPoint check and requires the set of probes; using them as appropriate in production and test code
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3ARemoteFileChanged.java
Outdated
Show resolved
Hide resolved
@@ -356,4 +358,14 @@ public void testListingFilteredExpiredItems() throws Exception { | |||
} | |||
} | |||
|
|||
protected DirListingMetadata getDirListingMetadata(final MetadataStore ms, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this change is connected to the original issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was getting an NPE during test runs where a list op was returning null but the tests assumed it never did. Added for debugging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correction. You were reporting an NPE on test runs which I wasn't seeing. This change is to debug your test failures
* the resolveFileStatus command is back into innerGetFileStatus -which now requires a set of probes to make against S3 * getFileStatus() makes the entryPoint() call and then asks for everything * the other uses ask for everything too, except in create(), which chooses based on the overwrite flag * and there's a new probe in ITestS3AFileOperationCost to verify that the HEAD doesn't get made (just realised I need to do an unguarded test run here) Change-Id: Idf12fa6793388554e75cb31fcbcbbc5c081c638f
Updated PR tests successfully against s3 ireland w/ s3guard & ddb; one failure in DDB table delete testDynamoDBInitDestroyCycle. testing without s3guard now to verify it is happy too, as more checks of s3 are made there |
💔 -1 overall
This message was automatically generated. |
spurious test failure in org.apache.hadoop.util.TestDiskCheckerWithDiskIo, Assumption: jenkins disk is root cause. |
tested against ireland: parallel:
sequential tests usual failures:
we have a new failure which is quite constant for me: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Failures unrelated, but we should deal with those separately.
+1
this is now committed -Thanks for the reviews! |
This patch avoids issuing any HEAD path request when creating a file with overwrite=true, so 404s will not end up in the S3 load balancers unless someone calls getFileStatus/exists in their own code
Special S3Guard FNFE retry policy independently configurable from other retry policies,
Also: tests turning auth mode on/off have to handle the auth state being set through an authoritative path, rather than a single flag. Caught me out as of course the first test I saw with this was the ITestS3ARemoteFileChanged rename ones, and I assumed that it was my new code. It was actually due to me setting an auth path last week.
This is a rebased successor to #1229
Change-Id: I7bb468aca0f4019537d82bc083f0a9887eaa282b