Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-16490. Avoid/handle cached 404s during S3A file creation #1407

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Sep 5, 2019

This patch avoids issuing any HEAD path request when creating a file with overwrite=true, so 404s will not end up in the S3 load balancers unless someone calls getFileStatus/exists in their own code

Special S3Guard FNFE retry policy independently configurable from other retry policies,

  • and use a setting with exponential backoff
  • new config names
  • copy raises a RemoteFileChangedException which is not caught in rename() and downgraded to false. Thus: when a rename is unrecoverable, this fact is propagated
  • tests for this
  • More logging @ debug in change policies as to policy type and when options are not set, as well as being set. Currently to work out the policy involves looking for the absence of messages, not the presence. It makes the file more verbose, but will aid with debugging these problems.

Also: tests turning auth mode on/off have to handle the auth state being set through an authoritative path, rather than a single flag. Caught me out as of course the first test I saw with this was the ITestS3ARemoteFileChanged rename ones, and I assumed that it was my new code. It was actually due to me setting an auth path last week.

This is a rebased successor to #1229

Change-Id: I7bb468aca0f4019537d82bc083f0a9887eaa282b

S3Guard retry policy on failures in open/select/copy to be independently configurable from other retry policies,

* and use a setting with exponential backoff
* new config names
* copy raises a RemoteFileChangedException which is *not* caught in rename() and downgraded to false. Thus: when a rename is unrecoverable, this fact is propagated
* tests for this

Also: tests turning auth mode on/off have to handle the auth state being set through an authoritative path, rather than a single flag. Caught me out as of course the first test I saw with this was the ITestS3ARemoteFileChanged rename ones, and I assumed that it was my new code. It was actually due to me setting an auth path last week.

I'm unsure about the use of "RemoteFileChangedException", but it could be a remote file change, especially for an etag, where the file has been deleted since being recorded.I don't know if/how we should adapt to that in S3Guard if the problem persists. The file is either *no longer there* or *changes not yet visible*. Maybe consider marking parent dir as nonauth? Logging error, always?

Change-Id: I7bb468aca0f4019537d82bc083f0a9887eaa282b
- different text when s3guard is off
- better tostring values for logging

Change-Id: I559d7a099e2720fee48d432c4fc7ae18c6af137f
…ently

by not using deleteOnExit we save an RPC call on the normal path and reduce the risk of negative cached entries.

Change-Id: Ie067c3b5179a1602488eadf35419805c089a3373
Change-Id: I3163cd3b561da0ab6bfb9f9d56d9097ec6d4438c
-initial delay = 2s
-tweaked the new error messages to make clear this happened during rename and to highlight when a table is unguarded.
-checkstyles

Change-Id: Ibf71c433fc8342a20ffbf4d822cd9ea834a492a8
Change-Id: I1d08d64b27afd348261e0fa93444f14510b47746
no explicit testing of those probe switches -it will be straightforward

Change-Id: I00f8987b445bfd192845a69d10a2ba6fd02bd2a1
…s at debug, rather than lose them.

Change-Id: Icced1ac648c7a97e322362bbc03a245cf496edcd
Change-Id: Ie4a7c9aca2a505fce3b8eedd07fbb0127ad4379a
TestCopy was failing because of the removal of deleteOnExit broke the mock invocation count.
-correct for current values
-reinstate deleteOnExit before rename for the existing behaviour
-add a few extra probes
-and a test on what happens on a fail in write, rather than create.

Change-Id: Ib8362ac3a5e36819255a35b8f54c8cfc523c1eb1
Gabor reported an NPE here on a test run of this PR only; added asserts that the lists from the MS are never null

Change-Id: I9154eadabb486f601f86e51121d2f05d912c8e46
Change-Id: Ifa586e1d04998594af17b46b24379df7882bed2a
@bgaborg bgaborg self-requested a review September 5, 2019 17:55
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 70 Docker mode activated.
_ Prechecks _
+1 dupname 0 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 6 new or modified test files.
_ trunk Compile Tests _
0 mvndep 67 Maven dependency ordering for branch
+1 mvninstall 1033 trunk passed
+1 compile 975 trunk passed
+1 checkstyle 146 trunk passed
+1 mvnsite 133 trunk passed
+1 shadedclient 1015 branch has no errors when building and testing our client artifacts.
+1 javadoc 114 trunk passed
0 spotbugs 70 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 187 trunk passed
_ Patch Compile Tests _
0 mvndep 25 Maven dependency ordering for patch
+1 mvninstall 80 the patch passed
+1 compile 922 the patch passed
+1 javac 922 the patch passed
+1 checkstyle 142 the patch passed
+1 mvnsite 129 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 1 The patch has no ill-formed XML file.
+1 shadedclient 687 patch has no errors when building and testing our client artifacts.
+1 javadoc 111 the patch passed
+1 findbugs 206 the patch passed
_ Other Tests _
-1 unit 517 hadoop-common in the patch failed.
+1 unit 95 hadoop-aws in the patch passed.
+1 asflicense 52 The patch does not generate ASF License warnings.
6750
Reason Tests
Failed junit tests hadoop.util.TestDiskCheckerWithDiskIo
Subsystem Report/Notes
Docker Client=19.03.2 Server=19.03.2 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1407/1/artifact/out/Dockerfile
GITHUB PR #1407
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 7b0b7fb6dfa7 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 511df1e
Default Java 1.8.0_222
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-1407/1/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-1407/1/testReport/
Max. process+thread count 1368 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-1407/1/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

tested s3 ireland without s3guard -one failure in test teardown. I have been seeing this is a lot recently and suspect it may be throttling. Or just AWS performance

@bgaborg
Copy link

bgaborg commented Sep 7, 2019

itest run against ireland, there were some errors. I need to get back to this as my console buffer was not big enough to save all output, so here goes the results mvn clean verify -Dparallel-tests -DtestsThreadCount=8 -Ddynamo -Ds3guard:

[ERROR] Failures:
[ERROR]   ITestSessionDelegationInFileystem.testDTUtilShell:710->dtutil:699->Assert.assertEquals:631->Assert.assertEquals:645->Assert.failNotEquals:834->Assert.fail:88 expected:<0> but was:<1>
[ERROR] Errors:
[ERROR]   ITestS3ATemporaryCredentials.testInvalidSTSBinding:257 ? SdkClient Unable to f...
[ERROR]   ITestS3ATemporaryCredentials.testSTS:130 ? SdkClient Unable to find a region v...
[ERROR]   ITestS3ATemporaryCredentials.testSessionRequestExceptionTranslation:441->lambda$testSessionRequestExceptionTranslation$5:442 ? SdkClient
[ERROR]   ITestS3ATemporaryCredentials.testSessionTokenExpiry:222 ? SdkClient Unable to ...
[ERROR]   ITestS3ATemporaryCredentials.testSessionTokenPropagation:193 ? SdkClient Unabl...
[ERROR]   ITestDelegatedMRJob.testJobSubmissionCollectsTokens:286 ? SdkClient Unable to ...
[ERROR]   ITestSessionDelegationInFileystem.testAddTokensFromFileSystem:235 ? SdkClient ...
[ERROR]   ITestSessionDelegationInFileystem.testCanRetrieveTokenFromCurrentUserCreds:260->createDelegationTokens:292->AbstractDelegationIT.mkTokens:88 ? SdkClient
[ERROR]   ITestSessionDelegationInFileystem.testDTCredentialProviderFromCurrentUserCreds:278->createDelegationTokens:292->AbstractDelegationIT.mkTokens:88 ? SdkClient
[ERROR]   ITestSessionDelegationInFileystem.testDelegatedFileSystem:308->createDelegationTokens:292->AbstractDelegationIT.mkTokens:88 ? SdkClient
[ERROR]   ITestSessionDelegationInFileystem.testDelegationBindingMismatch1:432->createDelegationTokens:292->AbstractDelegationIT.mkTokens:88 ? SdkClient
[ERROR]   ITestSessionDelegationInFileystem.testFileSystemBoundToCreator:681 ? SdkClient
[ERROR]   ITestSessionDelegationInFileystem.testGetDTfromFileSystem:212 ? SdkClient Unab...
[ERROR]   ITestSessionDelegationInFileystem.testHDFSFetchDTCommand:606->lambda$testHDFSFetchDTCommand$3:607 ? SdkClient
[ERROR]   ITestSessionDelegationInFileystem.testYarnCredentialPickup:576 ? SdkClient Una...
[ERROR]   ITestSessionDelegationTokens.testCreateAndUseDT:176 ? SdkClient Unable to find...
[ERROR]   ITestSessionDelegationTokens.testSaveLoadTokens:121 ? SdkClient Unable to find...
[ERROR]   ITestS3GuardConcurrentOps.testConcurrentTableCreations:178->deleteTable:87 ? TestTimedOut
[INFO]
[ERROR] Tests run: 1057, Failures: 1, Errors: 18, Skipped: 166```

Seqential test run:
[ERROR] Errors:
[ERROR]   ITestMagicCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[ERROR]   ITestDirectoryCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[ERROR]   ITestPartitionCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[ERROR]   ITestStagingCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[INFO]
[ERROR] Tests run: 94, Failures: 0, Errors: 4, Skipped: 67

Seq tests seem like the usual failures, but the others are worrying me.


At the same time I was running the same tests on aws against us-west, just to have a second opinion:
There were no failures with mvn clean verify -Dparallel-tests -DtestsThreadCount=8 -Ddynamo -Ds3guard during the parallel tests, and the usual errors during the sequential tests:

[ERROR]   ITestMagicCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327 » FileNotFound
[ERROR]   ITestDirectoryCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327 » FileNotFound
[ERROR]   ITestPartitionCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327 » FileNotFound
[ERROR]   ITestStagingCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327 » FileNotFound```

@steveloughran
Copy link
Contributor Author

OK. I will have a look at the commit MR jobs.

you know all the output goes into target/failsafe right? it doesn;t matter if the console fills up as we get everything there, inc logs

If I'm not seeing this and you are, you'll be able to set the s3a logs to debug and rerun so i ge5t the details.

FWIW, all the DT/session failures imply its the location of the STS service which isn't being found

@steveloughran
Copy link
Contributor Author

FWIW, in the HADOOP-16207 code there was a bug in a recent iteration where each task was committing the staging file .pendingset files[ to to the temp dir system property of its task, which was being set to a different path on each NM; only work written by tasks running on the same NM as job commit were found. Wonder if this is similar

To provide more details on Gabor's reports.

Change-Id: I57c79b26552c11ee733d8d4257c94486648a6215
@steveloughran
Copy link
Contributor Author

just pushed up a patch with more detailed asserts and some better tuning of staging dirs

Copy link

@bgaborg bgaborg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the more logging and improvement in general. Added some comments to the code.

* @throws FileNotFoundException when the path does not exist
* @throws IOException on other problems.
*/
private S3AFileStatus resolveFileStatus(final Path path,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this: why do we need to create another wrapper for this call? I mean getFileStatus calls innerGetFileStatus calls resolveFileStatus and I don't see why do we need to do the last call here - imho there's no need for another method in the same class.. It will be just another command+click for most of us in the IDE, while I don't see any particular gain from this - a better way would be the factor out the method call to it's own class, or create a jira for this at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have plans here. Specifically I want to push it down a layer in any refactored s3a FS. InnerGetFileStatus is essentially the obsolete one.

Once this PR is I want to switch those operations to list paths and whose callers expect to see directory lists of some kind to do the dir list first, probably in the order: LIST, HEAD / HEAD (HADOOP-16465) and massively eliminate the #of calls made during treewalks in query planning. So a we need new args.

Let me review this; I could make that StatusProbeEnum an explicit param to resolveFileStatus or innerGetFileStatus() after moving the entry point/qualification code up there to getFileStatus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, moved it into an innerGetFIleStatus which removes the entryPoint check and requires the set of probes; using them as appropriate in production and test code

@@ -356,4 +358,14 @@ public void testListingFilteredExpiredItems() throws Exception {
}
}

protected DirListingMetadata getDirListingMetadata(final MetadataStore ms,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this change is connected to the original issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting an NPE during test runs where a list op was returning null but the tests assumed it never did. Added for debugging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correction. You were reporting an NPE on test runs which I wasn't seeing. This change is to debug your test failures

* the resolveFileStatus command is back into innerGetFileStatus -which now requires a set of probes to make against S3
* getFileStatus() makes the entryPoint() call and then asks for everything
* the other uses ask for everything too, except in create(), which chooses based on the overwrite flag
* and there's a new probe in ITestS3AFileOperationCost to verify that the HEAD doesn't get made (just realised I need to do an unguarded test run here)

Change-Id: Idf12fa6793388554e75cb31fcbcbbc5c081c638f
@apache apache deleted a comment from hadoop-yetus Sep 10, 2019
@steveloughran
Copy link
Contributor Author

Updated PR tests successfully against s3 ireland w/ s3guard & ddb; one failure in DDB table delete testDynamoDBInitDestroyCycle.

testing without s3guard now to verify it is happy too, as more checks of s3 are made there

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 189 Docker mode activated.
_ Prechecks _
+1 dupname 1 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 11 new or modified test files.
_ trunk Compile Tests _
0 mvndep 105 Maven dependency ordering for branch
+1 mvninstall 1051 trunk passed
+1 compile 1017 trunk passed
+1 checkstyle 149 trunk passed
+1 mvnsite 131 trunk passed
+1 shadedclient 976 branch has no errors when building and testing our client artifacts.
+1 javadoc 111 trunk passed
0 spotbugs 69 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 197 trunk passed
_ Patch Compile Tests _
0 mvndep 27 Maven dependency ordering for patch
+1 mvninstall 80 the patch passed
+1 compile 983 the patch passed
+1 javac 983 the patch passed
+1 checkstyle 143 the patch passed
+1 mvnsite 133 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 1 The patch has no ill-formed XML file.
+1 shadedclient 715 patch has no errors when building and testing our client artifacts.
+1 javadoc 117 the patch passed
+1 findbugs 206 the patch passed
_ Other Tests _
-1 unit 519 hadoop-common in the patch failed.
+1 unit 97 hadoop-aws in the patch passed.
+1 asflicense 54 The patch does not generate ASF License warnings.
7046
Reason Tests
Failed junit tests hadoop.util.TestDiskCheckerWithDiskIo
Subsystem Report/Notes
Docker Client=19.03.2 Server=19.03.2 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1407/3/artifact/out/Dockerfile
GITHUB PR #1407
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux fd34b8e9e3dd 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 38c1a10
Default Java 1.8.0_222
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-1407/3/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-1407/3/testReport/
Max. process+thread count 1367 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-1407/3/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

steveloughran commented Sep 10, 2019

spurious test failure in org.apache.hadoop.util.TestDiskCheckerWithDiskIo, Assumption: jenkins disk is root cause.
The hadoop-aws tests were successful

@bgaborg
Copy link

bgaborg commented Sep 11, 2019

tested against ireland:

parallel:

[ERROR] Failures:
[ERROR]   ITestSessionDelegationInFileystem.testDTUtilShell:710->dtutil:699->Assert.assertEquals:631->Assert.assertEquals:645->Assert.failNotEquals:834->Assert.fail:88 expected:<0> but was:<1>
[ERROR] Errors:
[ERROR]   ITestS3ATemporaryCredentials.testInvalidSTSBinding:257 ? SdkClient Unable to f...
[ERROR]   ITestS3ATemporaryCredentials.testSTS:130 ? SdkClient Unable to find a region v...
[ERROR]   ITestS3ATemporaryCredentials.testSessionRequestExceptionTranslation:441->lambda$testSessionRequestExceptionTranslation$5:442 ? SdkClient
[ERROR]   ITestS3ATemporaryCredentials.testSessionTokenExpiry:222 ? SdkClient Unable to ...
[ERROR]   ITestS3ATemporaryCredentials.testSessionTokenPropagation:193 ? SdkClient Unabl...
[ERROR]   ITestDelegatedMRJob.testJobSubmissionCollectsTokens:286 ? SdkClient Unable to ...
[ERROR]   ITestSessionDelegationInFileystem.testAddTokensFromFileSystem:235 ? SdkClient ...
[ERROR]   ITestSessionDelegationInFileystem.testCanRetrieveTokenFromCurrentUserCreds:260->createDelegationTokens:292->AbstractDelegationIT.mkTokens:88 ? SdkClient
[ERROR]   ITestSessionDelegationInFileystem.testDTCredentialProviderFromCurrentUserCreds:278->createDelegationTokens:292->AbstractDelegationIT.mkTokens:88 ? SdkClient
[ERROR]   ITestSessionDelegationInFileystem.testDelegatedFileSystem:308->createDelegationTokens:292->AbstractDelegationIT.mkTokens:88 ? SdkClient
[ERROR]   ITestSessionDelegationInFileystem.testDelegationBindingMismatch1:432->createDelegationTokens:292->AbstractDelegationIT.mkTokens:88 ? SdkClient
[ERROR]   ITestSessionDelegationInFileystem.testFileSystemBoundToCreator:681 ? SdkClient
[ERROR]   ITestSessionDelegationInFileystem.testGetDTfromFileSystem:212 ? SdkClient Unab...
[ERROR]   ITestSessionDelegationInFileystem.testHDFSFetchDTCommand:606->lambda$testHDFSFetchDTCommand$3:607 ? SdkClient
[ERROR]   ITestSessionDelegationInFileystem.testYarnCredentialPickup:576 ? SdkClient Una...
[ERROR]   ITestSessionDelegationTokens.testCreateAndUseDT:176 ? SdkClient Unable to find...
[ERROR]   ITestSessionDelegationTokens.testSaveLoadTokens:121 ? SdkClient Unable to find...

sequential tests usual failures:

[ERROR] Errors:
[ERROR]   ITestMagicCommitMRJob>AbstractITCommitMRJob.testMRJob:150 ? FileNotFound MR jo...
[ERROR]   ITestDirectoryCommitMRJob>AbstractITCommitMRJob.testMRJob:150 ? FileNotFound M...
[ERROR]   ITestPartitionCommitMRJob>AbstractITCommitMRJob.testMRJob:150 ? FileNotFound M...
[ERROR]   ITestStagingCommitMRJob>AbstractITCommitMRJob.testMRJob:150 ? FileNotFound MR ...

we have a new failure which is quite constant for me:
com.amazonaws.SdkClientException: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region.
we should address that in a different jira.
This is unrelated to this change.

Copy link

@bgaborg bgaborg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Failures unrelated, but we should deal with those separately.
+1

@steveloughran steveloughran changed the title HADOOP-16490. Improve S3Guard handling of FNFEs in copy HADOOP-16490. Avoid/handle cached 404s during S3A file creation Sep 11, 2019
@steveloughran
Copy link
Contributor Author

this is now committed -Thanks for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants