Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-15183. S3Guard store becomes inconsistent after partial failure of rename #951

Conversation

steveloughran
Copy link
Contributor

Contributed by Steve Loughran.

This is the squashed patch of PR #843 commit 115fb77

Contains

  • HADOOP-13936. S3Guard: DynamoDB can go out of sync with S3AFileSystem.delete()

  • HADOOP-15604. Bulk commits of S3A MPUs place needless excessive load on S3 & S3Guard

  • HADOOP-15658. Memory leak in S3AOutputStream

  • HADOOP-16364. S3Guard table destroy to map IllegalArgumentExceptions to IOEs]

This work adds to the S3Guard Metastore APIs

  • the notion of a "BulkOperation" : A store-specific class which is requested before initiating bulk work (put, purge, rename) and which then can be used to cache table changes performed during the bulk operation. This allows for renames and commit operations to avoid duplicate creation of parent entries in the tree: the store can track what is already created/found.

  • the notion of a "RenameTracker" which factors out the task of updating a metastore with changes to the filesystem during a rename, (files added + deleted) and after the completion of the operation, successful or not.

The original rename update -the one which failed to update the store until the end of the rename is implemented as the DelayedUpdateRenameTracker, while a new ProgressiveRenameTracker updates the sttore as individual files are copied and when bulk deletes complete. To avoid performance problems, stores mut provide a BulkOperation implementation which remembers ancestors added. The DynamoDBMetastore does this.

Some of the new features are implemented as part of a gradual refactoring of the S3AFileSystem itself: the handling of partial delete failures is in its own class org.apache.hadoop.fs.s3a.impl.MultiObjectDeleteSupport which, rather than being given a reference back to the owning S3AFileSystem, is handed a StoreContext which contains restriced attributes and callback. As this refactoring continues in future patches, and the different layers of a new store model factored out, this will be extended.

Change-Id: Ie0bd96ab861f0f30170b75f78e5503fc0e929524

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 34 Docker mode activated.
_ Prechecks _
+1 dupname 2 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 32 new or modified test files.
_ trunk Compile Tests _
0 mvndep 67 Maven dependency ordering for branch
+1 mvninstall 1056 trunk passed
+1 compile 1027 trunk passed
+1 checkstyle 147 trunk passed
+1 mvnsite 126 trunk passed
+1 shadedclient 978 branch has no errors when building and testing our client artifacts.
+1 javadoc 84 trunk passed
0 spotbugs 61 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 177 trunk passed
_ Patch Compile Tests _
0 mvndep 22 Maven dependency ordering for patch
+1 mvninstall 74 the patch passed
+1 compile 987 the patch passed
+1 javac 987 the patch passed
-0 checkstyle 149 root: The patch generated 20 new + 100 unchanged - 2 fixed = 120 total (was 102)
+1 mvnsite 127 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 2 The patch has no ill-formed XML file.
+1 shadedclient 688 patch has no errors when building and testing our client artifacts.
+1 javadoc 106 the patch passed
+1 findbugs 204 the patch passed
_ Other Tests _
-1 unit 533 hadoop-common in the patch failed.
+1 unit 288 hadoop-aws in the patch passed.
+1 asflicense 56 The patch does not generate ASF License warnings.
6954
Reason Tests
Failed junit tests hadoop.ha.TestZKFailoverController
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-951/1/artifact/out/Dockerfile
GITHUB PR #951
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 9aa4e01e5952 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 23c0379
Default Java 1.8.0_212
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-951/1/artifact/out/diff-checkstyle-root.txt
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-951/1/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-951/1/testReport/
Max. process+thread count 1392 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-951/1/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

Testing: S3A ireland. All good except for ITestCommitOperations.testBulkCommitFiles which only fails on parallel test runs. Which is very, very annoying, as it is hard to track down, especially as the scale tests now take 30 minutes. Plan: AncestorState.toString to list paths added, and assert to include the before and after string values.

Hypotheses:

  1. We are recreating parent paths
  2. more files are somehow being created and committed
  3. parallel test runs are in subdirectories, and this increases the count
[ERROR] testBulkCommitFiles(org.apache.hadoop.fs.s3a.commit.ITestCommitOperations)  Time elapsed: 9.071 s  <<< FAILURE!
java.lang.AssertionError: Number of records written after second commit; first commit had 4: s3guard_metadatastore_record_writes expected:<2> but was:<8>
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:834)
	at org.junit.Assert.assertEquals(Assert.java:645)
	at org.apache.hadoop.fs.s3a.S3ATestUtils$MetricDiff.assertDiffEquals(S3ATestUtils.java:882)
	at org.apache.hadoop.fs.s3a.commit.ITestCommitOperations.testBulkCommitFiles(ITestCommitOperations.java:626)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)

@bgaborg
Copy link

bgaborg commented Jun 12, 2019

I have 2 failures while testing the latest PR against ireland:

[ERROR] testMRJob(org.apache.hadoop.fs.s3a.commit.staging.integration.ITestDirectoryCommitMRJob)  Time elapsed: 42.965 s  <<< ERROR!
java.io.FileNotFoundException: Path s3a://gabota-versioned-bucket-ireland/fork-0005/test/DELAY_LISTING_ME/testMRJob is recorded as deleted by S3Guard at 2019-06-12T11:20:48.612Z
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2857)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2827)
	at org.apache.hadoop.fs.contract.ContractTestUtils.assertIsDirectory(ContractTestUtils.java:559)
	at org.apache.hadoop.fs.contract.AbstractFSContractTestBase.assertIsDirectory(AbstractFSContractTestBase.java:327)
	at org.apache.hadoop.fs.s3a.commit.AbstractITCommitMRJob.testMRJob(AbstractITCommitMRJob.java:137)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)

[INFO] Running org.apache.hadoop.fs.s3a.impl.ITestPartialRenamesDeletes
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 93.858 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.magic.ITestMagicCommitMRJob
[ERROR] testMRJob(org.apache.hadoop.fs.s3a.commit.magic.ITestMagicCommitMRJob)  Time elapsed: 44.661 s  <<< ERROR!
java.io.FileNotFoundException: Path s3a://gabota-versioned-bucket-ireland/fork-0004/test/testMRJob is recorded as deleted by S3Guard at 2019-06-12T11:21:16.180Z
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2857)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2827)
	at org.apache.hadoop.fs.contract.ContractTestUtils.assertIsDirectory(ContractTestUtils.java:559)
	at org.apache.hadoop.fs.contract.AbstractFSContractTestBase.assertIsDirectory(AbstractFSContractTestBase.java:327)
	at org.apache.hadoop.fs.s3a.commit.AbstractITCommitMRJob.testMRJob(AbstractITCommitMRJob.java:137)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)```

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
0 reexec 37 Docker mode activated.
_ Prechecks _
+1 dupname 3 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 32 new or modified test files.
_ trunk Compile Tests _
0 mvndep 22 Maven dependency ordering for branch
+1 mvninstall 1032 trunk passed
+1 compile 1027 trunk passed
+1 checkstyle 142 trunk passed
+1 mvnsite 132 trunk passed
+1 shadedclient 1002 branch has no errors when building and testing our client artifacts.
+1 javadoc 105 trunk passed
0 spotbugs 59 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 187 trunk passed
_ Patch Compile Tests _
0 mvndep 20 Maven dependency ordering for patch
+1 mvninstall 73 the patch passed
+1 compile 976 the patch passed
+1 javac 976 the patch passed
-0 checkstyle 137 root: The patch generated 19 new + 100 unchanged - 2 fixed = 119 total (was 102)
+1 mvnsite 121 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 3 The patch has no ill-formed XML file.
+1 shadedclient 654 patch has no errors when building and testing our client artifacts.
+1 javadoc 110 the patch passed
+1 findbugs 207 the patch passed
_ Other Tests _
+1 unit 514 hadoop-common in the patch passed.
+1 unit 289 hadoop-aws in the patch passed.
+1 asflicense 42 The patch does not generate ASF License warnings.
6859
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-951/2/artifact/out/Dockerfile
GITHUB PR #951
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux b01e374367f9 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 23c0379
Default Java 1.8.0_212
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-951/2/artifact/out/diff-checkstyle-root.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-951/2/testReport/
Max. process+thread count 1410 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-951/2/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

@bgaborg
Copy link

bgaborg commented Jun 12, 2019

I got 4 errors during verify against ireland:

[ERROR]   ITestMagicCommitMRJob>AbstractITCommitMRJob.testMRJob:137->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[ERROR]   ITestDirectoryCommitMRJob>AbstractITCommitMRJob.testMRJob:137->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[ERROR]   ITestPartitionCommitMRJob>AbstractITCommitMRJob.testMRJob:137->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[ERROR]   ITestStagingCommitMRJob>AbstractITCommitMRJob.testMRJob:137->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 67.914 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.staging.integration.ITestPartitionCommitMRJob
[ERROR] testMRJob(org.apache.hadoop.fs.s3a.commit.staging.integration.ITestPartitionCommitMRJob)  Time elapsed: 40.515 s  <<< ERROR!
java.io.FileNotFoundException: Path s3a://gabota-versioned-bucket-ireland/fork-0004/test/DELAY_LISTING_ME/testMRJob is recorded as deleted by S3Guard at 2019-06-12T12:47:07.966Z
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2857)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2827)
	at org.apache.hadoop.fs.contract.ContractTestUtils.assertIsDirectory(ContractTestUtils.java:559)
	at org.apache.hadoop.fs.contract.AbstractFSContractTestBase.assertIsDirectory(AbstractFSContractTestBase.java:327)
	at org.apache.hadoop.fs.s3a.commit.AbstractITCommitMRJob.testMRJob(AbstractITCommitMRJob.java:137)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)
[ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 91.891 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.staging.integration.ITestStagingCommitMRJob
[ERROR] testMRJob(org.apache.hadoop.fs.s3a.commit.staging.integration.ITestStagingCommitMRJob)  Time elapsed: 38.71 s  <<< ERROR!
java.io.FileNotFoundException: Path s3a://gabota-versioned-bucket-ireland/fork-0008/test/DELAY_LISTING_ME/testMRJob is recorded as deleted by S3Guard at 2019-06-12T12:47:27.663Z
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2857)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2827)
	at org.apache.hadoop.fs.contract.ContractTestUtils.assertIsDirectory(ContractTestUtils.java:559)
	at org.apache.hadoop.fs.contract.AbstractFSContractTestBase.assertIsDirectory(AbstractFSContractTestBase.java:327)
	at org.apache.hadoop.fs.s3a.commit.AbstractITCommitMRJob.testMRJob(AbstractITCommitMRJob.java:137)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 67.194 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.staging.integration.ITestDirectoryCommitMRJob
[ERROR] testMRJob(org.apache.hadoop.fs.s3a.commit.staging.integration.ITestDirectoryCommitMRJob)  Time elapsed: 40.606 s  <<< ERROR!
java.io.FileNotFoundException: Path s3a://gabota-versioned-bucket-ireland/fork-0007/test/DELAY_LISTING_ME/testMRJob is recorded as deleted by S3Guard at 2019-06-12T12:48:25.435Z
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2857)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2827)
	at org.apache.hadoop.fs.contract.ContractTestUtils.assertIsDirectory(ContractTestUtils.java:559)
	at org.apache.hadoop.fs.contract.AbstractFSContractTestBase.assertIsDirectory(AbstractFSContractTestBase.java:327)
	at org.apache.hadoop.fs.s3a.commit.AbstractITCommitMRJob.testMRJob(AbstractITCommitMRJob.java:137)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 71.741 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.magic.ITestMagicCommitMRJob
[ERROR] testMRJob(org.apache.hadoop.fs.s3a.commit.magic.ITestMagicCommitMRJob)  Time elapsed: 43.631 s  <<< ERROR!
java.io.FileNotFoundException: Path s3a://gabota-versioned-bucket-ireland/fork-0008/test/testMRJob is recorded as deleted by S3Guard at 2019-06-12T12:48:54.008Z
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2857)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2827)
	at org.apache.hadoop.fs.contract.ContractTestUtils.assertIsDirectory(ContractTestUtils.java:559)
	at org.apache.hadoop.fs.contract.AbstractFSContractTestBase.assertIsDirectory(AbstractFSContractTestBase.java:327)
	at org.apache.hadoop.fs.s3a.commit.AbstractITCommitMRJob.testMRJob(AbstractITCommitMRJob.java:137)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)

@steveloughran
Copy link
Contributor Author

@gabor, thanks for that. I have sometimes seen that failure on ITestMagicCommitMR job, hence we now log when it was deleted. What was the actual time when the test was run?

What I can do is add some extra diags in the operations where the committers update the DDB tables on commit, because this failure implies they didn't create an entry for the parent dir.

this all happens in finishedWrite() which first calls MetastoreAddAncestors, which in DDB goes up the tree to find the first parent dir which is in the store and stops there. Then in the metastore.put() afterwards we add the new file and its parents, but skipping those where there's already an entry.

I wonder if we can/should do more here

  1. I'll add a check in addAncestors to throw a PathIOE t if the ancestor scan finds a file. Let me know if you see it :)
  2. we should consider whether we should do the addAncestors work at all rather than just do the put() and have it create the entire ancestor tree, rather than stop the moment it finds a parent entry in the DDB. That will implement more recovery of inconsistent state.at the cost (over the entire bulk operation) of one more ddb write per directory level entry and one fewer get for every parent which doesn't have an entry in the store

Copy link

@bgaborg bgaborg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @steveloughran for providing this massive contribution. Look good overall, but I proposed some changes and added some notes.
Running the testst with dynamo and local I get the same issues:

[ERROR] Errors:
[ERROR]   ITestMagicCommitMRJob>AbstractITCommitMRJob.testMRJob:137->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[ERROR]   ITestDirectoryCommitMRJob>AbstractITCommitMRJob.testMRJob:137->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[ERROR]   ITestPartitionCommitMRJob>AbstractITCommitMRJob.testMRJob:137->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[ERROR]   ITestStagingCommitMRJob>AbstractITCommitMRJob.testMRJob:137->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound

dstMetas = new ArrayList<>();
}
// TODO S3Guard HADOOP-13761: retries when source paths are not visible yet
// Validation completed: time to begin the operation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would worth to create a separate method to the validation and all the other parts of this method that could be moved and tested separately from innerRename. This method is huge, 300+ lines; it would help the sustainability to split it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I will do. It might also line me up better for when I open up the rename/3 operation which will always throw an exception on any invalid state (rather than return "false" with no explanation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Not doing any tests on the validation alone, as the contract tests are expected to generate the failure conditions, but it does help isolate the two stages

* @return the store context of this FS.
*/
@InterfaceAudience.Private
public StoreContext createStoreContext() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of parameters for a constructor. I think it would worth to use builder pattern for readability and sustainability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must disagree. The builder pattern is best for when you want to have partial config or support change where you dont want to add many, many constructors, and substitutes for Java's lack or named params in constructors (compare with: groovy, scala, python)

Here: all parameters must be supplied and this is exclusively for use in the s3a connector. Nobody should be creating these elsewhere, and if they do, not my problem if it breaks

String key,
S3AEncryptionMethods serverSideEncryptionAlgorithm,
String serverSideEncryptionKey,
String eTag,
String versionId) {
String versionId, final long len) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if it's one parameter per line we should keep that way (add len to a new line).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. IDE refactoring at work

* No attempt to use a builder here as outside tests
* this should only be created in the S3AFileSystem.
*/
public StoreContext(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would worth a builder pattern, and as this is new with this PR it will scale well in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for the reasons as discussed.

  1. if we add more args, we want 100% of uses (production, test cases) to add every param. Doing that in the refactor operations of the IDE is straightforward.
  2. I'm thinking that as more FS-level operations are added (path to key, temp file...) I'd just add a new interface "FSLevelOperations" which we'd implement in S3A and for any test suite. This would avoid the need for a new field & constructor arg on every operation, though I'd still require code to call an explicit method for each such operation (i.e. no direct access to FSLevelOperations.

No need to do that now precisely because this is all private; we can do that on the next iteration

@@ -1474,6 +1474,18 @@ Caused by: java.lang.NullPointerException
... 1 more
```

### Error `Attempt to change a resource which is still in use: Table is being deleted`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I see this is an entirely different issue, which should be resolved in a separate commit. do you plan to do that instead of squashing all your changes to the same commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been working on this patch for too long and wrapping up stuff to try and get all tests to worse; a lot of work has gone into the ITestDynamoDB there. I don't want to split them up at this late in the patch. Sorry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth discussing I think--I'm with @bgaborg on this. I've been on projects in the past that auto-reject conflated commits. Why not maintain a list of commits on your branch and commit them intact instead of squashing them? Git rebase -i and force push (your personal branch only) are your friends here. Gives you optimal commit history and not that hard to manage IMO. Maybe try it out next time you are working on a patch set. I wouldn't force you to go break apart these commits this late in the game, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have branches of the unsquashed PRs, so I should be able to do that. By popular requeset I will give it a go, but leave them in here. If people get the other PR in first, I'll deal with that

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
0 reexec 34 Docker mode activated.
_ Prechecks _
+1 dupname 2 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 32 new or modified test files.
_ trunk Compile Tests _
0 mvndep 20 Maven dependency ordering for branch
+1 mvninstall 1026 trunk passed
+1 compile 1006 trunk passed
+1 checkstyle 130 trunk passed
+1 mvnsite 109 trunk passed
+1 shadedclient 896 branch has no errors when building and testing our client artifacts.
+1 javadoc 82 trunk passed
0 spotbugs 63 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 174 trunk passed
_ Patch Compile Tests _
0 mvndep 23 Maven dependency ordering for patch
+1 mvninstall 77 the patch passed
+1 compile 963 the patch passed
+1 javac 963 the patch passed
-0 checkstyle 145 root: The patch generated 21 new + 100 unchanged - 2 fixed = 121 total (was 102)
+1 mvnsite 126 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 2 The patch has no ill-formed XML file.
+1 shadedclient 683 patch has no errors when building and testing our client artifacts.
+1 javadoc 94 the patch passed
+1 findbugs 200 the patch passed
_ Other Tests _
+1 unit 515 hadoop-common in the patch passed.
+1 unit 286 hadoop-aws in the patch passed.
+1 asflicense 53 The patch does not generate ASF License warnings.
6679
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-951/3/artifact/out/Dockerfile
GITHUB PR #951
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux fc5a5141e6c7 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 3b31694
Default Java 1.8.0_212
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-951/3/artifact/out/diff-checkstyle-root.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-951/3/testReport/
Max. process+thread count 1407 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-951/3/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

@steveloughran steveloughran force-pushed the s3/HADOOP-15183-s3guard-rename-failures branch 2 times, most recently from 2fa4cb3 to da8e05a Compare June 12, 2019 20:02
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
0 reexec 33 Docker mode activated.
_ Prechecks _
+1 dupname 3 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 32 new or modified test files.
_ trunk Compile Tests _
0 mvndep 60 Maven dependency ordering for branch
+1 mvninstall 1057 trunk passed
+1 compile 1122 trunk passed
+1 checkstyle 147 trunk passed
+1 mvnsite 115 trunk passed
+1 shadedclient 902 branch has no errors when building and testing our client artifacts.
+1 javadoc 86 trunk passed
0 spotbugs 61 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 174 trunk passed
_ Patch Compile Tests _
0 mvndep 26 Maven dependency ordering for patch
+1 mvninstall 75 the patch passed
+1 compile 1041 the patch passed
+1 javac 1041 the patch passed
-0 checkstyle 146 root: The patch generated 11 new + 100 unchanged - 2 fixed = 111 total (was 102)
+1 mvnsite 116 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 3 The patch has no ill-formed XML file.
+1 shadedclient 621 patch has no errors when building and testing our client artifacts.
+1 javadoc 85 the patch passed
+1 findbugs 194 the patch passed
_ Other Tests _
+1 unit 525 hadoop-common in the patch passed.
+1 unit 287 hadoop-aws in the patch passed.
+1 asflicense 51 The patch does not generate ASF License warnings.
6875
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-951/5/artifact/out/Dockerfile
GITHUB PR #951
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 8781dc2ab72a 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 1732312
Default Java 1.8.0_212
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-951/5/artifact/out/diff-checkstyle-root.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-951/5/testReport/
Max. process+thread count 1463 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-951/5/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
0 reexec 32 Docker mode activated.
_ Prechecks _
+1 dupname 2 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 32 new or modified test files.
_ trunk Compile Tests _
0 mvndep 22 Maven dependency ordering for branch
+1 mvninstall 1030 trunk passed
+1 compile 1140 trunk passed
+1 checkstyle 136 trunk passed
+1 mvnsite 111 trunk passed
+1 shadedclient 911 branch has no errors when building and testing our client artifacts.
+1 javadoc 86 trunk passed
0 spotbugs 61 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 181 trunk passed
_ Patch Compile Tests _
0 mvndep 22 Maven dependency ordering for patch
+1 mvninstall 81 the patch passed
+1 compile 1043 the patch passed
+1 javac 1043 the patch passed
-0 checkstyle 141 root: The patch generated 11 new + 100 unchanged - 2 fixed = 111 total (was 102)
+1 mvnsite 107 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 2 The patch has no ill-formed XML file.
+1 shadedclient 604 patch has no errors when building and testing our client artifacts.
+1 javadoc 88 the patch passed
+1 findbugs 198 the patch passed
_ Other Tests _
+1 unit 515 hadoop-common in the patch passed.
+1 unit 288 hadoop-aws in the patch passed.
+1 asflicense 38 The patch does not generate ASF License warnings.
6792
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-951/6/artifact/out/Dockerfile
GITHUB PR #951
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 03025a03bd22 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 1732312
Default Java 1.8.0_212
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-951/6/artifact/out/diff-checkstyle-root.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-951/6/testReport/
Max. process+thread count 1463 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-951/6/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 45 Docker mode activated.
_ Prechecks _
+1 dupname 2 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 32 new or modified test files.
_ trunk Compile Tests _
0 mvndep 61 Maven dependency ordering for branch
+1 mvninstall 1141 trunk passed
+1 compile 1584 trunk passed
+1 checkstyle 294 trunk passed
+1 mvnsite 392 trunk passed
+1 shadedclient 2026 branch has no errors when building and testing our client artifacts.
+1 javadoc 328 trunk passed
0 spotbugs 83 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 417 trunk passed
_ Patch Compile Tests _
0 mvndep 22 Maven dependency ordering for patch
+1 mvninstall 81 the patch passed
+1 compile 1593 the patch passed
+1 javac 1593 the patch passed
-0 checkstyle 288 root: The patch generated 11 new + 100 unchanged - 2 fixed = 111 total (was 102)
+1 mvnsite 420 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 7 The patch has no ill-formed XML file.
+1 shadedclient 1244 patch has no errors when building and testing our client artifacts.
+1 javadoc 342 the patch passed
+1 findbugs 286 the patch passed
_ Other Tests _
-1 unit 551 hadoop-common in the patch failed.
+1 unit 293 hadoop-aws in the patch passed.
+1 asflicense 48 The patch does not generate ASF License warnings.
11225
Reason Tests
Failed junit tests hadoop.ha.TestZKFailoverControllerStress
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-951/4/artifact/out/Dockerfile
GITHUB PR #951
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 334677821fa6 4.4.0-143-generic #169~14.04.2-Ubuntu SMP Wed Feb 13 15:00:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 1732312
Default Java 1.8.0_212
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-951/4/artifact/out/diff-checkstyle-root.txt
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-951/4/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-951/4/testReport/
Max. process+thread count 1379 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-951/4/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@ajfabbri ajfabbri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a couple more files to look at.. here's my comments so far.

/**
* Saves metadata for exactly one path, potentially
* using any bulk operation state to eliminate duplicate work.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if operationState is not null, can implementations may defer write to metastore until a later time, or must they write immediately but are allowed to elide subsequent "duplicate" writes? I don't think the "when is it durable" contract is super important currently, but might be worth clarifying if you roll another version of the patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the metastores are doing anything with delayed operations, just tracking it. Clarified in the javadocs


@Override
public int compare(Path pathL, Path pathR) {
// exist fast on equal values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "exit fast"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
if (compare > 0) {
return -1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this function just be return -super.compare(pathL, pathR)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'd think so, but when I tried my sort tests were failing -and even stepping through the code with the debugger couldn't work out why. Doing in like this fixed the tests

}
}

// outside the lock, the entriesToAdd list has all new entries to create.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which chunks of data is this lock protecting? Since these are vanilla Lists you need a lock to read as well or you get undefined results, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a variable, the list of new entries to add. I exit the synchronized(this) block so that the move call doesn't block.
Looking at it, I think I'll add DurationInfo around it , and some more comments as to what is happening

// and finish off; including deleting source directories.
// TODO: is this making too many calls?
LOG.debug("Rename completed for {}", this);
getMetadataStore().move(pathsToDelete, destMetas, getOperationState());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No lock here on a read to pathsToDelete, etc. Per previous comment want to understand which data you are guarding with lock so we can ensure we have coverage.

* Originally it implemented the logic to probe for an add ancestors,
* but with the addition of a store-specific bulk operation state
* it became unworkable.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't hold up this patch for it, but am curious how this became unworkable. I imagine de-duping metadata writes (ancestors) using a bulk context, such that S3A repeating ancestor writes within an op are not an issue.

Again, always trying to keep MetadataStore as simple as possible and specific to logging metadata operations on a FileSystem.

@@ -1474,6 +1474,18 @@ Caused by: java.lang.NullPointerException
... 1 more
```

### Error `Attempt to change a resource which is still in use: Table is being deleted`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth discussing I think--I'm with @bgaborg on this. I've been on projects in the past that auto-reject conflated commits. Why not maintain a list of commits on your branch and commit them intact instead of squashing them? Git rebase -i and force push (your personal branch only) are your friends here. Gives you optimal commit history and not that hard to manage IMO. Maybe try it out next time you are working on a patch set. I wouldn't force you to go break apart these commits this late in the game, though.

Copy link
Contributor

@ajfabbri ajfabbri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I've gone through everything. I could spend more time meditating on the new rename code but these are my comments so far. Most are nits or discussion for fun, but there was one question about synchronization that I'd like clarification on.

* lowest entry first.
*
* This is to ensure that if a client failed partway through the update,
* there will no entries in the table which lack parent entries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of topological sorting here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. We do need it, just for the extra resilience

* An attempt is made in {@link #deleteTestDirInTeardown()} to prune these test
* files.
*/
@SuppressWarnings("ThrowableNotThrown")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

assertIsDirectory(readOnlyDir);
Path renamedDestPath = new Path(readOnlyDir, writableDir.getName());
assertRenameOutcome(roleFS, writableDir, readOnlyDir, true);
assertIsFile(renamedDestPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests seem to accomplish what you need. Did you think of reaching below into the MetadataStore (via S3AFileSystem.getMetadataStore() and asserting on its state after failures? I don't see a specific need (you are getting good coverage through the FS)... just wondering if there are additional cases we could expose that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, didn't actually. good point though

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
0 reexec 34 Docker mode activated.
_ Prechecks _
+1 dupname 2 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 32 new or modified test files.
_ trunk Compile Tests _
0 mvndep 58 Maven dependency ordering for branch
+1 mvninstall 1117 trunk passed
+1 compile 1046 trunk passed
+1 checkstyle 138 trunk passed
+1 mvnsite 125 trunk passed
+1 shadedclient 944 branch has no errors when building and testing our client artifacts.
+1 javadoc 97 trunk passed
0 spotbugs 68 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 188 trunk passed
_ Patch Compile Tests _
0 mvndep 23 Maven dependency ordering for patch
+1 mvninstall 78 the patch passed
+1 compile 989 the patch passed
+1 javac 989 the patch passed
-0 checkstyle 143 root: The patch generated 11 new + 100 unchanged - 2 fixed = 111 total (was 102)
+1 mvnsite 120 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 2 The patch has no ill-formed XML file.
+1 shadedclient 691 patch has no errors when building and testing our client artifacts.
+1 javadoc 105 the patch passed
+1 findbugs 189 the patch passed
_ Other Tests _
+1 unit 496 hadoop-common in the patch passed.
+1 unit 292 hadoop-aws in the patch passed.
+1 asflicense 38 The patch does not generate ASF License warnings.
6959
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-951/7/artifact/out/Dockerfile
GITHUB PR #951
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 61d3cd8d79bb 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / bcfd228
Default Java 1.8.0_212
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-951/7/artifact/out/diff-checkstyle-root.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-951/7/testReport/
Max. process+thread count 1463 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-951/7/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.


// check if UNGUARDED_FLAG is passed and use NullMetadataStore in
// config to avoid side effects like creating the table if not exists
Configuration conf0 = getConf();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename this to unguardedConf or something like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

This is commit a22fc98 merged atop trunk and the HADOOP-16279 OOB Delete JIRA.

* OperationState as arg to the put operations
* Still some tuning/review of AddAncestors needed from where it was pushed into the metastore (so it can use/update the ancestors).

Change-Id: Idf34e5c7e88a765aa0aeadccd4f9bffdc8bca420
Change-Id: I5b0e5f0991cd26429f5ab9463073f79220ac9bd2
@steveloughran
Copy link
Contributor Author

I'm going to close this and re-open a new patch with everything merged atop the OOB patch. It's not that they conflict functionality-wise, it's just as they both pass down a param to the metastore put operations, they create merge conflict.

FWIW, I'm now unsure why the TTL needs to go down, rather than set during the init phase

…ameOperation class, as requested.

This does provide just one place to look at the code. There are eight methods in S3AFileSystem used during the rename. I've created an interface for this and the inner class for S3AFS which bridges to them.

Looking at the methods you can see what things we should export in a lower-down layer in the S3AFS refactoring -ideally these should all be one level down.
This takes most of the new code out of the S3AFS, though the new callback interface adds some again. What is key is that

1. The new algorithm for renaming is in its own class, with src and dest params all as final fields.
2. broken up into separate methods for file and dir rename
3. and with helper methods for queuing operations etc.

The StoreContext has backed off from lambda-expressions to invoke S3AFS ops as they were getting too many, moving to an interface and again, an implementation, ContextAccessors
This adds some more code in the S3AFS, but it makes it easier to see how the methods are being used, while still allowing tests to provide their own mock implementation class.

+ InternalConstants class for internal constants
+ Move the FunctionsRaisingIOE over to fs.impl. With the move to interfaces over l-expressions these aren't being used so much, but can be picked up elsewhere. Marked as private/unstable
* slightly better diags for the AbstractCommitTerasortITests on a missing _SUCCESS marker; and a cleanup operation at the end to delete the files (the normal per-fork paths aren't used so they can avoid deletion.
* AbstractStoreOperation implements getConf() as it's that common to use.

Change-Id: I9e4420d343fb87422779c11ce89fe7710edd180c
@steveloughran steveloughran force-pushed the s3/HADOOP-15183-s3guard-rename-failures branch from a22fc98 to dbdbfe8 Compare June 17, 2019 19:58
@steveloughran
Copy link
Contributor Author

ok, I untintentionally forced push rather than closed. Hopefully that won't be too disruptive. If need be I can switch to a new PR

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 45 Docker mode activated.
_ Prechecks _
+1 dupname 2 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 34 new or modified test files.
_ trunk Compile Tests _
0 mvndep 70 Maven dependency ordering for branch
+1 mvninstall 1163 trunk passed
+1 compile 1062 trunk passed
+1 checkstyle 151 trunk passed
+1 mvnsite 125 trunk passed
+1 shadedclient 1087 branch has no errors when building and testing our client artifacts.
+1 javadoc 95 trunk passed
0 spotbugs 137 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 291 trunk passed
-0 patch 199 Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
0 mvndep 33 Maven dependency ordering for patch
+1 mvninstall 89 the patch passed
+1 compile 1268 the patch passed
+1 javac 1268 the patch passed
-0 checkstyle 157 root: The patch generated 40 new + 109 unchanged - 2 fixed = 149 total (was 111)
+1 mvnsite 129 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 2 The patch has no ill-formed XML file.
+1 shadedclient 783 patch has no errors when building and testing our client artifacts.
-1 javadoc 33 hadoop-tools_hadoop-aws generated 4 new + 1 unchanged - 0 fixed = 5 total (was 1)
+1 findbugs 204 the patch passed
_ Other Tests _
+1 unit 548 hadoop-common in the patch passed.
+1 unit 304 hadoop-aws in the patch passed.
+1 asflicense 53 The patch does not generate ASF License warnings.
7848
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-951/8/artifact/out/Dockerfile
GITHUB PR #951
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux a9f8d5df3c95 4.4.0-144-generic #170~14.04.1-Ubuntu SMP Mon Mar 18 15:02:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 3d020e9
Default Java 1.8.0_212
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-951/8/artifact/out/diff-checkstyle-root.txt
javadoc https://builds.apache.org/job/hadoop-multibranch/job/PR-951/8/artifact/out/diff-javadoc-javadoc-hadoop-tools_hadoop-aws.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-951/8/testReport/
Max. process+thread count 1715 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-951/8/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

…d from a get.

This is to debug why some of the root contract tests are failing. I'm going to switch to trunk to see if it has the issue

Change-Id: I3d7a177495e83880a179bc76ab81c8dfbaf0a53e
@steveloughran
Copy link
Contributor Author

Last little refactoring caused whitespace issues. Will fix

./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/InternalConstants.java:24:public class InternalConstants {:1: Utility classes should not have a public or default constructor. [HideUtilityClassConstructor]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:153:  public RenameOperation(:10: More than 7 parameters (found 8). [ParameterNumber]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:399:      final S3ALocatedFileStatus sourceStatus,:34: 'sourceStatus' hides a field. [HiddenField]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:438:  private Path copySourceAndUpdateTracker(:16: More than 7 parameters (found 8). [ParameterNumber]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:439:      final RenameTracker renameTracker,:27: 'renameTracker' hides a field. [HiddenField]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:440:      final Path sourcePath,:18: 'sourcePath' hides a field. [HiddenField]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:444:      final Path destPath,:18: 'destPath' hides a field. [HiddenField]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:445:      final String destKey,:20: 'destKey' hides a field. [HiddenField]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:487:      final RenameTracker renameTracker,:27: 'renameTracker' hides a field. [HiddenField]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:488:      final List<DeleteObjectsRequest.KeyVersion> keysToDelete,:51: 'keysToDelete' hides a field. [HiddenField]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:489:      final List<Path> pathsToDelete):24: 'pathsToDelete' hides a field. [HiddenField]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:541:        final Path path,:9: Redundant 'final' modifier. [RedundantModifier]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:542:        final String eTag,:9: Redundant 'final' modifier. [RedundantModifier]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:543:        final String versionId,:9: Redundant 'final' modifier. [RedundantModifier]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:544:        final long len);:9: Redundant 'final' modifier. [RedundantModifier]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:552:        final S3AFileStatus fileStatus);:9: Redundant 'final' modifier. [RedundantModifier]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:561:        final FileStatus fileStatus);:9: Redundant 'final' modifier. [RedundantModifier]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:563:    /**: First sentence should end with a period. [JavadocStyle]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:633:        final List<DeleteObjectsRequest.KeyVersion> keysToDelete,:9: Redundant 'final' modifier. [RedundantModifier]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:634:        final boolean deleteFakeDir,:9: Redundant 'final' modifier. [RedundantModifier]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java:635:        final List<Path> undeletedObjectsOnFailure):9: Redundant 'final' modifier. [RedundantModifier]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/StoreContext.java:119:  public StoreContext(:10: More than 7 parameters (found 17). [ParameterNumber]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/s3guard/S3Guard.java:50:import org.apache.hadoop.fs.s3a.Tristate;:8: Unused import - org.apache.hadoop.fs.s3a.Tristate. [UnusedImports]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/s3guard/S3Guard.java:525:   * {@link MetadataStore#addAncestors(Path, ITtlTimeProvider, BulkOperationState)}.: Line is longer than 80 characters (found 84). [LineLength]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/s3guard/S3GuardTool.java:1163:        clearBucketOption(unguardedConf, fsURI.getHost(), S3_METADATA_STORE_IMPL);: Line is longer than 80 characters (found 82). [LineLength]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/terasort/AbstractCommitTerasortIT.java:243:  public void test_200_directory_deletion() throws Throwable {:15: Name 'test_200_directory_deletion' must match pattern '^[a-z][a-zA-Z0-9]*$'. [MethodName]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/impl/TestPartialDeleteFailures.java:74:  private static final ContextAccessors contextAccessors:41: Name 'contextAccessors' must match pattern '^[A-Z][A-Z0-9]*(_[A-Z0-9]+)*$'. [ConstantName]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/impl/TestPartialDeleteFailures.java:349:    public BulkOperationState initiateBulkWrite(final BulkOperationState.OperationType operation,: Line is longer than 80 characters (found 97). [LineLength]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/s3guard/ITestDynamoDBMetadataStore.java:988:          () -> ddbms.prune(PruneMode.ALL_BY_MODTIME,0));:53: ',' is not followed by whitespace. [WhitespaceAfter]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/s3guard/ITestDynamoDBMetadataStore.java:1169:        null );:14: ')' is preceded with whitespace. [ParenPad]

* the add ancestors code is now pushed down into the stores, so they can use any bulk state tracking
* which also means that the stores need to do the patching of TTL values (now done)
* but they also need to do it in Put when completing ancestors there (not done)
* and the state tracking in DDB, with the addAncestor integration, doesn't overwrite grandparent paths which have been overwritten with a tombstone

Change-Id: I4009ed5ef03549453db2e8c7389903e4c66114a8
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 513 Docker mode activated.
_ Prechecks _
+1 dupname 3 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 34 new or modified test files.
_ trunk Compile Tests _
0 mvndep 68 Maven dependency ordering for branch
+1 mvninstall 1021 trunk passed
+1 compile 1104 trunk passed
+1 checkstyle 141 trunk passed
+1 mvnsite 117 trunk passed
+1 shadedclient 970 branch has no errors when building and testing our client artifacts.
+1 javadoc 91 trunk passed
0 spotbugs 63 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 185 trunk passed
-0 patch 95 Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
0 mvndep 23 Maven dependency ordering for patch
+1 mvninstall 78 the patch passed
+1 compile 1065 the patch passed
+1 javac 1065 the patch passed
-0 checkstyle 144 root: The patch generated 18 new + 107 unchanged - 4 fixed = 125 total (was 111)
+1 mvnsite 108 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 2 The patch has no ill-formed XML file.
+1 shadedclient 657 patch has no errors when building and testing our client artifacts.
-1 javadoc 27 hadoop-tools_hadoop-aws generated 4 new + 1 unchanged - 0 fixed = 5 total (was 1)
+1 findbugs 208 the patch passed
_ Other Tests _
+1 unit 543 hadoop-common in the patch passed.
+1 unit 285 hadoop-aws in the patch passed.
+1 asflicense 48 The patch does not generate ASF License warnings.
7472
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-951/9/artifact/out/Dockerfile
GITHUB PR #951
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 9e5c25f7ae9b 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / b14f056
Default Java 1.8.0_212
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-951/9/artifact/out/diff-checkstyle-root.txt
javadoc https://builds.apache.org/job/hadoop-multibranch/job/PR-951/9/artifact/out/diff-javadoc-javadoc-hadoop-tools_hadoop-aws.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-951/9/testReport/
Max. process+thread count 1393 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-951/9/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

* DynamoDB.completeAncestry() sets the TTL on entries it creates
* DynamoDB.addAncestors() doesn't just stop at the first entry, it goes up the tree.
  When it finds a tombstone or missing entry as a parent of a vaid entry it logs @ warn and adds to the list

As a result, there's now O(depth) gets in every finishedWrite, where before it stopped at the first entry (bad)
but in completeAncestry() a put was being done at O(depth) anyway.

S3AFileSystem.finishedWrite() calls addAncestors before put().

For a bulk commit, because a bulk operation spans the add and the put, there's no duplication, the cost of a write is less:
its O(depth) with the put operations for only those files.

For a normal single file write we can do the same by creating a temp bulk write instance and using it purely for the single operation.
It does seem overkill, but it lets us glue together both operations in the sequence, which is the whole point.

Change-Id: I48ad7b2657b0d14fffc0318934af73bde8368482
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 31 Docker mode activated.
_ Prechecks _
+1 dupname 2 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 34 new or modified test files.
_ trunk Compile Tests _
0 mvndep 22 Maven dependency ordering for branch
+1 mvninstall 1024 trunk passed
+1 compile 1071 trunk passed
+1 checkstyle 144 trunk passed
+1 mvnsite 130 trunk passed
+1 shadedclient 1010 branch has no errors when building and testing our client artifacts.
+1 javadoc 100 trunk passed
0 spotbugs 67 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 182 trunk passed
-0 patch 107 Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
0 mvndep 23 Maven dependency ordering for patch
+1 mvninstall 79 the patch passed
+1 compile 1024 the patch passed
+1 javac 1024 the patch passed
-0 checkstyle 140 root: The patch generated 19 new + 107 unchanged - 4 fixed = 126 total (was 111)
+1 mvnsite 126 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 3 The patch has no ill-formed XML file.
+1 shadedclient 688 patch has no errors when building and testing our client artifacts.
-1 javadoc 55 hadoop-tools_hadoop-aws generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1)
+1 findbugs 206 the patch passed
_ Other Tests _
+1 unit 519 hadoop-common in the patch passed.
+1 unit 292 hadoop-aws in the patch passed.
+1 asflicense 52 The patch does not generate ASF License warnings.
7025
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-951/10/artifact/out/Dockerfile
GITHUB PR #951
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux db3ed9314a41 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 37bd5bb
Default Java 1.8.0_212
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-951/10/artifact/out/diff-checkstyle-root.txt
javadoc https://builds.apache.org/job/hadoop-multibranch/job/PR-951/10/artifact/out/diff-javadoc-javadoc-hadoop-tools_hadoop-aws.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-951/10/testReport/
Max. process+thread count 1413 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-951/10/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

…d ancestors and put call are integrated

finishedWrite() now creates a BulkUpdate if one wasn't already present, closes it afterwards. This is to ensure that the findings of the addAncestors call are used in the putAndReturn call, which will not add a PUT request for all entries we know exists. Makes DDB cost of writing a single file depth * GET + (1+ missing parent count) * PUT. Before: depth * PUT as well as extra GET/PUT calls in addAncestors. PUTs cost more than GET calls, so this is a net saving

Failing test ITestCommitOperations was tracked down to clock skew triggering a writeback of the getFileStatus result on the probes after the first commit, so causing an intermittent failure in parallel test runs (under load == worse skew).

Filed HADOOP-16382 for the underlying issue; for now simply resetting the MetricDiff counter after the various probes.

Change-Id: I85f60bc517cb0ae683961607f1f48b6f35a7004b
@steveloughran
Copy link
Contributor Author

Latest patch: doing full matrix of test runs (s3guard/non, local/ddb, auth/non-auth)

S3AFileSystem.finishedWrite() now initiates a BulkUpdate if one wasn't already present and closes it afterwards. This is to ensure that the findings of the addAncestors call are used in the putAndReturn call, which will not add a PUT request for all entries we know exists. This makes the DDB cost of writing a single file depth * GET + (1+ missing parent count) * PUT. Before: depth * PUT as well as extra GET/PUT calls in addAncestors. PUTs cost more than GET calls, so this is a net saving

Failing test ITestCommitOperations was tracked down to clock skew triggering a writeback of the getFileStatus result on the probes after the first commit, so causing an intermittent failure in parallel test runs (under load == worse skew).

Filed HADOOP-16382 for the underlying issue; for now simply resetting the MetricDiff counter after the various probes.

I'm reaching that point where I can't see any more issues, and really need the insight/approval/criticism of others. In particular

  1. Is the ancestor tracking efficient and yet sufficient? It aims to eliminate the many spurious parent entries put in s3a commit operations and in parallel renames, as well as in simple file writes.
  2. Does the metadata update strategy in ProgressiveRenameTracker hold together?
  3. Is the rename algorithm in org.apache.hadoop.fs.s3a.impl.RenameOperation understandable and correct?

feedback strongly encouraged.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
0 reexec 31 Docker mode activated.
_ Prechecks _
+1 dupname 2 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 34 new or modified test files.
_ trunk Compile Tests _
0 mvndep 45 Maven dependency ordering for branch
+1 mvninstall 1084 trunk passed
+1 compile 1154 trunk passed
+1 checkstyle 144 trunk passed
+1 mvnsite 118 trunk passed
+1 shadedclient 971 branch has no errors when building and testing our client artifacts.
+1 javadoc 87 trunk passed
0 spotbugs 62 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 187 trunk passed
-0 patch 95 Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
0 mvndep 24 Maven dependency ordering for patch
+1 mvninstall 82 the patch passed
+1 compile 1068 the patch passed
+1 javac 1068 the patch passed
-0 checkstyle 140 root: The patch generated 16 new + 107 unchanged - 4 fixed = 123 total (was 111)
+1 mvnsite 118 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 2 The patch has no ill-formed XML file.
+1 shadedclient 688 patch has no errors when building and testing our client artifacts.
+1 javadoc 90 the patch passed
+1 findbugs 211 the patch passed
_ Other Tests _
+1 unit 545 hadoop-common in the patch passed.
+1 unit 282 hadoop-aws in the patch passed.
+1 asflicense 43 The patch does not generate ASF License warnings.
7126
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-951/11/artifact/out/Dockerfile
GITHUB PR #951
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 223b95d9426e 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / d3ac516
Default Java 1.8.0_212
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-951/11/artifact/out/diff-checkstyle-root.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-951/11/testReport/
Max. process+thread count 1393 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-951/11/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

* minimising diff between trunk and branch
* completeAncestry doesn't break on first ancestor found, it continues up the path.

This is due diligence: I haven't encountered problem which arise from not doing this, I'm just making sure that we make sure those parents exist. Because operations now span >1 write, and the normal file write includes the addAncestors check which builds up the same list including with probes for the files actually existing.

Change-Id: I2edf9de75ea2546de7f97322ee0bcf838dd7591b
@steveloughran
Copy link
Contributor Author

need to create the TTL time provider in the DB for a non-FS init, else you get an NPE in the CLI prune

~/P/h/h/t/hadoop-3.3.0-SNAPSHOT (s3/HADOOP-15183-s3guard-rename-failures ⚡↩☡+) bin/hadoop s3guard prune s3a://hwdev-steve-ireland-new/2019-06-19 18:12:59,376 [main] INFO  s3guard.S3GuardTool (S3GuardTool.java:initMetadataStore(320)) - Metadata store DynamoDBMetadataStore{region=eu-west-1, tableName=hwdev-steve-ireland-new, tableArn=arn:aws:dynamodb:eu-west-1:980678866538:table/hwdev-steve-ireland-new} is initialized. java.lang.NullPointerException
	at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.completeAncestry(DynamoDBMetadataStore.java:858)
	at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.innerPut(DynamoDBMetadataStore.java:1226)
	at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.removeAuthoritativeDirFlag(DynamoDBMetadataStore.java:1569)
	at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.innerPrune(DynamoDBMetadataStore.java:1497)
	at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.prune(DynamoDBMetadataStore.java:1461)
	at org.apache.hadoop.fs.s3a.s3guard.S3GuardTool$Prune.run(S3GuardTool.java:1094)
	at org.apache.hadoop.fs.s3a.s3guard.S3GuardTool.run(S3GuardTool.java:400)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.hadoop.fs.s3a.s3guard.S3GuardTool.run(S3GuardTool.java:1659)
	at org.apache.hadoop.fs.s3a.s3guard.S3GuardTool.main(S3GuardTool.java:1668)
2019-06-19 18:12:59,945 [main] INFO  util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status -1: java.lang.NullPointerException

Copy link
Contributor

@mackrorysd mackrorysd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments & questions attached, and I have some opinions about how RenameOperation could be better, and I'm dreading the backport pain that factoring rename() out is going to cause, but it's only going to get worse, so let's do it! I'm a +1 to committing if there's nothing else blocking it, and I think it's time to move on and address everything else independently.

@@ -418,9 +434,11 @@ private void initThreadPools(Configuration conf) {
unboundedThreadPool = new ThreadPoolExecutor(
maxThreads, Integer.MAX_VALUE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I had suggested we drop the first argument to 0 (for boundedThreadPool too) as that's core threads, not max threads. Otherwise we actually lock ourselves at the maximum and grow from there. Only if you rebuild and retest anyway - otherwise I'll submit a patch once this is in to avoid further conflicts...

@@ -689,6 +707,7 @@ public String getBucketLocation() throws IOException {
* @return the region in which a bucket is located
* @throws IOException on any failure.
*/
@VisibleForTesting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I could kinda see clients wanting to use this. I know for HBOSS I almost did when I was toying with a potential DynamoDB locking implementation..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new storecontext API exports this as an on-demand operation.: fs.createStoreContext().getBucketLocation()

public void move(
@Nullable Collection<Path> pathsToDelete,
@Nullable Collection<PathMetadata> pathsToCreate,
ITtlTimeProvider ttlTimeProvider,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't immediately seeing ttlTimeProvider being used everywhere it's passed. Did we get to the bottom of that? Now's the time to remove it, maybe.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we did. I started a discussion about this if we want to pass it in the metastore init instead of every method which will use it. We ended up passing it to every method instead of the init method. We need to fix that. If that is not fixed in this pr I will fix it tomorrow under a new issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. If you need to override for a single test operation, can always add a @VisibleForTesting setter.

@@ -899,6 +915,9 @@ public void addAncestors(
// a directory entry will go in.
PathMetadata directory = get(parent);
if (directory == null || directory.isDeleted()) {
if (entryFound) {
LOG.warn("Inconsistent S3Guard table: adding directory {}", parent);
}
Copy link
Contributor

@ajfabbri ajfabbri Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting change to this function. slower but more robust (the removed break below, that is, not this log message)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, we might as well do the depth(path) get operations in parallel if they always happen, and the break behavior you remove is not configurable. In terms of write latency it would remove depth(path)-1 round trips (approx.). Proposing this as a followup JIRA, not doing it here.


// the maximum number of tasks cached if all threads are already uploading
public static final String MAX_TOTAL_TASKS = "fs.s3a.max.total.tasks";

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove empty line

…e a time source

Change-Id: Ic3ec71dc1d4c7bef4866ca4d598c20aba4e17575
Copy link
Contributor

@ajfabbri ajfabbri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes since last review LGTM. +1 overall.

@Nullable ITtlTimeProvider ttlTimeProvider) {
return ttlTimeProvider != null ? ttlTimeProvider : timeProvider;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed w/ @bgaborg @mackrorysd this will go away soon, and is fine for now.

@@ -899,6 +915,9 @@ public void addAncestors(
// a directory entry will go in.
PathMetadata directory = get(parent);
if (directory == null || directory.isDeleted()) {
if (entryFound) {
LOG.warn("Inconsistent S3Guard table: adding directory {}", parent);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, we might as well do the depth(path) get operations in parallel if they always happen, and the break behavior you remove is not configurable. In terms of write latency it would remove depth(path)-1 round trips (approx.). Proposing this as a followup JIRA, not doing it here.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
0 reexec 1490 Docker mode activated.
_ Prechecks _
+1 dupname 2 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 34 new or modified test files.
_ trunk Compile Tests _
0 mvndep 91 Maven dependency ordering for branch
+1 mvninstall 1278 trunk passed
+1 compile 1403 trunk passed
+1 checkstyle 149 trunk passed
+1 mvnsite 147 trunk passed
+1 shadedclient 1093 branch has no errors when building and testing our client artifacts.
+1 javadoc 101 trunk passed
0 spotbugs 68 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 198 trunk passed
-0 patch 120 Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
0 mvndep 25 Maven dependency ordering for patch
+1 mvninstall 83 the patch passed
+1 compile 1169 the patch passed
+1 javac 1169 the patch passed
-0 checkstyle 139 root: The patch generated 16 new + 107 unchanged - 4 fixed = 123 total (was 111)
+1 mvnsite 117 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 3 The patch has no ill-formed XML file.
+1 shadedclient 656 patch has no errors when building and testing our client artifacts.
+1 javadoc 86 the patch passed
+1 findbugs 201 the patch passed
_ Other Tests _
+1 unit 550 hadoop-common in the patch passed.
+1 unit 307 hadoop-aws in the patch passed.
+1 asflicense 47 The patch does not generate ASF License warnings.
9345
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-951/12/artifact/out/Dockerfile
GITHUB PR #951
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux e1d78ead575e 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 71ecd2e
Default Java 1.8.0_212
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-951/12/artifact/out/diff-checkstyle-root.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-951/12/testReport/
Max. process+thread count 1386 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-951/12/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

steveloughran commented Jun 20, 2019

thanks for the reviews; I will commit as is and file some followups

  • doing that depth check in parallel: nice
  • moving off the TTL per arg. I did actually do a revision of the patch with that but reverted it because those tests which were patching the TTL were failing. Merge this in and Gabor can make use of the IDE's refactor-delete-argument feature.
  • Extra prune resilience. Now there are checks for inconsistency on bulk writeback, when your table is in a bit of a mess, things go wrong. As my table is in that state I have a great opportunity to debug this by writing new tests

@bgaborg
Copy link

bgaborg commented Jun 20, 2019

Tested it with -Dscale against ireland. I have the following failures:

[ERROR] Failures:
[ERROR]   ITestS3AEmptyDirectory.testDirectoryBecomesEmpty:48->assertEmptyDirectory:56->Assert.assertEquals:118->Assert.failNotEquals:834->Assert.fail:88 dir is empty expected:<TRUE> but was:<FALSE>
[ERROR] Errors:
[ERROR]   ITestMagicCommitMRJob>AbstractITCommitMRJob.testMRJob:137->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[ERROR]   ITestDirectoryCommitMRJob>AbstractITCommitMRJob.testMRJob:137->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[ERROR]   ITestPartitionCommitMRJob>AbstractITCommitMRJob.testMRJob:137->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[ERROR]   ITestStagingCommitMRJob>AbstractITCommitMRJob.testMRJob:137->AbstractFSContractTestBase.assertIsDirectory:327 ? FileNotFound
[INFO]
[ERROR] Tests run: 1023, Failures: 1, Errors: 4, Skipped: 130

This is new for me:

ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 11.097 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.ITestS3AEmptyDirectory
[ERROR] testDirectoryBecomesEmpty(org.apache.hadoop.fs.s3a.ITestS3AEmptyDirectory)  Time elapsed: 4.515 s  <<< FAILURE!
java.lang.AssertionError: dir is empty expected:<TRUE> but was:<FALSE>
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:834)
	at org.junit.Assert.assertEquals(Assert.java:118)
	at org.apache.hadoop.fs.s3a.ITestS3AEmptyDirectory.assertEmptyDirectory(ITestS3AEmptyDirectory.java:56)
	at org.apache.hadoop.fs.s3a.ITestS3AEmptyDirectory.testDirectoryBecomesEmpty(ITestS3AEmptyDirectory.java:48)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)

We know about the other 3 testMRJob. I'm not happy that we have those issue, but we know about that. Have we created an issue already to stabilize those?

I see some failures in the sequential-integration-tests as well, but those are still running. Not just timeouts - e.g. [ERROR] Tests run: 7, Failures: 1, Errors: 2, Skipped: 0, Time elapsed: 72.408 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortMagicCommitter. I'll comment with the results once those are completed.

@bgaborg
Copy link

bgaborg commented Jun 20, 2019

I have the sequential-integration-tests issues:

[ERROR] Failures:
[ERROR]   ITestTerasortMagicCommitter>AbstractCommitTerasortIT.test_110_teragen:167->AbstractCommitTerasortIT.executeStage:143->Assert.assertEquals:645->Assert.failNotEquals:834->Assert.fail:88 Teragen(1000, s3a://gabota-versioned-bucket-ireland/terasort-ITestTerasortMagicCommitter/sortin) failed expected:<0> but was:<1>
[ERROR] Errors:
[ERROR]   ITestS3AContractRootDir>AbstractContractRootDirectoryTest.testRecursiveRootListing:219 ? TestTimedOut
[ERROR]   ITestS3AContractRootDir>AbstractContractRootDirectoryTest.testRmEmptyRootDirNonRecursive:95 ? TestTimedOut
[ERROR]   ITestTerasortMagicCommitter>AbstractCommitTerasortIT.test_120_terasort:177->AbstractCommitITest.loadSuccessFile:499 ? FileNotFound
[ERROR]   ITestTerasortMagicCommitter>AbstractCommitTerasortIT.test_130_teravalidate:192->AbstractCommitITest.loadSuccessFile:499 ? FileNotFound
[ERROR]   ITestDynamoDBMetadataStoreScale.lambda$execute$10:494->lambda$test_040_get$1:296 ? FileNotFound

I'm a bit worried about the FNFEs here:

[ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 373.769 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.s3guard.ITestDynamoDBMetadataStoreScale
[ERROR] test_040_get(org.apache.hadoop.fs.s3a.s3guard.ITestDynamoDBMetadataStoreScale)  Time elapsed: 7.065 s  <<< ERROR!
java.io.FileNotFoundException: get on s3a://example.org/get: com.amazonaws.services.dynamodbv2.model.ResourceNotFoundException: Requested resource not found (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: 2IOBP94FLJ86OIPKUUPDVR40G7VV4KQNSO5AEMVJF66Q9ASUAAJG)
	at org.apache.hadoop.fs.s3a.S3AUtils.translateDynamoDBException(S3AUtils.java:424)
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:206)
	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:111)
	at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:314)
	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:406)
	at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:310)
	at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:285)
	at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.getConsistentItem(DynamoDBMetadataStore.java:644)
	at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.innerGet(DynamoDBMetadataStore.java:688)
	at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.get(DynamoDBMetadataStore.java:666)
	at org.apache.hadoop.fs.s3a.s3guard.ITestDynamoDBMetadataStoreScale.lambda$test_040_get$1(ITestDynamoDBMetadataStoreScale.java:296)
	at org.apache.hadoop.fs.s3a.s3guard.ITestDynamoDBMetadataStoreScale.lambda$execute$10(ITestDynamoDBMetadataStoreScale.java:494)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazonaws.services.dynamodbv2.model.ResourceNotFoundException: Requested resource not found (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: 2IOBP94FLJ86OIPKUUPDVR40G7VV4KQNSO5AEMVJF66Q9ASUAAJG)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
	at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:4279)
	at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:4246)
	at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeGetItem(AmazonDynamoDBClient.java:2054)
	at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.getItem(AmazonDynamoDBClient.java:2020)
	at com.amazonaws.services.dynamodbv2.document.internal.GetItemImpl.doLoadItem(GetItemImpl.java:77)
	at com.amazonaws.services.dynamodbv2.document.internal.GetItemImpl.getItem(GetItemImpl.java:66)
	at com.amazonaws.services.dynamodbv2.document.Table.getItem(Table.java:608)
	at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.lambda$getConsistentItem$3(DynamoDBMetadataStore.java:649)
	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
	... 13 more
[ERROR] test_130_teravalidate(org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortMagicCommitter)  Time elapsed: 1.609 s  <<< ERROR!
java.io.FileNotFoundException: Expected file: not found s3a://gabota-versioned-bucket-ireland/terasort-ITestTerasortMagicCommitter/sortout/_SUCCESS in s3a://gabota-versioned-bucket-ireland/terasort-ITestTerasortMagicCommitter/sortout
	at org.apache.hadoop.fs.contract.ContractTestUtils.verifyPathExists(ContractTestUtils.java:940)
	at org.apache.hadoop.fs.contract.ContractTestUtils.assertPathExists(ContractTestUtils.java:918)
	at org.apache.hadoop.fs.contract.ContractTestUtils.assertIsFile(ContractTestUtils.java:826)
	at org.apache.hadoop.fs.s3a.commit.AbstractCommitITest.loadSuccessFile(AbstractCommitITest.java:499)
	at org.apache.hadoop.fs.s3a.commit.terasort.AbstractCommitTerasortIT.test_130_teravalidate(AbstractCommitTerasortIT.java:192)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: No such file or directory: s3a://gabota-versioned-bucket-ireland/terasort-ITestTerasortMagicCommitter/sortout/_SUCCESS
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2804)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2693)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2589)
	at org.apache.hadoop.fs.contract.ContractTestUtils.verifyPathExists(ContractTestUtils.java:934)
	... 19 more
[ERROR] test_120_terasort(org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortMagicCommitter)  Time elapsed: 1.658 s  <<< ERROR!
java.io.FileNotFoundException: Expected file: not found s3a://gabota-versioned-bucket-ireland/terasort-ITestTerasortMagicCommitter/sortin/_SUCCESS in s3a://gabota-versioned-bucket-ireland/terasort-ITestTerasortMagicCommitter/sortin
	at org.apache.hadoop.fs.contract.ContractTestUtils.verifyPathExists(ContractTestUtils.java:940)
	at org.apache.hadoop.fs.contract.ContractTestUtils.assertPathExists(ContractTestUtils.java:918)
	at org.apache.hadoop.fs.contract.ContractTestUtils.assertIsFile(ContractTestUtils.java:826)
	at org.apache.hadoop.fs.s3a.commit.AbstractCommitITest.loadSuccessFile(AbstractCommitITest.java:499)
	at org.apache.hadoop.fs.s3a.commit.terasort.AbstractCommitTerasortIT.test_120_terasort(AbstractCommitTerasortIT.java:177)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: No such file or directory: s3a://gabota-versioned-bucket-ireland/terasort-ITestTerasortMagicCommitter/sortin/_SUCCESS
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2804)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2693)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2589)
	at org.apache.hadoop.fs.contract.ContractTestUtils.verifyPathExists(ContractTestUtils.java:934)
	... 19 more
[ERROR] Tests run: 7, Failures: 1, Errors: 2, Skipped: 0, Time elapsed: 72.408 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortMagicCommitter
[ERROR] test_110_teragen(org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortMagicCommitter)  Time elapsed: 21.174 s  <<< FAILURE!
java.lang.AssertionError: Teragen(1000, s3a://gabota-versioned-bucket-ireland/terasort-ITestTerasortMagicCommitter/sortin) failed expected:<0> but was:<1>
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:834)
	at org.junit.Assert.assertEquals(Assert.java:645)
	at org.apache.hadoop.fs.s3a.commit.terasort.AbstractCommitTerasortIT.executeStage(AbstractCommitTerasortIT.java:143)
	at org.apache.hadoop.fs.s3a.commit.terasort.AbstractCommitTerasortIT.test_110_teragen(AbstractCommitTerasortIT.java:167)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)

@steveloughran
Copy link
Contributor Author

@bgaborg thanks for those results, we need to look at them to see if they are related.

test_040_get(org.apache.hadoop.fs.s3a.s3guard.ITestDynamoDBMetadataStoreScale)

That FileNotFoundException wraps a ResourceNotFoundException: the DDB isn't there. What happens on a rerun

terasort.

The tests are in an ordered chain: they only run if the previous test stage completed, which is inferred from the _SUCCESS marker in the previous dir

The only one to worry about (at least at first) is: test_110_teragen, where the exec'd operation returned a non-zero value: it failed. But we don't know why.

One thing I have never worked out is where in the miniyarn cluster the logs from the MR job actually collect. We have those of the JUnit process, but not the forked processes which are actually logging what's going on. If you have any insight here, that'd help us debug. Otherwise, what happens when you rereun this?

@steveloughran
Copy link
Contributor Author

@mackrorysd FWIW, I wasn't planning to backport this too far. All the new files are far away from existing code, so its the DDB changes and the changes in S3AFS which will be the sources of merge pain

shanthoosh pushed a commit to shanthoosh/hadoop that referenced this pull request Oct 15, 2019
* Kafka 2.0 upgrade

* Migrated some of tests to use new Java APIs and remove scala code

* Addressed review comments; fixed all the remaining failures

* Remove unused code

* Minor cleanup
shanthoosh pushed a commit to shanthoosh/hadoop that referenced this pull request Oct 15, 2019
@steveloughran steveloughran deleted the s3/HADOOP-15183-s3guard-rename-failures branch October 15, 2021 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants