-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11253. Handle corrupted merkle tree files #7083
HDDS-11253. Handle corrupted merkle tree files #7083
Conversation
…rrupt-files * HDDS-10239-container-reconciliation: (183 commits) HDDS-10376. Add a Datanode API to supply a merkle tree for a given container. (apache#6945) HDDS-11289. Bump docker-maven-plugin to 0.45.0 (apache#7024) HDDS-11287. Code cleanup in XceiverClientSpi (apache#7043) HDDS-11283. Refactor KeyValueStreamDataChannel to avoid spurious IDE build issues (apache#7040) HDDS-11257. Ozone write does not work when http proxy is set for the JVM. (apache#7036) HDDS-11249. Bump ozone-runner to 20240729-jdk17-1 (apache#7003) HDDS-10517. Recon - Add a UI component for showing DN decommissioning detailed status and info (apache#6724) HDDS-10926. Block deletion should update container merkle tree. (apache#6875) HDDS-11270. [hsync] Add DN layout version (HBASE_SUPPORT/version 8) upgrade test. (apache#7021) HDDS-11272. Statistics some node status information (apache#7025) HDDS-11278. Move code out of Hadoop util package (apache#7028) HDDS-11274. (addendum) Replace Hadoop annotations/configs with Ozone-specific ones HDDS-11274. Replace Hadoop annotations/configs with Ozone-specific ones (apache#7026) HDDS-11280. Add Synchronize in AbstractCommitWatcher.addAckDataLength (apache#7032) HDDS-11235. Spare InfoBucket RPC call in FileSystem#mkdir() call. (apache#6990) HDDS-11273. Bump commons-compress to 1.26.2 (apache#7023) HDDS-11225. Increase ipc.server.read.threadpool.size (apache#7007) HDDS-11224. Increase hdds.datanode.handler.count (apache#7011) HDDS-11259. [hsync] DataNode should verify HBASE_SUPPORT layout version for every PutBlock. (apache#7012) HDDS-11214. Added config to set rocksDB's max log file size and num of log files (apache#7014) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/TestContainerChecksumTreeManager.java
Also deserialize inside the client
Cannot get the permissions to work to simulate a rename failure. Existing write failure test covers it good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @errose28. I have a minor nit, apart from that I'm good with the changes.
...c/test/java/org/apache/hadoop/ozone/container/checksum/TestContainerChecksumTreeManager.java
Outdated
Show resolved
Hide resolved
...st/src/test/java/org/apache/hadoop/ozone/dn/checksum/TestContainerCommandReconciliation.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round.
...e/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
Outdated
Show resolved
Hide resolved
...e/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
Show resolved
Hide resolved
...e/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
Outdated
Show resolved
Hide resolved
public static File getContainerChecksumFile(ContainerData data) { | ||
return new File(data.getMetadataPath(), data.getContainerID() + ".tree"); | ||
} | ||
|
||
@VisibleForTesting | ||
public static File getTmpContainerChecksumFile(ContainerData data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Simpler signature than passing a massive object.
public static File getTmpContainerChecksumFile(ContainerData data) { | |
public static File getTmpContainerChecksumFile(String path, long containerID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This signature matches getContainerChecksumFile
for consistency. Changing them both as this suggests would require the caller to know more internal details like where in the container the file(s) belong and whether the path is relative to the container root or not. Also in practice callers already have ContainerData
to refer to the container in the first place so this ends up being less verbose because callers do not need to unpack the object themselves.
...e/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
Outdated
Show resolved
Hide resolved
...st/src/test/java/org/apache/hadoop/ozone/dn/checksum/TestContainerCommandReconciliation.java
Outdated
Show resolved
Hide resolved
…rrupt-files * HDDS-10239-container-reconciliation: (61 commits) HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047) HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
All should be addressed in the latest commits. |
…rrupt-files * HDDS-10239-container-reconciliation: HDDS-10379. Datanodes should generate initial container merkle tree during container close. (apache#7065) Conflicts: hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeTestUtils.java hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/TestContainerChecksumTreeManager.java hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestOzoneContainerWithTLS.java hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/dn/checksum/TestContainerCommandReconciliation.java
All tests are expected to pass after the rebase. |
9765a6a
into
apache:HDDS-10239-container-reconciliation
…an-on-error * HDDS-10239-container-reconciliation: HDDS-11253. Handle corrupted merkle tree files (apache#7083) HDDS-10379. Datanodes should generate initial container merkle tree during container close. (apache#7065) Conflicts: hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueHandlerWithUnhealthyContainer.java
What changes were proposed in this pull request?
Summary
A merkle tree file may be corrupted during write or by some other error. This should only be a temporary problem since the scanner will rewrite it on its next run. Until that happens failures should be handled accordingly:
Details
merge
algorithm that will be implemented in HDDS-10928 will pick up this information again from other replicas.HddsUtils#isReadOnly
. Tests in this PR caught the issue and it is now fixed.getContainerMerkleTree
togetContainerChecksumInfo
to match the proto value returned.ContainerMerkleTree
andContainerChecksumInfo
.ContainerChecksumInfo
contains aContainerMerkleTree
along with other metadata.getContaienrMerkleTree
but returned aBytestring
that actually deserialized into aContainerChecksumInfo
, which was unintuitive. CallinggetContainerMerkleTree
and deserializing as aContainerMerkleTree
would fail.ContainerChecksumInfo
to capture all information about the container (the merkle tree alone will not suffice), it is intuitive to rename the APIgetContainerChecksumInfo
.What is the link to the Apache JIRA
HDDS-11253
How was this patch tested?