Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11253. Handle corrupted merkle tree files #7083

Conversation

errose28
Copy link
Contributor

@errose28 errose28 commented Aug 16, 2024

What changes were proposed in this pull request?

Summary

A merkle tree file may be corrupted during write or by some other error. This should only be a temporary problem since the scanner will rewrite it on its next run. Until that happens failures should be handled accordingly:

  • Clients calling the read API should be able to cleanly fail on a corrupted file. The server does not deserialize it before sending.
  • Any operation that updates the file should regenerate a local checksum file if it is corrupted, as if it does not exist.
  • File write can be done by renaming a temp file into place to avoid corruption from partial writes. The file should either be present or absent if a write fails, but not partially written.

Details

  • When a checksum file is corrupted, we overwrite it with a new copy. This will lose any block delete information present in the file.
    • This is not a problem because the merge algorithm that will be implemented in HDDS-10928 will pick up this information again from other replicas.
    • If no other replicas have the delete information then all replicas match and there is no problem.
  • When a checksum file does not exist on the datanode, an error is returned to the client. This prevents accidental errors that may occur from using a default value protobuf or null value instead.
  • The read checksum API now automatically fails on unclosed containers. These containers will never have the files generated.
  • Exception handling for the current read API was handled incorrectly since the request was not marked readonly in HddsUtils#isReadOnly. Tests in this PR caught the issue and it is now fixed.
  • The API was renamed from getContainerMerkleTree to getContainerChecksumInfo to match the proto value returned.
    • We have protos called ContainerMerkleTree and ContainerChecksumInfo. ContainerChecksumInfo contains a ContainerMerkleTree along with other metadata.
    • The API was called getContaienrMerkleTree but returned a Bytestring that actually deserialized into a ContainerChecksumInfo, which was unintuitive. Calling getContainerMerkleTree and deserializing as a ContainerMerkleTree would fail.
    • Since the API needs to return a ContainerChecksumInfo to capture all information about the container (the merkle tree alone will not suffice), it is intuitive to rename the API getContainerChecksumInfo.

What is the link to the Apache JIRA

HDDS-11253

How was this patch tested?

  • Unit tests added for handling file corruption within the datanode.
  • Integration tests added to test error propagation from server to client when reading the file.

…rrupt-files

* HDDS-10239-container-reconciliation: (183 commits)
  HDDS-10376. Add a Datanode API to supply a merkle tree for a given container. (apache#6945)
  HDDS-11289. Bump docker-maven-plugin to 0.45.0 (apache#7024)
  HDDS-11287. Code cleanup in XceiverClientSpi (apache#7043)
  HDDS-11283. Refactor KeyValueStreamDataChannel to avoid spurious IDE build issues (apache#7040)
  HDDS-11257. Ozone write does not work when http proxy is set for the JVM. (apache#7036)
  HDDS-11249. Bump ozone-runner to 20240729-jdk17-1 (apache#7003)
  HDDS-10517. Recon - Add a UI component for showing DN decommissioning detailed status and info (apache#6724)
  HDDS-10926. Block deletion should update container merkle tree. (apache#6875)
  HDDS-11270. [hsync] Add DN layout version (HBASE_SUPPORT/version 8) upgrade test. (apache#7021)
  HDDS-11272. Statistics some node status information (apache#7025)
  HDDS-11278. Move code out of Hadoop util package (apache#7028)
  HDDS-11274. (addendum) Replace Hadoop annotations/configs with Ozone-specific ones
  HDDS-11274. Replace Hadoop annotations/configs with Ozone-specific ones (apache#7026)
  HDDS-11280. Add Synchronize in AbstractCommitWatcher.addAckDataLength (apache#7032)
  HDDS-11235. Spare InfoBucket RPC call in FileSystem#mkdir() call. (apache#6990)
  HDDS-11273. Bump commons-compress to 1.26.2 (apache#7023)
  HDDS-11225. Increase ipc.server.read.threadpool.size (apache#7007)
  HDDS-11224. Increase hdds.datanode.handler.count (apache#7011)
  HDDS-11259. [hsync] DataNode should verify HBASE_SUPPORT layout version for every PutBlock. (apache#7012)
  HDDS-11214. Added config to set rocksDB's max log file size and num of log files (apache#7014)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/TestContainerChecksumTreeManager.java
Cannot get the permissions to work to simulate a rename failure.
Existing write failure test covers it good enough.
Copy link
Member

@aswinshakil aswinshakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @errose28. I have a minor nit, apart from that I'm good with the changes.

Copy link
Contributor

@kerneltime kerneltime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round.

public static File getContainerChecksumFile(ContainerData data) {
return new File(data.getMetadataPath(), data.getContainerID() + ".tree");
}

@VisibleForTesting
public static File getTmpContainerChecksumFile(ContainerData data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Simpler signature than passing a massive object.

Suggested change
public static File getTmpContainerChecksumFile(ContainerData data) {
public static File getTmpContainerChecksumFile(String path, long containerID) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This signature matches getContainerChecksumFile for consistency. Changing them both as this suggests would require the caller to know more internal details like where in the container the file(s) belong and whether the path is relative to the container root or not. Also in practice callers already have ContainerData to refer to the container in the first place so this ends up being less verbose because callers do not need to unpack the object themselves.

…rrupt-files

* HDDS-10239-container-reconciliation: (61 commits)
  HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091)
  HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089)
  Mark TestPipelineManagerMXBean#testPipelineInfo as flaky
  Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky
  HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky
  HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky
  HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky
  HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081)
  HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075)
  HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047)
  HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
@errose28
Copy link
Contributor Author

All should be addressed in the latest commits.

…rrupt-files

* HDDS-10239-container-reconciliation:
  HDDS-10379. Datanodes should generate initial container merkle tree during container close. (apache#7065)

Conflicts:
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeTestUtils.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/TestContainerChecksumTreeManager.java
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestOzoneContainerWithTLS.java
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/dn/checksum/TestContainerCommandReconciliation.java
@errose28
Copy link
Contributor Author

errose28 commented Sep 6, 2024

All tests are expected to pass after the rebase. TestContainerCommandReconciliation is passing locally for me after the updates. Waiting for CI on my fork before taking this out of draft.

@errose28 errose28 marked this pull request as ready for review September 6, 2024 19:01
@kerneltime kerneltime merged commit 9765a6a into apache:HDDS-10239-container-reconciliation Sep 10, 2024
51 checks passed
errose28 added a commit to errose28/ozone that referenced this pull request Oct 17, 2024
…an-on-error

* HDDS-10239-container-reconciliation:
  HDDS-11253. Handle corrupted merkle tree files (apache#7083)
  HDDS-10379. Datanodes should generate initial container merkle tree during container close. (apache#7065)

Conflicts:
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueHandlerWithUnhealthyContainer.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants