Skip to content

HDDS-15150. Container scanner should not mark container as UNHEALTHY when FD exhausted#10214

Open
sarvekshayr wants to merge 1 commit into
apache:masterfrom
sarvekshayr:HDDS-15150
Open

HDDS-15150. Container scanner should not mark container as UNHEALTHY when FD exhausted#10214
sarvekshayr wants to merge 1 commit into
apache:masterfrom
sarvekshayr:HDDS-15150

Conversation

@sarvekshayr
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Fixed a bug where background container scanners marked containers as UNHEALTHY due to resource issues rather than actual data corruption. Specifically, when the system encountered a FileNotFoundException or FileSystemException caused by file-descriptor exhaustion ("Too many open files"), the scanners incorrectly flagged these as corruption events.

The logic has been updated to explicitly catch these resource-related exceptions, ensuring that containers remain in their current state when the scanner cannot perform its check due to system limits.

ContainerMetadataScanner

2026-04-20 22:01:43,978 ERROR [ContainerMetadataScanner]-org.apache.hadoop.ozone.container.ozoneimpl.BackgroundContainerMetadataScanner: Corruption detected in container [3980819]. Marking it UNHEALTHY.
java.io.FileNotFoundException: /data6/hadoop-ozone/datanode/data/hdds/CID-637fe7c5-f40b-4e49-98b3-52154bd669e2/current/containerDir95/3980819/metadata/3980819.container (Too many open files)

ContainerDataScanner

2026-04-20 22:01:43,982 ERROR [ContainerDataScanner(/data12/hadoop-ozone/datanode/data/hdds)]-org.apache.hadoop.ozone.container.ozoneimpl.BackgroundContainerDataScanner: Corruption detected in container [16326340]. Marking it UNHEALTHY.
java.nio.file.FileSystemException: /data12/hadoop-ozone/datanode/data/hdds/CID-637fe7c5-f40b-4e49-98b3-52154bd669e2/current/containerDir143/16326340/chunks/115816904944438982.block: Too many open files

What is the link to the Apache JIRA

HDDS-15150

How was this patch tested?

Added unit tests in TestBackgroundContainerDataScanner and TestBackgroundContainerMetadataScanner.
Verified that with fix, containers are not incorrectly marked as UNHEALTHY.

@sarvekshayr sarvekshayr changed the title HDDS-15150. Datanode scanner should not mark container as UNHEALTHY when FD exhausted HDDS-15150. Container scanner should not mark container as UNHEALTHY when FD exhausted May 8, 2026
@sarvekshayr sarvekshayr requested a review from ChenSammi May 8, 2026 09:41
@ChenSammi ChenSammi requested a review from Copilot May 11, 2026 11:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Prevents background container scanners from incorrectly marking containers as UNHEALTHY when scan failures are caused by file-descriptor exhaustion (“Too many open files”) rather than real corruption.

Changes:

  • Added ScanTransientIOUtil to detect transient “Too many open files” IO failures from scan results / exception chains.
  • Updated ContainerScanHelper to skip marking containers UNHEALTHY when all scan errors are FD-exhaustion related.
  • Added unit tests covering the new transient-error classification and the “do not mark unhealthy” behavior for both data and metadata scanners.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ScanTransientIOUtil.java New utility to classify scan failures as “Too many open files” via error inspection.
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerScanHelper.java Skips unhealthy marking when the scan result indicates only FD exhaustion.
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestScanTransientIOUtil.java New unit tests validating the transient IO detection logic.
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestBackgroundContainerMetadataScanner.java Test ensuring metadata scan “Too many open files” does not mark container unhealthy.
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestBackgroundContainerDataScanner.java Test ensuring data scan “Too many open files” does not mark container unhealthy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +123 to 126
if (ScanTransientIOUtil.scanErrorsAreOnlyTooManyOpenFiles(result)) {
return;
}
long containerID = containerData.getContainerID();
Comment on lines +46 to +54
public static boolean isTooManyOpenFiles(Throwable throwable) {
for (Throwable cause = throwable; cause != null; cause = cause.getCause()) {
String message = cause.getMessage();
if (message != null && containsTooManyOpenFiles(message)) {
return true;
}
}
return false;
}
Comment on lines +48 to +50
String message = cause.getMessage();
if (message != null && containsTooManyOpenFiles(message)) {
return true;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants