HDDS-10614. Avoid decreasing cached space usage below zero#6508
HDDS-10614. Avoid decreasing cached space usage below zero#6508adoroszlai merged 13 commits intoapache:masterfrom
Conversation
|
@devmadhuu @adoroszlai Could you place take a look |
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @ArafatKhan2198 for working on this.
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
devmadhuu
left a comment
There was a problem hiding this comment.
Thanks @ArafatKhan2198 for working on this patch. Pls check and handle the comments.
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/fs/TestCachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/fs/TestCachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/fs/TestCachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/fs/TestCachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
|
@ArafatKhan2198 @devmadhuu This PR hopefully fixes the source of negative space at Datanode. Does Recon need any additional fix? Is space usage stored in Recon DB, which might prevent startup if Recon already saved invalid value? |
|
Thanks @ArafatKhan2198 for the fix, @devmadhuu for the review. |
So, in summary, the Recon component gets the storage information from the periodic heartbeat messages sent by the individual Datanodes in the cluster. The ReconNodeManager aggregates these individual Datanode statistics to provide the cluster-wide storage information to other components like the ClusterStateEndpoint. This is my understanding from the code. We do not store the space in any of the DB's in recon (Derby or Rocks) @devmadhuu @dombizita please correct me if I am making a wrong assumption anywhere. |
(cherry picked from commit cc023e7)
(cherry picked from commit cc023e7)
(cherry picked from commit cc023e7)
(cherry picked from commit cc023e7)
(cherry picked from commit cc023e7)
(cherry picked from commit cc023e7)
apache#6508) Change-Id: Id8953111d32c280936099e899aad02dd4c1d138a
What changes were proposed in this pull request?
The root cause seems to be an error during the refresh operation of the
CachingSpaceUsageSource. Specifically, the underlyingSpaceUsageSource(likely an instance of DU, which uses the Unix du command to calculate disk usage) is failing due to a permission issue when trying to read the /data3/lost+found directory. This failure might cause thegetUsedSpace()method to return an incorrect value (possibly zero), which, when decremented, results in a negative value.This PR introduces error handling and validation in the
CachingSpaceUsageSourceclass to ensure data integrity. Specifically, it prevents negative values for used space by validating new values before updating the cache and handles exceptions, includingUncheckedIOException, by maintaining the last known good value and logging errors. These changes ensure that temporary issues, such as permission errors, do not result in invalid state transitions or data corruption.We catch
UncheckedIOExceptionbecause it indicates a problem occurred when the program tried to read or write data, and we saw it during operations like calculating disk space usage. This specific exception wraps lower-level errors, making it a clear sign that something went wrong with I/O operations, which are crucial for accurately tracking disk space.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10614
How was this patch tested?
CI ran green :- https://github.com/ArafatKhan2198/ozone/actions/runs/8627744703
Will be adding Unit tests for it if the approach is correct