Skip to content

HDDS-15024. Track pending containers in SCM to prevent Datanode over-allocation#10073

Open
ashishkumar50 wants to merge 8 commits intoapache:masterfrom
ashishkumar50:HDDS-15024
Open

HDDS-15024. Track pending containers in SCM to prevent Datanode over-allocation#10073
ashishkumar50 wants to merge 8 commits intoapache:masterfrom
ashishkumar50:HDDS-15024

Conversation

@ashishkumar50
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Introduce PendingContainerTracker in SCM to track container allocations that are issued but not yet fully realized on DataNodes.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15024

How was this patch tested?

Unit test

@adoroszlai adoroszlai changed the title HDDS-15024. Introduce PendingContainerTracker in SCM to prevent container over-allocation per DataNode. HDDS-15024. Track pending containers in SCM to prevent Datanode over-allocation Apr 13, 2026
Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashishkumar50 , thanks a lot for splitting the PR

  • For simplicity, let's don't remove buckets. The number of datanodes is small (~10k). We just keep all of them in the map.
  • Then, the map becomes very simple. Let's create a class.
  • Always roll before return.
  class DatanodeBuckets {
    private final ConcurrentHashMap<DatanodeID, TwoWindowBucket> map = new ConcurrentHashMap<>();

    TwoWindowBucket get(DatanodeID id) {
      final TwoWindowBucket bucket = map.compute(id, (k, b) -> b != null ? b : new TwoWindowBucket(rollIntervalMs));
      bucket.rollIfNeeded();
      return bucket;
    }

    TwoWindowBucket get(DatanodeDetails dn) {
      return map.get(dn.getID());
    }
  }

Comment on lines +119 to +124
previousWindow.clear();
currentWindow.clear();
lastRollTime = now;
} else if (elapsed >= rollIntervalMs) {
previousWindow = currentWindow;
currentWindow = new HashSet<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent, use clear() and swap:

        previousWindow.clear();
        final Set<ContainerID> tmp = previousWindow;
        previousWindow = currentWindow;
        currentWindow = tmp;

long usableSpace = VolumeUsage.getUsableSpace(report);
long containersOnThisDisk = usableSpace / containerSize;
effectiveAllocatableSpace += containersOnThisDisk * containerSize;
if (effectiveAllocatableSpace - pendingAllocationBytes >= containerSize) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use:

      if (usableSpace - pendingBytes >= containerSize) {

Comment on lines +43 to +49
private Pipeline pipeline;
private DatanodeDetails dn1;
private DatanodeDetails dn2;
private DatanodeDetails dn3;
private ContainerID container1;
private ContainerID container2;
private ContainerID container3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 pipeline, 3 DNs and 3 containers are too small.

How about 1k DNs, 1k pipelines and 10k containers?

* @param node The DataNode
* @return Set of pending container IDs
*/
public Set<ContainerID> getPendingContainers(DatanodeDetails node) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this method since it is very easy to be misused. E.g. the test calls it just for the size. Why copying the set to get the size?

//TestPendingContainerTracker
tracker.getPendingContainers(dn1).size())

@ashishkumar50
Copy link
Copy Markdown
Contributor Author

@szetszwo thanks for the review, handled the comments.

@ashishkumar50 ashishkumar50 requested a review from szetszwo April 14, 2026 15:34
Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashishkumar50 , thanks for the update! The change looks mostly good.

  • All the synchronized (bucket) should be removed.
  • Are there legitimate cases to pass null to the methods and ignore it? If yes, please add a comment describing the cases. Otherwise, please replace the null check with Objects.requireNonNull(..).
    We usually use Objects.requireNonNull(..) to detect bugs. If we ignore and return, it hides the bug and may lead to more serious problem such as data loss.

@ashishkumar50 ashishkumar50 requested a review from szetszwo April 14, 2026 18:16
Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashishkumar50 , thanks for the quick update!

+1 the change looks good.

if (elapsed >= 2 * rollIntervalMs) {
previousWindow.clear();
currentWindow.clear();
lastRollTime = now;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a log message for the full drop as well.

        int dropped = previousWindow.size() + currentWindow.size();
        previousWindow.clear();
        currentWindow.clear();
        lastRollTime = now;
        LOG.debug("Double roll interval elapsed ({}ms): dropped {} pending containers from both windows",
            elapsed, dropped);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getCount() is called after both windows are cleared. It will always print 0.

Can you change it like,

int dropped = getCount(); 
previousWindow.clear();
currentWindow.clear();
lastRollTime = now;
LOG.debug("Double roll interval elapsed ({}ms): dropped {} pending containers", elapsed, dropped);

previousWindow = currentWindow;
currentWindow = tmp;
lastRollTime = now;
LOG.debug("Rolled window. Previous window size: {}, Current window reset to empty", previousWindow.size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add elapsed time here as well.

    LOG.debug("Rolled window after {}ms. Previous window size: {}, Current window reset to empty",
        elapsed, previousWindow.size());

Copy link
Copy Markdown
Contributor

@rakeshadr rakeshadr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ashishkumar50 for the continuous efforts. Added two log improvements, please take care.

+1 LGTM

if (storageReports.isEmpty()) {
return false;
}
for (StorageReportProto report : storageReports) {
Copy link
Copy Markdown
Contributor

@rakeshadr rakeshadr Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashishkumar50
Point-1) Can you add tests for this logic with multiple volumes in a datanode.

Test scenario:

pendingAllocationBytes = 15GB
Volume-0: capacity=100GB, usableSpace=20GB <---- (20-15 >= 5) <--- return true
Volume-1: capacity=100GB, usableSpace=1GB
Volume-2: capacity=100GB, usableSpace=1GB

Point-2) There is a corner case. Say, all 3 volumes have 15GB free and pendingAllocationBytes is 15GB. A 5GB container fits easily. But SCM wrongly rejects the entire DN because it applied 15GB of pending (which in reality may all be on one volume) to every volume.

Test scenario: False negative case, since pendingAllocationBytes is the total pending across the entire DN, not per-volume and causing the trouble. Good thing is it won't result into write failure but it will result into unused space eventhough volumes has space.

pendingAllocationBytes = 15GB
Volume-0: capacity=100GB, usableSpace=15GB <---- (15-15 >= 0) <--- return false
Volume-1: capacity=100GB, usableSpace=15GB <---- (15-15 >= 0) <--- return false
Volume-2: capacity=100GB, usableSpace=15GB <---- (15-15 >= 0) <--- return false

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szetszwo We changed to use only usableSpace after this comment but i think we should use effectiveAllocatableSpace across all the volumes as mentioned by Rakesh as well. I have changed it to use effectiveAllocatableSpace, what do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I have misunderstood the calculation. There are three different methods below for a single calculation. Let's combine them into a single method.

  • hasEffectiveAllocatableSpaceForNewContainer
  • getPendingAllocationSize
  • hasAllocatableSpaceAfterPending

There are also two different sizes

  • containerSize
  • maxContainerSize

Are they supposed to be different?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined to single method also maxContainerSize is only used as both are same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants