HDDS-15024. Track pending containers in SCM to prevent Datanode over-allocation by ashishkumar50 · Pull Request #10073 · apache/ozone

ashishkumar50 · 2026-04-13T06:08:50Z

What changes were proposed in this pull request?

Introduce PendingContainerTracker in SCM to track container allocations that are issued but not yet fully realized on DataNodes.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15024

How was this patch tested?

Unit test

…iner over-allocation per DataNode.

szetszwo

@ashishkumar50 , thanks a lot for splitting the PR

For simplicity, let's don't remove buckets. The number of datanodes is small (~10k). We just keep all of them in the map.
Then, the map becomes very simple. Let's create a class.
Always roll before return.

  class DatanodeBuckets {
    private final ConcurrentHashMap<DatanodeID, TwoWindowBucket> map = new ConcurrentHashMap<>();

    TwoWindowBucket get(DatanodeID id) {
      final TwoWindowBucket bucket = map.compute(id, (k, b) -> b != null ? b : new TwoWindowBucket(rollIntervalMs));
      bucket.rollIfNeeded();
      return bucket;
    }

    TwoWindowBucket get(DatanodeDetails dn) {
      return map.get(dn.getID());
    }
  }

szetszwo · 2026-04-13T16:32:06Z

+        previousWindow.clear();
+        currentWindow.clear();
+        lastRollTime = now;
+      } else if (elapsed >= rollIntervalMs) {
+        previousWindow = currentWindow;
+        currentWindow = new HashSet<>();


To be consistent, use clear() and swap:

previousWindow.clear(); final Set<ContainerID> tmp = previousWindow; previousWindow = currentWindow; currentWindow = tmp;

szetszwo · 2026-04-13T16:42:36Z

+      long usableSpace = VolumeUsage.getUsableSpace(report);
+      long containersOnThisDisk = usableSpace / containerSize;
+      effectiveAllocatableSpace += containersOnThisDisk * containerSize;
+      if (effectiveAllocatableSpace - pendingAllocationBytes >= containerSize) {


Just use:

if (usableSpace - pendingBytes >= containerSize) {

szetszwo · 2026-04-13T16:48:12Z

+  private Pipeline pipeline;
+  private DatanodeDetails dn1;
+  private DatanodeDetails dn2;
+  private DatanodeDetails dn3;
+  private ContainerID container1;
+  private ContainerID container2;
+  private ContainerID container3;


1 pipeline, 3 DNs and 3 containers are too small.

How about 1k DNs, 1k pipelines and 10k containers?

szetszwo · 2026-04-13T16:53:21Z

+   * @param node The DataNode
+   * @return Set of pending container IDs
+   */
+  public Set<ContainerID> getPendingContainers(DatanodeDetails node) {


Remove this method since it is very easy to be misused. E.g. the test calls it just for the size. Why copying the set to get the size?

//TestPendingContainerTracker tracker.getPendingContainers(dn1).size())

ashishkumar50 · 2026-04-14T15:34:10Z

@szetszwo thanks for the review, handled the comments.

szetszwo

@ashishkumar50 , thanks for the update! The change looks mostly good.

All the synchronized (bucket) should be removed.
Are there legitimate cases to pass null to the methods and ignore it? If yes, please add a comment describing the cases. Otherwise, please replace the null check with Objects.requireNonNull(..).
We usually use Objects.requireNonNull(..) to detect bugs. If we ignore and return, it hides the bug and may lead to more serious problem such as data loss.

szetszwo

@ashishkumar50 , thanks for the quick update!

+1 the change looks good.

rakeshadr · 2026-04-15T15:12:44Z

+      if (elapsed >= 2 * rollIntervalMs) {
+        previousWindow.clear();
+        currentWindow.clear();
+        lastRollTime = now;


Can you add a log message for the full drop as well.

int dropped = previousWindow.size() + currentWindow.size(); previousWindow.clear(); currentWindow.clear(); lastRollTime = now; LOG.debug("Double roll interval elapsed ({}ms): dropped {} pending containers from both windows", elapsed, dropped);

getCount() is called after both windows are cleared. It will always print 0.

Can you change it like,

int dropped = getCount(); previousWindow.clear(); currentWindow.clear(); lastRollTime = now; LOG.debug("Double roll interval elapsed ({}ms): dropped {} pending containers", elapsed, dropped);

rakeshadr · 2026-04-15T15:15:34Z

+        previousWindow = currentWindow;
+        currentWindow = tmp;
+        lastRollTime = now;
+        LOG.debug("Rolled window. Previous window size: {}, Current window reset to empty", previousWindow.size());


Can you add elapsed time here as well.

LOG.debug("Rolled window after {}ms. Previous window size: {}, Current window reset to empty", elapsed, previousWindow.size());

rakeshadr

Thanks @ashishkumar50 for the continuous efforts. Added two log improvements, please take care.

+1 LGTM

rakeshadr · 2026-04-16T02:20:38Z

+    if (storageReports.isEmpty()) {
+      return false;
+    }
+    for (StorageReportProto report : storageReports) {


@ashishkumar50
Point-1) Can you add tests for this logic with multiple volumes in a datanode.

Test scenario:

pendingAllocationBytes = 15GB Volume-0: capacity=100GB, usableSpace=20GB <---- (20-15 >= 5) <--- return true Volume-1: capacity=100GB, usableSpace=1GB Volume-2: capacity=100GB, usableSpace=1GB

Point-2) There is a corner case. Say, all 3 volumes have 15GB free and pendingAllocationBytes is 15GB. A 5GB container fits easily. But SCM wrongly rejects the entire DN because it applied 15GB of pending (which in reality may all be on one volume) to every volume.

Test scenario: False negative case, since pendingAllocationBytes is the total pending across the entire DN, not per-volume and causing the trouble. Good thing is it won't result into write failure but it will result into unused space eventhough volumes has space.

pendingAllocationBytes = 15GB Volume-0: capacity=100GB, usableSpace=15GB <---- (15-15 >= 0) <--- return false Volume-1: capacity=100GB, usableSpace=15GB <---- (15-15 >= 0) <--- return false Volume-2: capacity=100GB, usableSpace=15GB <---- (15-15 >= 0) <--- return false

@szetszwo We changed to use only usableSpace after this comment but i think we should use effectiveAllocatableSpace across all the volumes as mentioned by Rakesh as well. I have changed it to use effectiveAllocatableSpace, what do you think?

Sorry that I have misunderstood the calculation. There are three different methods below for a single calculation. Let's combine them into a single method.

hasEffectiveAllocatableSpaceForNewContainer

getPendingAllocationSize

hasAllocatableSpaceAfterPending

There are also two different sizes

containerSize

maxContainerSize

Are they supposed to be different?

Combined to single method also maxContainerSize is only used as both are same.

HDDS-15024. Introduce PendingContainerTracker in SCM to prevent conta…

28a209c

…iner over-allocation per DataNode.

ashishkumar50 requested review from aswinshakil, rakeshadr, sumitagrawl and szetszwo April 13, 2026 06:08

ashishkumar50 mentioned this pull request Apr 13, 2026

HDDS-14921. Improve space accounting in SCM with In-Flight container allocation tracking. #10000

Open

peterxcli reviewed Apr 13, 2026

View reviewed changes

Comment thread ...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/PendingContainerTracker.java

adoroszlai changed the title ~~HDDS-15024. Introduce PendingContainerTracker in SCM to prevent container over-allocation per DataNode.~~ HDDS-15024. Track pending containers in SCM to prevent Datanode over-allocation Apr 13, 2026

szetszwo reviewed Apr 13, 2026

View reviewed changes

Fix review comments

704fcce

ashishkumar50 requested a review from szetszwo April 14, 2026 15:34

szetszwo reviewed Apr 14, 2026

View reviewed changes

ashishkr200 added 2 commits April 14, 2026 23:33

Fix comments

30b5233

Fix null check

be01c09

ashishkumar50 requested a review from szetszwo April 14, 2026 18:16

szetszwo approved these changes Apr 14, 2026

View reviewed changes

rakeshadr reviewed Apr 15, 2026

View reviewed changes

rakeshadr approved these changes Apr 15, 2026

View reviewed changes

Update logs

319b0d9

rakeshadr reviewed Apr 16, 2026

View reviewed changes

ashishkr200 added 3 commits April 16, 2026 11:28

Use effective space across volume

65b0426

Refactor code

10be8c0

Update log

1a2da8a

ashishkumar50 requested review from rakeshadr and szetszwo April 16, 2026 10:53

rakeshadr approved these changes Apr 16, 2026

View reviewed changes

Conversation

ashishkumar50 commented Apr 13, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Uh oh!

szetszwo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashishkumar50 commented Apr 14, 2026

Uh oh!

szetszwo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rakeshadr left a comment

Choose a reason for hiding this comment

Uh oh!

rakeshadr Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

szetszwo left a comment •

edited

Loading

szetszwo left a comment •

edited

Loading

rakeshadr Apr 16, 2026 •

edited

Loading