HDDS-15024. Track pending containers in SCM to prevent Datanode over-allocation#10073
HDDS-15024. Track pending containers in SCM to prevent Datanode over-allocation#10073ashishkumar50 wants to merge 8 commits intoapache:masterfrom
Conversation
…iner over-allocation per DataNode.
There was a problem hiding this comment.
@ashishkumar50 , thanks a lot for splitting the PR
- For simplicity, let's don't remove buckets. The number of datanodes is small (~10k). We just keep all of them in the map.
- Then, the map becomes very simple. Let's create a class.
- Always roll before return.
class DatanodeBuckets {
private final ConcurrentHashMap<DatanodeID, TwoWindowBucket> map = new ConcurrentHashMap<>();
TwoWindowBucket get(DatanodeID id) {
final TwoWindowBucket bucket = map.compute(id, (k, b) -> b != null ? b : new TwoWindowBucket(rollIntervalMs));
bucket.rollIfNeeded();
return bucket;
}
TwoWindowBucket get(DatanodeDetails dn) {
return map.get(dn.getID());
}
}| previousWindow.clear(); | ||
| currentWindow.clear(); | ||
| lastRollTime = now; | ||
| } else if (elapsed >= rollIntervalMs) { | ||
| previousWindow = currentWindow; | ||
| currentWindow = new HashSet<>(); |
There was a problem hiding this comment.
To be consistent, use clear() and swap:
previousWindow.clear();
final Set<ContainerID> tmp = previousWindow;
previousWindow = currentWindow;
currentWindow = tmp;| long usableSpace = VolumeUsage.getUsableSpace(report); | ||
| long containersOnThisDisk = usableSpace / containerSize; | ||
| effectiveAllocatableSpace += containersOnThisDisk * containerSize; | ||
| if (effectiveAllocatableSpace - pendingAllocationBytes >= containerSize) { |
There was a problem hiding this comment.
Just use:
if (usableSpace - pendingBytes >= containerSize) {| private Pipeline pipeline; | ||
| private DatanodeDetails dn1; | ||
| private DatanodeDetails dn2; | ||
| private DatanodeDetails dn3; | ||
| private ContainerID container1; | ||
| private ContainerID container2; | ||
| private ContainerID container3; |
There was a problem hiding this comment.
1 pipeline, 3 DNs and 3 containers are too small.
How about 1k DNs, 1k pipelines and 10k containers?
| * @param node The DataNode | ||
| * @return Set of pending container IDs | ||
| */ | ||
| public Set<ContainerID> getPendingContainers(DatanodeDetails node) { |
There was a problem hiding this comment.
Remove this method since it is very easy to be misused. E.g. the test calls it just for the size. Why copying the set to get the size?
//TestPendingContainerTracker
tracker.getPendingContainers(dn1).size())|
@szetszwo thanks for the review, handled the comments. |
There was a problem hiding this comment.
@ashishkumar50 , thanks for the update! The change looks mostly good.
- All the synchronized (bucket) should be removed.
- Are there legitimate cases to pass null to the methods and ignore it? If yes, please add a comment describing the cases. Otherwise, please replace the null check with
Objects.requireNonNull(..).
We usually useObjects.requireNonNull(..)to detect bugs. If we ignore and return, it hides the bug and may lead to more serious problem such as data loss.
szetszwo
left a comment
There was a problem hiding this comment.
@ashishkumar50 , thanks for the quick update!
+1 the change looks good.
| if (elapsed >= 2 * rollIntervalMs) { | ||
| previousWindow.clear(); | ||
| currentWindow.clear(); | ||
| lastRollTime = now; |
There was a problem hiding this comment.
Can you add a log message for the full drop as well.
int dropped = previousWindow.size() + currentWindow.size();
previousWindow.clear();
currentWindow.clear();
lastRollTime = now;
LOG.debug("Double roll interval elapsed ({}ms): dropped {} pending containers from both windows",
elapsed, dropped);
There was a problem hiding this comment.
getCount() is called after both windows are cleared. It will always print 0.
Can you change it like,
int dropped = getCount();
previousWindow.clear();
currentWindow.clear();
lastRollTime = now;
LOG.debug("Double roll interval elapsed ({}ms): dropped {} pending containers", elapsed, dropped);
| previousWindow = currentWindow; | ||
| currentWindow = tmp; | ||
| lastRollTime = now; | ||
| LOG.debug("Rolled window. Previous window size: {}, Current window reset to empty", previousWindow.size()); |
There was a problem hiding this comment.
Can you add elapsed time here as well.
LOG.debug("Rolled window after {}ms. Previous window size: {}, Current window reset to empty",
elapsed, previousWindow.size());
rakeshadr
left a comment
There was a problem hiding this comment.
Thanks @ashishkumar50 for the continuous efforts. Added two log improvements, please take care.
+1 LGTM
| if (storageReports.isEmpty()) { | ||
| return false; | ||
| } | ||
| for (StorageReportProto report : storageReports) { |
There was a problem hiding this comment.
@ashishkumar50
Point-1) Can you add tests for this logic with multiple volumes in a datanode.
Test scenario:
pendingAllocationBytes = 15GB
Volume-0: capacity=100GB, usableSpace=20GB <---- (20-15 >= 5) <--- return true
Volume-1: capacity=100GB, usableSpace=1GB
Volume-2: capacity=100GB, usableSpace=1GB
Point-2) There is a corner case. Say, all 3 volumes have 15GB free and pendingAllocationBytes is 15GB. A 5GB container fits easily. But SCM wrongly rejects the entire DN because it applied 15GB of pending (which in reality may all be on one volume) to every volume.
Test scenario: False negative case, since pendingAllocationBytes is the total pending across the entire DN, not per-volume and causing the trouble. Good thing is it won't result into write failure but it will result into unused space eventhough volumes has space.
pendingAllocationBytes = 15GB
Volume-0: capacity=100GB, usableSpace=15GB <---- (15-15 >= 0) <--- return false
Volume-1: capacity=100GB, usableSpace=15GB <---- (15-15 >= 0) <--- return false
Volume-2: capacity=100GB, usableSpace=15GB <---- (15-15 >= 0) <--- return false
There was a problem hiding this comment.
Sorry that I have misunderstood the calculation. There are three different methods below for a single calculation. Let's combine them into a single method.
- hasEffectiveAllocatableSpaceForNewContainer
- getPendingAllocationSize
- hasAllocatableSpaceAfterPending
There are also two different sizes
- containerSize
- maxContainerSize
Are they supposed to be different?
There was a problem hiding this comment.
Combined to single method also maxContainerSize is only used as both are same.
What changes were proposed in this pull request?
Introduce PendingContainerTracker in SCM to track container allocations that are issued but not yet fully realized on DataNodes.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15024
How was this patch tested?
Unit test