New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-8168. Make deadlines inside MoveManager for move commands configurable #4415
Conversation
# Conflicts: # hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerTask.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with one minor comment.
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/MoveManager.java
Outdated
Show resolved
Hide resolved
DU is run in a DN every 60 minutes by default ( If we increase |
@SaketaChalamchala can you please take a look? |
Considering improvements made in HDDS-6577, it's probably best to have the iteration interval default to 70 minutes, same as before. We can revisit these values later if needed. |
@@ -451,6 +451,8 @@ message ContainerBalancerConfigurationProto { | |||
|
|||
required bool shouldRun = 18; | |||
optional int32 nextIterationIndex = 19; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an extra blank line here - probably best to remove it.
if (conf.getMoveReplicationTimeout().toMillis() >= | ||
conf.getMoveTimeout().toMillis()) { | ||
LOG.warn("hdds.container.balancer.move.replication.timeout {} should " + | ||
"be lesser than hdds.container.balancer.move.timeout {}.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: lesser -> less
conf.getMoveTimeout().toMinutes()); | ||
throw new InvalidContainerBalancerConfigurationException( | ||
"hdds.container.balancer.move.replication.timeout should " + | ||
"be lesser than hdds.container.balancer.move.timeout."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: lesser -> less
moveTimeout and replicationTimeout are set by ContainerBalancer. | ||
*/ | ||
private long moveTimeout = 1000 * 65 * 60; | ||
private long replicationTimeout = 1000 * 50 * 60; | ||
private static final double MOVE_DEADLINE_FACTOR = 0.95; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if using a factor like this makes sense for these longer duration timeouts.
The idea I had, is that the datanode should timeout before SCM does, so that when SCM abandons the command, we know the DN has given up on it too.
If we have a replication timeout of 50 mins, then 95% of that is 47.5, so the DN will give up 2.5 mins before SCM does. Feels like the DN is then giving up too early. If we have a 10 minute timeout or a 60 minute timeout, the DN doesn't need to give up earlier for the longer timeout.
Perhaps we could take factor away from here completely, and then let RM decide what the DN timeout should be. That would simplify the API slightly into RM, as we only need to pass a single timeout.
Perhaps 30 seconds less than the SCM timeout for all values, rather than a percentage like we have now, but it could have a single configuration in RM, rather than having the factor defined here and also in RM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have another small PR to change the factor if you think it will complicate this PR too much. I am happy either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a replication timeout of 50 mins, then 95% of that is 47.5, so the DN will give up 2.5 mins before SCM does. Feels like the DN is then giving up too early. If we have a 10 minute timeout or a 60 minute timeout, the DN doesn't need to give up earlier for the longer timeout.
That's a good point. Created HDDS-8230 for fixing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @siddhantsangwan for the patch, @sodonnel for the review. |
* master: (43 commits) HDDS-8148. Improve log for Pipeline creation failure (apache#4385) HDDS-7853. Add support for RemoveSCM in SCMRatisServer. (apache#4358) HDDS-8042. Display certificate issuer in cert list command. (apache#4429) HDDS-8189. [Snapshot] renamedKeyTable should only track keys in buckets that has at least one active snapshot. (apache#4436) HDDS-8154. Perf: Reuse Mac instances in S3 token validation (apache#4433) HDDS-8245. Info log for keyDeletingService when nonzero number of keys are deleted. (apache#4451) HDDS-8233. ReplicationManager: Throttle delete container commands from over-replication handlers (apache#4447) HDDS-8220. [Ozone-Streaming] Trigger volume check on IOException in StreamDataChannelBase (apache#4428) HDDS-8173. Fix to remove enrties from RocksDB after container gets deleted. (apache#4445) HDDS-7975. Rebalance acceptance tests (apache#4437) HDDS-8152. Reduce S3 acceptance test setup time (apache#4393) HDDS-8172. ECUnderReplicationHandler should consider commands already sent when processing the container (apache#4435) HDDS-7883. [Snapshot] Accommodate FSO, key renames and implement OMSnapshotPurgeRequest for SnapshotDeletingService (apache#4407) HDDS-8168. Make deadlines inside MoveManager for move commands configurable (apache#4415) HDDS-7918. EC: ECBlockReconstructedStripeInputStream should check for spare replicas before failing an index (apache#4441) HDDS-8222. EndpointBase#getBucket should handle BUCKET_NOT_FOUND (apache#4431) HDDS-8068. Fix Exception: JMXJsonServlet, getting attribute RatisRoles of Hadoop:service=OzoneManager. (apache#4352) HDDS-8139. Datanodes should not drop block delete transactions based on transaction ID (apache#4384) HDDS-8216. EC: OzoneClientConfig is overwritten in ECKeyOutputStream (apache#4425) HDDS-8054. Fix NPE in metrics for failed volume (apache#4340) ...
What changes were proposed in this pull request?
MoveManager currently has a hardcoded deadline of 60 minutes. We should make this configurable, and that configuration should be present in ContainerBalancerConfiguration.
Move is replication + deletion. Replication involves copying between two datanodes so it's generally more expensive than deletion. So we probably need to have an additional configuration for deciding how much time out of the total move deadline is allowed for replication. The rest will be allowed for deletion. Move deadline's configuration is "hdds.container.balancer.move.timeout".
Changes introduced:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8168
How was this patch tested?
Added unit tests