Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-8168. Make deadlines inside MoveManager for move commands configurable #4415

Merged
merged 8 commits into from Mar 21, 2023

Conversation

siddhantsangwan
Copy link
Contributor

What changes were proposed in this pull request?

MoveManager currently has a hardcoded deadline of 60 minutes. We should make this configurable, and that configuration should be present in ContainerBalancerConfiguration.
Move is replication + deletion. Replication involves copying between two datanodes so it's generally more expensive than deletion. So we probably need to have an additional configuration for deciding how much time out of the total move deadline is allowed for replication. The rest will be allowed for deletion. Move deadline's configuration is "hdds.container.balancer.move.timeout".

Changes introduced:

  1. Added a new configuration "hdds.container.balancer.move.replication.timeout". Made MoveManager use this configuration.
  2. Added validations for this configuration in ContainerBalancer.
  3. Added this configuration to ContainerBalancerConfigurationProto.
  4. Other small changes.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8168

How was this patch tested?

Added unit tests

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with one minor comment.

@siddhantsangwan
Copy link
Contributor Author

DU is run in a DN every 60 minutes by default (hdds.datanode.du.refresh.period). That means SCM gets updated information on a DN's free space every 60 minutes. This information is used by Container Balancer to compare DNs based on their space utilisation (free space divided by total space). Container Balancer is stateless between iterations - all the space utilisation information is recalculated every iteration. So, it's good to have a default Container Balancer iteration interval that's greater than the DU interval. It prevents balancer from making moves with the same stale information that was used in the previous iteration.

If we increase moveTimeout to 90 minutes, then at the latest we expect moves to complete close to the 90th minute. In the worst case, DU will run before moves have completed. This means if our Container Balancer iteration interval is close to (and greater than) 90 minutes, it'll start a new iteration that does not account for moves made by the previous iteration because DU hasn't calculated the latest space yet. That's why I think the iteration interval should be greater than 90 + 60 minutes. 160 minutes seems like a good default to me. If anyone wants to make it more aggressive, they can enable trigger.du.before.move.enable and reduce the iteration interval.

@kerneltime
Copy link
Contributor

@SaketaChalamchala can you please take a look?

@siddhantsangwan
Copy link
Contributor Author

Considering improvements made in HDDS-6577, it's probably best to have the iteration interval default to 70 minutes, same as before. We can revisit these values later if needed.

@@ -451,6 +451,8 @@ message ContainerBalancerConfigurationProto {

required bool shouldRun = 18;
optional int32 nextIterationIndex = 19;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an extra blank line here - probably best to remove it.

if (conf.getMoveReplicationTimeout().toMillis() >=
conf.getMoveTimeout().toMillis()) {
LOG.warn("hdds.container.balancer.move.replication.timeout {} should " +
"be lesser than hdds.container.balancer.move.timeout {}.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: lesser -> less

conf.getMoveTimeout().toMinutes());
throw new InvalidContainerBalancerConfigurationException(
"hdds.container.balancer.move.replication.timeout should " +
"be lesser than hdds.container.balancer.move.timeout.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: lesser -> less

moveTimeout and replicationTimeout are set by ContainerBalancer.
*/
private long moveTimeout = 1000 * 65 * 60;
private long replicationTimeout = 1000 * 50 * 60;
private static final double MOVE_DEADLINE_FACTOR = 0.95;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using a factor like this makes sense for these longer duration timeouts.

The idea I had, is that the datanode should timeout before SCM does, so that when SCM abandons the command, we know the DN has given up on it too.

If we have a replication timeout of 50 mins, then 95% of that is 47.5, so the DN will give up 2.5 mins before SCM does. Feels like the DN is then giving up too early. If we have a 10 minute timeout or a 60 minute timeout, the DN doesn't need to give up earlier for the longer timeout.

Perhaps we could take factor away from here completely, and then let RM decide what the DN timeout should be. That would simplify the API slightly into RM, as we only need to pass a single timeout.

Perhaps 30 seconds less than the SCM timeout for all values, rather than a percentage like we have now, but it could have a single configuration in RM, rather than having the factor defined here and also in RM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have another small PR to change the factor if you think it will complicate this PR too much. I am happy either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a replication timeout of 50 mins, then 95% of that is 47.5, so the DN will give up 2.5 mins before SCM does. Feels like the DN is then giving up too early. If we have a 10 minute timeout or a 60 minute timeout, the DN doesn't need to give up earlier for the longer timeout.

That's a good point. Created HDDS-8230 for fixing this.

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adoroszlai adoroszlai merged commit 4a43f34 into apache:master Mar 21, 2023
25 checks passed
@adoroszlai
Copy link
Contributor

Thanks @siddhantsangwan for the patch, @sodonnel for the review.

errose28 added a commit to errose28/ozone that referenced this pull request Mar 23, 2023
* master: (43 commits)
  HDDS-8148. Improve log for Pipeline creation failure (apache#4385)
  HDDS-7853. Add support for RemoveSCM in SCMRatisServer. (apache#4358)
  HDDS-8042. Display certificate issuer in cert list command. (apache#4429)
  HDDS-8189. [Snapshot] renamedKeyTable should only track keys in buckets that has at least one active snapshot. (apache#4436)
  HDDS-8154. Perf: Reuse Mac instances in S3 token validation (apache#4433)
  HDDS-8245. Info log for keyDeletingService when nonzero number of keys are deleted. (apache#4451)
  HDDS-8233. ReplicationManager: Throttle delete container commands from over-replication handlers (apache#4447)
  HDDS-8220. [Ozone-Streaming] Trigger volume check on IOException in StreamDataChannelBase (apache#4428)
  HDDS-8173. Fix to remove enrties from RocksDB after container gets deleted. (apache#4445)
  HDDS-7975. Rebalance acceptance tests (apache#4437)
  HDDS-8152. Reduce S3 acceptance test setup time (apache#4393)
  HDDS-8172. ECUnderReplicationHandler should consider commands already sent when processing the container (apache#4435)
  HDDS-7883. [Snapshot] Accommodate FSO, key renames and implement OMSnapshotPurgeRequest for SnapshotDeletingService (apache#4407)
  HDDS-8168. Make deadlines inside MoveManager for move commands configurable (apache#4415)
  HDDS-7918. EC: ECBlockReconstructedStripeInputStream should check for spare replicas before failing an index (apache#4441)
  HDDS-8222. EndpointBase#getBucket should handle BUCKET_NOT_FOUND (apache#4431)
  HDDS-8068. Fix Exception: JMXJsonServlet, getting attribute RatisRoles of Hadoop:service=OzoneManager. (apache#4352)
  HDDS-8139. Datanodes should not drop block delete transactions based on transaction ID (apache#4384)
  HDDS-8216. EC: OzoneClientConfig is overwritten in ECKeyOutputStream (apache#4425)
  HDDS-8054. Fix NPE in metrics for failed volume (apache#4340)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants