Skip to content

HDDS-9635. Trying to close a closed container from CLI results in indefinite retries.#5710

Merged
adoroszlai merged 6 commits intoapache:masterfrom
sadanand48:HDDS-9635
Dec 13, 2023
Merged

HDDS-9635. Trying to close a closed container from CLI results in indefinite retries.#5710
adoroszlai merged 6 commits intoapache:masterfrom
sadanand48:HDDS-9635

Conversation

@sadanand48
Copy link
Contributor

What changes were proposed in this pull request?

SCM should let client know if container is closed when it tries to close a container instead of retrying.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9635

How was this patch tested?

Unit tests

Sadanand Shenoy added 2 commits December 1, 2023 00:20
…efinite retries

(cherry picked from commit bfafdfc98191054873f11ff94239bf1a2852e1a2)
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sadanand48 for the patch.

@adoroszlai
Copy link
Contributor

Thanks @sadanand48 for updating the patch.

Now that I've tested it, I noticed that while the container is closing, we may still get some ugly exceptions a few times, before finally getting Container ... already closed:

com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.scm.exceptions.SCMException): Cannot close a CLOSING container.
  at org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.closeContainer(SCMClientProtocolServer.java:669)
  ...
, while invoking $Proxy20.submitRequest over nodeId=scmNodeId,nodeAddress=scm/172.19.0.8:9860. Trying to failover after sleeping for 2000ms. Current retry count: 0.
...
Container 2 already closed

I also noticed that after Container ... already closed, the command exits with error code.

I would like to suggest considering the following enhancements if possible:

  1. show Container ... already (CLOSING|CLOSED) (only the actual state, not both)
  2. let the command exit successfully

What do you think?

@adoroszlai adoroszlai requested a review from xBis7 December 4, 2023 14:10
@sadanand48
Copy link
Contributor Author

Thanks, it makes sense, will update the patch.

Copy link
Contributor

@xBis7 xBis7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadanand48 Thanks for the patch.

How does the client know when to retry and when not to? Does it have to do with the exception type or a non-0 exit code?

Will it be better to return the response proto back to the client instead of returning void and throwing an exception?

@sadanand48
Copy link
Contributor Author

Command exits now

bash-4.2$ ozone admin container close 1
Unable to close container
java.io.IOException: Container 1 is in closing state
	at org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.closeContainer(StorageContainerLocationProtocolClientSideTranslatorPB.java:592)
	at org.apache.hadoop.hdds.scm.cli.ContainerOperationClient.closeContainer(ContainerOperationClient.java:423)
	at org.apache.hadoop.hdds.scm.cli.container.CloseSubcommand.execute(CloseSubcommand.java:53)
	at org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:39)
	at org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:29)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
	at org.apache.hadoop.hdds.cli.OzoneAdmin.lambda$execute$0(OzoneAdmin.java:92)
	at org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
	at org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
	at org.apache.hadoop.hdds.cli.OzoneAdmin.execute(OzoneAdmin.java:91)
	at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
	at org.apache.hadoop.hdds.cli.OzoneAdmin.main(OzoneAdmin.java:84)
bash-4.2$ ozone admin container close 1
Unable to close container
java.io.IOException: Container 1 already closed
	at org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.closeContainer(StorageContainerLocationProtocolClientSideTranslatorPB.java:592)
	at org.apache.hadoop.hdds.scm.cli.ContainerOperationClient.closeContainer(ContainerOperationClient.java:423)
	at org.apache.hadoop.hdds.scm.cli.container.CloseSubcommand.execute(CloseSubcommand.java:53)
	at org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:39)
	at org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:29)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
	at org.apache.hadoop.hdds.cli.OzoneAdmin.lambda$execute$0(OzoneAdmin.java:92)
	at org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
	at org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
	at org.apache.hadoop.hdds.cli.OzoneAdmin.execute(OzoneAdmin.java:91)
	at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
	at org.apache.hadoop.hdds.cli.OzoneAdmin.main(OzoneAdmin.java:84)

@sadanand48
Copy link
Contributor Author

Will it be better to return the response proto back to the client instead of returning void and throwing an exception?

Thanks @xBis7 , yes the current patch returns the exception state (CLOSED/CLOSING) in the response proto and lets client know to not retry. In regular case, the retry happens whenever there is an exception.

@adoroszlai
Copy link
Contributor

Command exits now

bash-4.2$ ozone admin container close 1
Unable to close container
java.io.IOException: Container 1 is in closing state
	at org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.closeContainer(StorageContainerLocationProtocolClientSideTranslatorPB.java:592)
...

@sadanand48, I think we should print these stack traces only in verbose mode (--verbose).

@sadanand48
Copy link
Contributor Author

@sadanand48, I think we should print these stack traces only in verbose mode (--verbose).

Thanks @adoroszlai , updated.

bash-4.2$ ozone admin container close 1
Unable to close container : Container 1 already closed
bash-4.2$ ozone admin --verbose container close 1
Unable to close container
java.io.IOException: Container 1 already closed
	at org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.closeContainer(StorageContainerLocationProtocolClientSideTranslatorPB.java:592)
	at org.apache.hadoop.hdds.scm.cli.ContainerOperationClient.closeContainer(ContainerOperationClient.java:423)
	at org.apache.hadoop.hdds.scm.cli.container.CloseSubcommand.execute(CloseSubcommand.java:58)
	at org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:39)
	at org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:29)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
	at org.apache.hadoop.hdds.cli.OzoneAdmin.lambda$execute$0(OzoneAdmin.java:92)
	at org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
	at org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
	at org.apache.hadoop.hdds.cli.OzoneAdmin.execute(OzoneAdmin.java:91)
	at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
	at org.apache.hadoop.hdds.cli.OzoneAdmin.main(OzoneAdmin.java:84)

Copy link
Contributor

@xBis7 xBis7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadanand48 I've tested it locally. LGTM!

@adoroszlai
Copy link
Contributor

Thanks @sadanand48 for updating the patch, LGTM.

@adoroszlai
Copy link
Contributor

@errose28 @kaijchen @sodonnel would you like to take a look?

@adoroszlai adoroszlai merged commit 09ec0b1 into apache:master Dec 13, 2023
@adoroszlai
Copy link
Contributor

Thanks @sadanand48 for the patch, @xBis7 for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants