Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make admin operations on Statestore non blocking #9348

Merged

Conversation

pkumar-singh
Copy link
Member

Motivation

Admin operations should be non blocking.
Typical admin operations particularly delete table/namespace operations should not be a blocking operation.
If delete admin operation fails the maximum cost is it will leave garbage behind in the state store, which can always be cleaned up later manually by an operator.

Modifications

Delete the table in non blocking way and do not wait for operation to be completed.

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)
  • The rest endpoints: (no)
  • The admin cli options: (no)
  • Anything that affects deployment: (no)

@wolfstudy
Copy link
Member

/pulsarbot run-failure-checks

@jerrypeng
Copy link
Contributor

/pulsarbot run-failure-checks

1 similar comment
@pkumar-singh
Copy link
Member Author

/pulsarbot run-failure-checks

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do not wait for the completion of the operation we could fall into a case in which the execution of the deregisterFunction completes but we still have ongoing operations.

This may be a problem and also it may make tests less predictable (and so more flaky)

Am I correct ? (I hope I am missing some part of the story here and the change looks good indeed)

@jerrypeng
Copy link
Contributor

/pulsarbot run-failure-checks

@pkumar-singh
Copy link
Member Author

pkumar-singh commented Jan 28, 2021

@eolivelli I agree. As I mentioned in the Modification section in the PR.
"Delete the table in non blocking way and do not wait for operation to be completed." And I agree that it makes table deletion unpredictable and best effort. Intention was to make table deletion not necessarily unpredictable but best effort.

Talking about tests. I am not sure we have test coverage for this class and I mentioned the same in PR message.

@eolivelli
Copy link
Contributor

Probably other tests use this function, and if we do not have tests then it is the good time to add them :)

@pkumar-singh
Copy link
Member Author

Yeah other tests certainly use the method deregisterFunction. But for tests worker().getStateStoreAdminClient() I think is null. ( I will confirm again). Therefore it never executed table deletion code before. And when I made the change Its again behind the same null check therefore it would not execute it now. StorageAdminClient adminClient = worker().getStateStoreAdminClient(); I agree writing tests to cover these scenarios would be good. But that being said as far as test is concerned this PR does not have any bearing on existing tests.

@pkumar-singh
Copy link
Member Author

/pulsarbot run-failure-checks

@eolivelli
Copy link
Contributor

I am sorry, but I am not sure this is the right way.
If you do not have feedback that something went wrong you are going to have garbage and you do not know.

Why waiting for this operation is so annoying ? does it take too much time ?

@pkumar-singh
Copy link
Member Author

@eolivelli I know, and I mentioned that in the commit message that we may end up leaving garbage behind which can be cleaned up manually later, as it is being logged as error log.

I can be wrong but my theory is.

Deletion of table in state store can take long time sometimes(We are trying to address that) but regardless of that deletion is a heavy operation and it may fail at multiple places. In that scenarios should we allow ingestion to fail or block because it could not delete a table? Or we leave the garbage behind and make progress. While a cron or an operator scans the logs and deletes the garbage.

I am leaning towards leaving the garbage and make progress.

@jerrypeng
Copy link
Contributor

jerrypeng commented Jan 29, 2021

@eolivelli thanks for chiming in!

If you do not have feedback that something went wrong you are going to have garbage and you do not know.

This is not entirely true. If an error occurs, it is logged.

The reason for this change:

  1. Currently, all cleanup operations of external resources used by a Pulsar Function is best effort. There is no guarantee a resource like subscription or table will be cleanup. We make a best effort but there is no guarantee. Thus, this change does not really change any guarantees from that perspective. We will be looking into ways to improve that in the future.

  2. Since our current model is best effort, there is no need to wait for the table deletion operation to complete. Even though this is not an issue for Pulsar Functions, the state store client has some issues that we are investigating with cause it to block indefinitely. Thus, another reason to not to wait and be blocked forever with making any progress.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerrypeng thanks for your explanation.

+1

@jerrypeng
Copy link
Contributor

/pulsarbot run-failure-checks

@jerrypeng jerrypeng merged commit 13faf63 into apache:master Feb 5, 2021
merlimat pushed a commit to merlimat/pulsar that referenced this pull request Apr 6, 2021
Co-authored-by: Prashant <prashantk@splunk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants