Make admin operations on Statestore non blocking #9348

pkumar-singh · 2021-01-27T20:26:25Z

Motivation

Admin operations should be non blocking.
Typical admin operations particularly delete table/namespace operations should not be a blocking operation.
If delete admin operation fails the maximum cost is it will leave garbage behind in the state store, which can always be cleaned up later manually by an operator.

Modifications

Delete the table in non blocking way and do not wait for operation to be completed.

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API: (no)
The schema: (no)
The default values of configurations: (no)
The wire protocol: (no)
The rest endpoints: (no)
The admin cli options: (no)
Anything that affects deployment: (no)

...unctions/worker/src/main/java/org/apache/pulsar/functions/worker/rest/api/ComponentImpl.java

wolfstudy · 2021-01-28T05:19:20Z

/pulsarbot run-failure-checks

jerrypeng · 2021-01-28T06:13:09Z

/pulsarbot run-failure-checks

pkumar-singh · 2021-01-28T18:33:10Z

/pulsarbot run-failure-checks

eolivelli

if we do not wait for the completion of the operation we could fall into a case in which the execution of the deregisterFunction completes but we still have ongoing operations.

This may be a problem and also it may make tests less predictable (and so more flaky)

Am I correct ? (I hope I am missing some part of the story here and the change looks good indeed)

jerrypeng · 2021-01-28T19:21:01Z

/pulsarbot run-failure-checks

pkumar-singh · 2021-01-28T19:22:26Z

@eolivelli I agree. As I mentioned in the Modification section in the PR.
"Delete the table in non blocking way and do not wait for operation to be completed." And I agree that it makes table deletion unpredictable and best effort. Intention was to make table deletion not necessarily unpredictable but best effort.

Talking about tests. I am not sure we have test coverage for this class and I mentioned the same in PR message.

eolivelli · 2021-01-28T19:28:40Z

Probably other tests use this function, and if we do not have tests then it is the good time to add them :)

pkumar-singh · 2021-01-28T20:15:16Z

Yeah other tests certainly use the method deregisterFunction. But for tests worker().getStateStoreAdminClient() I think is null. ( I will confirm again). Therefore it never executed table deletion code before. And when I made the change Its again behind the same null check therefore it would not execute it now. StorageAdminClient adminClient = worker().getStateStoreAdminClient(); I agree writing tests to cover these scenarios would be good. But that being said as far as test is concerned this PR does not have any bearing on existing tests.

pkumar-singh · 2021-01-28T21:08:13Z

/pulsarbot run-failure-checks

eolivelli · 2021-01-29T07:27:09Z

I am sorry, but I am not sure this is the right way.
If you do not have feedback that something went wrong you are going to have garbage and you do not know.

Why waiting for this operation is so annoying ? does it take too much time ?

pkumar-singh · 2021-01-29T18:02:39Z

@eolivelli I know, and I mentioned that in the commit message that we may end up leaving garbage behind which can be cleaned up manually later, as it is being logged as error log.

I can be wrong but my theory is.

Deletion of table in state store can take long time sometimes(We are trying to address that) but regardless of that deletion is a heavy operation and it may fail at multiple places. In that scenarios should we allow ingestion to fail or block because it could not delete a table? Or we leave the garbage behind and make progress. While a cron or an operator scans the logs and deletes the garbage.

I am leaning towards leaving the garbage and make progress.

jerrypeng · 2021-01-29T18:05:08Z

@eolivelli thanks for chiming in!

If you do not have feedback that something went wrong you are going to have garbage and you do not know.

This is not entirely true. If an error occurs, it is logged.

The reason for this change:

Currently, all cleanup operations of external resources used by a Pulsar Function is best effort. There is no guarantee a resource like subscription or table will be cleanup. We make a best effort but there is no guarantee. Thus, this change does not really change any guarantees from that perspective. We will be looking into ways to improve that in the future.
Since our current model is best effort, there is no need to wait for the table deletion operation to complete. Even though this is not an issue for Pulsar Functions, the state store client has some issues that we are investigating with cause it to block indefinitely. Thus, another reason to not to wait and be blocked forever with making any progress.

eolivelli

@jerrypeng thanks for your explanation.

+1

jerrypeng · 2021-02-04T23:36:43Z

/pulsarbot run-failure-checks

Co-authored-by: Prashant <prashantk@splunk.com>

jerrypeng reviewed Jan 27, 2021

View reviewed changes

...unctions/worker/src/main/java/org/apache/pulsar/functions/worker/rest/api/ComponentImpl.java Outdated Show resolved Hide resolved

pkumar-singh force-pushed the delete_statestore_table_async branch from 5ad3883 to 8012492 Compare January 27, 2021 23:25

aahmed-se approved these changes Jan 28, 2021

View reviewed changes

pkumar-singh force-pushed the delete_statestore_table_async branch from 8012492 to 0a41af1 Compare January 28, 2021 01:57

wolfstudy assigned pkumar-singh Jan 28, 2021

wolfstudy added area/admin java labels Jan 28, 2021

wolfstudy added this to the 2.8.0 milestone Jan 28, 2021

wolfstudy approved these changes Jan 28, 2021

View reviewed changes

jerrypeng approved these changes Jan 28, 2021

View reviewed changes

eolivelli requested changes Jan 28, 2021

View reviewed changes

eolivelli approved these changes Jan 30, 2021

View reviewed changes

Make admin operations on Statestore non blocking

e83182b

pkumar-singh force-pushed the delete_statestore_table_async branch from 0a41af1 to e83182b Compare February 4, 2021 20:44

jerrypeng merged commit 13faf63 into apache:master Feb 5, 2021

jerrypeng added the area/function label Feb 9, 2021

merlimat pushed a commit to merlimat/pulsar that referenced this pull request Apr 6, 2021

Make admin operations on Statestore non blocking (apache#9348)

80b543c

Co-authored-by: Prashant <prashantk@splunk.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make admin operations on Statestore non blocking #9348

Make admin operations on Statestore non blocking #9348

pkumar-singh commented Jan 27, 2021

wolfstudy commented Jan 28, 2021

jerrypeng commented Jan 28, 2021

pkumar-singh commented Jan 28, 2021

eolivelli left a comment

jerrypeng commented Jan 28, 2021

pkumar-singh commented Jan 28, 2021 •

edited

Loading

eolivelli commented Jan 28, 2021

pkumar-singh commented Jan 28, 2021

pkumar-singh commented Jan 28, 2021

eolivelli commented Jan 29, 2021

pkumar-singh commented Jan 29, 2021

jerrypeng commented Jan 29, 2021 •

edited

Loading

eolivelli left a comment

jerrypeng commented Feb 4, 2021

Make admin operations on Statestore non blocking #9348

Make admin operations on Statestore non blocking #9348

Conversation

pkumar-singh commented Jan 27, 2021

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

wolfstudy commented Jan 28, 2021

jerrypeng commented Jan 28, 2021

pkumar-singh commented Jan 28, 2021

eolivelli left a comment

Choose a reason for hiding this comment

jerrypeng commented Jan 28, 2021

pkumar-singh commented Jan 28, 2021 • edited Loading

eolivelli commented Jan 28, 2021

pkumar-singh commented Jan 28, 2021

pkumar-singh commented Jan 28, 2021

eolivelli commented Jan 29, 2021

pkumar-singh commented Jan 29, 2021

jerrypeng commented Jan 29, 2021 • edited Loading

eolivelli left a comment

Choose a reason for hiding this comment

jerrypeng commented Feb 4, 2021

pkumar-singh commented Jan 28, 2021 •

edited

Loading

jerrypeng commented Jan 29, 2021 •

edited

Loading