Skip to content

NIFI-11294 - Add support for component state as checkpointing strategy#7076

Closed
malthe wants to merge 14 commits intoapache:mainfrom
malthe:NIFI-11294-use-processor-state-for-checkpointing
Closed

NIFI-11294 - Add support for component state as checkpointing strategy#7076
malthe wants to merge 14 commits intoapache:mainfrom
malthe:NIFI-11294-use-processor-state-for-checkpointing

Conversation

@malthe
Copy link
Contributor

@malthe malthe commented Mar 23, 2023

Summary

NIFI-11294 ConsumeAzureEventHub should default to processor state for checkpointing

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 11
    • JDK 17

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

@malthe
Copy link
Contributor Author

malthe commented Mar 23, 2023

Note that the existing tests (which I have adapted, somewhat unsuccessfully) are not very good – they do not actually exercise the processor's logic end-to-end, but only the "meat". I think the whole test setup for this processor really needs to be reworked entirely.

@pvillard31 pvillard31 changed the title Add support for component state as checkpointing strategy NIFI-11294 - Add support for component state as checkpointing strategy Mar 23, 2023
@turcsanyip turcsanyip self-requested a review March 23, 2023 16:29
@turcsanyip
Copy link
Contributor

Will review...

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this feature @malthe! Just noting that the automated builds flagged a couple Checkstyle violations, and the unit tests failed due to unnecessary stubbing. That usually means there are unnecessary mock calls that need to be removed or adjusted so the when expectations are only called when needed.

@turcsanyip
Copy link
Contributor

@malthe Thanks for working on this feature, it would be a valuable addition in ConsumeAzureEventHub.

I tried out the processor and also had a look at the code.

There is an essential issue with the new component state strategy: it does not handle the partition ownership.
Please note, not only the checkpoints but also the ownership info is need to be handled and stored by a CheckpointStore implementation. An example can be found in BlobCheckpointStore.
If you run the processor on a NiFi cluster, all nodes will own and process all partitions but the partitions should be load-balanced among the consumer group members (instances of the same ConsumeAzureEventHub processor on the cluster nodes).

Also the processor cannot be stopped cleanly but throws NullPointerExceptions in a loop after Stop.

Please fix these first.

@malthe
Copy link
Contributor Author

malthe commented Mar 23, 2023

@exceptionfactory thanks – these issues have been fixed now.

I have updated the tests in adef599 so that they're exercising the public API (i.e. onTrigger).

@malthe
Copy link
Contributor Author

malthe commented Mar 27, 2023

@turcsanyip I have now implemented proper load-balancing in a9bd297 – and to fix the issues with stopping the processor, I have removed most of the processor state fields, shifting them into a event batch processing closure instead.

However, what remains at this point to be tested is the component-state checkpoint store in isolation.

@turcsanyip
Copy link
Contributor

Thanks @malthe !
Apologies, I did not have time to check it and I'm going to be AFK for the next week. Will review when I'm back.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates on this new feature @malthe! I noted several test implementation concerns, and a couple recommendations on the implementation itself.

public void testReceiveOne() {
setProperties();
testRunner.run(1, false);
@EnumSource(CheckpointStrategy.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing all of these test methods to run for both checkpoint strategy is unnecessary and results in unnecessary processing. Recommend creating a couple separate methods instead.

@BeforeEach
public void setupProcessor() {
processor = new MockConsumeAzureEventHub();
processor = spy(new MockConsumeAzureEventHub());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MockConsumeAzureEventHub Processor already isolates certain behavior for testing, so introducing a spy() creates and additional level of indirection. Is there a particular reason for this approach as opposed to other alternatives?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the processor wasn't tested through the onTrigger but through some internal methods.

I changed the tests so that all testing happens through the actually exposed interface. However, this requires some additional mocking in order to implement the checkpoint strategy for example.

(The reason being, we don't have the luxury of testing against a real event hub here.)

But instead of spying on the processor, I suppose I could implement some alternative methods on the mock class. I don't remember if I tried that and faced some issues perhaps.

EVENT_HUB_NAME,
CONSUMER_GROUP
).blockFirst();
assert listed.getETag().equals(claimed.getETag());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert keyword is not intended for testing. All instances should be replaces with assertEquals() or other applicable JUnit 5 assert methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4aa8240.

Comment on lines +575 to +586
new ComponentStateCheckpointStore(
identifier,
new ComponentStateCheckpointStore.State() {
public StateMap getState() throws IOException {
return session.getState(Scope.CLUSTER);
}

public boolean replaceState(StateMap oldValue, Map<String, String> newValue) {
return false;
}
}
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inner anonymous class results in multiple levels of nesting that make this hard to follow. Recommend breaking it out to an explicit class.

} else {
checkpointStore = new ComponentStateCheckpointStore(
identifier,
new ComponentStateCheckpointStore.State() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having two different anonymous inner class implementations make this hard to follow. Recommend breaking this out to a distinct class.


final Map<String, String> newState = new HashMap<>(oldState.toMap());
long timestamp = System.currentTimeMillis();
String eTag = identifier + "/" + timestamp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the eTag be created based on the state version as opposed the current timestamp? That seems like it would align better with the purpose of the tag for tracking ownership and ensuring consistent version control across nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure – but that logic was copied from the blob-based checkpoint store.

In fact, whenever I could, I have used the exact same logic between them.

Comment on lines +192 to +211
final String key = entry.getKey();
final String[] parts = key.split("/", 5);
if (parts.length != 5) {
throw new ProcessException(
String.format("Invalid %s key: %s", kind, entry.getKey())
);
}
if (!parts[0].equals(kind)) {
continue;
}
final String fullyQualifiedNamespace = parts[1];
final String eventHubName = parts[2];
final String consumerGroup = parts[3];
final String partitionId = parts[4];
PartitionContext partitionContext = new PartitionContext(
fullyQualifiedNamespace,
eventHubName,
consumerGroup,
partitionId
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend pulling this logic out to a separate method so the formatting approach is clear.

Copy link
Contributor

@turcsanyip turcsanyip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malthe I tried out the processor but it seems to process all messages again and again after restarts. I cannot see the checkpoint data saved in the state (only ownership) so I think this is the root cause. Could you please check it?

Could you please also rebase your changes onto the current main branch because there are some conflicting changes?
Thanks

Comment on lines +22 to +23
AZURE_BLOB_STORAGE("azure-blob-storage", "Use Azure Blob Storage to store partition checkpoints and ownership"),
COMPONENT_STATE("component-state", "Use component state to store partition checkpoints");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use "user friendly" labels like Component State on the UI?

.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
.expressionLanguageSupported(ExpressionLanguageScope.VARIABLE_REGISTRY)
.required(true)
.dependsOn(CHECKPOINT_STRATEGY, CheckpointStrategy.AZURE_BLOB_STORAGE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other Storage * properties below should also depend on CheckpointStrategy.AZURE_BLOB_STORAGE.
Storage Account Key property can also be mandatory as it will be shown only when it is required.


public boolean replaceState(StateMap oldValue, Map<String, String> newValue) throws IOException {
final ProcessSession session = processSessionFactory.createSession();
if (!session.replaceState(oldValue, newValue, Scope.CLUSTER)) {
Copy link
Contributor

@turcsanyip turcsanyip May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProcessSession.replaceState() does not provide fully proper optimistic locking because the state can be changed by another session between replaceState() and commit(). So even if replaceState() returns true, the state change may be omitted at commit time due to a concurrent update.

I suggest using ProcessContext.getStateManager().replace() instead which has the right optimistic locking semantics which is required from a checkpoint store implementation.

Please note ProcessContext.getStateManager().replace() cannot initialize the state currently (see also NIFI-11595) so it needs to be created with setState() before using replace(). E.g. in @OnScheduled like this:

    @OnScheduled
    public void onScheduled(ProcessContext context) throws IOException {
        if (getNodeTypeProvider().isPrimary()) {
            final StateManager stateManager = context.getStateManager();
            final StateMap state = stateManager.getState(Scope.CLUSTER);

            if (!state.getStateVersion().isPresent()) {
                stateManager.setState(new HashMap<>(), Scope.CLUSTER);
            }
        }
    }

malthe and others added 2 commits May 26, 2023 08:44
Co-authored-by: Peter Turcsanyi <35004384+turcsanyip@users.noreply.github.com>
@turcsanyip
Copy link
Contributor

@malthe Just a heads-up that NIFI-11595 has been merged fixing the replaceState() logic I mentioned in my previous comment. It makes the state handling of the checkpoints simpler, though it was not a blocker.

Considering that there was no activity on the PR in the past months, I would like to ask if you are still interested in this feature? Do you intend to finish the PR? I'm happy to jump in and continue the implementation based on your PR if you have other interests now.

Thanks

@malthe
Copy link
Contributor Author

malthe commented Sep 19, 2023

@turcsanyip please do – I am currently not actively using NiFi so just following along from the sideline.

@turcsanyip
Copy link
Contributor

Thanks @malthe! I'll go ahead and try to finish the implementation.

@exceptionfactory
Copy link
Contributor

Thanks for your work on this @malthe, and thanks for planning to take this on @turcsanyip. I am closing the pull request for now given the state of several conflicts, but feel free to open a new one when it is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants