Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-5788: Introduce batch size limit in PutDatabaseRecord processor #3128

Closed
wants to merge 3 commits into from

Conversation

vadimar
Copy link

@vadimar vadimar commented Nov 5, 2018

Copy link
Contributor

@mattyb149 mattyb149 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add unit tests for this? Maybe a mock PreparedStatement that counts the number of executeBatch() invocations?

@@ -265,6 +265,17 @@
.expressionLanguageSupported(ExpressionLanguageScope.VARIABLE_REGISTRY)
.build();

static final PropertyDescriptor BATCH_SIZE = new PropertyDescriptor.Builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be consistent here with "batch size" and "bulk size" in the naming of variables, documentation, etc. Maybe "Maximum Batch Size"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree regarding "Maximum Batch Size". Sounds better. What's "bulk size"? Is it relevant to this change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I see it now. The display label is "Bulk Size". I'll fix it to be "Maximum Batch Size". Thanks

.displayName("Bulk Size")
.description("Specifies batch size for INSERT and UPDATE statements. This parameter has no effect for other statements specified in 'Statement Type'."
+ " Non-positive value has the effect of infinite bulk size.")
.defaultValue("-1")
Copy link
Contributor

@mattyb149 mattyb149 Nov 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does a value of zero do? Would anyone ever use it? If not, perhaps zero is the best default to indicate infinite batch size. If you do change it to zero, please change the validator to a NONNEGATIVE_INTEGER_VALIDATOR to match

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that 0 should be the default, and would replicate the current behavior of the processor, "All records in one batch".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the default to be zero and the validator to NONNEGATIVE_INTEGER_VALIDATOR

@@ -669,11 +685,20 @@ private void executeDML(ProcessContext context, ProcessSession session, FlowFile
}
}
ps.addBatch();
if (++currentBatchSize == batchSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be beneficial to capture currentBatchSize*batchIndex, with batchIndex being incremented only after a successful call to executeBatch() as an attribute? My thinking is, if you have a failure, and only part of a batch was loaded, you could store how many rows were loaded successfully as an attribute?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this would be benefitial. PutDatabaseRecord works without autoCommit. It's all or nothing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I missed that override before, but I see it now. So definitely less valuable, the only thing it would provide would be troubleshooting guidance, "your bad data is roughly in this part of the file". Probably not worth it. Thanks!

        Renamed 'batch size' to 'Maximum Batch Size'.
        Changed default value of max_batch_size to zero (INFINITE)
        Fixed parameter validation.
        Added unit tests
@vadimar
Copy link
Author

vadimar commented Nov 8, 2018

Hi,
Can you please review the latest commits? I committed the changes that address all the issues raised by reviewers.
Thanks

@asfgit asfgit closed this in d319a3e Nov 15, 2018
@mattyb149
Copy link
Contributor

+1 LGTM, tested with various batch sizes and ran unit tests. Thanks for this improvment! Merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants