New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-5788: Introduce batch size limit in PutDatabaseRecord processor #3128
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add unit tests for this? Maybe a mock PreparedStatement that counts the number of executeBatch() invocations?
@@ -265,6 +265,17 @@ | |||
.expressionLanguageSupported(ExpressionLanguageScope.VARIABLE_REGISTRY) | |||
.build(); | |||
|
|||
static final PropertyDescriptor BATCH_SIZE = new PropertyDescriptor.Builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be consistent here with "batch size" and "bulk size" in the naming of variables, documentation, etc. Maybe "Maximum Batch Size"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree regarding "Maximum Batch Size". Sounds better. What's "bulk size"? Is it relevant to this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I see it now. The display label is "Bulk Size". I'll fix it to be "Maximum Batch Size". Thanks
.displayName("Bulk Size") | ||
.description("Specifies batch size for INSERT and UPDATE statements. This parameter has no effect for other statements specified in 'Statement Type'." | ||
+ " Non-positive value has the effect of infinite bulk size.") | ||
.defaultValue("-1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does a value of zero do? Would anyone ever use it? If not, perhaps zero is the best default to indicate infinite batch size. If you do change it to zero, please change the validator to a NONNEGATIVE_INTEGER_VALIDATOR to match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that 0
should be the default, and would replicate the current behavior of the processor, "All records in one batch".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change the default to be zero and the validator to NONNEGATIVE_INTEGER_VALIDATOR
@@ -669,11 +685,20 @@ private void executeDML(ProcessContext context, ProcessSession session, FlowFile | |||
} | |||
} | |||
ps.addBatch(); | |||
if (++currentBatchSize == batchSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be beneficial to capture currentBatchSize*batchIndex
, with batchIndex
being incremented only after a successful call to executeBatch()
as an attribute? My thinking is, if you have a failure, and only part of a batch was loaded, you could store how many rows were loaded successfully as an attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this would be benefitial. PutDatabaseRecord works without autoCommit. It's all or nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I missed that override before, but I see it now. So definitely less valuable, the only thing it would provide would be troubleshooting guidance, "your bad data is roughly in this part of the file". Probably not worth it. Thanks!
Renamed 'batch size' to 'Maximum Batch Size'. Changed default value of max_batch_size to zero (INFINITE) Fixed parameter validation. Added unit tests
Hi, |
+1 LGTM, tested with various batch sizes and ran unit tests. Thanks for this improvment! Merged to master |
https://issues.apache.org/jira/browse/NIFI-5788