NIFI-1202: Site-to-Site batch settings.#1306
NIFI-1202: Site-to-Site batch settings.#1306ijokarumawak wants to merge 2 commits intoapache:masterfrom
Conversation
|
Confirmed that the batch configuration works as expected by following tests. Test detail
HTTP
RAW
Flow File Distribution resultConfirmed 100 flow files are evenly distributed among nodes if batch setting is configured as following image. Confirmed the same behavior with pulling from an output port. |
|
While I was working on this PR, I've found some RPG and RPG Port configurations are not taken into account for Audit and Fingerprint. I created two JIRAs to address these. |
511cdbb to
e3df849
Compare
e3df849 to
1fbcd45
Compare
|
@ijokarumawak sorry for the delay but it seems this needs rebasing. :-( |
|
Updated:
Retest load distribution in 3 node NiFi cluster. Confirmed batch settings are taking effect. |
|
Thanks for rebasing... Will review |
|
So far the changes look good. One suggestion I'd like to offer is rendering something in this batch settings when there is nothing configured. Typically, we use a string like 'No value set' or relay this fact. Have we considered providing default values for any of the batch settings? We effectively already have that for the duration but the user is only aware of that through the tooltip. Maybe the default value for the duration is 500 millis? |
| * @return preferred number of flow files to include in a transaction | ||
| */ | ||
| @ApiModelProperty( | ||
| value = "Preferred number of bytes to include in a transaction." |
There was a problem hiding this comment.
The description should say 'number of flow files' (not bytes).
There was a problem hiding this comment.
Thanks for the catch. Fixed.
| dto -> dto.getUseCompression() != null, RemoteGroupPort::isUseCompression) | ||
| .setConvertName(PORT_NAME_CONVERT) | ||
| .setConvertName(PORT_NAME_CONVERT), | ||
| new ConfigurationRecorder<RemoteGroupPort, RemoteProcessGroupPortDTO>("Batch Count", |
There was a problem hiding this comment.
Should the batch settings also be using PORT_NAME_CONVERT?
There was a problem hiding this comment.
Yes, we should. I forgot to use the name converter. Thanks! Fixed this and related unit test.
| final Integer batchCount = remoteProcessGroupPortDto.getBatchCount(); | ||
| final String batchSize = remoteProcessGroupPortDto.getBatchSize(); | ||
| final String batchDuration = remoteProcessGroupPortDto.getBatchDuration(); | ||
| if (isAnyNotNull(batchCount, batchSize, batchDuration)) { |
There was a problem hiding this comment.
Is there a reason we are setting all three batch settings when any one of them have been specified? In our other endpoints, we support partial updating. For instance, if a field is not specified in the incoming request it's underlying value will not be changed. With this approach, if the incoming request only contains a batchCount, it will result in setting the batchSize and batchDuration to null effectively clearing their current value. I think we want separate if statements for each field to remain consistent.
There was a problem hiding this comment.
There is one reason why I did this way, that is to let user clear these batch settings. Especially batchCount is difficult to clear since it is Integer field in DTO. I thought we won't be able to know if user wanted to clear a batch setting or leave it as is but update other data partially. If it is a String field, we maybe able to distinct these two operation by checking if it's null (leave as it is) or empty string (clear, empty value). Having null values have different meaning in aspect of controlling batch operation.
So, I grouped these three batch settings to treat as one single setting. Would documenting these details at @ApiModelProperty description suffice?
|
Hi @mcgilman, I've pushed another commit to incorporate your feedback. As for the default value rendering, I've used 'No value set' as shown below: However, I didn't set a default value for batch duration, because users may want to limit batch by only flow file count or data size, for those use-cases, we shouldn't use duration to limit batch. If we specifies default value from UI layer, we will not be able to specify blank setting. Instead, I made the tooltip hint more clearly explain how those configuration affect to control batch. It works differently for input-port (client side throttle) and output-port (remote side throttle, client only ask its preference). Finally, for the partial update against DTO, I couldn't come up with better alternatives so I just updated API documentation. I hope it would be enough. Please let me know if these updates address your concerns. Thanks again for your review comments! |
|
Thanks for the updates. Continuing to follow up on the partial update... is there a reason that batchCount, batchSize, and batchDuration must be updated as a single unit? Should a user not but able to just set just one of them? It seems the logic in I don't see a reason to prevent setting just the batchSize for instance. I think this can accomplish by updating Please let me know if I'm missing something about why all of these fields must be updated in the same request. Thanks again! |
|
@mcgilman I considered following code once, but it can not clear batchCount to null, because batchCount is an Integer (not String): if (isNotNull(batchCount)) {
port.setBatchCount(batchCount);
}Blank string can tell user want to clear string field, but numeric ones can not: {"batchCount": null, "batchSize": ""}and {"batchSize": ""}are identical when those are converted into DTO layer. Is this understanding wrong? |
|
@ijokarumawak Yes, your understanding here is correct. Sorry, I was missing what's unique about this case which is that it is an optional number. I don't believe we have this case elsewhere. Other places where we have numbers, they have default values and support being updated but not unconfigured. I have one thought that may make the DTO/API a little more clear. As is, the behavior of setting the fields differ within the same DTO and even amongst the same fields depending on the presence of other fields. Since we want the batch settings to be treated as a single unit in terms of configuration, what are your thoughts introducing a With this approach, in the Thoughts? Not trying to be super picky here but need to be very conscious of the changes as they are part of the public REST API. Just trying to make this as clear as possible. If someone directly invokes these endpoints I'd hate if they inadvertently cleared the existing desired configuration. |
|
@mcgilman Thanks for the suggestion, it perfectly makes sense. I agree with that we need to be very carefully to make public REST API change. I didn't have time today, but will do tomorrow! |
- Added batchCount, batchSize, batchDuration to limit flow files to be included in a single Site-to-Site transaction. - Added batch throttling logic when StandardRemoteGroupPort transfers flow files to a remote input port using the batch limit configurations, so that users can limit batch not only for pulling data, but also pushing data. - Added destination list shuffle to provide better load distribution. Previously, the load distribution algorithm produced the same host consecutively. - Added new batch settings to FlowConfiguration.xsd. - Added new batch settings to Flow Fingerprint. - Added new batch settings to Audit. - Sort ports by name at 'Remote Process Group Ports' dialog.
Incorporated review comments: - Show 'No value set' when a batch configuration is not set - Updated batch settings tooltip to clearly explain how it works the configuration works differently for input and output ports. - Updated DTO by separating batch settings to BatchSettingsDTO to indicate count, size and duration are a set of configurations.
|
@mcgilman I have updated this PR to use |
|
Thanks @ijokarumawak. This has been merged to master. |







Thank you for submitting a contribution to Apache NiFi.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
included in a single Site-to-Site transaction.
flow files to a remote input port using the batch limit configurations, so that users can limit batch
not only for pulling data, but also pushing data.
Previously, the load distribution algorithm produces the same host consecutively.