Skip to content

NIFI-1202: Site-to-Site batch settings.#1306

Closed
ijokarumawak wants to merge 2 commits intoapache:masterfrom
ijokarumawak:nifi-1202
Closed

NIFI-1202: Site-to-Site batch settings.#1306
ijokarumawak wants to merge 2 commits intoapache:masterfrom
ijokarumawak:nifi-1202

Conversation

@ijokarumawak
Copy link
Member

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

  • Added batchCount, batchSize, batchDuration to limit flow files to be
    included in a single Site-to-Site transaction.
  • Added batch throttling logic when StandardRemoteGroupPort transfers
    flow files to a remote input port using the batch limit configurations, so that users can limit batch
    not only for pulling data, but also pushing data.
  • Added destination list shuffle to provide better load distribution.
    Previously, the load distribution algorithm produces the same host consecutively.

@ijokarumawak
Copy link
Member Author

Confirmed that the batch configuration works as expected by following tests.

Test detail

  • 3 node cluster
  • Generate 100 flow files using GenerateFlowFile on primary node
  • Send and receive 100 flow files using HTTP/RAW transport protocols with different batch settings
  • Confirmed flow files are distributed evenly based on batch configuration

image

HTTP

  • Push to remote input port
    • default setting: single remote peer gets all 100 files
    • limit with 10 files: each remote peer gets even flow files
  • Pull from remote output port
    • default setting: single client node gets all 100 files
    • limit with 10 files: each client node gets even flow files

RAW

  • Push to remote input port
    • default setting: single remote peer gets all 100 files
    • limit with 10 files: each remote peer gets even flow files
  • Pull from remote output port
    • default setting: single client node gets all 100 files
    • limit with 10 files: each client node gets even flow files

Flow File Distribution result

Confirmed 100 flow files are evenly distributed among nodes if batch setting is configured as following image. Confirmed the same behavior with pulling from an output port.

image

@ijokarumawak
Copy link
Member Author

While I was working on this PR, I've found some RPG and RPG Port configurations are not taken into account for Audit and Fingerprint. I created two JIRAs to address these.

@ijokarumawak ijokarumawak changed the title NIFI-1202: Site-to-Site batch settings. [WIP] NIFI-1202: Site-to-Site batch settings. Dec 28, 2016
@ijokarumawak
Copy link
Member Author

ijokarumawak commented Dec 28, 2016

Rebased with the latest master to resolve conflicts, and squashed commits, but still Work In Progress, as this depends on NIFI-3163, which is under review process now. Once NIFI-3163 is merged, I will update this PR with similar fingerprinting logic.

@trixpan
Copy link
Contributor

trixpan commented Apr 11, 2017

@ijokarumawak sorry for the delay but it seems this needs rebasing. :-(

@ijokarumawak
Copy link
Member Author

Updated:

  • Rebased with the latest master, conflicts are resolved
  • Incorporated NIFI-3162: RPG configurations Audit with newly added batch configs
  • Incorporated NIFI-3163: Flow Fingerprint with newly added batch configs

Retest load distribution in 3 node NiFi cluster. Confirmed batch settings are taking effect.
This PR is ready for review! Thank you :)

@ijokarumawak ijokarumawak changed the title [WIP] NIFI-1202: Site-to-Site batch settings. NIFI-1202: Site-to-Site batch settings. Apr 19, 2017
@mcgilman
Copy link
Contributor

Thanks for rebasing... Will review

@mcgilman
Copy link
Contributor

So far the changes look good. One suggestion I'd like to offer is rendering something in this batch settings when there is nothing configured. Typically, we use a string like 'No value set' or relay this fact.

screen shot 2017-04-20 at 9 48 13 am

Have we considered providing default values for any of the batch settings? We effectively already have that for the duration but the user is only aware of that through the tooltip. Maybe the default value for the duration is 500 millis?

* @return preferred number of flow files to include in a transaction
*/
@ApiModelProperty(
value = "Preferred number of bytes to include in a transaction."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description should say 'number of flow files' (not bytes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch. Fixed.

dto -> dto.getUseCompression() != null, RemoteGroupPort::isUseCompression)
.setConvertName(PORT_NAME_CONVERT)
.setConvertName(PORT_NAME_CONVERT),
new ConfigurationRecorder<RemoteGroupPort, RemoteProcessGroupPortDTO>("Batch Count",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the batch settings also be using PORT_NAME_CONVERT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should. I forgot to use the name converter. Thanks! Fixed this and related unit test.

final Integer batchCount = remoteProcessGroupPortDto.getBatchCount();
final String batchSize = remoteProcessGroupPortDto.getBatchSize();
final String batchDuration = remoteProcessGroupPortDto.getBatchDuration();
if (isAnyNotNull(batchCount, batchSize, batchDuration)) {
Copy link
Contributor

@mcgilman mcgilman Apr 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we are setting all three batch settings when any one of them have been specified? In our other endpoints, we support partial updating. For instance, if a field is not specified in the incoming request it's underlying value will not be changed. With this approach, if the incoming request only contains a batchCount, it will result in setting the batchSize and batchDuration to null effectively clearing their current value. I think we want separate if statements for each field to remain consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one reason why I did this way, that is to let user clear these batch settings. Especially batchCount is difficult to clear since it is Integer field in DTO. I thought we won't be able to know if user wanted to clear a batch setting or leave it as is but update other data partially. If it is a String field, we maybe able to distinct these two operation by checking if it's null (leave as it is) or empty string (clear, empty value). Having null values have different meaning in aspect of controlling batch operation.

So, I grouped these three batch settings to treat as one single setting. Would documenting these details at @ApiModelProperty description suffice?

@ijokarumawak
Copy link
Member Author

Hi @mcgilman,

I've pushed another commit to incorporate your feedback.

As for the default value rendering, I've used 'No value set' as shown below:
image

However, I didn't set a default value for batch duration, because users may want to limit batch by only flow file count or data size, for those use-cases, we shouldn't use duration to limit batch. If we specifies default value from UI layer, we will not be able to specify blank setting.

Instead, I made the tooltip hint more clearly explain how those configuration affect to control batch. It works differently for input-port (client side throttle) and output-port (remote side throttle, client only ask its preference).

For input port:
image

For output port:
image

Finally, for the partial update against DTO, I couldn't come up with better alternatives so I just updated API documentation. I hope it would be enough.

image

Please let me know if these updates address your concerns. Thanks again for your review comments!

@mcgilman
Copy link
Contributor

Thanks for the updates. Continuing to follow up on the partial update... is there a reason that batchCount, batchSize, and batchDuration must be updated as a single unit? Should a user not but able to just set just one of them? It seems the logic in StandardRemoteGroupPort considers each field individually and each, if set, could allow for the termination of a transaction.

if (maxBatchCount > 0 && flowFilesSent.size() >= maxBatchCount) {
  flowFile = null; // if maxBatchCount is set and current transaction exceeds, stop
} else if (maxBatchBytes > 0 && bytesSent >= maxBatchBytes) {
  flowFile = null; // if maxBatchBytes is set and current transaction exceeds, stop
} else if (sendingNanos >= maxBatchDuration) {
  flowFile = null; // if maxBatchDuration is set (default of 500 millis if not) and current transaction exceeds, stop
} else {
  flowFile = session.get(); // continue the transaction
}

I don't see a reason to prevent setting just the batchSize for instance. I think this can accomplish by updating StandardRemoteProcessGroupDAO and treating each field independently. For instance in updatePort(...)

if (isNotNull(batchCount)) {
  port.setBatchCount(batchCount);
}
if (isNotNull(batchSize)) {
  port.setBatchSize(batchSize);
}
if (isNotNull(batchDuration)) {
  port.setBatchDuration(batchDuration);
}

Please let me know if I'm missing something about why all of these fields must be updated in the same request.

Thanks again!

@ijokarumawak
Copy link
Member Author

@mcgilman I considered following code once, but it can not clear batchCount to null, because batchCount is an Integer (not String):

if (isNotNull(batchCount)) {
  port.setBatchCount(batchCount);
}

Blank string can tell user want to clear string field, but numeric ones can not:

{"batchCount": null, "batchSize": ""}

and

{"batchSize": ""}

are identical when those are converted into DTO layer. Is this understanding wrong?

@mcgilman
Copy link
Contributor

@ijokarumawak Yes, your understanding here is correct. Sorry, I was missing what's unique about this case which is that it is an optional number. I don't believe we have this case elsewhere. Other places where we have numbers, they have default values and support being updated but not unconfigured.

I have one thought that may make the DTO/API a little more clear. As is, the behavior of setting the fields differ within the same DTO and even amongst the same fields depending on the presence of other fields. Since we want the batch settings to be treated as a single unit in terms of configuration, what are your thoughts introducing a BatchSettingsDTO for relaying this configuration? For clarity, I'm suggesting the possibility of the RemoteProcessGroupPortDTO having a BatchSettingsDTO which has batchCount, batchSize, and batchDuration.

With this approach, in the RemoteProcessGroupPortDTO the behavior is consistent. If a field is set, attempt to configure it. If a field is null, leave the existing configuration in place. Within the BatchSettingsDTO we can stipulate that these fields are a single unit and the proposed configuration will replace the existing configuration. I think this approach would be functionally equivalent but may be slightly less confusing.

Thoughts?

Not trying to be super picky here but need to be very conscious of the changes as they are part of the public REST API. Just trying to make this as clear as possible. If someone directly invokes these endpoints I'd hate if they inadvertently cleared the existing desired configuration.

@ijokarumawak
Copy link
Member Author

@mcgilman Thanks for the suggestion, it perfectly makes sense. I agree with that we need to be very carefully to make public REST API change. I didn't have time today, but will do tomorrow!

- Added batchCount, batchSize, batchDuration to limit flow files to be
  included in a single Site-to-Site transaction.
- Added batch throttling logic when StandardRemoteGroupPort transfers
  flow files to a remote input port using the batch limit configurations,
  so that users can limit batch not only for pulling data, but also pushing data.
- Added destination list shuffle to provide better load distribution.
  Previously, the load distribution algorithm produced the same host consecutively.
- Added new batch settings to FlowConfiguration.xsd.
- Added new batch settings to Flow Fingerprint.
- Added new batch settings to Audit.
- Sort ports by name at 'Remote Process Group Ports' dialog.
Incorporated review comments:

- Show 'No value set' when a batch configuration is not set
- Updated batch settings tooltip to clearly explain how it works the configuration works differently for input and output ports.
- Updated DTO by separating batch settings to BatchSettingsDTO to indicate count, size and duration are a set of configurations.
@ijokarumawak
Copy link
Member Author

@mcgilman I have updated this PR to use BatchSettingsDTO to group count, size and duration batch settings as you suggested. Would you review this once again? Many thanks!

@asfgit asfgit closed this in a41a2a9 Apr 27, 2017
@mcgilman
Copy link
Contributor

Thanks @ijokarumawak. This has been merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants