Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-12772 Expose REMOTE_POLL_BATCH_SIZE property for ListSFTP #8390

Closed
wants to merge 2 commits into from

Conversation

tombrisland
Copy link
Contributor

@tombrisland tombrisland commented Feb 9, 2024

Summary

NIFI-12772

Allows use of the REMOTE_POLL_BATCH_SIZE property in the ListSFTP processor. Previously this was supported but not exposed for configuration.

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 21

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

@tombrisland
Copy link
Contributor Author

Need to do some manual testing

@EndzeitBegins
Copy link
Contributor

EndzeitBegins commented Feb 11, 2024

Thank you for working on this @tombrisland.
I'm really looking forward to this improvement and hope we can backport it to 1.x as well.

I haven't had time to do any testing, but just glanced over the code changes and they look straightforward.

It seems that even though the SFTPTransfer supports the property already, the underlying sshj does not support a limit to ls directly but instead we rather stop adding results to our list. The listing is continued nonetheless.
I think supporting the property is still an improvement, as we at least can limit the amount of FlowFiles / records created this way.

However, I wonder if it makes sense to move the existing max result check

if (listing.size() >= maxResults) {
    return false;
}

to the top of the RemoteResourceFilter? The limit only ever gets relevant, when there is a large amount of files at the remote source. Even though we cannot abort the listing prematurely, at least not without modifications to sshj, this way the checks needed for all files after the limit is reached is kept to a minimum. This seems beneficial in situations where the listing takes a long time due files accumulated on the remote source.

What do you think about that?

@tombrisland
Copy link
Contributor Author

tombrisland commented Feb 11, 2024

Makes sense to me @EndzeitBegins, would also be great if we can get your early termination support in!

We may also need to have a think about which tracking strategies are supported when you select batch size. E.g. I don't think that time tracking will work with batch size since it relies on keeping track of the last timestamp processed?

@EndzeitBegins
Copy link
Contributor

EndzeitBegins commented Feb 11, 2024

Thank you for the adjustments.
While I hope we can get the changes to land in sshj I wouldn't wait for them. After introducing the general support for REMOTE_POLL_BATCH_SIZE in ListSFTP, we still can integrate a new version of sshj that includes the improvements in a following PR.

That's a very sensible remark. I think it makes sense to limit the support for REMOTE_POLL_BATCH_SIZE to the strategies BY_ENTITIES and NO_TRACKING only. I'm not familiar with BY_TIME_WINDOW but I think combining the batch size with one of the other strategies might result in some files being skipped / never be picked up. On the other hand, ListFTP does not restrict the use to selected strategies, so there's that. 🤷🏻‍♂️

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this change @tombrisland, and thanks for the feedback thus far @EndzeitBegins.

Regarding making the property dependent, this should not be changed in FileTransfer because it also impacts ListFTP and GetFTP.

It does not seem necessary to make it dependent, because those other strategies could still work under some circumstances. The fact that it is not currently a dependent property for the FTP Processors is also worth noting, so at least for the initial change, it seems better to avoid making it a dependent property.

@joewitt
Copy link
Contributor

joewitt commented Feb 12, 2024

I don't have a lot to add to the comments I see already. But the main gist is if the intent of the JIRA is now broader than the original title please change the JIRA. If the intent has not changed then please narrow the PR. So either this is about adding this option to ListSFTP or it is about others. Would be important to validate each case which expands the scope of the review.

Given we don't actually get to, it appears, change the behavior of the SFTP commands executed on the remote end (ie: once you start a listing in a large directory...much of the damage is already done in terms of time/waiting) then the benefit of this property being available is mostly to protect the NiFi side. Still useful just probably not the truly desired/most beneficial impact.

I'd avoid any changes for now that aren't central to the specific desired improvement.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @tombrisland. I noticed one import that no longer seems necessary, and a had a question on one other relocation.

Also, if you could do this change as a regular push, instead of a force-push, that would be helpful for tracking. The initial PR should have a single commit, but after the review process is complete, then all the changes will be squashed into a single commit for merging into the main branch.

@EndzeitBegins
Copy link
Contributor

Thanks for taking on the merge process @joewitt.
Could we backport these changes to the support/nifi-1.x branch as well?

@joewitt
Copy link
Contributor

joewitt commented Feb 20, 2024

@EndzeitBegins It would need a separate JIRA/PR would be needed and I dont think the 'removeIf' language mechanism would work as NiFi 1.x is Java 8 or higher.

@EndzeitBegins
Copy link
Contributor

Thank you for the fast response @joewitt and letting me know.

Hey @tombrisland, are you interested to work on a backport for 1.x or should I have a go at it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants