Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-11472 Make PutFTP processor more multithread friendly #7184

Closed
wants to merge 1 commit into from

Conversation

MormonJesus69420
Copy link
Contributor

@MormonJesus69420 MormonJesus69420 commented Apr 20, 2023

Add an extra check during directory creation to see if directory wasn't already created in another thread.

From Issue:

Problem happens when a PutFTP is set to run several concurrent tasks and two (or more ) FlowFiles come in and both need to create the same directory. One of them will create directory and succeed immediately while the other will try to create directory, but fail since it already exist, throw an error, the FlowFile will then be penalized and on second run will succeed.

While it is not the biggest error, as files are getting transferred in the end, but the bulletins and errors are annoying, especially in production environment where you don't want to get unnecessary errors.

We found that the solution involves a simple change to the FTPTransfer.java class in:
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/util/FTPTransfer.java
On line 398 and ensureDirectoryExists method you can simply add another if check which double checks that the directory exists when it fails to create one.

final boolean cdSuccessful = setWorkingDirectory(remoteDirectory);

if (!cdSuccessful) {    
  if (client.makeDirectory(remoteDirectory)) {        
    logger.debug("Remote Directory not found: created directory [{}]", remoteDirectory);    
  } else if (!setWorkingDirectory(remoteDirectory)) { 
         // Double check that the dir exists as it might have been created in another thread        
    throw new IOException("Failed to create remote directory " + remoteDirectory);    
  }
}

Summary

NIFI-11472

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 11
    • JDK 17

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

@arpadboda
Copy link
Contributor

Hello @MormonJesus69420 , thanks for your contribution!
I'm not sure if using multiple threads to transfer to the same ftp makes any sense, this operation is limited by bandwith anyway.
So I would prefer to restrict it to single threaded usage, but the current implementation allows uploading to multiple hosts based on flowfile attributes, in which case multiple threads might make sense, so I'm not against this change.
Weak +1, I'm ok to merge with a 2nd approval.

@MormonJesus69420
Copy link
Contributor Author

MormonJesus69420 commented Apr 20, 2023

Hi @arpadboda I might not have described it so well, but the issue we face is when we configure processor to run several concurrent tasks.

Concurrent tasks setting in PutFTP processor configuration
When we change the number to two or more we start receiving "errors" about processor being unable to create directory as it already exists. While the issue resolves itself, it is rather distracting to see it, when it's not a "real" error.

We have noticed a significant performance boost when using a PutFTP processor with more than one concurrent task. I don't have the numbers on me at the moment, but switching to two or three concurrent tasks significantly sped up the transfer time.

Add an extra check during directory creation to see if directory wasn't already created in another thread.
@MormonJesus69420
Copy link
Contributor Author

I am sorry, I managed to make the most basic mistake in such a small change, I forgot to add a ! to the setWorkingDirectory(remoteDirectory) method call.

@MormonJesus69420
Copy link
Contributor Author

Strange, I don't understand why it failed on the Windows action. I don't have the ability to test it on Windows either, since we use Linux for development at work. Also my branch is based off of the latest nifi/main branch.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @MormonJesus69420, and thanks for the review @arpadboda!

The Windows build often takes longer than the others, so it timed out, I restarted it to try again.

The use case of multiple concurrent tasks for PutFTP makes sense, the introducing the additional call to setWorkingDirectory seems reasonable under the circumstances. Within the context of a single NiFi server, a more robust solution might be to add synchronized to this method, but that would not help when running PutFTP across multiple NiFi nodes in a cluster.

With that background, attempting to change to the directory if the makeDirectory command fails seems like the best approach. I will monitor the Windows build status for verification.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @MormonJesus69420! +1 merging

exceptionfactory pushed a commit that referenced this pull request Apr 24, 2023
- Multiple threads can attempt to create a remote directory when PutFTP has multiple concurrent tasks.

This closes #7184

Signed-off-by: David Handermann <exceptionfactory@apache.org>
(cherry picked from commit d3f2bf1)
@MormonJesus69420
Copy link
Contributor Author

Hi, I see that the PR is already closed, but thought I might add the example where we tested running PutFTP processor using 1 and 2 concurrent tasks. As long as we were able to provide the data to send, you can see that upload speed was almost doubled. As such the bottleneck was how fast we were able to provide data to send, rather than sending of data.
Comparison in MB/s transfer speeds between 1 and 2 concurrent tasks in PutFTP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants