Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-7830: Support large files in PutAzureDataLakeStorage #4556

Closed
wants to merge 6 commits into from

Conversation

turcsanyip
Copy link
Contributor

@turcsanyip turcsanyip commented Sep 25, 2020

https://issues.apache.org/jira/browse/NIFI-7830

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not squash or use --force when pushing to allow for clean monitoring of changes.

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • Have you verified that the full build is successful on JDK 8?
  • Have you verified that the full build is successful on JDK 11?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.

public int available() {
// com.azure.storage.common.Utility.convertStreamToByteBuffer() throws an exception
// if there are more available bytes in the stream after reading the chunk
return 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MuazmaZ Do you happen to know why Utility.convertStreamToByteBuffer() throws an exception when available() > 0?
https://github.com/Azure/azure-sdk-for-java/blob/0345889402425191b7003e73b7b3d6ea3c0a5175/sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/Utility.java#L268

Due to this, it is not possible to process a longer input stream in portions / chunks.
As a workaround, I added a fake available() method to lie there is no more data in the input stream which is not really nice but works.
Another option would be to read the chunks in a loop into a byte array on our side and pass a stream on the byte array to the Azure client lib. But I would rather avoid this extra copy and extra memory for the buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turcsanyip , the patch looks good to me, but maybe you could use the BoundedInputStream from the Apache Commons library [1] instead of the workaround. What do you think?
[1] https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/input/BoundedInputStream.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adenes Thanks for the idea. BoundedInputStream works properly here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turcsanyip I am looking into this and I will respond by tomorrow based on the internal team's response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MuazmaZ Thanks. The BoundedInputStream approach is much better, than my original workaround, so it is not so critical anymore. However, I'm still wondering why it is not possible to pass in a longer stream to Utility.convertStreamToByteBuffer().

@@ -216,13 +216,15 @@ public void testFetchNonExistentFile() {
testFailedFetch(fileSystemName, directory, filename, inputFlowFileContent, inputFlowFileContent, 404);
}

@Ignore("Takes some time, only recommended for manual testing.")
//@Ignore("Takes some time, only recommended for manual testing.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it no longer take "some time"? :)

Copy link
Contributor

@pgyori pgyori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked and tested, LGTM. +1

@asfgit asfgit closed this in f9ae3bb Sep 30, 2020
@pvillard31
Copy link
Contributor

Merged to main, thanks @turcsanyip and everyone who reviewed.
@MuazmaZ - let us know if you have a feedback recommending to take another approach here.

thenatog pushed a commit to thenatog/nifi that referenced this pull request Oct 9, 2020
Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com>

This closes apache#4556.
thenatog pushed a commit to thenatog/nifi that referenced this pull request Oct 20, 2020
Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com>

This closes apache#4556.
driesva pushed a commit to driesva/nifi that referenced this pull request Mar 19, 2021
Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com>

This closes apache#4556.
adenes pushed a commit to adenes/nifi that referenced this pull request Jul 5, 2021
Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com>

This closes apache#4556.
krisztina-zsihovszki pushed a commit to krisztina-zsihovszki/nifi that referenced this pull request Jun 28, 2022
Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com>

This closes apache#4556.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants