NIFI-7830: Support large files in PutAzureDataLakeStorage #4556

turcsanyip · 2020-09-25T22:36:36Z

https://issues.apache.org/jira/browse/NIFI-7830

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not squash or use --force when pushing to allow for clean monitoring of changes.

For code changes:

Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
Have you written or updated unit tests to verify your changes?
Have you verified that the full build is successful on JDK 8?
Have you verified that the full build is successful on JDK 11?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.

turcsanyip · 2020-09-25T22:56:50Z

...ocessors/src/main/java/org/apache/nifi/processors/azure/storage/PutAzureDataLakeStorage.java

+                             public int available() {
+                                 // com.azure.storage.common.Utility.convertStreamToByteBuffer() throws an exception
+                                 // if there are more available bytes in the stream after reading the chunk
+                                 return 0;


@MuazmaZ Do you happen to know why Utility.convertStreamToByteBuffer() throws an exception when available() > 0?
https://github.com/Azure/azure-sdk-for-java/blob/0345889402425191b7003e73b7b3d6ea3c0a5175/sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/Utility.java#L268

Due to this, it is not possible to process a longer input stream in portions / chunks.
As a workaround, I added a fake available() method to lie there is no more data in the input stream which is not really nice but works.
Another option would be to read the chunks in a loop into a byte array on our side and pass a stream on the byte array to the Azure client lib. But I would rather avoid this extra copy and extra memory for the buffer.

@turcsanyip , the patch looks good to me, but maybe you could use the BoundedInputStream from the Apache Commons library [1] instead of the workaround. What do you think?
[1] https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/input/BoundedInputStream.html

@adenes Thanks for the idea. BoundedInputStream works properly here.

@turcsanyip I am looking into this and I will respond by tomorrow based on the internal team's response.

@MuazmaZ Thanks. The BoundedInputStream approach is much better, than my original workaround, so it is not so critical anymore. However, I'm still wondering why it is not possible to pass in a longer stream to Utility.convertStreamToByteBuffer().

…tream.

tpalfy · 2020-09-28T17:31:19Z

...sors/src/test/java/org/apache/nifi/processors/azure/storage/ITFetchAzureDataLakeStorage.java

@@ -216,13 +216,15 @@ public void testFetchNonExistentFile() {
        testFailedFetch(fileSystemName, directory, filename, inputFlowFileContent, inputFlowFileContent, 404);
    }

-    @Ignore("Takes some time, only recommended for manual testing.")
+    //@Ignore("Takes some time, only recommended for manual testing.")


Does it no longer take "some time"? :)

… it was ignored originally.

pgyori

Checked and tested, LGTM. +1

pvillard31 · 2020-09-30T09:17:42Z

Merged to main, thanks @turcsanyip and everyone who reviewed.
@MuazmaZ - let us know if you have a feedback recommending to take another approach here.

Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com> This closes apache#4556.

NIFI-7830: Support large files in PutAzureDataLakeStorage

c027b27

turcsanyip commented Sep 25, 2020

View reviewed changes

turcsanyip added 4 commits September 28, 2020 14:05

NIFI-7830: Use commons-io BoundedInputStream to window the original s…

a18e16a

…tream.

NIFI-7830: Use the same upload logic in Fetch ADLS tests as in Put ADLS.

1832d9b

NIFI-7830: Use random data in Fetch/Put ADLS tests with large files.

2fbfdc1

NIFI-7830: Fixed Checkstyle violation.

c6e9f6b

tpalfy reviewed Sep 29, 2020

View reviewed changes

NIFI-7830: Ignore ITFetchAzureDataLakeStorage.testFetchLargeFile() as…

8ffed9e

… it was ignored originally.

pgyori approved these changes Sep 30, 2020

View reviewed changes

pvillard31 approved these changes Sep 30, 2020

View reviewed changes

asfgit closed this in f9ae3bb Sep 30, 2020

thenatog pushed a commit to thenatog/nifi that referenced this pull request Oct 9, 2020

NIFI-7830: Support large files in PutAzureDataLakeStorage

479a4be

Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com> This closes apache#4556.

thenatog pushed a commit to thenatog/nifi that referenced this pull request Oct 20, 2020

NIFI-7830: Support large files in PutAzureDataLakeStorage

da546d1

Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com> This closes apache#4556.

driesva pushed a commit to driesva/nifi that referenced this pull request Mar 19, 2021

NIFI-7830: Support large files in PutAzureDataLakeStorage

0e45944

Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com> This closes apache#4556.

adenes pushed a commit to adenes/nifi that referenced this pull request Jul 5, 2021

NIFI-7830: Support large files in PutAzureDataLakeStorage

62fe613

Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com> This closes apache#4556.

krisztina-zsihovszki pushed a commit to krisztina-zsihovszki/nifi that referenced this pull request Jun 28, 2022

NIFI-7830: Support large files in PutAzureDataLakeStorage

34987ad

Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com> This closes apache#4556.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-7830: Support large files in PutAzureDataLakeStorage #4556

NIFI-7830: Support large files in PutAzureDataLakeStorage #4556

turcsanyip commented Sep 25, 2020 •

edited

turcsanyip Sep 25, 2020

adenes Sep 28, 2020

turcsanyip Sep 28, 2020

MuazmaZ Sep 28, 2020

turcsanyip Sep 28, 2020

tpalfy Sep 28, 2020

pgyori left a comment

pvillard31 commented Sep 30, 2020

NIFI-7830: Support large files in PutAzureDataLakeStorage #4556

NIFI-7830: Support large files in PutAzureDataLakeStorage #4556

Conversation

turcsanyip commented Sep 25, 2020 • edited

For all changes:

For code changes:

For documentation related changes:

Note:

turcsanyip Sep 25, 2020

Choose a reason for hiding this comment

adenes Sep 28, 2020

Choose a reason for hiding this comment

turcsanyip Sep 28, 2020

Choose a reason for hiding this comment

MuazmaZ Sep 28, 2020

Choose a reason for hiding this comment

turcsanyip Sep 28, 2020

Choose a reason for hiding this comment

tpalfy Sep 28, 2020

Choose a reason for hiding this comment

pgyori left a comment

Choose a reason for hiding this comment

pvillard31 commented Sep 30, 2020

turcsanyip commented Sep 25, 2020 •

edited