Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NIFI-10792] Fixed bug to allow for processing files larger than 10MB… #7016

Closed
wants to merge 2 commits into from

Conversation

dan-s1
Copy link
Contributor

@dan-s1 dan-s1 commented Mar 6, 2023

…. Also refactored to reduce the cognitive complexity in the onTrigger method.

Summary

NIFI-10792

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 11
    • JDK 17

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

…. Also refactored to reduce the cognitive complexity in the onTrigger method.
@dan-s1
Copy link
Contributor Author

dan-s1 commented Mar 8, 2023

@exceptionfactory I did not end up including the unit test I had as it was a unit which tested with a 20MB file. I would have thought there should be a unit test to exercise the change I made. Please advise.

@mh013370
Copy link
Contributor

mh013370 commented Mar 8, 2023

@exceptionfactory I did not end up including the unit test I had as it was a unit which tested with a 20MB file. I would have thought there should be a unit test to exercise the change I made. Please advise.

Anecdotally, the NiFi mock framework will read the entire FF contents into memory (I discovered this in #6369) so I think you're correct in not including unit tests testing larger FFs.

Here's my except from that PR

I attempted to unit test this, but due to how the NiFi mock framework works, every flowfile is read into memory in full even if that FlowFile is backed by a NullInputStream. I couldn't find a nice alternative to verify with a unit test.

@exceptionfactory
Copy link
Contributor

Thanks for the comment @mh013370, this issue is similar to the size limits for Tar files resolved in #6369.

@dan-s1, Although unit tests are the preferred way to confirm expected behavior, unit tests are not optimal for testing these kinds of scenarios. For this particular situation, confirming existing functionality is good, and runtime verification is better than introducing large files or long-running tests into the repository.

@dan-s1
Copy link
Contributor Author

dan-s1 commented Mar 10, 2023

@exceptionfactory When you get a chance can you please look over this PR? I was hoping I could reuse the Excel part of this code for NIFI-11167. I noticed a drawback with fastexcel-reader that it

discards styles, graphs, and many other stuff

therefore I would like to use excel-streaming-reader as I did in this PR as I believes it preserves these things or at least a subset of them.

@exceptionfactory
Copy link
Contributor

Thanks @dan-s1, I plan to take a closer look at this pull request soon.

The excel-streaming-reader seems acceptable, although discarding styles, graphs, and other elements does not necessarily sound like a problem because that would not necessarily translate to record-oriented data for other services.

@dan-s1
Copy link
Contributor Author

dan-s1 commented Mar 10, 2023

@exceptionfactory Thanks! That is a good point regarding the formatting. Though I would think you would want to have an option to preserve formatting for use with an Excel record writer.

@exceptionfactory
Copy link
Contributor

Good point @dan-s1, I'm not sure if the formatting information could be carried through to an Excel Writer, but it would be useful if it can be supported.

Copy link
Contributor

@emiliosetiadarma emiliosetiadarma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I was able to verify that I was able to convert an excel spreadsheet > 10 MB successfully into CSV, I just have a few questions

@dan-s1
Copy link
Contributor Author

dan-s1 commented Mar 14, 2023

@emiliosetiadarma @exceptionfactory I believe I addressed the issues brought up. Is there anything else needed?

- Removed unnecessary graphics2d dependency
Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this improvement @dan-s1, and thanks for the testing @emiliosetiadarma.

The code changes for streaming look good, existing unit tests pass and I was able to convert a 20 MB XLSX file at runtime with these changes.

I pushed a commit that adds the log4j-to-slf4j dependency, which was something missing after recent Apache POI upgrades, supporting log routing from Log4j 2 to the NiFi standard SLF4J. I also removed the explicit dependency on graphics2d, which has a holdover from earlier Apache POI versions and no longer used. Lastly, I reverted the reformatting of the properties and relationships, since it is easier to maintain the list of one per line versus wrapping lines.

With these changes, planning to merge following a successful automated build.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated builds look good. +1 merging

exceptionfactory pushed a commit that referenced this pull request Mar 15, 2023
- Refactored to reduce the cognitive complexity in the onTrigger method
- Added log4j-to-slf4j library and adjusted formatting
- Removed unnecessary graphics2d dependency

This closes #7016

Signed-off-by: David Handermann <exceptionfactory@apache.org>
r-vandenbos pushed a commit to r-vandenbos/nifi that referenced this pull request Apr 11, 2023
- Refactored to reduce the cognitive complexity in the onTrigger method
- Added log4j-to-slf4j library and adjusted formatting
- Removed unnecessary graphics2d dependency

This closes apache#7016

Signed-off-by: David Handermann <exceptionfactory@apache.org>
asfgit pushed a commit that referenced this pull request Jun 21, 2023
…eGroovyScript

This closes #7016

Signed-off-by: Mike Thomsen <mthomsen@apache.org>
asfgit pushed a commit that referenced this pull request Jun 21, 2023
…eGroovyScript

This closes #7016

Signed-off-by: Mike Thomsen <mthomsen@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants