NIFI-8437 RecordReader 'Infer Schema' for large records may throw BufferedInputStream error#5011
NIFI-8437 RecordReader 'Infer Schema' for large records may throw BufferedInputStream error#5011adenes wants to merge 2 commits intoapache:mainfrom
Conversation
|
@markap14, indeed, the removal of the |
…feredInputStream error
| // We expect to be able to mark/reset any length because we expect that the underlying stream here will be a ContentClaimInputStream, which is able to | ||
| // re-read the content regardless of how much data is read. | ||
| contentStream.mark(10_000_000); | ||
| contentStream.mark(Integer.MAX_VALUE); |
There was a problem hiding this comment.
I think we want to keep this at a smaller value. The general contract of InputStream says that when mark is called, the stream must remember at least that many bytes but is free to remember more. As the comment above explains, the general expectation is that the stream will be of type ContentClaimInputStream. In that case, the value passed to mark is ignored because it can always roll back to the beginning of the stream (by re-reading the file under the hood).
But keeping the 10 MB limit (or even a 1 MB limit would probably be okay) means that if the caller does wrap the InputStream in a BufferedInputStream, then we have the chance remember this amount of data. If it's changed to Integer.MAX_VALUE, that bound is basically removed, which can lead to OutOfMemoryError very easily, and that should be avoided at almost any cost as the entire JVM can become defunct.
There was a problem hiding this comment.
I see your point and share your concern.
Then I'll revert this change and also remove the newly added test case and keep only the BufferedInputStream wrapping removal.
I also checked all the other references to org.apache.nifi.serialization.SchemaRegistryService#getSchema() and no other BufferedInputStream wrapping occurs, at least not directly before the getSchema() call.
markap14
left a comment
There was a problem hiding this comment.
Thanks @adenes! I think the changes look good, except that I'm concerned about changing the argument passed to InputStream.mark. That value is there as a safe-guard, specifically for this situation where an InputStream was wrapped in a BufferedInputStream. We do not want to buffer an unbounded (or bound at Integer.MAX_VALUE) amount of data in BufferedInputStream. We'd rather throw an Exception.
…feredInputStream error Addressing review comments
|
Thanks @adenes all looks good to me. +1 merged to main |
…feredInputStream error This closes apache#5011. Signed-off-by: Mark Payne <markap14@hotmail.com>
Thank you for submitting a contribution to Apache NiFi.
Please provide a short description of the PR here:
Description of PR
Enables X functionality; fixes bug NIFI-YYYY.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically
main)?Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not
squashor use--forcewhen pushing to allow for clean monitoring of changes.For code changes:
mvn -Pcontrib-check clean installat the rootnififolder?LICENSEfile, including the mainLICENSEfile undernifi-assembly?NOTICEfile, including the mainNOTICEfile found undernifi-assembly?.displayNamein addition to .name (programmatic access) for each of the new properties?For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.