NIFI-8437 RecordReader 'Infer Schema' for large records may throw BufferedInputStream error by adenes · Pull Request #5011 · apache/nifi

adenes · 2021-04-20T08:07:06Z

increased the mark readLimit in InferSchemaAccessStrategy.getSchema() to Integer.MAX_VALUE
removed the unnecessary BufferedInputStream wrapping in JsonPathReader.createRecordReader()
added a new test case to test with 50MB record

Thank you for submitting a contribution to Apache NiFi.

Please provide a short description of the PR here:

Description of PR

Enables X functionality; fixes bug NIFI-YYYY.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not squash or use --force when pushing to allow for clean monitoring of changes.

For code changes:

Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
Have you written or updated unit tests to verify your changes?
Have you verified that the full build is successful on JDK 8?
Have you verified that the full build is successful on JDK 11?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.

adenes · 2021-04-20T08:10:34Z

@markap14, indeed, the removal of the BufferedInputStream wrapping in JsonPathReader.createRecordReader() solves the issue (as you suggested on the Jira) because the ContentClaimInputStream doesn't use the readLimit in the mark call. Regardless of this I think it might be worth to increase the readLimit as there's no guarantee that the provided InputStream implementation doesn't takes it into account.
What do you think?

…feredInputStream error

markap14 · 2021-04-20T13:52:23Z

...ation-services/src/main/java/org/apache/nifi/schema/inference/InferSchemaAccessStrategy.java

        // We expect to be able to mark/reset any length because we expect that the underlying stream here will be a ContentClaimInputStream, which is able to
        // re-read the content regardless of how much data is read.
-        contentStream.mark(10_000_000);
+        contentStream.mark(Integer.MAX_VALUE);


I think we want to keep this at a smaller value. The general contract of InputStream says that when mark is called, the stream must remember at least that many bytes but is free to remember more. As the comment above explains, the general expectation is that the stream will be of type ContentClaimInputStream. In that case, the value passed to mark is ignored because it can always roll back to the beginning of the stream (by re-reading the file under the hood).

But keeping the 10 MB limit (or even a 1 MB limit would probably be okay) means that if the caller does wrap the InputStream in a BufferedInputStream, then we have the chance remember this amount of data. If it's changed to Integer.MAX_VALUE, that bound is basically removed, which can lead to OutOfMemoryError very easily, and that should be avoided at almost any cost as the entire JVM can become defunct.

I see your point and share your concern.
Then I'll revert this change and also remove the newly added test case and keep only the BufferedInputStream wrapping removal.
I also checked all the other references to org.apache.nifi.serialization.SchemaRegistryService#getSchema() and no other BufferedInputStream wrapping occurs, at least not directly before the getSchema() call.

markap14

Thanks @adenes! I think the changes look good, except that I'm concerned about changing the argument passed to InputStream.mark. That value is there as a safe-guard, specifically for this situation where an InputStream was wrapped in a BufferedInputStream. We do not want to buffer an unbounded (or bound at Integer.MAX_VALUE) amount of data in BufferedInputStream. We'd rather throw an Exception.

…feredInputStream error Addressing review comments

markap14 · 2021-04-23T17:46:59Z

Thanks @adenes all looks good to me. +1 merged to main

…feredInputStream error This closes apache#5011. Signed-off-by: Mark Payne <markap14@hotmail.com>

NIFI-8437 RecordReader 'Infer Schema' for large records may throw Buf…

e6f4a0d

…feredInputStream error

adenes force-pushed the NIFI-8437 branch from 390bca5 to e6f4a0d Compare April 20, 2021 09:40

markap14 reviewed Apr 20, 2021

View reviewed changes

markap14 requested changes Apr 20, 2021

View reviewed changes

NIFI-8437 RecordReader 'Infer Schema' for large records may throw Buf…

90f5d98

…feredInputStream error Addressing review comments

markap14 closed this in 9f45b48 Apr 23, 2021

krisztina-zsihovszki pushed a commit to krisztina-zsihovszki/nifi that referenced this pull request Jun 28, 2022

NIFI-8437 RecordReader 'Infer Schema' for large records may throw Buf…

033177f

…feredInputStream error This closes apache#5011. Signed-off-by: Mark Payne <markap14@hotmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-8437 RecordReader 'Infer Schema' for large records may throw BufferedInputStream error#5011

NIFI-8437 RecordReader 'Infer Schema' for large records may throw BufferedInputStream error#5011
adenes wants to merge 2 commits intoapache:mainfrom
adenes:NIFI-8437

adenes commented Apr 20, 2021

Uh oh!

adenes commented Apr 20, 2021

Uh oh!

markap14 Apr 20, 2021

Uh oh!

adenes Apr 20, 2021

Uh oh!

markap14 left a comment

Uh oh!

markap14 commented Apr 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adenes commented Apr 20, 2021

Description of PR

For all changes:

For code changes:

For documentation related changes:

Note:

Uh oh!

adenes commented Apr 20, 2021

Uh oh!

markap14 Apr 20, 2021

Choose a reason for hiding this comment

Uh oh!

adenes Apr 20, 2021

Choose a reason for hiding this comment

Uh oh!

markap14 left a comment

Choose a reason for hiding this comment

Uh oh!

markap14 commented Apr 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants