Skip to content

NIFI-8437 RecordReader 'Infer Schema' for large records may throw BufferedInputStream error#5011

Closed
adenes wants to merge 2 commits intoapache:mainfrom
adenes:NIFI-8437
Closed

NIFI-8437 RecordReader 'Infer Schema' for large records may throw BufferedInputStream error#5011
adenes wants to merge 2 commits intoapache:mainfrom
adenes:NIFI-8437

Conversation

@adenes
Copy link
Contributor

@adenes adenes commented Apr 20, 2021

  • increased the mark readLimit in InferSchemaAccessStrategy.getSchema() to Integer.MAX_VALUE
  • removed the unnecessary BufferedInputStream wrapping in JsonPathReader.createRecordReader()
  • added a new test case to test with 50MB record

Thank you for submitting a contribution to Apache NiFi.

Please provide a short description of the PR here:

Description of PR

Enables X functionality; fixes bug NIFI-YYYY.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not squash or use --force when pushing to allow for clean monitoring of changes.

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • Have you verified that the full build is successful on JDK 8?
  • Have you verified that the full build is successful on JDK 11?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.

@adenes
Copy link
Contributor Author

adenes commented Apr 20, 2021

@markap14, indeed, the removal of the BufferedInputStream wrapping in JsonPathReader.createRecordReader() solves the issue (as you suggested on the Jira) because the ContentClaimInputStream doesn't use the readLimit in the mark call. Regardless of this I think it might be worth to increase the readLimit as there's no guarantee that the provided InputStream implementation doesn't takes it into account.
What do you think?

// We expect to be able to mark/reset any length because we expect that the underlying stream here will be a ContentClaimInputStream, which is able to
// re-read the content regardless of how much data is read.
contentStream.mark(10_000_000);
contentStream.mark(Integer.MAX_VALUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to keep this at a smaller value. The general contract of InputStream says that when mark is called, the stream must remember at least that many bytes but is free to remember more. As the comment above explains, the general expectation is that the stream will be of type ContentClaimInputStream. In that case, the value passed to mark is ignored because it can always roll back to the beginning of the stream (by re-reading the file under the hood).

But keeping the 10 MB limit (or even a 1 MB limit would probably be okay) means that if the caller does wrap the InputStream in a BufferedInputStream, then we have the chance remember this amount of data. If it's changed to Integer.MAX_VALUE, that bound is basically removed, which can lead to OutOfMemoryError very easily, and that should be avoided at almost any cost as the entire JVM can become defunct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point and share your concern.
Then I'll revert this change and also remove the newly added test case and keep only the BufferedInputStream wrapping removal.
I also checked all the other references to org.apache.nifi.serialization.SchemaRegistryService#getSchema() and no other BufferedInputStream wrapping occurs, at least not directly before the getSchema() call.

Copy link
Contributor

@markap14 markap14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adenes! I think the changes look good, except that I'm concerned about changing the argument passed to InputStream.mark. That value is there as a safe-guard, specifically for this situation where an InputStream was wrapped in a BufferedInputStream. We do not want to buffer an unbounded (or bound at Integer.MAX_VALUE) amount of data in BufferedInputStream. We'd rather throw an Exception.

…feredInputStream error

Addressing review comments
@markap14 markap14 closed this in 9f45b48 Apr 23, 2021
@markap14
Copy link
Contributor

Thanks @adenes all looks good to me. +1 merged to main

krisztina-zsihovszki pushed a commit to krisztina-zsihovszki/nifi that referenced this pull request Jun 28, 2022
…feredInputStream error

This closes apache#5011.

Signed-off-by: Mark Payne <markap14@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants