Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-3560: throw SchemaParseException on dangling content in avsc beyond end of schema #1748

Merged
merged 1 commit into from Jun 14, 2023

Conversation

radai-rosenblatt
Copy link
Contributor

@radai-rosenblatt radai-rosenblatt commented Jul 1, 2022

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit test:
    • TestSchema.testContentAfterAvsc()
    • TestSchema.testContentAfterAvscInInputStream()

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@github-actions github-actions bot added the Java Pull Requests for Java binding label Jul 1, 2022
@RyanSkraba RyanSkraba changed the title AVRO-3560 - throw SchemaParseException on dangling content in avsc beyond end of schema AVRO-3560: throw SchemaParseException on dangling content in avsc beyond end of schema Jul 1, 2022
@radai-rosenblatt
Copy link
Contributor Author

radai-rosenblatt commented Jul 1, 2022

ended up finding a few bad schemas in other unit tests as well ...

@KalleOlaviNiemitalo
Copy link

Perhaps this change should not apply to parse(InputStream in) where the caller might expect to be able to read content after the schema.

@radai-rosenblatt
Copy link
Contributor Author

updated the PR to allow dangling content when parsing from InputStream

@KalleOlaviNiemitalo
Copy link

Could you add tests for trailing content in a UTF-8 file parsed using Schema.parse(File file)? JsonFactory.createParser(File f) apparently creates an InputStream for that, and JsonFactory._createParser(InputStream in, IOContext ctxt) calls ByteSourceJsonBootstrapper.constructParser, which creates an UTF8StreamJsonParser in that case. UTF8StreamJsonParser is a "byte-based" parser and inherits JsonParser.releaseBuffered(Writer w), which just returns -1, but UTF8StreamJsonParser.releaseBuffered(OutputStream out) can write something to the OutputStream. I think this means that, to correctly detect trailing content in a UTF-8 file, Schema.parse would have to call not only JsonParser.releaseBuffered(Writer) but also JsonParser.releaseBuffered(OutputStream), or first call JsonParser.getInputSource() and then use the type of the result to guess whether the parser is byte-based or char-based.

If parse(InputStream) parses a stream that has trailing content, and JsonParser buffers that content, it would be best to return that content to the InputStream so that the caller can then read it. However I don't see how to do that.

Is it possible that JsonParser buffers some content from a Reader, and the buffered content is all space characters and thus ignored by Avro.Schema, but the Reader has more content that JsonContent did not even read because its buffer filled up? In which case, Avro.Schema would have to check the Reader as well.

@radai-rosenblatt
Copy link
Contributor Author

i'll add a check for the File method.
as for the case of a buffer full of whitespace (and presumably content beyond that) - I'm ok with missing that validation (as i expect it to be extremely uncommon). otherwise i'd need to read till EOF?

@radai-rosenblatt
Copy link
Contributor Author

that was an excellent catch @KalleOlaviNiemitalo. i've added a test for File input and adapted the code to work for all cases of dangling input.

except for the case of a buffer full of whitespace with more content hiding in the underlying input (either a reader or an inputstream). I'm ok with not solving for that case as I expect it to be rare (and the code changes required more invasive)

@radai-rosenblatt
Copy link
Contributor Author

@RyanSkraba - any chance of this making it in ?

@RyanSkraba
Copy link
Contributor

Hello! We released 1.11.1 the other day. It was kind of in a crunch, and I wasn't looking at any of the non-1.11.1 tagged issues :/ My apologies for not getting to this earlier -- it should be in the next release.

To confirm, there's really little or no security issues here right? We strictly just never look farther than the JSON beginning of File.

I like the branch name :D

@RyanSkraba RyanSkraba self-requested a review August 11, 2022 08:14
if (!allowDanglingContent) {
String dangling;
StringWriter danglingWriter = new StringWriter();
int numCharsReleased = parser.releaseBuffered(danglingWriter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the parser didn't read all bytes available in a file?
AFAICT, JsonParser only releases what it read but didn't parse.

(I think we're ok for Strings; I'm not sure about files)

@@ -101,7 +101,7 @@ private static Object[][] dataForResolvingTests() {
"{\"type\":\"record\",\"name\":\"outer\",\"fields\":[" + "{\"name\": \"g1\", "
+ "\"type\":{\"type\":\"record\",\"name\":\"inner\",\"fields\":["
+ "{\"name\":\"f1\", \"type\":\"int\", \"default\": 101}," + "{\"name\":\"f2\", \"type\":\"int\"}]}}, "
+ "{\"name\": \"g2\", \"type\": \"long\"}]}}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are nice catches!

@RyanSkraba RyanSkraba merged commit 017918e into apache:master Jun 14, 2023
RyanSkraba pushed a commit that referenced this pull request Jun 14, 2023
@RyanSkraba
Copy link
Contributor

Ooops, it looks like this broke on master due to the unit test migration to JUnit 5. I'm taking a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java Pull Requests for Java binding
Projects
None yet
4 participants