Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ReaderInputStream implementation #293

Conversation

Marcono1234
Copy link
Contributor

The new implementation follows more closely the required behavior defined by CharsetEncoder, instead of assuming that buffers of certain size will work without having to resize them.

Relates to:

Sorry for creating this pull request so late and not participating in the review of the previous pull requests for these issues.
If you don't think this pull request is worth it, feel free to close it. The current implementation will probably work in most situations correctly, but here it overwrites the lastCoderResult, even when the previous encode was unsuccessful, e.g. OVERFLOW or ERROR.

I also tried refactoring org.apache.commons.io.input.CharSequenceInputStream to use ReaderInputStream to reduce error-prone usage of CharsetEncoder there. However, this causes some behavior changes:

  • available() will in most cases return 0
  • readLimit parameter of mark is now actually considered, so code calling mark(0) will not work anymore

I have marked this pull request as draft for now because I have not extensively tested it. Please let me know what you think.

The new implementation follows more closely the required behavior defined by
CharsetEncoder, instead of assuming that buffers of certain size will work
without having to resize them.
This avoids having to implement error-prone CharsetEncoder logic in
CharSequenceInputStream. However, to keep support for `mark` some extra code
is needed, which does not add much value compared to using BufferedInputStream
instead.
@Marcono1234 Marcono1234 force-pushed the marcono1234/ReaderInputStream-refactor branch from 0566dbd to 7ab9ef3 Compare November 1, 2021 01:09
@@ -217,62 +218,60 @@ private void testIO_356(final int bufferSize, final int dataSize, final int read
is.close();

// data buffers should be identical
assertArrayEquals(data1, data2, "bufferSize=" + bufferSize + " dataSize=" + dataSize);
assertArrayEquals(data1, data2, "dataSize=" + dataSize);
}

@Test
public void testIO_356_B10_D10_S0_UTF16() throws Exception {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably have to change the method names of these tests (removing B10).

@garydgregory
Copy link
Member

"Please let me know what you think."

All builds are red 🙁

@garydgregory
Copy link
Member

Closing, no feedback.

@Marcono1234
Copy link
Contributor Author

Sorry for not having responded earlier, and for not properly following up on this.

The issues mentioned in the description still exist on master:

  • ReaderInputStream erroneously overwrites lastCoderResult
    This can for example be seen with the following code snippet. The string has a trailing unpaired surrogate, which is therefore malformed. The encoder is created to throw on malformed input. However, currently read() erroneously just returns -1 instead of throwing an exception.

    // Encoder which throws on malformed or unmappable input
    CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
    ReaderInputStream in = new ReaderInputStream(new StringReader("\uD800"), encoder);
    System.out.println("Read: " + in.read());
  • CharSequenceInputStream.available() makes erroneous assumptions about conversion ratio
    Currently the implementation assumes that for each char at least one byte will be encoded. This might be incorrect for some specific encodings (have not found an example for this yet though), but is definitely incorrect for the default malformed input action of REPLACE. With this, encodings replace supplementary code points (consisting of two chars) with a single byte. This can been seen here:

    Charset charset = Charset.forName("Big5");
    CharSequenceInputStream in = new CharSequenceInputStream("\uD800\uDC00", charset);
    System.out.println("Available: " + in.available());
    // Note: readAllBytes() is a method added in Java 9
    System.out.println("Actually read: " + in.readAllBytes().length);

    available() returns 2, but only 1 byte can be read.

I asked for feedback for the following reasons, but did not describe them very well, or at all, in the original description:

  • I was not sure what your opinion about these larger refactorings are
  • The refactored CharSequenceInputStream made the available() implementation more correct (see above mentioned issue), but made the results less useful. This is also why the tests were failing because they were still expecting specific available() results, which cannot necessarily be guaranteed anymore. The tests could be rewritten to only check the available() result against the maximum, but not impose any minimum anymore. But I first wanted to hear your opinion on whether that is fine.
  • The new CharSequenceInputStream implementation is basically a wrapped ReaderInputStream with a bunch of boilerplate code to support mark and reset. Additionally the new mark implementation considers its parameter value now, which differs from the previous behavior. Is this new CharSequenceInputStream implementation reasonable?

if (bufferSize < 1) {
throw new IllegalArgumentException("Buffer size must be >= 1");
}
// TODO: bufferSize is unused
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change this to pass it to ReaderInputStream and use it as inital encoderOut size there.

@Marcono1234
Copy link
Contributor Author

Also, will these changes require me to sign the Apache CLA?

@garydgregory
Copy link
Member

Also, will these changes require me to sign the Apache CLA?

I don't think so but IANAL.

@Marcono1234
Copy link
Contributor Author

Regardless of what will happen with this pull request, should I report the two issues mentioned above on Jira?

@Marcono1234
Copy link
Contributor Author

For completeness sake I have pushed one more commit to this marcono1234/ReaderInputStream-refactor branch which uses the bufferSize parameter of the CharSequenceInputStream constructor again. (Probably does not show up here because this PR is closed.)

Though that branch would still need adjustments to the tests as mentioned in my comments above.

Regardless of what will happen with this pull request I have now created two issues on Jira which track the bugs mentioned above:

  • IO-780: ReaderInputStream discards some encoding errors
  • IO-781: CharSequenceInputStream.available() returns too large numbers in some cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants