Refactor `ReaderInputStream` implementation #293

Marcono1234 · 2021-11-01T01:04:22Z

The new implementation follows more closely the required behavior defined by CharsetEncoder, instead of assuming that buffers of certain size will work without having to resize them.

Relates to:

Sorry for creating this pull request so late and not participating in the review of the previous pull requests for these issues.
If you don't think this pull request is worth it, feel free to close it. The current implementation will probably work in most situations correctly, but here it overwrites the lastCoderResult, even when the previous encode was unsuccessful, e.g. OVERFLOW or ERROR.

I also tried refactoring org.apache.commons.io.input.CharSequenceInputStream to use ReaderInputStream to reduce error-prone usage of CharsetEncoder there. However, this causes some behavior changes:

available() will in most cases return 0
readLimit parameter of mark is now actually considered, so code calling mark(0) will not work anymore

I have marked this pull request as draft for now because I have not extensively tested it. Please let me know what you think.

The new implementation follows more closely the required behavior defined by CharsetEncoder, instead of assuming that buffers of certain size will work without having to resize them.

This avoids having to implement error-prone CharsetEncoder logic in CharSequenceInputStream. However, to keep support for `mark` some extra code is needed, which does not add much value compared to using BufferedInputStream instead.

Marcono1234 · 2021-11-01T16:05:22Z

src/test/java/org/apache/commons/io/input/CharSequenceInputStreamTest.java

@@ -217,62 +218,60 @@ private void testIO_356(final int bufferSize, final int dataSize, final int read
        is.close();

        // data buffers should be identical
-        assertArrayEquals(data1, data2, "bufferSize=" + bufferSize + " dataSize=" + dataSize);
+        assertArrayEquals(data1, data2, "dataSize=" + dataSize);
    }

    @Test
    public void testIO_356_B10_D10_S0_UTF16() throws Exception {


Probably have to change the method names of these tests (removing B10).

garydgregory · 2021-11-11T19:51:20Z

"Please let me know what you think."

All builds are red 🙁

garydgregory · 2022-04-04T19:48:50Z

Closing, no feedback.

Marcono1234 · 2022-04-16T23:50:51Z

Sorry for not having responded earlier, and for not properly following up on this.

The issues mentioned in the description still exist on master:

ReaderInputStream erroneously overwrites lastCoderResult
This can for example be seen with the following code snippet. The string has a trailing unpaired surrogate, which is therefore malformed. The encoder is created to throw on malformed input. However, currently read() erroneously just returns -1 instead of throwing an exception.
```
// Encoder which throws on malformed or unmappable input
CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
ReaderInputStream in = new ReaderInputStream(new StringReader("\uD800"), encoder);
System.out.println("Read: " + in.read());
```
CharSequenceInputStream.available() makes erroneous assumptions about conversion ratio
Currently the implementation assumes that for each char at least one byte will be encoded. This might be incorrect for some specific encodings (have not found an example for this yet though), but is definitely incorrect for the default malformed input action of REPLACE. With this, encodings replace supplementary code points (consisting of two chars) with a single byte. This can been seen here:
```
Charset charset = Charset.forName("Big5");
CharSequenceInputStream in = new CharSequenceInputStream("\uD800\uDC00", charset);
System.out.println("Available: " + in.available());
// Note: readAllBytes() is a method added in Java 9
System.out.println("Actually read: " + in.readAllBytes().length);
```
available() returns 2, but only 1 byte can be read.

I asked for feedback for the following reasons, but did not describe them very well, or at all, in the original description:

I was not sure what your opinion about these larger refactorings are
The refactored CharSequenceInputStream made the available() implementation more correct (see above mentioned issue), but made the results less useful. This is also why the tests were failing because they were still expecting specific available() results, which cannot necessarily be guaranteed anymore. The tests could be rewritten to only check the available() result against the maximum, but not impose any minimum anymore. But I first wanted to hear your opinion on whether that is fine.
The new CharSequenceInputStream implementation is basically a wrapped ReaderInputStream with a bunch of boilerplate code to support mark and reset. Additionally the new mark implementation considers its parameter value now, which differs from the previous behavior. Is this new CharSequenceInputStream implementation reasonable?

Marcono1234 · 2022-04-16T23:52:28Z

src/main/java/org/apache/commons/io/input/CharSequenceInputStream.java

+        if (bufferSize < 1) {
+            throw new IllegalArgumentException("Buffer size must be >= 1");
+        }
+        // TODO: bufferSize is unused


Will change this to pass it to ReaderInputStream and use it as inital encoderOut size there.

Marcono1234 · 2022-04-16T23:55:10Z

Also, will these changes require me to sign the Apache CLA?

garydgregory · 2022-04-17T15:32:18Z

Also, will these changes require me to sign the Apache CLA?

I don't think so but IANAL.

Marcono1234 · 2022-04-25T19:28:33Z

Regardless of what will happen with this pull request, should I report the two issues mentioned above on Jira?

Marcono1234 · 2022-09-11T14:30:54Z

For completeness sake I have pushed one more commit to this marcono1234/ReaderInputStream-refactor branch which uses the bufferSize parameter of the CharSequenceInputStream constructor again. (Probably does not show up here because this PR is closed.)

Though that branch would still need adjustments to the tests as mentioned in my comments above.

Regardless of what will happen with this pull request I have now created two issues on Jira which track the bugs mentioned above:

IO-780: ReaderInputStream discards some encoding errors
IO-781: CharSequenceInputStream.available() returns too large numbers in some cases

Marcono1234 added 2 commits November 1, 2021 02:07

TEMP: Refactor ReaderInputStream implementation

944df8c

The new implementation follows more closely the required behavior defined by CharsetEncoder, instead of assuming that buffers of certain size will work without having to resize them.

Marcono1234 force-pushed the marcono1234/ReaderInputStream-refactor branch from 0566dbd to 7ab9ef3 Compare November 1, 2021 01:09

Marcono1234 commented Nov 1, 2021

View reviewed changes

garydgregory closed this Apr 4, 2022

Marcono1234 commented Apr 16, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `ReaderInputStream` implementation #293

Refactor `ReaderInputStream` implementation #293

Marcono1234 commented Nov 1, 2021

Marcono1234 Nov 1, 2021

garydgregory commented Nov 11, 2021

garydgregory commented Apr 4, 2022

Marcono1234 commented Apr 16, 2022

Marcono1234 Apr 16, 2022

Marcono1234 commented Apr 16, 2022

garydgregory commented Apr 17, 2022

Marcono1234 commented Apr 25, 2022

Marcono1234 commented Sep 11, 2022

Refactor ReaderInputStream implementation #293

Refactor ReaderInputStream implementation #293

Conversation

Marcono1234 commented Nov 1, 2021

Marcono1234 Nov 1, 2021

Choose a reason for hiding this comment

garydgregory commented Nov 11, 2021

garydgregory commented Apr 4, 2022

Marcono1234 commented Apr 16, 2022

Marcono1234 Apr 16, 2022

Choose a reason for hiding this comment

Marcono1234 commented Apr 16, 2022

garydgregory commented Apr 17, 2022

Marcono1234 commented Apr 25, 2022

Marcono1234 commented Sep 11, 2022

Refactor `ReaderInputStream` implementation #293

Refactor `ReaderInputStream` implementation #293