Incorrect result for decodeUtf8 of "\194" ByteString with text 1.0 #61

Closed
snoyberg opened this Issue Dec 29, 2013 · 6 comments

Projects

None yet

3 participants

@snoyberg
Contributor

Consider the following code:

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.Text.IO as TIO
import Data.Text.Encoding (decodeUtf8)

main :: IO ()
main = TIO.putStrLn $ decodeUtf8 "\194"

The behavior with text pre-1.0, and the expected behavior, is that it throws an exception, in particular:

text-bug.hs: Cannot decode byte '\xc2': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream

Beginning with release 1.0, this prints out an empty string. This behavior was initially reported as a bug in conduit's text decoding at snoyberg/conduit#127.

@snoyberg snoyberg referenced this issue in snoyberg/conduit Dec 29, 2013
Closed

test failed: text utf8 raw bytes #127

@bos
Owner
bos commented Dec 30, 2013

This bisects back to aca0971 which was written by @bgamari. Sorry for the regression, I'm looking into it now.

@snoyberg
Contributor

Thanks Bryan, let me know if I can be of any assistance.

@bos bos added a commit that referenced this issue Dec 30, 2013
@bos Ensure that t_utf8_err gets fed *only* invalid UTF-8 inputs
This test currently fails due to gh-61.

--HG--
extra : amend_source : ca66a1e6503a0cb9cf6cf5f2b82f2199133a6512
9c11b44
@bos bos added a commit that referenced this issue Dec 30, 2013
@bos Merge fix for gh-61 into 1.0 branch f91d749
@bos
Owner
bos commented Dec 30, 2013

OK, please test the 1.0 branch. It works for me. If it looks good to you, I'll release it.

The fix is in 7c09f3c, and a much improved test case is in 9c11b44. The new test case reveals the regression that was introduced in aca0971, and is fixed by the alleged fix :-)

@bos bos added a commit that closed this issue Dec 30, 2013
@bos Improve on previous fix
This version tries to force the real decoding function to be inlined
into each of its callers, which in turn each have different criteria
for backing up a byte. This avoids an extra test at the end of
strict decoding.

While this seems to fix gh-61, I want to beef up the test suite so
that it will correctly detect the bug.
7c09f3c
@bos bos closed this in 7c09f3c Dec 30, 2013
@snoyberg
Contributor

I can confirm that with the 1.0 branch, the conduit test suite passes, so I believe the issue is resolved. Thank you!

@bos
Owner
bos commented Dec 30, 2013

OK, it's up: http://hackage.haskell.org/package/text-1.0.0.1

A couple of last invalid UTF-8 generators added, too, for good measure, in 494d7d9.

@AnneTheAgile

Cross ref to blog post; http://www.serpentine.com/blog/2013/12/30/testing-a-utf-8-decoder-with-vigour/ (pet peeve, I wish github showed dates, not 'ago' timestamps... 2013-12-30 is not 2 months ago..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment