This package contains a test case for a UTF-8 decoding problem I am running into.
Tested with GHC 7.8.3 (Haskell Platform 2014.2.0.0), OSX 10.8.5.
To reproduce:
-
Build the executable
bug
:cabal sandbox init cabal install --only-dependencies --force-reinstalls cabal build
-
Run the tests:
$ cabal repl *Main> main1 *** Exception: Cannot decode byte '\xc2': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream *Main> main1a output written to file output1a.html *Main> main2 output written to file output2.html *Main> main3 in file 1510-3.html, number of bytes: 75632 chars: 75046 spaces: 4396 *Main> main4 Message count: 71 Found Unicode special character in: M {messageId_ = "633099125665929015", parentId_ = ...
In output of main4
, look for the code points \65533\65533
.
- The program uses
xml-conduit
to scrape messages from a web forum page (file1510-3.html
) - The module
NewDOM.hs
is a copy ofText.HTML.DOM
withlenientDecode
changed tostrictDecode
main1
reads the HTML file usingstrictDecode
and writes it back out asoutput1.html
main1a
reads the HTML file (as a strict ByteString) usingstrictDecode
and writes it back out asoutput1a.html
main2
reads the HTML file usinglenientDecode
and writes it back out asoutput2.html
main3
reads the HTML file (as a strict ByteString) usingstrictDecode
and counts the number of code-points and spacesmain4
reads the HTML file usinglenientDecode
downcasing tag and attribute names and scrapes messages from it usingText.XML.Cursor
- I believe the HTML file contains well-formed UTF-8. There is a perl script
check-utf8
which validates it. In particular it does not contain the code-point\65533
.