GitHub - erantapaa/decode-bug at https://coder.social

This package contains a test case for a UTF-8 decoding problem I am running into.

Tested with GHC 7.8.3 (Haskell Platform 2014.2.0.0), OSX 10.8.5.

To reproduce:

Build the executable bug:

 cabal sandbox init
 cabal install --only-dependencies --force-reinstalls
 cabal build

Run the tests:

 $ cabal repl
 *Main> main1
 *** Exception: Cannot decode byte '\xc2': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream
 *Main> main1a
 output written to file output1a.html
 *Main> main2
 output written to file output2.html
 *Main> main3
 in file 1510-3.html, number of bytes: 75632 chars: 75046 spaces: 4396
 *Main> main4
 Message count: 71
 Found Unicode special character in:
 M {messageId_ = "633099125665929015", parentId_ = ...

In output of main4, look for the code points \65533\65533.

Notes

The program uses xml-conduit to scrape messages from a web forum page (file 1510-3.html)
The module NewDOM.hs is a copy of Text.HTML.DOM with lenientDecode changed to strictDecode
main1 reads the HTML file using strictDecode and writes it back out as output1.html
main1a reads the HTML file (as a strict ByteString) using strictDecode and writes it back out as output1a.html
main2 reads the HTML file using lenientDecode and writes it back out as output2.html
main3 reads the HTML file (as a strict ByteString) using strictDecode and counts the number of code-points and spaces
main4 reads the HTML file using lenientDecode downcasing tag and attribute names and scrapes messages from it using Text.XML.Cursor
I believe the HTML file contains well-formed UTF-8. There is a perl script check-utf8 which validates it. In particular it does not contain the code-point \65533.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
1510-3.html		1510-3.html
Main.hs		Main.hs
MyDOM.hs		MyDOM.hs
NewDOM.hs		NewDOM.hs
README.md		README.md
Setup.hs		Setup.hs
check-utf8		check-utf8
decode-bug.cabal		decode-bug.cabal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1510-3.html

1510-3.html

Main.hs

Main.hs

MyDOM.hs

MyDOM.hs

NewDOM.hs

NewDOM.hs

README.md

README.md

Setup.hs

Setup.hs

check-utf8

check-utf8

decode-bug.cabal

decode-bug.cabal

Repository files navigation

Notes

About

Releases

Packages

Languages

erantapaa/decode-bug

Folders and files

Latest commit

History

Repository files navigation

Notes

About

Resources

Stars

Watchers

Forks

Languages