"source sequence is illegal/malformed utf-8" workes with 1.7.7, but not with 1.8.x #242

viktorbenei · 2015-04-13T19:34:41Z

The issue: starting with version 1.8.0 the json gem now generates source sequence is illegal/malformed utf-8 for the same input which worked fine in 1.7.7.

I've created a repository to reproduce the issue: https://github.com/viktorbenei/test-illegal-malformed-ruby-json

The text was updated successfully, but these errors were encountered:

flori · 2015-04-22T11:33:40Z

You should convert line to UTF-8 instead of just falsely tagging the line string as UTF-8 via force_encoding. You can for example use line.encode("UTF-8", invalid: :replace) to accomplish that.

viktorbenei · 2015-04-22T11:54:12Z

@flori you're right, but

shouldn't it work if the conversion line (line = line.to_s.force_encoding("UTF-8").encode("UTF-8")) does not raise an error?
is it a known change in 1.8.x, as it works perfectly in 1.7.x?

flori · 2015-04-22T13:41:44Z

It will only raise an error if you tag the line for example as ASCII-8BIT:

line = File.read('sample-log.txt', encoding: 'ASCII-8BIT')

=> "[line] : .\x80\x9CProductV"

line.encoding

=> #Encoding:ASCII-8BIT

line.encode('UTF-8')
Encoding::UndefinedConversionError: "\x80" from ASCII-8BIT to UTF-8
…
The 1.8.x behaviour is the correct behaviour because the JSON output should be UTF-8 and just relying on the tagged encoding doesn't guarantee that. This change was done in ca25df02.

viktorbenei · 2015-04-22T17:50:21Z

@flori thank you for the detailed explanation. We'll use a combination of .force_encoding("UTF-8").encode("UTF-8") and line.encode("UTF-8", invalid: :replace) as a fallback, simply because that's the only reliable way we could find to preserve as many characters as we can (we have to transfer logs of scripts and we can't guarantee that it'll always be in a correct encoding as the scripts are user defined) and still be able to reliably encode and transfer the content.

viktorbenei closed this as completed Apr 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"source sequence is illegal/malformed utf-8" workes with 1.7.7, but not with 1.8.x #242

"source sequence is illegal/malformed utf-8" workes with 1.7.7, but not with 1.8.x #242

viktorbenei commented Apr 13, 2015

flori commented Apr 22, 2015

viktorbenei commented Apr 22, 2015

flori commented Apr 22, 2015

=> "[line] : .\x80\x9CProductV"

=> #Encoding:ASCII-8BIT

viktorbenei commented Apr 22, 2015

"source sequence is illegal/malformed utf-8" workes with 1.7.7, but not with 1.8.x #242

"source sequence is illegal/malformed utf-8" workes with 1.7.7, but not with 1.8.x #242

Comments

viktorbenei commented Apr 13, 2015

flori commented Apr 22, 2015

viktorbenei commented Apr 22, 2015

flori commented Apr 22, 2015

=> "[line] : .\x80\x9CProductV"

=> #Encoding:ASCII-8BIT

viktorbenei commented Apr 22, 2015