Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"source sequence is illegal/malformed utf-8" workes with 1.7.7, but not with 1.8.x #242

Closed
viktorbenei opened this issue Apr 13, 2015 · 4 comments

Comments

@viktorbenei
Copy link

The issue: starting with version 1.8.0 the json gem now generates source sequence is illegal/malformed utf-8 for the same input which worked fine in 1.7.7.

I've created a repository to reproduce the issue: https://github.com/viktorbenei/test-illegal-malformed-ruby-json

@flori
Copy link
Owner

flori commented Apr 22, 2015

You should convert line to UTF-8 instead of just falsely tagging the line string as UTF-8 via force_encoding. You can for example use line.encode("UTF-8", invalid: :replace) to accomplish that.

@viktorbenei
Copy link
Author

@flori you're right, but

  1. shouldn't it work if the conversion line (line = line.to_s.force_encoding("UTF-8").encode("UTF-8")) does not raise an error?
  2. is it a known change in 1.8.x, as it works perfectly in 1.7.x?

@flori
Copy link
Owner

flori commented Apr 22, 2015

  1. It will only raise an error if you tag the line for example as ASCII-8BIT:

    line = File.read('sample-log.txt', encoding: 'ASCII-8BIT')

    => "[line] : .\x80\x9CProductV"

    line.encoding

    => #Encoding:ASCII-8BIT

    line.encode('UTF-8')
    Encoding::UndefinedConversionError: "\x80" from ASCII-8BIT to UTF-8

  2. The 1.8.x behaviour is the correct behaviour because the JSON output should be UTF-8 and just relying on the tagged encoding doesn't guarantee that. This change was done in ca25df02.

@viktorbenei
Copy link
Author

@flori thank you for the detailed explanation. We'll use a combination of .force_encoding("UTF-8").encode("UTF-8") and line.encode("UTF-8", invalid: :replace) as a fallback, simply because that's the only reliable way we could find to preserve as many characters as we can (we have to transfer logs of scripts and we can't guarantee that it'll always be in a correct encoding as the scripts are user defined) and still be able to reliably encode and transfer the content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants