Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Bad UTF-8 content blows up parser #138

headius opened this Issue May 31, 2012 · 2 comments


None yet
2 participants

headius commented May 31, 2012

I know, I know...if it's bad content it's bad content. But this represents a difference from MRI.

Here's the case, again a reduced version of one I got from @rkh:

# encoding: utf-8
require 'json'
x = "{\"foo\":\"\xC3\"}"
h = JSON.parse(x)
p h['foo']
p h['foo'].encoding

So basically there's a bad byte in a UTF-8 string, and the MRI version walks right by it and allows it to come through to the resulting parsed json structure.

I have a totally broken patch for this:

diff --git a/java/src/json/ext/ByteListTranscoder.java b/java/src/json/ext/ByteListTranscoder.java
index ed9e54b..a7e42ba 100644
--- a/java/src/json/ext/ByteListTranscoder.java
+++ b/java/src/json/ext/ByteListTranscoder.java
@@ -78,9 +78,10 @@ abstract class ByteListTranscoder {
             return head;
         if (head <= 0xbf) { // 0b10xxxxxx
-            throw invalidUtf8(); // tail byte with no head
+            return head; //throw invalidUtf8(); // tail byte with no head
         if (head <= 0xdf) { // 0b110xxxxx
+            if (pos + 1 > srcEnd) return head;
             int cp = ((head  & 0x1f) << 6)
                      | nextPart();

Again, I'm not sure this is actually something that needs to be fixed, but because the MRI version of json does not blow up on this content, there's something to be addressed.


flori commented Jun 7, 2012

I think the other variants should also raise an exception. I already implemented this for Ruby 1.9 extension and pure variants , but 1.8 is a bit more difficult.


headius commented Jun 8, 2012

I'm not particularly concerned about 1.8. The report we got was that JRuby errored in 1.9 mode and I believe the user's only concerned with 1.9 mode. Hopefully the upstream data can be fixed too, since obviously the "fix" of actually erroring out will now cause the "bug" for MRI too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment