GetStringUTFChars() does not actually return modified UTF-8 #283

ntrrgc · 2017-01-21T13:54:51Z

Description

I defined a String in Java like this:

String string = "\uD83D\uDE3A"; // 😺

And passed it to a native function... Inside it I called env->GetStringUTFChars(myJstring) and inspected the return value in Android Studio:

According to the docs, the string should be in Modified UTF-8 format, so the cat character should take 6 bytes... but it's taking 4 instead, as in Standard UTF-8.

JNI docs on Modified UTF-8: https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/types.html#modified_utf_8_strings
JNI docs on GetStringUTFChars: https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#GetStringUTFChars

Environment Details

Not all of these will be relevant to every bug, but please provide as much
information as you can.

NDK Version: 13.1.3345770
Build sytem: cmake
Host OS: Arch Linux
Compiler: Clang

The text was updated successfully, but these errors were encountered:

kneth · 2017-02-17T14:10:15Z

It can be that the native debugger in Android Studio is showing you the wrong thing. I have written a small C++ function to test it:

Java_net_zigzak_funwithstrings_MainActivity_toJNI(JNIEnv *env, jobject instance, jstring str_) {
    const char *str = env->GetStringUTFChars(str_, 0);
    const jint length = env->GetStringLength(str_);

    __android_log_print(ANDROID_LOG_DEBUG, TAG, "length = %d / %d", length, strlen(str));
    for(int i = 0; i < strlen(str); ++i) {
        __android_log_print(ANDROID_LOG_DEBUG, TAG, "str[%d] = 0x%X", i, str[i]);
    }

    env->ReleaseStringUTFChars(str_, str);
}

And I am using it from Java:

String emoji = "\uD83D\uDE3A";
toJNI(emoji);

I see the following in the log:

02-17 15:07:16.619 14279-14279/? D/FunWithStrings: length = 2 / 6
02-17 15:07:16.619 14279-14279/? D/FunWithStrings: str[0] = 0xFFFFFFED
02-17 15:07:16.619 14279-14279/? D/FunWithStrings: str[1] = 0xFFFFFFA0
02-17 15:07:16.619 14279-14279/? D/FunWithStrings: str[2] = 0xFFFFFFBD
02-17 15:07:16.619 14279-14279/? D/FunWithStrings: str[3] = 0xFFFFFFED
02-17 15:07:16.619 14279-14279/? D/FunWithStrings: str[4] = 0xFFFFFFB8
02-17 15:07:16.619 14279-14279/? D/FunWithStrings: str[5] = 0xFFFFFFBA

Putting a breakpoint in my C++ code, I get about the same picture:

By the way, I am using NDK r10e.

ntrrgc · 2017-02-17T15:05:33Z

It can be that the native debugger in Android Studio is showing you the wrong thing.

I don't think the native debugger would ever do something like that, no matter how buggy it could be: it's just a char *. Furthermore, in order to prove it innocent, I've run your function:

02-17 16:04:57.204 5128-5128/? D/FunWithStrings: length = 2 / 4
02-17 16:04:57.204 5128-5128/? D/FunWithStrings: str[0] = 0xFFFFFFF0
02-17 16:04:57.204 5128-5128/? D/FunWithStrings: str[1] = 0xFFFFFF9F
02-17 16:04:57.204 5128-5128/? D/FunWithStrings: str[2] = 0xFFFFFF98
02-17 16:04:57.204 5128-5128/? D/FunWithStrings: str[3] = 0xFFFFFFBA

enh · 2017-02-17T16:42:19Z

(sorry, i thought i'd commented on this when it was first raised and i asked our Java runtime folks to improve the docs...)

This is actually WAI for Android. Android has always done the standard UTF-8 thing for emoji. It was a long time ago, but iirc they tried both and "standard" broke fewer apps' assumptions than "modified".

I ping that team again about getting the docs improved...

enh · 2017-02-17T19:25:25Z

i've raised internal bug http://b/35469153 to track improving the documentation.

ntrrgc · 2017-02-17T20:37:27Z

Could you please fix the link? 😕

enh · 2017-02-17T21:54:42Z

it's internal to google. that link is for our benefit when we're trying to see whether the docs have actually been fixed next time we're pinged on this bug... :-)

kneth · 2017-02-20T07:54:17Z

@enh Does the conversion depend on locale settings? When I run the the test function, I get length of 6, while @ntrrgc gets 4.

enh · 2017-02-20T17:55:01Z

i don't think so, though it might differ between platform versions. i don't know what dalvik did, but art should aiui always use regular UTF-8.

alexcohn · 2017-02-27T10:48:48Z

@enh : unfortunately "has always done the standard UTF-8 for emoji" sounds like a citation from G.Orwell. This commit suggests that until 2014 4-byte UTF-8 sequences could not work, and after this they were internally converted to surrogate pairs. Even today JNI checks whether |bytes| is valid modified UTF-8 but also accept 4 byte UTF sequences in place of encoded surrogate pairs.

@kneth : which platform did you use for testing?

kneth · 2017-02-27T11:30:02Z

I was running in an x86 emulator API 22.

DanAlbert · 2017-10-03T00:51:46Z

(doc issue isn't fixed, but that's tracked elsewhere and there's nothing to do from the NDK side)

kneth mentioned this issue Feb 9, 2017

Failure when converting to UTF-8 realm/realm-java#4148

Closed

DanAlbert closed this as completed Oct 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GetStringUTFChars() does not actually return modified UTF-8 #283

GetStringUTFChars() does not actually return modified UTF-8 #283

ntrrgc commented Jan 21, 2017 •

edited

kneth commented Feb 17, 2017 •

edited

ntrrgc commented Feb 17, 2017

enh commented Feb 17, 2017

enh commented Feb 17, 2017

ntrrgc commented Feb 17, 2017

enh commented Feb 17, 2017

kneth commented Feb 20, 2017

enh commented Feb 20, 2017

alexcohn commented Feb 27, 2017 •

edited

kneth commented Feb 27, 2017

DanAlbert commented Oct 3, 2017

GetStringUTFChars() does not actually return modified UTF-8 #283

GetStringUTFChars() does not actually return modified UTF-8 #283

Comments

ntrrgc commented Jan 21, 2017 • edited

Description

Environment Details

kneth commented Feb 17, 2017 • edited

ntrrgc commented Feb 17, 2017

enh commented Feb 17, 2017

enh commented Feb 17, 2017

ntrrgc commented Feb 17, 2017

enh commented Feb 17, 2017

kneth commented Feb 20, 2017

enh commented Feb 20, 2017

alexcohn commented Feb 27, 2017 • edited

kneth commented Feb 27, 2017

DanAlbert commented Oct 3, 2017

ntrrgc commented Jan 21, 2017 •

edited

kneth commented Feb 17, 2017 •

edited

alexcohn commented Feb 27, 2017 •

edited