Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetStringUTFChars() does not actually return modified UTF-8 #283

Closed
ntrrgc opened this issue Jan 21, 2017 · 11 comments
Closed

GetStringUTFChars() does not actually return modified UTF-8 #283

ntrrgc opened this issue Jan 21, 2017 · 11 comments

Comments

@ntrrgc
Copy link

ntrrgc commented Jan 21, 2017

Description

I defined a String in Java like this:

String string = "\uD83D\uDE3A"; // 馃樅

And passed it to a native function... Inside it I called env->GetStringUTFChars(myJstring) and inspected the return value in Android Studio:

According to the docs, the string should be in Modified UTF-8 format, so the cat character should take 6 bytes... but it's taking 4 instead, as in Standard UTF-8.

JNI docs on Modified UTF-8: https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/types.html#modified_utf_8_strings
JNI docs on GetStringUTFChars: https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#GetStringUTFChars

Environment Details

Not all of these will be relevant to every bug, but please provide as much
information as you can.

  • NDK Version: 13.1.3345770
  • Build sytem: cmake
  • Host OS: Arch Linux
  • Compiler: Clang
@kneth
Copy link

kneth commented Feb 17, 2017

It can be that the native debugger in Android Studio is showing you the wrong thing. I have written a small C++ function to test it:

Java_net_zigzak_funwithstrings_MainActivity_toJNI(JNIEnv *env, jobject instance, jstring str_) {
    const char *str = env->GetStringUTFChars(str_, 0);
    const jint length = env->GetStringLength(str_);

    __android_log_print(ANDROID_LOG_DEBUG, TAG, "length = %d / %d", length, strlen(str));
    for(int i = 0; i < strlen(str); ++i) {
        __android_log_print(ANDROID_LOG_DEBUG, TAG, "str[%d] = 0x%X", i, str[i]);
    }

    env->ReleaseStringUTFChars(str_, str);
}

And I am using it from Java:

String emoji = "\uD83D\uDE3A";
toJNI(emoji);

I see the following in the log:

02-17 15:07:16.619 14279-14279/? D/FunWithStrings: length = 2 / 6
02-17 15:07:16.619 14279-14279/? D/FunWithStrings: str[0] = 0xFFFFFFED
02-17 15:07:16.619 14279-14279/? D/FunWithStrings: str[1] = 0xFFFFFFA0
02-17 15:07:16.619 14279-14279/? D/FunWithStrings: str[2] = 0xFFFFFFBD
02-17 15:07:16.619 14279-14279/? D/FunWithStrings: str[3] = 0xFFFFFFED
02-17 15:07:16.619 14279-14279/? D/FunWithStrings: str[4] = 0xFFFFFFB8
02-17 15:07:16.619 14279-14279/? D/FunWithStrings: str[5] = 0xFFFFFFBA

Putting a breakpoint in my C++ code, I get about the same picture:

screen shot 2017-02-17 at 15 03 35

By the way, I am using NDK r10e.

@ntrrgc
Copy link
Author

ntrrgc commented Feb 17, 2017

It can be that the native debugger in Android Studio is showing you the wrong thing.

I don't think the native debugger would ever do something like that, no matter how buggy it could be: it's just a char *. Furthermore, in order to prove it innocent, I've run your function:

02-17 16:04:57.204 5128-5128/? D/FunWithStrings: length = 2 / 4
02-17 16:04:57.204 5128-5128/? D/FunWithStrings: str[0] = 0xFFFFFFF0
02-17 16:04:57.204 5128-5128/? D/FunWithStrings: str[1] = 0xFFFFFF9F
02-17 16:04:57.204 5128-5128/? D/FunWithStrings: str[2] = 0xFFFFFF98
02-17 16:04:57.204 5128-5128/? D/FunWithStrings: str[3] = 0xFFFFFFBA

@enh
Copy link
Contributor

enh commented Feb 17, 2017

(sorry, i thought i'd commented on this when it was first raised and i asked our Java runtime folks to improve the docs...)

This is actually WAI for Android. Android has always done the standard UTF-8 thing for emoji. It was a long time ago, but iirc they tried both and "standard" broke fewer apps' assumptions than "modified".

I ping that team again about getting the docs improved...

@enh
Copy link
Contributor

enh commented Feb 17, 2017

i've raised internal bug http://b/35469153 to track improving the documentation.

@ntrrgc
Copy link
Author

ntrrgc commented Feb 17, 2017

Could you please fix the link? 馃槙

@enh
Copy link
Contributor

enh commented Feb 17, 2017

it's internal to google. that link is for our benefit when we're trying to see whether the docs have actually been fixed next time we're pinged on this bug... :-)

@kneth
Copy link

kneth commented Feb 20, 2017

@enh Does the conversion depend on locale settings? When I run the the test function, I get length of 6, while @ntrrgc gets 4.

@enh
Copy link
Contributor

enh commented Feb 20, 2017

i don't think so, though it might differ between platform versions. i don't know what dalvik did, but art should aiui always use regular UTF-8.

@alexcohn
Copy link

alexcohn commented Feb 27, 2017

@enh : unfortunately "has always done the standard UTF-8 for emoji" sounds like a citation from G.Orwell. This commit suggests that until 2014 4-byte UTF-8 sequences could not work, and after this they were internally converted to surrogate pairs. Even today JNI checks whether |bytes| is valid modified UTF-8 but also accept 4 byte UTF sequences in place of encoded surrogate pairs.

@kneth : which platform did you use for testing?

@kneth
Copy link

kneth commented Feb 27, 2017

I was running in an x86 emulator API 22.

@DanAlbert
Copy link
Member

(doc issue isn't fixed, but that's tracked elsewhere and there's nothing to do from the NDK side)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants