Fix utf8 conversions for s390x#128394
Conversation
while adding eventpipe support s390x microsoft/perfview I came across some bugs where string utf8_to_utf16 convertsion file was incorrect in the .nettrace file for s390x.
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
|
cc: @am11 |
|
@Dotnet-s390x build |
|
Build Queued.. To cancel the current build, please comment: |
|
If you haven't already, please run PAL tests to validate coverage https://github.com/dotnet/runtime/blob/3366f6acb14d67cbff4663cc6021a6207582ad65/docs/workflow/testing/coreclr/testing.md#building-pal-tests |
|
Build Failed Build Error Summary |
|
The CI threw the following error. now ilasm uses this runtime/src/coreclr/ilasm/main.cpp Line 226 in 48a9212 https://github.com/dotnet/runtime/blob/main/src/coreclr/pal/src/locale/unicode.cpp#L184 |
|
with all the fixes in place clr.PalTests pass successfully. |
|
@Dotnet-s390x build |
|
Build Queued.. To cancel the current build, please comment: |
| IN int cchWideChar) | ||
| { | ||
| INT retval =0; | ||
| dwFlags |= MINIPAL_TREAT_AS_LITTLE_ENDIAN; |
There was a problem hiding this comment.
This does not look right. MultiByteToWideChar should use native endian.
There was a problem hiding this comment.
@jkotas ,
Based on the previous behavior before PR
MultiByteToWideChar called UTF8ToUnicode
https://github.com/dotnet/runtime/blob/v7.0.0/src/coreclr/pal/src/locale/unicode.cpp#L265
where as UTF8ToUnicode called GetChars
https://github.com/dotnet/runtime/blob/v7.0.0/src/coreclr/pal/src/locale/utf8.cpp#L2880
GetChars already did the endian conversions by-default for big endian systems.
https://github.com/dotnet/runtime/blob/v7.0.0/src/coreclr/pal/src/locale/utf8.cpp#L1938-L1949
There was a problem hiding this comment.
The links you have shared show that MultiByteToWideChar used native endian output. It does not make sense to force little endian output by passing MINIPAL_TREAT_AS_LITTLE_ENDIAN flag.
Looking at the whole change, I think you may be trying to change MINIPAL_TREAT_AS_LITTLE_ENDIAN to mean opposite of what its name says.
There was a problem hiding this comment.
I believe I'm confused here, I believe the intent of MINIPAL_TREAT_AS_LITTLE_ENDIAN is to treat the data as little endian?
I agree with what you said previously code in the #if BIGENDIAN produces output in native endian.
while the latter should always produce output in little endian always,
I was trying to debug on why eventpipe handler is writing in-correctly into a .nettrace file (LE-only) on big-endian systems (I misunderstood previously)
the raw .nettrace file on big-endian system looks like this
000005f0 00 00 00 00 00 00 00 2f 00 68 00 6d 00 6f 00 61 |......./.h.m.o.a| <----- marker
00000600 00 73 00 2f 00 65 00 6d 00 61 00 6a 00 6e 00 72 |.s./.e.m.a.j.n.r| <------ marker
00000610 00 6f 00 63 00 2f 00 72 00 6c 00 63 00 65 00 74 |.o.c./.r.l.c.e.t|
00000620 00 6f 00 64 00 2f 00 2d 00 74 00 65 00 6e 00 30 |.o.d./.-.t.e.n.0|
00000630 00 39 00 33 00 73 00 68 00 73 00 2f 00 78 00 64 |.9.3.s.h.s./.x.d|
00000640 00 65 00 72 00 61 00 63 00 69 00 4d 00 2f 00 6f |.e.r.a.c.i.M./.o|
00000650 00 73 00 6f 00 72 00 4e 00 2e 00 74 00 66 00 6f |.s.o.r.N...t.f.o|
00000660 00 43 00 54 00 45 00 41 00 2e 00 65 00 72 00 31 |.C.T.E.A...e.r.1|
00000670 00 2f 00 70 00 70 00 2e 00 30 00 2e 00 30 00 65 |./.p.p...0...0.e|
00000680 00 64 00 2d 00 31 00 69 00 6c 00 2f 00 76 00 72 |.d.-.1.i.l./.v.r|
00000690 00 6f 00 63 00 62 00 72 00 6c 00 63 00 65 00 2e |.o.c.b.r.l.c.e..|
000006a0 00 73 00 6f 00 00 81 03 fc 1c 22 09 00 00 52 00 |.s.o......"...R.|
on the debugger we have this
#0 GetChars (self=0x3ffc9877380, bytes=0x3ffc9877af8 "/home/sanjam/coreclr/dotnet-s390x/shared/Microsoft.NETCore.App/10.0.1-dev/libcoreclr.so", byteCount=87, chars=0x2aa0cbf7d2b, charCount=87)
at /home/sanjam/runtime/src/native/minipal/utf8.c:1116
#1 0x000003ffacd160fc in minipal_convert_utf8_to_utf16 (source=0x3ffc9877af8 "/home/sanjam/coreclr/dotnet-s390x/shared/Microsoft.NETCore.App/10.0.1-dev/libcoreclr.so", sourceLength=87, destination=0x2aa0cbf7d2b,
destinationLength=87, flags=30) at /home/sanjam/runtime/src/native/minipal/utf8.c:2119
#2 0x000003ffacaa16de in g_utf8_to_utf16le_custom_alloc_impl (str=0x3ffc9877af8 "/home/sanjam/coreclr/dotnet-s390x/shared/Microsoft.NETCore.App/10.0.1-dev/libcoreclr.so", len=87, items_read=0x0, items_written=0x0,
custom_alloc_func=0x3ffacaa3280 <monoeg_g_fixed_buffer_custom_allocator>, custom_alloc_data=0x3ffc9877628, err=0x0, treatAsLE=true) at /home/sanjam/runtime/src/mono/mono/eglib/giconv.c:260
for example let consider string "e/sa" in "/home/sanjam/coreclr/dotnets390x/shared/Microsoft.NETCore.App/10.0.1-dev/libcoreclr.so"
for this specific code
https://github.com/dotnet/runtime/blob/main/src/native/minipal/utf8.c#L1081-L1125
where as char* pSrc = "e/sa" (truncated for simplicity) = 0x65 0x2f 0x73 0x61
now
int ch = *(int*)pSrc
little - endian big - endian
ch = 0x61732f65 ch = 0x652f7361
*pTarget = (CHAR16_T)(ch & 0x7F);
*(pTarget + 1) = (CHAR16_T)((ch >> 8) & 0x7F);
*(pTarget + 2) = (CHAR16_T)((ch >> 16) & 0x7F);
*(pTarget + 3) = (CHAR16_T)((ch >> 24) & 0x7F);
little-endian big endian
*pTarget = 65 00 *pTarget = 00 61
*(pTarget+1) = 2f 00 *(pTarget+1) = 00 73
*(pTarget+2) = 73 00 *(pTarget+2) = 00 2f
*(pTarget + 3) = 61 00 *(pTarget+3) = 00 65
which I think is incorrect on a big endian platform (I think bytes should be in reverse order to comform with the LE format), we can validate this based on the output shown in the raw hex bytes with the marker.
There was a problem hiding this comment.
I agree with you that this does not look right.
I think the code should be doing the following on BE systems for your example:
*pTarget = 00 65
*(pTarget+1) = 00 2f
*(pTarget+2) = 00 73
*(pTarget+3) = 00 61
There was a problem hiding this comment.
shouldn't it be this on big endian systems ?
*pTarget = 65 00
*(pTarget+1) = 2f 00
*(pTarget+2) = 73 00
*(pTarget+3) = 61 00
because we are trying to write it down in little-endian utf-16 format. the above one is still in big endian utf-16 format
There was a problem hiding this comment.
@jkotas Thank you for you're comments, I will close this and open a new PR with the fixes.
|
Build Failed Build Error Summary |
|
@Dotnet-s390x build |
|
Build Queued.. To cancel the current build, please comment: |
|
Build Failed Build Error Summary |
while adding eventpipe support for s390x in microsoft/perfview I came across some bugs where string utf8_to_utf16 convertsion file was incorrect in the .nettrace file for s390x.
here we set the MINIPAL_TREAT_AS_LITTLE_ENDIAN flags
runtime/src/native/eventpipe/ep-string.c
Lines 13 to 41 in 99f87de
and here we invalidate the flag even if the flag is set.
https://github.com/dotnet/runtime/blob/main/src/native/minipal/utf8.c#L1095-L1110