Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions src/coreclr/pal/src/locale/unicode.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -61,14 +61,15 @@ MultiByteToWideChar(
IN int cchWideChar)
{
INT retval =0;
dwFlags |= MINIPAL_TREAT_AS_LITTLE_ENDIAN;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look right. MultiByteToWideChar should use native endian.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas ,
Based on the previous behavior before PR

MultiByteToWideChar called UTF8ToUnicode

https://github.com/dotnet/runtime/blob/v7.0.0/src/coreclr/pal/src/locale/unicode.cpp#L265

where as UTF8ToUnicode called GetChars

https://github.com/dotnet/runtime/blob/v7.0.0/src/coreclr/pal/src/locale/utf8.cpp#L2880

GetChars already did the endian conversions by-default for big endian systems.

https://github.com/dotnet/runtime/blob/v7.0.0/src/coreclr/pal/src/locale/utf8.cpp#L1938-L1949

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The links you have shared show that MultiByteToWideChar used native endian output. It does not make sense to force little endian output by passing MINIPAL_TREAT_AS_LITTLE_ENDIAN flag.

Looking at the whole change, I think you may be trying to change MINIPAL_TREAT_AS_LITTLE_ENDIAN to mean opposite of what its name says.

Copy link
Copy Markdown
Contributor Author

@saitama951 saitama951 May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I'm confused here, I believe the intent of MINIPAL_TREAT_AS_LITTLE_ENDIAN is to treat the data as little endian?

I agree with what you said previously code in the #if BIGENDIAN produces output in native endian.
while the latter should always produce output in little endian always,

I was trying to debug on why eventpipe handler is writing in-correctly into a .nettrace file (LE-only) on big-endian systems (I misunderstood previously)

the raw .nettrace file on big-endian system looks like this

000005f0  00 00 00 00 00 00 00 2f  00 68 00 6d 00 6f 00 61  |......./.h.m.o.a| <----- marker 
00000600  00 73 00 2f 00 65 00 6d  00 61 00 6a 00 6e 00 72  |.s./.e.m.a.j.n.r| <------ marker
00000610  00 6f 00 63 00 2f 00 72  00 6c 00 63 00 65 00 74  |.o.c./.r.l.c.e.t|
00000620  00 6f 00 64 00 2f 00 2d  00 74 00 65 00 6e 00 30  |.o.d./.-.t.e.n.0|
00000630  00 39 00 33 00 73 00 68  00 73 00 2f 00 78 00 64  |.9.3.s.h.s./.x.d|
00000640  00 65 00 72 00 61 00 63  00 69 00 4d 00 2f 00 6f  |.e.r.a.c.i.M./.o|
00000650  00 73 00 6f 00 72 00 4e  00 2e 00 74 00 66 00 6f  |.s.o.r.N...t.f.o|
00000660  00 43 00 54 00 45 00 41  00 2e 00 65 00 72 00 31  |.C.T.E.A...e.r.1|
00000670  00 2f 00 70 00 70 00 2e  00 30 00 2e 00 30 00 65  |./.p.p...0...0.e|
00000680  00 64 00 2d 00 31 00 69  00 6c 00 2f 00 76 00 72  |.d.-.1.i.l./.v.r|
00000690  00 6f 00 63 00 62 00 72  00 6c 00 63 00 65 00 2e  |.o.c.b.r.l.c.e..|
000006a0  00 73 00 6f 00 00 81 03  fc 1c 22 09 00 00 52 00  |.s.o......"...R.|

on the debugger we have this

#0  GetChars (self=0x3ffc9877380, bytes=0x3ffc9877af8 "/home/sanjam/coreclr/dotnet-s390x/shared/Microsoft.NETCore.App/10.0.1-dev/libcoreclr.so", byteCount=87, chars=0x2aa0cbf7d2b, charCount=87)
    at /home/sanjam/runtime/src/native/minipal/utf8.c:1116

#1  0x000003ffacd160fc in minipal_convert_utf8_to_utf16 (source=0x3ffc9877af8 "/home/sanjam/coreclr/dotnet-s390x/shared/Microsoft.NETCore.App/10.0.1-dev/libcoreclr.so", sourceLength=87, destination=0x2aa0cbf7d2b, 
    destinationLength=87, flags=30) at /home/sanjam/runtime/src/native/minipal/utf8.c:2119

#2  0x000003ffacaa16de in g_utf8_to_utf16le_custom_alloc_impl (str=0x3ffc9877af8 "/home/sanjam/coreclr/dotnet-s390x/shared/Microsoft.NETCore.App/10.0.1-dev/libcoreclr.so", len=87, items_read=0x0, items_written=0x0, 
    custom_alloc_func=0x3ffacaa3280 <monoeg_g_fixed_buffer_custom_allocator>, custom_alloc_data=0x3ffc9877628, err=0x0, treatAsLE=true) at /home/sanjam/runtime/src/mono/mono/eglib/giconv.c:260

for example let consider string "e/sa" in "/home/sanjam/coreclr/dotnets390x/shared/Microsoft.NETCore.App/10.0.1-dev/libcoreclr.so"

for this specific code
https://github.com/dotnet/runtime/blob/main/src/native/minipal/utf8.c#L1081-L1125

where as char* pSrc = "e/sa" (truncated for simplicity) = 0x65 0x2f 0x73 0x61
now

int ch = *(int*)pSrc
little - endian                                   big - endian

ch = 0x61732f65                               ch = 0x652f7361
                     *pTarget = (CHAR16_T)(ch & 0x7F);
                    *(pTarget + 1) = (CHAR16_T)((ch >> 8) & 0x7F);
                    *(pTarget + 2) = (CHAR16_T)((ch >> 16) & 0x7F);
                    *(pTarget + 3) = (CHAR16_T)((ch >> 24) & 0x7F);
little-endian                                  big endian 
                                                 
*pTarget	= 65 00    	                *pTarget = 00 61
                                                 
*(pTarget+1) = 2f 00                    *(pTarget+1) = 00 73
                                                 
*(pTarget+2) = 73 00		              *(pTarget+2) = 00 2f
                                                
*(pTarget + 3) = 61 00                  *(pTarget+3) = 00 65

which I think is incorrect on a big endian platform (I think bytes should be in reverse order to comform with the LE format), we can validate this based on the output shown in the raw hex bytes with the marker.

Copy link
Copy Markdown
Member

@jkotas jkotas May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that this does not look right.

I think the code should be doing the following on BE systems for your example:

*pTarget = 00 65                                                 
*(pTarget+1) = 00 2f
*(pTarget+2) = 00 73                                                
*(pTarget+3) = 00 61

Copy link
Copy Markdown
Contributor Author

@saitama951 saitama951 May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it be this on big endian systems ?

*pTarget = 65 00                                              
*(pTarget+1) = 2f 00
*(pTarget+2) = 73 00                                                
*(pTarget+3) = 61 00

because we are trying to write it down in little-endian utf-16 format. the above one is still in big endian utf-16 format

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas Thank you for you're comments, I will close this and open a new PR with the fixes.


PERF_ENTRY(MultiByteToWideChar);
ENTRY("MultiByteToWideChar(CodePage=%u, dwFlags=%#x, lpMultiByteStr=%p (%s),"
" cbMultiByte=%d, lpWideCharStr=%p, cchWideChar=%d)\n",
CodePage, dwFlags, lpMultiByteStr?lpMultiByteStr:"NULL", lpMultiByteStr?lpMultiByteStr:"NULL",
cbMultiByte, lpWideCharStr, cchWideChar);

if (dwFlags & ~(MB_ERR_INVALID_CHARS | MB_PRECOMPOSED))
if (dwFlags & ~(MB_ERR_INVALID_CHARS | MB_PRECOMPOSED | MINIPAL_TREAT_AS_LITTLE_ENDIAN))
{
ASSERT("Error dwFlags(0x%x) parameter is invalid\n", dwFlags);
SetLastError(ERROR_INVALID_FLAGS);
Expand Down Expand Up @@ -137,6 +138,7 @@ WideCharToMultiByte(
INT retval =0;
char defaultChar = '?';
BOOL usedDefaultChar = FALSE;
dwFlags |= MINIPAL_TREAT_AS_LITTLE_ENDIAN;

PERF_ENTRY(WideCharToMultiByte);
ENTRY("WideCharToMultiByte(CodePage=%u, dwFlags=%#x, lpWideCharStr=%p (%S), "
Expand All @@ -146,7 +148,7 @@ WideCharToMultiByte(
cchWideChar, lpMultiByteStr, cbMultiByte,
lpDefaultChar, lpUsedDefaultChar);

if (dwFlags & ~WC_NO_BEST_FIT_CHARS)
if (dwFlags & ~(WC_NO_BEST_FIT_CHARS | MINIPAL_TREAT_AS_LITTLE_ENDIAN))
{
ERROR("dwFlags %d invalid\n", dwFlags);
SetLastError(ERROR_INVALID_FLAGS);
Expand Down
18 changes: 9 additions & 9 deletions src/native/minipal/utf8.c
Original file line number Diff line number Diff line change
Expand Up @@ -620,14 +620,14 @@ static size_t GetCharCount(UTF8Encoding* self, unsigned char* bytes, size_t coun
LongCodeWithMask32 :
#if BIGENDIAN
// be careful about the sign extension
if (!self->treatAsLE) ch = (int)(((unsigned int)ch) >> 16);
if (self->treatAsLE) ch = (int)(((unsigned int)ch) >> 16);
else
#endif
ch &= 0xFF;

LongCodeWithMask16:
#if BIGENDIAN
if (!self->treatAsLE) ch = (int)(((unsigned int)ch) >> 8);
if (self->treatAsLE) ch = (int)(((unsigned int)ch) >> 8);
else
#endif
ch &= 0xFF;
Expand Down Expand Up @@ -1061,7 +1061,7 @@ static size_t GetChars(UTF8Encoding* self, unsigned char* bytes, size_t byteCoun

// Unfortunately, this is endianness sensitive
#if BIGENDIAN
if (!self->treatAsLE)
if (self->treatAsLE)
{
*pTarget = (CHAR16_T)((ch >> 8) & 0x7F);
pSrc += 2;
Expand Down Expand Up @@ -1093,7 +1093,7 @@ static size_t GetChars(UTF8Encoding* self, unsigned char* bytes, size_t byteCoun

// Unfortunately, this is endianness sensitive
#if BIGENDIAN
if (!self->treatAsLE)
if (self->treatAsLE)
{
*pTarget = (CHAR16_T)((ch >> 24) & 0x7F);
*(pTarget + 1) = (CHAR16_T)((ch >> 16) & 0x7F);
Expand Down Expand Up @@ -1126,14 +1126,14 @@ static size_t GetChars(UTF8Encoding* self, unsigned char* bytes, size_t byteCoun
LongCodeWithMask32 :
#if BIGENDIAN
// be careful about the sign extension
if (!self->treatAsLE) ch = (int)(((unsigned int)ch) >> 16);
if (self->treatAsLE) ch = (int)(((unsigned int)ch) >> 16);
else
#endif
ch &= 0xFF;

LongCodeWithMask16:
#if BIGENDIAN
if (!self->treatAsLE) ch = (int)(((unsigned int)ch) >> 8);
if (self->treatAsLE) ch = (int)(((unsigned int)ch) >> 8);
else
#endif
ch &= 0xFF;
Expand Down Expand Up @@ -1599,7 +1599,7 @@ static size_t GetBytes(UTF8Encoding* self, CHAR16_T* chars, size_t charCount, un

// Unfortunately, this is endianness sensitive
#if BIGENDIAN
if (!self->treatAsLE)
if (self->treatAsLE)
{
*pTarget = (unsigned char)(ch >> 16);
*(pTarget + 1) = (unsigned char)ch;
Expand All @@ -1624,7 +1624,7 @@ static size_t GetBytes(UTF8Encoding* self, CHAR16_T* chars, size_t charCount, un
LongCodeWithMask:
#if BIGENDIAN
// be careful about the sign extension
if (!self->treatAsLE) ch = (int)(((unsigned int)ch) >> 16);
if (self->treatAsLE) ch = (int)(((unsigned int)ch) >> 16);
else
#endif
ch = (CHAR16_T)ch;
Expand Down Expand Up @@ -2002,7 +2002,7 @@ static size_t GetByteCount(UTF8Encoding* self, CHAR16_T *chars, size_t count)
LongCodeWithMask:
#if BIGENDIAN
// be careful about the sign extension
if (!self->treatAsLE) ch = (int)(((unsigned int)ch) >> 16);
if (self->treatAsLE) ch = (int)(((unsigned int)ch) >> 16);
else
#endif
ch = (CHAR16_T)ch;
Expand Down
Loading