Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "sim: Specify -fshort-wchar as NuttX wchar_t is 16-bit" #4754

Merged
merged 2 commits into from
Dec 9, 2021

Conversation

xiaoxiang781216
Copy link
Contributor

@xiaoxiang781216 xiaoxiang781216 commented Oct 31, 2021

Summary

It's better to apply the default compiler option to improve the compatibility
This reverts commit 3fc06ff.
Here is the output from x64 gcc:

gcc -dM -E - < /dev/null | grep -i wchar
#define __WCHAR_MAX__ 0x7fffffff
#define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
#define __WCHAR_WIDTH__ 32
#define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
#define __WCHAR_TYPE__ int
#define __SIZEOF_WCHAR_T__ 4

And arm gcc:

arm-none-eabi-gcc -dM -E - < /dev/null | grep -i wchar
#define __ARM_SIZEOF_WCHAR_T 4
#define __WCHAR_MAX__ 0xffffffffU
#define __WCHAR_MIN__ 0U
#define __WCHAR_WIDTH__ 32
#define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 1
#define __WCHAR_TYPE__ unsigned int
#define __SIZEOF_WCHAR_T__ 4

Impact

sim

Testing

pass CI and ostest

@yamt
Copy link
Contributor

yamt commented Oct 31, 2021

It's better to apply the default compiler option to improve the compatibility

compatibility with what?

@xiaoxiang781216
Copy link
Contributor Author

xiaoxiang781216 commented Nov 1, 2021

It's better to apply the default compiler option to improve the compatibility

compatibility with what?

3rd party application, sorry I can't disclose the name publicly.
The major reason to revert this patch is that it's wrong to make the different size of wchar_t between sim and real device(e.g. arm) as I list in the summary.

@xiaoxiang781216
Copy link
Contributor Author

xiaoxiang781216 commented Nov 1, 2021

From https://pubs.opengroup.org/onlinepubs/007908799/xsh/stddef.h.html:

wchar_t
Integral type whose range of values can represent distinct wide-character codes for all members of the largest character set specified among the locales supported by the compilation environment: the null character has the code value 0 and each member of the Portable Character Set has a code value equal to its value when used as the lone character in an integer character constant.

wchar_t require to save all possible character encoding.
And from https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html:

Data type: wchar_t
This data type is used as the base type for wide character strings. In other words, arrays of objects of this type are the equivalent of char[] for multibyte character strings. The type is defined in stddef.h.

The ISO C90 standard, where wchar_t was introduced, does not say anything specific about the representation. It only requires that this type is capable of storing all elements of the basic character set. Therefore it would be legitimate to define wchar_t as char, which might make sense for embedded systems.

But in the GNU C Library wchar_t is always 32 bits wide and, therefore, capable of representing all UCS-4 values and, therefore, covering all of ISO 10646. Some Unix systems define wchar_t as a 16-bit type and thereby follow Unicode very strictly. This definition is perfectly fine with the standard, but it also means that to represent all characters from Unicode and ISO 10646 one has to use UTF-16 surrogate characters, which is in fact a multi-wide-character encoding. But resorting to multi-wide-character encoding contradicts the purpose of the wchar_t type.

To cover all possible Unicode encoding, wchar_t require at least 4 bytes.

So, from the standard perspective, this patch should be revert too.

@yamt
Copy link
Contributor

yamt commented Nov 1, 2021

It's better to apply the default compiler option to improve the compatibility

compatibility with what?

3rd party application, sorry I can't disclose the name publicly.

what assumption does the app make?

The major reason to revert this patch is that it's wrong to make the different size of wchar_t between sim and real device(e.g. arm) as I list in the summary.

how about using -fshort-wchar for arm then?

@yamt
Copy link
Contributor

yamt commented Nov 1, 2021

From https://pubs.opengroup.org/onlinepubs/007908799/xsh/stddef.h.html:

wchar_t
Integral type whose range of values can represent distinct wide-character codes for all members of the largest character set specified among the locales supported by the compilation environment: the null character has the code value 0 and each member of the Portable Character Set has a code value equal to its value when used as the lone character in an integer character constant.

wchar_t require to save all possible character encoding. And from https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html:

Data type: wchar_t
This data type is used as the base type for wide character strings. In other words, arrays of objects of this type are the equivalent of char[] for multibyte character strings. The type is defined in stddef.h.

The ISO C90 standard, where wchar_t was introduced, does not say anything specific about the representation. It only requires that this type is capable of storing all elements of the basic character set. Therefore it would be legitimate to define wchar_t as char, which might make sense for embedded systems.

But in the GNU C Library wchar_t is always 32 bits wide and, therefore, capable of representing all UCS-4 values and, therefore, covering all of ISO 10646. Some Unix systems define wchar_t as a 16-bit type and thereby follow Unicode very strictly. This definition is perfectly fine with the standard, but it also means that to represent all characters from Unicode and ISO 10646 one has to use UTF-16 surrogate characters, which is in fact a multi-wide-character encoding. But resorting to multi-wide-character encoding contradicts the purpose of the wchar_t type.

To cover all possible Unicode encoding, wchar_t require at least 4 bytes.

So, from the standard perspective, this patch should be revert too.

no standard mandates unicode or how it's implemented as far as i know.
__STDC_ISO_10646__ is optional.
i feel it makes sense only if we are going to implement unicode wchar_t. are we?

@yamt
Copy link
Contributor

yamt commented Nov 1, 2021

anyway, if we are going to revert this, don't forget to update this. https://github.com/apache/incubator-nuttx/blob/b58379b7380dcef3346bf9ff1ac76d40dc094095/include/sys/types.h#L170

@xiaoxiang781216
Copy link
Contributor Author

It's better to apply the default compiler option to improve the compatibility

compatibility with what?

3rd party application, sorry I can't disclose the name publicly.

what assumption does the app make?

It assume wchar_t is four bytes and then has the enough space to encode UCS4 encoding.

The major reason to revert this patch is that it's wrong to make the different size of wchar_t between sim and real device(e.g. arm) as I list in the summary.

how about using -fshort-wchar for arm then?

short will make it's very hard to process surrogate pair(https://en.wikipedia.org/wiki/UTF-16) as I mention in the last reply. So I don't see there is any benefit to add the seldom used flag(-fshort-wchar), could you explain why you stick with this flag?

@yamt
Copy link
Contributor

yamt commented Nov 1, 2021

It's better to apply the default compiler option to improve the compatibility

compatibility with what?

3rd party application, sorry I can't disclose the name publicly.

what assumption does the app make?

It assume wchar_t is four bytes and then has the enough space to encode UCS4 encoding.

The major reason to revert this patch is that it's wrong to make the different size of wchar_t between sim and real device(e.g. arm) as I list in the summary.

how about using -fshort-wchar for arm then?

short will make it's very hard to process surrogate pair(https://en.wikipedia.org/wiki/UTF-16) as I mention in the last reply. So I don't see there is any benefit to add the seldom used flag(-fshort-wchar), could you explain why you stick with this flag?

actually i don't insist.
nuttx wchar_t has been 16 bit. (i don't know why.)
i just haven't seen a good reason to change it, especially when the change increases memory usage.

@yamt
Copy link
Contributor

yamt commented Nov 1, 2021

It's better to apply the default compiler option to improve the compatibility

compatibility with what?

3rd party application, sorry I can't disclose the name publicly.

what assumption does the app make?

It assume wchar_t is four bytes and then has the enough space to encode UCS4 encoding.

does it assume __STDC_ISO_10646__?

@xiaoxiang781216
Copy link
Contributor Author

xiaoxiang781216 commented Nov 1, 2021

no standard mandates unicode or how it's implemented as far as i know.

unicode is accepted by many countries as a standard to represent the character worldwide. If we want to support language other English, Unicode is the best option.

__STDC_ISO_10646__ is optional. i feel it makes sense only if we are going to implement unicode wchar_t. are we?

But if we want to support wchar_t, it's better to follow __STDC_ISO_10646__.

anyway, if we are going to revert this, don't forget to update this.

https://github.com/apache/incubator-nuttx/blob/b58379b7380dcef3346bf9ff1ac76d40dc094095/include/sys/types.h#L170

Fix here: 339e54b

actually i don't insist. nuttx wchar_t has been 16 bit. (i don't know why.) i just haven't seen a good reason to change it, especially when the change increases memory usage.

The size increasement should be small because:

  1. The code which use wchar_t is very rare.
  2. wchar_t is builtin type in C++ which mean it is already 4 bytes

@yamt
Copy link
Contributor

yamt commented Nov 1, 2021

if your app is just abusing wchar_t where it should use uint32_t, maybe this PR is enough.
otherwise, you likely need a proper implementation of mbrtowc etc.
which is the case?

@xiaoxiang781216
Copy link
Contributor Author

if your app is just abusing wchar_t where it should use uint32_t, maybe this PR is enough. otherwise, you likely need a proper implementation of mbrtowc etc. which is the case?

Yes, we plan to implement mbrtowc for UTF8 encoding.

It's better to apply the default compiler option to improve the compatibility
This reverts commit 3fc06ff.
@xiaoxiang781216
Copy link
Contributor Author

xiaoxiang781216 commented Dec 8, 2021

@yamt please merge this PR if you don't have more concern. #4881 depends on it.

include/sys/types.h Outdated Show resolved Hide resolved
Signed-off-by: Xiang Xiao <xiaoxiang@xiaomi.com>
Copy link
Contributor

@yamt yamt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yamt yamt merged commit 6357523 into apache:master Dec 9, 2021
@masayuki2009
Copy link
Contributor

@xiaoxiang781216

Hmm, the following commit has a problem with uSD card on Spresense.
If I revert the commit locally, it works fine.

nsh> df
  Block    Number
  Size     Blocks       Used   Available Mounted on
 16384     971908       5718      966190 /mnt/sd0
  4096       1024          8        1016 /mnt/spif
     0          0          0           0 /proc
nsh> ls -l /mnt/sd0
/mnt/sd0:
nsh: ls: stat failed: 22

commit 6357523
Author: Xiang Xiao xiaoxiang@xiaomi.com
Date: Mon Nov 1 12:40:51 2021 +0800

arch: Add _wchar_t typedef like other basic types

Signed-off-by: Xiang Xiao <xiaoxiang@xiaomi.com>

@masayuki2009
Copy link
Contributor

masayuki2009 commented Dec 9, 2021

@xiaoxiang781216

Hmm, the following commit has a problem with uSD card on Spresense.

@xiaoxiang781216
stm32f4discovery with uSD over SPI has the same problem too.

@xiaoxiang781216
Copy link
Contributor Author

Ok, we will take a look tonight.

@masayuki2009
Copy link
Contributor

masayuki2009 commented Dec 10, 2021

Hmm, the following commit has a problem with uSD card on Spresense.
If I revert the commit locally, it works fine.

@xiaoxiang781216
Because the nuttx/fs/fat uses wchar_t inside, the side effect happened with this PR.
If I added -fshort-wchar to the compile options, it worked again.

However, there are many linker warnings.

arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/thumb/v7e-m+fp/hard/libgcc.a(_udivmoddi4.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/thumb/v7e-m+fp/hard/libgcc.a(_fixunsdfdi.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-w_pow.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-e_pow.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-e_sqrt.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-s_fabs.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-s_finite.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-s_lib_ver.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-s_nan.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-s_rint.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-s_scalbn.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail

@yamt
Copy link
Contributor

yamt commented Dec 10, 2021

Hmm, the following commit has a problem with uSD card on Spresense.
If I revert the commit locally, it works fine.

@xiaoxiang781216 Because the nuttx/fs/fat uses wchar_t inside, the side effect happened with this PR. If I added -fshort-wchar to the compile options, it worked again.

However, there are many linker warnings.

arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/thumb/v7e-m+fp/hard/libgcc.a(_udivmoddi4.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/thumb/v7e-m+fp/hard/libgcc.a(_fixunsdfdi.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-w_pow.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-e_pow.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-e_sqrt.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-s_fabs.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-s_finite.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-s_lib_ver.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-s_nan.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-s_rint.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail
arm-none-eabi-ld: warning: /opt/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libm.a(lib_a-s_scalbn.o) uses 4-byte wchar_t yet the output is to use 2-byte wchar_t; use of wchar_t values across objects may fail

i guess nuttx/fs/fat should use uint16_t instead.
using wchar_t to mean particular encoding (utf16, utf32, or whatever) is a bad practice.

@masayuki2009
Copy link
Contributor

i guess nuttx/fs/fat should use uint16_t instead.
using wchar_t to mean particular encoding (utf16, utf32, or whatever) is a bad practice.

@yamt
Thanks for the comments.
As far as I tried your advice, it worked.
I will create a new PR later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants