DRAFT
Owner Immo Landwerth
UTF-8 is currently the most popular encoding on the web with more than 95% of all websites using it. Additionally it's used as a default encoding on Linux. When .NET was created, there were many character encodings used widely and UTF-8 was not dominant at the time--.NET chose to use UTF-16. As UTF-8 became more popular, it became more obvious that UTF-8 has many memory advantages over other encodings including UTF-16; those advantages led to broad adoption of UTF-8.
- Reduces heap allocations - the amount of memory needed for storing text may be reduced up to 2x compared to UTF-16 encoding
- Improves quality, readability, and maintainability of code - performance oriented apps frequently store UTF-8 strings as byte arrays or workaround the platform limitations in other ways
- Improves integration with subsystems that are UTF-8 or mostly UTF-8 - for example on Linux most of the APIs are UTF-8 strings by convention - they currently need to transcode from/to UTF-16 - sometimes multiple times - on every API call
- Competitivenes with other platforms that offer UTF-8 primitives - most of the popular languages currently offer some sort of UTF-8 string support while .NET only allows transcoding to/from byte array
- Reduced CPU time for transcoding between UTF-16 and UTF-8 - currently many apps get data as UTF-8 and/or need to produce UTF-8 document output - while transcoding is very fast and in a typical complex app it will not show up very high on the perf metrics it can still add up to couple of percent of CPU cycles
- When you perform heavy text processing or utilize globalization-related features heavily (e.g., a GUI application localized to your native language), UTF-16 is often faster than UTF-8 due to technical details of its implementation
- For string which consist primarily of CJK characters, UTF-8 is less memory-efficient than UTF-16
- WinRT and some Windows APIs only take UTF-16 strings, the integration with them will be worse
- Possible dropping quality, readability and maintainability of code needing strictly UTF-16
These conditions aren't expected to occur frequently in web applications or other application types where TTFB and memory utilization are critical metrics.
- A scenario doc describes where UTF-8 would be valuable
- A scenario doc describes the acceptance criteria – what the desired improvements would be within the use case
- A scenario doc can include code blocks that illustrate the use case
- A scenario doc should not prescribe what the solution should be
To be added
- better optimizations, for example when enumerating runes
- it's consistent with other ecosystems but not all of them, i.e. in golang validity of the encoding is a soft requirement
- better safety and reliability
- i.e. String/UTF-8 String literals in read-only memory (read-only section of PE file) would allow to share strings between processes
- i.e. zero-terminated UTF-8 Strings would allow to pass it directly to many native APIs
- How can we avoid confusion: "Should I use UTF-8 String or System.String?"
- Should framework duplicate APIs? (create UTF-8 and UTF-16 version)
- Should System.String/char be re-defined and become UTF-8 underneath?
- Should UTF-8 literals be prefixed i.e. u8"foo" or just "bar" which compiler would figure out what type that is? (target typing)
To be added
To be added
To be added
To be added