Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-16 string literals #14670

Closed
straight-shoota opened this issue Jun 6, 2024 · 14 comments · Fixed by #14676
Closed

UTF-16 string literals #14670

straight-shoota opened this issue Jun 6, 2024 · 14 comments · Fixed by #14676

Comments

@straight-shoota
Copy link
Member

straight-shoota commented Jun 6, 2024

When working with Windows APIs, it's common that we need UTF-16 strings (instead of Crystal's String which is UTF-8).
String#to_utf16 is available for conversion.

But most use cases of this method in stdlib are actually for string literals (e.g. "Content Type".to_utf16). This is a bit unnecessary because it means the string transformation happens at runtime, while it could be entirely at compile time, avoding extra computation and allocation.

A particularly intricate use case is in #14659 where we must not allocate at all. So it ends up with such a mechanism to achive compile time conversion: UInt16.static_array({% for chr in "CRYSTAL_TRACE".chars %}{{chr.ord}}, {% end %} 0).

This certainly works, at least for this limited use case. But it fails for code points outside the Basic Multilingual Plane. So it's not a generic solution.

It would be nice if we had an easy tool for creating UTF-16 encoded strings.

Maybe the converstion algorithm from String#to_utf16 could be implemented as a macro method? It's a bit complex, but not too much. I don't think we can explicitly do math operations on 16-bit integers in the macro language, though.

An alternative would be to expose a compiler primitive for UTF-16 conversion.

Related: #2886

@BlobCodes
Copy link
Contributor

BlobCodes commented Jun 6, 2024

Maybe the converstion algorithm from String#to_utf16 could be implemented as a macro method? It's a bit complex, but not too much. I don't think we can explicitly do math operations on 16-bit integers in the macro language, though.

Simply porting #14671 works fine, explicit math operations on 16-bit integers are not needed.

class String
  macro utf16_literal(data)
    {%
      arr = [] of NumberLiteral
      data.chars.each do |c|
        c = c.ord
        if c < 0x1_0000
          arr << c
        else
          c -= 0x1_0000
          arr << 0xd800 + ((c >> 10) & 0x3ff)
          arr << 0xdc00 + (c & 0x3ff)
        end
      end
      arr << 0
    %}
    Slice(UInt16).literal({{arr.splat}})[0, {{arr.size - 1}}]
  end
end

s = String.utf16_literal("TEST 😐🐙 ±∀ の")
# => Slice[84, 69, 83, 84, 32, 55357, 56848, 55357, 56345, 32, 177, 8704, 32, 12398]

String.from_utf16(s)
# => "TEST 😐🐙 ±∀ の"

Encoding 10000 characters takes around 300ms.
That's certainly not fast, but probably good enough.

EDIT: Added a final 0 byte

@straight-shoota
Copy link
Member Author

straight-shoota commented Jun 6, 2024

Looks like a winner, then 🚀

That's certainly not fast, but probably good enough

Yeah, this is mainly for relatively short strings, so performance should not be an issue.
We can always push it up into the compiler if the need arises.

Btw. CharLiteral#ord was only added in 1.11 (#13910), so this wouldn't have been possible before.

@straight-shoota
Copy link
Member Author

In order to make it actually static data, we'd also need a slice literal (#2886).

@BlobCodes
Copy link
Contributor

The version from my comment uses the literals from #13716, so it is static data in this case.
Although it is still experimental API.

@stakach
Copy link
Contributor

stakach commented Jun 6, 2024

Worth noting that Windows supports UTF8 now and encourages use of those APIs

https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page#-a-vs--w-apis

So the conversations could be avoided entirely

@straight-shoota
Copy link
Member Author

So the conversations could be avoided entirely

Would be nice. But I believe we're quite a bit away from that. The Windows ecosystem is huge and it has 30 years of wide chars in it.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Jun 7, 2024

@straight-shoota this is reusing the "old" ANSI API to use the UTF-8 codepage, so it might just work 🤷

It took me a while to find this: at the above link there is the explanation to set the Active Code Page (ACP) to UTF-8 which requires a manifest and calling an EXE to "add the manifest" to an executable. Then the executable the ANSI variant of the Windows API will use UTF-8.

That being said, it requires Windows 10 v1903 (2019) and GDI applications won't support it unless the user activates a beta setting.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Jun 7, 2024

The macro is nice, but if we want to eventually have the compiler optimize it, maybe we could just expose the String.to_utf16 to macros directly? For example {{ "CRYSTAL_TRACE".to_utf16 }} would be lovely & fast.

@straight-shoota
Copy link
Member Author

Hm, that's an interesting idea. Exposing StringLiteral#to_utf16 would certainly have the benefit that you have the resulting literal easily available in macro land.
I like that it's exactly identical to the runtime version, but in a macro expansion which makes it clear that this happens at compile time.

FTR: Eventual compiler optimization would also be possible with String.utf16_literal as well. We could turn this macro into a primitive later.

Let's focus on UTF-16 string literals here and continue the discussion about UTF-8 support on Win32 in a different issue. I'm pretty sure we won't lose all use cases for UTF-16 string literals over night, so this will still be useful.

@ysbaddaden
Copy link
Contributor

The difficulty to implement StringLiteral#to_utf16 is that there is no SliceLiteral and we should generate a Slice(UInt16).literal(..., 0) and I have no idea how to achieve that.

@BlobCodes
Copy link
Contributor

BlobCodes commented Jun 7, 2024

The difficulty to implement StringLiteral#to_utf16 is that there is no SliceLiteral and we should generate a Slice(UInt16).literal(..., 0) and I have no idea how to achieve that.

It could return ArrayLiteral(NumberLiteral) (or Call(@receiver=Generic(@name=Slice, @type_vars=[UInt16]) @name="literal", @args=[0, 1, 2, 3, 4, 5, ...]))


Btw I just tested the performance of my macro code a bit more.
Simply replacing the line {{ arr.splat }} with {% arr.splat %} 0 (so the resulting splat is not parsed) improves the runtime of encoding 10000 characters from ~300ms to ~20ms.

The macro language actually isn't that slow - the parser is.

Implementing StringLiteral#to_utf16 wouldn't improve performance in a perceivable manner since it would only remove <10% of the runtime.

Maybe there should be a way to create AST nodes directly inside the macro language, so we don't have to parse everything again.

@stakach
Copy link
Contributor

stakach commented Jun 7, 2024

GDI applications won't support it unless the user activates a beta setting.

You can activate the code pages in code, this is how applications like MS Edge browser run.
MS Edge being a react native app, so runs using JS and UTF8 (although Microsoft is removing react)

ysbaddaden added a commit to ysbaddaden/crystal that referenced this issue Jun 8, 2024
@straight-shoota
Copy link
Member Author

Do we want to proceed with StringLiteral#to_utf16 then?
I think it's more elegant than String#utf16_literal (#14670 (comment)). An apparently more performant (#14676 (comment)).

@bcardiff
Copy link
Member

I like StringLiteral#to_utf16 and if to do that we end up having a SliceLiteral even one without first class syntax yet it would still be a double win. Because then embedding resources could leverage a similar StringLiteral#to_slice in compile-time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants