-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Depend on Erlang for computing grapheme clusters #11024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Erlang ships with its own embedding of the Unicode Codebase for a couple releases and this release changes Elixir to depend on it in order to compute grapheme clusters. The Erlang implementation was up to 2x faster in low codepoints (such as latin1) while the Elixir one could be faster up to 3x in high codepoints (such as emoji) so at the end the performance results are roughly the same. As a benefit, we no longer need to ship our copy of the grapheme cluster algorithm, which would take up to 250kB in disk and more than 15 seconds to compile. Note we still keep our own String downcase and upcase algorithms, as our version is considerably more efficient on all cases since it works exclusively with binaries (more than 5x faster).
# 4. Update String.Unicode.version/0 and on String module docs (version and link) | ||
# 5. make unicode | ||
|
||
data_path = Path.join(__DIR__, "UnicodeData.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved properties.ex to this file and renamed String.Unicode to String.Casing. Other than that, the contents are strictly the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other way around, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, renamed Casing to Unicode.
lib/elixir/lib/string.ex
Outdated
defp do_last([gc | rest], _), do: do_last(:unicode_util.gc(rest), gc) | ||
defp do_last([], acc) when is_binary(acc), do: acc | ||
defp do_last([], acc), do: :unicode.characters_to_binary([acc]) | ||
defp do_last({:error, <<byte, rest::bitstring>>}, _), do: do_last(rest, <<byte>>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defp do_last({:error, <<byte, rest::bitstring>>}, _), do: do_last(rest, <<byte>>) | |
defp do_last({:error, <<byte, rest::bitstring>>}, _), do: do_last(:unicode_util.gc(rest), <<byte>>) |
For example, String.last(<<?a, 200, ?a>>) #=> "a"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome find. Fixed and regression test added.
💚 💙 💜 💛 ❤️ |
Erlang ships with its own embedding of the Unicode
Codebase for a couple releases and this release changes
Elixir to depend on it in order to compute grapheme
clusters.
The Erlang implementation was up to 2x faster in low
codepoints (such as latin1) while the Elixir one could
be faster up to 3x in high codepoints (such as emoji)
so at the end the performance results are roughly the
same. As a benefit, we no longer need to ship our copy
of the grapheme cluster algorithm, which would take up
to 250kB in disk and more than 15 seconds to compile.
Note we still keep our own String downcase and upcase
algorithms, as our version is considerably more efficient
on all cases since it works exclusively with binaries
(more than 5x faster).