Skip to content

Depend on Erlang for computing grapheme clusters #11024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 1, 2021
Merged

Conversation

josevalim
Copy link
Member

Erlang ships with its own embedding of the Unicode
Codebase for a couple releases and this release changes
Elixir to depend on it in order to compute grapheme
clusters.

The Erlang implementation was up to 2x faster in low
codepoints (such as latin1) while the Elixir one could
be faster up to 3x in high codepoints (such as emoji)
so at the end the performance results are roughly the
same. As a benefit, we no longer need to ship our copy
of the grapheme cluster algorithm, which would take up
to 250kB in disk and more than 15 seconds to compile.

Note we still keep our own String downcase and upcase
algorithms, as our version is considerably more efficient
on all cases since it works exclusively with binaries
(more than 5x faster).

Erlang ships with its own embedding of the Unicode
Codebase for a couple releases and this release changes
Elixir to depend on it in order to compute grapheme
clusters.

The Erlang implementation was up to 2x faster in low
codepoints (such as latin1) while the Elixir one could
be faster up to 3x in high codepoints (such as emoji)
so at the end the performance results are roughly the
same. As a benefit, we no longer need to ship our copy
of the grapheme cluster algorithm, which would take up
to 250kB in disk and more than 15 seconds to compile.

Note we still keep our own String downcase and upcase
algorithms, as our version is considerably more efficient
on all cases since it works exclusively with binaries
(more than 5x faster).
# 4. Update String.Unicode.version/0 and on String module docs (version and link)
# 5. make unicode

data_path = Path.join(__DIR__, "UnicodeData.txt")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved properties.ex to this file and renamed String.Unicode to String.Casing. Other than that, the contents are strictly the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other way around, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, renamed Casing to Unicode.

defp do_last([gc | rest], _), do: do_last(:unicode_util.gc(rest), gc)
defp do_last([], acc) when is_binary(acc), do: acc
defp do_last([], acc), do: :unicode.characters_to_binary([acc])
defp do_last({:error, <<byte, rest::bitstring>>}, _), do: do_last(rest, <<byte>>)
Copy link
Member

@fertapric fertapric May 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defp do_last({:error, <<byte, rest::bitstring>>}, _), do: do_last(rest, <<byte>>)
defp do_last({:error, <<byte, rest::bitstring>>}, _), do: do_last(:unicode_util.gc(rest), <<byte>>)

For example, String.last(<<?a, 200, ?a>>) #=> "a"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome find. Fixed and regression test added.

@josevalim josevalim merged commit 3f5b3f0 into master Jun 1, 2021
@josevalim josevalim deleted the jv-erlang-unicode branch June 1, 2021 10:44
@josevalim
Copy link
Member Author

💚 💙 💜 💛 ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants