Depend on Erlang for computing grapheme clusters #11024

josevalim · 2021-05-31T13:59:02Z

Erlang ships with its own embedding of the Unicode
Codebase for a couple releases and this release changes
Elixir to depend on it in order to compute grapheme
clusters.

The Erlang implementation was up to 2x faster in low
codepoints (such as latin1) while the Elixir one could
be faster up to 3x in high codepoints (such as emoji)
so at the end the performance results are roughly the
same. As a benefit, we no longer need to ship our copy
of the grapheme cluster algorithm, which would take up
to 250kB in disk and more than 15 seconds to compile.

Note we still keep our own String downcase and upcase
algorithms, as our version is considerably more efficient
on all cases since it works exclusively with binaries
(more than 5x faster).

Erlang ships with its own embedding of the Unicode Codebase for a couple releases and this release changes Elixir to depend on it in order to compute grapheme clusters. The Erlang implementation was up to 2x faster in low codepoints (such as latin1) while the Elixir one could be faster up to 3x in high codepoints (such as emoji) so at the end the performance results are roughly the same. As a benefit, we no longer need to ship our copy of the grapheme cluster algorithm, which would take up to 250kB in disk and more than 15 seconds to compile. Note we still keep our own String downcase and upcase algorithms, as our version is considerably more efficient on all cases since it works exclusively with binaries (more than 5x faster).

lib/elixir/test/elixir/string_test.exs

josevalim · 2021-05-31T14:00:21Z

lib/elixir/unicode/unicode.ex

+# 4. Update String.Unicode.version/0 and on String module docs (version and link)
+# 5. make unicode
+
+data_path = Path.join(__DIR__, "UnicodeData.txt")


I have moved properties.ex to this file and renamed String.Unicode to String.Casing. Other than that, the contents are strictly the same.

The other way around, right?

Correct, renamed Casing to Unicode.

lib/elixir/lib/string.ex

fertapric · 2021-05-31T16:13:41Z

lib/elixir/lib/string.ex

+  defp do_last([gc | rest], _), do: do_last(:unicode_util.gc(rest), gc)
+  defp do_last([], acc) when is_binary(acc), do: acc
+  defp do_last([], acc), do: :unicode.characters_to_binary([acc])
+  defp do_last({:error, <<byte, rest::bitstring>>}, _), do: do_last(rest, <<byte>>)


Suggested change

defp do_last({:error, <<byte, rest::bitstring>>}, _), do: do_last(rest, <<byte>>)

defp do_last({:error, <<byte, rest::bitstring>>}, _), do: do_last(:unicode_util.gc(rest), <<byte>>)

For example, String.last(<<?a, 200, ?a>>) #=> "a"

Awesome find. Fixed and regression test added.

josevalim · 2021-06-01T10:44:59Z

💚 💙 💜 💛 ❤️

josevalim commented May 31, 2021

View reviewed changes

lib/elixir/test/elixir/string_test.exs Outdated Show resolved Hide resolved

Update lib/elixir/test/elixir/string_test.exs

794f402

josevalim commented May 31, 2021

View reviewed changes

josevalim added 2 commits May 31, 2021 16:07

Fix compilation error on OTP 22

8414d98

Simplify length/slice traversal

98c7392

fertapric reviewed May 31, 2021

View reviewed changes

lib/elixir/lib/string.ex Outdated Show resolved Hide resolved

fertapric reviewed May 31, 2021

View reviewed changes

josevalim added 2 commits May 31, 2021 21:06

Fix feedback

d96e6fb

Fix format

f47f6a6

fertapric approved these changes Jun 1, 2021

View reviewed changes

josevalim merged commit 3f5b3f0 into master Jun 1, 2021

josevalim deleted the jv-erlang-unicode branch June 1, 2021 10:44

taylordowns2000 mentioned this pull request Sep 30, 2022

String.next_grapheme returns tuple with grapheme and list #12160

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Depend on Erlang for computing grapheme clusters #11024

Depend on Erlang for computing grapheme clusters #11024

Uh oh!

josevalim commented May 31, 2021

Uh oh!

Uh oh!

josevalim May 31, 2021

Uh oh!

fertapric May 31, 2021

Uh oh!

josevalim May 31, 2021

Uh oh!

Uh oh!

fertapric May 31, 2021 •

edited

Loading

Uh oh!

josevalim May 31, 2021

Uh oh!

josevalim commented Jun 1, 2021

Uh oh!

Uh oh!

	defp do_last({:error, <<byte, rest::bitstring>>}, _), do: do_last(rest, <<byte>>)
	defp do_last({:error, <<byte, rest::bitstring>>}, _), do: do_last(:unicode_util.gc(rest), <<byte>>)

Depend on Erlang for computing grapheme clusters #11024

Depend on Erlang for computing grapheme clusters #11024

Uh oh!

Conversation

josevalim commented May 31, 2021

Uh oh!

Uh oh!

josevalim May 31, 2021

Choose a reason for hiding this comment

Uh oh!

fertapric May 31, 2021

Choose a reason for hiding this comment

Uh oh!

josevalim May 31, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fertapric May 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

josevalim May 31, 2021

Choose a reason for hiding this comment

Uh oh!

josevalim commented Jun 1, 2021

Uh oh!

Uh oh!

fertapric May 31, 2021 •

edited

Loading