Draft of directional confusability protection & allowing script mixing in identifiers when separated by underscores #13693

mrluc · 2024-06-26T03:44:13Z

PR following up on Directional confusability: _א1 and _1א not detected -- unicode 15.1's new TR55, and updated TR39, proposing two changes:

per discussion of recent updates to Unicode's UTS39/55 in the issue linked above, we allow safe script mixing in identifiers, using a heuristic derived from the concept of identifier chunks from uts55. Basically we allow limited script mixing in identifiers like foo_bar_baz, only when separated by underscore, where foo, bar, baz chunks must be single-script or highly-restrictive.
provide directional confusability detection (which was needed even without script mixing, as the example from the issue above shows) by reversing spans of rtl-direction chars in bidi_skeleton, so that we default to checking confusability based on display order versus byte order in these relatively rare cases (eg, tools like github/vs code impl. the bidi algo from uax9).

These changes are also in harmony with the latest versions of the unicode technical reports, example from UTS39 as of version 15+ (similar reasoning in 55):

... it is inappropriate to enforce that strings be single-script, or at least single-directionality; this is the case in programming language identifiers ...

This PR maintains Elixir's script-mixing protections; for instance, we still don't allow an identifier like аdmin_, and the error shows us why:

Mixed-script identifiers are not supported for security reasons. 'аdmin_' is made of the following scripts:

  \u0430 а {Cyrillic}
  \u0064 d {Latin}
  \u006D m {Latin}
  \u0069 i {Latin}
  \u006E n {Latin}
  \u005F _

All characters in identifier chunks should resolve to a single script, or a highly restrictive set of scripts.

However, for identifiers separated by an underscore, we now make an exception:

    http_सर्वर = 1
    http_cервер = 1

... as long as the chunks, themselves, are 'single-script' or 'highly restrictive'. (If they aren't, we will see the same kind of error message as above; see test cases w/this PR).

Similarly, the confusability additions build on existing messages, but now include a breakdown of directionality if tokens are confusable and they include any RTL characters:

warning: confusable identifier: '_1א' looks like '_א1' on line 1, but they are written using different characters

'_1א' includes right-to-left characters:
  \u005F _ neutral
  \u0031 1 neutral
  \u05D0 א rtl

'_א1' includes right-to-left characters:
  \u005F _ neutral
  \u05D0 א rtl
  \u0031 1 neutral

@josevalim this is a first pass -- here's what I think is good/bad/ugly

I think/hope this should have basically no perf impact because of where the additions live

the .beam file sizes are bumped a little bit because of the new rtl? functions in Tokenizer.Security used for bidi_skeleton, by approx. 10kb:

➜  elixir-clean git:(main) ls -l lib/elixir/ebin/Elixir.String.Tokenizer*
-rw-r--r--  1 mrluc  staff    2660 Jun 25 06:55 lib/elixir/ebin/Elixir.String.Tokenizer.ScriptSet.beam
-rw-r--r--  1 mrluc  staff  134312 Jun 25 06:55 lib/elixir/ebin/Elixir.String.Tokenizer.Security.beam
-rw-r--r--  1 mrluc  staff   35456 Jun 25 06:55 lib/elixir/ebin/Elixir.String.Tokenizer.beam
➜  elixir git:(unicode-15-ltr-confusability-chunked-idents) ls -lh lib/elixir/ebin/Elixir.String.Tokenizer*
-rw-r--r--  1 mrluc  staff   2.6K Jun 25 09:46 lib/elixir/ebin/Elixir.String.Tokenizer.ScriptSet.beam
-rw-r--r--  1 mrluc  staff   141K Jun 25 09:46 lib/elixir/ebin/Elixir.String.Tokenizer.Security.beam
-rw-r--r--  1 mrluc  staff    36K Jun 25 09:46 lib/elixir/ebin/Elixir.String.Tokenizer.beam

I switched the order of Tokenizer and Tokenizer.Security in the makefile, and export an undocumented function from Tokenizer, so that Security can use it to filter the list of RTL and directionally 'neutral' characters to be just those that we know could legally appear in an identifier. Let me know if that's undesirable/I should do that a better way.
maybe this should have tests of directional confusability w/more complex identifiers, lmk if you agree; I never circled back to more realistic examples.

* follow uts39's recco that it's not necessary to require idents to be single-script (they call out proglang idents, reference the new uts55-5). We use a heuristic derived from the concept of identifier chunks from uts55-5, to allow idents like foo_bar_baz where each chunk around the _ can be single-or-highly-restrictive * provide directional confusability detection, by reversing spans of direction-changed chars in idents for bidi_skeleton, see issue elixir-lang#12929

lib/elixir/pages/references/unicode-syntax.md

lib/elixir/unicode/security.ex

josevalim · 2024-06-27T09:09:46Z

This looks great, I have added some comments mostly to move things around, and one discussion to perhaps rollback to single script now that we support chunks.

the .beam file sizes are bumped a little bit because of the new rtl? functions in Tokenizer.Security used for bidi_skeleton, by approx. 10kb:

The 10kb increase is not an issue but we definitely want to keep compilation times low, so I suggest moving the direction checking to String.Tokenizer.

maybe this should have tests of directional confusability w/more complex identifiers, lmk if you agree; I never circled back to more realistic examples.

More tests are always welcome!

lib/elixir/pages/references/unicode-syntax.md

Co-authored-by: José Valim <jose.valim@gmail.com>

* better tests/impl of directional confusability * move dir/1 to from security.ex -> tokenizer.ex

mrluc · 2024-07-02T01:22:46Z

lib/elixir/test/elixir/kernel/warning_test.exs

+      # test that the implementation of String.Tokenizer.Security.unbidify/1 agrees
+      # w/Unicode Bidi Algo (UAX9) for these (identifier-specific, no-bracket) examples
+      #
+      # you can create new examples with: https://util.unicode.org/UnicodeJsps/bidic.jsp?s=foo_%D9%84%D8%A7%D9%85%D8%AF%D8%A7_baz&b=0&u=140&d=2
+      # inspired by (none of these are directly usable for our idents): https://www.unicode.org/Public/UCD/latest/ucd/BidiCharacterTest.txt
+      #
+      # there's a spurious ;A; after the identifier, because the semicolon is dir-neutral, and
+      # deleting it makes these examples hard to read in many/most editors!
+      """
+      foo;A;0066 006F 006F;0 1 2
+      _foo_ ;A;005F 0066 006F 006F 005F;0 1 2 3 4
+      __foo__ ;A;005F 005F 0066 006F 006F 005F 005F;0 1 2 3 4 5 6
+      لامدا_foo ;A;0644 0627 0645 062F 0627 005F 0066 006F 006F;4 3 2 1 0 5 6 7 8
+      foo_لامدا_baz ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 0062 0061 007A;0 1 2 3 8 7 6 5 4 9 10 11 12
+      foo_لامدا ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627;0 1 2 3 8 7 6 5 4
+      foo_لامدا1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 0031;0 1 2 3 9 8 7 6 5 4
+      foo_لامدا_حدد ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F;0 1 2 3 12 11 10 9 8 7 6 5 4
+      foo_لامدا_حدد1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031;0 1 2 3 13 12 11 10 9 8 7 6 5 4
+      foo_لامدا_حدد1_bar ;A; 0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031 005F 0062 0061 0072;0 1 2 3 13 12 11 10 9 8 7 6 5 4 14 15 16 17
+      foo_لامدا_حدد1_bar1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031 005F 0062 0061 0072 0031;0 1 2 3 13 12 11 10 9 8 7 6 5 4 14 15 16 17 18


@josevalim I added these example-based tests, which fed back into the implementation by exercising things I hadn't thought through, like 'which span a neutral character or a number will attach to'.

Unicode has a great Bidi Algo testing tool, which will spit out the list of rearranged indices.

Let me know if we don't like this format for these tests, and if so what kind of format would be better; in the absence of anything else, I liked BidiCharacterTest.txt's approach of "each test case has a list of codepoints, and a list of indices that they should be reordered to".

mrluc · 2024-07-02T01:25:29Z

@josevalim I took a pass at moving those things around, and adding a handful of more complex LTR-confusability examples in the test -- which ended up changing the implementation of unbidify for the better I think.

josevalim · 2024-07-02T11:52:42Z

lib/elixir/unicode/tokenizer.ex

+  end
+
+  def dir(i) when is_integer(i), do: :ltr
+  def dir(_), do: {:error, :codepoint_must_be_integer}


Is there a reason we don't want to raise here? This will always be a bug, right?

josevalim · 2024-07-02T12:07:25Z

💚 💙 💜 💛 ❤️

josevalim · 2024-07-02T13:05:22Z

Follow up commits you may be interested in:

f01dc17

9924aff

mrluc · 2024-07-02T13:29:10Z

Ah, nice -- love this refactor -- also, literally just learned about Enum.reverse(ls, tail) 😆 -- and Highly Restrictive is no more, sweet! 🥳 🎉

mrluc changed the title ~~from unicode 15~~ Draft of directional confusability protection & allowing script mixing in identifiers when separated by underscores Jun 26, 2024