Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of directional confusability protection & allowing script mixing in identifiers when separated by underscores #13693

Merged

Conversation

mrluc
Copy link
Contributor

@mrluc mrluc commented Jun 26, 2024

PR following up on Directional confusability: _א1 and _1א not detected -- unicode 15.1's new TR55, and updated TR39, proposing two changes:

  1. per discussion of recent updates to Unicode's UTS39/55 in the issue linked above, we allow safe script mixing in identifiers, using a heuristic derived from the concept of identifier chunks from uts55. Basically we allow limited script mixing in identifiers like foo_bar_baz, only when separated by underscore, where foo, bar, baz chunks must be single-script or highly-restrictive.

  2. provide directional confusability detection (which was needed even without script mixing, as the example from the issue above shows) by reversing spans of rtl-direction chars in bidi_skeleton, so that we default to checking confusability based on display order versus byte order in these relatively rare cases (eg, tools like github/vs code impl. the bidi algo from uax9).

These changes are also in harmony with the latest versions of the unicode technical reports, example from UTS39 as of version 15+ (similar reasoning in 55):

... it is inappropriate to enforce that strings be single-script, or at least single-directionality; this is the case in programming language identifiers ...

This PR maintains Elixir's script-mixing protections; for instance, we still don't allow an identifier like аdmin_, and the error shows us why:

Mixed-script identifiers are not supported for security reasons. 'аdmin_' is made of the following scripts:

  \u0430 а {Cyrillic}
  \u0064 d {Latin}
  \u006D m {Latin}
  \u0069 i {Latin}
  \u006E n {Latin}
  \u005F _

All characters in identifier chunks should resolve to a single script, or a highly restrictive set of scripts.

However, for identifiers separated by an underscore, we now make an exception:

    http_सर्वर = 1
    http_cервер = 1

... as long as the chunks, themselves, are 'single-script' or 'highly restrictive'. (If they aren't, we will see the same kind of error message as above; see test cases w/this PR).

Similarly, the confusability additions build on existing messages, but now include a breakdown of directionality if tokens are confusable and they include any RTL characters:

warning: confusable identifier: '_1א' looks like '_א1' on line 1, but they are written using different characters

'_1א' includes right-to-left characters:
  \u005F _ neutral
  \u0031 1 neutral
  \u05D0 א rtl

'_א1' includes right-to-left characters:
  \u005F _ neutral
  \u05D0 א rtl
  \u0031 1 neutral

@josevalim this is a first pass -- here's what I think is good/bad/ugly

  • I think/hope this should have basically no perf impact because of where the additions live

  • the .beam file sizes are bumped a little bit because of the new rtl? functions in Tokenizer.Security used for bidi_skeleton, by approx. 10kb:

    ➜  elixir-clean git:(main) ls -l lib/elixir/ebin/Elixir.String.Tokenizer*
    -rw-r--r--  1 mrluc  staff    2660 Jun 25 06:55 lib/elixir/ebin/Elixir.String.Tokenizer.ScriptSet.beam
    -rw-r--r--  1 mrluc  staff  134312 Jun 25 06:55 lib/elixir/ebin/Elixir.String.Tokenizer.Security.beam
    -rw-r--r--  1 mrluc  staff   35456 Jun 25 06:55 lib/elixir/ebin/Elixir.String.Tokenizer.beam
    ➜  elixir git:(unicode-15-ltr-confusability-chunked-idents) ls -lh lib/elixir/ebin/Elixir.String.Tokenizer*
    -rw-r--r--  1 mrluc  staff   2.6K Jun 25 09:46 lib/elixir/ebin/Elixir.String.Tokenizer.ScriptSet.beam
    -rw-r--r--  1 mrluc  staff   141K Jun 25 09:46 lib/elixir/ebin/Elixir.String.Tokenizer.Security.beam
    -rw-r--r--  1 mrluc  staff    36K Jun 25 09:46 lib/elixir/ebin/Elixir.String.Tokenizer.beam
    
  • I switched the order of Tokenizer and Tokenizer.Security in the makefile, and export an undocumented function from Tokenizer, so that Security can use it to filter the list of RTL and directionally 'neutral' characters to be just those that we know could legally appear in an identifier. Let me know if that's undesirable/I should do that a better way.

  • maybe this should have tests of directional confusability w/more complex identifiers, lmk if you agree; I never circled back to more realistic examples.

* follow uts39's recco that it's not necessary to require
  idents to be single-script (they call out proglang idents,
  reference the new uts55-5). We use a heuristic derived from
  the concept of identifier chunks from uts55-5, to allow
  idents like foo_bar_baz where each chunk around the _ can be
  single-or-highly-restrictive

* provide directional confusability detection, by reversing
  spans of direction-changed chars in idents for bidi_skeleton,
  see issue elixir-lang#12929
@mrluc mrluc changed the title from unicode 15 Draft of directional confusability protection & allowing script mixing in identifiers when separated by underscores Jun 26, 2024
@josevalim
Copy link
Member

This looks great, I have added some comments mostly to move things around, and one discussion to perhaps rollback to single script now that we support chunks.

the .beam file sizes are bumped a little bit because of the new rtl? functions in Tokenizer.Security used for bidi_skeleton, by approx. 10kb:

The 10kb increase is not an issue but we definitely want to keep compilation times low, so I suggest moving the direction checking to String.Tokenizer.

maybe this should have tests of directional confusability w/more complex identifiers, lmk if you agree; I never circled back to more realistic examples.

More tests are always welcome!

mrluc and others added 3 commits June 27, 2024 05:08
Co-authored-by: José Valim <jose.valim@gmail.com>
* better tests/impl of directional confusability
* move dir/1 to from security.ex -> tokenizer.ex
Comment on lines +165 to +184
# test that the implementation of String.Tokenizer.Security.unbidify/1 agrees
# w/Unicode Bidi Algo (UAX9) for these (identifier-specific, no-bracket) examples
#
# you can create new examples with: https://util.unicode.org/UnicodeJsps/bidic.jsp?s=foo_%D9%84%D8%A7%D9%85%D8%AF%D8%A7_baz&b=0&u=140&d=2
# inspired by (none of these are directly usable for our idents): https://www.unicode.org/Public/UCD/latest/ucd/BidiCharacterTest.txt
#
# there's a spurious ;A; after the identifier, because the semicolon is dir-neutral, and
# deleting it makes these examples hard to read in many/most editors!
"""
foo;A;0066 006F 006F;0 1 2
_foo_ ;A;005F 0066 006F 006F 005F;0 1 2 3 4
__foo__ ;A;005F 005F 0066 006F 006F 005F 005F;0 1 2 3 4 5 6
لامدا_foo ;A;0644 0627 0645 062F 0627 005F 0066 006F 006F;4 3 2 1 0 5 6 7 8
foo_لامدا_baz ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 0062 0061 007A;0 1 2 3 8 7 6 5 4 9 10 11 12
foo_لامدا ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627;0 1 2 3 8 7 6 5 4
foo_لامدا1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 0031;0 1 2 3 9 8 7 6 5 4
foo_لامدا_حدد ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F;0 1 2 3 12 11 10 9 8 7 6 5 4
foo_لامدا_حدد1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031;0 1 2 3 13 12 11 10 9 8 7 6 5 4
foo_لامدا_حدد1_bar ;A; 0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031 005F 0062 0061 0072;0 1 2 3 13 12 11 10 9 8 7 6 5 4 14 15 16 17
foo_لامدا_حدد1_bar1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031 005F 0062 0061 0072 0031;0 1 2 3 13 12 11 10 9 8 7 6 5 4 14 15 16 17 18
Copy link
Contributor Author

@mrluc mrluc Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim I added these example-based tests, which fed back into the implementation by exercising things I hadn't thought through, like 'which span a neutral character or a number will attach to'.

Unicode has a great Bidi Algo testing tool, which will spit out the list of rearranged indices.

Let me know if we don't like this format for these tests, and if so what kind of format would be better; in the absence of anything else, I liked BidiCharacterTest.txt's approach of "each test case has a list of codepoints, and a list of indices that they should be reordered to".

@mrluc
Copy link
Contributor Author

mrluc commented Jul 2, 2024

@josevalim I took a pass at moving those things around, and adding a handful of more complex LTR-confusability examples in the test -- which ended up changing the implementation of unbidify for the better I think.

end

def dir(i) when is_integer(i), do: :ltr
def dir(_), do: {:error, :codepoint_must_be_integer}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't want to raise here? This will always be a bug, right?

@josevalim josevalim merged commit c83334e into elixir-lang:main Jul 2, 2024
10 checks passed
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

@josevalim
Copy link
Member

Follow up commits you may be interested in:

f01dc17

9924aff

@mrluc
Copy link
Contributor Author

mrluc commented Jul 2, 2024

Ah, nice -- love this refactor -- also, literally just learned about Enum.reverse(ls, tail) 😆 -- and Highly Restrictive is no more, sweet! 🥳 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants