-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft of directional confusability protection & allowing script mixing in identifiers when separated by underscores #13693
Draft of directional confusability protection & allowing script mixing in identifiers when separated by underscores #13693
Conversation
* follow uts39's recco that it's not necessary to require idents to be single-script (they call out proglang idents, reference the new uts55-5). We use a heuristic derived from the concept of identifier chunks from uts55-5, to allow idents like foo_bar_baz where each chunk around the _ can be single-or-highly-restrictive * provide directional confusability detection, by reversing spans of direction-changed chars in idents for bidi_skeleton, see issue elixir-lang#12929
This looks great, I have added some comments mostly to move things around, and one discussion to perhaps rollback to single script now that we support chunks.
The 10kb increase is not an issue but we definitely want to keep compilation times low, so I suggest moving the direction checking to String.Tokenizer.
More tests are always welcome! |
Co-authored-by: José Valim <jose.valim@gmail.com>
* better tests/impl of directional confusability * move dir/1 to from security.ex -> tokenizer.ex
# test that the implementation of String.Tokenizer.Security.unbidify/1 agrees | ||
# w/Unicode Bidi Algo (UAX9) for these (identifier-specific, no-bracket) examples | ||
# | ||
# you can create new examples with: https://util.unicode.org/UnicodeJsps/bidic.jsp?s=foo_%D9%84%D8%A7%D9%85%D8%AF%D8%A7_baz&b=0&u=140&d=2 | ||
# inspired by (none of these are directly usable for our idents): https://www.unicode.org/Public/UCD/latest/ucd/BidiCharacterTest.txt | ||
# | ||
# there's a spurious ;A; after the identifier, because the semicolon is dir-neutral, and | ||
# deleting it makes these examples hard to read in many/most editors! | ||
""" | ||
foo;A;0066 006F 006F;0 1 2 | ||
_foo_ ;A;005F 0066 006F 006F 005F;0 1 2 3 4 | ||
__foo__ ;A;005F 005F 0066 006F 006F 005F 005F;0 1 2 3 4 5 6 | ||
لامدا_foo ;A;0644 0627 0645 062F 0627 005F 0066 006F 006F;4 3 2 1 0 5 6 7 8 | ||
foo_لامدا_baz ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 0062 0061 007A;0 1 2 3 8 7 6 5 4 9 10 11 12 | ||
foo_لامدا ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627;0 1 2 3 8 7 6 5 4 | ||
foo_لامدا1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 0031;0 1 2 3 9 8 7 6 5 4 | ||
foo_لامدا_حدد ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F;0 1 2 3 12 11 10 9 8 7 6 5 4 | ||
foo_لامدا_حدد1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031;0 1 2 3 13 12 11 10 9 8 7 6 5 4 | ||
foo_لامدا_حدد1_bar ;A; 0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031 005F 0062 0061 0072;0 1 2 3 13 12 11 10 9 8 7 6 5 4 14 15 16 17 | ||
foo_لامدا_حدد1_bar1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031 005F 0062 0061 0072 0031;0 1 2 3 13 12 11 10 9 8 7 6 5 4 14 15 16 17 18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josevalim I added these example-based tests, which fed back into the implementation by exercising things I hadn't thought through, like 'which span a neutral character or a number will attach to'.
Unicode has a great Bidi Algo testing tool, which will spit out the list of rearranged indices.
Let me know if we don't like this format for these tests, and if so what kind of format would be better; in the absence of anything else, I liked BidiCharacterTest.txt's approach of "each test case has a list of codepoints, and a list of indices that they should be reordered to".
@josevalim I took a pass at moving those things around, and adding a handful of more complex LTR-confusability examples in the test -- which ended up changing the implementation of |
end | ||
|
||
def dir(i) when is_integer(i), do: :ltr | ||
def dir(_), do: {:error, :codepoint_must_be_integer} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we don't want to raise here? This will always be a bug, right?
💚 💙 💜 💛 ❤️ |
Ah, nice -- love this refactor -- also, literally just learned about Enum.reverse(ls, tail) 😆 -- and Highly Restrictive is no more, sweet! 🥳 🎉 |
PR following up on Directional confusability: _א1 and _1א not detected -- unicode 15.1's new TR55, and updated TR39, proposing two changes:
per discussion of recent updates to Unicode's UTS39/55 in the issue linked above, we allow safe script mixing in identifiers, using a heuristic derived from the concept of identifier chunks from uts55. Basically we allow limited script mixing in identifiers like
foo_bar_baz
, only when separated by underscore, wherefoo
,bar
,baz
chunks must be single-script or highly-restrictive.provide directional confusability detection (which was needed even without script mixing, as the example from the issue above shows) by reversing spans of rtl-direction chars in bidi_skeleton, so that we default to checking confusability based on display order versus byte order in these relatively rare cases (eg, tools like github/vs code impl. the bidi algo from uax9).
These changes are also in harmony with the latest versions of the unicode technical reports, example from UTS39 as of version 15+ (similar reasoning in 55):
This PR maintains Elixir's script-mixing protections; for instance, we still don't allow an identifier like
аdmin_
, and the error shows us why:However, for identifiers separated by an underscore, we now make an exception:
... as long as the chunks, themselves, are 'single-script' or 'highly restrictive'. (If they aren't, we will see the same kind of error message as above; see test cases w/this PR).
Similarly, the confusability additions build on existing messages, but now include a breakdown of directionality if tokens are confusable and they include any RTL characters:
@josevalim this is a first pass -- here's what I think is good/bad/ugly
I think/hope this should have basically no perf impact because of where the additions live
the .beam file sizes are bumped a little bit because of the new
rtl?
functions inTokenizer.Security
used for bidi_skeleton, by approx. 10kb:I switched the order of
Tokenizer
andTokenizer.Security
in the makefile, and export an undocumented function fromTokenizer
, so thatSecurity
can use it to filter the list of RTL and directionally 'neutral' characters to be just those that we know could legally appear in an identifier. Let me know if that's undesirable/I should do that a better way.maybe this should have tests of directional confusability w/more complex identifiers, lmk if you agree; I never circled back to more realistic examples.