Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support named unicode characters in f-strings #160

Merged
merged 2 commits into from
Nov 22, 2020

Conversation

thatch
Copy link
Contributor

@thatch thatch commented Nov 21, 2020

Fixes #154

The previous behavior misinterpreted the curly braces as enclosing an
expression. This change does some cursory validation so we can still
get parse errors in the most egregious cases, but does not validate that
the names are actually valid, only that they are name-shaped and have a
chance of being valid.

The character names appear to obey a few rules:

  • Case insensitive
  • Name characters are [A-Z0-9 \-]
  • Whitespace before or after is not allowed
  • Whitespace in the middle may only be a single space between words
  • Dashes may occur at the start or middle of a word
f"\N{A B}"           # might be legal
f"\N{a b}"           # equivalent to above
f"\N{A     B}"       # no way
f"\N{    A B     }"  # no way
f"""\N{A
B}"""                # no way

For confirming this regex matches all (current) unicode character names:

import re
import sys
import unicodedata

R = re.compile(r"[A-Za-z0-9\-]+(?: [A-Za-z0-9\-]+)*")

for i in range(sys.maxunicode):
    try:
        name = unicodedata.name(chr(i))
    except ValueError:
        # Some small values like 0 and 1 have no name, /shrug
        continue
    m = R.fullmatch(name)
    if m is None:
        print("FAIL", repr(name))

Fixes davidhalter#154

The previous behavior misinterpreted the curly braces as enclosing an
expression.  This change does some cursory validation so we can still
get parse errors in the most egregious cases, but does not validate that
the names are actually valid, only that they are name-shaped and have a
chance of being valid.

The character names appear to obey a few rules:
* Case insensitive
* Name characters are `[A-Z0-9 \-]`
* Whitespace before or after is not allowed
* Whitespace in the middle may only be a single space between words
* Dashes may occur at the start or middle of a word

```py
f"\N{A B}"           # might be legal
f"\N{a b}"           # equivalent to above
f"\N{A     B}"       # no way
f"\N{    A B     }"  # no way
f"""\N{A
B}"""                # no way
```

For confirming this regex matches all (current) unicode character names:

```py
import re
import sys
import unicodedata

R = re.compile(r"[A-Za-z0-9\-]+(?: [A-Za-z0-9\-]+)*")

for i in range(sys.maxunicode):
    try:
        name = unicodedata.name(chr(i))
    except ValueError:
        # Some small values like 0 and 1 have no name, /shrug
        continue
    m = R.fullmatch(name)
    if m is None:
        print("FAIL", repr(name))
```
thatch added a commit to thatch/LibCST that referenced this pull request Nov 21, 2020
thatch added a commit to thatch/LibCST that referenced this pull request Nov 21, 2020
Copy link
Collaborator

@isidentical isidentical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you please add a few tests where these unicode escapes used in the middle and at the end of the f-string. Like; f"some {stuff} and \N{escape}" etc.

@thatch
Copy link
Contributor Author

thatch commented Nov 22, 2020

Additional tests in latest push.

@isidentical isidentical merged commit d39aadc into davidhalter:master Nov 22, 2020
@isidentical
Copy link
Collaborator

Thanks for the patch @thatch! Also loved your work on https://github.com/thatch/python-grammar-changes

jimmylai pushed a commit to Instagram/LibCST that referenced this pull request Nov 30, 2020
* Support named unicode characters in f-strings

This is the same as my pull request
davidhalter/parso#160

* A small bugfix to what is allowed in f-string expressions

Thanks to davidhalter/parso#159 for catching
that yield (as an expression, I suppose) is allowed on 3.6.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Named unicode escapes with spaces marked as syntax error in f-strings
2 participants