Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Carets/offsets are wrong for unicode characters #10

Open
ammaraskar opened this issue May 16, 2021 · 12 comments
Open

Carets/offsets are wrong for unicode characters #10

ammaraskar opened this issue May 16, 2021 · 12 comments

Comments

@ammaraskar
Copy link

Using the code:

# -*- coding: utf-8 -*-

1 + "Ĥellö Wörld"

leads to

Traceback (most recent call last):
  File "C:\Users\ammar\junk\test.py", line 3, in <module>
    1 + "Ĥellö Wörld"
    ^^^^^^^^^^^^^^^^^^^^
TypeError: unsupported operand type(s) for +: 'int' and 'str'
@pablogsal
Copy link

Maybe we need to use something like https://github.com/python/cpython/blob/main/Parser/pegen.c#L143

but on the other hand, we handle this correctly for syntax errors:

>>> f(4, "Ĥellö Wörld" for x in range(1))
  File "<stdin>", line 1
    f(4, "Ĥellö Wörld" for x in range(1))
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: Generator expression must be parenthesized

and the numbers used are the same, in reality.

@ammaraskar
Copy link
Author

Huh I wonder if there is some magic going on in the syntax error handling because the AST column offsets are definitely byte-based rather than character.

> python -m ast -a C:\Users\ammar\junk\test.py
Module(
   body=[
      Expr(
         value=BinOp(
            left=Constant(
               value=1,
               lineno=3,
               col_offset=0,
               end_lineno=3,
               end_col_offset=1),
            op=Add(),
            right=Constant(
               value='Ĥellö Wörld',
               lineno=3,
               col_offset=4,
               end_lineno=3,
               end_col_offset=20),
            lineno=3,
            col_offset=0,
            end_lineno=3,
            end_col_offset=20),
         lineno=3,
         col_offset=0,
         end_lineno=3,
         end_col_offset=20)],
   type_ignores=[])

@pablogsal
Copy link

Huh I wonder if there is some magic going on in the syntax error handling because the AST column offsets are definitely byte-based rather than character.

Surprise surprise:

https://github.com/python/cpython/blob/fdc7e52f5f1853e350407c472ae031339ac7f60c/Parser/pegen.c#L496-L501

@ammaraskar
Copy link
Author

So there's a weird edge case here around non-utf8 encoded source files that we need to consider if we want to add special handling for:

Consider a file like:

# -*- coding: cp437 -*-

f(4, "Hëllo World τΦΘ" for x in range(1))

Now in the compiler I think this ends up magically getting handled because the tokenizer re-encodes the source file lines to utf8. When it goes into the pegen error handling code, this call to PyErr_ProgramTextObject ends up failing because this line fails to parse the line properly as utf-8. This causes the fallback logic for getting the line to be invoked which decodes the line as utf-8 which is valid since the tokenizer re-encoded it to utf8 in the buffer. Thus the assumption in byte_offset_to_character_offset that the line is utf-8 holds true and it all works.

(As an aside I think there's a bug here if we use a custom # -*- coding: that encodes characters differently to utf-8 but still parses as valid utf-8).

However, this case is a bit trickier to handle in the traceback machinery. In the traceback code we really only know the filename and the offsets, we can't rely on the tokenizer to have done the work of figuring out the proper source encoding for the file. I wonder if we should just assume it's UTF8 and maybe decode in the replace mode since this is such an edge case or actually handle it properly.

@ammaraskar
Copy link
Author

(As an aside I think there's a bug here if we use a custom # -*- coding: that encodes characters differently to utf-8 but still parses as valid utf-8).

I was able to trigger this bug:

# -*- coding: cp437 -*-

f(4, '¢¢¢¢¢¢' for x in range(1)) # Test
  File "test-weird-encoding.py", line 3
    f(4, '¢¢¢¢¢¢' for x in range(1)) # Test
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: Generator expression must be parenthesized

Let me make a bpo issue for it.

@pablogsal
Copy link

Excellent find @ammaraskar. This is going to be a bit of a pain to fix, I am afraid :(

@pablogsal
Copy link

This also affects older versions of the interpreter, is basically a bug in PyErr_ProgramTextObject :

  File "/Users/pgalindo3/github/cpython/lel.py", line 3
    f('¢¢¢¢¢¢', 4 for x in range(1)) # Test
                                                                ^
SyntaxError: Generator expression must be parenthesized

@ammaraskar
Copy link
Author

bpo issue created: https://bugs.python.org/issue44349

@ammaraskar
Copy link
Author

I think for our case in the traceback, since this is such a weird edge case we should probably just assume the input is utf8 and ignore/replace decoding errors? @isidentical would it be possible to quickly check with your AST tooling how many python packages use a custom # -*- coding: line that isn't utf8?

@pablogsal
Copy link

I think for our case in the traceback, since this is such a weird edge case we should probably just assume the input is utf8 and ignore/replace decoding errors?

I very much think so, the reason is that the better we can do is print the error as utf8 and that is still slighly weird, but is consistent with the rest of the interpreter. That is what I did here:

python#26611

@pablogsal
Copy link

pablogsal commented Jun 8, 2021

@isidentical would it be possible to quickly check with your AST tooling how many python packages use a custom # -*- coding: line that isn't utf8?

Also, notice that the problem happens when the encoding is not utf8 and the line with the error has something tha is decodeable as utf8. But this happens also with older versions when reporting any error with caret, as the initial position will also be wrong.

On the other hand, we could add some code to make sure we are decoding a utf-8 file by checking the BOM and don't show the caret if that's not true.

@ammaraskar
Copy link
Author

However, this case is a bit trickier to handle in the traceback machinery.

Aha, I spoke too soon! The traceback machinery is already set up to detect the encoding so we are all set here 😅
https://github.com/python/cpython/blob/main/Python/traceback.c#L414

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants