Carets/offsets are wrong for unicode characters #10

ammaraskar · 2021-05-16T23:00:42Z

Using the code:

# -*- coding: utf-8 -*-

1 + "Ĥellö Wörld"

leads to

Traceback (most recent call last):
  File "C:\Users\ammar\junk\test.py", line 3, in <module>
    1 + "Ĥellö Wörld"
    ^^^^^^^^^^^^^^^^^^^^
TypeError: unsupported operand type(s) for +: 'int' and 'str'

The text was updated successfully, but these errors were encountered:

pablogsal · 2021-05-16T23:33:10Z

Maybe we need to use something like https://github.com/python/cpython/blob/main/Parser/pegen.c#L143

but on the other hand, we handle this correctly for syntax errors:

>>> f(4, "Ĥellö Wörld" for x in range(1))
  File "<stdin>", line 1
    f(4, "Ĥellö Wörld" for x in range(1))
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: Generator expression must be parenthesized

and the numbers used are the same, in reality.

ammaraskar · 2021-05-16T23:43:59Z

Huh I wonder if there is some magic going on in the syntax error handling because the AST column offsets are definitely byte-based rather than character.

> python -m ast -a C:\Users\ammar\junk\test.py
Module(
   body=[
      Expr(
         value=BinOp(
            left=Constant(
               value=1,
               lineno=3,
               col_offset=0,
               end_lineno=3,
               end_col_offset=1),
            op=Add(),
            right=Constant(
               value='Ĥellö Wörld',
               lineno=3,
               col_offset=4,
               end_lineno=3,
               end_col_offset=20),
            lineno=3,
            col_offset=0,
            end_lineno=3,
            end_col_offset=20),
         lineno=3,
         col_offset=0,
         end_lineno=3,
         end_col_offset=20)],
   type_ignores=[])

pablogsal · 2021-05-16T23:48:29Z

Huh I wonder if there is some magic going on in the syntax error handling because the AST column offsets are definitely byte-based rather than character.

Surprise surprise:

https://github.com/python/cpython/blob/fdc7e52f5f1853e350407c472ae031339ac7f60c/Parser/pegen.c#L496-L501

ammaraskar · 2021-06-08T16:57:16Z

So there's a weird edge case here around non-utf8 encoded source files that we need to consider if we want to add special handling for:

Consider a file like:

# -*- coding: cp437 -*-

f(4, "Hëllo World τΦΘ" for x in range(1))

Now in the compiler I think this ends up magically getting handled because the tokenizer re-encodes the source file lines to utf8. When it goes into the pegen error handling code, this call to PyErr_ProgramTextObject ends up failing because this line fails to parse the line properly as utf-8. This causes the fallback logic for getting the line to be invoked which decodes the line as utf-8 which is valid since the tokenizer re-encoded it to utf8 in the buffer. Thus the assumption in byte_offset_to_character_offset that the line is utf-8 holds true and it all works.

(As an aside I think there's a bug here if we use a custom # -*- coding: that encodes characters differently to utf-8 but still parses as valid utf-8).

However, this case is a bit trickier to handle in the traceback machinery. In the traceback code we really only know the filename and the offsets, we can't rely on the tokenizer to have done the work of figuring out the proper source encoding for the file. I wonder if we should just assume it's UTF8 and maybe decode in the replace mode since this is such an edge case or actually handle it properly.

ammaraskar · 2021-06-08T17:25:02Z

(As an aside I think there's a bug here if we use a custom # -*- coding: that encodes characters differently to utf-8 but still parses as valid utf-8).

I was able to trigger this bug:

# -*- coding: cp437 -*-

f(4, '┬ó┬ó┬ó┬ó┬ó┬ó' for x in range(1)) # Test

  File "test-weird-encoding.py", line 3
    f(4, '¢¢¢¢¢¢' for x in range(1)) # Test
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: Generator expression must be parenthesized

Let me make a bpo issue for it.

pablogsal · 2021-06-08T17:32:19Z

Excellent find @ammaraskar. This is going to be a bit of a pain to fix, I am afraid :(

pablogsal · 2021-06-08T17:47:55Z

This also affects older versions of the interpreter, is basically a bug in PyErr_ProgramTextObject :

  File "/Users/pgalindo3/github/cpython/lel.py", line 3
    f('┬ó┬ó┬ó┬ó┬ó┬ó', 4 for x in range(1)) # Test
                                                                ^
SyntaxError: Generator expression must be parenthesized

ammaraskar · 2021-06-08T17:50:52Z

bpo issue created: https://bugs.python.org/issue44349

ammaraskar · 2021-06-08T17:52:35Z

I think for our case in the traceback, since this is such a weird edge case we should probably just assume the input is utf8 and ignore/replace decoding errors? @isidentical would it be possible to quickly check with your AST tooling how many python packages use a custom # -*- coding: line that isn't utf8?

pablogsal · 2021-06-08T19:12:40Z

I think for our case in the traceback, since this is such a weird edge case we should probably just assume the input is utf8 and ignore/replace decoding errors?

I very much think so, the reason is that the better we can do is print the error as utf8 and that is still slighly weird, but is consistent with the rest of the interpreter. That is what I did here:

python#26611

pablogsal · 2021-06-08T19:13:56Z

@isidentical would it be possible to quickly check with your AST tooling how many python packages use a custom # -*- coding: line that isn't utf8?

Also, notice that the problem happens when the encoding is not utf8 and the line with the error has something tha is decodeable as utf8. But this happens also with older versions when reporting any error with caret, as the initial position will also be wrong.

On the other hand, we could add some code to make sure we are decoding a utf-8 file by checking the BOM and don't show the caret if that's not true.

ammaraskar · 2021-06-09T14:50:09Z

However, this case is a bit trickier to handle in the traceback machinery.

Aha, I spoke too soon! The traceback machinery is already set up to detect the encoding so we are all set here 😅
https://github.com/python/cpython/blob/main/Python/traceback.c#L414

ammaraskar mentioned this issue Jun 10, 2021

Handle lines with multi-byte unicode characters properly #12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Carets/offsets are wrong for unicode characters #10

Carets/offsets are wrong for unicode characters #10

ammaraskar commented May 16, 2021

pablogsal commented May 16, 2021

ammaraskar commented May 16, 2021

pablogsal commented May 16, 2021

ammaraskar commented Jun 8, 2021

ammaraskar commented Jun 8, 2021

pablogsal commented Jun 8, 2021

pablogsal commented Jun 8, 2021

ammaraskar commented Jun 8, 2021

ammaraskar commented Jun 8, 2021

pablogsal commented Jun 8, 2021

pablogsal commented Jun 8, 2021 •

edited

ammaraskar commented Jun 9, 2021

Carets/offsets are wrong for unicode characters #10

Carets/offsets are wrong for unicode characters #10

Comments

ammaraskar commented May 16, 2021

pablogsal commented May 16, 2021

ammaraskar commented May 16, 2021

pablogsal commented May 16, 2021

ammaraskar commented Jun 8, 2021

ammaraskar commented Jun 8, 2021

pablogsal commented Jun 8, 2021

pablogsal commented Jun 8, 2021

ammaraskar commented Jun 8, 2021

ammaraskar commented Jun 8, 2021

pablogsal commented Jun 8, 2021

pablogsal commented Jun 8, 2021 • edited

ammaraskar commented Jun 9, 2021

pablogsal commented Jun 8, 2021 •

edited