New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Carets/offsets are wrong for unicode characters #10
Comments
Maybe we need to use something like https://github.com/python/cpython/blob/main/Parser/pegen.c#L143 but on the other hand, we handle this correctly for syntax errors:
and the numbers used are the same, in reality. |
Huh I wonder if there is some magic going on in the syntax error handling because the AST column offsets are definitely byte-based rather than character.
|
Surprise surprise: |
So there's a weird edge case here around non-utf8 encoded source files that we need to consider if we want to add special handling for: Consider a file like: # -*- coding: cp437 -*-
f(4, "Hëllo World τΦΘ" for x in range(1)) Now in the compiler I think this ends up magically getting handled because the tokenizer re-encodes the source file lines to utf8. When it goes into the pegen error handling code, this call to (As an aside I think there's a bug here if we use a custom However, this case is a bit trickier to handle in the traceback machinery. In the traceback code we really only know the filename and the offsets, we can't rely on the tokenizer to have done the work of figuring out the proper source encoding for the file. I wonder if we should just assume it's UTF8 and maybe decode in the |
I was able to trigger this bug: # -*- coding: cp437 -*-
f(4, '¢¢¢¢¢¢' for x in range(1)) # Test
Let me make a bpo issue for it. |
Excellent find @ammaraskar. This is going to be a bit of a pain to fix, I am afraid :( |
This also affects older versions of the interpreter, is basically a bug in
|
bpo issue created: https://bugs.python.org/issue44349 |
I think for our case in the traceback, since this is such a weird edge case we should probably just assume the input is utf8 and ignore/replace decoding errors? @isidentical would it be possible to quickly check with your AST tooling how many python packages use a custom |
I very much think so, the reason is that the better we can do is print the error as utf8 and that is still slighly weird, but is consistent with the rest of the interpreter. That is what I did here: |
Also, notice that the problem happens when the encoding is not utf8 and the line with the error has something tha is decodeable as utf8. But this happens also with older versions when reporting any error with caret, as the initial position will also be wrong. On the other hand, we could add some code to make sure we are decoding a utf-8 file by checking the BOM and don't show the caret if that's not true. |
Aha, I spoke too soon! The traceback machinery is already set up to detect the encoding so we are all set here 😅 |
Using the code:
leads to
The text was updated successfully, but these errors were encountered: