Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError when using cp-1252 file enconding #2612

Closed
stefan-hoelzl opened this issue Dec 19, 2023 · 6 comments · Fixed by #2657
Closed

UnicodeDecodeError when using cp-1252 file enconding #2612

stefan-hoelzl opened this issue Dec 19, 2023 · 6 comments · Fixed by #2657
Assignees
Labels
Type: Bug Something isn't working

Comments

@stefan-hoelzl
Copy link

How do you use Sentry?

Sentry Saas (sentry.io)

Version

1.39.1

Steps to Reproduce

When I run the following script

# -*- coding: cp1252 -*-
import sentry_sdk
sentry_sdk.init(dsn=SENTRY_DSN)
raise RuntimeError("Ä")

I do not get any Issues reported

I poked around a bit and found out it is because of this line

length = len(value.encode("utf-8"))

a UnicodeDecodeError gets raised here.

I am using Python 2.7

Expected Result

The RuntimeError appears as Issue in the web console.

Actual Result

No Issue appears

@mcuppi
Copy link

mcuppi commented Dec 19, 2023

On Python 2, this error can also occur when Unicode characters are present in the source code.

When Sentry captures an exception, it retrieves the source code surrounding the error. During this process, Sentry checks the length of each line and may truncate it using strip_string()1:

def strip_string(value, max_length=None):
    # type: (str, Optional[int]) -> Union[AnnotatedValue, str]
    if not value:
        return value

    if max_length is None:
        max_length = DEFAULT_MAX_VALUE_LENGTH

    length = len(value.encode("utf-8"))

    if length > max_length:
        return AnnotatedValue(
            value=value[: max_length - 3] + "...",
            metadata={
                "len": length,
                "rem": [["!limit", "x", max_length - 3, max_length]],
            },
        )
    return value

The problematic line here is this one: UnicodeDecodeError: length = len(value.encode("utf-8")). This line was modified on October 27, 2022, changing from length = len(value) to length = len(value.encode("utf-8"))2. The reason for the change is that a Unicode string returns its length in characters, not bytes3:

# -*- coding: utf-8 -*-

print(len(u"••••")) # -> 4
print(len("••••")) # -> 12

The proposed solution was to use encode("utf-8") to convert a Unicode string into bytes and then measure its length. For example:

print(len(u"••••".encode("utf-8"))) # -> 12

However, this approach causes an issue: if the value is a byte string containing Unicode characters, an exception will be raised:

print(len("••••".encode("utf-8"))) # -> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

To summarize, if a Python 2 project uses Unicode characters in parts of its source code, then Sentry might raise an exception when capturing an event. Sentry's strip_string() needs to be changed for Python 2 compatibility.

Footnotes

  1. https://github.com/getsentry/sentry-python/blob/1.39.1/sentry_sdk/utils.py#L1121-L1139

  2. https://github.com/getsentry/sentry-python/commit/e2674d4006df4f50b82cb41405f5d78ab18a2719

  3. https://github.com/getsentry/sentry-python/issues/1691

@antonpirker antonpirker added the Type: Bug Something isn't working label Dec 20, 2023
@antonpirker
Copy link
Member

Hey @stefan-hoelzl and @mcuppi thanks for reporting this.

We are currently in the process to drop support for Python 2.x. (#2581) so this bug will probably never be fixed. Sorry.

@mcuppi
Copy link

mcuppi commented Dec 20, 2023

@antonpirker; Thank you for the quick reply. I understand that Python 2 is end-of-life and support is being phased out universally, but I think this bug is pretty severe.

This bug not only masks the original exception, but it also raises an entirely new one.

Currently, sentry-sdk is advertised as supporting Python 2.7+1. Given that many projects rely on the accuracy and reliability of claimed version support, it seems inappropriate not to address this issue. Especially since version 2.0 isn't even out yet.

The fix is probably as simple as something like this (pseudo code):

if isinstance(value, binary_type):
    length = len(value)
else:
    length = len(value.encode("utf-8"))

Footnotes

  1. https://web.archive.org/web/20231203160847/https://docs.sentry.io/platforms/python/integrations/django/#supported-versions

@antonpirker
Copy link
Member

Lets see if we can flash out some time in January to address this.
PRs are also always welcome, so if you want to give it a go you can submit a PR that fixes this problem, we are happy to review and merge it!

@mcuppi
Copy link

mcuppi commented Dec 21, 2023

@antonpirker; Thank you. I'll see if I can submit one soon.

@sentrivana
Copy link
Contributor

Hey folks, a fix for this will be out tomorrow in 1.40.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Something isn't working
Projects
Archived in project
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants