Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle UnicodeDecodeError #77

Closed
domenukk opened this issue Jul 18, 2020 · 14 comments
Closed

Handle UnicodeDecodeError #77

domenukk opened this issue Jul 18, 2020 · 14 comments

Comments

@domenukk
Copy link
Member

If we pass illegal utf-8 into ensure_unicode, teams should be Mumble, but a checker error occurs.

@MMunier
Copy link
Member

MMunier commented Jul 20, 2020

I'd disagree that the checker should handle that, since in most cases UnicodeDecodeErrors mean that the checker-script is kind of broken. Seeing a random checker error appear is exactly the right response to that.

@ldruschk
Copy link
Member

The question is would there be any situation where the checker would not raise a BrokenServiceException as a consequence of this? I would let the ensure_unicode function raise UnicodeDecodeErrors so that checker authors can handle them if they want, but would catch them in the run-method and raise a BrokenServiceException there based on the assumption that if this was not what should be raised, the checker author would have handled that explicitly.

@Trolldemorted
Copy link
Member

The question is would there be any situation where the checker would not raise a BrokenServiceException as a consequence of this?

Any service that doesn't really care about unicode and displays user input may output invalid unicode output. The checker error has the advantage of us knowing that something is wrong, where a mumbleexception would rely on team complaints.

@domenukk
Copy link
Member Author

While I do agree in general, we should probably document this in the ensure_unicode function, it was not obvious to me at least. Or even strip invalid chars automatically?
Also we could think about renaming it to ensure_str, now that py2 is no longer supported

@ldruschk
Copy link
Member

Unfortunately there is no standardized way to specify thrown exceptions using type annotations, the best way would be putting this into the docstring raising the question whether people actually read that.

@domenukk
Copy link
Member Author

How about removing illegal chars with a logger warning? That way the checker won't fail unless some later comparison fails(?)
I don't see obvious downsides, however it could be used to attack other teams without breaking the checks which would be kinda hilarious

@ldruschk
Copy link
Member

If we go this way I would make this an configurable (and off by default) option, otherwise this is not something one would expect from the function and sounds like something that will lead to a weird bug somewhere down the line (which will be hard to find for the checker author since they would not expect such behavior).

Also, stripping bytes which are not invalid UTF-8 is neither trivial nor necessarily deterministic, If you have a start byte of 2 byte sequence and the two continuation bytes, which one do you strip? If you have a start byte of a 3 byte sequence followed by a valid 2 byte sequence (meaning there are two start bytes following each other), do you strip only the first byte or the whole three bytes?

But imo even having this as a configurable option does not seem very useful, if you don't care about the invalid encoded bytes you could simply encode the UTF-8 string you are trying to find and search for bytes in bytes without ever having to deal with UTF-8 decoding issues.

@domenukk
Copy link
Member Author

Not all string functions exist for bytes and just unicode allthethings makes it easy. But yeah lets add a warning to docstring then?

@Trolldemorted
Copy link
Member

Did you consider removing it? If you know the argument's type it can be replaced in one line (either by decode or str), and if you don't you should solve that problem instead of hoping ensure_unicode gets it right.

@domenukk
Copy link
Member Author

Any new ideas on this topic? Else I would close this issue, and hope for the best.

@ldruschk
Copy link
Member

Is ensure_unicode even used much? If you know whether you are dealing with bytes or a string (which in my opinion is not as confusing these days with python3 as it was when supporting both, python2 and python3) you can just explicitly encode/decode one or the other (or both), in which case you would have to deal with these errors yourself anyways.

So I would agree that this issue can be closed, but it might be worth considering deprecating the function altogether

@ldruschk
Copy link
Member

Oh, and by deprecating I mean removing, since basically all checkers were broken by the update to v2 anyways and would need to be reworked for other reasons, during which we could simply fix checkers using ensure_unicode/ensure_bytes

@domenukk
Copy link
Member Author

domenukk commented Apr 26, 2021

I agree, I will open a PR to remove ensure_unicode, as it's a potential footgun, however I would keep ensure_bytes around for now.
It is used internally for SimpleSocket and makes it more convenient to use in my eyes (since bytestrings do not support f strings...), with no obvious drawback (like DecodeErrors).

@domenukk
Copy link
Member Author

Openend #108 for this. Closing this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants