Handle `UnicodeDecodeError` #77

domenukk · 2020-07-18T13:28:04Z

If we pass illegal utf-8 into ensure_unicode, teams should be Mumble, but a checker error occurs.

The text was updated successfully, but these errors were encountered:

MMunier · 2020-07-20T11:40:14Z

I'd disagree that the checker should handle that, since in most cases UnicodeDecodeErrors mean that the checker-script is kind of broken. Seeing a random checker error appear is exactly the right response to that.

ldruschk · 2020-07-20T11:44:46Z

The question is would there be any situation where the checker would not raise a BrokenServiceException as a consequence of this? I would let the ensure_unicode function raise UnicodeDecodeErrors so that checker authors can handle them if they want, but would catch them in the run-method and raise a BrokenServiceException there based on the assumption that if this was not what should be raised, the checker author would have handled that explicitly.

Trolldemorted · 2020-07-20T19:43:00Z

The question is would there be any situation where the checker would not raise a BrokenServiceException as a consequence of this?

Any service that doesn't really care about unicode and displays user input may output invalid unicode output. The checker error has the advantage of us knowing that something is wrong, where a mumbleexception would rely on team complaints.

domenukk · 2020-07-20T20:44:14Z

While I do agree in general, we should probably document this in the ensure_unicode function, it was not obvious to me at least. Or even strip invalid chars automatically?
Also we could think about renaming it to ensure_str, now that py2 is no longer supported

ldruschk · 2020-07-20T20:58:53Z

Unfortunately there is no standardized way to specify thrown exceptions using type annotations, the best way would be putting this into the docstring raising the question whether people actually read that.

domenukk · 2020-07-20T21:33:31Z

How about removing illegal chars with a logger warning? That way the checker won't fail unless some later comparison fails(?)
I don't see obvious downsides, however it could be used to attack other teams without breaking the checks which would be kinda hilarious

ldruschk · 2020-07-20T21:53:35Z

If we go this way I would make this an configurable (and off by default) option, otherwise this is not something one would expect from the function and sounds like something that will lead to a weird bug somewhere down the line (which will be hard to find for the checker author since they would not expect such behavior).

Also, stripping bytes which are not invalid UTF-8 is neither trivial nor necessarily deterministic, If you have a start byte of 2 byte sequence and the two continuation bytes, which one do you strip? If you have a start byte of a 3 byte sequence followed by a valid 2 byte sequence (meaning there are two start bytes following each other), do you strip only the first byte or the whole three bytes?

But imo even having this as a configurable option does not seem very useful, if you don't care about the invalid encoded bytes you could simply encode the UTF-8 string you are trying to find and search for bytes in bytes without ever having to deal with UTF-8 decoding issues.

domenukk · 2020-07-20T22:00:12Z

Not all string functions exist for bytes and just unicode allthethings makes it easy. But yeah lets add a warning to docstring then?

Trolldemorted · 2020-07-21T09:29:43Z

Did you consider removing it? If you know the argument's type it can be replaced in one line (either by decode or str), and if you don't you should solve that problem instead of hoping ensure_unicode gets it right.

domenukk · 2021-04-26T14:45:59Z

Any new ideas on this topic? Else I would close this issue, and hope for the best.

ldruschk · 2021-04-26T14:57:49Z

Is ensure_unicode even used much? If you know whether you are dealing with bytes or a string (which in my opinion is not as confusing these days with python3 as it was when supporting both, python2 and python3) you can just explicitly encode/decode one or the other (or both), in which case you would have to deal with these errors yourself anyways.

So I would agree that this issue can be closed, but it might be worth considering deprecating the function altogether

ldruschk · 2021-04-26T15:01:19Z

Oh, and by deprecating I mean removing, since basically all checkers were broken by the update to v2 anyways and would need to be reworked for other reasons, during which we could simply fix checkers using ensure_unicode/ensure_bytes

domenukk · 2021-04-26T15:15:08Z

I agree, I will open a PR to remove ensure_unicode, as it's a potential footgun, however I would keep ensure_bytes around for now.
It is used internally for SimpleSocket and makes it more convenient to use in my eyes (since bytestrings do not support f strings...), with no obvious drawback (like DecodeErrors).

domenukk · 2021-04-26T15:19:05Z

Openend #108 for this. Closing this issue now.

ldruschk added the discussion label Jul 22, 2020

domenukk closed this as completed Apr 26, 2021

This was referenced May 3, 2021

Rust checker enowars/enowars-service-example#10

Merged

Unicode Error Should Not Implicitly Lead to Mumble enowars/enochecker-rs#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle `UnicodeDecodeError` #77

Handle `UnicodeDecodeError` #77

domenukk commented Jul 18, 2020

MMunier commented Jul 20, 2020

ldruschk commented Jul 20, 2020

Trolldemorted commented Jul 20, 2020

domenukk commented Jul 20, 2020

ldruschk commented Jul 20, 2020

domenukk commented Jul 20, 2020

ldruschk commented Jul 20, 2020

domenukk commented Jul 20, 2020

Trolldemorted commented Jul 21, 2020

domenukk commented Apr 26, 2021

ldruschk commented Apr 26, 2021

ldruschk commented Apr 26, 2021

domenukk commented Apr 26, 2021 •

edited

Loading

domenukk commented Apr 26, 2021

Handle UnicodeDecodeError #77

Handle UnicodeDecodeError #77

Comments

domenukk commented Jul 18, 2020

MMunier commented Jul 20, 2020

ldruschk commented Jul 20, 2020

Trolldemorted commented Jul 20, 2020

domenukk commented Jul 20, 2020

ldruschk commented Jul 20, 2020

domenukk commented Jul 20, 2020

ldruschk commented Jul 20, 2020

domenukk commented Jul 20, 2020

Trolldemorted commented Jul 21, 2020

domenukk commented Apr 26, 2021

ldruschk commented Apr 26, 2021

ldruschk commented Apr 26, 2021

domenukk commented Apr 26, 2021 • edited Loading

domenukk commented Apr 26, 2021

Handle `UnicodeDecodeError` #77

Handle `UnicodeDecodeError` #77

domenukk commented Apr 26, 2021 •

edited

Loading