Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autodecode should use replacementDchar rather than throwing on invalid #9777

Open
dlangBugzillaToGithub opened this issue Aug 15, 2019 · 5 comments

Comments

@dlangBugzillaToGithub
Copy link

bugzilla (@WalterBright) reported this on 2019-08-15T22:51:17Z

Transfered from https://issues.dlang.org/show_bug.cgi?id=20134

CC List

Description

Currently, when the autodecoder encounters an invalid UTF sequence, it throws a UTFexception. In contrast, the byUTF conversions return a replacementDchar instead, as does foreach() when decoding.

This enhancement would make the behavior consistent, with the additional benefits of character processing being @nogc and nothrow and even pure.
@dlangBugzillaToGithub
Copy link
Author

bugzilla (@WalterBright) commented on 2019-08-15T22:58:29Z

Over time, common practice has evolved from rejecting malformed UTF to replacing it with replacementDchar, which enables the application (like a web browser) to continue processing.

Code should also be faster with this change.

@dlangBugzillaToGithub
Copy link
Author

dlang-bot commented on 2019-08-16T00:01:58Z

@WalterBright updated dlang/phobos pull request #7144 "fix Issue 20134 - autodecode should use replacementDchar rather than throwing on invalid" fixing this issue:

- fix Issue 20134 - autodecode should use replacementDchar rather than throwing on invalid

https://github.com/dlang/phobos/pull/7144

@dlangBugzillaToGithub
Copy link
Author

dlang-bugzilla (@CyberShadow) commented on 2019-08-16T00:20:47Z

(In reply to Walter Bright from comment #1)
> Over time, common practice has evolved from rejecting malformed UTF to
> replacing it with replacementDchar, which enables the application (like a
> web browser) to continue processing.

In applications where not crashing is preferrable to corrupting data, yes, but I don't think we can make that decision in place of the user. Corrupted data spreads and seeps into archives and can be very hard to rectify once it's discovered, but crashes are immediately visible and usually easily fixable.

> Code should also be faster with this change.

So should either assuming that the strings are valid, or throwing Errors instead of Exceptions, right?

@dlangBugzillaToGithub
Copy link
Author

dlang-bugzilla (@CyberShadow) commented on 2019-08-16T00:22:59Z

(In reply to Walter Bright from comment #1)
> Over time, common practice has evolved from rejecting malformed UTF to
> replacing it with replacementDchar, which enables the application (like a
> web browser) to continue processing.

BTW, I don't think this is quite correct. Web browsers both raise an error (in the dev console) AND continue processing. By using replacementDchar implicitly, D programs would not know that there was ever a problem.

@dlangBugzillaToGithub
Copy link
Author

jrdemail2000-dlang commented on 2019-08-16T05:47:08Z

Correct handling of invalid UTF sequences is often known only by the application. That is, it is task dependent. And in some applications, the appropriate handling may not be known until runtime, making compile-time decisions problematic.

A related piece of the puzzle is that in many high performance string processing applications, it is useful to switch between modes of processing where strings are handled as bytes for some algorithms, then switch back to modes where strings are character sequences. When operating as bytes, UTF interpretation is not needed or desired (so no detection of invalid UTF sequences). But when algorithms are operating on characters, then invalid UTF detection/handling is desired/required. (Note: Many of these algorithms are possible because ASCII characters in UTF-8 can be used as single byte markers without interpretation of other parts of the byte stream.)

This makes it difficult for libraries to implement a single policy and still nicely support the wide range of application use-cases. Especially when there may be many layers of code between the application layer making a call and the lower level function where opportunity for detection occurs.

As an application developer, what I'd really like to have is a magical context object where the current detection and handling policies are set, and have all code invoked with the scope of that object obey them. I'd gladly take a performance hit to get it. This may too big change, but it's worth considering how well other solutions compare from an application development perspective.

@LightBender LightBender removed the P4 label Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants