-
-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix Issue 20134 - autodecode should use replacementDchar rather than throwing on invalid #7144
Conversation
|
Thanks for your pull request, @WalterBright! Bugzilla references
Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub run digger -- build "master + phobos#7144" |
2e4420a
to
aed9a33
Compare
aed9a33
to
97c5ca2
Compare
Yes it does. |
It needs to be in the commit message. GitHub pull request titles don't end up in git history and remain only on GitHub. |
97c5ca2
to
5958320
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure we discussed this a few years ago and concluded it was a bad idea? As it can lead to silent data corruption.
Generally any "sanitization" where garbage (which, really, is almost always just data that's in the wrong format for some reason or another) is replaced with some replacement dummy placeholder needs to be opt in.
I amended it. Still no joy. |
|
Here is the discussion: |
|
Looks like it needed a reopen to see the new commit message. |
|
I think the correct fix here is to throw an Error, not an Exception. Allowing strings (which should have been sanitized at the inputs) to get all the way to any autodecoding machinery is a program bug. |
|
Hmm, I had thought my PR to fix foreach() had gone through. |
|
Curiously, the foreach() throws a UnicodeException while std.utf throws UTFException. Right hand, meet left hand. |
Looks like |
|
While your arguments that this change could result in data loss where some was relying on invalid UTF data and read then wrote it is pedantically true, I am doubtful it matters in practice. Consider:
And lastly, we cannot remove autodecode and yet keep these exceptions. |
Yep, but probably because no one's using D for anything too serious in that area, of they are they're not using Phobos.
I don't see how that's related! D programs can be processing all kinds of data using strings, and it may not be necessarily "text files" that you would open in a text editor. (Also decent text editors don't do that any more)
Is that actually true? Pretty sure that's not what Rust does, for example. I think the advice only applies when displaying strings, not when converting them in general.
Maybe they don't need to be caught! The program crashing with UnicodeException / UTFException is a good signal that there is something wrong with the input and it needs to be fixed.
That's bad but I don't know how to fix it. For something as low-level as implicit autodecoding maybe we can do something like |
|
BTW, a problem with the current design of Edit: Also, it would prevent the errors from propagating. As soon as you tried to put (encode) the invalid char/dchar into a string, the encoder will throw and catch the bug... |
|
I'm in favor of this change in general as it unblocks a common problem: string algorithms are hard to do in nothrow and @nogc (dip1008 is still not a thing). However, we at least need to provide a range that allows opt-in validation (i.e. expose the current front/back etc.) to give users a choice. |
|
Everywhere I look, I see that (unless the particular situation calls to do otherwise) the default or accepted practice is to signal an error on invalid UTF. Other programming languages do it. The standards recommend or demand it. It is very easy to find sources for this. I did not want to outright change the article to state the opposite of what it previously was implying. And, in this particular case, I disagree that removing that text did not improve the article. |
Except for @lesderid 's quote from the Unicode consortium. As for being default practice, that could very well be legacy, which is outdated now. Besides, D is hardly consistent with this. byUTF has been there for years, and did not throw. Nobody complained. Autodecoding does not work for ranges of char, which is why byUTF came about. |
We must be reading it differently. None of the text he quoted recommends one practice over the other.
This is a bold assumption. Do you have anything to back it up? Rust is newer than D and raises errors on bad UTF. You are saying that Rust is using legacy practice?
I agree. However, we are fixing the inconsistency it in the wrong direction.
I have been pretty vocal that this is the wrong practice for five years now. Let me turn that around. |
I looked. Here are the top three search results I get on the forum for
Why is this still even an argument? |
It says both are acceptable.
Yes, but you deleted it.
It offers both forms, not a default as far as I can tell.
They come in the form of poor performance, not being nothrow, and using the gc. These are general problems with using exceptions for reporting errors. Having them at the root of all string processing just makes for problems. It's not the string processing itself that is made slow, it's everything that calls it because they have to build exception unwinding sections for the RAII objects. Exception unwinding code turns off data flow optimizations, register allocation, etc. I expect this is why Rust uses Option to report errors, and Herb Sutter has been working on a similar thing for C++. The 1990's sales job of "zero cost exceptions" has turned out to be a snow job. I anticipate that exceptions will retreat into legacy territory, and that Option types will become the preferred solution. We should get in front of this change before it drives over us. I've done a lot of work to remove gc dependencies from Phobos, and we should be working on ways to remove throwing exceptions, too. |
|
If your argument is standing solely on an uncited sentence on Wikipedia, well ...
Yes. But, to be fair, the functions are
All these are problems with D's model of handling exceptions, not with how auto-decoding treats bad UTF. Why not handle it in the same way as an out-of-memory error, then? You don't need to be able to catch the exception, because you can achieve the same effect by requesting decoding through a means that signals errors in other ways. |
What? No, it won't! There is exactly one valid encoding for any code point in UTF-8, UTF-16, and UTF-32. Normalization is completely unrelated to this.
That's completely beside the point. It could very well be any function that does need to auto-decode. Could be the same |
|
https://issues.dlang.org/show_bug.cgi?id=20140 Note how this change completely avoids the problem with invalid UTF sequences, pushing the decision up to the user to make an explicit choice. |
|
https://unicode.org/reports/tr15/ While I think normalization is a bug in Unicode's design, it certainly is a thing. |
Here is a better analogy. Why not handle it in the same way that integer division by zero is handled? This situation is actually quite similar to a proposal to make integer division by zero result in zero:
I agree with you 100% here: this is the correct solution. However, most unfortunately, we cannot do this in D everywhere, because of how deeply ingrained auto-decoding is. And, I don't know if that change to
Normalization is completely unrelated here because it happens on a different layer. Decoding, encoding, autodecoding work with and transform code units. Normalization transfroms code points. |
I'm not so sure about that. I managed to fix quite a bit of std.string so it did not do autodecoding, with no effect on the user (I don't remember why I didn't do My evolved opinion is that an app should decide which encoding to use, and stick with it everywhere. Any decoding/encoding should be restricted to input/output. My further evolved opinion is that D should not have been agnostic about character types, and just been UTF-8 everywhere. UTF-16 and UTF-32 should be treated as library types only. |
|
BTW https://issues.dlang.org/show_bug.cgi?id=20140 is a fine example of what I'm talking about with regard to defining an error condition out of existence so Phobos can be made nothrow and @nogc. |
|
Agreed 100% there, that's great. Still, I disagree with the change in this pull request, because it allows mis-encoded data to be silently corrupted into unrecoverable errors and propagated before they're noticed, and because it is a change that will affect existing D programs. Please consider one of the alternative approaches discussed above. |
|
Ok I'll think about it. In the meantime, I could use help with the two failing tests - I have no idea what's wrong from reading the logs. Though the buildkite one might be caused by linking with an older library. |
|
DAutoTest is failing due to a network error. Just rebase and force-push. The relevant line from the log is:
|
|
I don't know what's wrong with buildkite, looks like it died before it even started. Probably a rebase will fix it. CC @wilzbach |
bb37989
to
4c8345b
Compare
|
just rebased |
|
It's very likely unrelated, but I would be careful with this PR here. |
|
Jumping in because of the FWIW I did run into an issue with I like @CyberShadow's approach here -- make it throw an Error. But most of this is moot if we just get rid of autodecoding altogether, which is the approach I'm taking. |
…throwing on invalid
4c8345b
to
822ec98
Compare
|
I took the liberty to rebase this. @WalterBright @CyberShadow should we merge this? |
|
No, this is a breaking change that may result in silent data corruption. We've discussed this thoroughly. This change is also going against what modern languages are doing. @RazvanN7 Sorry to waste your work, but I'm going to go ahead and close this :) |
|
@CyberShadow No worries. I just wanted to get some resolution for this PR. Thanks! |
|
Yeah, I hope we can put this behind us, since I know @WalterBright feels very strongly about this (because it simplifies things a lot on the implementation side). But, it really is a really bad compromise, and it would put us in a bad situation that we can't get out of easily. |
https://issues.dlang.org/show_bug.cgi?id=20134