Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Issue 20134 - autodecode should use replacementDchar rather than throwing on invalid #7144

Closed
wants to merge 1 commit into from

Conversation

WalterBright
Copy link
Member

@WalterBright WalterBright commented Aug 15, 2019

@dlang-bot
Copy link
Contributor

dlang-bot commented Aug 15, 2019

Thanks for your pull request, @WalterBright!

Bugzilla references

Auto-close Bugzilla Severity Description
20134 enhancement autodecode should use replacementDchar rather than throwing on invalid

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + phobos#7144"

@WalterBright WalterBright force-pushed the replacementDchar branch 3 times, most recently from 2e4420a to aed9a33 Compare August 15, 2019 23:17
@WalterBright WalterBright added the Phobos 2: No Autodecode Make Phobos work without autodecoding label Aug 15, 2019
@WalterBright WalterBright changed the title fix Issue 20134 - autodecode should use replacementDchar rather than throwing on invalid fix Issue 20134 - autodecode should use replacementDchar rather than throwing on invalid Aug 15, 2019
@WalterBright
Copy link
Member Author

Your PR doesn't reference any Bugzilla issue.

Yes it does.

@CyberShadow
Copy link
Member

Yes it does.

It needs to be in the commit message. GitHub pull request titles don't end up in git history and remain only on GitHub.

Copy link
Member

@CyberShadow CyberShadow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure we discussed this a few years ago and concluded it was a bad idea? As it can lead to silent data corruption.

Generally any "sanitization" where garbage (which, really, is almost always just data that's in the wrong format for some reason or another) is replaced with some replacement dummy placeholder needs to be opt in.

@WalterBright
Copy link
Member Author

It needs to be in the commit message.

I amended it. Still no joy.

@CyberShadow
Copy link
Member

Here is the discussion:
https://issues.dlang.org/show_bug.cgi?id=14519

@CyberShadow CyberShadow reopened this Aug 16, 2019
@CyberShadow
Copy link
Member

Looks like it needed a reopen to see the new commit message.

@CyberShadow
Copy link
Member

I think the correct fix here is to throw an Error, not an Exception. Allowing strings (which should have been sanitized at the inputs) to get all the way to any autodecoding machinery is a program bug.

@WalterBright
Copy link
Member Author

WalterBright commented Aug 16, 2019

Hmm, I had thought my PR to fix foreach() had gone through.

dlang/druntime#1240

@WalterBright
Copy link
Member Author

Curiously, the foreach() throws a UnicodeException while std.utf throws UTFException. Right hand, meet left hand.

@CyberShadow
Copy link
Member

Curiously, the foreach() throws a UnicodeException while std.utf throws UTFException. Right hand, meet left hand.

Looks like UnicodeException is in Druntime (as the foreach decoding would be) and UTFException is in std.utf.

@WalterBright
Copy link
Member Author

While your arguments that this change could result in data loss where some was relying on invalid UTF data and read then wrote it is pedantically true, I am doubtful it matters in practice. Consider:

  1. toUTF uses replacementDchar and nobody notices
  2. the same thing happens with invalid floating point values - they just get converted to NaN
  3. text editors often change the file read and then written, even if nothing else changes. It'll, for example, reset how line endings are done, it'll truncate anything past a 0 or a ^Z in the file, etc. Nobody cares.
  4. Normal practice in the industry is replacementDchar, not rejecting the entire string
  5. even Errors are expensive in D
  6. the fact that nobody has noticed UnicodeException and UTFException are both used for the same thing suggests that nobody depends on catching these

And lastly, we cannot remove autodecode and yet keep these exceptions.

@CyberShadow
Copy link
Member

CyberShadow commented Aug 16, 2019

toUTF uses replacementDchar and nobody notices

Yep, but probably because no one's using D for anything too serious in that area, of they are they're not using Phobos.

text editors often change the file read and then written, even if nothing else changes

I don't see how that's related! D programs can be processing all kinds of data using strings, and it may not be necessarily "text files" that you would open in a text editor. (Also decent text editors don't do that any more)

Normal practice in the industry is replacementDchar, not rejecting the entire string

Is that actually true? Pretty sure that's not what Rust does, for example. I think the advice only applies when displaying strings, not when converting them in general.

the fact that nobody has noticed UnicodeException and UTFException are both used for the same thing suggests that nobody depends on catching these

Maybe they don't need to be caught! The program crashing with UnicodeException / UTFException is a good signal that there is something wrong with the input and it needs to be fixed.

even Errors are expensive in D

That's bad but I don't know how to fix it. For something as low-level as implicit autodecoding maybe we can do something like onOutOfMemory (which isn't too different from an abort()).

@CyberShadow
Copy link
Member

CyberShadow commented Aug 16, 2019

BTW, a problem with the current design of byUTF / decode / etc: When it returns replacementDchar, how can you tell if it was because of bad UTF in its input, or because the input string literally contained a '\uFFFD' character? A better return value to signal bad UTF would be some value that's unencodable in a valid UTF stream, such as cast(char)0xFF...

Edit: Also, it would prevent the errors from propagating. As soon as you tried to put (encode) the invalid char/dchar into a string, the encoder will throw and catch the bug...

@CyberShadow CyberShadow dismissed their stale review August 16, 2019 00:43

argument has been noted

@wilzbach
Copy link
Member

I'm in favor of this change in general as it unblocks a common problem: string algorithms are hard to do in nothrow and @nogc (dip1008 is still not a thing). However, we at least need to provide a range that allows opt-in validation (i.e. expose the current front/back etc.) to give users a choice.

@CyberShadow
Copy link
Member

Everywhere I look, I see that (unless the particular situation calls to do otherwise) the default or accepted practice is to signal an error on invalid UTF. Other programming languages do it. The standards recommend or demand it. It is very easy to find sources for this. I did not want to outright change the article to state the opposite of what it previously was implying. And, in this particular case, I disagree that removing that text did not improve the article.

@WalterBright
Copy link
Member Author

Everywhere I look, I see that (unless the particular situation calls to do otherwise) the default or accepted practice is to signal an error on invalid UTF.

Except for @lesderid 's quote from the Unicode consortium.

As for being default practice, that could very well be legacy, which is outdated now.

Besides, D is hardly consistent with this. byUTF has been there for years, and did not throw. Nobody complained. Autodecoding does not work for ranges of char, which is why byUTF came about.

@CyberShadow
Copy link
Member

CyberShadow commented Aug 18, 2019

Except for @lesderid 's quote from the Unicode consortium.

We must be reading it differently. None of the text he quoted recommends one practice over the other.

As for being default practice, that could very well be legacy, which is outdated now.

This is a bold assumption. Do you have anything to back it up?

Rust is newer than D and raises errors on bad UTF. You are saying that Rust is using legacy practice?

Besides, D is hardly consistent with this.

I agree. However, we are fixing the inconsistency it in the wrong direction.

byUTF has been there for years, and did not throw. Nobody complained.

I have been pretty vocal that this is the wrong practice for five years now.

Let me turn that around. foreach throws when decoding UTF. Where are the complaints?

@CyberShadow
Copy link
Member

CyberShadow commented Aug 18, 2019

Where are the complaints?

I looked. UnicodeException is what's thrown when foreach encounters bad UTF, right?

Here are the top three search results I get on the forum for UnicodeException:

  1. https://forum.dlang.org/post/vkhugmslgbmlfupnvwma@forum.dlang.org

    The user is trying to load a CSV that is not (entirely) UTF-8. Without the exception, the mis-encoded data would have been corrupted and the errors would have propagated to the output.

  2. https://forum.dlang.org/thread/cxlmmnjnqnaqlkilnned@forum.dlang.org

    The problem is a mis-use of Windows APIs. Without the exception, the operation would have silently succeeded and resulted in garbage.

  3. https://forum.dlang.org/post/mvkbemmgatrfbxnmjada@forum.dlang.org

    The user explicitly wants to work with a "badly encoded" string. Never mind that strings should not contain bad UTF, or that std.string.tr should not autodecode. Though we don't know if the user cares about preserving non-UTF parts of the string in this case, decoding happens as part of a different operation. Without the exception, the non-UTF parts would be replaced implicitly and unexpectedly, without the user's consent.

Why is this still even an argument?

@WalterBright
Copy link
Member Author

None of the text he quoted recommends one practice over the other.

It says both are acceptable.

Do you have anything to back it up?

Yes, but you deleted it.

Rust is newer than D and raises errors on bad UTF.

It offers both forms, not a default as far as I can tell.

Where are the complaints?

They come in the form of poor performance, not being nothrow, and using the gc. These are general problems with using exceptions for reporting errors. Having them at the root of all string processing just makes for problems. It's not the string processing itself that is made slow, it's everything that calls it because they have to build exception unwinding sections for the RAII objects. Exception unwinding code turns off data flow optimizations, register allocation, etc.

I expect this is why Rust uses Option to report errors, and Herb Sutter has been working on a similar thing for C++.

The 1990's sales job of "zero cost exceptions" has turned out to be a snow job.

I anticipate that exceptions will retreat into legacy territory, and that Option types will become the preferred solution. We should get in front of this change before it drives over us. I've done a lot of work to remove gc dependencies from Phobos, and we should be working on ways to remove throwing exceptions, too.

@WalterBright
Copy link
Member Author

  1. one wonders why the user was decoding the line in the first place. Unicode is not symmetric and doing input<=>output conversions is going to result in differences, and not just from replacement characters. (Normalization, etc.)
  2. that was a buffer overflow. Not really fair, as the overflow bytes could have just as easily been valid ASCII
  3. string.tr shouldn't be decoding strings, it's a bug in it.

@CyberShadow
Copy link
Member

Yes, but you deleted it.

If your argument is standing solely on an uncited sentence on Wikipedia, well ...

It offers both forms, not a default as far as I can tell.

Yes. But, to be fair, the functions are from_utf8 and from_utf8_lossy, and not from_utf8 and from_utf8_non_lossy. Due to the naming, the code explicitly needs to request the lossy version. We can't do that in D because autodecoding usually happens a few layers below.

They come in the form of poor performance, not being nothrow, and using the gc.

All these are problems with D's model of handling exceptions, not with how auto-decoding treats bad UTF.

Why not handle it in the same way as an out-of-memory error, then? You don't need to be able to catch the exception, because you can achieve the same effect by requesting decoding through a means that signals errors in other ways.

@CyberShadow
Copy link
Member

CyberShadow commented Aug 18, 2019

  1. Unicode is not symmetric and doing input<=>output conversions is going to result in differences, and not just from replacement characters. (Normalization, etc.)

What? No, it won't! There is exactly one valid encoding for any code point in UTF-8, UTF-16, and UTF-32. Normalization is completely unrelated to this.

string.tr shouldn't be decoding strings, it's a bug in it.

That's completely beside the point. It could very well be any function that does need to auto-decode. Could be the same tr but with dchar arguments and char[] string, then it would be forced to decode.

@WalterBright
Copy link
Member Author

https://issues.dlang.org/show_bug.cgi?id=20140

Note how this change completely avoids the problem with invalid UTF sequences, pushing the decision up to the user to make an explicit choice.

@WalterBright
Copy link
Member Author

https://unicode.org/reports/tr15/

While I think normalization is a bug in Unicode's design, it certainly is a thing.

@CyberShadow
Copy link
Member

CyberShadow commented Aug 18, 2019

Why not handle it in the same way as an out-of-memory error, then?

Here is a better analogy. Why not handle it in the same way that integer division by zero is handled? This situation is actually quite similar to a proposal to make integer division by zero result in zero:

  • Would be more consistent with floating-point division, by never raising an error.
  • Allows making arithmetic code actually-really-nothrow
  • Some programming languages already do exactly this! Though, not to great public acclaim.

This leaves open what to do if from and to contain invalid UTF sequences if the conversion of them is necessary. The most pragmatic solution is to reject from and to arguments to tr that are not of the same UTF encoding as str.

I agree with you 100% here: this is the correct solution. However, most unfortunately, we cannot do this in D everywhere, because of how deeply ingrained auto-decoding is. And, I don't know if that change to tr would be possible due to being breaking, and even more so on a larger scale. Anyway, this is not what this argument is about.

While I think normalization is a bug in Unicode's design, it certainly is a thing.

Normalization is completely unrelated here because it happens on a different layer. Decoding, encoding, autodecoding work with and transform code units. Normalization transfroms code points.

@WalterBright
Copy link
Member Author

I agree with you 100% here: this is the correct solution. However, most unfortunately, we cannot do this in D everywhere, because of how deeply ingrained auto-decoding is.

I'm not so sure about that. I managed to fix quite a bit of std.string so it did not do autodecoding, with no effect on the user (I don't remember why I didn't do tr()). I got rid of the autodecoding in std.path, too. We need to get rid of autodecoding one way or another, or D will get stuck in the past.

My evolved opinion is that an app should decide which encoding to use, and stick with it everywhere. Any decoding/encoding should be restricted to input/output.

My further evolved opinion is that D should not have been agnostic about character types, and just been UTF-8 everywhere. UTF-16 and UTF-32 should be treated as library types only.

@WalterBright
Copy link
Member Author

BTW https://issues.dlang.org/show_bug.cgi?id=20140 is a fine example of what I'm talking about with regard to defining an error condition out of existence so Phobos can be made nothrow and @nogc.

@CyberShadow
Copy link
Member

Agreed 100% there, that's great. Still, I disagree with the change in this pull request, because it allows mis-encoded data to be silently corrupted into unrecoverable errors and propagated before they're noticed, and because it is a change that will affect existing D programs. Please consider one of the alternative approaches discussed above.

@WalterBright
Copy link
Member Author

Ok I'll think about it. In the meantime, I could use help with the two failing tests - I have no idea what's wrong from reading the logs. Though the buildkite one might be caused by linking with an older library.

@CyberShadow
Copy link
Member

CyberShadow commented Aug 18, 2019

DAutoTest is failing due to a network error. Just rebase and force-push.

The relevant line from the log is:

Failed to download package libdparse from https://code.dlang.org/packages/libdparse/0.8.7.zip

@CyberShadow
Copy link
Member

I don't know what's wrong with buildkite, looks like it died before it even started. Probably a rebase will fix it. CC @wilzbach

@WalterBright
Copy link
Member Author

just rebased

@wilzbach
Copy link
Member

It's very likely unrelated, but I would be careful with this PR here.
BTW with Buildkite you can also login and just restart the failing job in question, that might save time.

@schveiguy
Copy link
Member

Jumping in because of the no-autodecode tag, I'm starting an effort to remove autodecoding completely from Phobos as an option. See #7595

FWIW I did run into an issue with wstring.front not being @nogc. But I worked around it by fixing std.algorithm.comparison.equals.

I like @CyberShadow's approach here -- make it throw an Error. But most of this is moot if we just get rid of autodecoding altogether, which is the approach I'm taking.

@RazvanN7
Copy link
Collaborator

I took the liberty to rebase this. @WalterBright @CyberShadow should we merge this?

@CyberShadow
Copy link
Member

No, this is a breaking change that may result in silent data corruption. We've discussed this thoroughly. This change is also going against what modern languages are doing.

@RazvanN7 Sorry to waste your work, but I'm going to go ahead and close this :)

@RazvanN7
Copy link
Collaborator

@CyberShadow No worries. I just wanted to get some resolution for this PR. Thanks!

@CyberShadow
Copy link
Member

CyberShadow commented Sep 14, 2021

Yeah, I hope we can put this behind us, since I know @WalterBright feels very strongly about this (because it simplifies things a lot on the implementation side). But, it really is a really bad compromise, and it would put us in a bad situation that we can't get out of easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Phobos 2: No Autodecode Make Phobos work without autodecoding Severity:Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants