Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix security issue 22495 #14538

Merged
merged 9 commits into from
Oct 28, 2022
Merged

Fix security issue 22495 #14538

merged 9 commits into from
Oct 28, 2022

Conversation

dukc
Copy link
Contributor

@dukc dukc commented Oct 8, 2022

No description provided.

@dukc
Copy link
Contributor Author

dukc commented Oct 8, 2022

Hmm it appears this solution would have to unfix https://issues.dlang.org/show_bug.cgi?id=13512.

So we'd have to somehow know if the file is in UTF-8 or some ASCII-extended code page. How could we reliably detect that?

@ibuclaw
Copy link
Member

ibuclaw commented Oct 9, 2022

Hmm it appears this solution would have to unfix https://issues.dlang.org/show_bug.cgi?id=13512.

So we'd have to somehow know if the file is in UTF-8 or some ASCII-extended code page. How could we reliably detect that?

Isn't the BOM always present when any utf16 or 32 characters exist within a file?

compiler/src/dmd/root/utf.d Outdated Show resolved Hide resolved
compiler/src/dmd/lexer.d Outdated Show resolved Hide resolved
@dukc
Copy link
Contributor Author

dukc commented Oct 12, 2022

Isn't the BOM always present when any utf16 or 32 characters exist within a file?

The problem exists with UTF-8 too.

Thinking about this, the only solution would be to detect the native encoding of the system, and then allow it in the shebang if and only if the rest of the file is pure ASCII.

The problem is, that's too complicated to do as a bugfix PR. So essentially two choices:

  • unfixing 13512
  • leaving this issue unsolved for shebangs.

Doing the latter would not allow "invisible" D code, but it would still allow disguising shebang line as something else than it really is. I should note that even with option 1, you can probably invisibly disable (didn't test) D code at the top of file, by using other line endings than "\n" so that the compiler will think the following lines belong to the shebang line.

@dlang-bot
Copy link
Contributor

Thanks for your pull request and interest in making D better, @dukc! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Auto-close Bugzilla Severity Description
22495 blocker SECURITY: unicode directionality overrides should be rejected

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "stable + dmd#14538"

@ibuclaw
Copy link
Member

ibuclaw commented Oct 14, 2022

Isn't the BOM always present when any utf16 or 32 characters exist within a file?

The problem exists with UTF-8 too.

Which UTF-8 character controls the flow of text? Looking at the list of cases in the switch present here, there aren't any.

@ibuclaw
Copy link
Member

ibuclaw commented Oct 14, 2022

OK, I see that UTF decided to use a range of high bits to denote a UTF-16 or 32 character, which didn't match my expectation of using bits xFF and xFFFF respectively. :-)

I also see that this range of UTF bits are also in conflict with the high end bits used for KOI-8 characters.

But could there really be an ambiguity?

Just looking at the individual short hex codes against the KOI-8 table.
'\u061C' - 0xD8+0xA8 - ь+invalid character
'\u200E' - 0xE2+0x80+0x8E - Б+invalid character+invalid character
'\u200F' - 0xE2+0x80+0x8F - Б+invalid character+invalid character
'\u202A' - 0xE2+0x80+0xAA - Б+invalid character+invalid character
'\u202E' - 0xE2+0x80+0xAE - Б+invalid character+invalid character
'\u2066' - 0xE2+0x81+0xA6 - Б+invalid character+invalid character
'\u2069' - 0xE2+0x81+0xA9 - Б+invalid character+invalid character

Yeah, if the aim is to only match and reject bidirectional characters, then it looks like there can't be any ambiguity - at least for this KOI-8 encoding.

If the problem is that utf_decodeChar fails after encoutering a cyrillic B character (because it proceeded to attempt decoding it as something else), then the reasonable default would be to suppress the error, and assume that it must be a 1 byte character of another encoding, then move onto the next byte.

If the problem is that uft_decodeChar could ever incorrectly succeed, then that might be something to think about.

@ibuclaw
Copy link
Member

ibuclaw commented Oct 14, 2022

On Posix complaint environments, I'd assume that LC_ALL would be set appropriately if you are not using UTF-8.

@dukc
Copy link
Contributor Author

dukc commented Oct 15, 2022

If I reject only Bidi characters in the shebang but allow otherwise malformed UTF-8, there will be no problem with KOI-8, but potentially with other code pages such as Windows-1252. Putting just any garbage (from Unicode perspective) at the shebang line will usually work, but someone may someday get a strange error.

That's unlikely enough though that it's probably the least evil option here. I'll do it that way.

@dukc
Copy link
Contributor Author

dukc commented Oct 15, 2022

Done.

@CyberShadow Got ya! Raw binary data in strings!

EDIT: Not binary data, but non-graphic Unicode anyway.

@CyberShadow
Copy link
Member

@CyberShadow Got ya! Raw binary data in strings!

EDIT: Not binary data, but non-graphic Unicode anyway.

What is the problem, please?

Also, wouldn't it make more sense to look for undesirable characters after decoding, not before?

@dukc
Copy link
Contributor Author

dukc commented Oct 15, 2022

What is the problem, please?

Oh sorry, I was probably too aggressive because I thought it was funny. Anyway, raw Unicode directionality overriding characters embedded in those strings.

@CyberShadow
Copy link
Member

Got it, thanks. Thought it was something more benign like a no-break space. I agree it makes little sense to allow raw control characters like that (or at least only do so for specific circumstances, like if the entire line is part of a string literal).

I'll fix it in ae tomorrow, feel free to send a PR to expedite that.

@dukc
Copy link
Contributor Author

dukc commented Oct 15, 2022

BuildKite doesn't seem to be required so it can wait to tomorrow I think.

@ibuclaw
Copy link
Member

ibuclaw commented Oct 16, 2022

If I reject only Bidi characters in the shebang but allow otherwise malformed UTF-8, there will be no problem with KOI-8, but potentially with other code pages such as Windows-1252. Putting just any garbage (from Unicode perspective) at the shebang line will usually work, but someone may someday get a strange error.

That's unlikely enough though that it's probably the least evil option here. I'll do it that way.

Having a further think about it, who's parsing the shebang anyway? Surely not us, so what issue could arise if we just keep on incrementitg the buffer pointer until the first newline?

It's not a security risk if we don't read it.

I know it sounds like I'm saying just make it someone else's problem, but bidirectional controls being parsed in the shebang would be a shell bug, not compiler.

CyberShadow added a commit to CyberShadow/ae that referenced this pull request Oct 17, 2022
@CyberShadow
Copy link
Member

I'll fix it in ae tomorrow, feel free to send a PR to expedite that.

Fixed and tagged.

@dukc
Copy link
Contributor Author

dukc commented Oct 17, 2022

Having a further think about it, who's parsing the shebang anyway? Surely not us, so what issue could arise if we just keep on incrementitg the buffer pointer until the first newline?

The potential issue is that if the shell has a vulnerability, a shebang line that seems to be doing something completely different can instead call DMD, where it could not if we kept this check.

This is very theoretical though. If the shell indeed allows doing that it has far bigger problems than DMD not rechecking this one. Also it would be very unlikely for anyone to want to secretly invoke DMD on something that is openly a D source file. So I agree, not worth to keep the shebang check around.

@dukc
Copy link
Contributor Author

dukc commented Oct 17, 2022

Done.

I left the refactorings that were needed by the shebang check in place. They are, in my opinion, in the right direction anyway so no point in removing them since they're already in place. I'll remove them if you disagree.

@ibuclaw
Copy link
Member

ibuclaw commented Oct 18, 2022

Having a further think about it, who's parsing the shebang anyway? Surely not us, so what issue could arise if we just keep on incrementitg the buffer pointer until the first newline?

The potential issue is that if the shell has a vulnerability, a shebang line that seems to be doing something completely different can instead call DMD, where it could not if we kept this check.

Maybe, but I'm just seeing it from the view of: if someone invoked the compiler by running ./foo.d the damage has already been done. And the best thing we can do for damage limitation is:

  1. Hope people test with dmd foo.d before executing via shebang.
  2. Hope that any detection errors aren't false positive because the 8-bit sequence is valid for character encoding X-123.

Better I think to focus on the part we have full control over, and directly affects us if we get it wrong.

Copy link
Member

@ibuclaw ibuclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm for now, if someone else has a grand idea for shebangs later, we'll run with that.

@ibuclaw
Copy link
Member

ibuclaw commented Oct 18, 2022

@dukc a brief changelog entry to make this visible would be appreciated.

@dukc
Copy link
Contributor Author

dukc commented Oct 18, 2022

I think it's better that someone else writes the entry. Since this is a security fix it may justify some special way to do the changelog entry, so I think it's better that you or another central contributor decides about that and writes it.

@dukc
Copy link
Contributor Author

dukc commented Oct 24, 2022

Still awaiting answer.

string msg;
auto result = decodeUTFpure(msg);

if (msg) error("%.*s", cast(int)msg.length, msg.ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if-statement body should be on its own line

@dkorpel
Copy link
Contributor

dkorpel commented Oct 24, 2022

Since this is a security fix it may justify some special way to do the changelog entry

I don't know what kind of special way you're thinking of, but here's my suggestion:

Source files may no longer contain Unicode directionality overrides

[Trojan Source: Invisible Vulnerabilities](https://github.com/nickboucher/trojan-source) shows how they can be used to maliciously hide code in various programming languages. 
[D is also affected](https://github.com/nickboucher/trojan-source/pull/16).

To prevent this, the compiler now raises an error when they appear in source files.
If you need a string literal containing directionality overrides, use escape sequences instead:
---
string s = "\u202E\u2066";
---

@dukc
Copy link
Contributor Author

dukc commented Oct 26, 2022

I don't know what kind of special way you're thinking of

Red fonts, or possibly asking to update immediately. I don't know whether that would be appreciated or needless drama, so I thought it'd be faster for someone who has authority to decide to word the message himself.

Anyway, I added the entry exactly as you wrote it.

@dukc
Copy link
Contributor Author

dukc commented Oct 27, 2022

ping @ibuclaw

@ibuclaw
Copy link
Member

ibuclaw commented Oct 27, 2022

Every security vulnerability disclosed since meltdown that gives itself a name, a website... is needless drama. :-)

@CyberShadow
Copy link
Member

I would not call this a vulnerability / (and the change a security fix). I think this is better described as a security precaution and improvement.

Copy link
Member

@ibuclaw ibuclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just noticed the changelog name, that should be properly prefixed so it's picked up. :-)

@dukc
Copy link
Contributor Author

dukc commented Oct 28, 2022

@ibuclaw I assume you mean it the casing had to be corrected. Done. Also fixed the if condition style Dennis said about earlier.

@dukc
Copy link
Contributor Author

dukc commented Oct 28, 2022

Oh it really needed a prefix. No other entries in my local fork have that prefix so I didn't know what you were talking about.

@ibuclaw
Copy link
Member

ibuclaw commented Oct 28, 2022

Oh it really needed a prefix. No other entries in my local fork have that prefix so I didn't know what you were talking about.

Yes, see #14600 for the change that enforces this. As part of the dmd/runtime merger we still need a way to differentiate between compiler and runtime changes (moving dmd changelog to compiler was rejected, so this is the next least worst thing we can do to keep mostly the same status quo for now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants