Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure to parse invalid Unicode on Windows in v2.33.0 (but not master) #1905

Closed
oconnor663 opened this issue May 5, 2020 · 10 comments · Fixed by #1907
Closed

failure to parse invalid Unicode on Windows in v2.33.0 (but not master) #1905

oconnor663 opened this issue May 5, 2020 · 10 comments · Fixed by #1907
Labels
C-bug Category: Updating dependencies

Comments

@oconnor663
Copy link
Contributor

The following is for the current release version of Clap, v2.33.0. (The bug does not appear to repro in Clap master. More on that below.) I've set up a repository containing a minimal repro of this problem: https://github.com/oconnor663/clap_unicode_test. You can see a CI run demonstrating the failure here: https://github.com/oconnor663/clap_unicode_test/runs/645077320. To repro the failure yourself, run cargo test in that repository.

The issue is that passing invalid Unicode on the command line to a Clap binary causes an "unexpected invalid UTF-8 code point" error, but only on Windows. Here's the main.rs:

use clap::{App, Arg};

fn main() {
    let matches = App::new("test")
        .arg(Arg::with_name("ARG").multiple(true))
        .get_matches();
    if let Some(args) = matches.values_of_os("ARG") {
        for arg in args {
            if arg.to_str().is_some() {
                println!("{:?} (valid Unicode)", arg);
            } else {
                println!("{:?} (INVALID Unicode!!!)", arg);
            }
        }
    }
}

Here's the test that runs on Linux and macOS. This one passes:

#[test]
#[cfg(unix)]
fn test_invalid_unicode_on_unix() {
    use std::os::unix::ffi::OsStringExt;
    let mut cmd = Command::new(env!("CARGO_BIN_EXE_test"));
    cmd.arg("abcdef");
    cmd.arg(OsString::from_vec(b"abc\xffdef".to_vec()));
    let status = cmd.status().unwrap();
    assert!(status.success());
}

In that case, the actual output of the (successful) binary is:

"abcdef" (valid Unicode)
"abc\xFFdef" (INVALID Unicode!!!)

And here's the test that fails on Windows:

#[test]
#[cfg(windows)]
fn test_invalid_unicode_on_windows() {
    use std::os::windows::ffi::OsStringExt;
    let mut cmd = Command::new(env!("CARGO_BIN_EXE_test"));
    cmd.arg("abcdef");
    let surrogate_char = 0xDC00;
    let bad_unicode_wchars = [
        'a' as u16,
        'b' as u16,
        'c' as u16,
        surrogate_char,
        'd' as u16,
        'e' as u16,
        'f' as u16,
    ];
    let bad_osstring = OsString::from_wide(&bad_unicode_wchars);
    cmd.arg(bad_osstring);
    let status = cmd.status().unwrap();
    assert!(status.success());
}

In this one, the binary crashes in the clap::App::get_matches call.

The tests are different between Unix and Windows because the way to construct an invalid Unicode OsString is different. On Unix, we just insert a non-UTF-8 byte into a byte slice. On Windows, we need to insert a unpaired surrogate character into a slice of u16. Other than that, the tests are effectively the same.

When I try the master branch of Clap (see the clap_master branch of my repository and this CI run), the problem appears fixed. That's awesome! However, it looks like Clap might be in the middle of a substantial refactoring, and I don't know how long it will take for these fixes to see a release. Should something be backported in the meantime?

@pksunkara
Copy link
Member

I don't think we will be backporting this fix because 2.x branch is frozen and only security fixes are going to be allowed. We are working hard on clap v3 and I am confident to say that the full release will be no more than 3 months away if we keep up the current pace of development.

@oconnor663
Copy link
Contributor Author

If I was able to figure out a fix on the 2.x branch, and it didn't look too hairy, would you be willing to release it? I'm interested in fixing this for my own work (it's a blocker for properly testing b3sum --check), and it might be valuable for other command line apps that never bother to upgrade out of the 2.x series.

@CreepySkeleton
Copy link
Contributor

@oconnor663 I say we wouldn't because the code in question is a few years old code. The last time someone touched it was a year ago or so, nobody here (maybe short of @kbknapp) is familiar with that codebase. We simply can't review your fix and be certain that it won't break something else, even if it's just a few lines.

@kbknapp
Copy link
Member

kbknapp commented May 5, 2020

@oconnor663

We appreciate the very detailed bug report!

@pksunkara is correct, 2.x is currently frozen except security fixes. However, I'm willing to take a look at a PR to fix the issue, since I'm familiar with the codebase. I can't promise a merge if it looks to be too high a maintenance burden. Lets at least see how far reaching the fix is, since causing a crash on valid input could be considered a security issue if we wend the optics a bit to include remote binaries with DDoS potential.

oconnor663 added a commit to oconnor663/clap that referenced this issue May 5, 2020
Previously, starts_with() would panic on Windows if the OsStr contained
any invalid Unicode. Because this method is in the critical path for
parsing command line arguments, this meant that Clap up to v2.33.0 on
Windows would panic on non-Unicode filenames, even when the application
tried to use OsString specifically to account for this case. As a
workaround, this commit adds a custom implementation of
OsStr::starts_with() for Windows that doesn't rely on as_bytes().

Fixes clap-rs#1905.
@oconnor663
Copy link
Contributor Author

Thanks for all the prompt replies. It looks like the fix can be quite targeted, essentially a one-line change to the existing OsStr::starts_with method, plus a new pure helper function and some tests. I haven't tried to comb the codebase for all the places where OsStr is assumed to be valid Unicode, but this single change at least seems to fix my use case. I want to add a couple more tests, then I'll submit a proper PR.

@kbknapp
Copy link
Member

kbknapp commented May 5, 2020

There shouldn't be many places where OsStr is assumed valid Unicode. Also see the potentially relevant PR #1893 the underlying code is different since its on the 3.x master branch, but deals with OsStr::starts_with

oconnor663 added a commit to oconnor663/clap that referenced this issue May 5, 2020
Previously, starts_with() would panic on Windows if the OsStr contained
any invalid Unicode. Because this method is in the critical path for
parsing command line arguments, this meant that Clap up to v2.33.0 on
Windows would panic on non-Unicode filenames, even when the application
tried to use OsString specifically to account for this case. As a
workaround, this commit adds a custom implementation of
OsStr::starts_with() for Windows that doesn't rely on as_bytes().

Fixes clap-rs#1905.
oconnor663 added a commit to oconnor663/clap that referenced this issue May 5, 2020
Previously, starts_with() would panic on Windows if the OsStr contained
any invalid Unicode. Because this method is in the critical path for
parsing command line arguments, this meant that Clap up to v2.33.0 on
Windows would panic on non-Unicode filenames, even when the application
tried to use OsString specifically to account for this case. As a
workaround, this commit adds a custom implementation of
OsStr::starts_with() for Windows that doesn't rely on as_bytes().

Fixes clap-rs#1905.
oconnor663 added a commit to oconnor663/clap that referenced this issue May 5, 2020
Previously, starts_with() would panic on Windows if the OsStr contained
any invalid Unicode. Because this method is in the critical path for
parsing command line arguments, this meant that Clap up to v2.33.0 on
Windows would panic on non-Unicode filenames, even when the application
tried to use OsString specifically to account for this case. As a
workaround, this commit adds a custom implementation of
OsStr::starts_with() for Windows that doesn't rely on as_bytes().

Fixes clap-rs#1905.
@pksunkara
Copy link
Member

I think this was fixed in master because of the recent refactor done on OsStr by @dylni. It was a pretty big change (IMO).

oconnor663 added a commit to oconnor663/clap that referenced this issue May 5, 2020
Previously, starts_with() would panic on Windows if the OsStr contained
any invalid Unicode. Because this method is in the critical path for
parsing command line arguments, this meant that Clap up to v2.33.0 on
Windows would panic on non-Unicode filenames, even when the application
tried to use OsString specifically to account for this case. As a
workaround, this commit adds a custom implementation of
OsStr::starts_with() for Windows that doesn't rely on as_bytes().

Fixes clap-rs#1905.
oconnor663 added a commit to oconnor663/clap that referenced this issue May 5, 2020
Previously, OsStr::starts_with() would panic on Windows if the OsStr
contained any invalid Unicode. Because this method is in the critical
path for parsing command line arguments, Clap up to v2.33.0 on Windows
panics on non-Unicode filenames, even when the application specifically
requests OsString to account for this case. As a workaround, this commit
adds a custom implementation of OsStr::starts_with() for Windows that
doesn't rely on OsStr::as_bytes().

With this change, examples like the following can parse successfully.
Here "X" represents invalid Unicode:

    clap.exe X
    clap.exe --some-arg X

However, examples like these will still panic:

    clap.exe --some-arg=X
    clap.exe -sX

These still panic, because they require string splitting operations like
OsStr::split_at_byte() and OsStr::trim_left_matches(). Fixing these
would require either breaking open the in-memory representation of
OsStr, which is not stable, or changing all these APIs to allow for an
allocated OsString/Cow<OsStr>. This fix is aiming to be minimal and also
not wildly unsafe, so we leave these bugs in place. Hopefully the
majority of invalid Unicode in the wild occurs in filepaths given as
standalone arguments, and many of the rest can be converted from flags
with `=` to flags with a space.

Fixes clap-rs#1905.
oconnor663 added a commit to oconnor663/clap that referenced this issue May 5, 2020
Previously, OsStr::starts_with() would panic on Windows if the OsStr
contained any invalid Unicode. Because this method is in the critical
path for parsing command line arguments, Clap up to v2.33.0 on Windows
panics on non-Unicode filenames, even when the application specifically
requests OsString to account for this case. As a workaround, this commit
adds a custom implementation of OsStr::starts_with() for Windows that
doesn't rely on OsStr::as_bytes().

With this change, examples like the following can parse successfully.
Here "X" represents invalid Unicode:

    clap.exe X
    clap.exe --some-arg X

However, examples like these will still panic:

    clap.exe --some-arg=X
    clap.exe -sX

These still panic, because they require string splitting operations like
OsStr::split_at_byte() and OsStr::trim_left_matches(). Fixing these
would require either breaking open the in-memory representation of
OsStr, which is not stable, or changing all these APIs to allow for an
allocated OsString/Cow<OsStr>. This fix is aiming to be minimal and also
not wildly unsafe, so we leave these bugs in place. Hopefully the
majority of invalid Unicode in the wild occurs in filepaths given as
standalone arguments, and many of the rest can be converted from flags
with `=` to flags with a space.

Fixes clap-rs#1905.
@oconnor663
Copy link
Contributor Author

I've submitted #1907. Despite appearances, I think this is a targeted fix :) Most of the code in there is new tests. The meat of the change is that I've defined the windows_osstr_starts_with standalone function on Windows only, and I've changed OsStr::starts_with to call that instead on Windows. This fixes the simplest cases of invalid Unicode parsing, but it leaves panics in place for cases like --arg=[invalid], which would require string splitting and therefore allocation and a broader API change. So this is not nearly as thorough as the refactoring that @pksunkara mentioned, but hopefully it's small enough that it's unlikely to introduce new bugs.

@dylni
Copy link
Contributor

dylni commented May 5, 2020

Thanks for the ping @pksunkara. Yes, this was fixed in this pull request.

@pksunkara pksunkara linked a pull request May 6, 2020 that will close this issue
@kbknapp
Copy link
Member

kbknapp commented May 7, 2020

I think this was fixed in master because of the recent refactor done on OsStr by @dylni. It was a pretty big change (IMO).

Yeah that's why I referenced the other PR (part of the refactor). I haven't had time the last day or so, but I'll look at this new PR by this weekend and see if it's something that could go on the 2.x branch with the understanding that if v2-master is currently broken, so long as we don't break more we're in a slightly better spot even if its not a total fix.

Once v3 is fully released I fully intend to EOL v2 except for hard security patches

bors bot added a commit that referenced this issue May 11, 2020
1907: prevent some panics when parsing invalid Unicode on Windows r=kbknapp a=oconnor663

Previously, OsStr::starts_with() would panic on Windows if the OsStr
contained any invalid Unicode. Because this method is in the critical
path for parsing command line arguments, Clap up to v2.33.0 on Windows
panics on non-Unicode filenames, even when the application specifically
requests OsString to account for this case. As a workaround, this commit
adds a custom implementation of OsStr::starts_with() for Windows that
doesn't rely on OsStr::as_bytes().

With this change, examples like the following can parse successfully.
Here "X" represents invalid Unicode:

    clap.exe X
    clap.exe --some-arg X

However, examples like these will still panic:

    clap.exe --some-arg=X
    clap.exe -sX

These still panic, because they require string splitting operations like
OsStr::split_at_byte() and OsStr::trim_left_matches(). Fixing these
would require either breaking open the in-memory representation of
OsStr, which is not stable, or changing all these APIs to allow for an
allocated OsString/Cow<OsStr>. This fix is aiming to be minimal and also
not wildly unsafe, so we leave these bugs in place. Hopefully the
majority of invalid Unicode in the wild occurs in filepaths given as
standalone arguments, and many of the rest can be converted from flags
with `=` to flags with a space.

Fixes #1905.

<!--
If your PR closes some issues, please write `Closes #XXXX`
where `XXXX` is the number of the issue you want to fix.
Each issue goes on it's own line.
-->


Co-authored-by: Jack O'Connor <oconnor663@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: Updating dependencies
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants