New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode in project names is very poorly handled (in 'stack new') #1337

Closed
kadoban opened this Issue Nov 12, 2015 · 11 comments

Comments

Projects
None yet
3 participants
@kadoban
Collaborator

kadoban commented Nov 12, 2015

I apologize in advance for not diligently following the CONTRIBUTING.md format, but I'm not actually sure what the expected result of these commands should be (read below for why). But regardless of what it should be, there's a problem.

To preface, 'ば', '日' and '本' are all letters (according to isLetter from Data.Char). Now let's see what happens when we try to create a new project with each of them as a name:

This one just completely gives you an incorrect project name:

$ stack new "ば"                      
Downloading template "new-template" to create project "p" in p/ ...

This one fails with a fairly bizarre-to-the-user error:

$ stack new "日"
Cannot decode byte '\xe5': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream

And this one is a complicated ball of I'm not sure what:

$ stack new "本"
Expected valid package name, but got: 本
  • '本' is a letter, according to isLetter from Data.Char at least, so it should be allowed
  • But are cabal package names in ASCII, or can you actually use unicode?
  • If it is supposed to fail, it only accidentally fails. Turns out it's truncating '本' down to ',' via Data.ByteString.Char8.pack and then fails because ',' isn't a valid package name.

What's happening is:
the optparse-applicative parser is calling parsePackageNameFromString which uses Data.ByteString.Char8.pack, truncating any Chars outside of the Word8 range. Then packageNameParser is used to parse the result, yielding some pretty odd outcomes.

The decodeUtf8 error message on "日" is via packageNameText in some logging in the New command code

So, my question is: what should actually be done to fix this? In my mind, there's two options:

  1. Detect package names outside of ASCII range and reject them outright with a sane error message.
  2. Correctly handle unicode characters, which I'm not sure how extensive the changes required would be. Probably a lot of code paths would have to at least be looked at, and PackageName would have to change at least a moderate amount.

I suspect that the first choice is the correct one, but can anyone confirm?

The above behavior is identical between the following two versions:

$ stack --version
Version 0.1.7.0, Git revision 2f00f7bd350192cef1c61a8f07cbe7341c1e735f (2555 commits) x86_64
$ stack --version
Version 0.1.6.0, Git revision e22271f5ce9afa2cb5be3bad9cafa392c623f85c (2313 commits) x86_64
@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Nov 14, 2015

I'm definitely in favor of (2), since it seems that cabal does support unicode package names. I don't think any code will need to change, since nearly all of the uses of package name go through functions which covert to unicode-aware types. There is a usage of packageNameByteString, here, but that's an acceptable usage.

@kadoban

This comment has been minimized.

Collaborator

kadoban commented Nov 15, 2015

Interesting, I'll see if I can figure out how to make choice number 2 happen then.

kadoban added a commit to kadoban/stack that referenced this issue Nov 17, 2015

use Text instead of ByteString for PackageName
This is a WIP change towards correctly accepting unicode package
names, see issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Nov 17, 2015

do a minimal change to get this type error fixed
TODO: this change probably needs to be more extensive in this area

Related to issue commercialhaskell#1337 . It is now possible to build 'stack' in
attempting to fix this issue.
@kadoban

This comment has been minimized.

Collaborator

kadoban commented Nov 17, 2015

I've started trying to fix this, but it still needs some work and a lot more testing. fix_unicode_handling is the WIP branch.

There were an unfortunate number of types using the same ASCII-focused ByteString based parsing, so the changes are more than I'd hoped originally. It's also not finished yet, there's at least GhcPkgId using the old way, which is breaking registering of packages with unicode names.

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Nov 17, 2015

Cool, thanks for taking this on! The changes look good so far.

@kadoban

This comment has been minimized.

Collaborator

kadoban commented Nov 17, 2015

Well, I've gotten it to the point where you can stack new and stack build a project with a unicode name and it seems to work. There's still a pretty long way to go though, the current version is more of a basic proof of concept that it could potentially work rather than a finished product.

I'll be looking into how to make more principled changes to the PackageDump and PackageIndex parts, those were more just quick hacks than actual fixes. I think I'll need to try to understand what that code actually does and what it's used for before I can actually make it work well.

I said above it works, but so far only for an arbitrary subset of unicode letters that I tried and I'm not sure why. Hopefully it's related to the hacky fixes I mentioned and once I improve those it'll be more sane.

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Nov 18, 2015

Great! Looking forward to when it's mergeable!

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

use Text instead of ByteString for PackageName
This is a WIP change towards correctly accepting unicode package
names, see issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

do a minimal change to get this type error fixed
TODO: this change probably needs to be more extensive in this area

Related to issue commercialhaskell#1337 . It is now possible to build 'stack' in
attempting to fix this issue.

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

kadoban added a commit to kadoban/stack that referenced this issue Nov 20, 2015

kadoban added a commit to kadoban/stack that referenced this issue Nov 20, 2015

@kadoban

This comment has been minimized.

Collaborator

kadoban commented Nov 20, 2015

Well, some progress but still having troubles.

I've fixed up PackageIndex and PackageDump to more sanely handle non-ASCII (instead of just ... blindly fixing type errors in the minimal way possible, as I'd done before), but I'm still getting the following error when trying to build some package names, not all. I can't tell what the pattern is yet or why, that'll be my next investigation I suppose:

本-0.1.0.0: build
setup-Simple-Cabal-1.22.4.0-x86_64-linux-ghc-7.10.2: Saved package config file header is corrupt. Try re-running the 'configure' command.

--  While building package \26412-0.1.0.0 using:
      /home/mud/.stack/setup-exe-cache/setup-Simple-Cabal-1.22.4.0-x86_64-linux-ghc-7.10.2 --builddir=.stack-work/dist/x86_64-linux/Cabal-1.22.4.0/ build lib:本 exe:本-exe --ghc-options " -ddump-hi -ddump-to-file"
    Process exited with code: ExitFailure 1

Starting to think I'm going to have to look at every single place that Data.ByteString.Char8 is used.

kadoban added a commit to kadoban/stack that referenced this issue Nov 20, 2015

@kadoban

This comment has been minimized.

Collaborator

kadoban commented Nov 21, 2015

Starting to think that the error I'm seeing is just a cabal bug. I found haskell/cabal#2557 .

So I'm not exactly sure where to go with this for now. I think I'm going to stop here, review the changes I've done and make sure all of them are completely well-founded and work on some integration tests.

I may have to dig into cabal's bug and try to work that out for this use-case to be perfect, though it's currently much better than it was originally at least.

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Nov 21, 2015

Yeah, that sounds like a likely explanation. Sounds like a good plan!

@borsboom borsboom added the type: bug label Nov 22, 2015

@borsboom borsboom added this to the P3: Optional milestone Nov 22, 2015

kadoban added a commit to kadoban/stack that referenced this issue Dec 10, 2015

use Text instead of ByteString for PackageName
This is a WIP change towards correctly accepting unicode package
names, see issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Dec 10, 2015

do a minimal change to get this type error fixed
TODO: this change probably needs to be more extensive in this area

Related to issue commercialhaskell#1337 . It is now possible to build 'stack' in
attempting to fix this issue.

kadoban added a commit to kadoban/stack that referenced this issue Dec 10, 2015

kadoban added a commit to kadoban/stack that referenced this issue Dec 10, 2015

kadoban added a commit to kadoban/stack that referenced this issue Dec 10, 2015

kadoban added a commit to kadoban/stack that referenced this issue Dec 10, 2015

change some basic internal types to handle unicode
Previously only ASCII really works correctly, everything else
breaks pretty badly.

This is a step towards fixing issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Dec 12, 2015

@borsboom

This comment has been minimized.

Contributor

borsboom commented Dec 20, 2015

@kadoban: Looks like the integration tests you added are failing on Windows. I've done some trials and I can't even get GHC on its own to successfully work with unicode filenames (after ensuring I'm on a UTF-8 code page), so I think for now I'll just disable these tests on Windows since I really doubt we can fix them in Stack.

borsboom added a commit that referenced this issue Dec 20, 2015

@kadoban

This comment has been minimized.

Collaborator

kadoban commented Dec 20, 2015

@borsboom: That's fine. I'm uncertain if they should be on by default even on other platforms ... they're really quite fragile unfortunately (not because of a stack issue, as far as I know, but due to Cabal ... and apparently GHC on windows).

I'll soon (next few days) be opening a PR or issue with information on what's not working just so we have a reference for it. (and I'll include the more extensive tests that I really would have liked to have included originally if they actually worked on linux (but they don't).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment