unicode in project names is very poorly handled (in 'stack new') #1337

kadoban · 2015-11-12T07:48:17Z

I apologize in advance for not diligently following the CONTRIBUTING.md format, but I'm not actually sure what the expected result of these commands should be (read below for why). But regardless of what it should be, there's a problem.

To preface, 'ば', '日' and '本' are all letters (according to isLetter from Data.Char). Now let's see what happens when we try to create a new project with each of them as a name:

This one just completely gives you an incorrect project name:

$ stack new "ば"                      
Downloading template "new-template" to create project "p" in p/ ...

This one fails with a fairly bizarre-to-the-user error:

$ stack new "日"
Cannot decode byte '\xe5': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream

And this one is a complicated ball of I'm not sure what:

$ stack new "本"
Expected valid package name, but got: 本

'本' is a letter, according to isLetter from Data.Char at least, so it should be allowed
But are cabal package names in ASCII, or can you actually use unicode?
If it is supposed to fail, it only accidentally fails. Turns out it's truncating '本' down to ',' via Data.ByteString.Char8.pack and then fails because ',' isn't a valid package name.

What's happening is:
the optparse-applicative parser is calling parsePackageNameFromString which uses Data.ByteString.Char8.pack, truncating any Chars outside of the Word8 range. Then packageNameParser is used to parse the result, yielding some pretty odd outcomes.

The decodeUtf8 error message on "日" is via packageNameText in some logging in the New command code

So, my question is: what should actually be done to fix this? In my mind, there's two options:

Detect package names outside of ASCII range and reject them outright with a sane error message.
Correctly handle unicode characters, which I'm not sure how extensive the changes required would be. Probably a lot of code paths would have to at least be looked at, and PackageName would have to change at least a moderate amount.

I suspect that the first choice is the correct one, but can anyone confirm?

The above behavior is identical between the following two versions:

$ stack --version
Version 0.1.7.0, Git revision 2f00f7bd350192cef1c61a8f07cbe7341c1e735f (2555 commits) x86_64

$ stack --version
Version 0.1.6.0, Git revision e22271f5ce9afa2cb5be3bad9cafa392c623f85c (2313 commits) x86_64

The text was updated successfully, but these errors were encountered:

mgsloan · 2015-11-14T03:29:58Z

I'm definitely in favor of (2), since it seems that cabal does support unicode package names. I don't think any code will need to change, since nearly all of the uses of package name go through functions which covert to unicode-aware types. There is a usage of packageNameByteString, here, but that's an acceptable usage.

kadoban · 2015-11-15T21:20:29Z

Interesting, I'll see if I can figure out how to make choice number 2 happen then.

This is a WIP change towards correctly accepting unicode package names, see issue commercialhaskell#1337

TODO: this change probably needs to be more extensive in this area Related to issue commercialhaskell#1337 . It is now possible to build 'stack' in attempting to fix this issue.

kadoban · 2015-11-17T00:50:24Z

I've started trying to fix this, but it still needs some work and a lot more testing. fix_unicode_handling is the WIP branch.

There were an unfortunate number of types using the same ASCII-focused ByteString based parsing, so the changes are more than I'd hoped originally. It's also not finished yet, there's at least GhcPkgId using the old way, which is breaking registering of packages with unicode names.

mgsloan · 2015-11-17T02:30:04Z

Cool, thanks for taking this on! The changes look good so far.

kadoban · 2015-11-17T07:41:47Z

Well, I've gotten it to the point where you can stack new and stack build a project with a unicode name and it seems to work. There's still a pretty long way to go though, the current version is more of a basic proof of concept that it could potentially work rather than a finished product.

I'll be looking into how to make more principled changes to the PackageDump and PackageIndex parts, those were more just quick hacks than actual fixes. I think I'll need to try to understand what that code actually does and what it's used for before I can actually make it work well.

I said above it works, but so far only for an arbitrary subset of unicode letters that I tried and I'm not sure why. Hopefully it's related to the hacky fixes I mentioned and once I improve those it'll be more sane.

mgsloan · 2015-11-18T06:13:57Z

Great! Looking forward to when it's mergeable!

This is a WIP change towards correctly accepting unicode package names, see issue commercialhaskell#1337

issue commercialhaskell#1337

TODO: this change probably needs to be more extensive in this area Related to issue commercialhaskell#1337 . It is now possible to build 'stack' in attempting to fix this issue.

issue commercialhaskell#1337

issue commercialhaskell#1337 related

kadoban · 2015-11-20T17:41:32Z

Well, some progress but still having troubles.

I've fixed up PackageIndex and PackageDump to more sanely handle non-ASCII (instead of just ... blindly fixing type errors in the minimal way possible, as I'd done before), but I'm still getting the following error when trying to build some package names, not all. I can't tell what the pattern is yet or why, that'll be my next investigation I suppose:

本-0.1.0.0: build
setup-Simple-Cabal-1.22.4.0-x86_64-linux-ghc-7.10.2: Saved package config file header is corrupt. Try re-running the 'configure' command.

--  While building package \26412-0.1.0.0 using:
      /home/mud/.stack/setup-exe-cache/setup-Simple-Cabal-1.22.4.0-x86_64-linux-ghc-7.10.2 --builddir=.stack-work/dist/x86_64-linux/Cabal-1.22.4.0/ build lib:本 exe:本-exe --ghc-options " -ddump-hi -ddump-to-file"
    Process exited with code: ExitFailure 1

Starting to think I'm going to have to look at every single place that Data.ByteString.Char8 is used.

issue commercialhaskell#1337 related

kadoban · 2015-11-21T03:36:39Z

Starting to think that the error I'm seeing is just a cabal bug. I found haskell/cabal#2557 .

So I'm not exactly sure where to go with this for now. I think I'm going to stop here, review the changes I've done and make sure all of them are completely well-founded and work on some integration tests.

I may have to dig into cabal's bug and try to work that out for this use-case to be perfect, though it's currently much better than it was originally at least.

mgsloan · 2015-11-21T04:50:58Z

Yeah, that sounds like a likely explanation. Sounds like a good plan!

This is a WIP change towards correctly accepting unicode package names, see issue commercialhaskell#1337

TODO: this change probably needs to be more extensive in this area Related to issue commercialhaskell#1337 . It is now possible to build 'stack' in attempting to fix this issue.

issue commercialhaskell#1337

issue commercialhaskell#1337 related

Previously only ASCII really works correctly, everything else breaks pretty badly. This is a step towards fixing issue commercialhaskell#1337

related to (closed) issue commercialhaskell#1337

borsboom · 2015-12-20T13:59:41Z

@kadoban: Looks like the integration tests you added are failing on Windows. I've done some trials and I can't even get GHC on its own to successfully work with unicode filenames (after ensuring I'm on a UTF-8 code page), so I think for now I'll just disable these tests on Windows since I really doubt we can fix them in Stack.

kadoban · 2015-12-20T23:00:57Z

@borsboom: That's fine. I'm uncertain if they should be on by default even on other platforms ... they're really quite fragile unfortunately (not because of a stack issue, as far as I know, but due to Cabal ... and apparently GHC on windows).

I'll soon (next few days) be opening a PR or issue with information on what's not working just so we have a reference for it. (and I'll include the more extensive tests that I really would have liked to have included originally if they actually worked on linux (but they don't).

kadoban added a commit to kadoban/stack that referenced this issue Nov 17, 2015

use Text instead of ByteString for PackageName

7492320

This is a WIP change towards correctly accepting unicode package names, see issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

use Text instead of ByteString for PackageName

f79359b

This is a WIP change towards correctly accepting unicode package names, see issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

use Text instead of ByteString for Version

5b97845

issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

cleanup parsing of PackageName to lessen packing/unpacking

349e5dd

issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

use Text instead of BS for FlagName too

13fe3ba

issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

use Text instead of ByteString for PackageIdentifier

357d5f2

issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

fix up type errors due to the ByteString/Text switch

2857548

issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Nov 19, 2015

use Text instead of ByteString for GhcPkgId as well

a5f5394

issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Nov 20, 2015

handle non-ASCII in PackageDump

4f03232

issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Nov 20, 2015

handle non-ASCII better in PackageIndex

83f2bdc

issue commercialhaskell#1337 related

kadoban added a commit to kadoban/stack that referenced this issue Nov 20, 2015

go down another rabbithole turning BS into Text

0374267

issue commercialhaskell#1337 related

borsboom added the type: bug label Nov 22, 2015

borsboom added this to the P3: Optional milestone Nov 22, 2015

kadoban added a commit to kadoban/stack that referenced this issue Dec 10, 2015

use Text instead of ByteString for PackageName

1c924d5

This is a WIP change towards correctly accepting unicode package names, see issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Dec 10, 2015

handle non-ASCII in PackageDump

b1d832d

issue commercialhaskell#1337

kadoban added a commit to kadoban/stack that referenced this issue Dec 10, 2015

handle non-ASCII better in PackageIndex

673129e

issue commercialhaskell#1337 related

kadoban added a commit to kadoban/stack that referenced this issue Dec 10, 2015

go down another rabbithole turning BS into Text

1e82eb7

issue commercialhaskell#1337 related

kadoban added a commit to kadoban/stack that referenced this issue Dec 10, 2015

change some basic internal types to handle unicode

f2092e9

Previously only ASCII really works correctly, everything else breaks pretty badly. This is a step towards fixing issue commercialhaskell#1337

kadoban mentioned this issue Dec 10, 2015

improve unicode handling #1492

Merged

mgsloan closed this as completed in #1492 Dec 11, 2015

kadoban added a commit to kadoban/stack that referenced this issue Dec 12, 2015

test non-ascii package names

b355e6e

related to (closed) issue commercialhaskell#1337

borsboom added a commit that referenced this issue Dec 20, 2015

Disable unicode filename integration tests on Windows (#1337)

98f2332

This was referenced Dec 1, 2017

Unicode normalization on OS X can break package name parsing #3616

Open

Corrupt file header when dealing with Unicode normalization issues haskell/cabal#4920

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode in project names is very poorly handled (in 'stack new') #1337

unicode in project names is very poorly handled (in 'stack new') #1337

kadoban commented Nov 12, 2015

mgsloan commented Nov 14, 2015

kadoban commented Nov 15, 2015

kadoban commented Nov 17, 2015

mgsloan commented Nov 17, 2015

kadoban commented Nov 17, 2015

mgsloan commented Nov 18, 2015

kadoban commented Nov 20, 2015

kadoban commented Nov 21, 2015

mgsloan commented Nov 21, 2015

borsboom commented Dec 20, 2015

kadoban commented Dec 20, 2015

unicode in project names is very poorly handled (in 'stack new') #1337

unicode in project names is very poorly handled (in 'stack new') #1337

Comments

kadoban commented Nov 12, 2015

mgsloan commented Nov 14, 2015

kadoban commented Nov 15, 2015

kadoban commented Nov 17, 2015

mgsloan commented Nov 17, 2015

kadoban commented Nov 17, 2015

mgsloan commented Nov 18, 2015

kadoban commented Nov 20, 2015

kadoban commented Nov 21, 2015

mgsloan commented Nov 21, 2015

borsboom commented Dec 20, 2015

kadoban commented Dec 20, 2015