New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode normalized comparison #1810

Closed
harendra-kumar opened this Issue Feb 19, 2016 · 37 comments

Comments

Projects
None yet
5 participants
@harendra-kumar
Collaborator

harendra-kumar commented Feb 19, 2016

Fix the problem described in revert commit cf18703.

harendra-kumar referenced this issue Feb 19, 2016

Revert "init: check pkg name and .cabal file name match"
This reverts commit 63c7f48.

@harendra-kumar The reverted commit seems to break the
1336-1337-new-package-names integration test on Mac OS X.  I did
a bit of debugging and it appears that, somewhere along the way,
some unicode characters are changing (they look the same when
rendered, but the UTF-8 bytes are different) and that is causing
the comparison to fail.

For example:

```
$ stack new ば日本-4本
[...]
Package name as defined in the .cabal file must match the .cabal file name.
Please fix the following packages and try again:
- ば日本-4本/ば日本-4本.cabal
```

`show (FP.takeBaseName $ toFilePath fp,show $ gpdPackageName gpd)`
in this case gives me
`("\12399\12441\26085\26412-4\26412","\12400\26085\26412-4\26412")`;
note the first two bytes in the fst are different than the first byte
in the snd.

@borsboom borsboom added this to the P1: Must milestone Feb 20, 2016

@borsboom borsboom added the type: bug label Feb 20, 2016

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 20, 2016

I copied the characters from inside the .cabal file (package name field) and pasted them on the terminal to list the file using ls and it correctly listed the file. This means the filename and the package name inside the cabal file are identical.

The problem is likely to be with the comparison check in init as well as in build.

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 20, 2016

This turns out to be pretty interesting!

@borsboom noted the following in his update in cf18703:

show (FP.takeBaseName $ toFilePath fp,show $ gpdPackageName gpd) in this case gives me
("\12399\12441\26085\26412-4\26412","\12400\26085\26412-4\26412");
note the first two bytes in the fst are different than the first byte in the snd.

It happens that those are two equivalent unicode representations of the same visual character representation. I printed those on my terminal after converting the decimal into hex representation. Here is the output:

$ echo -e "\u306f + \u3099 = \u306f\u3099 = \u3070"
は + ゙ = ば = ば

So it seems you can combine two codepoints and it gives you the same character as a third pre-combined codepoint.

So I googled and found this (https://en.wikipedia.org/wiki/Combining_character) on wikipedia:

------quote------
Unicode also contains many precomposed characters, so that in many cases it is possible to use both combining diacritics and precomposed characters, at the user's or application's choice. This leads to a requirement to perform Unicode normalization before comparing two Unicode strings and to carefully design encoding converters to correctly map all of the valid ways to represent a character in Unicode to a legacy encoding to avoid data loss.
------unquote-----

In previous update I wrote that the filename and the package name are identical. But that's not correct, hexdump shows that they are actually different but the shell and the editor treat them as equivalent.

It seems we need to use some sort of a normalized comparison. So there are two questions to be answered:

  • How do we do a normalized comparison?
  • How did we end up with two different representations - filename using one representation and package name a different one? The source was the same for both of them i.e. the name given on command line and it was a combined character. Perhaps some API somewhere is splitting a codepoint into two. This is just for curiosity. If we do a normalized comparison it won't matter.
@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 20, 2016

https://hackage.haskell.org/package/text-icu-0.7.0.1/docs/Data-Text-ICU-Normalize.html has a normalize function, we can use that for comparison.

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 20, 2016

A bigger question is how to find and fix all places in the code susceptible to the "normalization bug"? And make sure we do not introduce them in future?

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 20, 2016

@borsboom since this is a pre-existing bug and not a regression you may want to go ahead with the release irrespective of this. Though, we can quickly fix the two known places i.e. stack init and stack build still we don't know if there are other places which may need fixing.

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 20, 2016

Here is a FAQ about unicode normalization: http://unicode.org/faq/normalization.html. Its looking like a PITA to me. And this report - http://www.unicode.org/reports/tr15/ .

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 20, 2016

I think this needs to be understood better before we decide on a fix. There has to be a better way than explicitly normalizing at all places. Not sure if normalization happens implicitly somewhere? Perhaps String and Text are using different normalization forms? Need to dig more.

Any Unicode expert out there?

@harendra-kumar harendra-kumar changed the title from stack init: check whether pkg name and .cabal file name match to Unicode normalized comparison Feb 20, 2016

@luigy

This comment has been minimized.

Contributor

luigy commented Feb 20, 2016

Just wanted to chime in with a test I ran:
Oh unicode! o.O

λ stack new ば日本-4本
λ ls -a
.                 ..                ば日本-4本
λ stack runghc ../stack-1810.hs .
Normalization mode: None
[".","..","\12399\12441\26085\26412-4\26412"]
Normalization mode: NFD
[".","..","\12399\12441\26085\26412-4\26412"]
Normalization mode: NFKD
[".","..","\12399\12441\26085\26412-4\26412"]
Normalization mode: NFC
[".","..","\12400\26085\26412-4\26412"]
Normalization mode: NFKC
[".","..","\12400\26085\26412-4\26412"]
Normalization mode: FCD
[".","..","\12399\12441\26085\26412-4\26412"]
-- ../stack-1810.hs
import qualified Data.Text as T
import Data.Text.ICU.Normalize
import System.Directory
import System.Environment

main :: IO ()
main = do
  [path] <- getArgs
  contents <- map T.pack <$> getDirectoryContents path
  mapM_ (logNormalize contents) modes
  where
    modes = [None,NFD,NFKD,NFC,NFKC,FCD]
    logNormalize contents mode = do
      putStrLn $ "Normalization mode: " ++ show mode
      print $ map (normalize mode) contents
@borsboom

This comment has been minimized.

Contributor

borsboom commented Feb 20, 2016

If at all possible, we want to avoid using text-icu in Stack as that requires installing extra system libraries.

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 21, 2016

I was wondering if ghc (7.10.3) handles the same thing correctly. Observations from an experiment:

  • It allows precomposed form in an identifier and correctly handles the case when the filename is in decomposed form.
  • produces an error when the decomposed form is used in a module name - I guess this should be a bug
cueball $ cat main.hs 
-- Using decomposed form U+306f U+3099
-- GHC produces a lexical error with this
--import Aば日本4本

-- Using precomposed form U+3070
import Aば日本4本

main = func "Hello"

cueball $ cat Aば日本4本.hs 
-- prcomposed form U+3070
module Aば日本4本 where

func = putStrLn

cueball $ runghc main.hs 
Hello

Note the filename is in decomposed form and therefore different than the module name inside the file. If ghc treated them differently it would have produced an error like this:

A.hs:1:8:
    File name does not match module name:
    Saw: ‘B’
    Expected: ‘A’
@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 21, 2016

Now let's see if ghc compares strings correctly:

cueball $ cat string-equal.hs 
main = if "ば" == "ば" then putStrLn "Pass" else putStrLn "Fail"
cueball $ runghc string-equal.hs 
Fail

If ghc used a normalized form to store strings then the test above would have passed and even our comparison code leading to this issue would have worked correctly. I guess we should discuss this on ghc dev list.

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 21, 2016

Text comparison also fails:

cueball $ cat string-equal.hs 
import Data.Text

main = if pack "ば"  == pack "ば" then putStrLn "Pass" else putStrLn "Fail"

cueball:/vol/hosts/cueball/projects/tracoz(master)*$ runghc string-equal.hs 
Fail
@luigy

This comment has been minimized.

Contributor

luigy commented Feb 21, 2016

hmm but even if handling unicode normalized comparison it will still be blocked by haskell/cabal/issues/2557

@luigy

This comment has been minimized.

Contributor

luigy commented Feb 21, 2016

Doubled checked again the example after realizing I had an extra character and I get the same behavior with both forms, but using your vanilla example does give me the lexical error

haha nvm yep indeed I have the same behavior. I need to go by the actual bytes next time and oh boy many conditions that can mess with the normalization of your unicode (tmux, terminal, filesystem, etc)

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 21, 2016

Yeah, I had a hard time playing with this. I used hexdump on the character to verify it. But still not sure if something in the pipeline could be changing the form. What is being displayed may not be the same as what is stored on disk.

Also, not sure if ghc is really handling it or its working just because of some lucky transform in the pipeline - I would like to see the ghc code where it is being handled if it is doing so explicitly.

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 21, 2016

The text package github page (https://github.com/bos/text) mentions that it does not handle normalization and you should use text-icu package for normalization.

@borsboom

This comment has been minimized.

Contributor

borsboom commented Feb 21, 2016

This is a nutty thought but what if, in order to compare package names with .cabal filenames, Stack writes a temporary file with the same filename as the package name, then reads the directory to get the package name as it's normalized by the filesystem, and uses that version to compare to the .cabal filename?

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Feb 21, 2016

@borsboom That is a nutty thought, but I like it! We can save on overhead by only doing it when the direct equality test fails.

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 21, 2016

Interesting hack. Use the filesystem as a normalization API! It may be ok for this particular case but We cannot use this as a general normalization API.

In general we might have the same problem in other cases as well. For example, one can manually specify a package dir in stack.yaml in a different form than the filesystem. There may be other such examples. Wherever we compare a unicode string to make a decision it has to be a normalized comparison for it to work correctly. We need to find a real solution to the problem.

I found this lightweight C library (https://github.com/JuliaLang/utf8proc) which is just two C files, around 600 lines of code and some data. It is maintained under the Julia project. We could create a Haskell package by writing Haskell bindings to this library and use that package? This package can ship the C code to avoid any system package dependency.

We can also look around if there is a library providing a normalize function (other than icu) which is guaranteed to be available on a standard base system. This seems to be a basic unicode problem, how do others (not using icu) solve this problem?

@luigy

This comment has been minimized.

Contributor

luigy commented Feb 22, 2016

In general we might have the same problem in other cases as well. For example, one can manually specify a package dir in stack.yaml in a different form than the filesystem. There may be other such examples.

Currently parsing a PackageName from a decomposed unicode string, in example what we get back when looking for a cabal file, will fail since a composing character like U+3099 returns False for Data.Char.isAlpha

We will need a way to normalize into precomposed form to try keep things a bit simpler

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Feb 22, 2016

Ah, interesting point. I never thought about that. That means form is important for such operations to work correctly. What else we might be missing?

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Mar 19, 2016

I sent an email to haskell-cafe about normalization without using text-icu but got no responses yet. So I am going to try creating a lightweight unicode normalization package myself. I will need some review help at the minimum.

@harendra-kumar harendra-kumar self-assigned this Mar 19, 2016

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Mar 23, 2016

I created a lightweight unicode processing package currently exporting only the normalization API. Please take a look at this: https://github.com/harendra-kumar/unicode-transforms . I would appreciate any feedback and suggestions.

I have tested it on Linux and Mac. I do not have a Windows setup so I will need help for testing it on Windows.

This is a first draft. It works pretty well to solve the problem at hand in the stack code. More functionality can be added easily as needed.

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Mar 23, 2016

Lest I forget, Data.ByteString.Char8 is lossy because it truncates the unicode characters. All uses of this module must be examined if we want unicode support to work correctly. There are a quite a few places where this module is being imported and used. We can instead use Data.Text.Encoding.encodeUtf8 . Data.Text.pack or utf8-string package to convert a String to UTF8 encoded ByteString.

We should perhaps ban the import of Data.ByteString.Char8 in general.

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Mar 24, 2016

Wow, awesome work @harendra-kumar ! Feel free to add that to stack as an extra-dep once there's a release on hackage.

@luigy

This comment has been minimized.

Contributor

luigy commented Mar 25, 2016

Yeah, good stuff @harendra-kumar

I was wondering if ghc (7.10.3) handles the same thing correctly. Observations from an experiment:

It allows precomposed form in an identifier and correctly handles the case when the filename is in decomposed form.
produces an error when the decomposed form is used in a module name - I guess this should be a bug

what I asked some time ago on irc might be relevant and there seems to be no decision on how to go about it yet

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Mar 25, 2016

Got a response on cafe - there seems to exist a Haskell implementation for some Unicode ops but its not yet on Hackage - https://github.com/llelf/prose. I am going to take a look at that as well.

@luigy the irc discussion seems to be in agreement with what I observed. It will be nice to have normalization support in GHC but that will require the unicode database for normalization to be shipped with GHC. More stuff to add and maintain.

I am not sure how much all this Unicode thing is worth for programming software especially - do people really use identifiers in local languages? The basic language and documentation all being mostly in English would it make sense to use some local identifiers? for fun?

@luigy

This comment has been minimized.

Contributor

luigy commented Mar 25, 2016

the irc discussion seems to be in agreement with what I observed.

indeed

I am not sure how much all this Unicode thing is worth for programming software especially - do people really use identifiers in local languages? The basic language and documentation all being mostly in English would it make sense to use some local identifiers? for fun?

😂 I know, right

borsboom added a commit that referenced this issue May 2, 2016

@borsboom

This comment has been minimized.

Contributor

borsboom commented May 2, 2016

I've disabled the failing part of the integration test for now, so that we can get stack-1.1.0 out the door without any failures. So once the root cause is resolved, remember to:

  • re-enable failing part of new package names test (i.e., revert 87fec57)
@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Jun 16, 2016

I have uploaded a candidate package on hackage. Can someone take a look if the API looks fine and can stack depend on this package for fixing this issue? This is my first package on Hackage and I am not sure if there is a checklist to check against. I have tested it on Linux/Mac/Windows/GHC-7.8.4/7.10.3/8.0.1.

Currently it is based on a C library but I am in the process of replacing that with all Haskell.

@borsboom

This comment has been minimized.

Contributor

borsboom commented Jun 16, 2016

The API looks good to me.

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Jun 16, 2016

In fact I would like to use the higher level convenience package https://github.com/ppelleti/normalization-insensitive which @ppelleti created on top of this package. It provides a Normalizable class and instances for String, ByteString and Text types which is used to provide a NI s type which keeps the type s normalized in a given form or normalization insensitive.

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Jun 16, 2016

Looks good to me too!

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Jun 16, 2016

Thanks @borsboom and @mgsloan !

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Jun 17, 2016

I gave it a thought and it occurs to me that its hard and error prone to handle normalization on a case to case basis or by using normalized data types for certain chosen fields. The best way would be to use Text for all text processing (ban the use of String or ByteString) and then replace the Text implementation under the hood with a normalization aware text package without any change of API. I can write more details in a writeup about how I arrived at this conclusion.

@Blaisorblade Blaisorblade referenced this issue Jul 21, 2016

Closed

On Hackage? #2

Blaisorblade added a commit to Blaisorblade/stack that referenced this issue Jul 21, 2016

Normalize to NFC before comparison (fix commercialhaskell#1810)
The filename gets normalized to NFD on OS X causing spurious failures.

Let's normalize both sides to NFC in case the Cabal file is in some
other normalization for any reason.

This is a slight hack.

Blaisorblade added a commit to Blaisorblade/stack that referenced this issue Jul 21, 2016

Revert "integration-tests: disable failing unicode package name on OS…
… X, for now (commercialhaskell#1810)"

This reverts commit 87fec57, now that
we have a proper fix.
@Blaisorblade

This comment has been minimized.

Collaborator

Blaisorblade commented Jul 21, 2016

I looked through longest-open high-priority issues and found this, so a few comments:

How did we end up with two different representations - filename using one representation and package name a different one?

Since nobody yet answered here: OSX uses NFD instead of NFC.

The minimal fix to 87fec57 is to switch FP.takeBaseName . toFilePath to normalize NFC . Text.pack . FP.takeBaseName . toFilePath — and do the same on the other side.

I gave it a thought and it occurs to me that its hard and error prone to handle normalization on a case to case basis or by using normalized data types for certain chosen fields.

I totally agree. But I assume that both refactorings you propose, while worthwhile, would take quite a while. So what about fixing the present issue meanwhile? Attempting this in #2397.

But I don't mean to scoop you or sth., especially since you did all the hard work 😉 .

@harendra-kumar

This comment has been minimized.

Collaborator

harendra-kumar commented Jul 21, 2016

@Blaisorblade no problem at all, I welcome the fix. Its great to get things done! I got disinterested in a point fix because the problem could be all pervasive. But anyway its good to get this one out of the way and pursue the problem independently.

I looked at your PR. It looks good to me. But it will be good if someone other than me also can take a look at it. @mgsloan?

Blaisorblade added a commit to Blaisorblade/stack that referenced this issue Aug 2, 2016

Revert "integration-tests: disable failing unicode package name on OS…
… X, for now (commercialhaskell#1810)"

This reverts commit 87fec57, now that
we have a proper fix.

Blaisorblade added a commit that referenced this issue Aug 2, 2016

Merge pull request #2397 from Blaisorblade/1810-unicode-normalized-pk…
…g-name-comparison

Fix #1810: Unicode-normalized pkg name comparison
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment