Unicode normalized comparison #1810

harendra-kumar · 2016-02-19T22:12:56Z

Fix the problem described in revert commit cf18703.

The text was updated successfully, but these errors were encountered:

@harendra-kumar

This reverts commit 63c7f48. @harendra-kumar The reverted commit seems to break the 1336-1337-new-package-names integration test on Mac OS X. I did a bit of debugging and it appears that, somewhere along the way, some unicode characters are changing (they look the same when rendered, but the UTF-8 bytes are different) and that is causing the comparison to fail. For example: ``` $ stack new ば日本-4本 [...] Package name as defined in the .cabal file must match the .cabal file name. Please fix the following packages and try again: - ば日本-4本/ば日本-4本.cabal ``` `show (FP.takeBaseName $ toFilePath fp,show $ gpdPackageName gpd)` in this case gives me `("\12399\12441\26085\26412-4\26412","\12400\26085\26412-4\26412")`; note the first two bytes in the fst are different than the first byte in the snd.

harendra-kumar · 2016-02-20T02:18:22Z

I copied the characters from inside the .cabal file (package name field) and pasted them on the terminal to list the file using ls and it correctly listed the file. This means the filename and the package name inside the cabal file are identical.

The problem is likely to be with the comparison check in init as well as in build.

harendra-kumar · 2016-02-20T05:10:44Z

This turns out to be pretty interesting!

@borsboom noted the following in his update in cf18703:

show (FP.takeBaseName $ toFilePath fp,show $ gpdPackageName gpd) in this case gives me
("\12399\12441\26085\26412-4\26412","\12400\26085\26412-4\26412");
note the first two bytes in the fst are different than the first byte in the snd.

It happens that those are two equivalent unicode representations of the same visual character representation. I printed those on my terminal after converting the decimal into hex representation. Here is the output:

$ echo -e "\u306f + \u3099 = \u306f\u3099 = \u3070"
は + ゙ = ば = ば

So it seems you can combine two codepoints and it gives you the same character as a third pre-combined codepoint.

So I googled and found this (https://en.wikipedia.org/wiki/Combining_character) on wikipedia:

------quote------
Unicode also contains many precomposed characters, so that in many cases it is possible to use both combining diacritics and precomposed characters, at the user's or application's choice. This leads to a requirement to perform Unicode normalization before comparing two Unicode strings and to carefully design encoding converters to correctly map all of the valid ways to represent a character in Unicode to a legacy encoding to avoid data loss.
------unquote-----

In previous update I wrote that the filename and the package name are identical. But that's not correct, hexdump shows that they are actually different but the shell and the editor treat them as equivalent.

It seems we need to use some sort of a normalized comparison. So there are two questions to be answered:

How do we do a normalized comparison?
How did we end up with two different representations - filename using one representation and package name a different one? The source was the same for both of them i.e. the name given on command line and it was a combined character. Perhaps some API somewhere is splitting a codepoint into two. This is just for curiosity. If we do a normalized comparison it won't matter.

harendra-kumar · 2016-02-20T05:16:36Z

https://hackage.haskell.org/package/text-icu-0.7.0.1/docs/Data-Text-ICU-Normalize.html has a normalize function, we can use that for comparison.

harendra-kumar · 2016-02-20T05:24:31Z

A bigger question is how to find and fix all places in the code susceptible to the "normalization bug"? And make sure we do not introduce them in future?

harendra-kumar · 2016-02-20T05:33:02Z

@borsboom since this is a pre-existing bug and not a regression you may want to go ahead with the release irrespective of this. Though, we can quickly fix the two known places i.e. stack init and stack build still we don't know if there are other places which may need fixing.

harendra-kumar · 2016-02-20T05:40:09Z

Here is a FAQ about unicode normalization: http://unicode.org/faq/normalization.html. Its looking like a PITA to me. And this report - http://www.unicode.org/reports/tr15/ .

harendra-kumar · 2016-02-20T06:36:16Z

I think this needs to be understood better before we decide on a fix. There has to be a better way than explicitly normalizing at all places. Not sure if normalization happens implicitly somewhere? Perhaps String and Text are using different normalization forms? Need to dig more.

Any Unicode expert out there?

luigy · 2016-02-20T23:38:28Z

Just wanted to chime in with a test I ran:
Oh unicode! o.O

λ stack new ば日本-4本
λ ls -a
.                 ..                ば日本-4本
λ stack runghc ../stack-1810.hs .
Normalization mode: None
[".","..","\12399\12441\26085\26412-4\26412"]
Normalization mode: NFD
[".","..","\12399\12441\26085\26412-4\26412"]
Normalization mode: NFKD
[".","..","\12399\12441\26085\26412-4\26412"]
Normalization mode: NFC
[".","..","\12400\26085\26412-4\26412"]
Normalization mode: NFKC
[".","..","\12400\26085\26412-4\26412"]
Normalization mode: FCD
[".","..","\12399\12441\26085\26412-4\26412"]

-- ../stack-1810.hs
import qualified Data.Text as T
import Data.Text.ICU.Normalize
import System.Directory
import System.Environment

main :: IO ()
main = do
  [path] <- getArgs
  contents <- map T.pack <$> getDirectoryContents path
  mapM_ (logNormalize contents) modes
  where
    modes = [None,NFD,NFKD,NFC,NFKC,FCD]
    logNormalize contents mode = do
      putStrLn $ "Normalization mode: " ++ show mode
      print $ map (normalize mode) contents

borsboom · 2016-02-20T23:42:55Z

If at all possible, we want to avoid using text-icu in Stack as that requires installing extra system libraries.

harendra-kumar · 2016-02-21T05:33:27Z

I was wondering if ghc (7.10.3) handles the same thing correctly. Observations from an experiment:

It allows precomposed form in an identifier and correctly handles the case when the filename is in decomposed form.
produces an error when the decomposed form is used in a module name - I guess this should be a bug

cueball $ cat main.hs 
-- Using decomposed form U+306f U+3099
-- GHC produces a lexical error with this
--import Aば日本4本

-- Using precomposed form U+3070
import Aば日本4本

main = func "Hello"

cueball $ cat Aば日本4本.hs 
-- prcomposed form U+3070
module Aば日本4本 where

func = putStrLn

cueball $ runghc main.hs 
Hello

Note the filename is in decomposed form and therefore different than the module name inside the file. If ghc treated them differently it would have produced an error like this:

A.hs:1:8:
    File name does not match module name:
    Saw: ‘B’
    Expected: ‘A’

harendra-kumar · 2016-02-21T05:39:00Z

Now let's see if ghc compares strings correctly:

cueball $ cat string-equal.hs 
main = if "ば" == "ば" then putStrLn "Pass" else putStrLn "Fail"
cueball $ runghc string-equal.hs 
Fail

If ghc used a normalized form to store strings then the test above would have passed and even our comparison code leading to this issue would have worked correctly. I guess we should discuss this on ghc dev list.

harendra-kumar · 2016-02-21T05:47:36Z

Text comparison also fails:

cueball $ cat string-equal.hs 
import Data.Text

main = if pack "ば"  == pack "ば" then putStrLn "Pass" else putStrLn "Fail"

cueball:/vol/hosts/cueball/projects/tracoz(master)*$ runghc string-equal.hs 
Fail

luigy · 2016-02-21T06:03:10Z

hmm but even if handling unicode normalized comparison it will still be blocked by haskell/cabal/issues/2557

luigy · 2016-02-21T06:14:23Z

~~Doubled checked again the example after realizing I had an extra character and I get the same behavior with both forms, but using your vanilla example does give me the lexical error~~

haha nvm yep indeed I have the same behavior. I need to go by the actual bytes next time and oh boy many conditions that can mess with the normalization of your unicode (tmux, terminal, filesystem, etc)

harendra-kumar · 2016-02-21T07:06:48Z

Yeah, I had a hard time playing with this. I used hexdump on the character to verify it. But still not sure if something in the pipeline could be changing the form. What is being displayed may not be the same as what is stored on disk.

Also, not sure if ghc is really handling it or its working just because of some lucky transform in the pipeline - I would like to see the ghc code where it is being handled if it is doing so explicitly.

harendra-kumar · 2016-02-21T07:23:47Z

The text package github page (https://github.com/bos/text) mentions that it does not handle normalization and you should use text-icu package for normalization.

borsboom · 2016-02-21T13:23:07Z

This is a nutty thought but what if, in order to compare package names with .cabal filenames, Stack writes a temporary file with the same filename as the package name, then reads the directory to get the package name as it's normalized by the filesystem, and uses that version to compare to the .cabal filename?

mgsloan · 2016-02-21T19:11:05Z

@borsboom That is a nutty thought, but I like it! We can save on overhead by only doing it when the direct equality test fails.

harendra-kumar · 2016-02-21T23:02:40Z

Interesting hack. Use the filesystem as a normalization API! It may be ok for this particular case but We cannot use this as a general normalization API.

In general we might have the same problem in other cases as well. For example, one can manually specify a package dir in stack.yaml in a different form than the filesystem. There may be other such examples. Wherever we compare a unicode string to make a decision it has to be a normalized comparison for it to work correctly. We need to find a real solution to the problem.

I found this lightweight C library (https://github.com/JuliaLang/utf8proc) which is just two C files, around 600 lines of code and some data. It is maintained under the Julia project. We could create a Haskell package by writing Haskell bindings to this library and use that package? This package can ship the C code to avoid any system package dependency.

We can also look around if there is a library providing a normalize function (other than icu) which is guaranteed to be available on a standard base system. This seems to be a basic unicode problem, how do others (not using icu) solve this problem?

luigy · 2016-02-22T00:37:54Z

In general we might have the same problem in other cases as well. For example, one can manually specify a package dir in stack.yaml in a different form than the filesystem. There may be other such examples.

Currently parsing a PackageName from a decomposed unicode string, in example what we get back when looking for a cabal file, will fail since a composing character like U+3099 returns False for Data.Char.isAlpha

We will need a way to normalize into precomposed form to try keep things a bit simpler

harendra-kumar · 2016-02-22T02:14:41Z

Ah, interesting point. I never thought about that. That means form is important for such operations to work correctly. What else we might be missing?

harendra-kumar · 2016-03-19T10:29:44Z

I sent an email to haskell-cafe about normalization without using text-icu but got no responses yet. So I am going to try creating a lightweight unicode normalization package myself. I will need some review help at the minimum.

harendra-kumar · 2016-03-23T15:31:22Z

I created a lightweight unicode processing package currently exporting only the normalization API. Please take a look at this: https://github.com/harendra-kumar/unicode-transforms . I would appreciate any feedback and suggestions.

I have tested it on Linux and Mac. I do not have a Windows setup so I will need help for testing it on Windows.

This is a first draft. It works pretty well to solve the problem at hand in the stack code. More functionality can be added easily as needed.

harendra-kumar · 2016-03-23T21:11:08Z

Lest I forget, Data.ByteString.Char8 is lossy because it truncates the unicode characters. All uses of this module must be examined if we want unicode support to work correctly. There are a quite a few places where this module is being imported and used. We can instead use Data.Text.Encoding.encodeUtf8 . Data.Text.pack or utf8-string package to convert a String to UTF8 encoded ByteString.

We should perhaps ban the import of Data.ByteString.Char8 in general.

mgsloan · 2016-03-24T00:05:00Z

Wow, awesome work @harendra-kumar ! Feel free to add that to stack as an extra-dep once there's a release on hackage.

luigy · 2016-03-25T01:09:59Z

Yeah, good stuff @harendra-kumar

I was wondering if ghc (7.10.3) handles the same thing correctly. Observations from an experiment:

It allows precomposed form in an identifier and correctly handles the case when the filename is in decomposed form.
produces an error when the decomposed form is used in a module name - I guess this should be a bug

what I asked some time ago on irc might be relevant and there seems to be no decision on how to go about it yet

harendra-kumar · 2016-03-25T10:43:54Z

Got a response on cafe - there seems to exist a Haskell implementation for some Unicode ops but its not yet on Hackage - https://github.com/llelf/prose. I am going to take a look at that as well.

@luigy the irc discussion seems to be in agreement with what I observed. It will be nice to have normalization support in GHC but that will require the unicode database for normalization to be shipped with GHC. More stuff to add and maintain.

I am not sure how much all this Unicode thing is worth for programming software especially - do people really use identifiers in local languages? The basic language and documentation all being mostly in English would it make sense to use some local identifiers? for fun?

luigy · 2016-03-25T18:05:30Z

the irc discussion seems to be in agreement with what I observed.

indeed

I am not sure how much all this Unicode thing is worth for programming software especially - do people really use identifiers in local languages? The basic language and documentation all being mostly in English would it make sense to use some local identifiers? for fun?

😂 I know, right

…now (#1810)

borsboom · 2016-05-02T23:17:17Z

I've disabled the failing part of the integration test for now, so that we can get stack-1.1.0 out the door without any failures. So once the root cause is resolved, remember to:

re-enable failing part of new package names test (i.e., revert 87fec57)

harendra-kumar · 2016-06-16T15:11:41Z

I have uploaded a candidate package on hackage. Can someone take a look if the API looks fine and can stack depend on this package for fixing this issue? This is my first package on Hackage and I am not sure if there is a checklist to check against. I have tested it on Linux/Mac/Windows/GHC-7.8.4/7.10.3/8.0.1.

Currently it is based on a C library but I am in the process of replacing that with all Haskell.

borsboom · 2016-06-16T15:39:54Z

The API looks good to me.

harendra-kumar · 2016-06-16T15:53:44Z

In fact I would like to use the higher level convenience package https://github.com/ppelleti/normalization-insensitive which @ppelleti created on top of this package. It provides a Normalizable class and instances for String, ByteString and Text types which is used to provide a NI s type which keeps the type s normalized in a given form or normalization insensitive.

mgsloan · 2016-06-16T19:30:18Z

Looks good to me too!

harendra-kumar · 2016-06-16T20:11:18Z

Thanks @borsboom and @mgsloan !

harendra-kumar · 2016-06-17T09:48:23Z

I gave it a thought and it occurs to me that its hard and error prone to handle normalization on a case to case basis or by using normalized data types for certain chosen fields. The best way would be to use Text for all text processing (ban the use of String or ByteString) and then replace the Text implementation under the hood with a normalization aware text package without any change of API. I can write more details in a writeup about how I arrived at this conclusion.

The filename gets normalized to NFD on OS X causing spurious failures. Let's normalize both sides to NFC in case the Cabal file is in some other normalization for any reason. This is a slight hack.

… X, for now (commercialhaskell#1810)" This reverts commit 87fec57, now that we have a proper fix.

Blaisorblade · 2016-07-21T00:33:23Z

I looked through longest-open high-priority issues and found this, so a few comments:

How did we end up with two different representations - filename using one representation and package name a different one?

Since nobody yet answered here: OSX uses NFD instead of NFC.

The minimal fix to 87fec57 is to switch FP.takeBaseName . toFilePath to normalize NFC . Text.pack . FP.takeBaseName . toFilePath — and do the same on the other side.

I gave it a thought and it occurs to me that its hard and error prone to handle normalization on a case to case basis or by using normalized data types for certain chosen fields.

I totally agree. But I assume that both refactorings you propose, while worthwhile, would take quite a while. So what about fixing the present issue meanwhile? Attempting this in #2397.

But I don't mean to scoop you or sth., especially since you did all the hard work 😉 .

harendra-kumar · 2016-07-21T05:13:41Z

@Blaisorblade no problem at all, I welcome the fix. Its great to get things done! I got disinterested in a point fix because the problem could be all pervasive. But anyway its good to get this one out of the way and pursue the problem independently.

I looked at your PR. It looks good to me. But it will be good if someone other than me also can take a look at it. @mgsloan?

… X, for now (commercialhaskell#1810)" This reverts commit 87fec57, now that we have a proper fix.

…g-name-comparison Fix #1810: Unicode-normalized pkg name comparison

borsboom added this to the P1: Must milestone Feb 20, 2016

borsboom added the type: bug label Feb 20, 2016

harendra-kumar changed the title ~~stack init: check whether pkg name and .cabal file name match~~ Unicode normalized comparison Feb 20, 2016

harendra-kumar self-assigned this Mar 19, 2016

borsboom added a commit that referenced this issue May 2, 2016

integration-tests: disable failing unicode package name on OS X, for …

87fec57

…now (#1810)

harendra-kumar mentioned this issue May 31, 2016

add to hackage composewell/unicode-transforms#3

Closed

Blaisorblade mentioned this issue Jul 21, 2016

On Hackage? ppelleti/normalization-insensitive#2

Closed

Blaisorblade added a commit to Blaisorblade/stack that referenced this issue Jul 21, 2016

Revert "integration-tests: disable failing unicode package name on OS…

65c4a37

… X, for now (commercialhaskell#1810)" This reverts commit 87fec57, now that we have a proper fix.

Blaisorblade mentioned this issue Jul 21, 2016

Fix #1810: Unicode-normalized pkg name comparison #2397

Merged

Blaisorblade added a commit to Blaisorblade/stack that referenced this issue Aug 2, 2016

Revert "integration-tests: disable failing unicode package name on OS…

12f9d01

… X, for now (commercialhaskell#1810)" This reverts commit 87fec57, now that we have a proper fix.

Blaisorblade closed this as completed in 7ec9dd8 Aug 2, 2016

Blaisorblade added a commit that referenced this issue Aug 2, 2016

Merge pull request #2397 from Blaisorblade/1810-unicode-normalized-pk…

ccc7229

…g-name-comparison Fix #1810: Unicode-normalized pkg name comparison

This was referenced Nov 30, 2017

Don't parse cabal files twice #3615

Merged

Unicode normalization on OS X can break package name parsing #3616

Open

Corrupt file header when dealing with Unicode normalization issues haskell/cabal#4920

Open

Unicode normalized comparison #1810

Unicode normalized comparison #1810

Comments

harendra-kumar commented Feb 19, 2016

harendra-kumar commented Feb 20, 2016

harendra-kumar commented Feb 20, 2016

harendra-kumar commented Feb 20, 2016

harendra-kumar commented Feb 20, 2016

harendra-kumar commented Feb 20, 2016

harendra-kumar commented Feb 20, 2016

harendra-kumar commented Feb 20, 2016

luigy commented Feb 20, 2016

borsboom commented Feb 20, 2016

harendra-kumar commented Feb 21, 2016

harendra-kumar commented Feb 21, 2016

harendra-kumar commented Feb 21, 2016

luigy commented Feb 21, 2016

luigy commented Feb 21, 2016

harendra-kumar commented Feb 21, 2016

harendra-kumar commented Feb 21, 2016

borsboom commented Feb 21, 2016

mgsloan commented Feb 21, 2016

harendra-kumar commented Feb 21, 2016

luigy commented Feb 22, 2016

harendra-kumar commented Feb 22, 2016

harendra-kumar commented Mar 19, 2016

harendra-kumar commented Mar 23, 2016

harendra-kumar commented Mar 23, 2016

mgsloan commented Mar 24, 2016

luigy commented Mar 25, 2016

harendra-kumar commented Mar 25, 2016

luigy commented Mar 25, 2016

borsboom commented May 2, 2016

harendra-kumar commented Jun 16, 2016

borsboom commented Jun 16, 2016

harendra-kumar commented Jun 16, 2016 • edited Loading

mgsloan commented Jun 16, 2016

harendra-kumar commented Jun 16, 2016

harendra-kumar commented Jun 17, 2016

Blaisorblade commented Jul 21, 2016

harendra-kumar commented Jul 21, 2016

harendra-kumar commented Jun 16, 2016 •

edited

Loading