New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode normalized comparison #1810
Comments
This reverts commit 63c7f48. @harendra-kumar The reverted commit seems to break the 1336-1337-new-package-names integration test on Mac OS X. I did a bit of debugging and it appears that, somewhere along the way, some unicode characters are changing (they look the same when rendered, but the UTF-8 bytes are different) and that is causing the comparison to fail. For example: ``` $ stack new ば日本-4本 [...] Package name as defined in the .cabal file must match the .cabal file name. Please fix the following packages and try again: - ば日本-4本/ば日本-4本.cabal ``` `show (FP.takeBaseName $ toFilePath fp,show $ gpdPackageName gpd)` in this case gives me `("\12399\12441\26085\26412-4\26412","\12400\26085\26412-4\26412")`; note the first two bytes in the fst are different than the first byte in the snd.
I copied the characters from inside the .cabal file (package name field) and pasted them on the terminal to list the file using The problem is likely to be with the comparison check in init as well as in build. |
This turns out to be pretty interesting! @borsboom noted the following in his update in cf18703:
It happens that those are two equivalent unicode representations of the same visual character representation. I printed those on my terminal after converting the decimal into hex representation. Here is the output:
So it seems you can combine two codepoints and it gives you the same character as a third pre-combined codepoint. So I googled and found this (https://en.wikipedia.org/wiki/Combining_character) on wikipedia: ------quote------ In previous update I wrote that the filename and the package name are identical. But that's not correct, hexdump shows that they are actually different but the shell and the editor treat them as equivalent. It seems we need to use some sort of a
|
https://hackage.haskell.org/package/text-icu-0.7.0.1/docs/Data-Text-ICU-Normalize.html has a normalize function, we can use that for comparison. |
A bigger question is how to find and fix all places in the code susceptible to the "normalization bug"? And make sure we do not introduce them in future? |
@borsboom since this is a pre-existing bug and not a regression you may want to go ahead with the release irrespective of this. Though, we can quickly fix the two known places i.e. |
Here is a FAQ about unicode normalization: http://unicode.org/faq/normalization.html. Its looking like a PITA to me. And this report - http://www.unicode.org/reports/tr15/ . |
I think this needs to be understood better before we decide on a fix. There has to be a better way than explicitly normalizing at all places. Not sure if normalization happens implicitly somewhere? Perhaps String and Text are using different normalization forms? Need to dig more. Any Unicode expert out there? |
Just wanted to chime in with a test I ran:
-- ../stack-1810.hs
import qualified Data.Text as T
import Data.Text.ICU.Normalize
import System.Directory
import System.Environment
main :: IO ()
main = do
[path] <- getArgs
contents <- map T.pack <$> getDirectoryContents path
mapM_ (logNormalize contents) modes
where
modes = [None,NFD,NFKD,NFC,NFKC,FCD]
logNormalize contents mode = do
putStrLn $ "Normalization mode: " ++ show mode
print $ map (normalize mode) contents |
If at all possible, we want to avoid using |
I was wondering if ghc (7.10.3) handles the same thing correctly. Observations from an experiment:
Note the filename is in decomposed form and therefore different than the module name inside the file. If ghc treated them differently it would have produced an error like this:
|
Now let's see if ghc compares strings correctly:
If ghc used a normalized form to store strings then the test above would have passed and even our comparison code leading to this issue would have worked correctly. I guess we should discuss this on ghc dev list. |
Text comparison also fails:
|
hmm but even if handling unicode normalized comparison it will still be blocked by haskell/cabal/issues/2557 |
haha nvm yep indeed I have the same behavior. I need to go by the actual bytes next time and oh boy many conditions that can mess with the normalization of your unicode (tmux, terminal, filesystem, etc) |
Yeah, I had a hard time playing with this. I used hexdump on the character to verify it. But still not sure if something in the pipeline could be changing the form. What is being displayed may not be the same as what is stored on disk. Also, not sure if ghc is really handling it or its working just because of some lucky transform in the pipeline - I would like to see the ghc code where it is being handled if it is doing so explicitly. |
The text package github page (https://github.com/bos/text) mentions that it does not handle normalization and you should use text-icu package for normalization. |
This is a nutty thought but what if, in order to compare package names with |
@borsboom That is a nutty thought, but I like it! We can save on overhead by only doing it when the direct equality test fails. |
Interesting hack. Use the filesystem as a normalization API! It may be ok for this particular case but We cannot use this as a general normalization API. In general we might have the same problem in other cases as well. For example, one can manually specify a package dir in stack.yaml in a different form than the filesystem. There may be other such examples. Wherever we compare a unicode string to make a decision it has to be a normalized comparison for it to work correctly. We need to find a real solution to the problem. I found this lightweight C library (https://github.com/JuliaLang/utf8proc) which is just two C files, around 600 lines of code and some data. It is maintained under the Julia project. We could create a Haskell package by writing Haskell bindings to this library and use that package? This package can ship the C code to avoid any system package dependency. We can also look around if there is a library providing a normalize function (other than icu) which is guaranteed to be available on a standard base system. This seems to be a basic unicode problem, how do others (not using icu) solve this problem? |
Currently parsing a We will need a way to normalize into precomposed form to try keep things a bit simpler |
Ah, interesting point. I never thought about that. That means form is important for such operations to work correctly. What else we might be missing? |
I sent an email to haskell-cafe about normalization without using |
I created a lightweight unicode processing package currently exporting only the normalization API. Please take a look at this: https://github.com/harendra-kumar/unicode-transforms . I would appreciate any feedback and suggestions. I have tested it on Linux and Mac. I do not have a Windows setup so I will need help for testing it on Windows. This is a first draft. It works pretty well to solve the problem at hand in the stack code. More functionality can be added easily as needed. |
Lest I forget, We should perhaps ban the import of |
Wow, awesome work @harendra-kumar ! Feel free to add that to stack as an extra-dep once there's a release on hackage. |
Yeah, good stuff @harendra-kumar
what I asked some time ago on irc might be relevant and there seems to be no decision on how to go about it yet |
Got a response on cafe - there seems to exist a Haskell implementation for some Unicode ops but its not yet on Hackage - https://github.com/llelf/prose. I am going to take a look at that as well. @luigy the irc discussion seems to be in agreement with what I observed. It will be nice to have normalization support in GHC but that will require the unicode database for normalization to be shipped with GHC. More stuff to add and maintain. I am not sure how much all this Unicode thing is worth for programming software especially - do people really use identifiers in local languages? The basic language and documentation all being mostly in English would it make sense to use some local identifiers? for fun? |
indeed
😂 I know, right |
I've disabled the failing part of the integration test for now, so that we can get stack-1.1.0 out the door without any failures. So once the root cause is resolved, remember to:
|
I have uploaded a candidate package on hackage. Can someone take a look if the API looks fine and can stack depend on this package for fixing this issue? This is my first package on Hackage and I am not sure if there is a checklist to check against. I have tested it on Linux/Mac/Windows/GHC-7.8.4/7.10.3/8.0.1. Currently it is based on a C library but I am in the process of replacing that with all Haskell. |
The API looks good to me. |
In fact I would like to use the higher level convenience package https://github.com/ppelleti/normalization-insensitive which @ppelleti created on top of this package. It provides a Normalizable class and instances for |
Looks good to me too! |
I gave it a thought and it occurs to me that its hard and error prone to handle normalization on a case to case basis or by using normalized data types for certain chosen fields. The best way would be to use Text for all text processing (ban the use of String or ByteString) and then replace the Text implementation under the hood with a normalization aware text package without any change of API. I can write more details in a writeup about how I arrived at this conclusion. |
The filename gets normalized to NFD on OS X causing spurious failures. Let's normalize both sides to NFC in case the Cabal file is in some other normalization for any reason. This is a slight hack.
… X, for now (commercialhaskell#1810)" This reverts commit 87fec57, now that we have a proper fix.
I looked through longest-open high-priority issues and found this, so a few comments:
Since nobody yet answered here: OSX uses NFD instead of NFC. The minimal fix to 87fec57 is to switch
I totally agree. But I assume that both refactorings you propose, while worthwhile, would take quite a while. So what about fixing the present issue meanwhile? Attempting this in #2397. But I don't mean to scoop you or sth., especially since you did all the hard work 😉 . |
@Blaisorblade no problem at all, I welcome the fix. Its great to get things done! I got disinterested in a point fix because the problem could be all pervasive. But anyway its good to get this one out of the way and pursue the problem independently. I looked at your PR. It looks good to me. But it will be good if someone other than me also can take a look at it. @mgsloan? |
… X, for now (commercialhaskell#1810)" This reverts commit 87fec57, now that we have a proper fix.
…g-name-comparison Fix #1810: Unicode-normalized pkg name comparison
Fix the problem described in revert commit cf18703.
The text was updated successfully, but these errors were encountered: