Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add General_Category and further predicates #40

Merged
merged 19 commits into from
Dec 21, 2021

Conversation

wismill
Copy link
Collaborator

@wismill wismill commented Nov 10, 2021

  • Parser
    • Fix parsing of ranges in UnicodeData.txt
    • Generate lib/Unicode/Internal/Char/UnicodeData/GeneralCategory.hs
  • Library
    • Add GeneralCategory data type and corresponding generalCategoryAbbr, generalCategory functions.
    • Add the following functions: isAlpha, isAlphaNum, isControl, isMark, isNumber, isPrint, isPunctuation, isSeparator, isSymbol.
    • Breaking change: isLetter is renamed isAlphabetic and isSpace is renamed isWhiteSpace.
    • Breaking change: isLetter and isSpace now match the base’s Data.Char behaviour.
    • Add lookupIntN in Unicode.Internal.Bits for low-level stuff.
    • Re-export some functions from Data.Char in order to make Unicode.Char a drop-in replacement.
  • Test
    • Add tests for Unicode.Char to ensure we get the same result than Data.Char.
      Note that this true only if the compiler used have have the same Unicode version as this package.

Fixes #38, Fixes #39

Note: ucd.sh generate should be re-run if PR #36 is merge (before or after this PR).

@harendra-kumar
Copy link
Member

@Bodigrim would you like to review the changes to lib/Unicode/Internal/Bits.hs?

@Bodigrim
Copy link
Collaborator

@wismill I appreciate the effort, but I think sub-byte operations are not worth it. There are 31 constructors, so you need 5 bits per one value. It's most likely faster just use a byte per value, but fetch all data in one go instead of conditional fetching of one or two words.

@wismill
Copy link
Collaborator Author

wismill commented Nov 11, 2021

@Bodigrim Ok, I will check that. Meanwhile, I added a benchmark and disable tests with incompatible GHC versions.

@wismill
Copy link
Collaborator Author

wismill commented Nov 12, 2021

@Bodigrim I have implemented your suggestion. It is roughly 60 to 80% faster than my previous implementation!

If my benchmark is correct, we have a nice speedup of 16x compared to some functions from Data.Char.

All
  Unicode.Char.Case
    isLower
      base:           OK (0.52s)
         17 ms ± 585 μs
      unicode-data:   OK (0.33s)
        315 μs ±  14 μs
    isUpper
      base:           OK (2.07s)
         16 ms ± 879 μs
      unicode-data:   OK (0.31s)
        314 μs ±  12 μs
  Unicode.Char.General
    generalCategory
      base:           OK (0.86s)
        122 ms ± 4.2 ms
      unicode-data:   OK (0.33s)
        109 ms ± 3.4 ms
    isAlpa
      base:           OK (0.24s)
         16 ms ± 795 μs
      unicode-data:   OK (0.25s)
        945 μs ±  53 μs
    isAlpabetic
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    isAlpaNum
      base:           OK (1.01s)
         16 ms ± 263 μs
      unicode-data:   OK (0.49s)
        946 μs ±  35 μs
    isControl
      base:           OK (0.49s)
         16 ms ± 424 μs
      unicode-data:   OK (0.20s)
        783 μs ±  46 μs
    isLetter
      base:           OK (0.20s)
        783 μs ±  43 μs
      unicode-data:   OK (0.20s)
        784 μs ±  44 μs
    isMark
      base:           OK (0.25s)
         16 ms ± 693 μs
      unicode-data:   OK (0.20s)
        783 μs ±  44 μs
    isNumber
      base:           OK (0.52s)
         17 ms ± 843 μs
      unicode-data:   OK (0.49s)
        947 μs ±  28 μs
    isPrint
      base:           OK (0.50s)
         16 ms ± 901 μs
      unicode-data:   OK (0.24s)
        943 μs ±  53 μs
    isPunctuation
      base:           OK (0.47s)
         15 ms ± 813 μs
      unicode-data:   OK (0.20s)
        783 μs ±  42 μs
    isSeparator
      base:           OK (0.51s)
         16 ms ± 644 μs
      unicode-data:   OK (0.24s)
        949 μs ±  50 μs
    isSpace
      base:           OK (0.40s)
        1.6 ms ±  43 μs
      unicode-data:   OK (0.24s)
        1.9 ms ± 109 μs
    isSymbol
      base:           OK (0.21s)
         15 ms ± 714 μs
      unicode-data:   OK (0.24s)
        940 μs ±  49 μs
    isWhiteSpace
      unicode-data:   OK (0.32s)
        315 μs ±  12 μs
    isHangul
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    isHangulLV
      unicode-data:   OK (0.32s)
        314 μs ±  12 μs
    isJamo
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    jamoLIndex
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    jamoVIndex
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    jamoTIndex
      unicode-data:   OK (0.32s)
        314 μs ±  12 μs
  Unicode.Char.Identifiers
    isIDContinue
      unicode-data:   OK (0.33s)
        314 μs ±  11 μs
    isIDStart
      unicode-data:   OK (0.33s)
        314 μs ±  12 μs
    isXIDContinue
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    isXIDStart
      unicode-data:   OK (0.32s)
        629 μs ±  24 μs
    isPatternSyntax
      unicode-data:   OK (0.33s)
        314 μs ±  13 μs
    isPatternWhitespace
      unicode-data:   OK (0.33s)
        315 μs ±  13 μs
  Unicode.Char.Normalization
    isCombining
      unicode-data:   OK (0.33s)
        314 μs ±  11 μs
    combiningClass
      unicode-data:   OK (0.21s)
        3.3 ms ± 188 μs
    isCombiningStarter
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    isDecomposable
      Canonical
        unicode-data: OK (0.33s)
          315 μs ±  13 μs
      Kompat
        unicode-data: OK (0.33s)
          315 μs ±  12 μs
    decomposeHangul
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs

All 48 tests passed (18.65s)

@wismill
Copy link
Collaborator Author

wismill commented Nov 12, 2021

I have merged latest changes from master.

All tests successful, except Ubuntu+GHC7103, which seems to have a linker issue. Could you check it?

I propose to bump the package version from 0.2 to 0.3, as there are some breaking changes.

…egory

# Conflicts:
#	appveyor.yml
#	lib/Unicode/Internal/Bits.hs
@wismill
Copy link
Collaborator Author

wismill commented Nov 22, 2021

@harendra-kumar @adithyaov @Bodigrim I have merged master, added support for big-endian architectures (based on @Bodigrim work). There are some CI failures, although they seem unrelated to this PR.

@harendra-kumar
Copy link
Member

@adithyaov can you help with appveyor CI failure and GHC 9.2.1 CI, I think you fixed a similar issue elsewhere. GHC 9.2.1 CI example can be found here.

@Bodigrim
Copy link
Collaborator

Appveyor failure is a known issue with lts-18.17, try to update to lts-18.18 or newer.

@harendra-kumar
Copy link
Member

I will take a better look at the APIs, naming, structuring etc once I get some time.

Copy link
Member

@harendra-kumar harendra-kumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good. I have some minor comments.

We can add a cabal docspec CI to check the properties in haddock. We have a similar CI in the streamly repo.

Changelog.md Show resolved Hide resolved
Changelog.md Outdated Show resolved Hide resolved
bench/Main.hs Outdated Show resolved Hide resolved
bench/Main.hs Outdated Show resolved Hide resolved
bench/Main.hs Outdated Show resolved Hide resolved
lib/Unicode/Char/General.hs Outdated Show resolved Hide resolved
lib/Unicode/Char/General.hs Show resolved Hide resolved
test/Unicode/CharSpec.hs Outdated Show resolved Hide resolved
unicode-data.cabal Outdated Show resolved Hide resolved
bench/Main.hs Outdated Show resolved Hide resolved
Changelog.md Outdated Show resolved Hide resolved
bench/Main.hs Outdated Show resolved Hide resolved
exe/Parser/Text.hs Outdated Show resolved Hide resolved
@adithyaov
Copy link
Member

@wismill I apologize for the delayed review. The PR looks good overall. I don't have any major comments other than the minor scrutiny already mentioned. I've added in one small question along as well.

You might need to rebase and resolve conflicts.

I can send a PR addressing this minor scrutiny to this branch in your repo.

@harendra-kumar
Copy link
Member

The CI's are failing with this message:

The following files are committed to the git repository but do not exist in the source distribution.

.editorconfig
Please consider adding them to your cabal file under 'extra-source-files' or 'extra-doc-files'.
If you do not want to add them in the source distribution then add them to .packcheck.ignore file at the root of the git repository.

You can add .editorconfig to .packcheck.ignore since we do not want to pack it in the distribution.

@wismill
Copy link
Collaborator Author

wismill commented Dec 16, 2021

Fixed .packcheck.ignore

@harendra-kumar
Copy link
Member

@adithyaov can you suggest a fix for appveyor CI failure. I think you fixed a similar issue elsewhere.

@harendra-kumar
Copy link
Member

@wismill this looks good for merge. You can squash any commits if required before the merge.

@wismill
Copy link
Collaborator Author

wismill commented Dec 17, 2021

New benchmark with relative speedup:

All
  Unicode.Char.Case
    isLower
      base:           OK (0.17s)
         21 ms ± 1.4 ms
      unicode-data:   OK (0.12s)
        3.9 ms ± 340 μs, 0.18x
    isUpper
      base:           OK (0.15s)
         22 ms ± 1.7 ms
      unicode-data:   OK (0.23s)
        3.7 ms ± 168 μs, 0.17x
  Unicode.Char.General
    generalCategory
      base:           OK (0.37s)
        124 ms ± 4.2 ms
      unicode-data:   OK (0.32s)
        106 ms ± 5.5 ms, 0.85x
    isAlpha
      base:           OK (0.15s)
         21 ms ± 1.4 ms
      unicode-data:   OK (0.25s)
        3.9 ms ± 180 μs, 0.18x
    isAlphabetic
      unicode-data:   OK (0.16s)
        312 μs ±  23 μs
    isAlphaNum
      base:           OK (0.15s)
         22 ms ± 1.5 ms
      unicode-data:   OK (0.14s)
        4.6 ms ± 337 μs, 0.21x
    isControl
      base:           OK (0.15s)
         22 ms ± 1.5 ms
      unicode-data:   OK (0.55s)
        4.5 ms ± 365 μs, 0.21x
    isLetter
      base:           OK (0.32s)
         21 ms ± 904 μs
      unicode-data:   OK (0.12s)
        3.9 ms ± 342 μs, 0.18x
    isMark
      base:           OK (0.16s)
         23 ms ± 2.2 ms
      unicode-data:   OK (0.27s)
        4.3 ms ± 175 μs, 0.19x
    isPrint
      base:           OK (0.34s)
         22 ms ± 1.4 ms
      unicode-data:   OK (0.13s)
        4.2 ms ± 378 μs, 0.18x
    isPunctuation
      base:           OK (0.15s)
         21 ms ± 2.1 ms
      unicode-data:   OK (0.13s)
        4.3 ms ± 351 μs, 0.20x
    isSeparator
      base:           OK (0.34s)
         23 ms ± 847 μs
      unicode-data:   OK (0.12s)
        4.0 ms ± 369 μs, 0.17x
    isSpace
      base:           OK (2.97s)
         11 ms ±  60 μs
      unicode-data:   OK (0.15s)
        4.8 ms ± 381 μs, 0.42x
    isSymbol
      base:           OK (0.33s)
         22 ms ± 1.0 ms
      unicode-data:   OK (0.14s)
        4.4 ms ± 377 μs, 0.20x
    isWhiteSpace
      unicode-data:   OK (0.16s)
        312 μs ±  22 μs
    isHangul
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    isHangulLV
      unicode-data:   OK (0.16s)
        312 μs ±  22 μs
    isJamo
      unicode-data:   OK (0.16s)
        312 μs ±  22 μs
    jamoLIndex
      unicode-data:   OK (0.16s)
        312 μs ±  22 μs
    jamoVIndex
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    jamoTIndex
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
  Unicode.Char.Identifiers
    isIDContinue
      unicode-data:   OK (0.16s)
        312 μs ±  23 μs
    isIDStart
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    isXIDContinue
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    isXIDStart
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    isPatternSyntax
      unicode-data:   OK (0.16s)
        312 μs ±  22 μs
    isPatternWhitespace
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
  Unicode.Char.Normalization
    isCombining
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    combiningClass
      unicode-data:   OK (0.19s)
        2.9 ms ± 221 μs
    isCombiningStarter
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    isDecomposable
      Canonical
        unicode-data: OK (0.16s)
          624 μs ±  42 μs
      Kompat
        unicode-data: OK (0.16s)
          625 μs ±  46 μs
    decomposeHangul
      unicode-data:   OK (0.16s)
        623 μs ±  43 μs
  Unicode.Char.Numeric
    isNumber
      base:           OK (0.16s)
         23 ms ± 1.7 ms
      unicode-data:   OK (0.15s)
        4.8 ms ± 345 μs, 0.21x

All 48 tests passed (12.06s)

@harendra-kumar
Copy link
Member

The benchmarks look pretty good compared to base. It may be a good idea for GHC/base to use the Haskell native unicode-data instead of using FFI to C library functions. And we have a good chunk of unicode functionality supported now, I am sure we will keep adding more stuff and one day this will be comprehensive.

Good stuff @wismill !

@wismill
Copy link
Collaborator Author

wismill commented Dec 17, 2021

Ok, I think this PR is almost ready. Should I squash all the commits? Should I bump the package’s version to 0.3?

@harendra-kumar
Copy link
Member

You can either squash them into a single commit or maybe squash them into a few logically related commits, or just squash the trivial/fixup type commits and leave the rest as it is, whatever you deem fit.

We can bump the package version, but let's give it a soak time of a week before we upload to hackage. We can also generate a bench-show report with a side-by-side benchmark comparison with base and put it in the readme, like we have in unicode-transforms.

@adithyaov
Copy link
Member

This needs to be added to the stack config.

flags:
  mintty:
    Win32-2-13-1: false

See https://stackoverflow.com/questions/70045586/could-not-find-module-system-console-mintty-win32-when-compiling-test-framework

@wismill
Copy link
Collaborator Author

wismill commented Dec 20, 2021

We can also generate a bench-show report with a side-by-side benchmark comparison with base and put it in the readme, like we have in unicode-transforms.

Unfortunately it seems that bench-show does not support tasty-bench output. I will add an excerpt of the current output.

This needs to be added to the stack config.

I will add it.

@adithyaov
Copy link
Member

Unfortunately it seems that bench-show does not support tasty-bench output. I will add an excerpt of the current output.

I'll need to update bench-show and generate the benchmarks. I'll need to do this as a release task anyway.

Add benchmark result in README.
Fix stack config
Move ambiguous functions to compatibility module: Unicode.Char.General.Compat.
Fix benchmark for old GHC
Improve doc about isSpace and isWhiteSpace
Improve benchmark by using bcompare.
Add .editorconfig to .packcheck.ignore
Fixes for review.
Add .editorconfig
Fixes for review
appveyor: lts-18.17 → lts-18.18
Revert "Add support for big-endian architectures"

This reverts commit 137f201.
Update Changelog.md
Simplify tests
Add support for big-endian architectures
@wismill
Copy link
Collaborator Author

wismill commented Dec 20, 2021

Unfortunately it seems that bench-show does not support tasty-bench output. I will add an excerpt of the current output.

I'll need to update bench-show and generate the benchmarks. I'll need to do this as a release task anyway.

Ok. Apart from this and GHC 7.10.3 failing, I think this PR is ready to merge.

@wismill wismill mentioned this pull request Dec 20, 2021
@wismill
Copy link
Collaborator Author

wismill commented Dec 21, 2021

Let’s merge and continue work in specific issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Code point ranges are not parsed in UnicodeData.txt Add support for General_Category property
4 participants