Add General_Category and further predicates #40

wismill · 2021-11-10T14:58:26Z

Parser
- Fix parsing of ranges in UnicodeData.txt
- Generate lib/Unicode/Internal/Char/UnicodeData/GeneralCategory.hs
Library
- Add GeneralCategory data type and corresponding generalCategoryAbbr, generalCategory functions.
- Add the following functions: isAlpha, isAlphaNum, isControl, isMark, isNumber, isPrint, isPunctuation, isSeparator, isSymbol.
- Breaking change: isLetter is renamed isAlphabetic and isSpace is renamed isWhiteSpace.
- Breaking change: isLetter and isSpace now match the base’s Data.Char behaviour.
- Add lookupIntN in Unicode.Internal.Bits for low-level stuff.
- Re-export some functions from Data.Char in order to make Unicode.Char a drop-in replacement.
Test
- Add tests for Unicode.Char to ensure we get the same result than Data.Char.
  Note that this true only if the compiler used have have the same Unicode version as this package.

Fixes #38, Fixes #39

Note: ucd.sh generate should be re-run if PR #36 is merge (before or after this PR).

…teSpace, in order to be closer to Data.Char.

[skip ci]

harendra-kumar · 2021-11-11T15:16:44Z

@Bodigrim would you like to review the changes to lib/Unicode/Internal/Bits.hs?

Bodigrim · 2021-11-11T19:11:53Z

@wismill I appreciate the effort, but I think sub-byte operations are not worth it. There are 31 constructors, so you need 5 bits per one value. It's most likely faster just use a byte per value, but fetch all data in one go instead of conditional fetching of one or two words.

wismill · 2021-11-11T22:42:18Z

@Bodigrim Ok, I will check that. Meanwhile, I added a benchmark and disable tests with incompatible GHC versions.

wismill · 2021-11-12T11:39:26Z

@Bodigrim I have implemented your suggestion. It is roughly 60 to 80% faster than my previous implementation!

If my benchmark is correct, we have a nice speedup of 16x compared to some functions from Data.Char.

All
  Unicode.Char.Case
    isLower
      base:           OK (0.52s)
         17 ms ± 585 μs
      unicode-data:   OK (0.33s)
        315 μs ±  14 μs
    isUpper
      base:           OK (2.07s)
         16 ms ± 879 μs
      unicode-data:   OK (0.31s)
        314 μs ±  12 μs
  Unicode.Char.General
    generalCategory
      base:           OK (0.86s)
        122 ms ± 4.2 ms
      unicode-data:   OK (0.33s)
        109 ms ± 3.4 ms
    isAlpa
      base:           OK (0.24s)
         16 ms ± 795 μs
      unicode-data:   OK (0.25s)
        945 μs ±  53 μs
    isAlpabetic
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    isAlpaNum
      base:           OK (1.01s)
         16 ms ± 263 μs
      unicode-data:   OK (0.49s)
        946 μs ±  35 μs
    isControl
      base:           OK (0.49s)
         16 ms ± 424 μs
      unicode-data:   OK (0.20s)
        783 μs ±  46 μs
    isLetter
      base:           OK (0.20s)
        783 μs ±  43 μs
      unicode-data:   OK (0.20s)
        784 μs ±  44 μs
    isMark
      base:           OK (0.25s)
         16 ms ± 693 μs
      unicode-data:   OK (0.20s)
        783 μs ±  44 μs
    isNumber
      base:           OK (0.52s)
         17 ms ± 843 μs
      unicode-data:   OK (0.49s)
        947 μs ±  28 μs
    isPrint
      base:           OK (0.50s)
         16 ms ± 901 μs
      unicode-data:   OK (0.24s)
        943 μs ±  53 μs
    isPunctuation
      base:           OK (0.47s)
         15 ms ± 813 μs
      unicode-data:   OK (0.20s)
        783 μs ±  42 μs
    isSeparator
      base:           OK (0.51s)
         16 ms ± 644 μs
      unicode-data:   OK (0.24s)
        949 μs ±  50 μs
    isSpace
      base:           OK (0.40s)
        1.6 ms ±  43 μs
      unicode-data:   OK (0.24s)
        1.9 ms ± 109 μs
    isSymbol
      base:           OK (0.21s)
         15 ms ± 714 μs
      unicode-data:   OK (0.24s)
        940 μs ±  49 μs
    isWhiteSpace
      unicode-data:   OK (0.32s)
        315 μs ±  12 μs
    isHangul
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    isHangulLV
      unicode-data:   OK (0.32s)
        314 μs ±  12 μs
    isJamo
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    jamoLIndex
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    jamoVIndex
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    jamoTIndex
      unicode-data:   OK (0.32s)
        314 μs ±  12 μs
  Unicode.Char.Identifiers
    isIDContinue
      unicode-data:   OK (0.33s)
        314 μs ±  11 μs
    isIDStart
      unicode-data:   OK (0.33s)
        314 μs ±  12 μs
    isXIDContinue
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    isXIDStart
      unicode-data:   OK (0.32s)
        629 μs ±  24 μs
    isPatternSyntax
      unicode-data:   OK (0.33s)
        314 μs ±  13 μs
    isPatternWhitespace
      unicode-data:   OK (0.33s)
        315 μs ±  13 μs
  Unicode.Char.Normalization
    isCombining
      unicode-data:   OK (0.33s)
        314 μs ±  11 μs
    combiningClass
      unicode-data:   OK (0.21s)
        3.3 ms ± 188 μs
    isCombiningStarter
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs
    isDecomposable
      Canonical
        unicode-data: OK (0.33s)
          315 μs ±  13 μs
      Kompat
        unicode-data: OK (0.33s)
          315 μs ±  12 μs
    decomposeHangul
      unicode-data:   OK (0.32s)
        314 μs ±  11 μs

All 48 tests passed (18.65s)

…egory

…nicode 14.0).

Fix base bound for benchmark. Add default language.

wismill · 2021-11-12T17:36:53Z

I have merged latest changes from master.

All tests successful, except Ubuntu+GHC7103, which seems to have a linker issue. Could you check it?

I propose to bump the package version from 0.2 to 0.3, as there are some breaking changes.

…egory # Conflicts: # appveyor.yml # lib/Unicode/Internal/Bits.hs

wismill · 2021-11-22T11:43:43Z

@harendra-kumar @adithyaov @Bodigrim I have merged master, added support for big-endian architectures (based on @Bodigrim work). There are some CI failures, although they seem unrelated to this PR.

lib/Unicode/Internal/Bits.hs

harendra-kumar · 2021-11-24T07:47:26Z

@adithyaov can you help with appveyor CI failure and GHC 9.2.1 CI, I think you fixed a similar issue elsewhere. GHC 9.2.1 CI example can be found here.

Bodigrim · 2021-11-25T20:05:20Z

Appveyor failure is a known issue with lts-18.17, try to update to lts-18.18 or newer.

harendra-kumar · 2021-11-30T22:18:52Z

I will take a better look at the APIs, naming, structuring etc once I get some time.

harendra-kumar

Overall looks pretty good. I have some minor comments.

We can add a cabal docspec CI to check the properties in haddock. We have a similar CI in the streamly repo.

Changelog.md

bench/Main.hs

lib/Unicode/Char/General.hs

test/Unicode/CharSpec.hs

unicode-data.cabal

bench/Main.hs

Changelog.md

bench/Main.hs

exe/Parser/Text.hs

lib/Unicode/Char/General.hs

adithyaov · 2021-12-10T21:14:37Z

@wismill I apologize for the delayed review. The PR looks good overall. I don't have any major comments other than the minor scrutiny already mentioned. I've added in one small question along as well.

You might need to rebase and resolve conflicts.

I can send a PR addressing this minor scrutiny to this branch in your repo.

harendra-kumar · 2021-12-15T23:08:49Z

The CI's are failing with this message:

The following files are committed to the git repository but do not exist in the source distribution.

.editorconfig
Please consider adding them to your cabal file under 'extra-source-files' or 'extra-doc-files'.
If you do not want to add them in the source distribution then add them to .packcheck.ignore file at the root of the git repository.

You can add .editorconfig to .packcheck.ignore since we do not want to pack it in the distribution.

wismill · 2021-12-16T15:00:55Z

Fixed .packcheck.ignore

harendra-kumar · 2021-12-16T23:43:02Z

@adithyaov can you suggest a fix for appveyor CI failure. I think you fixed a similar issue elsewhere.

harendra-kumar · 2021-12-16T23:52:52Z

@wismill this looks good for merge. You can squash any commits if required before the merge.

wismill · 2021-12-17T09:38:33Z

New benchmark with relative speedup:

All
  Unicode.Char.Case
    isLower
      base:           OK (0.17s)
         21 ms ± 1.4 ms
      unicode-data:   OK (0.12s)
        3.9 ms ± 340 μs, 0.18x
    isUpper
      base:           OK (0.15s)
         22 ms ± 1.7 ms
      unicode-data:   OK (0.23s)
        3.7 ms ± 168 μs, 0.17x
  Unicode.Char.General
    generalCategory
      base:           OK (0.37s)
        124 ms ± 4.2 ms
      unicode-data:   OK (0.32s)
        106 ms ± 5.5 ms, 0.85x
    isAlpha
      base:           OK (0.15s)
         21 ms ± 1.4 ms
      unicode-data:   OK (0.25s)
        3.9 ms ± 180 μs, 0.18x
    isAlphabetic
      unicode-data:   OK (0.16s)
        312 μs ±  23 μs
    isAlphaNum
      base:           OK (0.15s)
         22 ms ± 1.5 ms
      unicode-data:   OK (0.14s)
        4.6 ms ± 337 μs, 0.21x
    isControl
      base:           OK (0.15s)
         22 ms ± 1.5 ms
      unicode-data:   OK (0.55s)
        4.5 ms ± 365 μs, 0.21x
    isLetter
      base:           OK (0.32s)
         21 ms ± 904 μs
      unicode-data:   OK (0.12s)
        3.9 ms ± 342 μs, 0.18x
    isMark
      base:           OK (0.16s)
         23 ms ± 2.2 ms
      unicode-data:   OK (0.27s)
        4.3 ms ± 175 μs, 0.19x
    isPrint
      base:           OK (0.34s)
         22 ms ± 1.4 ms
      unicode-data:   OK (0.13s)
        4.2 ms ± 378 μs, 0.18x
    isPunctuation
      base:           OK (0.15s)
         21 ms ± 2.1 ms
      unicode-data:   OK (0.13s)
        4.3 ms ± 351 μs, 0.20x
    isSeparator
      base:           OK (0.34s)
         23 ms ± 847 μs
      unicode-data:   OK (0.12s)
        4.0 ms ± 369 μs, 0.17x
    isSpace
      base:           OK (2.97s)
         11 ms ±  60 μs
      unicode-data:   OK (0.15s)
        4.8 ms ± 381 μs, 0.42x
    isSymbol
      base:           OK (0.33s)
         22 ms ± 1.0 ms
      unicode-data:   OK (0.14s)
        4.4 ms ± 377 μs, 0.20x
    isWhiteSpace
      unicode-data:   OK (0.16s)
        312 μs ±  22 μs
    isHangul
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    isHangulLV
      unicode-data:   OK (0.16s)
        312 μs ±  22 μs
    isJamo
      unicode-data:   OK (0.16s)
        312 μs ±  22 μs
    jamoLIndex
      unicode-data:   OK (0.16s)
        312 μs ±  22 μs
    jamoVIndex
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    jamoTIndex
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
  Unicode.Char.Identifiers
    isIDContinue
      unicode-data:   OK (0.16s)
        312 μs ±  23 μs
    isIDStart
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    isXIDContinue
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    isXIDStart
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    isPatternSyntax
      unicode-data:   OK (0.16s)
        312 μs ±  22 μs
    isPatternWhitespace
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
  Unicode.Char.Normalization
    isCombining
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    combiningClass
      unicode-data:   OK (0.19s)
        2.9 ms ± 221 μs
    isCombiningStarter
      unicode-data:   OK (0.16s)
        312 μs ±  21 μs
    isDecomposable
      Canonical
        unicode-data: OK (0.16s)
          624 μs ±  42 μs
      Kompat
        unicode-data: OK (0.16s)
          625 μs ±  46 μs
    decomposeHangul
      unicode-data:   OK (0.16s)
        623 μs ±  43 μs
  Unicode.Char.Numeric
    isNumber
      base:           OK (0.16s)
         23 ms ± 1.7 ms
      unicode-data:   OK (0.15s)
        4.8 ms ± 345 μs, 0.21x

All 48 tests passed (12.06s)

harendra-kumar · 2021-12-17T12:54:20Z

The benchmarks look pretty good compared to base. It may be a good idea for GHC/base to use the Haskell native unicode-data instead of using FFI to C library functions. And we have a good chunk of unicode functionality supported now, I am sure we will keep adding more stuff and one day this will be comprehensive.

Good stuff @wismill !

wismill · 2021-12-17T15:03:00Z

Ok, I think this PR is almost ready. Should I squash all the commits? Should I bump the package’s version to 0.3?

harendra-kumar · 2021-12-18T00:04:37Z

You can either squash them into a single commit or maybe squash them into a few logically related commits, or just squash the trivial/fixup type commits and leave the rest as it is, whatever you deem fit.

We can bump the package version, but let's give it a soak time of a week before we upload to hackage. We can also generate a bench-show report with a side-by-side benchmark comparison with base and put it in the readme, like we have in unicode-transforms.

adithyaov · 2021-12-18T04:53:49Z

This needs to be added to the stack config.

flags:
  mintty:
    Win32-2-13-1: false

See https://stackoverflow.com/questions/70045586/could-not-find-module-system-console-mintty-win32-when-compiling-test-framework

wismill · 2021-12-20T08:28:36Z

We can also generate a bench-show report with a side-by-side benchmark comparison with base and put it in the readme, like we have in unicode-transforms.

Unfortunately it seems that bench-show does not support tasty-bench output. I will add an excerpt of the current output.

This needs to be added to the stack config.

I will add it.

adithyaov · 2021-12-20T08:47:55Z

Unfortunately it seems that bench-show does not support tasty-bench output. I will add an excerpt of the current output.

I'll need to update bench-show and generate the benchmarks. I'll need to do this as a release task anyway.

Add benchmark result in README. Fix stack config Move ambiguous functions to compatibility module: Unicode.Char.General.Compat. Fix benchmark for old GHC Improve doc about isSpace and isWhiteSpace Improve benchmark by using bcompare. Add .editorconfig to .packcheck.ignore Fixes for review. Add .editorconfig Fixes for review appveyor: lts-18.17 → lts-18.18 Revert "Add support for big-endian architectures" This reverts commit 137f201. Update Changelog.md Simplify tests Add support for big-endian architectures

wismill · 2021-12-20T09:55:02Z

Unfortunately it seems that bench-show does not support tasty-bench output. I will add an excerpt of the current output.

I'll need to update bench-show and generate the benchmarks. I'll need to do this as a release task anyway.

Ok. Apart from this and GHC 7.10.3 failing, I think this PR is ready to merge.

wismill · 2021-12-21T12:45:59Z

Let’s merge and continue work in specific issues.

wismill added 8 commits November 10, 2021 14:55

Add general category and some character predicates.

fd96270

Breaking change: rename isLetter to isAlphabetic and isSpace to isWhi…

779b79c

…teSpace, in order to be closer to Data.Char.

Re-export some Data.Char functions.

c01eec6

Fix Unicode Data range parsing

ab1ddc9

Comment only

22fdb0d

Add tests

430331b

Update appveyor.yml to lts-18.16

edf0fa7

Doc only

204f9fc

[skip ci]

harendra-kumar requested a review from adithyaov November 11, 2021 15:02

wismill added 2 commits November 11, 2021 21:41

Add benchmark

89125be

Disable tests for GHC with incompatible Unicode versions.

63efcf6

Simplify Enum list encoding using byte instead of sub-byte.

25fc5b5

wismill added 4 commits November 12, 2021 16:41

Merge remote-tracking branch 'origin/master' into feature/general_cat…

595bcdc

…egory

Regenerate GeneralCategory.hs

0aa511d

Fix build for GHC 9.2.1. Update tests to run only for GHC >= 9.2.1 (U…

7d56aa8

…nicode 14.0).

Make hlint happy.

eb03cfa

Fix base bound for benchmark. Add default language.

Merge remote-tracking branch 'origin/master' into feature/general_cat…

2ace84e

…egory # Conflicts: # appveyor.yml # lib/Unicode/Internal/Bits.hs

Bodigrim reviewed Nov 22, 2021

View reviewed changes

lib/Unicode/Internal/Bits.hs Outdated Show resolved Hide resolved

Bodigrim approved these changes Nov 25, 2021

View reviewed changes

harendra-kumar requested changes Dec 10, 2021

View reviewed changes

adithyaov reviewed Dec 10, 2021

View reviewed changes

Changelog.md Outdated Show resolved Hide resolved

adithyaov reviewed Dec 10, 2021

View reviewed changes

bench/Main.hs Outdated Show resolved Hide resolved

adithyaov reviewed Dec 10, 2021

View reviewed changes

exe/Parser/Text.hs Outdated Show resolved Hide resolved

adithyaov reviewed Dec 10, 2021

View reviewed changes

lib/Unicode/Char/General.hs Show resolved Hide resolved

wismill added 2 commits December 20, 2021 10:01

Bump package version and add since annotations.

4260b7e

wismill force-pushed the feature/general_category branch from 405118a to 4260b7e Compare December 20, 2021 09:01

Fix stack.yaml

6e68e5f

wismill force-pushed the feature/general_category branch from 3068c75 to 6e68e5f Compare December 20, 2021 09:40

wismill mentioned this pull request Dec 20, 2021

Add code formatter #49

Open

wismill merged commit fc5a08b into composewell:master Dec 21, 2021

wismill mentioned this pull request Dec 21, 2021

Update the benchmarks #50

Open

wismill mentioned this pull request Jun 16, 2023

Excessive inlining may optimize away the function to benchmark Bodigrim/tasty-bench#48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add General_Category and further predicates #40

Add General_Category and further predicates #40

wismill commented Nov 10, 2021 •

edited

harendra-kumar commented Nov 11, 2021

Bodigrim commented Nov 11, 2021

wismill commented Nov 11, 2021

wismill commented Nov 12, 2021

wismill commented Nov 12, 2021

wismill commented Nov 22, 2021

harendra-kumar commented Nov 24, 2021

Bodigrim commented Nov 25, 2021

harendra-kumar commented Nov 30, 2021

harendra-kumar left a comment

adithyaov commented Dec 10, 2021

harendra-kumar commented Dec 15, 2021

wismill commented Dec 16, 2021

harendra-kumar commented Dec 16, 2021

harendra-kumar commented Dec 16, 2021

wismill commented Dec 17, 2021

harendra-kumar commented Dec 17, 2021

wismill commented Dec 17, 2021

harendra-kumar commented Dec 18, 2021

adithyaov commented Dec 18, 2021

wismill commented Dec 20, 2021 •

edited

adithyaov commented Dec 20, 2021

wismill commented Dec 20, 2021

wismill commented Dec 21, 2021

Add General_Category and further predicates #40

Add General_Category and further predicates #40

Conversation

wismill commented Nov 10, 2021 • edited

harendra-kumar commented Nov 11, 2021

Bodigrim commented Nov 11, 2021

wismill commented Nov 11, 2021

wismill commented Nov 12, 2021

wismill commented Nov 12, 2021

wismill commented Nov 22, 2021

harendra-kumar commented Nov 24, 2021

Bodigrim commented Nov 25, 2021

harendra-kumar commented Nov 30, 2021

harendra-kumar left a comment

Choose a reason for hiding this comment

adithyaov commented Dec 10, 2021

harendra-kumar commented Dec 15, 2021

wismill commented Dec 16, 2021

harendra-kumar commented Dec 16, 2021

harendra-kumar commented Dec 16, 2021

wismill commented Dec 17, 2021

harendra-kumar commented Dec 17, 2021

wismill commented Dec 17, 2021

harendra-kumar commented Dec 18, 2021

adithyaov commented Dec 18, 2021

wismill commented Dec 20, 2021 • edited

adithyaov commented Dec 20, 2021

wismill commented Dec 20, 2021

wismill commented Dec 21, 2021

wismill commented Nov 10, 2021 •

edited

wismill commented Dec 20, 2021 •

edited