Unicode normalization combinators #698

adithyaov · 2020-09-22T10:39:19Z

Todo:

Add benchmarks

adithyaov · 2020-09-22T11:30:11Z

Ignore cabal.project. We'll release unicode-data before merging this.

harendra-kumar · 2021-07-29T20:58:51Z

@adithyaov can you rebase this, port it on the new unicode-data package, and add benchmarks from unicode-transforms package?

harendra-kumar

Let's add the normalization routines to the Unicode.Char module. We will use this module for operations on Char type including stream of Char to Char transformations. The Unicode.Stream module should go away. We will move the encode/decode routines in Unicode.Stream to Unicode.Utf8 and Unicode.Latin1.
Let's add some basic benchmarks as well so that we can start working on performance optimizations of these modules. Just take an input file and normalize it. We already have some benchmarks in the existing Unicode.Stream benchmarks that take input files using env vars.

Once the normalization tests pass and we have benchmarks in place, we can commit this and do the performance optimizations later.

src/Streamly/Internal/Unicode/Normalize.hs

harendra-kumar · 2021-08-04T22:30:18Z

src/Streamly/Internal/Unicode/Normalize.hs

+-- decomposed.
+{-# INLINE_EARLY partialComposeD #-}
+partialComposeD :: Monad m => Stream m Char -> Stream m Char
+partialComposeD (Stream step state) = Stream step' (ComposeNone state)


call this composeD.

This is kind of partially composing. This does not compose a semi-decomposed stream. It shouldn't be used directly.

harendra-kumar

Looks good at a high level. I did not check the details.

src/Streamly/Internal/Unicode/Char.hs

harendra-kumar mentioned this pull request Apr 20, 2021

Incremental normalization? composewell/unicode-transforms#60

Open

adithyaov force-pushed the unicode-normalize branch 6 times, most recently from 0a6debe to fa8f961 Compare August 4, 2021 14:12

harendra-kumar requested changes Aug 4, 2021

View reviewed changes

adithyaov force-pushed the unicode-normalize branch 2 times, most recently from 7424d87 to 131eff1 Compare August 6, 2021 13:08

harendra-kumar approved these changes Aug 6, 2021

View reviewed changes

src/Streamly/Internal/Unicode/Char.hs Outdated Show resolved Hide resolved

adithyaov added 2 commits August 8, 2021 06:37

Add unicode-data to nix env

c7a28df

Add unicode-data to stack env

26ff76e

adithyaov force-pushed the unicode-normalize branch from 131eff1 to 5061dd1 Compare August 8, 2021 01:07

adithyaov and others added 3 commits August 8, 2021 06:38

Add normalization combinators to Streamly.Internal.Unicode.Char

1871bad

Add unicode normalization test-suite

939bea1

Add unicode normalization bench-suite

2a638b0

adithyaov force-pushed the unicode-normalize branch from 5061dd1 to 2a638b0 Compare August 8, 2021 01:08

adithyaov merged commit 69e3b2c into master Aug 9, 2021

harendra-kumar deleted the unicode-normalize branch March 29, 2022 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode normalization combinators #698

Unicode normalization combinators #698

adithyaov commented Sep 22, 2020 •

edited

adithyaov commented Sep 22, 2020

harendra-kumar commented Jul 29, 2021

harendra-kumar left a comment

harendra-kumar Aug 4, 2021

adithyaov Aug 6, 2021 •

edited

harendra-kumar left a comment

Unicode normalization combinators #698

Unicode normalization combinators #698

Conversation

adithyaov commented Sep 22, 2020 • edited

adithyaov commented Sep 22, 2020

harendra-kumar commented Jul 29, 2021

harendra-kumar left a comment

Choose a reason for hiding this comment

harendra-kumar Aug 4, 2021

Choose a reason for hiding this comment

adithyaov Aug 6, 2021 • edited

Choose a reason for hiding this comment

harendra-kumar left a comment

Choose a reason for hiding this comment

adithyaov commented Sep 22, 2020 •

edited

adithyaov Aug 6, 2021 •

edited