Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode normalization combinators #698

Merged
merged 5 commits into from
Aug 9, 2021
Merged

Unicode normalization combinators #698

merged 5 commits into from
Aug 9, 2021

Conversation

adithyaov
Copy link
Member

@adithyaov adithyaov commented Sep 22, 2020

Todo:

  • Add benchmarks

@adithyaov
Copy link
Member Author

Ignore cabal.project. We'll release unicode-data before merging this.

@harendra-kumar
Copy link
Member

@adithyaov can you rebase this, port it on the new unicode-data package, and add benchmarks from unicode-transforms package?

@adithyaov adithyaov force-pushed the unicode-normalize branch 6 times, most recently from 0a6debe to fa8f961 Compare August 4, 2021 14:12
Copy link
Member

@harendra-kumar harendra-kumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Let's add the normalization routines to the Unicode.Char module. We will use this module for operations on Char type including stream of Char to Char transformations. The Unicode.Stream module should go away. We will move the encode/decode routines in Unicode.Stream to Unicode.Utf8 and Unicode.Latin1.
  2. Let's add some basic benchmarks as well so that we can start working on performance optimizations of these modules. Just take an input file and normalize it. We already have some benchmarks in the existing Unicode.Stream benchmarks that take input files using env vars.

Once the normalization tests pass and we have benchmarks in place, we can commit this and do the performance optimizations later.

src/Streamly/Internal/Unicode/Normalize.hs Outdated Show resolved Hide resolved
src/Streamly/Internal/Unicode/Normalize.hs Outdated Show resolved Hide resolved
src/Streamly/Internal/Unicode/Normalize.hs Outdated Show resolved Hide resolved
src/Streamly/Internal/Unicode/Normalize.hs Outdated Show resolved Hide resolved
src/Streamly/Internal/Unicode/Normalize.hs Outdated Show resolved Hide resolved
-- decomposed.
{-# INLINE_EARLY partialComposeD #-}
partialComposeD :: Monad m => Stream m Char -> Stream m Char
partialComposeD (Stream step state) = Stream step' (ComposeNone state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this composeD.

Copy link
Member Author

@adithyaov adithyaov Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of partially composing. This does not compose a semi-decomposed stream. It shouldn't be used directly.

@adithyaov adithyaov force-pushed the unicode-normalize branch 2 times, most recently from 7424d87 to 131eff1 Compare August 6, 2021 13:08
Copy link
Member

@harendra-kumar harendra-kumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good at a high level. I did not check the details.

src/Streamly/Internal/Unicode/Char.hs Outdated Show resolved Hide resolved
@adithyaov adithyaov merged commit 69e3b2c into master Aug 9, 2021
@harendra-kumar harendra-kumar deleted the unicode-normalize branch March 29, 2022 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants