Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient parsing using generalized unfolds #942

Merged
merged 5 commits into from Feb 20, 2021
Merged

Conversation

harendra-kumar
Copy link
Member

Instead of using a closed loop we can stop an unfold and then resume it
later by using an explicit loop state. This allows us to break the loop and restart it at some other
point. For example, we can parse a block from an input stream from a
file handle or a socket and then return the socket/handle plus nay
buffered data (due to backtracking) so that we can resume reading from
it later on after doing some processing.

@harendra-kumar
Copy link
Member Author

Allocations and bytesCopied are 0.

Data.Parser(cpuTime)
Benchmark                                                               default(μs)
----------------------------------------------------------------------- -----------
Data.Parser/o-1-space/parseMany/Unfold/1000 arrays/take 1                    356.82
Data.Parser/o-1-space/parseMany/Unfold/1000 arrays/take all                  122.11
Data.Parser/o-1-space/parseMany (take all)                                    66.04
Data.Parser/o-1-space/parseMany (take 1)                                      48.27

The type can be further simplified, we can remove the inject and extract. It should simply be a Step runner function.

Copy link
Member

@adithyaov adithyaov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a look. I think I can use something like this in streamly-lz4.

--
-- /Internal/
{-# INLINE_NORMAL lmap #-}
lmap :: (a -> a) -> Producer m a b -> Producer m a b
Copy link
Member

@adithyaov adithyaov Feb 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably remove this or rename this. I don't think this justifies lmap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A search for f :: a -> a -> g a -> g a on hoogle gave only fmap/liftA/liftM. Its a map even though it is restricted to an endomorphism. Any alternative names? If we remove it can we achieve the same functionality in some other way?

import Streamly.Internal.Data.Producer.Type
import Prelude hiding (concat)

-- XXX We should write unfolds as producers where possible and define
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to define unfolds using simplify. The users can do that themselves if ever need (They should ideally use producers where they can)

Comment on lines +529 to +530
extract (ReadUState (ForeignPtr end contents) (Ptr p)) =
return $ Array (ForeignPtr p contents) (Ptr end) (Ptr end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of extact :: s -> m a puts a constraint on a few things. I might want to return the entire array along with the index (ptr) till where we generated. No use-case but...

I'm sure we'll encounter a use case that forces us to change the type of extract to s -> m c.

Copy link
Member

@adithyaov adithyaov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the comments.
I haven't looked at the tests too throughly.

Instead of using a closed loop we can stop an unfold and then resume it
later. This allows us to break the loop and restart it at some other
point. For example, we can parse a block from an input stream from a
file handle or a socket and then return the socket/handle plus nay
buffered data (due to backtracking) so that we can resume reading from
it later on after doing some processing.
Plus:

* Fix returning buffer in the "parse" function
* Some more minor formatting/doc fixups.
@harendra-kumar harendra-kumar merged commit c8b9fac into master Feb 20, 2021
@harendra-kumar harendra-kumar deleted the parse-unfold branch August 5, 2021 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants