Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zero-copy ini parser example #7

Closed
wants to merge 13 commits into from
Closed

zero-copy ini parser example #7

wants to merge 13 commits into from

Conversation

Byron
Copy link
Contributor

@Byron Byron commented Oct 15, 2020

I thought it might be helpful for you to witness my humble beginnings, along with my raw notes shed some light on where I stumbled. And of course, I would be glad to learn about how you would do things differently.

Notes 2020-10-15

Please take everything as just my opinion based on possibly biased or downright wrong intuition. The following points are the thoughts in my head while trying to achieve a certain goal, thus might be quite 'raw' and maybe hard to follow.

  • consume_u8 allows specifying the byte to consume, but peek_u8 does not
  • it's very easy to attempt to use functions returning an error like peek_u8 from a function that doesn't constrain E at all, causing errors which rustc suggests to fix in a 'wrong/cumbersome' way (as it doesn't know it should use the Error<'i> constraint). I tried to do that from skip_whitespace(…).
  • I would have expected peek_eq() to also have a version that takes a single byte. Maybe that's a special case though
  • skip_while(…) has no trait bound on the error, but skip(…) has. It's confusing as the skip_whitespace function taken from the 'json' example doesn't have an Error constraint, but now that seems very rarely to be possible. Note to self: It's probably best to just pass the Error<'i> constraint at all times.
  • Even though it's easy to configure how input is displayed in errors, it would be nice if there was an easy way to see 'bytes as byte-ascii' in accidentally failing tests (during development). This might be useful for all byte-ascii type parsers.

@codecov

This comment has been minimized.

@Byron
Copy link
Contributor Author

Byron commented Oct 15, 2020

I am starting to get a feel for this and begin to see why parser combinators were invented - they seem to happen quite naturally. It's interesting to see how one has to implement everything by hand, for example handling EOF or that there might simply not be anything to parse.

Depending on the complexity of the format, this approach might come most naturally, whereas in others combinators might appear 'easier' to use (especially with prior experience).

@Byron Byron marked this pull request as ready for review October 15, 2020 14:46
@Byron
Copy link
Contributor Author

Byron commented Oct 15, 2020

Unbelievably I was able to take the time and finish the first version of a basic ini parser. I am very much looking forward to your ideas on how to improve it to make it use dangerous even better.

Once I had the hang of it, it was quite easy to get productive, nearly without any surprises at all, even though it took me a moment to figure out how to more explicitly return my own error.

When comparing this parser with the one provided in the combine examples I noticed that the error messages of dangerous actually were much more helpful than the ones generated by combine. Furthermore the combine parser was way too lenient to the point where I believed it could parse 'the wrong thing' (when testing it with my doctored input), whereas dangerous would handle most obvious errors well right away.

For git-config, I would have to keep spans instead of borrowed values, and would be glad for a suggestion on how to best accomplish that. Somehow I think that dangerous would provide convenience functions (if it doesn't do that already), with 1.47 a new pointer offset method was stabilized which might make that even easier.

Thanks a lot for sharing your thoughts.

examples/ini.rs Outdated Show resolved Hide resolved
examples/ini.rs Outdated Show resolved Hide resolved
examples/ini.rs Outdated Show resolved Hide resolved
examples/ini.rs Outdated Show resolved Hide resolved
examples/ini.rs Outdated Show resolved Hide resolved
examples/ini.rs Outdated Show resolved Hide resolved
@avitex
Copy link
Owner

avitex commented Oct 16, 2020

consume_u8 allows specifying the byte to consume, but peek_u8 does not

I would say peek_u8 is closer to read_u8 than consume_u8. consume_u8 exists to escape the lifetime requirement of consume when you only care about a single u8. For example:

this wouldn't work due to the lifetime:

fn read_tag<'i, E>(r: &mut Reader<'i, E>, tag: u8) .. {
  r.consume(&[tag])?;
  // ..
}

so we do this:

fn read_tag<'i, E>(r: &mut Reader<'i, E>, tag: u8) .. {
  r.consume_u8(tag)?;
  // ..
}

I would have expected peek_eq() to also have a version that takes a single byte. Maybe that's a special case though

Totally doable and a good suggestion. Reader::peek_u8_eq(..) seems the more intuitive over Reader::peek_eq_u8(..), however I see peek_eq as separate to peek_u8 and read_* due to the fact it is for preventing having to handle an error. I'm tossing up the idea of a Reader::peek_u8_opt(..) -> Option<u8> too.

A couple of thoughts:

  • I feel peeking should only be used to choose the correct path and the interface should prevent peeking and then skipping a length to consume a value.
  • expect/try_expect/verify/try_verify shouldn't be used to produce errors where length is expected as it doesn't return a RetryRequirement.

The reason peek_u8 has an error is to supply a RetryRequirement for input re-processing and to prevent usage like this:

fn read_ini<'i, E>(r: &mut Reader<'i, E>) -> Result<Document<'i>, E>
where
    E: Error<'i>,
{
    skip_whitespace_or_comment(r, ConsumeTo::NextToken);
    let document = match r.peek_u8() {
        // this feels super hacky just to get the retry requirement
        None => Err(r.read_u8().unwrap_err()),
        Some(b'[') => Ok(Document {
            globals: vec![],
            sections: read_sections(r)?,
        }),
        Some(_) => Ok(Document {
            globals: read_zero_or_more_properties_until_section(r)?,
            sections: read_sections(r)?,
        }),
    };
}

skip_while(…) has no trait bound on the error, but skip(…) has. It's confusing as the skip_whitespace function taken from the 'json' example doesn't have an Error constraint, but now that seems very rarely to be possible. Note to self: It's probably best to just pass the Error<'i> constraint at all times.

I'll admit most cases will just use Error<'i> but I like the fact each function defined on Reader only requires exactly what is needed. I only use Result where there is something tangible expected, that way you don't have to handle errors where there never would be (even if ? is super easy). It also self documents what could occur.

Not having Result where an error isn't produced also adds the benefit of being able to use the Reader in very simple formats where there would never be errors by using Reader::read_infallible(..). Perhaps you only needed to split a string with a few delimiters using just Reader::take_while.

* Even though it's easy to configure how input is displayed in errors, it would be nice if there was an easy way to see 'bytes as byte-ascii' in accidentally failing tests (during development). This might be useful for all byte-ascii type parsers.

I'm not exactly sure what you mean here sorry! There are only two other ways that aren't {:?} vs {:#?} or manually configuring via the InputDisplay interface that I've thought about, that being providing a env variable for controlling a few parameters or via compile time features.

I am starting to get a feel for this and begin to see why parser combinators were invented - they seem to happen quite naturally. It's interesting to see how one has to implement everything by hand, for example handling EOF or that there might simply not be anything to parse.

What is missing from dangerous that I don't want to add is all the tiny helper functions that catch a lot of common cases. I want them to be in a separate crate with fn read_ini<'i, E>(r: &mut Reader<'i, E>) -> .. style interfaces. That way dangerous keeps lean and clean. Things like delimited values, or structures like std::net::IpAddr.

Depending on the complexity of the format, this approach might come most naturally, whereas in others combinators might appear 'easier' to use (especially with prior experience).

nom is a fantastic library but the simple functions all stacked up I find can be hard to reason about, including the split between streaming and complete. I don't find reading the parsers nice and linear which is what one of my goals with dangerous. I've seen bad usage of nom parsers (including my own usage) with simple things like handling UTF-8 strings differently to other sequences (say if you get input stopped halfway inbetween of a utf-8 codepoint that becomes a fatal error vs returning a retry requirement)

Once I had the hang of it, it was quite easy to get productive, nearly without any surprises at all, even though it took me a moment to figure out how to more explicitly return my own error.

I think I need to make this more clear in the docs! But this is exactly the use case :D

For git-config, I would have to keep spans instead of borrowed values, and would be glad for a suggestion on how to best accomplish that. Somehow I think that dangerous would provide convenience functions (if it doesn't do that already), with 1.47 a new pointer offset method was stabilized which might make that even easier.

I'm going to sit on this problem this weekend as I have a similar use case. I'm not sure how deeply it is a dangerous concern. One solution could be to provide a function fn inclusive_range(&self, sub: &Input) -> Option<Range<usize>>, as I previously had internally here (keep in mind that implementation was broken).

I loved reading your thoughts and I'll have another look over your implementation this weekend! Thank you very much for the amount of time you have put into this and providing the feedback! :D

@avitex
Copy link
Owner

avitex commented Oct 16, 2020

I feel peeking should only be used to choose the correct path and the interface should prevent peeking and then skipping a length to consume a value.

There is shortcoming here with enum variant tags with variable lengths. For example say you have a lot of variant tags like "imShort" or "imLonger", consuming them into an enum is not easy today. If you take(..) the maximum length of the longest variant tag, you'll be taking input that isn't specific to a shorter variant tag. Using peek_eq(..) and consume(..)/skip(..) with a bunch of if statements seems to be the only solution here, or a complex mess of take(..)/consume(..)s. I think a Reader::consume_one_of(variants: &[&[u8]]) is needed, but seems like a unperformant solution as you have to iterate through each of variants till you find a hit.

@Byron
Copy link
Contributor Author

Byron commented Oct 17, 2020

Totally doable and a good suggestion. Reader::peek_u8_eq(..) seems the more intuitive over Reader::peek_eq_u8(..), however I see peek_eq as separate to peek_u8 and read_* due to the fact it is for preventing having to handle an error. I'm tossing up the idea of a Reader::peek_u8_opt(..) -> Option too.

Reader::peek_u8_opt(..) seems like a very useful addition.

expect/try_expect/verify/try_verify shouldn't be used to produce errors where length is expected as it doesn't return a RetryRequirement.

I believe to understand now that the RetryRequirement is used to encode information for a streaming parser - this was lost on me on my first read, but I admit to have skimmed some the error docs to some extend. Maybe it would help to provide a high-level overview to run down the main ideas of the architecture and help people chose the correct methods and error handling strategies right off the bat.

I'll admit most cases will just use Error<'i> but I like the fact each function defined on Reader only requires exactly what is needed. I only use Result where there is something tangible expected, that way you don't have to handle errors where there never would be (even if ? is super easy). It also self documents what could occur.

I like that too. The point I tried to make is that while changing a method that doesn't handle errors 'playfully' it's easy to chose a method on reader that does error handling, and even if the error won't leave the calling method, now a trait bound is required. Only in the first moment that was surprising to me, and I quickly figured what I would have to do to fix it. This might be more of a rust compiler/typesystem surprise than a surprise in dangerous.

Not having Result where an error isn't produced also adds the benefit of being able to use the Reader in very simple formats where there would never be errors by using Reader::read_infallible(..). Perhaps you only needed to split a string with a few delimiters using just Reader::take_while.

That's cool! I didn't see that one before :)!

I'm not exactly sure what you mean here sorry! There are only two other ways that aren't {:?} vs {:#?} or manually configuring via the InputDisplay interface that I've thought about, that being providing a env variable for controlling a few parameters or via compile time features.

It feels you kind of did, because controlling the display of input in an error case via environment variables would definitely help test cases. What happened to me is that during development, unit-tests that should not fail did actually produce an error. The input was displayed in hex, even though byte-ascii would have been preferred.
In the actual program, printing works the way I would have hoped.

Screenshot 2020-10-17 at 12 14 31

What is missing from dangerous that I don't want to add is all the tiny helper functions that catch a lot of common cases. I want them to be in a separate crate with fn read_ini<'i, E>(r: &mut Reader<'i, E>) -> .. style interfaces. That way dangerous keeps lean and clean. Things like delimited values, or structures like std::net::IpAddr.

Totally agree! And I am looking forward to your take on it. With dangerous as underpinnings it's definitely a novel take on things, and I for one believe there still is space for improvement in this field.

nom is a fantastic library but the simple functions all stacked up I find can be hard to reason about, including the split between streaming and complete. I don't find reading the parsers nice and linear which is what one of my goals with dangerous. I've seen bad usage of nom parsers (including my own usage) with simple things like handling UTF-8 strings differently to other sequences (say if you get input stopped halfway inbetween of a utf-8 codepoint that becomes a fatal error vs returning a retry requirement)

I agree, even though thus far I might not have hit that complexity boundary. The most complex thing I have parsed with it were git-objects (trees, tags, commits), which is really when it worked well enough. However, when thinking about that error type it is using, my brain twists around itself, and that even without a streaming parser. Maybe doing things the nom way helps with performance, but I for one would be happy to (potentially) sacrifice some when great error reporting and maintainability are valuable to the project.
combine makes it easy to chain parsers to avoid nesting (and reading things a little 'backwards'), even though nom 5.0 also supports much of that.

I'm going to sit on this problem this weekend as I have a similar use case. I'm not sure how deeply it is a dangerous concern. One solution could be to provide a function fn inclusive_range(&self, sub: &Input) -> Option<Range>, as I previously had internally here (keep in mind that implementation was broken).

I agree, and also believe this doesn't to be more than one function. After all, the top-level input will probably have to be kept around as state by the user anyway to do the necessary pointer arithmetic, which avoids affecting dangerous state. I am definitely looking forward to what you come up with!

I loved reading your thoughts and I'll have another look over your implementation this weekend! Thank you very much for the amount of time you have put into this and providing the feedback! :D

Whoop whoop, and I loved reading yours! That alone makes me want to use dangerous more, it feels a bit like co-evolution, allowing me to try to solve an actual problem with feedback from the one person who knows how dangerous should be used, while relaying my thoughts back to you to help fixing coming pitfalls some people might be likely to run into.

Please feel free to push changes directly into this branch, this should save you some time writing changes down as comments even though I am certain you already have them locally. Looking at diffs is fine and it's easy to comment in case I have questions or suggestions.

I think a Reader::consume_one_of(variants: &[&[u8]]) is needed, but seems like a unperformant solution as you have to iterate through each of variants till you find a hit.

True, I wonder how that ends up being implemented in nom or combine - I presume longTag(i).or(short_tag) ends up consuming temporaries, and on error one can go back to the unconsumed input to try the other parser. With dangerous one indeed might end up with more operations that are harder to optimize away. Maybe a piece of the puzzle is missing - something like a restore point maybe? It should be 'out-of-state' of the reader though to support recursion (if it's a possible solution at all I mean 😅)

@avitex
Copy link
Owner

avitex commented Oct 18, 2020

@Byron Sorry, slow on uptake for this, will reply soon :)

@Byron
Copy link
Contributor Author

Byron commented Oct 19, 2020

No worries, same same 😅. However, I do try to get the first pieces of git-config parsing done today using span computation based on the example you provided.

I will put my changes into a PR that will be shared here, instead of pushing to master directly, to enable you to take an occasional look should you find the time :).

@avitex
Copy link
Owner

avitex commented Oct 19, 2020

Just a small update. For the time being I've decided against a Reader::consume_one_of(variants: &[&[u8]]) function. On second thought, I can't think of any example variants wouldn't be delimited by something so if you take_while(|b| b != delim) you can then match on the taken input. It was going down a rabbit hole of unnecessary complexity.

I've improved the peek interface to reflect this as fn peek<'p, E>(&'p self, len: usize) -> Result<&'p Input, E>. As the returned Input does not carry the 'i lifetime. This prevents using the peeked Input for returned values, but the interface change makes it easy to match on. Win-win!

See PR #8 for these changes

@avitex
Copy link
Owner

avitex commented Oct 27, 2020

Hey @Byron, #8 is merged so you'll be able to make the changes around peeking. I still have more to look at and cleanup and if you like I'll merge this and make changes.

Thanks!

@Byron
Copy link
Contributor Author

Byron commented Oct 27, 2020

Thanks @avitex for your continued updates and improvements! I went in quickly, merged master and believe to have picked more suitable peek methods.
Maybe that makes this PR mergeable, but if not I wouldn't mind if you could make the remaining modifications by pushing into this branch. I will definitely take a look then :).

Also: Looking forward to the other improvements and cleanups you have planned!

@avitex
Copy link
Owner

avitex commented Oct 27, 2020

No thank you @Byron for your wonderful feedback! I'm going to merge this in :)
As for the changes, I've put most things I can think of in GitHub issues. If you have any issues yourself, please do raise them!

@avitex
Copy link
Owner

avitex commented Oct 27, 2020

Rebased this and merged this locally. Closing and thanks!

@avitex avitex closed this Oct 27, 2020
@Byron
Copy link
Contributor Author

Byron commented Oct 29, 2020

I just stumbled over this parser combinator crate which seems to do interesting chaining that I haven't seen before. No word about how errors look like, though, but maybe useful as inspiration in case you create a parser combinator crate yourself.

@avitex
Copy link
Owner

avitex commented Oct 29, 2020

I appreciate that @Byron :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants