Skip to content
This repository has been archived by the owner on Sep 24, 2022. It is now read-only.

Rewrite the parser #78

Closed
alexcrichton opened this issue Nov 22, 2015 · 10 comments
Closed

Rewrite the parser #78

alexcrichton opened this issue Nov 22, 2015 · 10 comments

Comments

@alexcrichton
Copy link
Collaborator

There's a number of pain points with the current parser:

  • Not the fastest in the world. It currently makes a huge number of allocations all over the place, but can it be better? Strings should at least be Cow to slice back to the original source and only unescape when necessary. Arrays and maps may be able to be based on an iterator?
  • Overall, parsing should probably not go directly to a toml::Value but instead some form of AST. This AST may not have validation like no duplicate keys, no redefinitions, or arrays are homogneous. That kind of verification can perhaps be delayed to another step.
  • Parsing should return a Result with an opaque error type which contains source information directly.
  • Should preserve source structure of whitespace, comments, etc. Basically make it easy to delete keys/tables and/or insert keys/tables.
@waynr
Copy link

waynr commented Mar 5, 2016

I would like to take on some subset of this parser rewrite once I am finished with withoutboats/notty#23 (never closer), unfortunately I will be back at my dayjob next week so I'm not sure how much time/energy I will have to spend on it after that.

Would it be possible to break this work up into smaller chunks/issues that can be incrementally tackled and which would target a "1.x" branch on this repository? Those smaller-chunk'd issues could be grouped under a "1.0" milestone in this github issue interface. Before work starts on interface changes, more details about what you have in mind would be nice so that whoever picks up the work can proceed with confidence in the implementation.

I'd actually be more than happy to break this issue up into smaller ones based on the current outline which you can then fill out with additional details in comments whenever you have time. Let me know if this approach sounds sane.

@alexcrichton
Copy link
Collaborator Author

Thanks for the interest @waynr! Awhile back I managed to rewrite the lexer at least, although I haven't gotten around to morphing that into a parser. I'd be totally fine breaking up the work here, and that branch would probably need some cleanup as well before pushing much more on it :(

@joelself
Copy link

@alexcrichton I have a work-in-progress TOML parser here. It's nearing its first release. There are some things you might like about it:

  • Preserves all whitespace and comments in the original document. If you parse then print the parser you will get an exact copy of the document you parsed.
  • Parses to an AST
  • Parsing to the AST doesn't require allocations because it uses nom and the AST can hold string slices (it can also hold Strings for when it is modified)
  • I haven't benchmarked it, but it seems fast. Do you have any huge TOML files you need to parse or are you just parsing tons of files?

It parses to a Value type, but this is part of the AST and unavoidable. The parser returns a sanitized TOMLValue type to the user to keep the AST and it's ugliness internal and because the AST is subject to constant change (I just made a major change to how inline tables are represented in the AST, but the TOMLValue::InlineTable remained stable). The goal is to never reveal the AST to the end user and only provide abstract methods to query and manipulate the parsed document.

It does do duplicate key checks, invalid table checks, heterogeneous array checks, and invalid datetime checks on the fly. But duplicate key and invalid tables are required otherwise they would overwrite and invalidate key/values already in the map. Mixed array checks are very simple, I don't see any benefit of deferring this to a post-parse step. The datetime check is one that can be deferred. If this slows down the parser too much I can make this optional depending on a flag and move it to a post-parsing validate_values method if you don't set the flag. In the future I want to add other validation such as integer overflow/underflow, float overflow/underflow, float loss of precision, invalid bool (wrong casing), but depending on feedback, which I currently don't have, maybe this should only be done on the fly if you specify a flag.

Parsing should return a Result with an opaque error type which contains source information directly.

I don't understand what you mean by 'opaque error type which contains source information directly'. Currently my parser returns itself due to the way parsing structs are implemented in nom (you can read about how I ended up with this implementation here). Right now you have to call get_result to get the result of the parse which ranges over:

  • Full (it parsed the whole document with no errors)
  • FullError (it parsed the whole document, but recorded some non-failure errors that are returned in FullError
  • Partial (it parsed part of the document, but there was some leftover)
  • PartialError (it parsed part of the document and it found errors which it returns to you in the PartialError
  • Failure (it failed to parse the document and it returns the last line number it recorded as part of the Failure

For the first release it will only be able to parse a document, print a parsed document, get values, get a key's child keys, and set values (and there are a few caveats with setting arrays and inline tables). Adding new key/values and new tables, deleting key/values and tables, and key modification (except for inline table key modification which is possible right now) are slated for the second release which I hope will come quickly after the first.

@alexcrichton
Copy link
Collaborator Author

Holy cow, awesome work @joelself! Some thoughts of mine:

  • Ah I wouldn't worry too much about parser performance, it's almost guaranteed to be better than this crate is today! The current parser does tons of allocations all over the place, but I've also never seen it in any Cargo profiles at least (always small TOML documents).
  • The API sounds great! Sounds particularly flexible in terms of internal refactorings :)
  • When I mentioned errors, I was basically just saying that the current situation is pretty bad. You get an error, but then you need to hand it back to the parser to figure out file/line information. In theory you shouldn't have to hand it back, it should just inherently have it inside the error.

I'm pretty excited about this, though, the ability for Cargo to modify a TOML document preserving formatting and comments would be huge! It would unlock quite a bit of functionality I've wanted to have for quite a while now :)

Would you be interested in merging your work back into this repo? Or would you like to keep it separate for now?

@joelself
Copy link

The external API has to be stable because I'm constantly changing how the internals work. Partly because I'm indecisive and partly because this is babby's first Rust project, so I find myself writing some code then looking at it a few weeks later and thinking "What idiot wrote this crap?"

I should probably change my errors. At first I had them return a key or vector of keys (if a key was involved) and then a Value or vector of values and a line number. But I recently changed it to only return a line number and a key[s] that you have to look up in the parser to get the value[s]. Going back to returning the value with the error will be more consistent since every error type that involves a value will contain a value instead of just the errors that can't store their values in the internal map containing a value.

I started the project from scratch, so it doesn't have anything in common with toml-rs. I don't know how I could merge into it.

Are there any particular features or APIs you want? Because right now I'm just targeting features for cargo-bump (for 0.1.0) and then for cargo-edit (for 0.2.0, although 0.1.0 will enable cargo list). I don't really have much of a roadmap after that.

@alexcrichton
Copy link
Collaborator Author

Ah ok, was just curious! Some idea features I can think of for a library might be:

  • When parsing, must be able to get file/line information for all parse errors.
  • Ideally, parsing can return more than one error if there is more than one.
  • A document should support insertion or removal of keys in a table. All other formatting should be preserved in a file.
  • A document could also support inserting a key with a particular style. For example in Cargo we custom print Cargo.lock to ensure consistent style, but representing it programmatically might perhaps be nicer.
  • Modifying existing keys in a document would also be nice.

I think that's what I could think of at least? If you're interested in merging back here I can definitely try to help out, but if not then sounds good to me as well :)

@joelself
Copy link

tomllib (I renamed toml_parser) is out now at version 0.1.1. It is currently limited to parsing documents, getting values, setting/changing values and getting subkeys of a key. This should be enough to get cargo-bump going. For the next set of features I'm targeting cargo-edit.

@alexcrichton
Copy link
Collaborator Author

Thanks for the update @joelself! Sounds like some great progress is being made :)

@alexcrichton
Copy link
Collaborator Author

I've rewritten the parser in the serde 0.9 rewrite: #137

It does everything here except preserve whitespace/comments, but the tokenizer at the lowest level now supports this, so I hope it'll be easier to prototype such support in the future.

@alexcrichton
Copy link
Collaborator Author

Fixed in spirit in #137

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants