advice on parsing + state regexp lexer prototype #70

Sylvain303 · 2019-09-17T06:49:12Z

Hi,

I discovered participle by looking for grammar parser in Go.
Good job! Participle looks funny and powerful.

Sorry a very long message for all I collected about using participle. May I can split it in multiple issue?

docopt context

Here's follow some of my context.

I'm the maintainer of the project docopts note the final "S", which is an analogy to getopts.

docopts is a CLI parser for bash, implementing the docopt language.
I'm currently using the Go docopt library to parse docopt. This library is almost a translation word to word from the original Python docopt parser.

I would like to write a new parser from scratch to re-parse, and next enhance, the docopt language which seems at a standstill. Notably issuing helpful parser error or warning during docopt parsing.
I'm just beginning with participle and I'm looking for some advises.

The docopt language is line based, and is composed of sections, each section is parsed differently.

Here is a valid docopt example which describes the docopt own language:

https://github.com/docopt/docopts/blob/dev-participle/grammar/docopt_language.docopt

I modified docopts in the branch, so it will parse the above example using the actual parser written in Go:

$ ./docopts -h "$(cat grammar/docopt_language.docopt)" --debug : required-action pipo | sed -n '/# bash #/,$ p'
$ ./docopts -h "$(cat grammar/docopt_language.docopt)" --print-ast : required-action pipo

testing participle

I first wrote a simple regexp lexer and a top grammar for splitting by section, then I was thinking about to parse each section with a different lexer + section parser.

Now I changed my mind and finished writing a PoC of a state lexer, like GNU flex, based on your lexer/regexp.

https://github.com/docopt/docopts/blob/dev-participle/grammar/lexer_state/lexer_state.go

The idea is that the lexer uses a different set of regexp depending of token it extracts, as I used in GNU/flex.

The following docopt grammar becomes parsable, as some token are only recognized in some defined section of the language.

I wrote a prototype with participle, which should be able to parse a grammar that looks like, the definition may not exactly reflect the real code (#1 would be great for posting issue too 😉)

Docopt =
  Prologue?
  Usage_section
  Options_section?
  Free_Section*

Prologue            =  Free_text+
Free_text           =  LONG_BLANK? LINE_OF_TEXT "\n" | "\n"
Usage_section       =    "Usage:" Usage_expr "\n" Usage_line*
                       | "Usage:" "\n" Usage_line+
Usage_line          =  ( LONG_BLANK Usage_expr | Comment ) "\n"
Comment             =  LINE_OF_TEXT | "\n"+
Usage_expr          =  Seq  ( "|" Seq )*
Seq                 =  ( Atom "..."? )*
Atom                =    "(" Expr ")"
                       | "[" Expr "]"
                       | "options"
                       | Long_def
                       | Shorts_option
                       | ARGUMENT
                       | Command
Shorts_option       =  SHORT | SHORT ARGUMENT
Long_def            =  LONG | LONG "="? ARGUMENT
Options_section     =  "Options:" "\n" Options_line+
Options_line        =  LONG_BLANK Options_flag LONG_BLANK Option_description
Option_description  =  (LONG_BLANK LINE_OF_TEXT "\n")*
                       (LONG_BLANK LINE_OF_TEXT Defaulf_value "\n")?
Defaulf_value       =  "[" DEFAULT LINE_OF_TEXT "]"
Free_Section        = SECTION "\n" Free_text*

My questions

Why negative rune for token?

Why not matching array index for token?

Capture without using @@ own type : Usage_first: structs can only be parsed with @@ or by implementing the Capture interface

I first encounter the error: Capture without using @@ own type : Usage_first: structs can only be parsed with @@ or by implementing the Capture interface

I did not manage to implement a Capture interface nor for the parent node, neither for the child node.

When I was trying to assign a token to an array. (I did reproduce it like that but it may be an invalid grammar)

type Usage struct {
  Pos lexer.Position

  Usage_lines       []Usage_line   `"Usage:" @Usage_line? "\n" @@* )`                                                                 
}

So I ended to write the single line case that way:
I would have preferred to be able to have an array with one or many Usage_line here.

type Usage struct {
  Pos lexer.Position

  Usage_first       *Usage_def      `( "Usage:" @@ "\n"`                                                                              
  Usage_next_lines  []*Usage_line   `           @@*`
  Usage_lines       []*Usage_line   `| "Usage:" "\n"  @@+ )`
}

lexer state

Is it a good way to delegate more on lexer state?

Comparing about having a top level grammar controlling multiples lexer + sub-parser

Capital parser struct definition panic error

Forgetting a Capital letter in a parser struct definition gives => panic : panic: reflect: reflect.Copy using value obtained using unexported field

This message is a bit cryptic

Error handling reporting

My main goal by refactoring docopt parser is to produce good and nice error handling during docopt parsing. How do I customize and handle parse error in participle?

Thanks for reading. 😄

The text was updated successfully, but these errors were encountered:

alecthomas · 2019-09-18T00:33:35Z

Hi... it's a bit unclear what you're asking, but if you want to jump in the Gopher's slack I have a channel where we can chat (invite link is here).

alecthomas · 2019-09-18T01:01:40Z

Capture without using @@ own type : Usage_first: structs can only be parsed with @@ or by implementing the Capture interface

  Usage_lines       []Usage_line   `"Usage:" @Usage_line? "\n" @@* )`

You're hitting the lack of sub-lexers here. A single token can't be parsed by a sub-node. See solution in #25.

lexer state
Is it a good way to delegate more on lexer state?
Comparing about having a top level grammar controlling multiples lexer + sub-parser

This is an open issue. My current idea is to support sub-lexers, see #25

Capital parser struct definition panic error

Yeah that's not good. Do you have the panic stack trace?

alecthomas · 2019-09-18T01:07:35Z

Error handling reporting
My main goal by refactoring docopt parser is to produce good and nice error handling during > docopt parsing. How do I customize and handle parse error in participle?

You'll often get a lexer.Error, which has the position of the error (I should make participle.Error include this information in addition to the failing grammar path). Unfortunately this is not guaranteed, so you might also get a plain formatted error. Additionally a partially constructed AST will be returned, up to the point where parsing failed.

I'd be interested in improving this, what specifically did you have in mind?

Sylvain303 · 2019-09-18T06:12:02Z

Hi... it's a bit unclear what you're asking, but if you want to jump in the Gopher's slack I have a channel where we can chat (invite link is here).

Hi @alecthomas, looking at the time stamp of this reply we have a big timeshift I'm located in France GMT+2. It may be tricky to have a spare time at the same period.

alecthomas · 2019-09-18T06:15:28Z

It's 4pm here (Sydney), I'll be online until around 10pm.

Sylvain303 · 2019-09-18T07:35:41Z

Capture without using @@ own type : Usage_first: structs can only be parsed with @@ or by implementing the Capture interface
You're hitting the lack of sub-lexers here. A single token can't be parsed by a sub-node. See solution in #25.

I'm not sure to understand the point here. I'm not yet familiar to all participle concepts, I may also lack some vocabulary or code knowledge. In #25 you mention participle.SubLexer() which doesn't exists yet, right. I will think about it and post in the related issue. I oversized the content of this one sorry.

lexer state
Is it a good way to delegate more on lexer state?
Comparing about having a top level grammar controlling multiples lexer + sub-parser

This is an open issue. My current idea is to support sub-lexers, see #25

I also splited my comment in #25.

Capital parser struct definition panic error

Yeah that's not good. Do you have the panic stack trace?

I opened a separate issue about it: #71

Sylvain303 · 2019-09-29T05:57:47Z

Ok, I've continued on my experimentation. I now completed my grammar with rules for parsing Options: section. So I met the following choice:

Here is an input ended by an extra dot ".", it will complexify my grammar if I try to parse this corner
case:

  --which-support=<argument>  The <argument> for this option has a [default: value_here_is_parsed].

My actual lexer is extracting it as expected:

LONG_BLANK, "  "
LONG, "--which-support"
PUNCT, "="                                                                                                                            
ARGUMENT, "<argument>"
LONG_BLANK, "  "
LINE_OF_TEXT, "The <argument> for this option has a "
PUNCT, "["
DEFAULT, "default: "
LINE_OF_TEXT, "value_here_is_parsed"
PUNCT, "]"
".", LINE_OF_TEXT

My idea here, is to remove the "[", "default:" , "value", "]" token extraction from the lexer
and to delegate the default: extraction during parsing with a simple regexp on LINE_OF_TEXT, in order to simplify the grammar. It also allow to have the full documentation verbatim from parsed text.

The regexp from legacy docoopt parser is:

reDefault := regexp.MustCompile(`(?i)\[default: (.*)\]`)

How I'm supposed to extract the value_here_is_parsed and put it back in the AST node?

Is it a job for Capture(), Parsable() or Build Option ?

alecthomas · 2019-09-29T08:47:26Z

It's tempting to put more and more logic into lexing/parsing, but I would keep it simple and do something like this:

type Options_line struct {
	Pos lexer.Position

	Option_def     Option_def     `(   LONG_BLANK @@`
	Option_doc     Option_doc     `    @@`
	Option_default Option_default `    @@? "\n"`
	Option_dot	bool			`@"."?`
	Comment        []string       `|   @( LINE_OF_TEXT "\n" | "\n"+ ) )`
}

func (o *Options_line) Doc() string {
	out := o.Option_doc
	if o.Option_dot {
		out += "."
	}
	return out
}

alecthomas/participle#70

This was referenced Sep 18, 2019

panic: reflect: reflect.Value.Set using value obtained using unexported field while omitting Capital letter in struct #71

Closed

Support for sub-lexers #25

Closed

Sylvain303 pushed a commit to docopt/docopts that referenced this issue Sep 30, 2019

Add test for reading Options from AST with "."?

590e559

alecthomas/participle#70

alecthomas closed this as completed Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

advice on parsing + state regexp lexer prototype #70

advice on parsing + state regexp lexer prototype #70

Sylvain303 commented Sep 17, 2019

alecthomas commented Sep 18, 2019 •

edited

alecthomas commented Sep 18, 2019

alecthomas commented Sep 18, 2019

Sylvain303 commented Sep 18, 2019

alecthomas commented Sep 18, 2019

Sylvain303 commented Sep 18, 2019

Sylvain303 commented Sep 29, 2019

alecthomas commented Sep 29, 2019

advice on parsing + state regexp lexer prototype #70

advice on parsing + state regexp lexer prototype #70

Comments

Sylvain303 commented Sep 17, 2019

docopt context

testing participle

My questions

Why negative rune for token?

Capture without using @@ own type : Usage_first: structs can only be parsed with @@ or by implementing the Capture interface

lexer state

Capital parser struct definition panic error

Error handling reporting

alecthomas commented Sep 18, 2019 • edited

alecthomas commented Sep 18, 2019

alecthomas commented Sep 18, 2019

Sylvain303 commented Sep 18, 2019

alecthomas commented Sep 18, 2019

Sylvain303 commented Sep 18, 2019

Sylvain303 commented Sep 29, 2019

alecthomas commented Sep 29, 2019

alecthomas commented Sep 18, 2019 •

edited