Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How do I combine these parsers? #15

Closed
AndrewSav opened this issue Sep 11, 2017 · 20 comments
Closed

[Question] How do I combine these parsers? #15

AndrewSav opened this issue Sep 11, 2017 · 20 comments

Comments

@AndrewSav
Copy link
Contributor

Let's assume I have a Group 1 of parser a,b,c and d and Group 2 of parser w,x,y and z.
I would like to combine these parses so that the resulting parser would parse any combination of parsers from group 1 and exactly 1 parser from group 2 in any order.

Is this possible?

It is quite possible that I'm attacking the problem from a wrong angle (XY-problem), here is what I'm really trying to solve.

I'd like to parse a command line switch that starts with - or / and then a number of options follows for example -abwdc among these options there could be as many as desired binary ones (that is a,b,c or d, that do not require a parameter to follow) but only one that requires a parameter (w,x,u or z). The options can be specified in any order.

@AndrewSav
Copy link
Contributor Author

AndrewSav commented Sep 11, 2017

Apologies, this was a case of rubber duck debugging. I went with the following (not prime-time ready, just to demonstrate the idea):

class Atom
{
	public char CharName { get; set; }
}
class BinaryAtom : Atom
{
}
class ValueAtom : Atom
{
}
class Switch
{
	public IEnumerable<BinaryAtom> bins { get; set; }
	public ValueAtom val {get; set;}
}
void Main()
{
	TextParser<BinaryAtom> bins = Character.In('a', 'b', 'c', 'd').Select(x => new BinaryAtom {CharName = x});
	TextParser<ValueAtom> vals = Character.In('w', 'x', 'y', 'z').Select(x => new ValueAtom {CharName = x});
	TextParser<Switch> bla =
		from first in bins.Many()
		from second in vals.OptionalOrDefault()
		from third in bins.Many()
		select new Switch { bins = first.Concat(third), val = second};
	bla = bla.AtEnd();

	bla.Parse("abcx").Dump();
	bla.Parse("wac").Dump();
	bla.Parse("czb").Dump();
	bla.Parse("ac").Dump();
	bla.Parse("abcwx").Dump();
}

@nblumhardt
Copy link
Member

Thanks for the follow-up!

Just to throw another idea in there - when I find I'm trying to do context-sensitive parsing like this, I usually fall back to trying to use the parser just for the underlying grammar - i.e. splitting the options out, in this case - and then do the validation as a completely separate second step on the results of the parse. (Hope this makes sense.)

Cheers,
Nick

@AndrewSav
Copy link
Contributor Author

I'm going to write some thoughts here, my objective being:

  • Understand Superpower better
  • May be take an opportunity to make a suggestion for improvement
  • Provide an additional use case to complement ones that we have in readme and in the sample project

I invite you @nblumhardt to have a look at what I've done and validate my thinking. This is optional as I know we are all quite busy, but I'm hoping that we both get something from it, I will get a better understanding how to use your library, and you'd have a chance for another perspective and possibly opportunities for improvements. Having said that I do fully appreciate if there are more important things in your life right now.

In the next few minutes I'll start posting my thought below in this thread. This is the repository with my experiment that I'm going to walk through below. It can be referred to as something working together in case certain piece from the explanation below do not make sense.

@AndrewSav
Copy link
Contributor Author

So I have a practical task at hand which I wanted to use to try out Superpower. My requirements can be sum up as following:

  • Written in C# so command line arguments come in a string array.
  • All arguments are switches some of those come with vaules (value switches) and some do not (binary switches)
  • Values in value switches are never optional. Values do not start with '-' or '/'
  • Every element of the input array of a conforming command line belongs to a switch (or its value). That is there are no commands, just switches.
  • An example of binary switch: -a; an example of value switch: -b fidget
  • Switches start with either - (but not --) or / - those are treated the same
  • Switches can have two forms representing the same switch, the short form was shown above the long form start with --. Example: --thingamabob bleh
  • Both binary and value switches can have a long or a short form
  • Switches can have either long or short form or both
  • Several short switches can be comnibed into one: -abcd. This combination cannot contain more than one value switch.
  • There are four types of value switches: integer (non-negative), string, list (e.g --servers server1 server2 server3) and quoted list.
  • Quoted list looks like this: --switches-to-pass-through "-a something -b -c".
  • The quoted part will be sent as a single array element to the C# program from command line. Note that quotes themselves are not passed by OS to the argument list.
  • Whatever is inside the quoted list value does not need to be paresed further.
  • It expected that the parser is aware of the type of every switch, so that if it's a value switch it will know to treat the subsequent input array item as the switch argument (or several argurments for list switches).

I tried not to over-complicate this description, yet I wanted the task to be non-trivial. I understand that these requirements are quite specific, and it is not likely that some one else would need to do the same same thing, I did not try to be generic. I tried to leverage Superpower for something non-trivial.

Overall I'm quite please with the result, the whole thing takes about 150 lines excluding comments and except for one method is quite easy to understand. Here is how I approached it.

@AndrewSav
Copy link
Contributor Author

Let's start with SwitchDescription class:

    public class SwitchDescription
    {
        public char? ShortName { get; set; }
        public string LongName { get; set; }
        public SwitchType Type { get; set; }
        public string Description { get; set; }
        public Action<string> SetValue { get; set; }
    }

We are going to be passing an array of these to our command line parser. Most of the fields are self-explanatory and logically follow from the requirements. I'll talk a bit more aboutSetValue below. SwitchType looks like this:

    public enum SwitchType
    {
        Binary,
        Int,
        String,
        QuotedList,
        List
    }

Those directly map to the for value switches from the requirements and the binary switch.

A program using the parser will probably also have an Options class, something similar to this:

    public class Options
    {
        public bool Binary1 { get; set; }
        public bool Binary2 { get; set; }
        public bool Binary3 { get; set; }
        public bool Binary4 { get; set; }
        public String String1 { get; set; }
        public String String2 { get; set; }
        public String String3 { get; set; }
        public String String4 { get; set; }
        public int Int1 { get; set; }
        public int Int2 { get; set; }
        public int Int3 { get; set; }
        public int Int4 { get; set; }
        public String[] List1 { get; set; }
        public String[] List2 { get; set; }
        public String[] List3 { get; set; }
        public String[] List4 { get; set; }
    }

The SetAction property of the SwitchDescription class can be set to fill in a propery in the Options class like this:

Options o = new Options();
switch.SetValue = a => o.String1 = a

@AndrewSav
Copy link
Contributor Author

So far we talked about stuff outside of the CommandLineParser, Those are pretty standard, now let's see how we can design the parser with help of Superpower.

We face a particular challenge here Tokenizer/Parser model of Superpower assumes that we are tokenizing and parsing some continuous test. In our case we don't have that test, instead we are passed a string array of args from the OS. This means that the standard model requires a bit of adjustment.

The two passes are still useful. We can view it like this.

Each args[x] value is either a switch(es) or a value. If it's a switch(es), it's either does not require a value (a binary switch or several binary switches in a single args[x]), requires a single value which will be provided in args[x+1] or require a list of values which will be provided in args[x+1] ... args[x+n]. In the latter two cases, the args[x] itself can be a single value switch or a value switch with some binary switches.

The paragraph above defines what our parser (second pass) will look like. In order for it to do its job each incoming token (corresponding to each of args[x]) need to be market with the ArgumentType. It's that type (TKind) that will be driving the parser (TokenListParser).

Defining ArgumentType is easy, based on the discussion above. It will look like this:

        private enum ArgumnetType
        {
            NoValueSwitch,
            SingleValueSwitch, 
            ListOfValuesSwitch,
            Value
        }

Note, that SingleValueSwitch will correspond to the following SwitchType values from above: Int, String, QuoutedList.

@AndrewSav
Copy link
Contributor Author

Now we need something that we can parse our each single args[x] into. Something that will form the Token that will go into TokenListParser that our parser is going to work off. To understand better what this something is, let's consider again what our non-value args[x] may look like. From the requirements there are a few possibilities:

  • A long binary switch --bugger-all
  • A long value switch --all-that, it can require one ore more values
  • A combination of short switches, all binaries -abcd
  • A combination of short switches, one is value switch -abwc, it can require one ore more values

From the example above it distills to the following class that we call Argument:

        private class Argument
        {
            public IEnumerable<SwitchDescription> BinarySwitches { get; set; }
            public SwitchDescription ValueSwitch { get; set; }
            ...
        }

This is quite straight forward. We can have zero or more binary switches, we can have zero or one value switch, and in case if there are zero binary switches and zero value switches, an instance of this class would signify a value (that is not a switch, but a value of a preceding switch). Let's codify this relationship, by adding a property ArgumentType to the Argument class:

public ArgumnetType ArgumentType
            {
                get
                {
                    if (BinarySwitches == null && ValueSwitch == null)
                    {
                        return ArgumnetType.Value;
                    }
                    if (ValueSwitch != null)
                    {
                        return MapSwitchTypeToTokenType(ValueSwitch);
                    }
                    return ArgumnetType.NoValueSwitch;
                }
            }

Hopefully, what's going on right above is obvious, it was just explained in the previous paragraph. The only unknown part is MapSwitchTypeToTokenType call that describes the relationship between the SwitchType and ArgumentType that we mentioned before. This relationship can be represented like this:

            private static ArgumnetType MapSwitchTypeToTokenType(SwitchDescription sd)
            {
                return new Dictionary<SwitchType, ArgumnetType>
                {
                    { SwitchType.Binary, ArgumnetType.NoValueSwitch},
                    { SwitchType.Int, ArgumnetType.SingleValueSwitch},
                    { SwitchType.String, ArgumnetType.SingleValueSwitch},
                    { SwitchType.QuotedList, ArgumnetType.SingleValueSwitch},
                    { SwitchType.List, ArgumnetType.ListOfValuesSwitch},
                }[sd.Type];
            }

@AndrewSav
Copy link
Contributor Author

Now when we have a good idea what our args[x] are going to be parsed to (why, into the Argument class!), we can look at defining our tokenizer parser.

Since Superpower is parser combinator it makes sense to start with defining partial parsers that we will combine into the final parser. Now the most tricky part of this discussion follows. First, I want to construct a parser, that would parse a part of args[x] into a SwitchDescription. For example in -abwc argument w symbol can be parsed into the SwitchDescription that defines switch -w. Or in --bugger-all the "bugger-all" string can be parsed into the SwitchDescription that defines --bugger-all.

Now with short names this is not too difficult we can use Characte.In method and supply all our switchDescription.ShortName to it. It is less straightforward with long names. Unfortunately Superpower does not have built in method for Span.EqualToAny that could have the following signature:

public static TextParser<TextSpan> EqualToAny(string[] text)

If it did, we'd have slightly easier time. As it is let us define a null parser:

private readonly TextParser<TextSpan> _nullParser = input => Result.Empty<TextSpan>(TextSpan.None);

We will use the parser as the seed (source) in Linq Aggregate method, to combine a list of parsers for each of the long name with TextParser.Or combinator from Superpower.

Another peculiarity of the next method we'll write is that it needs to return separate parsers for binary and non binary switches. This is so that we could use those separately for constructing a rule that allows only one value switch but amy number of binary switches. With that in mind here is the code to achieve that:

        private TextParser<SwitchDescription> GetSwitchParser(SwitchDescription[] switches, bool isBinary, bool isShort)
        {
            return isShort ? GetSwitchParserInner(s => s.ShortName) : GetSwitchParserInner(s => s.LongName);
            TextParser<SwitchDescription> GetSwitchParserInner<T>(Func<SwitchDescription, T> nameSelector)
            {
                var u = switches.Where(s => s.Type == SwitchType.Binary == isBinary && nameSelector(s) != null).Select(nameSelector);
                switch (u)
                {
                    case IEnumerable<char?> z:
                        return Character.In(z.Select(x => x.Value).ToArray()).Select(x => switches.First(s => nameSelector(s) as char? == x));
                    case IEnumerable<string> z:
                        return z.Select(Span.EqualTo).Aggregate(_nullParser, (a, b) => a.Try().Or(b)).Select(x => switches.First(s => nameSelector(s) as string == x.ToStringValue()));
                    default:
                        throw new InvalidOperationException();
                }
            }
        }

Unfortunately this code is a bit difficult to parse, so I'm open for any improvement suggestions. What's important here is that by providing different combinations of isBinary and isShort parameters we can get four different partial parsers that parse binary/value short/long switches.

@AndrewSav
Copy link
Contributor Author

For the next step let's tackle the two parsers for short names. We need to construct a rule, that allow as to have zero or more binary switches and zero or one value one. Here is how we do it:

            TextParser<SwitchDescription> binarySwitchShort = GetSwitchParser(switches, isBinary: true, isShort: true);
            TextParser<SwitchDescription> valueSwitchShort = GetSwitchParser(switches, isBinary: false, isShort: true);

            TextParser<Argument> intermediate =
                from first in binarySwitchShort.Many()
                from second in valueSwitchShort.OptionalOrDefault()
                from third in binarySwitchShort.Many()
                select new Argument { BinarySwitches = first.Concat(third), ValueSwitch = second };

            TextParser<Argument> withValueShort = Character.In('/', '-').Then(x => intermediate);
            TextParser<Argument> withoutValueShort = Character.In('/', '-').Then(x => binarySwitchShort.Many().Select(u => new Argument { BinarySwitches = u }));

This is much easier to understand than the previous piece of code. First, we use our previous function to get the binary and the value parsers, and then we combine those in the rule we have in the requirements. Finally we make sure that the input includes the switch signs / or - and that the result of parsing is an instance of our Argument class.

Dealing with long names is even easier:

            TextParser<SwitchDescription> binarySwitchLong = GetSwitchParser(switches, isBinary: true, isShort: false);
            TextParser<SwitchDescription> valueSwitchLong = GetSwitchParser(switches, isBinary: false, isShort: false);

            TextParser<Argument> withValueLong = Span.EqualTo("--").Then(x => valueSwitchLong).Select(u => new Argument { ValueSwitch = u });
            TextParser<Argument> withoutValueLong = Span.EqualTo("--").Then(x => binarySwitchLong).Select(u => new Argument { BinarySwitches = new [] {u} });

Finally, we combine each pair like this:

            _withoutValue = withoutValueLong.Try().Or(withoutValueShort).AtEnd();
            _withValue = withValueLong.Try().Or(withValueShort).AtEnd();

These to parser will be used in the final tokenizer to parse an args[x] without values and args[x] with values respectively. You will notice that we assign these two parser "globally", we'll talk as to why later.

@AndrewSav
Copy link
Contributor Author

Now we are ready to write the tokenizer code:

        private TokenList<ArgumnetType> Tokenize(string[] args)
        {
            TextParser<Argument> any = Character.AnyChar.Many().Value(new Argument()).AtEnd();
            TextParser<Argument> tokenizer = _withoutValue.Try().Or(_withValue).Try().Or(any);
            return new TokenList<ArgumnetType>(args.Select(a => new Token<ArgumnetType>(tokenizer.Parse(a).ArgumentType, new TextSpan(a))).ToArray());
        }

Again, this is pretty straightforward. One thing that you'll notice is that Superpower won't let you crate a token based on anything but TextSpan. I, frankly, would want Token to have a generic parameter that would allow any type of the underlying token, not just text. Because I cannot pass already parsed Argument I will have to parse all those again in the parser (pass 2). This is the reason why I had to save _withoutValue and _withValue parsers "globally", so that I could call them again, to re-parse respective arguments.

The tokenizer will give us a list of tokens for our parser, and the type of these tokens will be of ArgumentType that we defined above. The value of that type for each token is derived from the Argument instance as we saw earlier.

@AndrewSav
Copy link
Contributor Author

We are getting to the end. We will need to helper methods. First of those two should really be part of Superpower:

        private static TokenListParser<TKind, Token<TKind>> TokenAnyKind<TKind>()
        {
            return input =>
            {
                var next = input.ConsumeToken();
                if (!next.HasValue)
                    return TokenListParserResult.Empty<TKind, Token<TKind>>(input);
                return next;
            };
        }

This is similar to Character.AnyChar but works on the Token not on the Character.

The second method is for getting use if the SetValue property of SwitchDescription that was mentioned before to give the program using the command line parser the values of the command line switches that were parsed:

        private Unit SetSwitchValue(Argument sw, string val)
        {
            sw.ValueSwitch?.SetValue(val);
            if (sw.BinarySwitches != null)
            {
                foreach (var bin in sw.BinarySwitches)
                {
                    bin.SetValue(null);
                }
            }
            return Unit.Value;
        }

Unit is used instead of void in Superpower because you cannot chain things nicely with void: void cannot be a generic type argument.

@AndrewSav
Copy link
Contributor Author

The final part of the solution is the parser:

        private void Parse(TokenList<ArgumnetType> tokens)
        {
            TokenListParser<ArgumnetType, Unit> binary =
                from sw in Token.EqualTo(ArgumnetType.NoValueSwitch).Apply(x => _withoutValue)
                select SetSwitchValue(sw, null);

            TokenListParser<ArgumnetType, Unit> single =
                from sw in Token.EqualTo(ArgumnetType.SingleValueSwitch).Apply(x => _withValue)
                from val in TokenAnyKind<ArgumnetType>()
                select SetSwitchValue(sw, val.ToStringValue());

            TokenListParser<ArgumnetType, Unit> list =
                from sw in Token.EqualTo(ArgumnetType.ListOfValuesSwitch).Apply(x => _withValue)
                from val in Token.EqualTo(ArgumnetType.Value).AtLeastOnce()
                select SetSwitchValue(sw, string.Join(" ", val.Select(v => v.ToStringValue())));

            TokenListParser<ArgumnetType, Unit[]> parser = binary.Try().Or(single).Try().Or(list).Many().AtEnd();

            parser.Parse(tokens);
        }

The above is also more or less straightforward. Note how we use .Apply to re-parse out arguments again - something, that ideally should have been done once only. The we call SetSwitchValue to pass the parsed value onto the calling program.

Now we are done. We should chain tokenizer and parser for ease of use like this:

        public void Parse(string[] args)
        {
            Parse(Tokenize(args));
        }

        public static void Parse(string[] args, SwitchDescription[] switches)
        {
            (new CommandLineParser(switches)).Parse(args);
        }

@AndrewSav
Copy link
Contributor Author

AndrewSav commented Sep 12, 2017

@nblumhardt and I'm done. I'd appreciate your opinion. The entirety of the code is here.

  • Is this a sane approach? Are there any obvious improvements?
  • Do you think that Superpower could be made to support generic type of underlying Model.Token data not just TextSpan?
  • Do you think that Parsers.Token class can be extended to support more flexible selection by token type, where you could select by several token types?
  • Do you think Parsers.Span can be extended to match on any given strings from a string array, similar to how Character.In works?
  • You notice that I often need to use .Try().Or construct, from this particular use case it looks like in most case not failing on partial match would be a good default. Could you explain why it makes sense to fail on partial match by default?

Thank you in advance, appreciate you time if you could have a look at this, I know it's a big ask ;)

@nblumhardt
Copy link
Member

Hi Andrew! I'd love to dig into this - working on Superpower is a lot of fun, and it definitely does need thoughtful input like this - but I'm short of the time to properly digest such an enormous thread right now :-)

Perhaps, leaving this here for reference and for others to explore, we could consider one smaller point at a time?

(Didn't intend to close this earlier, sorry - comment button is badly placed on GitHub :-))

@nblumhardt nblumhardt reopened this Sep 13, 2017
@nblumhardt
Copy link
Member

Struggling to move anything forward for you on this one, may have to pass - but hope it's all going well!

@AndrewSav
Copy link
Contributor Author

@nblumhardt Was not really holding my breath;) Thank you!

@AndrewSav
Copy link
Contributor Author

Some further thoughts here:

Do you think that Superpower could be made to support generic type of underlying Model.Token data not just TextSpan?

In order for good error reporting which is the major design goal of superpower, you have to have TextSpan data, so that the error position (line/column) can be tracked.

Your normal tokenizer will usually build a Token from Result like this. result.Value here represent the result of Recognizer parsing, which is usually a token type similar to this.

It appears that the result of the Recoginzer parsing does not have to be a enum it can be any valid c# class. If we use the actual parsing result, then the parse will not need to re-parse the TextSpan again, because the result will already be contained inside the Token.

In this case though TKind will become a misnomer, because it is no longer Token kind it's token value itself. Parser may still need to find out the token kind and Token.EqualTo will not longer work because TKind is not a enum value, so won't be impossible to compare.

A solution to this could be a new Token.Matching method, demonstrated here, which can then be use like this

Do you think that Parsers.Token class can be extended to support more flexible selection by token type, where you could select by several token types?

One possible solution is to use a proposed OneOf method. This way it's easier to combine several marchers in to a single one.

Do you think Parsers.Span can be extended to match on any given strings from a string array, similar to how Character.In works?

Again, OneOf mentioned above can solve that. Instead for solving this for particular case of string array, it can be used to combine any kind of parsers, so just combine the string matchers with it.

You notice that I often need to use .Try().Or construct, from this particular use case it looks like in most case not failing on partial match would be a good default. Could you explain why it makes sense to fail on partial match by default?

So the thing is that you only need Try when you need backtracking. That is in the cases where failed parser consumed some input. If the failed parser did NOT consume any input you don't need the Try at all. Not having it is better for performance.

@jzabroski
Copy link

@AndrewSav I'm pretty thankful for your stream of conciousness. I was looking to see if anyone had implemented a command-line parser library using SuperPower, as the whole tokenization feature seems like it would allow for people to extend it with varying grammars. And the parser combinator feature would allow for nesting grammars.

@jzabroski
Copy link

@AndrewSav

  • You notice that I often need to use .Try().Or construct, from this particular use case it looks like in most case not failing on partial match would be a good default. Could you explain why it makes sense to fail on partial match by default?

Try only makes sense for built-in parsers for identifiers, reserved symbols, operators, and reserved operators (e.g. for future use). Arbitrary lookahead is not a good default, because it increases run-time cost and makes analyzing performance of parser combinator-based grammars more difficult. The risk of looking lookahead with Or is you could end up with ambiguous parse forest. To provide some lookahead, the parser combinators I mentioned are oftened called lexeme parsers because they only require one lookahead.

@AndrewSav
Copy link
Contributor Author

AndrewSav commented Aug 15, 2020

@jzabroski

Try only makes sense for built-in parsers for identifiers, reserved symbols, operators, and reserved operators (e.g. for future use). Arbitrary lookahead is not a good default, because it increases run-time cost and makes analyzing performance of parser combinator-based grammars more difficult.

An example would be helpful, otherwise it's a bit too abstract. I'm sure there are examples, I just did not come across any in anything I've done so far with Superpower.

The risk of looking lookahead with Or is you could end up with ambiguous parse forest.

Superpower is never ambiguous. It always prefers the first valid match.

To provide some lookahead, the parser combinators I mentioned are oftened called lexeme parsers because they only require one lookahead.

Which parser combinators did you mention? I must have missed that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants