A c++17 recursive-descent parser library that can parse left-recursive grammars.
Parserlib allows writing of recursive-descent parsers in c++ using the language's operators in order to imitate Extended Backus-Naur Form (EBNF) syntax.
The library can handle left recursion.
Here is a Calculator grammar example:
extern Rule<> add;
const auto val = +terminalRange('0', '9');
const auto num = val
| '(' >> add >> ')';
Rule<> mul = mul >> '*' >> num
| mul >> '/' >> num
| num;
Rule<> add = add >> '+' >> mul
| add >> '-' >> mul
| mul;
(NOTE: Rule 'add' needs a forward reference because it is mentioned for the very first time in rule 'num'.)
The above grammar is a direct translation of the following left-recursive EBNF grammar:
add = add + mul
| add - mul
| mul
mul = mul * num
| mul / num
| num
num = val
| ( add )
val = (0..9)+
The library is available as headers only, since every class is templated.
In order to use it, you have to have the path to its root include folder in your project's include folder list.
Then, you have to either include the root header, like this:
#include "parserlib.hpp"
Or use the various headers in the subfolder 'parserlib'.
All the code is included in the namespace parserlib
:
using namespace parserlib;
A grammar can be written as a series of parsing expressions, formed by operators and by functions that create parser objects.
The most basic parser is the TerminalParser
, which is used to parse a terminal. In order to write a terminal expression, the following code must be written:
terminal('x')
terminal("abc")
The terminals in this library are by default of type char
, but they can be customized to be anything.
Other types of terminal parsers are:
terminalRange('a', 'z') //parses all values between 'a' and 'z'.
terminalSet('+', '-') //parses '+' or '-'.
Terminals can be combined in sequences using the operator >>
:
const auto ab = terminal('a') >> terminal('b');
const auto abc = ab >> terminal('c');
In order to parse a sequence successfully, all members of that sequence shall parse successfully.
Expressions can have branches:
const auto this_or_that = terminal("this")
| terminal("that");
Branches are followed in top-to-bottom fashion. If a branch fails to parse, then the next branch is selected, until a branch is found or no more branches exist to follow.
- The
operator *
parses an expression 0 or more times. - The `operator +' parses an expression 1 or more times.
+(terminalRange('0', '9')) //parse a digit 1 or more times.
A parser can be made optional by using the operator -
:
-terminalSet('+', '-') >> terminalRange('0', '9') //parse a number; the sign is optional.
- The
operator &
allows parsing an expression without consuming any tokens; it returns true if the parsing succeeds, false if it fails. It can be used to test a specific series of tokens before parsing. - The
operator !
inverts the result of a parsing expression; it returns true if the expression returns false and vice versa.
!terminalSet('=', '-') >> terminalRange('0', '9') //parse an integer without a sign.
- The
operator ==
allows the assignment of a match id to a production; the created match does not have any children. - The
operator >=
allows the assignment of a match id to a production; the created match has children matches.
(-terminalSet('+', '-') >> terminalRange('0', '9')) == std::string("int")
In order to invoke a parser, the appropriate ParseContext
instance must be created.
//declare a grammar
const auto grammar = (-terminalSet('+', '-') >> terminalRange('0', '9')) == std::string("int");
//declare an input
std::string input = "123";
//declare a parse context over the input
ParseContext<> pc(input);
//parse
const bool ok = grammar(pc);
//iterate over recognized matches
for(const auto& match : pc.matches()) {
if (match.id() == "int") {
const auto parsedString = match.content();
//process int
}
}
Rules allow the writing of recursive grammars.
//whitespace
const auto whitespace = terminal(' ');
//integer
const auto integer = terminalRange('0', '9');
//forward declaration of recursive rule
extern Rule<> values;
//value; it is recursive
const auto value = integer
| terminal('(') >> values >> terminal(')');
//rule
Rule<> values = value >> whitespace >> values;
The library can parse left recursive grammars.
//the recursive rule
extern Rule<> expression;
//and integer is a series of digits
const auto integer = +terminalRange('0', '9');
//a value is either an integer or a parenthesized expression
Rule<> value = integer
| '(' >> expression >> ')';
//multiplication
Rule<> mul = mul >> '*' >> value
| mul >> '/' >> value
| value;
//addition
Rule<> add = add >> '+' >> mul
| add >> '-' >> mul
| mul;
//the root rule
Rule<> expression = add;
The class ParseContext is a template and has the following signature:
template <class SourceType, class MatchIdType, class SourcePositionType> class ParseContext;
It allows customizing the source type, the match id type and the source position type.
By default, a ParseContext instance will use an std::string
as an input source. But this can be changed to accomodate any STL like container.
For example, the source can be a static array of integers:
ParseContext<std::array<int, 1000>> pc(input);
The default match id type is std::string
, but usually it shall be an integer or an enumeration. It's also good for performance reasons to replace std::string
with a numeric value, since match ids are created and destroyed as parsing is performed.
Example:
ParseContext<std::string, int> pc(input);
The parse context's parameter named '`SourcePositionType' allows the customization of character processing:
- customizing comparison of elements, for example in order to implement case insensitive parsing.
- providing extra information regarding the source, for example line and column numbers.
- customizing the newline character sequence.
The library already provides two classes for the above:
- class
SourcePosition<class SourceType, bool CaseSensitive>
is the most basic class that just contains an iterator for the current position; it allows for statically using either case sensitive or case insensitive parsing. - class
LineCountingSourcePosition<class SourceType, bool CaseSensitive, class NewlineTraits>
extends the classSourcePosition
with line and column information, and it also allows the specification of newline sequence, which, by default, is implemented by classDefaultNewlineTraits
that recognizes the character\n
as the newline separator.
Examples:
//case insensitive parsing
ParseContext<std::string, int, SourcePosition<std::string, false>> pc(input);
//case sensitive parsing with line counting
ParseContext<std::string, int, LineCountingSourcePosition<std::string>> pc(input);
//case insensitive parsing with line counting and custom newline traits
ParseContext<std::string, int, LineCountingSourcePosition<std::string, false, CustomNewlineTraits>> pc(input);
The operator ==
allows the creation of a match, when an expression parses successfully. The right hand side should be an expression which evaluates to the match id expected by the parse context. Example:
enum TYPE {
A, B, C
};
const auto a = terminal('A') == A;
const auto b = terminal('B') == B;
const auto c = terminal('C') == C;
const auto grammar = a >> b >> c;
std::string input = "ABC";
ParseContext<std::string, Type> pc(input);
const bool ok = grammar(pc);
for(const auto& match : pc.matches()) {
std::cout << match.content() << " = " << match.id() << std::endl;
}
The above produces the output:
A = 0
B = 1
C = 2
The operator >=
allows the creation of a match, like the operator ==
, with a difference: all matches created within the context of the expression are placed as children matches.
This allows matches to also be trees, instead of a flat list.
In the following example, an IP4 address is returned as a tree match, with the following structure:
IP4_ADDRESS
HEX_BYTE
HEX_DIGIT
HEX_DIGIT
HEX_BYTE
HEX_DIGIT
HEX_DIGIT
HEX_BYTE
HEX_DIGIT
HEX_DIGIT
HEX_BYTE
HEX_DIGIT
HEX_DIGIT
Here is the code:
enum TYPE {
ZERO,
ONE,
TWO,
THREE,
FOUR,
FIVE,
SIX,
SEVEN,
EIGHT,
NINE,
A,
B,
C,
D,
E,
F,
HEX_DIGIT,
HEX_BYTE,
IP4_ADDRESS
};
const auto zero = terminal('0') == ZERO ;
const auto one = terminal('1') == ONE ;
const auto two = terminal('2') == TWO ;
const auto three = terminal('3') == THREE;
const auto four = terminal('4') == FOUR ;
const auto five = terminal('5') == FIVE ;
const auto six = terminal('6') == SIX ;
const auto seven = terminal('7') == SEVEN;
const auto eight = terminal('8') == EIGHT;
const auto nine = terminal('9') == NINE ;
const auto a = terminal('A') == A;
const auto b = terminal('B') == B;
const auto c = terminal('C') == C;
const auto d = terminal('D') == D;
const auto e = terminal('E') == E;
const auto f = terminal('F') == F;
const auto hexDigit = (zero | one | two | three | four | five | six | seven | eight | nine | a | b | c | d | f) >= HEX_DIGIT;
const auto hexByte = (hexDigit >> hexDigit) >= HEX_BYTE;
const auto ip4Address = (hexByte >> terminal('.') >> hexByte >> terminal('.') >> hexByte >> terminal('.') >> hexByte) >= IP4_ADDRESS;
const std::string input = "FF.12.DC.A0";
ParseContext<std::string, TYPE> pc(input);
using Match = typename ParseContext<std::string, TYPE>::Match;
const bool ok = ip4Address(pc);
assert(ok);
assert(pc.matches().size() == 1);
const Match& match = pc.matches()[0];
std::stringstream stream;
stream << match.children()[0].children()[0].content();
stream << match.children()[0].children()[1].content();
stream << '.';
stream << match.children()[1].children()[0].content();
stream << match.children()[1].children()[1].content();
stream << '.';
stream << match.children()[2].children()[0].content();
stream << match.children()[2].children()[1].content();
stream << '.';
stream << match.children()[3].children()[0].content();
stream << match.children()[3].children()[1].content();
const std::string output = stream.str();
std::cout << output;
The above prints the input, which is the value FF.12.DC.A0
.
In order to resume from errors, the special operator ~()
can be used to create an error resume point
.
An error resume point
shall be combined with operator >>()
to create a sequence of parsers, in which the parsers before the error resume point
may create an error, and then the error parser
will try to resume parsing from the error resume point
.
Here is an example of parsing a terminal enclosed in single quotes:
const auto ws = *terminal(' ');
const auto letter = terminalRange('a', 'z') | terminalRange('A', 'Z');
const auto digit = terminalRange('0', '9');
const auto character = letter | digit;
const auto terminal_ = ('\'' >> *(character - '\'') >> ~terminal('\'')) == "terminal";
const auto grammar = ws >> *(terminal_ >> ws);
If an error happens when parsing a terminal, then the parser will look for the single quote symbol \'
in order to continue parsing.