LL parsing for case classes made simple.
Author: Gidon Ernst gidonernst@gmail.com
Feedback is welcome! I'd appreciate to hear whether anyone found this library useful.
Scala's case classes are a fine way to represent syntax trees. For example, consider a simple Lisp-like language for expressions with identifiers, numbers and function application:
trait Expr
case class Id(name: String) extends Expr
case class App(args: List[Expr]) extends Expr
This definition corresponds (almost) directly to the following attributed grammar:
expr := id | app
id := string { Id(_) }
app := "(" expr* ")" { App(_) }
The aim of this library is to minimize the fuss to create a parser for such a language. The main idea is that, given a parser for the arguments of a case class constructor, the parser for the case class is instantiated automatically. With arse you can specify this as follows:
val expr: Parser[Expr] = P(app | id)
val id = Id(name)
val app = App("(" ~ (expr *) ~ ")")
This example demonstrates a few features of the library:
- Sequential composition
p ~ q
, which is strict inq
ifp
succeeds - Ordered choice
p | q
, which backtracks ifp
fails and triesq
- Recursive parsers through
P(p)
, which evaluatesp
lazily - Implicit lifting of functions to parsers, where
f(p1 ~ ... ~ pn)
parsesp1 ~ ... ~ pn
and appliesf(a1, ..., an)
to the resultsa1, ..., an
Dependencies
Compile & Install
mill easyparse.jar
Parsers p: Parser[A]
parse a piece of text and return a result a: A
.
p.parse(in)
parsep
onin: String
or fail with an errorp.parse(in, cm=false)
try to parsep
and fail withbacktrack()
(do not commit, internally used for backtracking, see also bk)p.parseAt(in, pos, cm)
parsein
at positionpos
int
,double
,char
,string
: predefined scanners for numbers, character literals'x'
and strings"..."
(support escaping\'
resp\"
)val n: Parser[T] = P(p)
declares a recursive parser with namen
that defers evaluation ofp
to allow forward declaration ofp
; works in class/object scope whenp
refers to fields, but not whenp
refers to local variables defined laterret(a)
succeeds without consuming input and returnsa
p ~ q
parsesp
and returns a pair(a,b)
of the resultsp ~> q
andp <~ q
are variants of sequential composition that discard the right resp. left resultp | q
triesp
and if it fails parsesq
"lit" ~ p
matches a literal "lit", parsesp
and returns it's result, similarlyp ~ "lit"
p $
parsesp
but causes an error if there is remaining input afterp
p ?
,p *
,p +
: optional and sequence parsers, returningOption[_]
resp.List[_]
p ~* q
andp ~+ q
: parsep ~ q ~ ... ~ q ~ p
and return the results ofp
asList[_]
; the latter requires at least onep
resultp.filter(f)
,p.filterNot(f)
,p.map(f)
: filter/map a result by a predicate/functionf
p.reduceLeft(f)
,p.reduceRight(f)
: parsep+
and reduce the result withf
p.foldLeft(z)(f)
,p.foldRight(z)(f)
, parsez ~ p+
resp.p+ ~ z
and fold the result withf
Beware that this library takes a nuanced view on backtracking:
Ordered choice p | q
permits p
to fail and tries q
only if p
has 1) not consumed input and 2) not produced a result (which includes ret(a)
).
Similarly, p*
returns an error when p
can be parsed partially.
This leads to somewhat reasonable automatic error reporting and it is usually the behavior that one needs.
However, it means that you cannot backtrack over arbitrary complex pieces of syntax, which can sometimes be a nuisance.
For example, (int ~ "a") | (int ~ "b")
fails on the input "0b"
because it commits to the first choice after successfully parsing "0"
with int
.
-
M(p, op, ap, s)
parses mixfix expressionsp
is used to parse inner expressionsop
: parse an operator such as "+"ap
: apply subexpressions as arguments to an operator previously returned byop
s: Syntax
: a database of mixfix operators
trait Syntax
is parametrized by
def prefix_ops: Map[Op, Int]
def postfix_ops: Map[Op, Int]
def infix_ops: Map[Op, (Assoc, Int)]
which classify an operator returned by op
into prefix, postfix, and infix
and return the respective precedence.
If you want a dynamic set of mixfix operators, just implement these as var
s
in your Syntax
object, no need to re-create the M(...)
parser everytime you change the set of operators.
Note that it is possible to overload operators in the different categories,
the parsing algorithm can discern e.g. between unary and binary -
(minus).
Infix takes precedence over postfix,
i.e., parsing repetition *
will precede binary multiplication
only if the postfix priority of *
is high enough to prohibit a right
argument (at the given source location).
See also:
- http://javascript.crockford.com/tdop/tdop.html
- http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm
- http://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf (about the Parser Monad)
- Project ulang for a larger example
(
source/Parser.scala
in branch master resp.source/Grammar.scala
in refactor-parser)
The library is probably slow and would suffer from backtracking on ambiguous grammars.
The combinator library (no longer) shipped with Scala is much more feature complete and supports left recursion by packrat parsing.
The fastparse library for Scala provides a similar interface, but has many additional (cool!) features and it is designed for speed.
The parseback
handles all context free grammars through
parsing with derivatives.
Give this library a try if you need a more flexible grammar
or run into limitations of arse
s naive backtracking.
If you need guaranteed linear runtime complexity, try a LALR parser generator instead, such as beaver.
- line and position tracking
- default implementation of syntax trees that store positioning information
- mixfix pretty printer