# djspiewak/gll-combinators

### Subversion checkout URL

You can clone with HTTPS or Subversion.

 ab273bf Added some brief observations djspiewak authored Apr 14, 2009 1 ==================================== 2 Implementation Notes and Experiences 3 ==================================== 4 f9a760d Added some conceptual implementation notes djspiewak authored Apr 23, 2009 5 The original GLL algorithm is quite dependent upon an unrestricted goto 6 statement. In fact, the form of goto required by GLL is even unavailable in 7 C, forcing the original authors to implement a workaround of their own by using 8 a "big switch" within the L0 branch. Obviously, this algorithm is not 9 immediately ammenable to implementation in a functional language, much less a 10 cleanly-separated implementation using combinators. 11 12 The critical observation which allows goto-less implementation of the algorithm 13 is in regards to the nature of the L0 branch. Upon close examination of the 14 algorithm, it becomes apparent that L0 can be viewed as a *trampoline*, a 15 concept which is quite common in functional programming as a way of implementing 16 stackless mutual tail-recursion. In the case of GLL, this trampoline function 17 must not only dispatch the various alternate productions (also represented as 18 functions) but also have some knowledge of the GSS and the dispatch queue itself. 19 In short, L0 is a trampoline function with some additional smarts to deal 20 with divergent and convergent branches. 21 22 Once this observation is made, the rest of the implementation just falls into 23 place. Continuations (wrapped up in anonymous functions) can be used to satisfy 24 the functionality of an unrestricted goto, assuming a trampoline function 25 as described above. Surprisingly, this scheme divides itself quite cleanly into 26 combinator-like constructs, further reinforcing the claim that GLL is just another 27 incarnation of recursive-descent. 28 29 30 Bumps in the Road 31 ================= 32 ab273bf Added some brief observations djspiewak authored Apr 14, 2009 33 Computation of true PREDICT sets is impossible because no Parser instance 34 actually knows what its successor is. Thus, we cannot compute FOLLOW sets 35 without "stepping out" into the parent parser. To avoid this, we say that 36 whenever FIRST(a) = { }, PREDICT(a) = \Sigma. Less-formally, if a parser goes 37 to \epsilon, then its (uncomputed) PREDICT set is satisfied by *any* input. 38 39 Our GSS seems to be somewhat less effective than that of GLL due to the fact that 40 parallel sequential parsers with shared suffixes do not actually share state. 41 Thus, we could easily get the following situation in our GSS:: 42 43 C -- D -- F 44 / 45 A -- B 46 \ 47 E -- D -- F 48 49 Notice that the D -- F suffix is shared, but because it is in separate parsers, 50 it will not be merged. Note however that if these two branches *reduce* to the 51 same value, that result will be merged. Alternatively, these branches may reduce 52 to differing values but eventually go to the same parser. When this happens, it 53 will be considered as a common prefix and merged accordingly (*not sure of this is sound sound*). 54 55 Greedy vs lazy matching seems to be a problem. Consider the following grammar:: 56 57 A ::= 'a' A 58 | 59 60 This grammar is actually quite ambiguous. The input string "aaa" may parse 61 as Success("", Stream('a', 'a', 'a')), Success("a", Stream('a', 'a')), 62 Success("aa", Stream('a')) or Success("aaa", Stream()). Obviously, this 63 is a problem. Or rather, this is a problem if we want to maintain PEG semantics. 699e811 Solution to the lazy/greedy problem (I hope) djspiewak authored Apr 14, 2009 64 In order to solve this problem, we need to define apply(...) for NonTerminalParser 65 so that any Success with a tail != Stream() becomes a Failure("Expected end of stream", tail). 17a4c97 Solution to the nasty equality problem djspiewak authored Apr 14, 2009 66 67 Parser equality is a very serious issue. Consider the following parser 68 declaration:: 69 70 def p: Parser[Any] = p | "a" 71 72 While it would be nice to say that p == p', where p' is the "inner p", 73 the recursive case. Unfortunately, these are actually two distinct instance of 74 DisjunctiveParser. This means that we cannot simply check equality to avoid 75 infinite recursion. 76 77 To solve this, we need to get direct access to the p thunk and check its 78 *class* rather than its *instance*. To do this, we will use Java reflection to 79 access the field value without allowing the Scala compiler to transparently 80 invoke the thunk. Once we have this value, we can invoke getClass and quickly 29a37bf Workarounds for issues with thunk-based equality djspiewak authored Apr 14, 2009 81 perform the comparison. The only problem with this solution is it forces all of 82 the thunk-uses to be logical constants. Thus, we cannot define a parser in the 83 following way:: 84 85 def p = make() | make() 86 87 def make() = literal(Math.random.toString) 88 89 The DisjunctiveParser contained by p will consider both the left make() 90 and the right make() to be exactly identical. Fortunately, we can safely 91 assume that grammars are constructed in a declarative fashion. The downside is 92 when people *do* try something like this, the result will be fairly bizzare from 93 a user's standpoint. aa41b8e Made mention of the problem with left-recursion and infinite queueing djspiewak authored Apr 22, 2009 94 95 Another interesting issue is one which arises in conjunction with left-recursion. 96 Consider the following grammar:: 97 98 def p: Parser[Any] = p ~ "a" | "a" 99 100 This grammar is quite unambiguous (so long as the parse is greedy), but it will 101 still lead to non-terminating execution for an input of Stream('a'). This is 102 because the parser will handle the single character using the second production 103 while simultaneously queueing up the first production rule against the untouched 104 stream (Stream('a')). This rule will in turn queue up two more parsers: the 105 first and second rules again. The second rule will immediately match, produce a 106 duplicate result and be discarded. However, the *first* rule will behave exactly 107 as it did before, queueing up two more parsers without consuming any of the stream. 108 Needless to say, this is a slight issue. 109 110 The solution here is that the second queueing of the first rule must lead to a 111 memoization of the relevant parse. The second pass over the second rule should 112 return that result through the second queueing, saving that result in popped 113 and avoiding the divergence. Thus, left-recursive rules will go *one* extra 114 queueing, but this extra step will be pruned as the successful parse will avoid 115 any additional repetition. Unfortunately, this solution is made more difficult 116 to implement due to the fact that disjunctive parsers are never themselves pushed 117 onto the dispatch queue. Trampoline does not know of any connection between 118 the first and second productions of a disjunction. It only knows that the two 119 separate productions have been pushed. 83679bf Described ThunkParser solution djspiewak authored Apr 22, 2009 120 121 To solve this problem in a practical way, we need to introduce another Parser 122 subtype: ThunkParser. This parser just delegates everything to its wrapper 123 parser with the exception of queue, which it leaves abstract. This parser 124 is instantiated using an anonymous inner-class within DisjunctiveParser to 125 handle the details of queueing up the separate productions without "losing" the 126 disjunction itself. 377700a Added note about Seq#toStream djspiewak authored Apr 23, 2009 127 128 Another problem encountered while attempting to implement the trampoline is that 129 Scala's Stream implementation isn't quite what one would expect. In particular, 130 equality is defined on a reference basis, rather than logical value. Thus, 131 two streams which have the same contents may not necessarily be equivalent according 132 to equals(...). This isn't normally an issue, but it does cause problems 133 with the Seq#toStream method:: 134 135 "".toStream == "".toStream // => false!! 136 137 For non-left-recursive grammars, this will lead to duplicate results from the 138 parse. However, for left-recursive grammars, this could actually lead to 139 divergence. This isn't really a problem with GLL or the combinator implementation. 140 Rather, it is an issue with the Scala Stream implementation. To avoid this, 141 we must ensure that all input streams are created using Stream(), Stream.cons 142 and Stream.empty. a2baa20 Added TODO note djspiewak authored May 3, 2009 143 144 145 From Recognizer to Parser 146 ========================= 147 148 *TODO*