Skip to content

Commit

Permalink
Release 1.1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
gkellogg committed Dec 12, 2016
2 parents 9fada88 + cf248d4 commit 19680f1
Show file tree
Hide file tree
Showing 58 changed files with 1,640 additions and 527 deletions.
1 change: 0 additions & 1 deletion .gemspec

This file was deleted.

4 changes: 2 additions & 2 deletions .travis.yml
Expand Up @@ -4,8 +4,8 @@ script: "bundle exec rspec spec"
env:
- CI=true
rvm:
- 2.2.5
- 2.3.1
- 2.2.6
- 2.3.3
- jruby
- rbx
cache: bundler
Expand Down
134 changes: 123 additions & 11 deletions README.md
Expand Up @@ -8,12 +8,20 @@
[![Dependency Status](https://gemnasium.com/gkellogg/ebnf.png)](https://gemnasium.com/gkellogg/ebnf)

## Description
This is a [Ruby][] implementation of an [EBNF][] and [BNF][] parser and parser generator.
It parses [EBNF][] grammars to [BNF][], generates [First/Follow and Branch][] tables for
[LL(1)][] grammars, which can be used with the stream [Tokenizer][] and [LL(1) Parser][].
This is a [Ruby][] implementation of an [EBNF][] and [BNF][] parser and parser generator. It parses [EBNF][] grammars to [BNF][], generates [First/Follow][] and Branch tables for [LL(1)][] grammars, which can be used with the stream [Tokenizer][] and [LL(1) Parser][].

Of note in this implementation is that the tokenizer and parser are streaming, so that they can
process inputs of arbitrary size.
As LL(1) grammars operate using `alt` and `seq` primitives, allowing for a match on alternative productions or a sequence of productions, generating a parser requires turning the EBNF rules into BNF:

* Transform `a ::= b?` into `a ::= _empty | b`
* Transform `a ::= b+` into `a ::= b b*`
* Transform `a ::= b*` into `a ::= _empty | (b a)`
* Transform `a ::= op1 (op2)` into two rules:
```
a ::= op1 _a_1
_a_1_ ::= op2
```

Of note in this implementation is that the tokenizer and parser are streaming, so that they can process inputs of arbitrary size.

## Usage
### Parsing an LL(1) Grammar
Expand All @@ -36,16 +44,37 @@ Generate [First/Follow][] rules for BNF grammars

ebnf.first_follow(start_tokens)

Generate Terminal, [First/Follow and Branch][] tables as Ruby for parsing grammars
Generate Terminal, [First/Follow][], Cleanup and Branch tables as Ruby for parsing grammars

ebnf.to_ruby

Generate formatted grammar using HTML (requires [Haml][Haml] gem)

ebnf.to_html

### Creating terminal definitions and parser rules to parse generated grammars
### Parser S-Expressions
Intermediate representations of the grammar may be serialized to Lisp-like S-Expressions. For example, the rule `[1] ebnf ::= (declaration | rule)*` is serialized as `(rule ebnf "1" (star (alt declaration rule)))`.

Once the [LL(1)][] conversion is made, the [First/Follow][] table is generated, this rule expands as follows:

(rule ebnf "1"
(start #t)
(first "@pass" "@terminals" LHS _eps)
(follow _eof)
(cleanup star)
(alt _empty _ebnf_2))
(rule _ebnf_1 "1.1"
(first "@pass" "@terminals" LHS)
(follow "@pass" "@terminals" LHS _eof)
(alt declaration rule))
(rule _ebnf_2 "1.2"
(first "@pass" "@terminals" LHS)
(follow _eof)
(cleanup merge)
(seq _ebnf_1 ebnf))
(rule _ebnf_3 "1.3" (first "@pass" "@terminals" LHS _eps) (follow _eof) (seq ebnf))

### Creating terminal definitions and parser rules to parse generated grammars
The parser is initialized to callbacks invoked on entry and exit
to each `terminal` and `production`. A trivial parser loop can be described as follows:

Expand Down Expand Up @@ -76,9 +105,10 @@ to each `terminal` and `production`. A trivial parser loop can be described as f

def initialize(input)
parser_options = {
:branch => BRANCH,
:first => FIRST,
:follow => FOLLOW
branch: BRANCH,
first: FIRST,
follow: FOLLOW,
cleanup: CLEANUP
}
parse(input, start_symbol, parser_options) do |context, *data|
# Process calls from callback from productions
Expand All @@ -88,10 +118,92 @@ to each `terminal` and `production`. A trivial parser loop can be described as f
raise RDF::ReaderError, e.message if validate?
end

### Branch Table
The Branch table is a hash mapping production rules to a hash relating terminals appearing in input to sequence of productions to follow when the corresponding input terminal is found. This allows either the `seq` primitive, where all terminals map to the same sequence of productions, or the `alt` primitive, where each terminal may map to a different production.

BRANCH = {
:alt => {
"(" => [:seq, :_alt_1],
:ENUM => [:seq, :_alt_1],
:HEX => [:seq, :_alt_1],
:O_ENUM => [:seq, :_alt_1],
:O_RANGE => [:seq, :_alt_1],
:RANGE => [:seq, :_alt_1],
:STRING1 => [:seq, :_alt_1],
:STRING2 => [:seq, :_alt_1],
:SYMBOL => [:seq, :_alt_1],
},
...
:declaration => {
"@pass" => [:pass],
"@terminals" => ["@terminals"],
},
...
}

In this case the `alt` rule is `seq ('|' seq)*` can happen when any of the specified tokens appears on the input stream. The all cause the same token to be passed to the `seq` rule and follow with `_alt_1`, which handles the `('|' seq)*` portion of the rule, after the first sequence is matched.

The `declaration` rule is `@terminals' | pass` using the `alt` primitive determining the production to run based on the terminal appearing on the input stream. Eventually, a terminal production is found and the token is consumed.

### First/Follow Table
The [First/Follow][] table is a hash mapping production rules to the terminals that may proceed or follow the rule. For example:

FIRST = {
:alt => [
:HEX,
:SYMBOL,
:ENUM,
:O_ENUM,
:RANGE,
:O_RANGE,
:STRING1,
:STRING2,
"("],
...
}

### Terminals Table
This table is a simple list of the terminal productions found in the grammar. For example:

TERMINALS = ["(", ")", "-",
"@pass", "@terminals",
:ENUM, :HEX, :LHS, :O_ENUM, :O_RANGE,:POSTFIX,
:RANGE, :STRING1, :STRING2, :SYMBOL,"|"
].freeze

### Cleanup Table
This table identifies productions which used EBNF rules, which are transformed to BNF for actual parsing. This allows the parser, in some cases, to reproduce *star*, *plus*, and *opt* rule matches. For example:

CLEANUP = {
:_alt_1 => :star,
:_alt_3 => :merge,
:_diff_1 => :opt,
:ebnf => :star,
:_ebnf_2 => :merge,
:_postfix_1 => :opt,
:seq => :plus,
:_seq_1 => :star,
:_seq_2 => :merge,
}.freeze

In this case the `ebnf` rule was `(declaration | rule)*`. As BNF does not support a star operator, this is decomposed into a set of rules using `alt` and `seq` primitives:

ebnf ::= _empty _ebnf_2
_ebnf_1 ::= declaration | rule
_ebnf_2 ::= _ebnf_1 ebnf
_ebnf_3 ::= ebnf

The `_empty` production matches an empty string, so allows for now value. `_ebnf_2` matches `declaration | rule` (using the `alt` primitive) followed by `ebnf`, creating a sequence of zero or more `declaration` or `alt` members.

## EBNF Grammar
The [EBNF][] variant used here is based on [W3C](http://w3.org/) [EBNF][] (see {file:etc/ebnf.ebnf EBNF grammar}) as defined in the
[XML 1.0 recommendation](http://www.w3.org/TR/REC-xml/), with minor extensions.
[XML 1.0 recommendation](http://www.w3.org/TR/REC-xml/), with minor extensions:

* Comments include `\\` and `#` through end of line (other than hex character) and `/* ... */ (* ... *) which may cross lines`
* All rules **MAY** start with an identifier, contained within square brackets. For example `[1] rule`, where the value within the brackets is a symbol `([a-z] | [A-Z] | [0-9] | "_" | ".")+`
* `@terminals` causes following rules to be treated as terminals. Any terminal which are entirely upper-case are also treated as terminals
* `@pass` defines the expression used to detect whitespace, which is removed in processing.
* No support for `wfc` (well-formedness constraint) or `vc` (validity constraint).

Parsing this grammar yields an S-Expression version: {file:etc/ebnf.ll1.sxp}.

Expand Down
6 changes: 3 additions & 3 deletions Rakefile
Expand Up @@ -15,8 +15,8 @@ namespace :gem do
end

desc 'Default: run specs.'
task :default => :spec
task :specs => :spec
task default: :spec
task specs: :spec

require 'rspec/core/rake_task'
desc 'Run specifications'
Expand All @@ -41,7 +41,7 @@ namespace :doc do
end

desc 'Create versions of ebnf files in etc'
task :etc => %w{
task etc: %w{
etc/ebnf.sxp etc/ebnf.ll1.sxp etc/ebnf.html etc/ebnf.rb
etc/turtle.sxp etc/turtle.ll1.sxp etc/turtle.html etc/turtle.rb
etc/sparql.sxp etc/sparql.ll1.sxp etc/sparql.html etc/sparql.rb
Expand Down
2 changes: 1 addition & 1 deletion VERSION
@@ -1 +1 @@
1.0.2
1.1.0
6 changes: 3 additions & 3 deletions bin/ebnf
Expand Up @@ -12,9 +12,9 @@ require 'getoptlong'
require 'ebnf'

options = {
:output_format => :sxp,
:prefix => "ttl",
:namespace => "http://www.w3.org/ns/formats/Turtle#",
output_format: :sxp,
prefix: "ttl",
namespace: "http://www.w3.org/ns/formats/Turtle#",
}

input, out = nil, STDOUT
Expand Down
24 changes: 15 additions & 9 deletions etc/ebnf.ebnf
Expand Up @@ -20,6 +20,8 @@

[9] primary ::= HEX
| SYMBOL
| ENUM
| O_ENUM
| RANGE
| O_RANGE
| STRING1
Expand All @@ -36,29 +38,33 @@

[13] HEX ::= '#x' ([0-9]|[a-f]|[A-F])+

[14] ENUM ::= '[' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR)) '-' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR)) ']'

[15] O_ENUM ::= '[^' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR)) '-' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR)) ']'

# Range is any combination of R_CHAR '-' R_CHAR or R_CHAR+
[14] RANGE ::= '[' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR))+ ']'
[16] RANGE ::= '[' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR))+ ']'

# Range is any combination of R_CHAR '-' R_CHAR or R_CHAR+ preceded by ^
[15] O_RANGE ::= '[^' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR))+ ']'
[17] O_RANGE ::= '[^' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR))+ ']'

# Strings are unescaped Unicode, excepting control characters and hash (#)
[16] STRING1 ::= '"' (CHAR - '"')* '"'
[18] STRING1 ::= '"' (CHAR - '"')* '"'

[17] STRING2 ::= "'" (CHAR - "'")* "'"
[19] STRING2 ::= "'" (CHAR - "'")* "'"

[18] CHAR ::= HEX
[20] CHAR ::= HEX
| [#x20#x21#x22]
| [#x24-#x00FFFFFF]

[19] R_CHAR ::= CHAR - ']'
[21] R_CHAR ::= CHAR - ']'

[20] R_BEGIN ::= (HEX | R_CHAR) "-"
[22] R_BEGIN ::= (HEX | R_CHAR) "-"

# Should be able to do this inline, but not until terminal regular expressions are created automatically
[21] POSTFIX ::= [?*+]
[23] POSTFIX ::= [?*+]

[22] PASS ::= ( [#x00-#x20]
[24] PASS ::= ( [#x00-#x20]
| ( '#' | '//' ) [^#x0A#x0D]*
| '/*' (( '*' [^/] )? | [^*] )* '*/'
| '(*' (( '*' [^)] )? | [^*] )* '*)'
Expand Down
44 changes: 35 additions & 9 deletions etc/ebnf.html
Expand Up @@ -76,6 +76,8 @@
<td>
<a href="#grammar-production-HEX">HEX</a>
<code>|</code> <a href="#grammar-production-SYMBOL">SYMBOL</a>
<code>|</code> <a href="#grammar-production-ENUM">ENUM</a>
<code>|</code> <a href="#grammar-production-O_ENUM">O_ENUM</a>
<code>|</code> <a href="#grammar-production-RANGE">RANGE</a>
<code>|</code> <a href="#grammar-production-O_RANGE">O_RANGE</a>
<code>|</code> <a href="#grammar-production-STRING1">STRING1</a>
Expand Down Expand Up @@ -119,8 +121,32 @@
(<code>[</code> <code class="grammar-literal">0-9</code><code>]</code> <code>|</code> <code>[</code> <code class="grammar-literal">a-f</code><code>]</code> <code>|</code> <code>[</code> <code class="grammar-literal">A-F</code><code>]</code> )<code>+</code>
</td>
</tr>
<tr id='grammar-production-RANGE'>
<tr id='grammar-production-ENUM'>
<td>[14]</td>
<td><code>ENUM</code></td>
<td>::=</td>
<td>
"<code class="grammar-literal">[</code>"
<code>(</code> <a href="#grammar-production-R_BEGIN">R_BEGIN</a> <code>(</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code> <code>|</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code>
"<code class="grammar-literal">-</code>"
<code>(</code> <a href="#grammar-production-R_BEGIN">R_BEGIN</a> <code>(</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code> <code>|</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code>
"<code class="grammar-literal">]</code>"
</td>
</tr>
<tr id='grammar-production-O_ENUM'>
<td>[15]</td>
<td><code>O_ENUM</code></td>
<td>::=</td>
<td>
"<code class="grammar-literal">[^</code>"
<code>(</code> <a href="#grammar-production-R_BEGIN">R_BEGIN</a> <code>(</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code> <code>|</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code>
"<code class="grammar-literal">-</code>"
<code>(</code> <a href="#grammar-production-R_BEGIN">R_BEGIN</a> <code>(</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code> <code>|</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code>
"<code class="grammar-literal">]</code>"
</td>
</tr>
<tr id='grammar-production-RANGE'>
<td>[16]</td>
<td><code>RANGE</code></td>
<td>::=</td>
<td>
Expand All @@ -130,7 +156,7 @@
</td>
</tr>
<tr id='grammar-production-O_RANGE'>
<td>[15]</td>
<td>[17]</td>
<td><code>O_RANGE</code></td>
<td>::=</td>
<td>
Expand All @@ -140,7 +166,7 @@
</td>
</tr>
<tr id='grammar-production-STRING1'>
<td>[16]</td>
<td>[18]</td>
<td><code>STRING1</code></td>
<td>::=</td>
<td>
Expand All @@ -150,7 +176,7 @@
</td>
</tr>
<tr id='grammar-production-STRING2'>
<td>[17]</td>
<td>[19]</td>
<td><code>STRING2</code></td>
<td>::=</td>
<td>
Expand All @@ -160,7 +186,7 @@
</td>
</tr>
<tr id='grammar-production-CHAR'>
<td>[18]</td>
<td>[20]</td>
<td><code>CHAR</code></td>
<td>::=</td>
<td>
Expand All @@ -170,7 +196,7 @@
</td>
</tr>
<tr id='grammar-production-R_CHAR'>
<td>[19]</td>
<td>[21]</td>
<td><code>R_CHAR</code></td>
<td>::=</td>
<td>
Expand All @@ -179,7 +205,7 @@
</td>
</tr>
<tr id='grammar-production-R_BEGIN'>
<td>[20]</td>
<td>[22]</td>
<td><code>R_BEGIN</code></td>
<td>::=</td>
<td>
Expand All @@ -188,15 +214,15 @@
</td>
</tr>
<tr id='grammar-production-POSTFIX'>
<td>[21]</td>
<td>[23]</td>
<td><code>POSTFIX</code></td>
<td>::=</td>
<td>
<code>[</code> <code class="grammar-literal">?*+</code><code>]</code>
</td>
</tr>
<tr id='grammar-production-PASS'>
<td>[22]</td>
<td>[24]</td>
<td><code>PASS</code></td>
<td>::=</td>
<td>
Expand Down

0 comments on commit 19680f1

Please sign in to comment.