Release 1.1.0

dryruby · Dec 12, 2016 · 19680f1 · 19680f1
2 parents 9fada88 + cf248d4
commit 19680f1
Show file tree

Hide file tree

Showing 58 changed files with 1,640 additions and 527 deletions.
diff --git a/.gemspec b/.gemspec
diff --git a/.travis.yml b/.travis.yml
@@ -4,8 +4,8 @@ script: "bundle exec rspec spec"
 env:
   - CI=true
 rvm:
-  - 2.2.5
-  - 2.3.1
+  - 2.2.6
+  - 2.3.3
   - jruby
   - rbx
 cache: bundler

diff --git a/README.md b/README.md
@@ -8,12 +8,20 @@
 [![Dependency Status](https://gemnasium.com/gkellogg/ebnf.png)](https://gemnasium.com/gkellogg/ebnf)
 
 ## Description
-This is a [Ruby][] implementation of an [EBNF][] and [BNF][] parser and parser generator.
-It parses [EBNF][] grammars to [BNF][], generates [First/Follow and Branch][] tables for
-[LL(1)][] grammars, which can be used with the stream [Tokenizer][] and [LL(1) Parser][].
+This is a [Ruby][] implementation of an [EBNF][] and [BNF][] parser and parser generator. It parses [EBNF][] grammars to [BNF][], generates [First/Follow][] and Branch tables for [LL(1)][] grammars, which can be used with the stream [Tokenizer][] and [LL(1) Parser][].
 
-Of note in this implementation is that the tokenizer and parser are streaming, so that they can
-process inputs of arbitrary size.
+As LL(1) grammars operate using `alt` and `seq` primitives, allowing for a match on alternative productions or a sequence of productions, generating a parser requires turning the EBNF rules into BNF:
+
+* Transform `a ::= b?` into `a ::= _empty | b`
+* Transform `a ::= b+` into `a ::= b b*`
+* Transform `a ::= b*` into `a ::= _empty | (b a)`
+* Transform `a ::= op1 (op2)` into two rules:
+  ```
+  a     ::= op1 _a_1
+  _a_1_ ::= op2
+  ```
+
+Of note in this implementation is that the tokenizer and parser are streaming, so that they can process inputs of arbitrary size.
 
 ## Usage
 ### Parsing an LL(1) Grammar
@@ -36,16 +44,37 @@ Generate [First/Follow][] rules for BNF grammars
 
     ebnf.first_follow(start_tokens)
 
-Generate Terminal, [First/Follow and Branch][] tables as Ruby for parsing grammars
+Generate Terminal, [First/Follow][], Cleanup and Branch tables as Ruby for parsing grammars
 
     ebnf.to_ruby
 
 Generate formatted grammar using HTML (requires [Haml][Haml] gem)
 
     ebnf.to_html
 
-### Creating terminal definitions and parser rules to parse generated grammars
+### Parser S-Expressions
+Intermediate representations of the grammar may be serialized to Lisp-like S-Expressions. For example, the rule `[1] ebnf        ::= (declaration | rule)*` is serialized as `(rule ebnf "1" (star (alt declaration rule)))`.
+
+Once the [LL(1)][] conversion is made, the [First/Follow][] table is generated, this rule expands as follows:
+
+     (rule ebnf "1"
+      (start #t)
+      (first "@pass" "@terminals" LHS _eps)
+      (follow _eof)
+      (cleanup star)
+      (alt _empty _ebnf_2))
+     (rule _ebnf_1 "1.1"
+      (first "@pass" "@terminals" LHS)
+      (follow "@pass" "@terminals" LHS _eof)
+      (alt declaration rule))
+     (rule _ebnf_2 "1.2"
+      (first "@pass" "@terminals" LHS)
+      (follow _eof)
+      (cleanup merge)
+      (seq _ebnf_1 ebnf))
+     (rule _ebnf_3 "1.3" (first "@pass" "@terminals" LHS _eps) (follow _eof) (seq ebnf))
 
+### Creating terminal definitions and parser rules to parse generated grammars
 The parser is initialized to callbacks invoked on entry and exit
 to each `terminal` and `production`. A trivial parser loop can be described as follows:
 
@@ -76,9 +105,10 @@ to each `terminal` and `production`. A trivial parser loop can be described as f
 
       def initialize(input)
         parser_options = {
-          :branch => BRANCH,
-          :first => FIRST,
-          :follow => FOLLOW
+          branch: BRANCH,
+          first: FIRST,
+          follow: FOLLOW,
+          cleanup: CLEANUP
         }
         parse(input, start_symbol, parser_options) do |context, *data|
           # Process calls from callback from productions
@@ -88,10 +118,92 @@ to each `terminal` and `production`. A trivial parser loop can be described as f
           raise RDF::ReaderError, e.message if validate?
         end
 
+### Branch Table
+The Branch table is a hash mapping production rules to a hash relating terminals appearing in input to sequence of productions to follow when the corresponding input terminal is found. This allows either the `seq` primitive, where all terminals map to the same sequence of productions, or the `alt` primitive, where each terminal may map to a different production.
+
+    BRANCH = {
+      :alt => {
+        "(" => [:seq, :_alt_1],
+        :ENUM => [:seq, :_alt_1],
+        :HEX => [:seq, :_alt_1],
+        :O_ENUM => [:seq, :_alt_1],
+        :O_RANGE => [:seq, :_alt_1],
+        :RANGE => [:seq, :_alt_1],
+        :STRING1 => [:seq, :_alt_1],
+        :STRING2 => [:seq, :_alt_1],
+        :SYMBOL => [:seq, :_alt_1],
+      },
+      ...
+      :declaration => {
+        "@pass" => [:pass],
+        "@terminals" => ["@terminals"],
+      },
+      ...
+    }
+
+In this case the `alt` rule is `seq ('|' seq)*` can happen when any of the specified tokens appears on the input stream. The all cause the same token to be passed to the `seq` rule and follow with `_alt_1`, which handles the `('|' seq)*` portion of the rule, after the first sequence is matched.
+
+The `declaration` rule is `@terminals' | pass` using the `alt` primitive determining the production to run based on the terminal appearing on the input stream. Eventually, a terminal production is found and the token is consumed.
+
+### First/Follow Table
+The [First/Follow][] table is a hash mapping production rules to the terminals that may proceed or follow the rule. For example:
+
+    FIRST = {
+      :alt => [
+        :HEX,
+        :SYMBOL,
+        :ENUM,
+        :O_ENUM,
+        :RANGE,
+        :O_RANGE,
+        :STRING1,
+        :STRING2,
+        "("],
+      ...
+    }
+
+### Terminals Table
+This table is a simple list of the terminal productions found in the grammar. For example:
+
+    TERMINALS = ["(", ")", "-",
+      "@pass", "@terminals",
+      :ENUM, :HEX, :LHS, :O_ENUM, :O_RANGE,:POSTFIX,
+      :RANGE, :STRING1, :STRING2, :SYMBOL,"|"
+    ].freeze
+
+### Cleanup Table
+This table identifies productions which used EBNF rules, which are transformed to BNF for actual parsing. This allows the parser, in some cases, to reproduce *star*, *plus*, and *opt* rule matches. For example:
+
+    CLEANUP = {
+      :_alt_1 => :star,
+      :_alt_3 => :merge,
+      :_diff_1 => :opt,
+      :ebnf => :star,
+      :_ebnf_2 => :merge,
+      :_postfix_1 => :opt,
+      :seq => :plus,
+      :_seq_1 => :star,
+      :_seq_2 => :merge,
+    }.freeze
+
+In this case the `ebnf` rule was `(declaration | rule)*`. As BNF does not support a star operator, this is decomposed into a set of rules using `alt` and `seq` primitives:
+
+    ebnf    ::= _empty _ebnf_2
+    _ebnf_1 ::= declaration | rule
+    _ebnf_2 ::= _ebnf_1 ebnf
+    _ebnf_3 ::= ebnf
+
+The `_empty` production matches an empty string, so allows for now value. `_ebnf_2` matches `declaration | rule` (using the `alt` primitive) followed by `ebnf`, creating a sequence of zero or more `declaration` or `alt` members.
 
 ## EBNF Grammar
 The [EBNF][] variant used here is based on [W3C](http://w3.org/) [EBNF][] (see {file:etc/ebnf.ebnf EBNF grammar}) as defined in the
-[XML 1.0 recommendation](http://www.w3.org/TR/REC-xml/), with minor extensions.
+[XML 1.0 recommendation](http://www.w3.org/TR/REC-xml/), with minor extensions:
+
+* Comments include `\\` and `#` through end of line (other than hex character) and `/* ... */ (* ... *) which may cross lines`
+* All rules **MAY** start with an identifier, contained within square brackets. For example `[1] rule`, where the value within the brackets is a symbol `([a-z] | [A-Z] | [0-9] | "_" | ".")+`
+* `@terminals` causes following rules to be treated as terminals. Any terminal which are entirely upper-case are also treated as terminals
+* `@pass` defines the expression used to detect whitespace, which is removed in processing.
+* No support for `wfc` (well-formedness constraint) or `vc` (validity constraint).
 
 Parsing this grammar yields an S-Expression version: {file:etc/ebnf.ll1.sxp}.
 

diff --git a/Rakefile b/Rakefile
@@ -15,8 +15,8 @@ namespace :gem do
 end
 
 desc 'Default: run specs.'
-task :default => :spec
-task :specs => :spec
+task default: :spec
+task specs: :spec
 
 require 'rspec/core/rake_task'
 desc 'Run specifications'
@@ -41,7 +41,7 @@ namespace :doc do
 end
 
 desc 'Create versions of ebnf files in etc'
-task :etc => %w{
+task etc: %w{
     etc/ebnf.sxp etc/ebnf.ll1.sxp etc/ebnf.html etc/ebnf.rb
     etc/turtle.sxp etc/turtle.ll1.sxp etc/turtle.html etc/turtle.rb
     etc/sparql.sxp etc/sparql.ll1.sxp etc/sparql.html etc/sparql.rb

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-1.0.2
+1.1.0
diff --git a/bin/ebnf b/bin/ebnf
@@ -12,9 +12,9 @@ require 'getoptlong'
 require 'ebnf'
 
 options = {
-  :output_format => :sxp,
-  :prefix => "ttl",
-  :namespace => "http://www.w3.org/ns/formats/Turtle#",
+  output_format: :sxp,
+  prefix: "ttl",
+  namespace: "http://www.w3.org/ns/formats/Turtle#",
 }
 
 input, out = nil, STDOUT

diff --git a/etc/ebnf.ebnf b/etc/ebnf.ebnf
@@ -20,6 +20,8 @@
 
     [9] primary     ::= HEX
                     |   SYMBOL
+                    |   ENUM
+                    |   O_ENUM
                     |   RANGE
                     |   O_RANGE
                     |   STRING1
@@ -36,29 +38,33 @@
 
     [13] HEX        ::= '#x' ([0-9]|[a-f]|[A-F])+
 
+    [14] ENUM       ::= '[' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR)) '-' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR)) ']'
+
+    [15] O_ENUM     ::= '[^' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR)) '-' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR)) ']'
+
     # Range is any combination of R_CHAR '-' R_CHAR or R_CHAR+
-    [14] RANGE      ::= '[' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR))+ ']'
+    [16] RANGE      ::= '[' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR))+ ']'
 
     # Range is any combination of R_CHAR '-' R_CHAR or R_CHAR+ preceded by ^
-    [15] O_RANGE    ::= '[^' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR))+ ']'
+    [17] O_RANGE    ::= '[^' ((R_BEGIN (HEX | R_CHAR)) | (HEX | R_CHAR))+ ']'
 
     # Strings are unescaped Unicode, excepting control characters and hash (#)
-    [16] STRING1    ::= '"' (CHAR - '"')* '"'
+    [18] STRING1    ::= '"' (CHAR - '"')* '"'
 
-    [17] STRING2    ::= "'" (CHAR - "'")* "'"
+    [19] STRING2    ::= "'" (CHAR - "'")* "'"
 
-    [18] CHAR       ::= HEX
+    [20] CHAR       ::= HEX
                       | [#x20#x21#x22]
                       | [#x24-#x00FFFFFF]
 
-    [19] R_CHAR     ::= CHAR - ']'
+    [21] R_CHAR     ::= CHAR - ']'
 
-    [20] R_BEGIN    ::= (HEX | R_CHAR) "-"
+    [22] R_BEGIN    ::= (HEX | R_CHAR) "-"
 
     # Should be able to do this inline, but not until terminal regular expressions are created automatically
-    [21] POSTFIX    ::= [?*+]
+    [23] POSTFIX    ::= [?*+]
 
-    [22] PASS       ::= ( [#x00-#x20]
+    [24] PASS       ::= ( [#x00-#x20]
                         | ( '#' | '//' ) [^#x0A#x0D]*
                         | '/*' (( '*' [^/] )? | [^*] )* '*/'
                         | '(*' (( '*' [^)] )? | [^*] )* '*)'

diff --git a/etc/ebnf.html b/etc/ebnf.html
@@ -76,6 +76,8 @@
       <td>
         <a href="#grammar-production-HEX">HEX</a>
         <code>|</code> <a href="#grammar-production-SYMBOL">SYMBOL</a>
+        <code>|</code> <a href="#grammar-production-ENUM">ENUM</a>
+        <code>|</code> <a href="#grammar-production-O_ENUM">O_ENUM</a>
         <code>|</code> <a href="#grammar-production-RANGE">RANGE</a>
         <code>|</code> <a href="#grammar-production-O_RANGE">O_RANGE</a>
         <code>|</code> <a href="#grammar-production-STRING1">STRING1</a>
@@ -119,8 +121,32 @@
         (<code>[</code> <code class="grammar-literal">0-9</code><code>]</code>  <code>|</code> <code>[</code> <code class="grammar-literal">a-f</code><code>]</code>  <code>|</code> <code>[</code> <code class="grammar-literal">A-F</code><code>]</code> )<code>+</code>
       </td>
     </tr>
-    <tr id='grammar-production-RANGE'>
+    <tr id='grammar-production-ENUM'>
       <td>[14]</td>
+      <td><code>ENUM</code></td>
+      <td>::=</td>
+      <td>
+        "<code class="grammar-literal">[</code>"
+        <code>(</code> <a href="#grammar-production-R_BEGIN">R_BEGIN</a> <code>(</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code>  <code>|</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code> 
+        "<code class="grammar-literal">-</code>"
+        <code>(</code> <a href="#grammar-production-R_BEGIN">R_BEGIN</a> <code>(</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code>  <code>|</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code> 
+        "<code class="grammar-literal">]</code>"
+      </td>
+    </tr>
+    <tr id='grammar-production-O_ENUM'>
+      <td>[15]</td>
+      <td><code>O_ENUM</code></td>
+      <td>::=</td>
+      <td>
+        "<code class="grammar-literal">[^</code>"
+        <code>(</code> <a href="#grammar-production-R_BEGIN">R_BEGIN</a> <code>(</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code>  <code>|</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code> 
+        "<code class="grammar-literal">-</code>"
+        <code>(</code> <a href="#grammar-production-R_BEGIN">R_BEGIN</a> <code>(</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code>  <code>|</code> <a href="#grammar-production-HEX">HEX</a> <code>|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a><code>)</code> 
+        "<code class="grammar-literal">]</code>"
+      </td>
+    </tr>
+    <tr id='grammar-production-RANGE'>
+      <td>[16]</td>
       <td><code>RANGE</code></td>
       <td>::=</td>
       <td>
@@ -130,7 +156,7 @@
       </td>
     </tr>
     <tr id='grammar-production-O_RANGE'>
-      <td>[15]</td>
+      <td>[17]</td>
       <td><code>O_RANGE</code></td>
       <td>::=</td>
       <td>
@@ -140,7 +166,7 @@
       </td>
     </tr>
     <tr id='grammar-production-STRING1'>
-      <td>[16]</td>
+      <td>[18]</td>
       <td><code>STRING1</code></td>
       <td>::=</td>
       <td>
@@ -150,7 +176,7 @@
       </td>
     </tr>
     <tr id='grammar-production-STRING2'>
-      <td>[17]</td>
+      <td>[19]</td>
       <td><code>STRING2</code></td>
       <td>::=</td>
       <td>
@@ -160,7 +186,7 @@
       </td>
     </tr>
     <tr id='grammar-production-CHAR'>
-      <td>[18]</td>
+      <td>[20]</td>
       <td><code>CHAR</code></td>
       <td>::=</td>
       <td>
@@ -170,7 +196,7 @@
       </td>
     </tr>
     <tr id='grammar-production-R_CHAR'>
-      <td>[19]</td>
+      <td>[21]</td>
       <td><code>R_CHAR</code></td>
       <td>::=</td>
       <td>
@@ -179,7 +205,7 @@
       </td>
     </tr>
     <tr id='grammar-production-R_BEGIN'>
-      <td>[20]</td>
+      <td>[22]</td>
       <td><code>R_BEGIN</code></td>
       <td>::=</td>
       <td>
@@ -188,15 +214,15 @@
       </td>
     </tr>
     <tr id='grammar-production-POSTFIX'>
-      <td>[21]</td>
+      <td>[23]</td>
       <td><code>POSTFIX</code></td>
       <td>::=</td>
       <td>
         <code>[</code> <code class="grammar-literal">?*+</code><code>]</code>
       </td>
     </tr>
     <tr id='grammar-production-PASS'>
-      <td>[22]</td>
+      <td>[24]</td>
       <td><code>PASS</code></td>
       <td>::=</td>
       <td>