Skip to content
This repository has been archived by the owner on Oct 28, 2023. It is now read-only.

Tutorial MiniJava

Christophe VG edited this page Mar 6, 2017 · 3 revisions

In this tutorial we will create a parser for MiniJava. MiniJava is a subset of Java, introduced in Appel and Palsberg's Modern Compiler Implementation in Java. We'll start with a verbatim copy of the grammar as presented and go through several steps to transform it to an optimised grammar for the Human Parser Generator.

Prerequisites

The Human Parser Generator

The first thing we need is, of course, the Human Parser Generator. As of now a binary build of the HPG is available from the GitHub repository's release page. We will use this binary release in this tutorial, thus avoiding having to [build the generator](Building HPG) from the repository.

If you prefer to build the binary yourself, the minimal survival commands are:

$ git clone https://github.com/christophevg/human-parser-generator.git
Cloning into 'human-parser-generator'...
remote: Counting objects: 1390, done.
remote: Compressing objects: 100% (588/588), done.
remote: Total 1390 (delta 381), reused 0 (delta 0), pack-reused 790
Receiving objects: 100% (1390/1390), 2.62 MiB | 599.00 KiB/s, done.
Resolving deltas: 100% (908/908), done.

$ cd human-parser-generator/

$ msbuild
Microsoft (R) Build Engine version 14.1.0.0
Copyright (C) Microsoft Corporation. All rights reserved.

Build started 3/6/2017 2:58:02 PM.
Project "/Users/xtof/Workspace/human-parser-generator/hpg.csproj" on node 1 (default targets).
Gen0Parser:
  /Library/Frameworks/Mono.framework/Versions/4.6.2/lib/mono/4.5/csc.exe /debug+ /out:bin/Debug/hpg.gen0.exe /target:exe generator/parsable.cs generator/generator.cs generator/factory.cs generator/emitter.csharp.cs generator/emitter.bnf.cs generator/format.csharp.cs generator/AssemblyInfo.cs generator/grammar.cs generator/bootstrap.cs
Gen1Source:
  mono bin/Debug/hpg.gen0.exe generator/hpg.bnf | LC_ALL="C" astyle -s2 -xt0 -xe -Y -xC80 > generator/parser.gen1.cs
Gen1Parser:
  /Library/Frameworks/Mono.framework/Versions/4.6.2/lib/mono/4.5/csc.exe /debug+ /out:bin/Debug/hpg.gen1.exe /target:exe generator/parsable.cs generator/generator.cs generator/factory.cs generator/emitter.csharp.cs generator/emitter.bnf.cs generator/format.csharp.cs generator/AssemblyInfo.cs generator/parser.gen1.cs generator/hpg.cs
HPGSource:
  mono bin/Debug/hpg.gen1.exe generator/hpg.bnf | LC_ALL="C" astyle -s2 -xt0 -xe -Y -xC80 > generator/parser.cs
Build:
  /Library/Frameworks/Mono.framework/Versions/4.6.2/lib/mono/4.5/csc.exe /debug+ /out:bin/Debug/hpg.exe /target:exe generator/parsable.cs generator/generator.cs generator/factory.cs generator/emitter.csharp.cs generator/emitter.bnf.cs generator/format.csharp.cs generator/AssemblyInfo.cs generator/parser.cs generator/hpg.cs
Done Building Project "/Users/xtof/Workspace/human-parser-generator/hpg.csproj" (default targets).

Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:00:02.65

$ mono bin/Debug/hpg.exe --help
Human Parser Generator version 1.1.6274.26942
Usage: hpg.exe [options] [file ...]

    --help, -h              Show usage information
    --version, -v           Show version information

    --output, -o FILENAME   Output to file, not stdout

Output options.
Select one of the following:
    --parser, -p            Generate parser (DEFAULT)
    --ast, -a               Show AST
    --model, -m             Show parser model
    --grammar, -g           Show grammar
Formatting options.
    --text, -t              Generate textual output (DEFAULT).
    --dot, -d               Generate Graphviz/Dot format output. (model)
Emission options.
    --info, -i              Suppress generation of info header
    --rule, -r              Suppress generation of rule comment
    --namespace, -n NAME    Embed parser in namespace

A MiniJava Grammar

Just Google for 'MiniJava Grammar' and the first hit gives us a BNF grammar:

Program	::=	MainClass ( ClassDeclaration )* <EOF>
MainClass	::=	"class" Identifier "{" "public" "static" "void" "main" "(" "String" "[" "]" Identifier ")" "{" Statement "}" "}"
ClassDeclaration	::=	"class" Identifier ( "extends" Identifier )? "{" ( VarDeclaration )* ( MethodDeclaration )* "}"
VarDeclaration	::=	Type Identifier ";"
MethodDeclaration	::=	"public" Type Identifier "(" ( Type Identifier ( "," Type Identifier )* )? ")" "{" ( VarDeclaration )* ( Statement )* "return" Expression ";" "}"
Type	::=	"int" "[" "]"
|	"boolean"
|	"int"
|	Identifier
Statement	::=	"{" ( Statement )* "}"
|	"if" "(" Expression ")" Statement "else" Statement
|	"while" "(" Expression ")" Statement
|	"System.out.println" "(" Expression ")" ";"
|	Identifier "=" Expression ";"
|	Identifier "[" Expression "]" "=" Expression ";"
Expression	::=	Expression ( "&&" | "<" | "+" | "-" | "*" ) Expression
|	Expression "[" Expression "]"
|	Expression "." "length"
|	Expression "." Identifier "(" ( Expression ( "," Expression )* )? ")"
|	<INTEGER_LITERAL>
|	"true"
|	"false"
|	Identifier
|	"this"
|	"new" "int" "[" Expression "]"
|	"new" Identifier "(" ")"
|	"!" Expression
|	"(" Expression ")"
Identifier	::=	<IDENTIFIER>

This grammar defines the MiniJava language. An example of a MiniJava program that a parser for this grammar would be able to parse is:

class Factorial {
  public static void main(String[] a){
    System.out.println(new Fac().ComputeFac(10));
  }
}

class Fac {
  public int ComputeFac(int num) {
    int num_aux ;
    if (num < 1)
        num_aux = 1 ;
    else
        num_aux = num * (this.ComputeFac(num-1)) ;
    return num_aux ;
  }
}

Let's Get Started

Although I'd love to tell you differently, right now, you can't simply copy and paste this grammar into a file and feed it to HPG. You can try, but trust me, you won't get far ;-)

Some Syntax Transformations

What are the minimal steps we need to take to be able to give this grammar to HPG:

  1. Add semi-colons ; at the end of each rule. I've started the development from a EBNF-like grammar specification and require a ; at the end of each rule.
  2. Remove <EOF>. HPG (for now) does this implicitly and there is (currently) no way to simulate an explicit <EOF> token.
  3. Implement <IDENTIFIER>. You could implement this in a very traditional way, by enumerating all possible characters and aggregate those into character classes and so on. The [HPG Grammar](HPG Grammar) has an extension in its EBNF-like syntax that introduces so called Extractors. These are basically patterns that match strings, allowing to match and extract parts of the input. In this case we could implement it as
    Identifier ::= ::= /([A-Za-z_][A-Za-z0-9-_]*)/ ;.
    Yes that looks a lot like a regular expression ;-)
  4. Implement <INTEGER_LITERAL> in the same way:
    Integer ::= /(-?[1-9][0-9]*)/ ;.
  5. Again, adhering more to an EBNF-like syntax, replace all occurrences of repetitive blocks ( ... )* by { ... } and all occurrences of optional blocks ( ... )? by [ ... ]. In the near future, these alternative syntaxes will also be supported, as I try to make the HPG grammar syntax as much as possible compatible with all these variations.).
  6. Optionally you could also replace ::= with =, which is used in EBNF. Both are supported by HPG, so you can choose ;-)

Let's take a look at the grammar so far:

Program = MainClass { ClassDeclaration } ;

MainClass = "class" Identifier "{"
              "public" "static" "void" "main" "(" "String" "[" "]" Identifier ")" "{"
                Statement
              "}"
            "}" ;

ClassDeclaration = "class" Identifier [ "extends" Identifier ] "{"
                     { VarDeclaration }
                     { MethodDeclaration }
                   "}" ;

VarDeclaration = Type Identifier ";" ;

MethodDeclaration = "public" Type Identifier "(" [ Type Identifier { "," Type Identifier } ] ")" "{"
                      { VarDeclaration }
                      { Statement }
                      "return" Expression ";"
                    "}" ;

Type = "int" "[" "]"
     | "boolean"
     | "int"
     | Identifier
     ;

Statement = "{" { Statement } "}"
          | "if" "(" Expression ")" Statement "else" Statement
          | "while" "(" Expression ")" Statement
          | "System.out.println" "(" Expression ")" ";"
          | Identifier "=" Expression ";"
          | Identifier "[" Expression "]" "=" Expression ";"
          ;

Expression = Expression ( "&&" | "<" | "+" | "-" | "*" ) Expression
           | Expression "[" Expression "]"
           | Expression "." "length"
           | Expression "." Identifier "(" [ Expression { "," Expression } ] ")"
           | Integer
           | "true"
           | "false"
           | Identifier
           | "this"
           | "new" "int" "[" Expression "]"
           | "new" Identifier "(" ")"
           | "!" Expression
           | "(" Expression ")"
           ;

Identifier  = ? /([A-Za-z_][A-Za-z0-9-_]*)/ ? ;
Integer     = ? /(-?[1-9][0-9]*)/ ? ;

At this point we could be ready in the case of many grammars. But not this grammar. This grammar contains left recursion and HPG is a straightforward recursive descent parser and will go into infinite loops before you know it.

Eliminating Left Recursion

There is a lot to read about the elimination of left recursion. In most cases it can even be automated. Maybe I'll look into that in a future version, but for now, we'll have to eliminate it manually.

The problem area is Expression:

Expression = Expression ( "&&" | "<" | "+" | "-" | "*" ) Expression
           | Expression "[" Expression "]"
           | Expression "." "length"
           | Expression "." Identifier "(" [ Expression { "," Expression } ] ")"
           | Integer
           | "true"
           | "false"
           | Identifier
           | "this"
           | "new" "int" "[" Expression "]"
           | "new" Identifier "(" ")"
           | "!" Expression
           | "(" Expression ")"
           ;

So what is the problem?

HPG is a recursive descent parser generator. This means that it will follow the rules in the grammar, going to the next rule as a rule dictates it. So if we want to parse an Expression, we start with the first alternative Expression ( "&&" | "<" | "+" | "-" | "*" ) Expression, which first requires an Expression. So we want to parse an Expression and we start with the first alternative Expression ( "&&" | "<" | "+" | "-" | "*" ) Expression, which first requires and Expression. So we ... I guess you get the picture ;-)

So is this grammar bad/incorrect/...?

No, the grammar is fine, but it just requires a more capable parser. Read more about it online. For now, HPG can't cope with it so we need to eliminate it.

Left Recursion ... Be Gone!

The problem is situated in more detail in the first for alternatives, which all start with a direct left recursion on Expression itself.

Expression = Expression ( "&&" | "<" | "+" | "-" | "*" ) Expression
           | Expression "[" Expression "]"
           | Expression "." "length"
           | Expression "." Identifier "(" [ Expression { "," Expression } ] ")"
...

The first alternative deals with so called binary operators, while the the others deal with postfix extensions.

We need to redesign this flat structure of Expression alternatives and turn them into a hierarchy that allows us to differentiate between all actual expressions. This involves taking into account _operator precedence and reducing the scope of the first part of the Expression, avoiding looping back to it.

The HPG grammar also deals with this typical expression left-recursion. Here I've applied the same pattern:

expression               = postfix-expression
                         | non-postfix-expression
                         ;

postfix-expression       = indexed-expression
                         | method-call-expression
                         | length-expression
                         ;

indexed-expression       = non-postfix-expression "[" expression "]" ;
method-call-expression   = non-postfix-expression "." identifier "(" [ expression { "," expression } ] ")" ;
length-expression        = non-postfix-expression "." "length" ;

non-postfix-expression   = prefix-expression
                         | non-prefix-expression
                         ;

prefix-expression        = not-expression ;

not-expression           = "!" prefix-expression ;

non-prefix-expression    = times-expression
                         | non-times-expression
                         ;

times-expression         = non-times-expression "*" non-prefix-expression ;

non-times-expression     = additive-expression
                         | non-additive-expression
                         ;

additive-expression      = non-additive-expression [ "+" | "-" ] non-times-expression ;

non-additive-expression  = less-than-expression 
                         | non-less-than-expression
                         ;

less-than-expression     = non-less-than-expression "<" non-additive-expression ;

non-less-than-expression = and-expression 
                         | non-and-expression
                         ;
                         
and-expression           = non-and-expression "&&" non-less-than-expression ;


non-and-expression       = group-expression
                         | primary-expression
                         ;

group-expression         = "(" expression ")" ;

We start with the operation(s) with the most precedense or tighter binding: indexing, length property and method call with optional arguments. Next we handle the negating operator, followed by multiplication, addition, comparing and logical.

At each level, the same pattern is applied: extract an expression of this level, or try extracting one of the level below. When dealing with left recursion at a certain level, the recursion isn't performed to the same level, but also to the level below. This way, we always tumble down the hierarchy, right down to primary-expression.

And now we have a complete grammar that HPG is capable of generating a parser for. You can consult it in the MiniJava example folder.

Generating Your First Human Parser

Push the button!

I advise to use astyle to format the generated code ;-)

$ mono hpg.exe mini-java.bnf | astyle -s2 -xt0 -xe -Y

will output the following C# code on standard output:

/// DO NOT EDIT THIS FILE
// This file was generated using the Human Parser Generator
// (https://github.com/christophevg/human-parser-generator)
// on Tuesday, February 21, 2017 at 10:50:34 PM
// Source : mini-java.bnf

using System;
using System.IO;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Linq;

// Program ::= MainClass { ClassDeclaration } ;
public class Program {
  public Mainclass Mainclass { get; set; }
  public List<Classdeclaration> Classdeclarations { get; set; }
  public Program() {
    this.Classdeclarations = new List<Classdeclaration>();
  }
  public override string ToString() {
    return
    "Program(" +
    "Mainclass=" + this.Mainclass + "," +
    "Classdeclarations=" + "[" +
    string.Join(",", this.Classdeclarations.Select(x => x.ToString())) +
    "]" +
    ")";
  }
}
...

Ok, those 842 lines of code look like a parser, but does it actually parses e.g. the example we saw earlier? To build this parser, we need two additional files from the HPG framework: parsable.cs and dump-ast.cs. The former contains some support to make the generated parser easier to generate and prettier to look at. The latter is a small runner application that takes a filename as an argument, parses the content of that file and dumps the resulting AST using its default generate ToString support.

The MiniJava example is also included in the HPG repository as example/mini-java. The following build starts from this location and accesses the additional generator files through the repository:

$ mono hpg.exe mini-java.bnf > mini-java.cs
$ mcs ../../generator/{parsable,dump-ast}.cs mini-java.cs -out:parse-mini-java.exe 
$ mono parse-mini-java.exe example.mini-java | LC_ALL="C" astyle -s2 -xt0 -xe -Y -xC80

parse-mini-java.exe is built using generator/dump-ast.cs. This parses an input file provided as a command line argument and dumps the resulting AST. It does this by calling ToString() on the AST. The generated ToString() outputs the C# code that constructs the corresponding AST.

The output is:

new Program() {
  MainClass = new MainClass() {
    Identifier0 = "Factorial",
    Identifier1 = "a",
    Statement = new Statement() {
      Statement0s = new List<Statement>() {},
      Expression0 = null,
      Statement1 = null,
      Statement2 = null,
      Expression1 = null,
      Statement3 = null,
      Expression2 = new MethodCallExpression() {
        NonPostfixExpression = new PrimaryExpression() {
          Integer = null,
          HasTrue = False,
          HasFalse = False,
          HasThis = False,
          Expression = null,
          Identifier0 = "Fac",
          Identifier1 = null
        },
...

So we're done?

Well, take a look at the generated parser code and tell me if you think that looks human ;-) Or even better, let's generate a quick UML view of the classes that will be used to construct the AST. hpg.exe has support for generating GraphViz'a dot format:

$ mono hpg.exe -m -d mini-java.bnf | unflatten -f -l 1 -c 4 | dot -T png -o mini-java.png

This produces the following diagram:

AST Classes of Imported Grammar

Just looking at AST above or at the classes and their properties in the diagram shows a lot of generated properties that we're not interested in, little useful structure,...

But this was only the first step, taking a verbatim grammar and make it generatable by HPG. Next we'll start applying some of the [HPG Grammar extensions](HPG Grammar) to refactor the resulting AST. With a few simple steps we can already make a big difference.

Putting the Human back in the Parser Generator

The Human Parser Generator has support at several levels for improving the generated parser and resulting AST. Let's start with some simple cosmetic changes:

The Case of Rule Names

When I earlier mentioned the syntactic changes that are needed to make the grammar generatable, I didn't mention the fact that, although HPG supports EBNF grammars and EBNF allows for spaces in rule names, HPG doesn't. In the MiniJava grammar these didn't occur, so I didn't mention them at that time.

A best practice that I follow here is to replace spaces with dashes (-). The generator will remove these dashes and apply Camel casing (or Pascal casing) to create (nice) class names and variable names. So for example less-than-expression becomes LessThanExpression. If you look carefully at the grammar and resulting classes, you'll notice that in the grammar currently defines LessThanExpression, and this is generated as Lessthanexpression. So the generator (or I?) forces us to adhere to this naming convention.

As a first improvement step, we change all rule names to dash-separated names.

Name Your Rules

Let's take a look at the first lines of the resulting AST:

new Program() {
  MainClass = new MainClass() {
    Identifier0 = "Factorial",
    Identifier1 = "a",
    Statement = 

Without having discussed all aspects of how the generator interprets grammars, it's easy to understand this AST and its relation to the corresponding rule:

main-class = "class" identifier "{"
               "public" "static" "void" "main" "(" "String" "[" "]" identifier ")" "{"
                 statement
               "}"
             "}" ;

Syntax vs Semantics

"class", "public", ... are all syntactic sugar that in a way help the parser to identify the actual content: identifier, identifier and statement.

From a parsing point of view, the first identifier and the second identifier are the same, but semantically they mean something else. Therefore the generator generates separate properties for them, but beyond giving them a sightly different name, using an index, it can't do much more.

Enter annotations!

The HPG grammar provides a way to change the default name that is given to properties of terminals. Terminals in the HPG grammar are rules that match a string, extract a string or refer to another rule, identifiers. Let's introduce them by applying them to the main-class example rule:

main-class = "class" class-name @ identifier "{"
               "public" "static" "void" "main" "(" "String" "[" "]" args @ identifier ")" "{"
                 statement
               "}"
             "}" ;

Notice the added class-name @ and args @ prefixes before the two instances of identifier.

Now, rebuild your parser and try it again:

new Program() {
  MainClass = new MainClass() {
    ClassName = "Factorial",
    Args = "a",
    Statement = ...

Much better! This way you can now annotate all generic rules in different semantical situations. The resulting grammar can be found in the MiniJava example folder.

Annotations are not only useful for differentiating between the same rules, but also to give a more meaningful name to an single instance. An example in the grammar is: "if" "(" condition @ expression ")" ...

Note that you could do this also in a different way:

main-class = "class" class-name  "{"
               "public" "static" "void" "main" "(" "String" "[" "]" args ")" "{"
                 statement
               "}"
             "}" ;

class-name = identifier ;
args       = identifier ;

From these examples, you will by now understand that reference to other rules, in the expansion of a rule, will (by default) generate properties on a class (more generally called an entity) with the name of the rule. This is why the alternative way works, and is sometimes used to obtain the desired generator behaviour.

Polymorphic Rules

Annotations solved a lot of semantical ambiguities in the grammar, but not all situations are easily resolved this way. Let's take a look at an example of a Statement in the AST, the first one we encounter in the output:

Statement = new Statement() {
  BlockStatements = new List<Statement>() {},
  Condition = null,
  TrueStatement = null,
  FalseStatement = null,
  Expression = null,
  RepeatedStatement = null,
  Output = ... ,
  Variable = null,
  Value = null,
  IndexedVariable = null,
  Index = null,
  IndexedValue = null
}

This results from the rule:

statement = "{" { block-statement @ statement } "}"
          | "if" "(" condition @ expression ")"
              true-statement @ statement
            "else"
              false-statement @ statement
          | "while" "(" expression ")" repeated-statement @ statement
          | "System.out.println" "(" output @ expression ")" ";"
          | variable @ identifier "=" value @ expression ";"
          | indexed-variable @ identifier "[" index @ expression "]"
            "=" indexed-value @ expression ";"
          ;

It's really hard to find good annotation names for alternatives and the fact that they all result in properties on the same entity, isn't helpful either. Let's introduce something we know from object-orientation: inheritance and polymorphism.

Remember we could replace annotation by addition more rules? Well, let's do that again:

statement            = block-statement
                     | if-statement
                     | while-statement
                     | print-statement
                     | assignment-statement
                     ;

block-statement      = "{" { statement } "}" ;
if-statement         = "if" "(" condition @ expression ")"
                         true-statement @ statement
                       "else"
                         false-statement @ statement
                     ;
while-statement      = "while" "(" expression ")" statement ;
print-statement      = "System.out.print" [ "ln" ] "(" expression ")" ";" ;
assignment-statement = variable @ identifier [ "[" index @ expression "]" ]
                       "=" expression ";"
                     ;

What has been done here? All alternatives for the statement rule, have been refactored into a rule of their own. The generator detects that each alternative would produce exactly one property, which means that they are all true alternatives. The statement entity can now push down its type to the other entities, who become implementations of statement.

If we now look again at the same part in the AST:

Statement = new PrintStatement() {
  HasLn = True,
  Expression = ... ,
}

I've added some additional grammar here: [ "ln" ]. This allows me to introduce the following section, but also hides an inherent problem with a not-so-realistic grammar. The fact that the print statement is explicitly part of the grammar, is because it's a very basic grammar. Without the added HasLn property, the PrintStatement would consist of a single property, which would allow it to push down its type down to Expression. The semantic meaning of the PrintStatement would be lost. Normally this would be some sort of MethodCall with System.out.Println the value of a property.

But first we apply this same polymorphism pattern to primary-expression. In its current form, this is also not applicable for inheritance, which is a shame.

primary-expression       = integer-expression
                         | boolean-expression
                         | this-expression
                         | constructor-expression
                         | identifier-expression
                         ;

integer-expression       = value @ integer ;
boolean-expression       = value @ boolean ;
this-expression          = "this" ;
constructor-expression   = "new" "int" "[" size @ expression "]"
                         | "new" class-name @ identifier "(" ")"
                         ;
identifier-expression    = name @ identifier ;

Another entity that needs this treatment is type:

type                     = primitive-type
                         | list-type
                         | complex-type
                         ;

primitive-type           = integer-type
                         | boolean-type
                         ;

integer-type             = "int" ;
boolean-type             = "boolean" ;

list-type                = integer-type "[" "]";

complex-type             = type @ identifier ;

The resulting grammar can be found in the MiniJava example folder.

Suppressing Properties

Now let's get back to the addition I made earlier to the print-statement: [ "nl" ].

Normally, string literals in a rule are just tools for the parser to match a rule based on certain keywords or syntactic sugar. But when such strings are optional, the parser wants to provide this information to the user and the generator will create boolean properties of the form has<String>. This is what happens here, and it allows us to create a second property on print-statement.

But, as we look at the current AST, there are more instances of this pattern that we don't want in our clean AST object model. Luckily, HPG provides a conventional mechanism to suppress the creation of properties: the underscore (_) name.

We've already seen that we can use annotations to change the name of terminal/properties. If we use an underscore for the name in the annotation, the generator will not generate a property for that terminal.

Let's try that. The current AST contains HasCommas, which tracks the presence of commas separating the arguments of the method-call-expression:

new Program() {
  MainClass = new MainClass() {
    ClassName = "Factorial",
    Args = "a",
    Statement = new PrintStatement() {
      HasLn = True,
      Expression = new MethodCallExpression() {
        Object = new ConstructorExpression() {
          Size = null,
          ClassName = "Fac"
        },
        Identifier = "ComputeFac",
        HasCommas = new List<bool>() {},
        Arguments = new List<Expression>() {
          new IntegerExpression() {
            Value = "10"
          }
...  

We can now suppress this property with a name annotation: _ @ ",".

Other redundant properties that can be suppressed this way are : HasExtends and HasOpenSquareBracket and HasCloseSquareBracket.

Another instance of this can be found in the definition of additive-expression:

additive-expression = lhs @ non-additive-expression [ "+" | "-" ] rhs @ non-times-expression ;

This produces HasPlus and HasDash, while we would like to have the operator:

additive-expression = lhs @ non-additive-expression operator @ addition-operator rhs @ non-times-expression ;

addition-operator   = ? /(\+|-)/ ? ;

The resulting grammar can be found in the MiniJava example folder.

List Pattern

So far, the generator has been helping us without use noticing it. The following rule isn't generated in the same way as we've seen so far:

method-call-expression   = object @ non-postfix-expression "." identifier
                           "(" [ argument @ expression { _ @ "," argument @ expression } ] ")" ;

The arguments of the method call are constructed using an optional part and a nested repetition part. Normally we would expect that the generator would generate separate properties with an index, but that's not the case. The generator recognises the List pattern and rewrites it into an actual list:

Expression = new MethodCallExpression() {
  Object = new ConstructorExpression() {
    Size = null,
    ClassName = "Fac"
  },
  Identifier = "ComputeFac",
  Arguments = new List<Expression>() {
    new IntegerExpression() {
      Value = "10"
    }
  }
}

There is another instance of such a list, but it doesn't adhere to the pattern well enough for the generator to recognise it and rewrite it:

method-declaration = "public" return-type @ type method-name @ identifier
                     "(" [ type identifier { _ @ "," type identifier } ] ")" "{"
                       { var-declaration }
                       { statement }
                       "return" returns @ expression ";"
                     "}" ;

The combination of type and identifier is too complex, and this shows in the AST:

new MethodDeclaration() {
  ReturnType = new IntegerType() { },
  MethodName = "ComputeFac",
  Type0 = new IntegerType() { },
  Identifier0 = "num",
  Type1s = new List<Type>() {},
  Identifier1s = new List<string>() {},
  VarDeclarations = // ... ,
  Statements = // ... ,
  Returns = new IdentifierExpression() {
    Name = "num_aux"
  }
}

If we refactor the rule to look like this:

method-declaration = "public" return-type @ type method-name @ identifier
                     "(" [ parameter { _ @ "," parameter } ] ")" "{"
                       { var-declaration }
                       { statement }
                       "return" returns @ expression ";"
                     "}" ;

parameter          = type identifier ;

This turns the previous method-declaration into:

new MethodDeclaration() {
  ReturnType = new IntegerType() { },
  MethodName = "ComputeFac",
  Parameters = new List<Parameter>() {
    new Parameter() {
      Type = new IntegerType() { },
	   Identifier = "num"
	 }
  },
  VarDeclarations = // ... ,
  Statements = // ... ,
  Returns = new IdentifierExpression() {
    Name = "num_aux"
  }
}

The resulting grammar can be found in the MiniJava example folder.

The (Final) AST

All this refactoring now results in the following AST object model for the given example:

new Program() {
  MainClass = new MainClass() {
    ClassName = "Factorial",
    Args = "a",
    Statement = new PrintStatement() {
      HasLn = True,
      Expression = new MethodCallExpression() {
        Object = new ConstructorExpression() {
          Size = null,
          ClassName = "Fac"
        },
        Identifier = "ComputeFac",
        Arguments = new List<Expression>() {
          new IntegerExpression() { Value = "10" }
        }
      }
    }
  },
  ClassDeclarations = new List<ClassDeclaration>() {
    new ClassDeclaration() {
      ClassName = "Fac",
      Extends = null,
      VarDeclarations = new List<VarDeclaration>() {},
      MethodDeclarations = new List<MethodDeclaration>() {
        new MethodDeclaration() {
          ReturnType = new IntegerType() {},
          MethodName = "ComputeFac",
          Parameters = new List<Parameter>() {
            new Parameter() {
              Type = new IntegerType() {},
              Identifier = "num"
            }
          },
          VarDeclarations = new List<VarDeclaration>() {
            new VarDeclaration() {
              Type = new IntegerType() {},
              Identifier = "num_aux"
            }
          },
          Statements = new List<Statement>() {
            new IfStatement() {
              Condition = new LessThanExpression() {
                Lhs = new IdentifierExpression() { Name = "num"},
                Rhs = new IntegerExpression() { Value = "1" }
              },
              TrueStatement = new AssignmentStatement() {
                Variable = "num_aux",
                Index = null,
                Expression = new IntegerExpression() { Value = "1" }
              },
              FalseStatement = new AssignmentStatement() {
                Variable = "num_aux",
                Index = null,
                Expression = new TimesExpression() {
                  Lhs = new IdentifierExpression() { Name = "num"},
                  Rhs = new GroupExpression() {
                    Expression = new MethodCallExpression() {
                      Object = new ThisExpression() { },
                      Identifier = "ComputeFac",
                      Arguments = new List<Expression>() {
                        new AdditiveExpression() {
                          Lhs = new IdentifierExpression() { Name = "num"},
                          Operator = "-",
                          Rhs = new IntegerExpression() { Value = "1" }
                        }
                      }
                    }
                  }
                }
              }
            }
          },
          Returns = new IdentifierExpression() { Name = "num_aux" }
        }
      }
    }
  }
}

And although some names or constructions might be modelled differently, this example shows how some often encountered patterns in grammar design can be constructed in the right way for HPG to produce a reasonable human parser that produces a reasonable human AST.

Have fun!