Skip to content

Implementing a BNF parser

Artyom V. Gorchakov edited this page Aug 27, 2023 · 15 revisions

General approach

Concrete Syntax Tree (CST) generation

Per the grammar production rules, csly will generate a concrete syntax tree. In the previous section, we explained how syntax trees are defined in csly using the SyntaxNode type. Nodes can be terminal or non-terminal. Terminal nodes are referred to as leaves.

Grammar definition and CST traversal

Grammar rules and visitor methods are defined in a single C# class that you provide, the expression parser. Each method that is defined in the class is associated with a production rule. See grammar definition below for further details.

Inspecting the CST

You can retrieve the whole CST in a number of ways. Once the CST is built, csly provides a way to visit each node programmatically using LINQ syntax. One way to examine the whole tree is by calling the SyntaxTree property from the ParseResult object and examining the object with your debugger. SyntaxTree is implemented as a single root SyntaxNode which can be traversed by iterating the Children property. It is also possible to generate a graphical tree representation with a number of tools. See View syntax tree for an example using VizGraph.

Grammar definition

The grammar defining the parser is defined using the C# attribute [Production("some grammar rule")] mapped to methods on your expression parser.

The rules follow the classical BNF notation:

<rule_name> : <terminal or non-terminal> <some other terminal or non-terminal> ...

A terminal rule name on the attribute string must exactly match the enum value of your expression tokens. Following C# conventions, the rules are case-sensitive. If the tokens don't match those on the attribute string, or if the parser cannot be generated, the builder.BuildParser(...) method will return null and a NullReferenceException will be thrown on the call to parser.Parse(...).

Each method takes as many parameters as rule clauses. Each parameter can be typed according to the return value of the clause:

  • for a terminal: the Token<T> corresponding to the token,
  • for a non-terminal: the result of the evaluation of the non-terminal, i.e., the value returned by the matching static method. As the parser output is typed <TOut>, as seen in the previous section, the result of an evaluation for a non-terminal is necessarily of type TOut.

Example: Revisiting the mathematical expression evaluator

In this section, we continue working on our expression parser from the Getting started section. We'd like to add parentheses and the subtraction operation to our expression parser. For simplicity, in this example we'll continue using whole numbers. We'll make modifications to our expression parser as follows. For a complete working demo, see the expression parser sample in the repo. Replace your expression parser and token enum with the following code:

public enum ExpressionToken
{
    [Lexeme("[0-9]+")]
    INT = 1,

    [Lexeme("\\+")]
    PLUS = 2,

    [Lexeme("\\-")]
    MINUS = 3,

    [Lexeme("\\(")]
    LPAREN = 4,

    [Lexeme("\\)")]
    RPAREN = 5,

    [Lexeme("[ \\t]+", isSkippable: true)] 
    WS = 6,

}

public class ExpressionParser
{
    [Production("primary: INT")]
    public int Primary(Token<ExpressionToken> intToken)
    {
        return intToken.IntValue;
    }

    [Production("primary: LPAREN [d] expression RPAREN [d]")]
    public int Group(int groupValue)
    {
        return groupValue;
    }

    [Production("expression: term PLUS expression")]
    [Production("expression: term MINUS expression")]
    public int Expression(int left, Token<ExpressionToken> operatorToken, int right)
    {
        var result = 0;

        switch (operatorToken.TokenID)
        {
            case ExpressionToken.PLUS:
                {
                    result = left + right;
                    break;
                }
            case ExpressionToken.MINUS:
                {
                    result = left - right;
                    break;
                }
        }

        return result;
    }

    [Production("expression: term")]
    public int Expression_Term(int termValue)
    {
        return termValue;
    }

    [Production("term: factor")]
    public int Term_Factor(int factorValue)
    {
        return factorValue;
    }

    [Production("factor: primary")]
    public int primaryFactor(int primValue)
    {
        return primValue;
    }

    [Production("factor: MINUS factor")]
    public int Factor(Token<ExpressionToken> discardedMinus, int factorValue)
    {
        return - factorValue;
    }
    
}

Note that we've added a few more tokens for the right and left parenthesis ( and ) as well as a new group method and multiple Production decorators to the Expression(...) method so that it will handle both addition and subtraction. The grammar rules are implemented with the [Production("")] attribute.

Building and using the parser

The code to build and use the parser has not changed. The parser instance is obtained by calling ParserBuilder.BuildParser(...). The builder method takes 3 parameters:

  1. an instance of the class containing the lexer and parser definitions,
  2. the kind of parser to be used,
  3. the root rule for the parser.
Parser<ExpressionToken,int> parser = null;

ExpressionParser expressionParserDefinition = new ExpressionParser()
// here see the typing :
//  ExpressionToken is the token enum type
//  int is the type of a parse evaluation
BuildResult<Parser<WhileToken, WhileAST>> ParserResult = ParserBuilder.BuildParser<ExpressionToken,int>(expressionParserDefinition,
                                                                            ParserType.LL_RECURSIVE_DESCENT,
                                                                            "expression");
if (parserResult.IsOk) {
    // everythin'fine : we have a configured parser
    parser = parserResult.Result;
}
else {
    // something's wrong
    foreach(var error in parserResult.Errors) {
        Console.WriteLine($"{error.Code} : {error.Message}");
    }    
}

then calling

parser.Parse("some source code")

will return the evaluation of the syntax tree. the parser returns a ParseResult instance containing the evaluation value or a list of errors.

	string expression = "1 + 1";

    ParseResult<ExpressionToken> r = Parser.Parse(expression);


    if (!r.IsError && r.Result != null && r.Result is int)
    {
        Console.WriteLine($"result of {expression}  is {(int)r.Result}");
    }
    else
    {
        if (r.Errors !=null && r.Errors.Any())
        {
        	// display errors
            r.Errors.ForEach(error => Console.WriteLine(error.ErrorMessage));
        }
    }

Currently, only a recursive descent parser is available in csly. The implementation is limited to an LL grammar with no left recursion. There are two possible LL grammar choices:

  • ParserType.LL_RECURSIVE_DESCENT: a BNF notation grammar parser,
  • ParserType.EBNF_LL_RECURSIVE_DESCENT: a EBNF notation grammar parser.

The EBNF notation in the second case provides the additional multiplier notations * and +. The generated parser is typed according to the token type and parser return type. When calling the ParserBuilder.BuildParser() method you get:

  • a flag stating if parser has correctly been built todo: confirm that this functionality is not available. If the parser build fails a null parser is returned.
  • a parser instance if the generation was successful,
  • a list of errors if something went bad. Errors contain
    • an error code : seed error code,
    • an error message.

Using a context during syntax-tree traversal

When traversing a syntax tree, the parser sometimes uses a symbolic context. As an example, imagine an expression parser allowing the use of variables, which can be declared and initialized before parsing. In the case of a visitor with context, an additional parameter is passed at the end of each visitor method. The Parser.ParseWithContext(source, context) method is thread-safe and ensures that the correct context will be parsed.

The parse-with-context provision ensures the parser will only allow one thread to access the state-dependent implementation in the visitor method at any one time. See below for example of parsing using a parsing context. For more information on thread-safety and concurrency see Joe Albahari or Jon Skeet's excellent articles on thread-safe C# implementations.

example of parsing using a parsing context

A parse using a context is started using the ParseWithContext of the Parser. the context can be of any type. the example below extends the expression parser described above to allow the use of variable valued in a dictionary context.

// --------------------------------------------------
//       the additional production rule for  a variable
// --------------------------------------------------

  [Production("primary_value : IDENTIFIER")]
        public int OperandVariable(Token<ExpressionToken> identifier,Dictionary<string,int> context)
        {
            if (context.ContainsKey(identifier.Value))
            {
                return context[identifier.Value];
            }
            else
            {
                return 0;
            }
        }

// --------------------------------------------------
//       calling a parse
// --------------------------------------------------

var res = parser.ParseWithContext("2 + a", new Dictionary<string, int> {{"a", 2}}); // will return 2

access lexer and parsers

One build a parser expose :

  • a main API through the Parse(string content) method (chain lexical analysis, syntax parsing and finally call your parsing methods)

  • the lexer through the Lexer property

  • the syntax parser through the SyntaxParser property (which type is a ISyntaxParser)

Defining your parser ⬅️ Implementing a BNF Parser ➡️ EBNF-parser