Skip to content

GenericLexerExtension

olivier edited this page Mar 8, 2018 · 15 revisions

Generic Lexer Extensions

For performance purpose the generic lexer limits the lexems definitions. Nevertheless an extension mechanism provides a way to add new lexem pattern relying on the FSM backing the generic lexer. To extend a lexer we need to add transitions and nodes to the underlying FSM.

New generic tokens are denoted with the specifice GenericToken.Extension value. For instance we can define a date lexem like this

public enum Extensions {
	[Lexeme(GenericToken.Extension)] 
	DATE,

	[Lexeme(GenericToken.Double)] 
	DOUBLE,
}

building extensions

The extensions are build in a callback function called for every token badged with the GenericToken.Extension value. The callback signature is a delegate

 
public delegate void BuildExtension<IN>( IN token, LexemeAttribute lexem,  GenericLexer<IN> lexer) where IN : struct;

where :

  • IN is the enum name used for the lexer (Extensions above)
  • token is the token to build the extension for
  • lexem is the lexem attribute : it allows to get optional parameters for the lexem through lexem.GenericTokenParameters (string[])
  • lexer is the lexer.
predefined FSM

Here is the predefined FSM that matches all standard tokens (ID, keyword, string , double and sugar)

FSM

As many string lexems may be used all the node are suffixed by a number.

adding states and transitions

the general form of a transition is :

transition

A transition comes from node "start" and goes to node "end". The character '.' denotes the transition. A precondition (a predicate Func<string, bool>) allows to refine the transition check. The precondition checks if the already parsed value matches some condition. Some example will be showned below.

Transitions and nodes are added to the FSM through an FSMBuilder. Transitions In the BuildExtension callback the FSMBuilder is retrieved through the lexer :

var fsmBuilder = lexer.FSMBuilder;

Then FSMBuilder exposes a fluent API to create nodes and transitions.

Moving to Node :

  • c# GoTo(string nodeName) : move to a state Adding transitions.
  • c# Transition(char token,Func<string,bool> precondition) : move to a new state using a given char and optional precondition.
  • c# RangeTransition(char startingToken, char endingToken Func<string,bool> precondition) : move to a new state when the char is between startingToken and endingToken.From 'a' to 'z' for example. An optional precondition might used.
  • c# ExceptTransition(char rejectToken, Func<string,bool> precondition) : move to a new state when the char is different from rejectToken. An optional precondition might used.
  • c# AnyTransition(Func<string,bool> precondition) : the unconditional transition. An optional precondition might used.

Nodes actions : some information might be added to nodes :

  • c# Mark(string name) : set a name to the current node. This name could be used later with the GoTo function.
  • c# End(GenericToken genericToken) : marks the node as an ending node for the given GenericToken value. When coming to such a node, a token has been matched.genericToken must be GenericToken.Extension when extending a GenericLexer.
  • c# CallBack(Func<FSMMatch<IN>,FSMMatch<IN>>) : on an ending node (see above) set a callback that can add the targeted token (from your lexer enum) to the resulting match.

Building the extended lexer

The extended lexer is built exactly the same way a generic lexer is build. Only difference is the BuildExtension callback passed as parameter.

	BuildExtension<Extensions> extensionBuilder = (Extensions token, LexemeAttribute lexem, GenericLexer<Extensions> lexer) => {
            if (token == Extensions.DATE) {
				// do some fsm modifications here
			}
	}
	
 var lexerRes = LexerBuilder.BuildLexer<Extensions>(new BuildResult<ILexer<Extensions>>(), extensionBuilder);

Example

This example will demonstrate how to extend a generic lexer to match a date lexem.

public enum Extensions {
   [Lexeme(GenericToken.Extension)] 
   DATE,

   [Lexeme(GenericToken.Double)] 
   DOUBLE,
}

The date lexem format is 'dd.mm.yyyy'. This format overlaps the double lexem as "20.02.2018" could be interpreted as

  • the 20 of february 2018
  • or the 20.02 double value and then ".2018" that failed to match an other lexem

Extended FSM

Before moving from in_double to in_date on char '.' we check if the current parsed value matches the 'dd.mm' pattern using the CheckDate precondition.

public static void AddExtension(Extensions token, LexemeAttribute lexem, GenericLexer<Extensions> lexer) {
   if (token == Extensions.DATE) {

   	
   	// precondition to check if starting date matches the dd.mm format
   	Func<string, bool> checkDate = (string value) => {
   		 bool ok = false;
           if (value.Length==5) {
               ok = char.IsDigit(value[0]);
               ok = ok && char.IsDigit(value[1]);
               ok = ok && value[2] == '.';
               ok = ok && char.IsDigit(value[3]);
               ok = ok && char.IsDigit(value[4]);
           }            
           return ok;
   	}

   	// callback on end_date node 
   	NodeCallback<GenericToken> callback = (FSMMatch<GenericToken> match) => 
   	{
   		// this store the token id the the FSMMatch object to be later returned by GenericLexer.Tokenize 
   		 match.Properties[GenericLexer<Extensions>.DerivedToken] = Extensions.DATE;
   		 return match;
   	};
   	
   	var fsmBuilder = lexer.FSMBuilder;

   	
   	fsmBuilder.GoTo(GenericLexer<Extensions>.in_double) // start a in_double node
   		.Transition('.',CheckDate) // add a transition on '.' with precondition
   		.Mark("start_date") // set the node name
   		.RangeTransition('0','9') // first year digit
   		.Mark("y1")
   		.RangeTransition('0','9') // second year digit
   		.Mark("y2")
   		.RangeTransition('0','9') // third year digit
   		.Mark("y3")
   		.RangeTransition('0','9') // fourth year digit
   		.Mark("y4")
   		.End(GenericToken.Extension) // mark as ending node 
   		.CallBack(callback); // set the ending callback
   }
}