### Generic Lexer ### The generic lexer aims at solving the performance issues with the Regex Lexer. The idea is to start from a limited set of classical lexemes and to refine this set to fit your needs. Those lexemes are recognize through a Finite State Machine, way more efficient than looping through a set of regexes. - [Generic Lexer](#generic-lexer) - [Lexer configuration](#lexer-configuration) - [Basic lexemes](#basic-lexemes) - [Strings](#strings) - [Comments](#comments) #### Lexer configuration The lexer can be configured with a `[Lexer]` attribute, and is available from version 2.4.0.6. The `[Lexer]` attribute has several properties: - `IgnoreWS`: Ignore whitespace characters. If `false`, any whitespace occuring in the lexed text must be explicitly handled in the lexer. Default is `true`. - `IgnoreEOL`: Ignore end of line characters. If `false`, any end of line characters occuring in the lexed text must be explicitly handled in the lexer. Default is `true`. - `WhiteSpace`: An array of characters that are considered whitespace if `IgnoreWS` is `true`. Default is ' ' (space) and '\t` (tab). - `KeyWordIgnoreCase`: If `true`, any keywords (`[Lexeme(GenericToken.Keyword, "...")]`) are matched ignoring case. That is, the keyword `if` also matches `IF`, `If`, etc. Default is `false`. #### Basic lexemes The basic lexemes are : - `GenericToken.Identifier`: An identifier. From version 2.0.3 `Identifier` accepts an extra parameter to specify an identifier pattern: - `IdentifierType.Alpha`: Only alpha characters (default value, only pattern available before version 2.0.3). - `IdentifierType.AlphaNum`: Starting with an alpha char and then alpha or numeric char. - `IdentifierType.AlphaNumDash`: Starting with an alpha or '\_' (underscore) char and then alphanumeric or '-'(minus) or '\_' (underscore) char. - `IdentifierType.Custom`: Accepts two parameters; the _starting_ character pattern and the _rest_ character pattern. The pattern string contains '_c_' (allowed char) and '_l-u_' (allowed char range). If '-' (dash) should be an allowed char, it should be the first character in the pattern. An example that duplicates `IdentifierType.AlphaNumDash` is `[Lexeme(GenericToken.Identifier, IdentifierType.Custom, "_A-Za-z", "-_0-9A-Za-z")]`. _(From version 2.4.0.6)_ - `GenericToken.String`: A classical string delimited by double quotes ". See below for more details. - `GenericToken.Int`: An int (i.e. a serie of one or more digit). - `GenericToken.Double`: A float number (decimal separator can be specified. default decimal separator is dot '.'). - `GenericToken.Hexa` : an hexadecimal number. Hexa numbers are denoted with a prefix that can be configured (example : "0x"). ⚠️ Beware that badly choosing the prefix can make an hexa decimal number match the identifier lexeme (idf used) leading to conflicts and lexing errors, default prefix is `0x`. - `GenericToken.Date`: A date. Needs 2 parameters : - format - either `DateFormat.YYYYMMDD` - or `DateFormat.DDMMYYYY` (default) - separator char used to separate year, month and day (defualt is '-') - `GenericToken.KeyWord`: A keyword is an identifier with a special meaning (it comes with the same constraint as the `GenericToken.Identifier`. Here again performance comes at the price of less flexibility. This lexeme is configurable. - `GenericToken.SugarToken`: A general purpose lexeme with no special constraint except the use of a leading alpha char. This lexeme is configurable. - `GenericToken.UpTo` : Match all until some patterns are found. For instance `Lexeme(GenericToken.UpTo,"<","{")` will match all characters until `<` or `{`. To build a generic lexer Lexeme attribute we have 2 different constructors: - static generic lexeme. this constructor allows to do a 1 to 1 mapping between a generic token and your lexer token. It uses only one parameter that is the mapped generic token : ```[Lexeme(GenericToken.String)]``` (static lexemes are String, Int , Double and Identifier) - configurable lexemes (KeyWord and SugarToken). It takes 2 parameters : * the mapped GenericToken * the value of the keyword or sugar token. There is also short code attributes for each basic lexeme type : - `GenericToken.Identifier`: - `IdentifierType.Alpha`: ```[AlphaId]``` - `IdentifierType.AlphaNum`: ```[AlphaNumId]``` - `IdentifierType.AlphaNumDash`: ```[AlphaNumDashId]``` - `IdentifierType.Custom`: ```[CustomId(startingPattern, endingPattern)]``` - `GenericToken.String`: ```[String(delimiterChar, escapeChar)]``` - `GenericToken.Int`: ```[Int]``` - `GenericToken.Double`: ```[Double]``` - `GenericToken.Hexa`: ```[Hexa]``` - `GenericToken.Date`: ```[Date]``` - `GenericToken.KeyWord`: ```[Keyword(pattern)]``` - `GenericToken.SugarToken`: ```[Sugar(pattern)]```. ⚠️ a sugar token can not start like a valid identifier (that is letter or `_` or `-`) - `GenericToken.UpTo`: `[UpTo(string pattern1, pattern2 .... patternn)]` #### Strings Strings lexeme definitions take 2 parameters : * a string delimiter char. Default is " (double quote) * an escape delimiter char to allow the use of the delimiter char inside a string. Default is \ (backslash). Use of the same char for delimiter and escape char is allowed. *examples* ```c# // matches 'hello \' world' => 'hello ' world' [Lexeme(GenericToken.String,"'","\\")] STRING ``` or ```c# // matches 'that''s my hello world' => 'that's my hello world' [Lexeme(GenericToken.String,"'","'")] STRING ``` *Many string patterns* Many string patterns are allowes in the same lexer. For instance you should want to match double quote delimited string as well as single quote delimiter string. For this you can simply apply many lexeme attribute with to the same enum value : ```c# // matches 'hello \' world' => 'hello ' world' // as well as "hello \" world" => "hello " world" [Lexeme(GenericToken.String,"'","'")] [Lexeme(GenericToken.String,"'","\\")] STRING ``` #### Comments The generic lexer offers support for comments. Comments are removed from the token stream before the parse start to ignore them. Nevertheless you can get them, for any special purpose, using directly the lexer. *Comment declaration* Comments use dedicated attributes on enum value that declares the comment delimiters ```c# [Comments(singleline, multilinestart, multilineend)] COMMENT, ``` * singleline : the single line comment delimiter ( "//" for all C derived languages) * multilinestart : the starting multi line comment delimiter ( "/*" in all C derived language) * multilineend : the closing multi line delimiter ( "*/" in all C derived language) ```c# [SingleLineComment(singleline)] SINGlE_LINE_COMMENT, ``` * singleline : the single line comment delimiter ( "//" for all C derived languages) ```c# [MultiLineComment(multilinestart, multilineend)] MULTI_LINE_COMMENT, ``` * multilinestart : the starting multi line comment delimiter ( "/*" in all C derived language) * multilineend : the closing multi line delimiter ( "*/" in all C derived language) ### Full example, for a dumb language (generic token based) ### ```c# public enum WhileTokenGeneric { #region keywords 0 -> 19 [Lexeme(GenericToken.KeyWord,"if")] IF = 1, [Lexeme(GenericToken.KeyWord, "then")] THEN = 2, [Lexeme(GenericToken.KeyWord, "else")] ELSE = 3, [Lexeme(GenericToken.KeyWord, "while")] WHILE = 4, [Lexeme(GenericToken.KeyWord, "do")] DO = 5, [Lexeme(GenericToken.KeyWord, "skip")] SKIP = 6, [Lexeme(GenericToken.KeyWord, "true")] TRUE = 7, [Lexeme(GenericToken.KeyWord, "false")] FALSE = 8, [Lexeme(GenericToken.KeyWord, "not")] NOT = 9, [Lexeme(GenericToken.KeyWord, "and")] AND = 10, [Lexeme(GenericToken.KeyWord, "or")] OR = 11, [Lexeme(GenericToken.KeyWord, "print")] PRINT = 12, #endregion #region literals 20 -> 29 // identifier with IdentifierType.AlphaNumDash pattern [Lexeme(GenericToken.Identifier, IdentifierType.AlphaNumDash)] IDENTIFIER = 20, [Lexeme(GenericToken.String)] STRING = 21, [Lexeme(GenericToken.Int)] INT = 22, #endregion #region operators 30 -> 49 [Lexeme(GenericToken.SugarToken,">")] GREATER = 30, [Lexeme(GenericToken.SugarToken, "<")] LESSER = 31, [Lexeme(GenericToken.SugarToken, "==")] EQUALS = 32, [Lexeme(GenericToken.SugarToken, "!=")] DIFFERENT = 33, [Lexeme(GenericToken.SugarToken, ".")] CONCAT = 34, [Lexeme(GenericToken.SugarToken, ":=")] ASSIGN = 35, [Lexeme(GenericToken.SugarToken, "+")] PLUS = 36, [Lexeme(GenericToken.SugarToken, "-")] MINUS = 37, [Lexeme(GenericToken.SugarToken, "*")] TIMES = 38, [Lexeme(GenericToken.SugarToken, "/")] DIVIDE = 39, #endregion #region sugar 50 -> 99 [Lexeme(GenericToken.SugarToken, "(")] LPAREN = 50, [Lexeme(GenericToken.SugarToken, ")")] RPAREN = 51, [Lexeme(GenericToken.SugarToken, ";")] SEMICOLON = 52, #endregion #region comments : C like comments [Comment("//","/*","*/")] COMMENTS = 100 #endregion EOF = 0 #endregion } ``` The same using short code attributes : ```C# public enum ShortWhileTokenGeneric { #region keywords 0 -> 19 [Keyword("IF")] [Keyword("if")] IF = 1, [Keyword("THEN")] [Keyword("then")] THEN = 2, [Keyword("ELSE")] [Keyword("else")] ELSE = 3, [Keyword("WHILE")] [Keyword("while")] WHILE = 4, [Sugar("DO")] [Sugar("do")] DO = 5, [Keyword("SKIP")] [Keyword( "skip")] SKIP = 6, [Keyword( "TRUE")] [Keyword("true")] TRUE = 7, [Keyword( "FALSE")] [Keyword( "false")] FALSE = 8, [Keyword( "NOT")] [Keyword("not")] NOT = 9, [Keyword( "AND")] [Keyword("and")] AND = 10, [Keyword( "OR")] [Keyword("or")] OR = 11, [Keyword( "PRINT")] [Keyword("print")] PRINT = 12, #endregion #region literals 20 -> 29 [AlphaId] IDENTIFIER = 20, [String] STRING = 21, [Int] INT = 22, #endregion #region operators 30 -> 49 [Sugar( ">")] GREATER = 30, [Sugar( "<")] LESSER = 31, [Sugar( "==")] EQUALS = 32, [Sugar( "!=")] DIFFERENT = 33, [Sugar( ".")] CONCAT = 34, [Sugar( ":=")] ASSIGN = 35, [Sugar( "+")] PLUS = 36, [Sugar( "-")] MINUS = 37, [Sugar( "*")] TIMES = 38, [Sugar( "/")] DIVIDE = 39, #endregion #region sugar 50 -> [Sugar( "(")] LPAREN = 50, [Sugar( ")")] RPAREN = 51, [Sugar( ";")] SEMICOLON = 52, EOF = 0 #endregion } ```