Generic lexer, WIP.
Run
npm install gelex
Get the library reference
const gelex = require('gelex')
Create a lexer definition
const def = gelex.definition()
Define a rule, with token name and expression:
def.define('integer', '[0123456789][0123456789]*');
The text between [
and ]
describe optional characters.
The asterisc *
indicates zero or more occurrences
Optional characters could be defined in ranges using -
:
def.define('name', '[a-zA-Z_][a-zA-Z0-9_]*');
Define a delimited text (a string):
def.defineText('string', '"', '"');
The second argument is the starting text delimiter and the third argument is the ending text delimiter.
Escaped characters could be optionally defined:
def.defineText('string', '"', '"',
{
escape: '\\',
escaped: { 'n': '\n', 'r': '\r', 't': '\t' }
}
);
The escaped
field is a map from mapped character and its final
representation. A escaped character not included in this map
is mapped to itself, ie: an escaped double quote is mapped
to a double quote in the above definition.
Define many rules in one, using an array:
def.define('delimiter', [ '{', '}', ',', ';' ]);
def.define('operator', [ '+', '-', '*', '/', '==', '===', '**', '^', '!', '|', '||', '&', '&&' ]);
It is equivalent to define each rule:
def.define('delimiter', '{' );
def.define('delimiter', '}' );
...
Matching any character using [.]
:
def.define('anychar', '[.]');
Define a rule with transform
def.define('symbol', '#[a-zA-Z_][a-zA-Z_]*', trfn);
where trfn
is a function that receives the scanned text value and returns
another value. Example:
def.define('symbol', '#[a-zA-Z_][a-zA-Z_]*',
function (value) {
return value.substring(1); // removing initial #
});
Define a custom rule
def.define('character', rule);
where rule
is an object with two functions:
first()
: returns the initial characters for the rulematch(scanner)
: returns null if no match, or the detected value as string if there is a match.
Example, a rule detecting $a
as the character a
(as in Smalltalk):
function CharacterRule(ch) {
this.first = function () { return ch; };
this.match = function (scanner) {
if (scanner.peek() !== ch)
return null;
scanner.scan();
return scanner.scan();
};
}
Define a comment
def.defineComment('/*', '*/');
The first argument is the starting text delimiter. The second argument is the ending text delimiter. Current version does not support nested comments, yet.
A comment is processed like an space character.
Define a line comment, giving only one argument:
def.defineComment('//');
Create and use a lexer:
const lexer = def.lexer();
const token = lexer.next();
Each token is retrieved in order invoking lexer next
function.
It returns null
when the tokens are exhausted.
Each token is an object with fields:
type
: the token type name, defined using thedefine
function; ieinteger
.value
: the string value of the tokenbegin
: start position in input textend
: end position in input text
Example:
const gelex = require('../..');
const def = gelex.definition();
def.define('integer', '[0123456789][0123456789]*');
def.define('name', '[a-zA-Z_][a-zA-Z0-9_]*');
def.define('delimiter', [ '{', '}', ',', ';' ]);
def.define('operator', [ '+', '-', '*', '/', '==', '===', '**', '^', '!', '|', '||', '&', '&&' ]);
def.defineText('string', "'", "'");
def.defineText('string', '"', '"');
const lexer = def.lexer('1 2 42 foo bar + * {},===== "foo" "bar"');
let token;
while (token = lexer.next())
console.dir(token);
Expected output:
Expected output:
{ type: 'unknown', value: '1', begin: 0, end: 0 }
{ type: 'unknown', value: '2', begin: 2, end: 2 }
{ type: 'integer', value: '42', begin: 4, end: 5 }
{ type: 'name', value: 'foo', begin: 7, end: 9 }
{ type: 'name', value: 'bar', begin: 11, end: 13 }
{ type: 'operator', value: '+', begin: 15, end: 15 }
{ type: 'operator', value: '*', begin: 17, end: 17 }
{ type: 'delimiter', value: '{', begin: 19, end: 19 }
{ type: 'delimiter', value: '}', begin: 20, end: 20 }
{ type: 'delimiter', value: ',', begin: 21, end: 21 }
{ type: 'operator', value: '===', begin: 22, end: 24 }
{ type: 'operator', value: '==', begin: 25, end: 26 }
{ type: 'string', value: 'foo', begin: 28, end: 32 }
{ type: 'string', value: 'bar', begin: 34, end: 38 }
- Version 0.0.1, first version.
- Version 0.0.2, fixing ManyRule.
- Version 0.0.3, detecting unclosed strings, match any character rule.
- Version 0.0.4, custom rule
- Version 0.0.4, custom rule
- Version 0.0.5, transform function in define
- Version 0.0.6, char function in Lexer
- Version 0.0.7, seek function in Lexer
TBD
- Support nested comments
- Detect unclosed comments
- Programming language sample
MIT
Feel free to file issues and submit pull requests — contributions are welcome.
If you submit a pull request, please be sure to add or update corresponding
test cases, and ensure that npm test
continues to pass.