Skip to content

ajlopez/gelex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gelex

Generic lexer, WIP.

Install

Run

npm install gelex

Use

Get the library reference

const gelex = require('gelex')

Create a lexer definition

const def = gelex.definition()

Define a rule, with token name and expression:

def.define('integer', '[0123456789][0123456789]*');

The text between [ and ] describe optional characters.

The asterisc * indicates zero or more occurrences

Optional characters could be defined in ranges using -:

def.define('name', '[a-zA-Z_][a-zA-Z0-9_]*');

Define a delimited text (a string):

def.defineText('string', '"', '"');

The second argument is the starting text delimiter and the third argument is the ending text delimiter.

Escaped characters could be optionally defined:

def.defineText('string', '"', '"',
    {
        escape: '\\',
        escaped: { 'n': '\n', 'r': '\r', 't': '\t' }
    }
);

The escaped field is a map from mapped character and its final representation. A escaped character not included in this map is mapped to itself, ie: an escaped double quote is mapped to a double quote in the above definition.

Define many rules in one, using an array:

def.define('delimiter', [ '{', '}', ',', ';' ]);
def.define('operator', [ '+', '-', '*', '/', '==', '===', '**', '^', '!', '|', '||', '&', '&&' ]);

It is equivalent to define each rule:

def.define('delimiter', '{' );
def.define('delimiter', '}' );
...

Matching any character using [.]:

def.define('anychar', '[.]');

Define a rule with transform

def.define('symbol', '#[a-zA-Z_][a-zA-Z_]*', trfn);

where trfn is a function that receives the scanned text value and returns another value. Example:

def.define('symbol', '#[a-zA-Z_][a-zA-Z_]*', 
    function (value) {
        return value.substring(1); // removing initial #
    });

Define a custom rule

def.define('character', rule);

where rule is an object with two functions:

  • first(): returns the initial characters for the rule
  • match(scanner): returns null if no match, or the detected value as string if there is a match.

Example, a rule detecting $a as the character a (as in Smalltalk):

function CharacterRule(ch) {
    this.first = function () { return ch; };
    
    this.match = function (scanner) {
        if (scanner.peek() !== ch)
            return null;
        
        scanner.scan();
        
        return scanner.scan();
    };
}

Define a comment

def.defineComment('/*', '*/');

The first argument is the starting text delimiter. The second argument is the ending text delimiter. Current version does not support nested comments, yet.

A comment is processed like an space character.

Define a line comment, giving only one argument:

def.defineComment('//');

Create and use a lexer:

const lexer = def.lexer();

const token = lexer.next();

Each token is retrieved in order invoking lexer next function. It returns null when the tokens are exhausted.

Each token is an object with fields:

  • type: the token type name, defined using the define function; ie integer.
  • value: the string value of the token
  • begin: start position in input text
  • end: end position in input text

Example:

const gelex = require('../..');
const def = gelex.definition();

def.define('integer', '[0123456789][0123456789]*');
def.define('name', '[a-zA-Z_][a-zA-Z0-9_]*');
def.define('delimiter', [ '{', '}', ',', ';' ]);
def.define('operator', [ '+', '-', '*', '/', '==', '===', '**', '^', '!', '|', '||', '&', '&&' ]);
def.defineText('string', "'", "'");
def.defineText('string', '"', '"');

const lexer = def.lexer('1 2 42 foo bar + * {},===== "foo" "bar"');

let token;

while (token = lexer.next())
    console.dir(token);

Expected output:

Expected output:

{ type: 'unknown', value: '1', begin: 0, end: 0 }
{ type: 'unknown', value: '2', begin: 2, end: 2 }
{ type: 'integer', value: '42', begin: 4, end: 5 }
{ type: 'name', value: 'foo', begin: 7, end: 9 }
{ type: 'name', value: 'bar', begin: 11, end: 13 }
{ type: 'operator', value: '+', begin: 15, end: 15 }
{ type: 'operator', value: '*', begin: 17, end: 17 }
{ type: 'delimiter', value: '{', begin: 19, end: 19 }
{ type: 'delimiter', value: '}', begin: 20, end: 20 }
{ type: 'delimiter', value: ',', begin: 21, end: 21 }
{ type: 'operator', value: '===', begin: 22, end: 24 }
{ type: 'operator', value: '==', begin: 25, end: 26 }
{ type: 'string', value: 'foo', begin: 28, end: 32 }
{ type: 'string', value: 'bar', begin: 34, end: 38 }

Versions

  • Version 0.0.1, first version.
  • Version 0.0.2, fixing ManyRule.
  • Version 0.0.3, detecting unclosed strings, match any character rule.
  • Version 0.0.4, custom rule
  • Version 0.0.4, custom rule
  • Version 0.0.5, transform function in define
  • Version 0.0.6, char function in Lexer
  • Version 0.0.7, seek function in Lexer

Previous work

Samples

References

TBD

To Do

  • Support nested comments
  • Detect unclosed comments
  • Programming language sample

License

MIT

Contribution

Feel free to file issues and submit pull requests — contributions are welcome.

If you submit a pull request, please be sure to add or update corresponding test cases, and ensure that npm test continues to pass.

About

Generic lexer in JavaScript

Resources

License

Stars

Watchers

Forks

Packages

No packages published