In [1]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

The following example has been adapted from the official Ply documentation and ported to TypeScript using Chevrotain.

## A Tokenizer for Numbers and the Arithmetical Operators

The module `chevrotain` contains the code that is necessary to create a scanner.

In [2]:
const { execSync } = await import('child_process');
console.log(execSync('npm install chevrotain@10').toString());




up to date, audited 9 packages in 2s

found 0 vulnerabilities



In [3]:
import { createToken, Lexer, IToken, ILexingError } from "chevrotain";

We start with the definition of <em style="color:blue">tokens</em>.

In Chevrotain, each token is created using the `createToken` function, which takes a configuration object with a `name` and a `pattern` (regular expression).

In [4]:
const Plus    = createToken({ name: "PLUS",    pattern: /\+/ });
const Minus   = createToken({ name: "MINUS",   pattern: /-/ });
const Times   = createToken({ name: "TIMES",   pattern: /\*/ });
const Divide  = createToken({ name: "DIVIDE",  pattern: /\// });
const LParen  = createToken({ name: "LPAREN",  pattern: /\(/ });
const RParen  = createToken({ name: "RPAREN",  pattern: /\)/ });

The pattern for numbers uses the regular expression `/0|[1-9][0-9]*/`.

This means a number is either:
- Exactly `0`, or
- Starts with a digit from 1-9 followed by any number of digits

This prevents leading zeros like `007`, which would be tokenized as three separate numbers: `0`, `0`, `7`.


In [5]:
const NumberLiteral = createToken({ 
  name: "NUMBER", 
  pattern: /0|[1-9][0-9]*/ 
});

Characters that should be ignored (whitespace, tabs, newlines) are marked with `group: Lexer.SKIPPED`.

This tells Chevrotain to:
- Recognize these tokens for proper position tracking (line and column numbers)
- Not include them in the output token stream

In [6]:
const Newline = createToken({
    name: "NEWLINE",
    pattern: /\n+/,
    group: Lexer.SKIPPED
});

const WhiteSpace = createToken({
    name: "WS",
    pattern: /[ \t\r]+/,
    group: Lexer.SKIPPED
});

Finally, we collect all token definitions in an array.

**Important:** The order matters! Chevrotain tries to match tokens in the order they appear in this array. Whitespace and newlines should come first to ensure they are recognized before other patterns.

In [7]:
const allTokens = [WhiteSpace, Newline, Plus, Minus, Times, Divide, LParen, RParen, NumberLiteral];

We can inspect all defined tokens and their patterns using a table:

In [8]:
console.table(
  allTokens.map(t => ({
    name: t.name,
    pattern: t.PATTERN.toString(),
  }))
);

┌─────────┬───────────┬───────────────────┐
│ (index) │ name      │ pattern           │
├─────────┼───────────┼───────────────────┤
│ 0       │ [32m'WS'[39m      │ [32m'/[ \\t\\r]+/'[39m    │
│ 1       │ [32m'NEWLINE'[39m │ [32m'/\\n+/'[39m          │
│ 2       │ [32m'PLUS'[39m    │ [32m'/\\+/'[39m           │
│ 3       │ [32m'MINUS'[39m   │ [32m'/-/'[39m             │
│ 4       │ [32m'TIMES'[39m   │ [32m'/\\*/'[39m           │
│ 5       │ [32m'DIVIDE'[39m  │ [32m'/\\//'[39m           │
│ 6       │ [32m'LPAREN'[39m  │ [32m'/\\(/'[39m           │
│ 7       │ [32m'RPAREN'[39m  │ [32m'/\\)/'[39m           │
│ 8       │ [32m'NUMBER'[39m  │ [32m'/0|[1-9][0-9]*/'[39m │
└─────────┴───────────┴───────────────────┘


Now we create the lexer using Chevrotain's `Lexer` class.

The option `positionTracking: "full"` ensures that we get complete position information (line, column, offset) for each token.

In [9]:
const lexer = new Lexer(allTokens, { positionTracking: "full" });

Let's test the generated lexer with the following string:

In [10]:
const data = `
       3 + 4 * 10 + 007 + (-20) * 2
       42
       a
       `;

Here is the input string we will tokenize:

In [11]:
console.log(data)


       3 + 4 * 10 + 007 + (-20) * 2
       42
       a
       


Now we tokenize the input string by calling the `tokenize` method.

We then iterate over all recognized tokens and display them. Any unrecognized characters will be reported as errors.

In [12]:
// Tokenize the input
const result = lexer.tokenize(data);

// Display all tokens
for (const token of result.tokens) {
  const displayValue = token.image.replace(/\n/g, '\\n');
  console.log(`LexToken(${token.tokenType.name},'${displayValue}',${token.startLine},${token.startOffset})`);
}

// Display errors (if any)
for (const error of result.errors) {
  const char = error.message.match(/->(.)<-/)?.[1] || '?';
  console.log(`Illegal character '${char}' at line ${error.line}.`);
  console.log(`This is the ${error.offset}th character.`);
}

LexToken(NUMBER,'3',2,8)
LexToken(PLUS,'+',2,10)
LexToken(NUMBER,'4',2,12)
LexToken(TIMES,'*',2,14)
LexToken(NUMBER,'10',2,16)
LexToken(PLUS,'+',2,19)
LexToken(NUMBER,'0',2,21)
LexToken(NUMBER,'0',2,22)
LexToken(NUMBER,'7',2,23)
LexToken(PLUS,'+',2,25)
LexToken(LPAREN,'(',2,27)
LexToken(MINUS,'-',2,28)
LexToken(NUMBER,'20',2,29)
LexToken(RPAREN,')',2,31)
LexToken(TIMES,'*',2,33)
LexToken(NUMBER,'2',2,35)
LexToken(NUMBER,'42',3,44)
Illegal character 'a' at line 4.
This is the 54th character.


We see that each generated token contains the following information:

1. The **type** of the token (e.g., `NUMBER`, `PLUS`)
2. The **value** of the token - the actual matched string
3. The **line number** - starts at 1 (note that the first line of `data` is empty)
4. The **character offset** - the position in the input string, starts at 0

For example, the character `a` at line 4 is the 54th character in the input string.