In [None]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css: string = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

The following example has been adapted from the official Ply documentation and ported to TypeScript using Chevrotain.

## A Tokenizer for Numbers and the Arithmetical Operators

The module `chevrotain` contains the code that is necessary to create a scanner.

In [None]:
import { createToken, Lexer, TokenType, IToken, ILexingError, ILexingResult } from "chevrotain";

We start with the definition of <em style="color:blue">tokens</em>.

In Chevrotain, each token is created using the `createToken` function, which takes a configuration object with a `name` and a `pattern` (regular expression) which returns a `TokenType`-Object.

In [None]:
const Plus : TokenType    = createToken({ name: "PLUS",    pattern: /\+/ });
const Minus : TokenType   = createToken({ name: "MINUS",   pattern: /-/ });
const Times : TokenType   = createToken({ name: "TIMES",   pattern: /\*/ });
const Divide : TokenType  = createToken({ name: "DIVIDE",  pattern: /\// });
const LParen : TokenType  = createToken({ name: "LPAREN",  pattern: /\(/ });
const RParen : TokenType  = createToken({ name: "RPAREN",  pattern: /\)/ });

The pattern for numbers uses the regular expression `/0|[1-9][0-9]*/`.

This means a number is either:
- Exactly `0`, or
- Starts with a digit from 1-9 followed by any number of digits

This prevents leading zeros like `007`, which would be tokenized as three separate numbers: `0`, `0`, `7`.


In [None]:
const NumberLiteral = createToken({ 
  name: "NUMBER", 
  pattern: /0|[1-9][0-9]*/ 
});

Characters that should be ignored (whitespace, tabs, newlines) are marked with `group: Lexer.SKIPPED`.

This tells Chevrotain to:
- Recognize these tokens for proper position tracking (line and column numbers)
- Not include them in the output token stream

In [None]:
const Newline : TokenType = createToken({
    name: "NEWLINE",
    pattern: /\n+/,
    group: Lexer.SKIPPED
});

const WhiteSpace : TokenType = createToken({
    name: "WS",
    pattern: /[ \t\r]+/,
    group: Lexer.SKIPPED
});

Finally, we collect all token definitions in an array.

**Important:** The order matters! Chevrotain tries to match tokens in the order they appear in this array. Whitespace and newlines should come first to ensure they are recognized before other patterns.

In [None]:
const allTokens : TokenType[] = [WhiteSpace, Newline, Plus, Minus, Times, Divide, LParen, RParen, NumberLiteral];

We can inspect all defined tokens and their patterns using a table:

In [None]:
console.table(
  allTokens.map(token => ({
    name: token.name,
    pattern: token.PATTERN.toString(),
  }))
);

Now we create the lexer using Chevrotain's `Lexer` class.

The option `positionTracking: "full"` ensures that we get complete position information (line, column, offset) for each token.

In [None]:
const lexer : Lexer = new Lexer(allTokens, { positionTracking: "full" });

Let's test the generated lexer with the following string:

In [None]:
const data : string = `
       3 + 4 * 10 + 007 + (-20) * 2
       42
       a
       `;

Here is the input string we will tokenize:

In [None]:
data;

Now we tokenize the input string by calling the `tokenize` method.

We then iterate over all recognized tokens and display them. Any unrecognized characters will be reported as errors.

In [None]:
const result: ILexingResult = lexer.tokenize(data);

for (const token of result.tokens as IToken[]) {
  const displayValue: string = token.image.replace(/\n/g, "\\n");
  console.log(
    `LexToken(${token.tokenType.name},'${displayValue}',${token.startLine},${token.startOffset})`
  );
}

for (const error of result.errors as ILexingError[]) {
  const charFromMessage: string = error.message.match(/->(.)<-/)?.[1];
  const illegalChar: string = charFromMessage || data.substr(error.offset, error.length) || "?";

  console.log(`Illegal character '${illegalChar}' at line ${error.line}.`);
  console.log(`This is the ${error.offset}th character.`);
}

We see that each generated token contains the following information:

1. The **type** of the token (e.g., `NUMBER`, `PLUS`)
2. The **value** of the token - the actual matched string
3. The **line number** - starts at 1 (note that the first line of `data` is empty)
4. The **character offset** - the position in the input string, starts at 0

For example, the character `a` at line 4 is the 54th character in the input string.