In [1]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

The following example has been extracted from the official documentation of Ply.

## A Tokenizer for Numbers and the Arithmetical Operators

The module `chevrotain` contains the code that is necessary to create a scanner.

In [2]:
const { execSync } = await import('child_process');
console.log(execSync('npm install chevrotain@10').toString());




up to date, audited 8 packages in 959ms

found 0 vulnerabilities



In [12]:
import { createToken, Lexer, IToken, ILexingError } from "chevrotain";

We start with a definition of the <em style="color:blue">tokens</em>.  Note that all token names have to start with 
a capital letter.

In [13]:
const Plus    = createToken({ name: "PLUS",    pattern: /\+/ });
const Minus   = createToken({ name: "MINUS",   pattern: /-/ });
const Times   = createToken({ name: "TIMES",   pattern: /\*/ });
const Divide  = createToken({ name: "DIVIDE",  pattern: /\// });
const LParen  = createToken({ name: "LPAREN",  pattern: /\(/ });
const RParen  = createToken({ name: "RPAREN",  pattern: /\)/ });

If we need to transform the value of a token, we can define the token via a function.  In that case, the first line of the function 
has to be a string that is a regular expression.  This regular expression then defines the token.  After that,
we can add code to transform the token.  The string that makes up the token is stored in `t.value`.  Below, this string
is cast into an integer via the predefined function `int`.

In [14]:
const NumberLiteral = createToken({ name: "NUMBER", pattern: /(?:\d*\.\d+|\d+)/ });

The rule below is used to keep track of line numbers. We use the function `len` since there might be
more than one newline.  The member variable `lexer.lineno` keeps track of the current line number.  This variable
is maintained so that we are able to specify the precise location of unkown characters in error messages.

In [15]:
const Newline = createToken({ name: "NEWLINE", pattern: /\n+/ });

The keyword `t_ignore` specifies those characters that should be discarded.
In the following cell it specifies that space characters and tabulator characters are to be ignored.  Note that we **must not** use a raw string here, since otherwise `\t` would not denote a tabulator character.

In [16]:
const WhiteSpace = createToken({ name: "WS", pattern: /[ \t\r]+/, group: Lexer.SKIPPED });

All characters not recognized by any of the defined tokens are handled by the function t_error. The function t.lexer.skip(1) skips one character, which is the character that has not been recognized. Scanning resumes after this character has been discarded.

In [18]:
function t_error(errors: ILexingError[]): void {
  if (errors.length === 0) return;

  console.error("Lexing errors detected:");

  for (const err of errors) {
    console.error(`  → ${err.message}`);
    if (err.line !== undefined && err.column !== undefined) {
      console.error(`    at line ${err.line}, column ${err.column}`);
    }
  }

  // Optional: throw an exception to stop execution
  // throw new Error("Lexing failed due to errors.");
}

In [19]:
const allTokens = [WhiteSpace, Newline, Plus, Minus, Times, Divide, LParen, RParen, NumberLiteral];

Below the function `lex.lex()` creates the lexer specified above.  Since this code is expected to be part 
of some Python file but really isn't part of a file since it is placed in a Jupyter notebook we have to set the variable 
`__file__` manually to fool the system into believing that the code given above is located in a file 
called `hugo.py`.  The name `hugo` is totally irrelevant and could be replaced by any other name.

Now `lexer` is the scanner that has been created by the previous command. 

In [20]:
const lexer = new Lexer(allTokens, { positionTracking: "full" });

Lets test the generated scanner, that is stored in `lexer`, with the following string:

In [21]:
const data = `
  3 + 4 * 10 + 007 + (-20) * 2
  42
  a
  b
  c
`;

Let us feed the scanner with the string `data`.  This is done by calling the method `input` of the generated scanner.

In [22]:
console.log(data)


  3 + 4 * 10 + 007 + (-20) * 2
  42
  a
  b
  c



In [23]:
function runLexer(input: string): void {
  const result = lexer.tokenize(input);

      t_error(result.errors);

  if (result.errors.length === 0) {
    console.log("Tokens:");
    for (const token of result.tokens) {
      console.log(
        `${token.image} → ${token.tokenType.name} (line ${token.startLine}, column ${token.startColumn})`
      );
    }
  }
}

In [24]:
runLexer(data);

[31mLexing errors detected:[39m
[31m  → unexpected character: ->a<- at offset: 39, skipped 1 characters.[39m
[31m    at line 4, column 3[39m
[31m  → unexpected character: ->b<- at offset: 43, skipped 1 characters.[39m
[31m    at line 5, column 3[39m
[31m  → unexpected character: ->c<- at offset: 47, skipped 1 characters.[39m
[31m    at line 6, column 3[39m


I have set the line number to `1` before scanning in order to be able to run the scanner multiple times, since each time the scanner runs the line number is changed.

In [26]:
runLexer(data)

[31mLexing errors detected:[39m
[31m  → unexpected character: ->a<- at offset: 39, skipped 1 characters.[39m
[31m    at line 4, column 3[39m
[31m  → unexpected character: ->b<- at offset: 43, skipped 1 characters.[39m
[31m    at line 5, column 3[39m
[31m  → unexpected character: ->c<- at offset: 47, skipped 1 characters.[39m
[31m    at line 6, column 3[39m


Now we put the lexer to work by using it as an *iterable*.  This way, we can simply iterate over all the tokens that our scanner recognizes.

In [25]:
console.table(
  allTokens.map(t => ({
    name: t.name,
    pattern: t.PATTERN.toString(),
  }))
);

┌─────────┬───────────┬──────────────────────────┐
│ (index) │ name      │ pattern                  │
├─────────┼───────────┼──────────────────────────┤
│ 0       │ [32m'WS'[39m      │ [32m'/[ \\t\\r]+/'[39m           │
│ 1       │ [32m'NEWLINE'[39m │ [32m'/\\n+/'[39m                 │
│ 2       │ [32m'PLUS'[39m    │ [32m'/\\+/'[39m                  │
│ 3       │ [32m'MINUS'[39m   │ [32m'/-/'[39m                    │
│ 4       │ [32m'TIMES'[39m   │ [32m'/\\*/'[39m                  │
│ 5       │ [32m'DIVIDE'[39m  │ [32m'/\\//'[39m                  │
│ 6       │ [32m'LPAREN'[39m  │ [32m'/\\(/'[39m                  │
│ 7       │ [32m'RPAREN'[39m  │ [32m'/\\)/'[39m                  │
│ 8       │ [32m'NUMBER'[39m  │ [32m'/(?:\\d*\\.\\d+|\\d+)/'[39m │
└─────────┴───────────┴──────────────────────────┘


We see that the generated tokens contain four pieces of information:
 1. The *type* of the token.
 2. The *value* of the token.  This is either a number or a string.
 3. The *line number* of the token.  The line number starts with 1.
    However, note that the first line of `data` is empty.
 4. The *character count*.  For example, the last token is the $54^{\textrm{th}}$ character.
    The character count starts with `0`.