In [2]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

The following example has been extracted from the official documentation of Ply.

## A Tokenizer for Numbers and the Arithmetical Operators

The module `chevrotain` contains the code that is necessary to create a scanner.

In [3]:
import { createToken, Lexer } from "chevrotain";
import type { IToken } from "chevrotain";

We start with a definition of the <em style="color:blue">tokens</em>.  Note that all token names have to start with 
a capital letter.

In [4]:
const Plus    = createToken({ name: "PLUS",    pattern: /\+/ });
const Minus   = createToken({ name: "MINUS",   pattern: /-/ });
const Times   = createToken({ name: "TIMES",   pattern: /\*/ });
const Divide  = createToken({ name: "DIVIDE",  pattern: /\// });
const LParen  = createToken({ name: "LPAREN",  pattern: /\(/ });
const RParen  = createToken({ name: "RPAREN",  pattern: /\)/ });

If we need to transform the value of a token, we can define the token via a function.  In that case, the first line of the function 
has to be a string that is a regular expression.  This regular expression then defines the token.  After that,
we can add code to transform the token.  The string that makes up the token is stored in `t.value`.  Below, this string
is cast into an integer via the predefined function `int`.

In [None]:
const NumberLiteral = createToken({ name: "NUMBER", pattern: /(?:\d*\.\d+|\d+)/ });

The rule below is used to keep track of line numbers. We use the function `len` since there might be
more than one newline.  The member variable `lexer.lineno` keeps track of the current line number.  This variable
is maintained so that we are able to specify the precise location of unkown characters in error messages.

In [None]:
def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

The keyword `t_ignore` specifies those characters that should be discarded.
In the following cell it specifies that space characters and tabulator characters are to be ignored.  Note that we **must not** use a raw string here, since otherwise `\t` would not denote a tabulator character.

In [None]:
t_ignore = ' \t'

All characters not recognized by any of the defined tokens are handled by the function `t_error`.
The function `t.lexer.skip(1)` skips one character, which is the character that has not been recognized. Scanning resumes after this character has been discarded.

In [None]:
def t_error(t):
    print(f"Illegal character '{t.value[0]}' at line {t.lexer.lineno}.")
    print(f"This is the {t.lexpos}th character.")
    t.lexer.skip(1)

Below the function `lex.lex()` creates the lexer specified above.  Since this code is expected to be part 
of some Python file but really isn't part of a file since it is placed in a Jupyter notebook we have to set the variable 
`__file__` manually to fool the system into believing that the code given above is located in a file 
called `hugo.py`.  The name `hugo` is totally irrelevant and could be replaced by any other name.

In [None]:
__file__ = 'hugo'
lexer = lex.lex(debug=True)

Now `lexer` is the scanner that has been created by the previous command. 

In [None]:
lexer

Lets test the generated scanner, that is stored in `lexer`, with the following string:

In [None]:
data = """
       3 + 4 * 10 + 007 + (-20) * 2
       42
       a
       """

Let us feed the scanner with the string `data`.  This is done by calling the method `input` of the generated scanner.

In [None]:
data

I have set the line number to `1` before scanning in order to be able to run the scanner multiple times, since each time the scanner runs the line number is changed.

In [None]:
lexer.lineno = 1
lexer.input(data)

Now we put the lexer to work by using it as an *iterable*.  This way, we can simply iterate over all the tokens that our scanner recognizes.

In [None]:
for token in lexer:
    print(token)

We see that the generated tokens contain four pieces of information:
 1. The *type* of the token.
 2. The *value* of the token.  This is either a number or a string.
 3. The *line number* of the token.  The line number starts with 1.
    However, note that the first line of `data` is empty.
 4. The *character count*.  For example, the last token is the $54^{\textrm{th}}$ character.
    The character count starts with `0`.