Clone this wiki locally
Tokenization is the first phase of parsing. It's the point where characters are extracted from the source string and grouped. There are four kinds of tokens:
All numbers extracted from the source string are positive, and are defined as anything that matches the following regular expression:
Even though "
." is technically recognized by this regular expression, it is not evaluated as a number.
No effort is made to allow for locale-sensitive numbers. Allowing things like thousands groupings or the comma as the decimal seperator can introduce ambiguity in the parser. For example, if "
," were recognized as the decimal seperator, then this is ambiguous:
Should that be parsed as the maximum of three numbers (
3), or the maximum of two (
2.3)? Similar problems arise when dealing with thousands groupings. As such, numbers are not locale-sensitive.
Function tokens are strictly the name of a function. For example, given the string "
sin(0)", the extracted function token is "
Functions can contain letters (upper and lowercase), decimal digits, and underscores.
Variables follow the same rules as functions, except that they must be prefixed with a "
$" character. Thus, the following are all legal variable names:
In addition, variables may also be quoted strings:
Operators are pretty much all other characters in the string.
Parentheses are parsed as operator tokens, even though they are not listed as part of the built-in operators. Parentheses used to denote order of operations and functions arguments are eliminated during term grouping.
Whitespace is seen as a logical break in the token stream. That means that "
3 4" will be parsed as the
3 token followed by the
4 token. And because of the logic in recognizing implicit multiplication, a multiplication operator will be injected into the stream. Thus, "
3 4" is recognized as "
3*4", and evaluates to