Skip to content
Alexander Liao edited this page Aug 14, 2017 · 4 revisions

The first step to interpreting Proton code is to break down the source code into tokens. Each token has a type and a content (as well as **kwargs, but those can be ignored since I never got around to using them). The type determines what component it will become on the AST, and the content is kept in the AST, especially useful for literals and operators.

The lexer works by comparing the code to a set of rules, in a specific order of precedence. Once it matches one, it will break off that section of the code and yield a token. The lexer is a generator, though its sole use is in a list(...) statement.

The rules are, in the particular order in which the lexer matches code:

Regex Type Description
#.+ Comment If this is matched, skip the entire line
/\*([^*]|\*[^/])*\*/ Comment If this is matched, skip everything in it (/* ... */)
\d*\.\d+j Literal Complex Number If it is in the form <0+ digits>.<1+ digits>j, then it is a complex literal. Note that 2.3ja would still become a complex literal
\d+j Literal Complex Number Same as above, except for an integer multiple of the imaginary unit
\d*\.\d+ Literal Floating Point Number Digits before the decimal are unnecessary but must be present after them (this is to avoid a bug that I now cannot recall)
\d+ Literal Integer If it's not a complex number, a rational, or a floating point number, and it has a bunch of digits, it's probably an integer. Note that a trailing space is not necessary
"([^"\\]|\\.)*" Literal string A quote with an arbitrary amount of non-quotes and non-backslashes or a backslash followed by any character. The string is evaluated using ast.literal_eval so strings work just like in Python. Actual newlines can be present in strings (no triple-quotes needed)
'([^'\\]|\\.)*' Literal string Same as above, but with single quotes
"([^"\\]|\\.)* UnclosedStringError Raises an error if the string is unclosed. Used by the shell to determine when to continue code across multiple lines
'([^'\\]|\\.)* UnclosedStringError Same as above but with single quotes
(no regex) Keyword Keywords are matched here. See the keywords subsection below on this page
[A-Za-z_][A-Za-z_0-9]* Identifier If it starts with a letter or an underscore and is followed by any number of underscores, letters, or numbers, then it's an identifier, unless it's an operator. See the operators subsection below on this page
(;|,|\?|:>|->|=>) Special Statements These are really simple tokens and their functionality needs not be explained until later
(no regex) Operator Operators are matched here (if not already by the Identifier section). See the operators subsection below on this page
[\(\)\[\]\{\}] Bracket These single-byte tokens become important in the parser
\s+ Whitespace If it matches whitespace then skip it. This, like a comment, is not actually tokenized

Keywords are special words that mean things, like for and if. This is a list of all keywords, and their use if it is not part of most mainstream programming languages:

if, else, unless, while, for, try, except, exist not, exist, exists not, exists, break, continue, import, include, as, from, to, by, timeof

unless - essentially a backwards if statement. More in the Control Flow section
exist not, exist, exists not, exists - check if a variable is defined, or if the argument is a literal, return True since literals are always defined. More in the Unique Features section
timeof - time how long an expression takes to be evaluated. More in the Unique Features section

Operators are things that join or modify expressions to compute things. This is a list of all operators, in descending order of precedence:

  • .
  • **
  • >>, <<
  • *, /, //
  • %
  • +, -
  • >, <, <=, >=
  • &
  • |
  • ^
  • **=, *=, /=, //=, +=, -=, >>=, <<=, %=, &=, |=, &&=, ||=
  • ==, !=, :=, =, =:
  • &&, and
  • ||, or
  • in, not, in, is, are, is, not, are, not

is, are, is not, and are not check whether or not an object is of a certain type.

Clone this wiki locally