Stateful Lexical Analyzer Compiler.
A slex file describes a stateful scanner which can be compiled into an Erlang module for tokenizing text.
A slex file consists of:
- Attributes describing properties of the scanner.
- Rules for tokenizing the text.
- Token transformation definitions.
- Additional code, not affecting the compilation of the scanner.
The slex syntax is rather basic.
Starts with a letter, followed by any number of alphanumeric or underscore characters.
Example: foo123
.
Positive integers only.
Example: 123
.
Strings are either quoted, using single or double quote (i.e. ' or "), or, in case of a "symbols only" string, the quotes can be omitted. Symbols are those characters that doesn't fit in any other type.
Backslash is used to escape the next character. A few characters has special meaning (e.g. r, n, t and s) and for the rest it is simply included in the string.
Example strings:
"foo bar"
'another string'
!@#$
{{\ }}
In the last string, note how the the space is escaped so the string doesn't end there.
Comments start with %%
and run until the end of the line.
Example: %% this is a comment.
Comments prefixed with three or more percent signs are included in the
generated .erl
source file.
Example: %%% comment to be included in the generated source.
A few identifiers has special meaning, and is treated accordingly.
These are:
any
: can be used to indicate "any prefix" or "any state".skip
: is used to indicate that a rule doesn't affect the scanned tokens.until
: used to associate a string that ends a state.
There are also a number of symbols used as delimiters, and thus has special meaning. In other words, these needs to be escaped if used in symbols strings.
These are:
+
: can indicate either that a rule appends to a scanned token, or that a rules prefix should also match the associated string that ends the state.-
: used to indicate a state that has no associated ending string.:
: used as delimiter between the different parts of a rule, or as theend of input
prefix marker.,
: just a delimiter..
: marks the end of a definition.
Attributes are prefixed with a dash -
, followed by a identifier
and optional arguments and terminated by a period:
-my_attr foo 123.
Supported attributes:
module
function
init_state
include
Rules make out the main functionality of the scanner, and each rule is applied to the input text, in order of rule priority.
A rule has five parts: prio prefix state [, guard] : body.
- Priority. The rules are matched in order of their priority (lowest number first).
- Prefix. The prefix that should match the input text.
- Current state. The state of the scanner for the rule to apply.
- Guard expression. Optional expression to further refine if the rule applies.
- Rule body, as described below.
Example rule: 10 <? in_text: open_tag, in_tag until ?>.
Will match on input text of "" sets the extra data for the state, which can be used by another rule to detect when this state should end.
- priority
The priority is to allow injection of new rules passed in to the
compiler with the extra_data
option.
All rules are sorted in ascending priority when building the scanner function clauses.
- prefix
The prefix is either the keyword any
or a string that should match
the input text for the rule to apply.
- current state
The state of the scanner that the rule applies to. There are a few tricks in play here to specify the properties of the state, see [states].
In addition to the rules given in [states], if the state is suffixed
with a +
sign, the state extra data must also match the prefix in
order for the rule to apply.
Example, to close the in_tag
state given in an earlier example:
20 ?> in_tag+: close_tag, in_text-.
.
This will match on input text of "?>" when the scanner is in state
in_tag
with extra data "?>", and save a close_tag
token and
continue in scanning in state in_text
with no extra data.
Instead of in_tag+
, we could have said any+
, and it would then
match any state with extra data "?>".
- guard expression (optional)
The guard expression, if provided, should be a Erlang guard expression
on the form: expr <guard code...> end
.
- body
The body is either a combination of action(s) and state transition
separated by comman ,
, or an Erlang expression that implements the
rule body on the form: expr <rule body code...> end
.
Actions are any number of tokens to save, or the keyword skip
if
no tokens are to be saved when this rule applies.
A action token can be an identifier or a string. For identifiers, the
saved token is on the form: {<identifier>, {Row, Col}, Prefix}
,
while for strings it is on the form: {'<string>', {Row, Col}}
.
If the identifier is prefixed by a +
sign and the last saved token
also was a <identifier>
then Prefix
is added to the value of
that token instead of adding a new token.
If the identifier is suffixed by -<string>
, then the <string>
will be used as value instead of Prefix
for the token.
Example actions:
literal
, will save{literal, Pos, Prefix}
.+string
, if previous token was not astring
token:{string, Pos, Prefix}
, or if previous token was astring
, will update it:{string, _, Prefix ++ _}
(_
indicates previous value).foo 'bar' baz-'moot'
, will save three tokens:{baz, Pos, "moot"}, {bar, Pos}, {foo, Pos, Prefix}
.
After the actions, an optional state transition can be specified by
providing a state name and optionally the extra data in the form of a
string for the state, using the until
keyword.
Example state transitions:
plain-
, switch to stateplain
, with no extra data.my_state
, switch tomy_state
and keep the extra data from the current state.custom until 'bail'
, switch to statecustom
with extra data"bail"
.
If a state transition is provided, the actions can be left out entirely, including the separating comma.
There's two kinds of state: stateless and stateful. The stateless state is a state that doesn't carry any extra information besides its name, while a stateful state carries an extra term, usually a string indicating the text that will end the state (but can be anything).
When specifying states in slex, a state name by itself refers to a
stateful state. To refer to a stateless state, add a dash -
suffix
to the state name.
Rules for post processing tokens.
Erlang code can be supplied in two distinct contexts, at the module level, as forms, or in a rule, as expressions.
At the module level, it allows to add arbitrary code to the compiled module, in the form of additional attributes, functions, include directives and what not.
In rules, expressions can be used to add custom guards or to provide the implementation body for the rule.
Code can run over multiple lines when newlines are escaped with a backslash.
Note: there can only be at most one form/expression on a single line
(e.g. only a single end
per line is supported).
Example:
form -record(foo, {bar, baz}) end
form foo(Bar) -> {baz, Bar} end
expr H >= $0 andalso H =< $9 end