-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation in doc/ply.md lacks important details about creating a custom lexer. #299
Comments
I appreciate the interest in PLY, but it is not something that I actually work on right now. As such, I have no plans (or time) to work on updates to the documentation or the code. The above description is also describing an out of date version of PLY, not the current version of code as found in GitHub. I'm more than willing to consider minor bug fixes, but they would have to be explicit and point to specific things in the code/docs. |
@dabeaz This is describing https://github.com/dabeaz/ply/blob/master/doc/ply.md. Is this not the current version you are referring to. Would anyone like to volunteer to incorporate my OP into doc/ply.md? |
The problem is that there are all these references to features that have been long removed. |
@dabeaz This is describing https://github.com/dabeaz/ply/blob/master/doc/ply.md. Is this not the current version you are referring to. Would anyone like to volunteer to incorporate my OP into doc/ply.md? |
I had an older version (11/2/20) of the |
The OP is now as I want it. The only thing I needed to do was remove the information about optimized usage, which is no longer supported. |
The function ply.lex.lex() is used to create a Lexer object. This Lexer is used to tokenize an input string.
Readers can use the following information to help them do this.
I had a hard time figuring out how to create my own customized lexer, involving much trial and error. I would like to see doc/ply.md explain things more clearly. The following is my best understanding of how it works. Somebody please correct this if it is incorrect.
Information which
lex()
requires.An environment.
This is a place where
lex()
finds some of the information it needs. It takes the form of a dict of names.(module=x)
or(object=x)
, the names are those indir(x)
, and the values are the corresponding attributes ofx.
If both keywords are present, theobject=x
is used.lex()
call at the time of the call, along with their corresponding values.A description of the lexer's behavior -- names in the environment.
tokens
(required) = the names of tokens. This is a tuple or list ofstr's
.states
(optional) = the additional states of the state machine. It is a tuple or list of (state name, state type). Default is()
. The state type is either'inclusive'
or'exclusive'
.lex()
also adds('INITIAL', 'inclusive')
to the states; 'INITIAL' is not allowed instates
t_
(
state_)*
token = a rule.Each rule defines:
LexToken
object, which designates the token type, the matched input text, and the Lexer itself.LexToken
is the result.LexToken
as its argument.If the function returns something, it will be the same, or another,
LexToken.
This will be the result of the rule.If the function returns None, then the lexer treats this as though the regex didn't match in the first place.
The rule is associated with each of the named state s, as well as any state with.'inclusive'
type[EDIT] The rule is associated with the following states:
INITIAL
is used.ANY
is one of the named state s, then with all states.INITIAL
, then any other state which is'inclusive'
.Special rules, as above, with token name
ignore
oreof
orerror
(optional).(
state_)*
ignore (optional). Must be a string. Any character in the string is skipped if it is the input character at the current position.(
state_)*
eof (optional). Must be a function. When there is no more input data, the lexer creates aLexToken
with a type of 'eof' and a value of ''. If this function exists, it is called. It can provide more data to the lexer and return the nextLexToken.
Or it can return None. The result is returned to the caller of lexer.token().(
state_)*
error. Must be a function. If all rules fail, the lexer creates aLexToken
with a type of 'eof' and a value of the entire remaining input data. It must advance the current data position. AnyLexToken
returned will be returned to the caller of lexer.token(). If None is returned, the lexer will start over at the new position. If this function is not defined, an exception is raised.literals
(optional) = string or other Iterable of single-character strings. Each characterc
is like a rule which matchesc
, and whose type isc
.This is NOT qualified by any state names. Thus the rules apply equally, whatever state the lexer is in at the time.
[EDIT]
I originally had a section about optimized
lex.lex()
, which could simply write out a lot of internal variables to a file, and at a later time read them back in. That was based on a 2002 version of this repo. This feature is no longer part of the product. Therefore, I deleted the entire section.Getting the next token.
The next token can be obtained either by calling
Lexer.token(self) ->
LexToken
|None
or by iteration, as in
for tok in lexer;
.The lexer applies rules which are associated with the current state, in the following order, until one succeeds.
After a successful rule, moves the current input position past the matched text and returns the LexToken returned by the rule.
If all of the rules fail, returns
None.
t_ignore
, if defined. If the next input character is there, it is skipped and the input position moves to the next character. token() starts over at the new position.t_eof()
, if defined and there is no more input. If it fails, or is not defined, token() returns None.Note, the order is determined by the line number in the module containing the function definition. So to ensure the correct order, all function rules should come from the same module.
literals
.t_error()
, if defined.Changing Lexer state
The Lexer is always in one of the states found in the environment
states
. It starts out in the'INITIAL'
state. The current state (astr
) is available aslexer.current_state()
.The state may be changed by one of the following methods.
lexer.begin(state: str = 'INITIAL') -> None:
Changes the current state.lexer.push_state(state: str) -> None
: remembers the present state, then changes the current state.lexer.pop_state() -> None
: Changes the current state to the one saved by matching push_state().The methods can be called outside of a
lexer.token()
call, or within any rule function (which is called if its docstring matches the current input). In a rule functionrule(t: LexToken)
, the lexer is found int.lexer
.Changing or skipping input data
The lexer has a
str
of input data, and a current position within that string.lexer.input(data: str)
sets the input data, and the current position to 0. This call is required before any tokens can be obtained.It is possible to call
lexer.input()
more than once. Typically this would be done in at_eof()
rule when all the input data is only available in pieces.lexer.lexdata
andlexer.lexpos
provide the current data and position.lexer.skip(n: int)
addsn
to the current position, so that intervening characters are completely ignored.When a rule's regex is matched,
lexer.lexpos
is moved to the end of the matched data, before any function rule is called.The text was updated successfully, but these errors were encountered: