Documentation in doc/ply.md lacks important details about creating a custom lexer. #299

mrolle45 · 2024-01-10T01:07:38Z

The function ply.lex.lex() is used to create a Lexer object. This Lexer is used to tokenize an input string.
Readers can use the following information to help them do this.

I had a hard time figuring out how to create my own customized lexer, involving much trial and error. I would like to see doc/ply.md explain things more clearly. The following is my best understanding of how it works. Somebody please correct this if it is incorrect.

Information which `lex()` requires.

An environment.

This is a place where lex() finds some of the information it needs. It takes the form of a dict of names.

If called with keyword (module=x) or (object=x), the names are those in dir(x), and the values are the corresponding attributes of x. If both keywords are present, the object=x is used.
If called without either of these keywords, then the names are all those names which are defined at the location of the lex() call at the time of the call, along with their corresponding values.
- If the name is a local or nonlocal variable with an assigned value, that value is used.
- Otherwise the name is a global variable, and its value is used.

A description of the lexer's behavior -- names in the environment.

tokens (required) = the names of tokens. This is a tuple or list of str's.
states (optional) = the additional states of the state machine. It is a tuple or list of (state name, state type). Default is (). The state type is either 'inclusive' or 'exclusive'. lex() also adds ('INITIAL', 'inclusive') to the states; 'INITIAL' is not allowed in states
t_(state_)*token = a rule.
Each rule defines:
- A regular expression, or regex, as used by Python's re module. When the lexer applies this rule, it matches the regex to the current position of the input string. If there is a match, it constructs a LexToken object, which designates the token type, the matched input text, and the Lexer itself.
- What to do if there is a match. This depends on what kind of object the rule is.
  - If it is a string, the regex is that string, and the LexToken is the result.
  - If it is a callable object, the regex is its doc string (required), and it is called with the LexToken as its argument.
    If the function returns something, it will be the same, or another, LexToken. This will be the result of the rule.
    If the function returns None, then the lexer treats this as though the regex didn't match in the first place.
~~The rule is associated with each of the named state s, as well as any state with 'inclusive' type~~.
[EDIT] The rule is associated with the following states:
- Every state in the rule name.
- If no state is specified, INITIAL is used.
- If ANY is one of the named state s, then with all states.
- If the rule is associated with INITIAL, then any other state which is 'inclusive'.
Special rules, as above, with token name ignore or eof or error (optional).

t_(state_)*ignore (optional). Must be a string. Any character in the string is skipped if it is the input character at the current position.
t_(state_)*eof (optional). Must be a function. When there is no more input data, the lexer creates a LexToken with a type of 'eof' and a value of ''. If this function exists, it is called. It can provide more data to the lexer and return the next LexToken. Or it can return None. The result is returned to the caller of lexer.token().
t_(state_)*error. Must be a function. If all rules fail, the lexer creates a LexToken with a type of 'eof' and a value of the entire remaining input data. It must advance the current data position. Any LexToken returned will be returned to the caller of lexer.token(). If None is returned, the lexer will start over at the new position. If this function is not defined, an exception is raised.

literals (optional) = string or other Iterable of single-character strings. Each character c is like a rule which matches c, and whose type is c.
This is NOT qualified by any state names. Thus the rules apply equally, whatever state the lexer is in at the time.

[EDIT]

I originally had a section about optimized lex.lex(), which could simply write out a lot of internal variables to a file, and at a later time read them back in. That was based on a 2002 version of this repo. This feature is no longer part of the product. Therefore, I deleted the entire section.

Getting the next token.

The next token can be obtained either by calling
Lexer.token(self) -> LexToken | None
or by iteration, as in for tok in lexer;.
The lexer applies rules which are associated with the current state, in the following order, until one succeeds.
After a successful rule, moves the current input position past the matched text and returns the LexToken returned by the rule.
If all of the rules fail, returns None.

t_ignore, if defined. If the next input character is there, it is skipped and the input position moves to the next character. token() starts over at the new position.
t_eof(), if defined and there is no more input. If it fails, or is not defined, token() returns None.
Function rules, in the order they are defined.
Note, the order is determined by the line number in the module containing the function definition. So to ensure the correct order, all function rules should come from the same module.
String rules, in reverse order of the length of their regexes. This is so that a rule matching a longer input will be taken over one which matches an initial subset thereof.
Any single character in literals.
t_error(), if defined.

Changing Lexer state

The Lexer is always in one of the states found in the environment states. It starts out in the 'INITIAL' state. The current state (a str) is available as lexer.current_state().
The state may be changed by one of the following methods.

lexer.begin(state: str = 'INITIAL') -> None: Changes the current state.
lexer.push_state(state: str) -> None: remembers the present state, then changes the current state.
lexer.pop_state() -> None: Changes the current state to the one saved by matching push_state().

The methods can be called outside of a lexer.token() call, or within any rule function (which is called if its docstring matches the current input). In a rule function rule(t: LexToken), the lexer is found in t.lexer.

Changing or skipping input data

The lexer has a str of input data, and a current position within that string.
lexer.input(data: str) sets the input data, and the current position to 0. This call is required before any tokens can be obtained.
It is possible to call lexer.input() more than once. Typically this would be done in a t_eof() rule when all the input data is only available in pieces.
lexer.lexdata and lexer.lexpos provide the current data and position.
lexer.skip(n: int) adds n to the current position, so that intervening characters are completely ignored.
When a rule's regex is matched, lexer.lexpos is moved to the end of the matched data, before any function rule is called.

The text was updated successfully, but these errors were encountered:

dabeaz · 2024-01-10T11:08:28Z

I appreciate the interest in PLY, but it is not something that I actually work on right now. As such, I have no plans (or time) to work on updates to the documentation or the code. The above description is also describing an out of date version of PLY, not the current version of code as found in GitHub. I'm more than willing to consider minor bug fixes, but they would have to be explicit and point to specific things in the code/docs.

mrolle45 · 2024-01-10T22:23:18Z

@dabeaz This is describing https://github.com/dabeaz/ply/blob/master/doc/ply.md. Is this not the current version you are referring to.

Would anyone like to volunteer to incorporate my OP into doc/ply.md?

dabeaz · 2024-01-11T00:32:31Z

The problem is that there are all these references to features that have been long removed. optimize, outdir, lextab, etc. None of that exists in the current codebase. As such, it doesn't appear in the docs either. I don't know what version of PLY you're working with, but it doesn't seem to match what's currently in GitHub.

mrolle45 · 2024-01-11T19:33:31Z

@dabeaz This is describing https://github.com/dabeaz/ply/blob/master/doc/ply.md. Is this not the current version you are referring to.

Would anyone like to volunteer to incorporate my OP into doc/ply.md?

mrolle45 · 2024-01-12T00:44:39Z

I had an older version (11/2/20) of the ply source code and the current version of ply.md. My issue is with the current version of ply.md, but some things no longer apply.
I now have the current version of the ply source code. I'll edit the OP soon, after I look at it and be sure that I can get my own application working with it.

mrolle45 · 2024-01-12T02:19:26Z

The OP is now as I want it. The only thing I needed to do was remove the information about optimized usage, which is no longer supported.

mrolle45 changed the title ~~Documentation in doc/ply.md lacks some important details.~~ Documentation in doc/ply.md lacks important details about creating a custom lexer. Jan 10, 2024

dabeaz closed this as completed Mar 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation in doc/ply.md lacks important details about creating a custom lexer. #299

Documentation in doc/ply.md lacks important details about creating a custom lexer. #299

mrolle45 commented Jan 10, 2024 •

edited

Loading

dabeaz commented Jan 10, 2024

mrolle45 commented Jan 10, 2024

dabeaz commented Jan 11, 2024

mrolle45 commented Jan 11, 2024

mrolle45 commented Jan 12, 2024 •

edited

Loading

mrolle45 commented Jan 12, 2024

Documentation in doc/ply.md lacks important details about creating a custom lexer. #299

Documentation in doc/ply.md lacks important details about creating a custom lexer. #299

Comments

mrolle45 commented Jan 10, 2024 • edited Loading

Information which lex() requires.

An environment.

A description of the lexer's behavior -- names in the environment.

[EDIT]

Getting the next token.

Changing Lexer state

Changing or skipping input data

dabeaz commented Jan 10, 2024

mrolle45 commented Jan 10, 2024

dabeaz commented Jan 11, 2024

mrolle45 commented Jan 11, 2024

mrolle45 commented Jan 12, 2024 • edited Loading

mrolle45 commented Jan 12, 2024

mrolle45 commented Jan 10, 2024 •

edited

Loading

Information which `lex()` requires.

mrolle45 commented Jan 12, 2024 •

edited

Loading