Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation in doc/ply.md lacks important details about creating a custom lexer. #299

Closed
mrolle45 opened this issue Jan 10, 2024 · 6 comments

Comments

@mrolle45
Copy link

mrolle45 commented Jan 10, 2024

The function ply.lex.lex() is used to create a Lexer object. This Lexer is used to tokenize an input string.
Readers can use the following information to help them do this.

I had a hard time figuring out how to create my own customized lexer, involving much trial and error. I would like to see doc/ply.md explain things more clearly. The following is my best understanding of how it works. Somebody please correct this if it is incorrect.

Information which lex() requires.

An environment.

This is a place where lex() finds some of the information it needs. It takes the form of a dict of names.

  • If called with keyword (module=x) or (object=x), the names are those in dir(x), and the values are the corresponding attributes of x. If both keywords are present, the object=x is used.
  • If called without either of these keywords, then the names are all those names which are defined at the location of the lex() call at the time of the call, along with their corresponding values.
    • If the name is a local or nonlocal variable with an assigned value, that value is used.
    • Otherwise the name is a global variable, and its value is used.

A description of the lexer's behavior -- names in the environment.

  1. tokens (required) = the names of tokens. This is a tuple or list of str's.

  2. states (optional) = the additional states of the state machine. It is a tuple or list of (state name, state type). Default is (). The state type is either 'inclusive' or 'exclusive'. lex() also adds ('INITIAL', 'inclusive') to the states; 'INITIAL' is not allowed in states

  3. t_(state_)*token = a rule.
    Each rule defines:

    • A regular expression, or regex, as used by Python's re module. When the lexer applies this rule, it matches the regex to the current position of the input string. If there is a match, it constructs a LexToken object, which designates the token type, the matched input text, and the Lexer itself.
    • What to do if there is a match. This depends on what kind of object the rule is.
      • If it is a string, the regex is that string, and the LexToken is the result.
      • If it is a callable object, the regex is its doc string (required), and it is called with the LexToken as its argument.
        If the function returns something, it will be the same, or another, LexToken. This will be the result of the rule.
        If the function returns None, then the lexer treats this as though the regex didn't match in the first place.

    The rule is associated with each of the named state s, as well as any state with 'inclusive' type.
    [EDIT] The rule is associated with the following states:

    • Every state in the rule name.
    • If no state is specified, INITIAL is used.
    • If ANY is one of the named state s, then with all states.
    • If the rule is associated with INITIAL, then any other state which is 'inclusive'.
  4. Special rules, as above, with token name ignore or eof or error (optional).

  • t_(state_)*ignore (optional). Must be a string. Any character in the string is skipped if it is the input character at the current position.
  • t_(state_)*eof (optional). Must be a function. When there is no more input data, the lexer creates a LexToken with a type of 'eof' and a value of ''. If this function exists, it is called. It can provide more data to the lexer and return the next LexToken. Or it can return None. The result is returned to the caller of lexer.token().
  • t_(state_)*error. Must be a function. If all rules fail, the lexer creates a LexToken with a type of 'eof' and a value of the entire remaining input data. It must advance the current data position. Any LexToken returned will be returned to the caller of lexer.token(). If None is returned, the lexer will start over at the new position. If this function is not defined, an exception is raised.
  1. literals (optional) = string or other Iterable of single-character strings. Each character c is like a rule which matches c, and whose type is c.
    This is NOT qualified by any state names. Thus the rules apply equally, whatever state the lexer is in at the time.

[EDIT]

I originally had a section about optimized lex.lex(), which could simply write out a lot of internal variables to a file, and at a later time read them back in. That was based on a 2002 version of this repo. This feature is no longer part of the product. Therefore, I deleted the entire section.

Getting the next token.

The next token can be obtained either by calling
Lexer.token(self) -> LexToken | None
or by iteration, as in for tok in lexer;.
The lexer applies rules which are associated with the current state, in the following order, until one succeeds.
After a successful rule, moves the current input position past the matched text and returns the LexToken returned by the rule.
If all of the rules fail, returns None.

  1. t_ignore, if defined. If the next input character is there, it is skipped and the input position moves to the next character. token() starts over at the new position.
  2. t_eof(), if defined and there is no more input. If it fails, or is not defined, token() returns None.
  3. Function rules, in the order they are defined.
    Note, the order is determined by the line number in the module containing the function definition. So to ensure the correct order, all function rules should come from the same module.
  4. String rules, in reverse order of the length of their regexes. This is so that a rule matching a longer input will be taken over one which matches an initial subset thereof.
  5. Any single character in literals.
  6. t_error(), if defined.

Changing Lexer state

The Lexer is always in one of the states found in the environment states. It starts out in the 'INITIAL' state. The current state (a str) is available as lexer.current_state().
The state may be changed by one of the following methods.

  • lexer.begin(state: str = 'INITIAL') -> None: Changes the current state.
  • lexer.push_state(state: str) -> None: remembers the present state, then changes the current state.
  • lexer.pop_state() -> None: Changes the current state to the one saved by matching push_state().

The methods can be called outside of a lexer.token() call, or within any rule function (which is called if its docstring matches the current input). In a rule function rule(t: LexToken), the lexer is found in t.lexer.

Changing or skipping input data

The lexer has a str of input data, and a current position within that string.
lexer.input(data: str) sets the input data, and the current position to 0. This call is required before any tokens can be obtained.
It is possible to call lexer.input() more than once. Typically this would be done in a t_eof() rule when all the input data is only available in pieces.
lexer.lexdata and lexer.lexpos provide the current data and position.
lexer.skip(n: int) adds n to the current position, so that intervening characters are completely ignored.
When a rule's regex is matched, lexer.lexpos is moved to the end of the matched data, before any function rule is called.

@mrolle45 mrolle45 changed the title Documentation in doc/ply.md lacks some important details. Documentation in doc/ply.md lacks important details about creating a custom lexer. Jan 10, 2024
@dabeaz
Copy link
Owner

dabeaz commented Jan 10, 2024

I appreciate the interest in PLY, but it is not something that I actually work on right now. As such, I have no plans (or time) to work on updates to the documentation or the code. The above description is also describing an out of date version of PLY, not the current version of code as found in GitHub. I'm more than willing to consider minor bug fixes, but they would have to be explicit and point to specific things in the code/docs.

@mrolle45
Copy link
Author

@dabeaz This is describing https://github.com/dabeaz/ply/blob/master/doc/ply.md. Is this not the current version you are referring to.

Would anyone like to volunteer to incorporate my OP into doc/ply.md?

@dabeaz
Copy link
Owner

dabeaz commented Jan 11, 2024

The problem is that there are all these references to features that have been long removed. optimize, outdir, lextab, etc. None of that exists in the current codebase. As such, it doesn't appear in the docs either. I don't know what version of PLY you're working with, but it doesn't seem to match what's currently in GitHub.

@mrolle45
Copy link
Author

@dabeaz This is describing https://github.com/dabeaz/ply/blob/master/doc/ply.md. Is this not the current version you are referring to.

Would anyone like to volunteer to incorporate my OP into doc/ply.md?

@mrolle45
Copy link
Author

mrolle45 commented Jan 12, 2024

I had an older version (11/2/20) of the ply source code and the current version of ply.md. My issue is with the current version of ply.md, but some things no longer apply.
I now have the current version of the ply source code. I'll edit the OP soon, after I look at it and be sure that I can get my own application working with it.

@mrolle45
Copy link
Author

The OP is now as I want it. The only thing I needed to do was remove the information about optimized usage, which is no longer supported.

@dabeaz dabeaz closed this as completed Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants