-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Antlr python 2 not recoginizing the correct end line for class, function, statements and loops #4153
Comments
What is the input? The above input text is illegal Python2 source code. (Try it in https://onecompiler.com/python2/42j94gnxp.)
Further, we cannot tell if you are using |
Thank you kaby76 for suggesting to input in a triple-backtip quoted block. Adding on to Himanshu's issue, sharing the below code snippet where function test() starts at line number 1 and ends at line number 3 at the print statement but the ANTLR Python 2.7.18 grammar finds the end line of test() function as the start of the next function greet() which is at line number 5.
|
DEDENT token is placed in the 5th line because it is detected there. It also places the DEDENT token in the 5th line. |
I agree, I'm not sure what the problem is here. Input:
Or in file: xxx.txt. The parse tree is:
The tokens are:
According to the Official Python2 grammar, https://docs.python.org/2.7/reference/grammar.html, a If you want to get the interval for the statements within function "test()", then you have to get the last char of the 2nd
|
The only thing that would be nice to change is the text for the INDENT and DEDENT tokens. They are The "problem" is on the trtext-side of things. trtext reconstructs the text of the input by concatenating the text of the leaves of the parse tree. So, I see |
Thanks for bringing this to my attention. I will fix it in all PythonLexerBase ports:
|
Thanks kaby76 and RobEin for checking on this issue. Waiting for your update if it is fixed in PythonLexerBase for Java. |
On second thought, no repair is needed after all. |
I don't understand. These two statements are inconsistent. The first statement says that the INDENT and DEDENT tokens need to be deleted from the parse tree in order to reconstruct the source. The second statement says that they cannot be deleted because they are essential to reconstruct the source. Currently, I have to delete the INDENT and DEDENT tokens to reconstruct the text because if I don't I get
Text reconstruction in Trash follows the basic concept that existed in CS since the 1960's: the input text is simply the concatenation of the text of the frontier of the parse tree. The text for INDENT and DEDENT tokens are
NB: trtext outputs an extra newline character because it calls Console.WriteLine() instead of a |
The second statement was about the original Python tokenizer.
Now I understand what the problem is. I can imagine two alternatives in this case:
I recommend the second solution. |
I didn't understand. Can you explain me what has to be changed. Do I need to change in any grammar files? We are trying to parse python 2x file using Java. When i tried to print the FuncdefContext.suite.getText() of test() function for this example,
Output:
and endline for this test() function is 5. Can you tell me what should be done here to get the the correct endline. |
Trash doesn't represent the parse tree like Antlr. It incorporates the entire input, including white space and comments. It's done this way so that it's fully serializable, with no loss of text, and fully editable. The way Antlr splits the parse tree from the token stream, and char stream, is unnatural, difficult/slow to serialize and edit. |
Hi, Thanks for your response. I understand that you have suggested on how to get the text from ANTLR parse tree. Our use case is to parse input python file and identify the startline and endline for each classes, functions, statements, comments, etc. in the file and while doing so we are facing an issue fetching endline from the function and statements context (for, while loop,...)
|
The easiest solution would be to just delete the INDENT and DEDENT leaves, then just get the Interval for the sub-tree. But, the Antlr runtime doesn't have tree editing. Instead, do this:
In C#:
I would think so, but don't quote me. |
Hi, Thanks for you response. We will check on the suggestion you have provided as we have built in Java. For Example:
Would you like to suggest if we can handle the endlines correctly here? |
Not quite. Try this. MyListener.java
|
ANTLR 4 does not recognize the end lines correctly. Below is the link to the grammar file we used with an example
Link: https://github.com/antlr/grammars-v4/tree/master/python/python2_7_18
start line for test is 1 but the end line is start line of another function which is 4
The text was updated successfully, but these errors were encountered: