Antlr python 2 not recoginizing the correct end line for class, function, statements and loops #4153

Himanshu-portfolio · 2024-07-05T12:48:01Z

ANTLR 4 does not recognize the end lines correctly. Below is the link to the grammar file we used with an example

Link: https://github.com/antlr/grammars-v4/tree/master/python/python2_7_18

def test ()
print hello
WebIDL Grammar #3
def another ()
print hello 2

start line for test is 1 but the end line is start line of another function which is 4

kaby76 · 2024-07-05T13:13:16Z

def test ()

print hello

WebIDL Grammar #3

def another ()

print hello 2

What is the input? The above input text is illegal Python2 source code. (Try it in https://onecompiler.com/python2/42j94gnxp.)

def test() does not end in a colon.
print hello is not indented within the definition of test().

Further, we cannot tell if you are using \n or \r\n or \n\r newline character sequences. It's only possible to know which if you attach a .txt file. In lieu of that, please edit the above comment with the input nested in a triple-backtick quoted block. See https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#quoting-code.

Ranjani-devz · 2024-07-05T15:33:44Z

Thank you kaby76 for suggesting to input in a triple-backtip quoted block.

Adding on to Himanshu's issue, sharing the below code snippet where function test() starts at line number 1 and ends at line number 3 at the print statement but the ANTLR Python 2.7.18 grammar finds the end line of test() function as the start of the next function greet() which is at line number 5.

def test():
    xxx=1
    print xxx

def greet():
  print 'Hello World'
  
greet();

RobEin · 2024-07-05T16:11:04Z

DEDENT token is placed in the 5th line because it is detected there.
Also try Python's tokenizer:
python -m tokenize test.py -e

It also places the DEDENT token in the 5th line.

kaby76 · 2024-07-05T16:36:56Z

I agree, I'm not sure what the problem is here.

Input:

def test():
    xxx=1
    print xxx

def greet():
  print 'Hello World'
  
greet();

Or in file: xxx.txt.

The parse tree is:


( file_input
  ( stmt
    ( compound_stmt
      ( funcdef
        ( DEF
          (  text:'def' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) 
        ( Attribute WS Value ' ' chnl:HIDDEN
        ) 
        ( NAME
          (  text:'test' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) 
        ( parameters
          ( LPAR
            (  text:'(' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) 
          ( RPAR
            (  text:')' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) ) 
        ( COLON
          (  text:':' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) 
        ( suite
          ( NEWLINE
            (  text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) 
          ( Attribute WS Value '    ' chnl:HIDDEN
          ) 
          ( INDENT
            (  text:'<INDENT>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) 
          ( stmt
            ( simple_stmt
              ( small_stmt
                ( expr_stmt
                  ( testlist
                    ( test
                      ( or_test
                        ( and_test
                          ( not_test
                            ( comparison
                              ( expr
                                ( xor_expr
                                  ( and_expr
                                    ( shift_expr
                                      ( arith_expr
                                        ( term
                                          ( factor
                                            ( power
                                              ( atom
                                                ( NAME
                                                  (  text:'xxx' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
                  ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 
                  ( EQUAL
                    (  text:'=' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
                  ) ) 
                  ( testlist
                    ( test
                      ( or_test
                        ( and_test
                          ( not_test
                            ( comparison
                              ( expr
                                ( xor_expr
                                  ( and_expr
                                    ( shift_expr
                                      ( arith_expr
                                        ( term
                                          ( factor
                                            ( power
                                              ( atom
                                                ( NUMBER
                                                  (  text:'1' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
              ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 
              ( NEWLINE
                (  text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) ) ) 
          ( stmt
            ( simple_stmt
              ( small_stmt
                ( print_stmt
                  ( Attribute WS Value '    ' chnl:HIDDEN
                  ) 
                  ( PRINT
                    (  text:'print' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
                  ) ) 
                  ( test
                    ( or_test
                      ( and_test
                        ( not_test
                          ( comparison
                            ( expr
                              ( xor_expr
                                ( and_expr
                                  ( shift_expr
                                    ( arith_expr
                                      ( term
                                        ( factor
                                          ( power
                                            ( atom
                                              ( Attribute WS Value ' ' chnl:HIDDEN
                                              ) 
                                              ( NAME
                                                (  text:'xxx' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
              ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 
              ( Attribute NEWLINE Value '\r\n' chnl:HIDDEN
              ) 
              ( NEWLINE
                (  text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) ) ) 
          ( DEDENT
            (  text:'<DEDENT>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
  ) ) ) ) ) ) 
  ( stmt
    ( compound_stmt
      ( funcdef
        ( DEF
          (  text:'def' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) 
        ( Attribute WS Value ' ' chnl:HIDDEN
        ) 
        ( NAME
          (  text:'greet' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) 
        ( parameters
          ( LPAR
            (  text:'(' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) 
          ( RPAR
            (  text:')' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) ) 
        ( COLON
          (  text:':' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) 
        ( suite
          ( NEWLINE
            (  text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) 
          ( Attribute WS Value '  ' chnl:HIDDEN
          ) 
          ( INDENT
            (  text:'<INDENT>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) 
          ( stmt
            ( simple_stmt
              ( small_stmt
                ( print_stmt
                  ( PRINT
                    (  text:'print' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
                  ) ) 
                  ( test
                    ( or_test
                      ( and_test
                        ( not_test
                          ( comparison
                            ( expr
                              ( xor_expr
                                ( and_expr
                                  ( shift_expr
                                    ( arith_expr
                                      ( term
                                        ( factor
                                          ( power
                                            ( atom
                                              ( Attribute WS Value ' ' chnl:HIDDEN
                                              ) 
                                              ( STRING
                                                (  text:''Hello World'' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
              ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 
              ( Attribute NEWLINE Value '\r\n' chnl:HIDDEN
              ) 
              ( Attribute WS Value '  ' chnl:HIDDEN
              ) 
              ( NEWLINE
                (  text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) ) ) 
          ( DEDENT
            (  text:'<DEDENT>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
  ) ) ) ) ) ) 
  ( stmt
    ( simple_stmt
      ( small_stmt
        ( expr_stmt
          ( testlist
            ( test
              ( or_test
                ( and_test
                  ( not_test
                    ( comparison
                      ( expr
                        ( xor_expr
                          ( and_expr
                            ( shift_expr
                              ( arith_expr
                                ( term
                                  ( factor
                                    ( power
                                      ( atom
                                        ( NAME
                                          (  text:'greet' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
                                      ) ) ) 
                                      ( trailer
                                        ( LPAR
                                          (  text:'(' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
                                        ) ) 
                                        ( RPAR
                                          (  text:')' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
      ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 
      ( SEMI
        (  text:';' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
      ) ) 
      ( NEWLINE
        (  text:'<NEWLINE>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
  ) ) ) ) 
  ( EOF
    (  text:'' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) )

The tokens are:

[@0,0:2='def',<9>,1:0]
[@1,3:3=' ',<84>,channel=1,1:3]
[@2,4:7='test',<79>,1:4]
[@3,8:8='(',<34>,1:8]
[@4,9:9=')',<37>,1:9]
[@5,10:10=':',<40>,1:10]
[@6,11:12='\r\n',<82>,1:11]
[@7,13:16='    ',<84>,channel=1,2:0]
[@8,17:16='<INDENT>',<1>,2:4]
[@9,17:19='xxx',<79>,2:4]
[@10,20:20='=',<51>,2:7]
[@11,21:21='1',<80>,2:8]
[@12,22:23='\r\n',<82>,2:9]
[@13,24:27='    ',<84>,channel=1,3:0]
[@14,28:32='print',<27>,3:4]
[@15,33:33=' ',<84>,channel=1,3:9]
[@16,34:36='xxx',<79>,3:10]
[@17,37:38='\r\n',<82>,channel=1,3:13]
[@18,39:40='\r\n',<82>,4:0]
[@19,41:40='<DEDENT>',<2>,5:0]
[@20,41:43='def',<9>,5:0]
[@21,44:44=' ',<84>,channel=1,5:3]
[@22,45:49='greet',<79>,5:4]
[@23,50:50='(',<34>,5:9]
[@24,51:51=')',<37>,5:10]
[@25,52:52=':',<40>,5:11]
[@26,53:54='\r\n',<82>,5:12]
[@27,55:56='  ',<84>,channel=1,6:0]
[@28,57:56='<INDENT>',<1>,6:2]
[@29,57:61='print',<27>,6:2]
[@30,62:62=' ',<84>,channel=1,6:7]
[@31,63:75=''Hello World'',<81>,6:8]
[@32,76:77='\r\n',<82>,channel=1,6:21]
[@33,78:79='  ',<84>,channel=1,7:0]
[@34,80:81='\r\n',<82>,7:2]
[@35,82:81='<DEDENT>',<2>,8:0]
[@36,82:86='greet',<79>,8:0]
[@37,87:87='(',<34>,8:5]
[@38,88:88=')',<37>,8:6]
[@39,89:89=';',<42>,8:7]
[@40,90:89='<NEWLINE>',<82>,8:8]
[@41,90:89='<EOF>',<-1>,8:8]

According to the Official Python2 grammar, https://docs.python.org/2.7/reference/grammar.html, a funcdef is funcdef: 'def' NAME parameters ':' suite. It extends from the first character 'd' of def, and goes all the way to the last character of DEDENT, since suite is defined as suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT.

If you want to get the interval for the statements within function "test()", then you have to get the last char of the 2nd stmt. It says there are two statements for function "test()":

$ trparse xxx.txt | trquery grep ' //stmt/compound_stmt/funcdef[NAME/text() = "test"]/suite/stmt' | trtext -c
CSharp 0 xxx.txt success 0.0428541
2
07/05-12:35:54 ~/issues/g4-new-csharp/python/python2_7_18/Generated-CSharp-0
$ trparse -l xxx.txt | trquery grep ' //stmt/compound_stmt/funcdef[NAME/text() = "test"]/suite/stmt[1]' | trcaret
CSharp 0 xxx.txt success 0.0425021
L2:     xxx=1
        ^
07/05-12:36:00 ~/issues/g4-new-csharp/python/python2_7_18/Generated-CSharp-0
$ trparse -l xxx.txt | trquery grep ' //stmt/compound_stmt/funcdef[NAME/text() = "test"]/suite/stmt[2]' | trcaret
CSharp 0 xxx.txt success 0.0426761
L3:     print xxx
        ^

kaby76 · 2024-07-05T18:44:36Z

The only thing that would be nice to change is the text for the INDENT and DEDENT tokens. They are <INDENT> and <DEDENT> respectively. But the text is inconsistent with the computed length of the token, which is end index - start index + 1 = 0. So, for the first INDENT token, [@8,17:16='<INDENT>',<1>,2:4], the attributes of the token are:

start index 17.
end index is 16.
text is <INDENT>.
channel is 1.
line is 2.
column is 4.

The "problem" is on the trtext-side of things. trtext reconstructs the text of the input by concatenating the text of the leaves of the parse tree. So, I see <INDENT> and <DEDENT> sprinkled in the reconstructed text. I can easily remove these from the tree using trquery delete.

RobEin · 2024-07-05T19:49:39Z

Thanks for bringing this to my attention.
I really forgot about that.
In other words, the token stream must ensure that the original source code can be restored.
And this is not possible with "<INDENT>" and "<DEDENT>" token text.

I will fix it in all PythonLexerBase ports:

Java
C#
Python
JavaScript
TypeScript
Go
Dart
CPP

Ranjani-devz · 2024-07-06T11:25:29Z

Thanks kaby76 and RobEin for checking on this issue. Waiting for your update if it is fixed in PythonLexerBase for Java.

RobEin · 2024-07-07T10:20:54Z

On second thought, no repair is needed after all.
The rule is very simple to restore the original source code by the token stream.
You just have to take out the INDENT and DEDENT tokens.
Python's tokenizer works differently.
The INDENT and DEDENT tokens must be inserted there to restore the original code.
I'm still wondering if there's any advantage to this, but probably not.

kaby76 · 2024-07-07T12:39:31Z

The rule is very simple to restore the original source code by the token stream[:] You just have to take out the INDENT and DEDENT tokens.
...
The INDENT and DEDENT tokens must be inserted there to restore the original code.

I don't understand. These two statements are inconsistent. The first statement says that the INDENT and DEDENT tokens need to be deleted from the parse tree in order to reconstruct the source. The second statement says that they cannot be deleted because they are essential to reconstruct the source.

Currently, I have to delete the INDENT and DEDENT tokens to reconstruct the text because if I don't I get INDENT and DEDENT strings sprinkled in the reconstructed text, e.g., this:

07/07-08:05:20 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$ trparse ../examples/atexit.py | trtext
CSharp 0 ../examples/atexit.py success 0.0601923
"""
atexit.py - allow programmer to define multiple exit functions to be executed
upon normal program termination.

One public function, register, is defined.
"""

__all__ = ["register"]

import sys

_exithandlers = []
def _run_exitfuncs():
    <INDENT>"""run any registered exit functions

    _exithandlers is traversed in reverse order so functions are executed
    last in, first out.
    """

    exc_info = None
    while _exithandlers:
        <INDENT>func, targs, kargs = _exithandlers.pop()
        try:
            <INDENT>func(*targs, **kargs)
        <DEDENT>except SystemExit:
            <INDENT>exc_info = sys.exc_info()
        <DEDENT>except:
            <INDENT>import traceback
            print >> sys.stderr, "Error in atexit._run_exitfuncs:"
            traceback.print_exc()
            exc_info = sys.exc_info()

    <DEDENT><DEDENT>if exc_info is not None:
        <INDENT>raise exc_info[0], exc_info[1], exc_info[2]


<DEDENT><DEDENT>def register(func, *targs, **kargs):
    <INDENT>"""register a function to be executed upon normal program termination

    func - function to be called at exit
    targs - optional arguments to pass to func
    kargs - optional keyword arguments to pass to func

    func is returned to facilitate usage as a decorator.
    """
    _exithandlers.append((func, targs, kargs))
    return func

<DEDENT>if hasattr(sys, "exitfunc"):
    # Assume it's another registered exit function - append it to our list
    <INDENT>register(sys.exitfunc)
<DEDENT>sys.exitfunc = _run_exitfuncs

if __name__ == "__main__":
    <INDENT>def x1():
        <INDENT>print "running x1"
    <DEDENT>def x2(n):
        <INDENT>print "running x2(%r)" % (n,)
    <DEDENT>def x3(n, kwd=None):
        <INDENT>print "running x3(%r, kwd=%r)" % (n, kwd)

    <DEDENT>register(x1)
    register(x2, 12)
    register(x3, 5, "bar")
    register(x3, "no kwd args")
<DEDENT>
07/07-08:05:40 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$

Text reconstruction in Trash follows the basic concept that existed in CS since the 1960's: the input text is simply the concatenation of the text of the frontier of the parse tree. The text for INDENT and DEDENT tokens are <INDENT> and <DEDENT>. This is why I need to either erase the text (which I currently cannot do with Trash), or the tokens need to be deleted from the parse tree, e.g.,:

07/07-07:59:04 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$ trparse !$ | trquery 'delete //(DEDENT | INDENT)' | trtext
trparse ../examples/atexit.py | trquery 'delete //(DEDENT | INDENT)' | trtext
CSharp 0 ../examples/atexit.py success 0.0612294
"""
atexit.py - allow programmer to define multiple exit functions to be executed
upon normal program termination.

One public function, register, is defined.
"""

__all__ = ["register"]

import sys

_exithandlers = []
def _run_exitfuncs():
    """run any registered exit functions

    _exithandlers is traversed in reverse order so functions are executed
    last in, first out.
    """

    exc_info = None
    while _exithandlers:
        func, targs, kargs = _exithandlers.pop()
        try:
            func(*targs, **kargs)
        except SystemExit:
            exc_info = sys.exc_info()
        except:
            import traceback
            print >> sys.stderr, "Error in atexit._run_exitfuncs:"
            traceback.print_exc()
            exc_info = sys.exc_info()

    if exc_info is not None:
        raise exc_info[0], exc_info[1], exc_info[2]


def register(func, *targs, **kargs):
    """register a function to be executed upon normal program termination

    func - function to be called at exit
    targs - optional arguments to pass to func
    kargs - optional keyword arguments to pass to func

    func is returned to facilitate usage as a decorator.
    """
    _exithandlers.append((func, targs, kargs))
    return func

if hasattr(sys, "exitfunc"):
    # Assume it's another registered exit function - append it to our list
    register(sys.exitfunc)
sys.exitfunc = _run_exitfuncs

if __name__ == "__main__":
    def x1():
        print "running x1"
    def x2(n):
        print "running x2(%r)" % (n,)
    def x3(n, kwd=None):
        print "running x3(%r, kwd=%r)" % (n, kwd)

    register(x1)
    register(x2, 12)
    register(x3, 5, "bar")
    register(x3, "no kwd args")

07/07-07:59:38 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$ trparse ../examples/atexit.py | trquery 'delete //(DEDENT | INDENT)' | trtext > save
CSharp 0 ../examples/atexit.py success 0.0600218
07/07-07:59:48 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$ diff save ../examples/atexit.py
66d65
<
07/07-07:59:57 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0

NB: trtext outputs an extra newline character because it calls Console.WriteLine() instead of a Console.Write(). It has to do this because dotnet programs don't work perfectly with a Cygwin/MSYS shell. Instead, one should use trsponge to perform the reconstruction and outputting.

RobEin · 2024-07-07T20:37:42Z

The second statement says that they cannot be deleted because they are essential to reconstruct the source.

The second statement was about the original Python tokenizer.

... Trash follows the basic concept that existed in CS since the 1960's: the input text is simply the concatenation of the text of the frontier of the parse tree ...

Now I understand what the problem is.
I didn't know this recommendation.

I can imagine two alternatives in this case:

Solution 1:
The text of the INDENT/DEDENT tokens would contain the indentation similar to Python's tokenizer.
Currently, the indentation text is stored in the WS tokens before the INDENT/DEDENT tokens.
This is problematic because it may cause compatibility problems with older applications that use the PythonLexerBase class.
Solution 2:
This is simpler and less likely to cause compatibility issues.
That is, INDENT/DEDENT tokens would store an empty string.
Currently, the text property of INDENT tokens is consistently "<INDENT>" and similarly that of DEDENT tokens is "<DEDENT>".
If these are now empty strings, then the text properties of the tokens should only be concatenated to restore the original source code.
This would be similar to deleting INDENT/DEDENT tokens.

I recommend the second solution.

Ranjani-devz · 2024-07-08T05:43:57Z

I didn't understand. Can you explain me what has to be changed. Do I need to change in any grammar files?

We are trying to parse python 2x file using Java. When i tried to print the FuncdefContext.suite.getText() of test() function for this example,

def test():
    xxx=1
    print xxx

def greet():
  print 'Hello World'
  
greet();

Output:

<INDENT>xxx=1
printxxx
<DEDENT>

and endline for this test() function is 5.

Can you tell me what should be done here to get the the correct endline.

kaby76 · 2024-07-08T07:28:00Z

tree.getText() doesn't reconstruct the text of the input. It never does for virtually every Antlr grammar! This is because Antlr parse trees don't contain all the tokens of the input, like comments and white space, nor does it contain strings that are "skipped." Grammars that define lexer rules with -> skip or -> channel(HIDDEN) cause input strings to be not tokenized or tokenized with the channel property to be 1. The leaves in the parse tree don't contain these tokens. For python2_7_18, the DEDENT and INDENT tokens contain text as strings <DEDENT> and <INDENT> and these tokens are part of the Antlr parse tree. This is why you see tree.getText() contain strings for the DEDENT and INDENT tokens. The "approved" way to get the text from an Antlr parse tree is to query the input char stream directly, using the parse tree to get the bounds of the indices of the text. See https://stackoverflow.com/a/55852474/4779853 or antlr/antlr4#1302

Trash doesn't represent the parse tree like Antlr. It incorporates the entire input, including white space and comments. It's done this way so that it's fully serializable, with no loss of text, and fully editable. The way Antlr splits the parse tree from the token stream, and char stream, is unnatural, difficult/slow to serialize and edit.

Ranjani-devz · 2024-07-09T06:00:35Z

Hi, Thanks for your response. I understand that you have suggested on how to get the text from ANTLR parse tree.

Our use case is to parse input python file and identify the startline and endline for each classes, functions, statements, comments, etc. in the file and while doing so we are facing an issue fetching endline from the function and statements context (for, while loop,...)

Can you help me understand how this endline can be fetched correctly or is there any workaround you would like to suggest.
Also, does ANTLR python 2.7.18 grammars support python 2.6 version too?

kaby76 · 2024-07-09T11:04:39Z

The easiest solution would be to just delete the INDENT and DEDENT leaves, then just get the Interval for the sub-tree. But, the Antlr runtime doesn't have tree editing.

Instead, do this:

Get the Interval of the node for the funcdef or stmt. The Interval is the start and end indices of the tokens for that sub-tree (i.e., not the start and end of the character buffer).
Write a loop to start at the ending token index. Working backwards, skip all INDENT and DEDENT tokens until you find something else, something that is not an INDENT or DEDENT. Do not backup further than the starting token index. We now have the end token index of the funcdef or stmt.
Get the end token from its end token index.
Get the end character index from the end token.
Write a loop that starts at end character index and looks at the character buffer. Stop looping when you find a character that is not a newline, character index of last non-newline for funcdef or stmt.
You can now return the 1+character index of last non-newline for funcdef or stmt

In C#:

        var funcdefs = new Antlr4.Runtime.Tree.Xpath.XPath(parser, "//funcdef").Evaluate(tree);
	    var funcdef = funcdefs.FirstOrDefault();
        var token_interval = funcdef.SourceInterval;
        int end_token_index = token_interval.b;
        for (; end_token_index >= token_interval.a; --end_token_index)
        {
            if (tokens.Get(end_token_index).Type != PythonParser.INDENT
                && tokens.Get(end_token_index).Type != PythonParser.DEDENT
                && tokens.Get(end_token_index).Type != PythonParser.WS
                && tokens.Get(end_token_index).Type != PythonParser.NEWLINE
                && tokens.Get(end_token_index).Channel == 0)
            {
                break;
            }
        }
        var start_token = tokens.Get(token_interval.a);
        var end_token = tokens.Get(end_token_index);
        var start_char_index = start_token.StartIndex;
        var end_char_index = end_token.StopIndex;
        System.Console.WriteLine("funcdef text:");
        System.Console.WriteLine(str.GetText(new Interval(start_char_index, end_char_index)));

[D]oes ANTLR python 2.7.18 grammars support python 2.6 version too?

I would think so, but don't quote me.

Ranjani-devz · 2024-07-12T08:06:01Z

Hi, Thanks for you response. We will check on the suggestion you have provided as we have built in Java.
Also, in our case we are using custom listener class to identify the endlines for each class, function, statement, etc. by overriding the base listener enter and exit methods.

For Example:

@Override
public void enterFuncdef(FuncdefContext ctx) {
     int start = ctx.getStart.getLine();
     int stop = ctx.getStop.getLine();
}

Would you like to suggest if we can handle the endlines correctly here?

kaby76 · 2024-07-12T11:48:19Z

[W]e are using custom listener class to identify the endlines for each class, function, statement, etc. by overriding the base listener enter and exit methods.

For Example:
@Override
public void enterFuncdef(FuncdefContext ctx) {
     int start = ctx.getStart.getLine();
     int stop = ctx.getStop.getLine();
}
Would you like to suggest if we can handle the endlines correctly here?

Not quite. Try this.

MyListener.java

import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.misc.*;

public class MyListener extends PythonParserBaseListener {

    CommonTokenStream tokens_;
    CharStream str_;

    public MyListener(CommonTokenStream tokens, CharStream str)
    {
	tokens_ = tokens;
	str_ = str;
    }
    
    @Override public void enterFuncdef(PythonParser.FuncdefContext ctx)
    {
	var start = ctx.getStart().getLine();
	var token_interval = ctx.getSourceInterval();
	var end_token_index = token_interval.b;
	var tokens = this.tokens_;
	var str = this.str_;
	for (; end_token_index >= token_interval.a; --end_token_index)
	{
	    if (tokens.get(end_token_index).getType() != PythonParser.INDENT
		  && tokens.get(end_token_index).getType() != PythonParser.DEDENT
		  && tokens.get(end_token_index).getType() != PythonParser.WS
		  && tokens.get(end_token_index).getType() != PythonParser.NEWLINE
		  && tokens.get(end_token_index).getChannel() == 0)
	    {
		break;
	    }
	}
	var start_token = tokens.get(token_interval.a);
	var end_token = tokens.get(end_token_index);
	var start_char_index = start_token.getStartIndex();
	var end_char_index = end_token.getStopIndex();
	var stop_line_number = end_token.getLine();
	System.out.println("stop = " + stop_line_number);
	System.out.println("funcdef text:");
	System.out.println(str.getText(new Interval(start_char_index, end_char_index)));
    }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Antlr python 2 not recoginizing the correct end line for class, function, statements and loops #4153

Antlr python 2 not recoginizing the correct end line for class, function, statements and loops #4153

Himanshu-portfolio commented Jul 5, 2024

kaby76 commented Jul 5, 2024

Ranjani-devz commented Jul 5, 2024 •

edited

Loading

RobEin commented Jul 5, 2024

kaby76 commented Jul 5, 2024

kaby76 commented Jul 5, 2024 •

edited

Loading

RobEin commented Jul 5, 2024 •

edited

Loading

Ranjani-devz commented Jul 6, 2024

RobEin commented Jul 7, 2024

kaby76 commented Jul 7, 2024 •

edited

Loading

RobEin commented Jul 7, 2024

Ranjani-devz commented Jul 8, 2024 •

edited

Loading

kaby76 commented Jul 8, 2024 •

edited

Loading

Ranjani-devz commented Jul 9, 2024

kaby76 commented Jul 9, 2024

Ranjani-devz commented Jul 12, 2024

kaby76 commented Jul 12, 2024

Antlr python 2 not recoginizing the correct end line for class, function, statements and loops #4153

Antlr python 2 not recoginizing the correct end line for class, function, statements and loops #4153

Comments

Himanshu-portfolio commented Jul 5, 2024

kaby76 commented Jul 5, 2024

Ranjani-devz commented Jul 5, 2024 • edited Loading

RobEin commented Jul 5, 2024

kaby76 commented Jul 5, 2024

kaby76 commented Jul 5, 2024 • edited Loading

RobEin commented Jul 5, 2024 • edited Loading

Ranjani-devz commented Jul 6, 2024

RobEin commented Jul 7, 2024

kaby76 commented Jul 7, 2024 • edited Loading

RobEin commented Jul 7, 2024

Ranjani-devz commented Jul 8, 2024 • edited Loading

kaby76 commented Jul 8, 2024 • edited Loading

Ranjani-devz commented Jul 9, 2024

kaby76 commented Jul 9, 2024

Ranjani-devz commented Jul 12, 2024

kaby76 commented Jul 12, 2024

MyListener.java

Ranjani-devz commented Jul 5, 2024 •

edited

Loading

kaby76 commented Jul 5, 2024 •

edited

Loading

RobEin commented Jul 5, 2024 •

edited

Loading

kaby76 commented Jul 7, 2024 •

edited

Loading

Ranjani-devz commented Jul 8, 2024 •

edited

Loading

kaby76 commented Jul 8, 2024 •

edited

Loading