Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go target] Generated lexer uses "l" for some functions, "p" for other functions. #4306

Closed
kaby76 opened this issue Jun 7, 2023 · 26 comments

Comments

@kaby76
Copy link
Contributor

kaby76 commented Jun 7, 2023

This is in regards to writing a "target agnostic grammar", where we try to reuse a grammar even though actions ("semantic predicates") in a grammar are always target specific.

The javascript grammar contains action code for tokenization. For example, here. Notice that the action is platform-specific code. In Java, the "this." syntax is a qualified method call in Java. It also so happens that the syntax works in CSharp, Dart, JavaScript (modern), and TypeScript. For the other targets, the grammar .g4's have to be sed -i to a syntax that works for that platform. "Target agnostic" isn't far enough. For PHP, it needs to be changed to "self::'". For Python3, "self.". For Cpp, "this->". For Go, it's harder.

The reason it is harder is because actions in a grammar turn into functions that use either "l" or "p", depending on whether the action is a predicate or not. For a non-predicate action, e.g., this, generated code is the following.

func (l *JavaScriptLexer) OpenBrace_Action(localctx antlr.RuleContext, actionIndex int) {
	switch actionIndex {
	case 0:
		l.ProcessOpenBrace()

	default:
		panic("No registered action for: " + fmt.Sprint(actionIndex))
	}
}

For a predicate, this action in the lexer grammar is transformed to a function with a parameter that is named "p".

func (p *JavaScriptLexer) HashBangLine_Sempred(localctx antlr.RuleContext, predIndex int) bool {
	switch predIndex {
	case 0:
		return l.IsStartOfFile()

	default:
		panic("No predicate with index: " + fmt.Sprint(predIndex))
	}
}

This is rather annoying because I have keep track of which action uses a "p" vs "l".

Could we change the Go templates just always use "this" as a parameter name for all these generated functions that contain action code? It's not critical, because I can write a script to rewrite the "this." to either "p." or "l." depending on whether the action is a predicate or not.

@jimidle
Copy link
Collaborator

jimidle commented Jun 7, 2023 via email

@jimidle
Copy link
Collaborator

jimidle commented Jun 7, 2023 via email

@kaby76
Copy link
Contributor Author

kaby76 commented Jun 7, 2023

This would be a "nice to have", but not critical. I can use Trash to extract plain actions vs semantic predicate actions, modify those actions with "l." or "p." accordingly, then reconstruct the grammar. The build tools can handle this. I just don't quite get why it's not a consistent name like "self" in the generated code, instead of all this "l" or "p".

@jimidle
Copy link
Collaborator

jimidle commented Jun 8, 2023

self or this are different concepts really. But we should be using consistent receiver variable names.

@kaby76
Copy link
Contributor Author

kaby76 commented Jun 8, 2023

Actually, there sort of is one, which I accidentally stumbled upon it with a typo I wrote. #3508 $parser. It's still pretty much an undocumented feature, except for this added last year in the python target documentation. $parser "works", but there are two flaws. One is that it's only good for the parser, i.e., there's no $lexer object reference. The second problem is that for Cpp, the syntax for object dereference uses the -> operator, not the . operator. So, for it to work in the Cpp target, the action must be $parser->, not $parser., which otherwise works fine for all the other targets.

@ericvergnaud
Copy link
Contributor

The cpp version could be a reference rather than a pointer, that would avoid the -> vs . issue.

@kaby76
Copy link
Contributor Author

kaby76 commented Jun 8, 2023

The cpp version could be a reference rather than a pointer, that would avoid the -> vs . issue.

Yes, instead of $parser->foobar(), use $parser.foobar(), and in the Antlr tool, generate code as (*this).foobar(). That would work for at least $parser. Still missing it for the lexer. Either need to add $lexer or better, $this and always use the . operator afterwards.

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jun 8, 2023 via email

@kaby76
Copy link
Contributor Author

kaby76 commented Jun 8, 2023

Yes, we could use $parser and $lexer (or $this, which I would need to think about what you said), then I could pitch in 2 of the 4 cases the transformGrammar.py used in grammars-v4, thus getting "target agnostic format" much closer to reality. (Note, I only have templates for 8 of the 9 targets; I don't target and test Swift cause I still cannot understand how to install the toolchain--it does not work on Windows.)

I'll still need the transformGrammar.py hack because there are a couple of other issues where it corrects other problems with codegen. In fortran/fortran90, I have to sed this line to turn it into @header {#include "Fortran90LexerBase.h"} for the Cpp target, and @header {require "Fortran90LexerBase.php";} for the PHP target. Honestly, we go to the trouble to declare in a grammar file options { superClass = Fortran90LexerBase; }, just to get the code to compile, and the tool doesn't generate the proper #include's or import's to work. Why is that?

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jun 8, 2023

Well I guess whilst it's obviously good practice to name the abstract parser header from its name, it's definitely not something we can enforce. We're not generating or providing that file are we ?
I would personally be ok to enforce the naming in principle, but that would likely break a number of working grammars, notably in the frequent case where developers place the generated files in a dedicated folder...
I'm personally supportive of a macro system, where grammar code would be conditionally included based on the active target

#if PHP
require "Fortran90LexerBase.php"
#elif CPP
#include "Fortran90LexerBase.h"
#endif

IIRC you hate the idea (or maybe that's Ivan?), but tbh I haven't seen a better one :-(

@kaby76
Copy link
Contributor Author

kaby76 commented Jun 8, 2023

This would be okay. I think my concern was lexing the target code between the #if and #else. The target code itself may have Cpp preprocessor directives:

grammar Foobar;

#if Cpp
// Cpp code   vvv.
#if defined(XXX)
#include "xxx.h"
#else                         <- bound to "#if Cpp" or "#if defined(XXX)"??
#include "other.h"
#endif  
// Cpp code ^^^
#endif  

It just makes the lexing more difficult, e.g., is the #else part of the Cpp target code, or the start of code for all other targets? @member uses an action block, so it only has to count and match open and closing braces outside of string literals within the action blocks. We really don't parse target-specific code. It's just one lexed string. The idea is fine, but I think we'd need to come up with a better syntax. Maybe just #if Cpp { ... } #else if Go { ... } #endif? Since we seem to have a fondness to '@membersor@lexer::membersor ... and the@-sign, maybe use that? Or just pitch the @`-sign altogether.

@ericvergnaud
Copy link
Contributor

We could certainly pick a different pattern but tbh if developers are clever enough to introduce such complexity then I expect they'd also be able to fix them.

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jun 8, 2023 via email

@kaby76
Copy link
Contributor Author

kaby76 commented Jun 8, 2023

That's cool. Works for me!

@KvanTTT
Copy link
Member

KvanTTT commented Jun 10, 2023

IIRC you hate the idea (or maybe that's Ivan?), but tbh I haven't seen a better one :-(

Yes, I still think the idea with preprocessor directives is not good. I have had experience with Objective-C language processing and encountered a lot of problems with preprocessor directives. Moreover, directives are not part of target language. Probably you meant this topic: Support predicates. I suggest using something like this (it was suggested by @udif in Unified Actions Language):

stat
    : { java5() }? 'goto' ID ';'

...

@parser::members::java {
  boolean java5() {
    ...
  }
}
@parser::members::cpp {
  bool java5() {
    ...
  }
}

We can use universal function calls and target-specific function declarations.

BTW, currently I'm working on Multiplatform feature in Kotlin and it looks like the idea above (in Kotlin there are expect and actual functions).

@kaby76
Copy link
Contributor Author

kaby76 commented Jun 10, 2023

We can use universal function calls and target-specific function declarations.

Exactly! Thank you @KvanTTT. Anltr already has $parser, which somehow someone snuck into the Antlr tool. And, we also have syntax for accessing attributes in actions. But, now we have an impasse where can't agree on something like $parser for lexers, and the translation of the operator that follows the reference into the target-specific syntax in order to make it a little more bearable over in grammars-v4. We don't need to devise a general-purpose language for actions, just a common way to access code in the target. That's a plane of features that I can use.

The only other alternative I can think of is to add syntax to actions and semantic predicates in Antlr grammars to declare the target. It completely pitches any and all notions of attributes, parser and lexer instances, the options { target=... } incomplete solution, all attempts to substitute anything in the target code--everything--and only have a notation for saying everything in an action { ... } region is target-specific code, such as:

@parser::members::java {
  boolean java5() {
    ...
  }
}
@parser::members::cpp {
  bool java5() {
    ...
  }
}

OpenBrace:                      '{' 
    {this.ProcessOpenBrace();}::(Java, CSharp, ....) | {this->ProcessOpenBrace();}::Cpp | ... ;

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jun 11, 2023 via email

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jun 11, 2023 via email

@kaby76
Copy link
Contributor Author

kaby76 commented Jun 11, 2023

Mmm… the bad news is that whilst this may work for Java which has an implicit this, it won’t for other targets, notably JS, TS and Python.

Not sure what you are looking at, but "this." is perfectly valid for JavaScript. Here's the constructor code that's generated for the lexer.

    constructor(input) {
        super(input)
        this._interp = new antlr4.atn.LexerATNSimulator(this, atn, decisionsToDFA, new antlr4.atn.PredictionContextCache());
    }

Generated-JavaScript-0.zip

I can prove it for TypeScript and Python, too, if you need to see a functioning, complete example.

But, it doesn't matter if syntax is added to @member and every side-effect-action/semantic-predicate to distinguish the specific target the action-block. Either that, or really translate $this and operator correctly from the action-block itself.

@kaby76
Copy link
Contributor Author

kaby76 commented Jun 11, 2023

Yes, you are right for Python3. There is no "this." But, the code passes a "self" in each of the wrapper functions. So, my "transformGrammar.py" hack rewrites the "this." to "self.". The generated code also works fine.

Generated-Python3-0.zip

@KvanTTT
Copy link
Member

KvanTTT commented Jun 11, 2023

Yes, you are right for Python3. There is no "this." But, the code passes a "self" in each of the wrapper functions. So, my "transformGrammar.py" hack rewrites the "this." to "self.". The generated code also works fine.

To be precise in Python use can use any word as receiver, not only self:

class C:
    def f(foo):
        print(foo.x)

c = C()
c.x = 42
c.f() // prints 42

@KvanTTT
Copy link
Member

KvanTTT commented Jun 11, 2023

Mmm… the bad news is that whilst this may work for Java which has an implicit this, it won’t for other targets, notably JS, TS and Python…

It will work work other targets if consider first parameter as receiver (actually on IR level all class members represented as members with first this parameter).

@ericvergnaud
Copy link
Contributor

Let's move this to a discussion ?

@KvanTTT
Copy link
Member

KvanTTT commented Jun 11, 2023

Yes but I suggest continuing existing ones: Unified Actions Language or Support predicates.

@kaby76
Copy link
Contributor Author

kaby76 commented Jun 12, 2023

I'm moving this to the discussions as suggested. But I recommend looking at the hacks that I have to employ in order to get a grammar to work across 8 different targets here.

@kaby76
Copy link
Contributor Author

kaby76 commented Jun 12, 2023

(I can't convert the issue to a discussion.)

@antlr antlr locked and limited conversation to collaborators Jun 12, 2023
@ericvergnaud ericvergnaud converted this issue into discussion #4311 Jun 12, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants