Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JavaCC 21 workgroup #685

Open
revusky opened this issue Jan 3, 2023 · 56 comments
Open

JavaCC 21 workgroup #685

revusky opened this issue Jan 3, 2023 · 56 comments

Comments

@revusky
Copy link

revusky commented Jan 3, 2023

Here's the proposal. I'll do it for you.

By that, I mean the migration to JavaCC 21, specifically adapting this file. Since most of the file is devoted to parsing what is basically a Java grammar, and with JavaCC 21, there is an up-to-date Java grammar that you can just INCLUDE, this would take quite a bit of burden off of you. For example, issues like supporting lambda expressions, #675, at least as regards the parsing side (not the runtime side, I grant) will be taken care of, because you can automatically parse (and build AST for) anything in the current Java language. In reality, the current situation where you (or any broadly similar project) have to maintain this yourselves just doesn't make much sense. I made that comment in this article: https://javacc.com/2021/03/01/reference-java-grammar/

The only thing I would ask in return is a serious commitment that, once I do the work, you review it and starting using it ASAP. You do understand that if I do this for you and then you don't use it or possibly even look at it, then that's kind of a crisis. You'd be putting me in a very uncomfortable position.

Oh,I should mention that another thing that you will get for free moving to JavaCC 21 is a much more serious treatment of fault-tolerant and error recovery. You might want to consider this article: https://javacc.com/2021/02/18/the-promised-land-fault-tolerant-parsing/ and you'll realize that the legacy JavaCC has simply zero concept of error recovery really. Meanwhile, the stuff that that article describes is all implemented in JavaCC 21. That said, there may be some glitches, but for that very reason, I would love to have a demanding "customer" that can help me get all the kinks out.

@nickl-
Copy link
Member

nickl- commented Jan 3, 2023

Here's the proposal. I'll do it for you.

Wow is it already Christmas again? =) You just made my day and probably the rest of my year!
You don't have to work alone we can certainly pitch in, and having your expertise at wheel we'll dust this off in no time.

By that, I mean the migration to JavaCC 21, specifically adapting this file.

Even though we have just about resolved all the outstanding parser issues now, having only one outstanding item remaining, I do think the best long term strategy will be to consider the upstream grammar as master and attempt to port only the variations over from the BeanShell grammar.

From my initial and very brief assessment I already noticed some naming alterations and if we want to streamline future updates with the least amount of friction then it behooves BeanShell to conform to what is defined upstream instead of the other way around. There is going to be work that needs done and we might as well suffer some growing pains now to have more pleasure in the future. What do you think @revusky?

These are some of the things which we may need to considered.

More relaxed statement or expression grammar:

void StatementExpression() :
{ }
{
/*
This is looser than normal Java to simplify the grammar. This allows
us to type arbitrary expressions on the command line, e.g. "1+1;"
We should turn this off in the implementation in strict java mode.
*/
Expression()
/*
// This was the original Java grammar.
// Original comment:
// The last expansion of this production accepts more than the legal
// Java expansions for StatementExpression.
PreIncrementExpression()
|
PreDecrementExpression()
|
LOOKAHEAD( PrimaryExpression() AssignmentOperator() )
Assignment() { }
|
PostfixExpression()
*/
}

This should allow even basic 1 + 1 expressions

Being able to specify import and package statements anywhere

|
// Allow BeanShell imports in any block
ImportDeclaration()
|
// Allow BeanShell package declarations in any block
PackageDeclaration()
|

The same kind of goes for variables and methods too which can also be declared just about anywhere, and it is loosely typed so wherever there is a Type definition it is optional:
The basic variable consists of an identifier with an optional "=" + initializer

void VariableDeclarator() #VariableDeclarator :
{
Token t;
}
{
t=<IDENTIFIER>
( "[" "]" { jjtThis.dimensions++; } )*
[ "=" VariableInitializer() ]
{ jjtThis.name = t.image; }
}

Formal parameters can also be only an identifier.
void FormalParameter() #FormalParameter :
{
Token t;
}
{
( LOOKAHEAD(2) [ "final" { jjtThis.isFinal = true; } ]
Type()[ <ELLIPSIS> { jjtThis.isVarArgs = true; } ]
t=<IDENTIFIER> { jjtThis.name = t.image; }
|
t=<IDENTIFIER> { jjtThis.name = t.image; }
)
( LOOKAHEAD({ isFormalParameterDimensions() }) "[" "]" { jjtThis.dimensions++; } )*
}

Method declaration can also just be only an identifier followed by a "("
void MethodDeclaration() #MethodDeclaration :
{
Token t = null;
Modifiers mods;
int count;
}
{
mods = Modifiers( Modifiers.METHOD, false ) { if (null != mods) jjtThis.modifiers = mods; }
(
LOOKAHEAD( <IDENTIFIER> "(" )
t = <IDENTIFIER> { jjtThis.name = t.image; }
|
ReturnType()
t = <IDENTIFIER> { jjtThis.name = t.image; }
)
FormalParameters()
[ "throws" count=NameList() { jjtThis.numThrows=count; } ]
( Block() | ";" )
}

Then there are some other conveniences like array dimensions allows us to also just use an arrayinitializer instead of defined dimensions:

void ArrayDimensions() #ArrayDimensions :
{}
{
// e.g. int [4][3][][];
LOOKAHEAD(2)
( LOOKAHEAD(2) "[" Expression() "]" { jjtThis.addDefinedDimension(); } )+
( LOOKAHEAD(2) "[" "]" { jjtThis.addUndefinedDimension(); } )*
|
// e.g. int [][] { {1,2}, {3,4} };
( "[" "]" { jjtThis.addUndefinedDimension(); } )+ ArrayInitializer()
|
// e.g. { {1,2}, {3,4} };
ArrayInitializer() { jjtThis.numUndefinedDims = -1; }
}

We added python style slices
|
"[" [ Expression() { jjtThis.hasLeftIndex = true; } ]
( ":" { jjtThis.slice = true; }
[ Expression() { jjtThis.hasRightIndex = true; } ]
( ":" { jjtThis.step = true; } [ Expression() ] )? )? "]" {
jjtThis.operation = BSHPrimarySuffix.INDEX;
}
|

Lets us basically do arrays. maps, lists and general collections all through the same syntax

String and character literals are interchangeable, we deal with a character in the engine any character string longer than 1 is considered a string.

< CHARACTER_LITERAL: "'" (~["'","\n","\r","\\"] | <ESCAPE>)* "'" >
|
< STRING_LITERAL: "\"" (~["\"","\n","\r","\\"] | <ESCAPE>)* "\"" >

We BeanShell numbers have extensive integer and floating point literal definitions through which we have turned BigInteger and BigDecimal into glorified primitives with auto widening and narrowing support and just yesterday have turned all primitives into math classes
Integer Literals

< INTEGER_LITERAL:
<DECIMAL_LITERAL> (["o","O","s","S","i","I","l","L","w","W"])?
| <HEX_LITERAL> (["o","O","s","S","i","I","l","L","w","W"])?
| <BINARY_LITERAL> (["o","O","s","S","i","I","l","L","w","W"])?
| <OCTAL_LITERAL> (["o","O","s","S","i","I","l","L","w","W"])?

Floating point literals
|
< FLOATING_POINT_LITERAL:
(["0"-"9"])+ "." (["0"-"9"])* (<EXPONENT>)? (["f","F","d","D","w","W"])?
| "." (["0"-"9"])+ (<EXPONENT>)? (["f","F","d","D","w","W"])?
| (["0"-"9"])+ <EXPONENT> (["f","F","d","D","w","W"])?
| (["0"-"9"])+ ["f","F","d","D"]

For example script source files there are the test scripts and commands which are scripts that can be called as functions. These should give a good idea of what should be parseable.

But while all of this fun stuff is true, it does not come at the expense of JAVA compatibility and nothing added must break our ability to process standard JAVA. Additions on top of the JAVA grammar with a liberal helping of freedom from constraints. Our motto or credo is, that while BeanShell works like JAVA does it doesn't also have to break like JAVA does.

Welcome on board!!

@nickl- nickl- mentioned this issue Jan 3, 2023
@nickl- nickl- changed the title Proposal (further to issue #591) JavaCC 21 workgroup Jan 3, 2023
@revusky
Copy link
Author

revusky commented Jan 4, 2023

Well, to be clear, it's unlikely that I'm going to do much work on the semantics of things or how your runtime works. My goal right now is just to get you on the rails in terms of using JavaCC 21 for parser generation. Then you can better focus on the semantics, since most of the grammar/parsing side will be handled by the included Java grammar, except in the cases where you want to override it.

Probably most (or maybe even all) of the BSHXXX.java classes that you have to check in and maintain by hand can be just generated, which really is a big win.

As for StatementExpression, I'm pretty sure that this is the correct version of which expressions can stand as a statement: https://github.com/javacc21/javacc21/blob/master/examples/java/Java.javacc#L980

In any case, my offer is pretty much solely to get the parsing side using JavaCC 21

@nickl-
Copy link
Member

nickl- commented Apr 5, 2023

From off topic issue:

If you can migrate to the grammar I donated, you automatically can parse lambdas, method references, new-style switch with yield, and all the rest of it. It's a massive leg-up. So, properly understood, you should really expend your energy trying to get that grammar working for you.

That is the plan yes, and your contributions do not go unnoticed. thank you very much. For as long as we still remain on the legacy parser, wasteful energy expenditure is to be expected, and mustn't be avoided less the wheels start coming off. The sooner we get this migration to a point the better for everyone.

@nickl-
Copy link
Member

nickl- commented Apr 6, 2023

Further from off topic issue:

I donated that a couple of months ago. It requires you guys to do some incremental work to integrate it,

There now we are on topic and the correct place to find clarity of the current understanding, where we should report on progress, and where we are free to make new decisions or change prior commitments. As long as the issue remains open it also communicates that the task is not complete. We've had several discussions off line to date but unless it is recorded here there is really no way to keep track.

Lets recap the current understanding to date:

....revusky: Here's the proposal. I'll do it for you.
........nickl-: You don't have to work alone we can certainly pitch in, and having your expertise at wheel we'll dust this off in no time.
....revusky: In any case, my offer is pretty much solely to get the parsing side using JavaCC 21

My commitment remains the same, you can be certain that I do not expect you to work alone, I will help. I am with the understanding that you are at the wheel.

While you are driving I can make conversation to ensure you stay awake, point out land marks or potholes in the road, even take incentive like opening a cooldrink so you don't spill or light your cigarette to avoid distraction. But ultimately if you want something specific from me like opening the window a smidge, put on the heater or find a place for you on the map, you will need to tell me. I may know where we are going but as the driver only you know how to get there.

The expectation remains clear, you are only driving until BeanShell is parsing against the Congo grammar. At which point I will hopefully be able to see what remains of the road left to travel so that I can take the wheel. If I cannot see the road we are going to get lost or worse, have a terrible accident and get stranded without transport.

If you recall, you requested a review of what has been done, we had a discussion about it and my conclusion was that I was not yet able to see the road ahead. I humbly admitted in conclusion, that unless you can clearly point out the route, like on a map for example, you will have to take the wheel again and drive us to the next milestone. If you leave it with me like this, I am afraid all the work will be for nothing.

This is not me backing out of my commitment, or leaving you in the lurch, but simply a statement of fact. If you leave me here to pick it up on my own, we will not reach our destination. I did do another audit over the last week and will report my findings in the next post.

but understand that I can't do any more on that because (a) I have so many other things I want to get to in my own projects and (b) I just don't have any clarity that you guys will ever pick up the ball.

Hopefully I was successful to mitigate line item b) to satisfaction, however should even the slightest doubt, suspicion, disbelief or confusion remain, please talk to me for my intentions are admirable and true. In light hereof if you are unwilling, unable or simply too busy to recant the above statement please know you are under no obligation to deliver on the proposal or commitment. No harm done so no foul we simply close the issue and forget about it.

If like me you'd prefer a win instead, please clearly state your commitment and where we go next.

@nickl-
Copy link
Member

nickl- commented Apr 6, 2023

Rebased the congocc branch on top of latest master, it required a few minor tweaks to realign with package changes and permission scope, which after application the build again completes with all tests ending in success. These are the findings...

Slow build:

After updating to the new congo jar file it reverted the custom jar file which included the proposed fix to ignore generating files that exist which has not been merged, which is what the current solution is capable of. It may only be due to more files being physically generated and written to disk but in the absence of a solution current effects are:

  • 132% slower build from clean
  • 160% slower rebuild time

This will negatively impact development costs. Further benchmarks on script parsing and running costs still outstanding.

No congo parsing:

Since all tests are completing with success it has the appearance that everything is done, but it turns out the congocc branch still uses javacc in its entirety, with all files currently generated now being merged under the bsh.legacy package, including the legacy Parser class.

Since the current parser is still in use there is no change, when for example trying to parse a lambda expression.

Current master branch:

BeanShell 3.0.0-SNAPSHOT.5569
bsh % array = new int[10];
--> $0 = {0I, 0I, 0I, 0I, 0I, 0I, 0I, 0I, 0I, 0I} :int[]
bsh % Arrays.setAll(array, p -> p > 9 ? 0 : p);
// Error: Parser Error: Unable to parse code syntax. Encountered: -> at line 2, column 24
bsh % // Error: Parser Error: Unable to parse code syntax. Encountered: ) at line 1, column 15

From the new congocc branch:

BeanShell 3.0.0-SNAPSHOT.CONGOCC
bsh % array = new int[10];
--> $0 = {0I, 0I, 0I, 0I, 0I, 0I, 0I, 0I, 0I, 0I} :int[]
bsh % Arrays.setAll(array, p -> p > 9 ? 0 : p);
// Error: Parser Error: Unable to parse code syntax. Encountered: -> at line 2, column 24
bsh % // Error: Parser Error: Unable to parse code syntax. Encountered: ) at line 1, column 15

Both equally incapable of parsing the arrow ->.

Where to next:

As it stands now it is unclear what needs to happen next. It would be understandable if BeanShell tests are failing because the new AST has not been completely migrated yet and with an example or two we would be able to pick it up and continue implementing the rest. Unfortunately the proposed solution and work done on the congocc branch for which we are very grateful, has yet to migrate to the new parser. So what are we supposed to do with that?

@nickl-
Copy link
Member

nickl- commented Apr 6, 2023

What is required:

As a minimum I think the following two items, if completed, would set us on a clear path ahead.

  1. Migrate to new parser:
    Change over to the new parser and prove that unimplemented but valid JAVA syntax (like lambda expressions) are now parsed without exceptions.

  2. Successfully interpret the prompt command:
    If the new implementation is able to successfully interpret the getBshPrompt command, we would be able to use the example and REPL client to develop the rest of the migration.

@revusky
Copy link
Author

revusky commented Apr 6, 2023

Okay, here is what I would like you guys to do now. There is a file TestHarness.java that I updated just a few minutes ago. (I also updated the congocc-full.jar in the root directory, BTW.) You can run the test harness on any .bsh file and see what it outputs, as in:

   java bsh.TestHarness somescript.bsh

Could you try it out a bit and report back? I would also like @opengo to do this as well. Actually, anybody who is interested in development on this project.

I'll answer your other points separately after you have tried out the above. Just let me know...

Actually, to be clear:

    git fetch origin
    git checkout congocc
    mvn compile
    java -classpath target/classes bsh.TestHarness

Run this yourself and try to get other people to run it. I'll answer your other points, but after you've verified that you've done the above.

@nickl-
Copy link
Member

nickl- commented Apr 6, 2023

getBshPrompt.bsh.txt

Output for:

$ java -classpath target/classes bsh.TestHarness src/main/resources/bsh/commands/getBshPrompt.bsh

@nickl-
Copy link
Member

nickl- commented Apr 6, 2023

I notice the new parser has dropped the constructor which takes a Reader, it monitors for input.

/**
* Constructor with InputStream.
* @param stream char stream
*/
public Parser(final java.io.Reader stream) {
jj_input_stream = new JavaCharStream(stream, 1, 1);
token_source = new ParserTokenManager(jj_input_stream);
token = new Token();
jj_ntk = -1;
jj_gen = 0;
for (int i = 0; i < 118; i++)
jj_la1[i] = -1;
for (int i = 0; i < jj_2_rtns.length; i++)
jj_2_rtns[i] = new JJCalls();
}
/**
* Reinitialise
* @param stream char stream
*/
public void ReInit(final java.io.Reader stream) {
if (jj_input_stream == null) {
jj_input_stream = new JavaCharStream(stream, 1, 1);
} else {
jj_input_stream.reInit(stream, 1, 1);
}
if (token_source == null) {
token_source = new ParserTokenManager(jj_input_stream);
}

We use it for the REPL client, takes input from STDIN.

Is there a replacement for this functionality or will we have to write a work around?

@revusky
Copy link
Author

revusky commented Apr 6, 2023

Okay, did you try it on other files? (It would be hard to imagine that you didn't. Curiosity killed the cat and all....)

You surely understand what the TestHarness is outputting, right?

Also, I'd like to ask: is this the first time that you ran that TestHarness?

@revusky
Copy link
Author

revusky commented Apr 6, 2023

I notice the new parser has dropped the constructor which takes a Reader, it monitors for input.

/**
* Constructor with InputStream.
* @param stream char stream
*/
public Parser(final java.io.Reader stream) {
jj_input_stream = new JavaCharStream(stream, 1, 1);
token_source = new ParserTokenManager(jj_input_stream);
token = new Token();
jj_ntk = -1;
jj_gen = 0;
for (int i = 0; i < 118; i++)
jj_la1[i] = -1;
for (int i = 0; i < jj_2_rtns.length; i++)
jj_2_rtns[i] = new JJCalls();
}
/**
* Reinitialise
* @param stream char stream
*/
public void ReInit(final java.io.Reader stream) {
if (jj_input_stream == null) {
jj_input_stream = new JavaCharStream(stream, 1, 1);
} else {
jj_input_stream.reInit(stream, 1, 1);
}
if (token_source == null) {
token_source = new ParserTokenManager(jj_input_stream);
}

We use it for the REPL client, takes input from STDIN.

Is there a replacement for this functionality or will we have to write a work around?

It is true that the parsers generated by CongoCC do not have the constructor that takes a Reader or InputStream as a parameter. This is because of a major refactoring where the thing does not buffer any input. It just slurps in all the input (typically a file) and works on it.

It's true that that did (temporarily) screw up this use-case of the interactive interpreter, which is not currently being supported. IOW, we don't have the use-case of blocking I/O working. But it's not big deal, believe me. I'll do the incremental work (not much) to get that sort of thing working again. You see, the thing is that having a Reader or InputStream as the argument does take advantage of blocking input. So, when the thing tries to get the next character from input and there is none, it just blocks until somebody types some more input. But that's not really very hard to get working again, trust me...

So, yeah, it's not very hard to get this working again probably, but I tore out that a good while ago, probably having in the back of my mind, to have a better version of this working later. And I just never had any cause to get that working again. Until now.

BUT.... you can certainly use the existing parser to parse/run any non-interactive script. So, IOW, there is no need for you guys to be sitting around on your hands waiting for me to get the REPL thing going again. Just do your end of getting the interpreter working with actual script files and I'll certainly do my end of getting the REPL working.

And, you know, now that I think about it, this may be the source of some earlier misunderstanding. I meant to tell you this, that there was some work pending from me regarding supporting interactive interpreters. However, this does not really prevent you from moving forward on this on your end. You can certainly test/tweak the new parser against scripts that are non-interactive, i.e. you just read them in and parse the whole thing and run it. Meanwhile, I can get the interactive stuff going in parallel, and actually, once I concentrate on that, it will almost certainly be working better than it ever was, because frankly, the whole existing thing is a bit lame really. And, you know, there is also the potential of using the fault-tolerant machinery, which really needs a good test case to get all the kinks out. Have you seen this article? This is basically working actually, but Beanshell could provide a very good use case for getting it really polished. I suppose, needless to say, that you know that legacy JavaCC has NOTHING like this! (And it never will...)

So, anyway, to answer your original question, yes, I need to put back in some functionality to support a REPL, because the parser that the tool generates just assumes that it can slurp in all the input at one go at the start. And that is not the case with a REPL. This is totally doable and the fact that that piece is not currently working does NOT present any obstacle to you guys proceeding with this. And again, the way it is working in the legacy code is not that great really. It's just leveraging the fact that you have blocking in put, so if there is pending input, like the stream is open, it just sits there waiting for more input. We can do better than that.

But look, if you commit to getting it going sans REPL, I'll commit to getting the necessary piece going for the REPL.

And, really, I think that if we commit to this, we can certainly set a realistic goal of throwing away the older parser by the end of this month. But my main point I made elsewhere was NOT off-topic. To move forward, you do have to have the courage to throw away the older system at some point. It relates to why Hernan Cortes burned his boats after getting to the New World. He told his men that there was no going back. It's a similar situation. I think you have to burn the boats within the next few weeks.

@nickl-
Copy link
Member

nickl- commented Apr 6, 2023

It just slurps in all the input (typically a file) and works on it.

I figured, which is perhaps the better route to go.

I'll do the incremental work (not much) to get that sort of thing working again.

Don't sweat about that for now, something that might be more useful is being able to take an instance of the parser and feed that instance lines. But lets see what happens there might be more pressing stuff.

@nickl-

This comment was marked as off-topic.

@nickl-
Copy link
Member

nickl- commented Apr 6, 2023

Have a problem, Tokens are not Nodes. =(

INJECT Delimiter : extends bsh.BSHPrimarySuffix
INJECT AdditiveExpression : extends bsh.BSHBinaryExpression

Not sure how we're going to treat these ones.

The problem is with the UnsupportedOperationException one of these methods are getting hit:

    public void setChild(int i, Node n) {
        throw new UnsupportedOperationException();
    }

    public void addChild(Node n) {
        throw new UnsupportedOperationException();
    }

    public void addChild(int i, Node n) {
        throw new UnsupportedOperationException();
    }

    public Node removeChild(int i) {
        throw new UnsupportedOperationException();
    }

@revusky

This comment was marked as off-topic.

@nickl-
Copy link
Member

nickl- commented Apr 6, 2023

I may also be incorrectly mapping Delimiter, this is not going to be easy.

INJECT Delimiter : extends bsh.BSHPrimarySuffix
INJECT AdditiveExpression : extends bsh.BSHBinaryExpression

@nickl-
Copy link
Member

nickl- commented Apr 6, 2023

I've already said that I will do whatever incremental work to get the REPL going.

YOU DON'T HAVE TO DO THE REPL!!!!

@revusky
Copy link
Author

revusky commented Apr 6, 2023

Have a problem, Tokens are not Nodes. =(

Well, you need to get rid of the legacy code as soon as possible. Tokens ARE Nodes in CongoCC. Or, to be more precise, they ARE Nodes in the sense that they implement the Node interface. But the concrete implementation of setChild or addChild etcetera in Token.java is to just throw UnsupportedOperationException, because the assumption is that a Token, being a terminal node, never has any children, so any attempt to add/remove children must be a programming error.

And, in fact, that is CORRECT. Tokens are the terminal nodes of the parse tree. At least, that is how any natural treatment of the problem would handle this. The reason that tokens are not nodes in the legacy JJTree thing is because it was implemented as a preprocessor and the guy who did it (I think his name is Rob Duncan) back in 1997 (!) did not feel comfortable changing anything in the core JavaCC, so the Token class was never retrofitted to implement the Node API.

So, as for the UnsupportedOperationException being thrown when you try to add/remove a Node from a Token, well, that is because Tokens are the terminal nodes and do not have children. However, look at the generated BaseNode.java and you'll see that a non-terminal node does not throw the exception when you try to add/remove a child node.

Really, once you get into this stuff, I think you'll see that it all kinda does make sense. But if something doesn't seem to make sense, by all means raise the issue!

Are you actually hitting that exception somehow? It is perfectly normal that an attempt to add/remove a child from a Token, i.e. a terminal node, hits an exception.

INJECT Delimiter : extends bsh.BSHPrimarySuffix
INJECT AdditiveExpression : extends bsh.BSHBinaryExpression

Not sure how we're going to treat these ones.

The problem is with the UnsupportedOperationException one of these methods are getting hit:

    public void setChild(int i, Node n) {
        throw new UnsupportedOperationException();
    }

    public void addChild(Node n) {
        throw new UnsupportedOperationException();
    }

    public void addChild(int i, Node n) {
        throw new UnsupportedOperationException();
    }

    public Node removeChild(int i) {
        throw new UnsupportedOperationException();
    }

@revusky
Copy link
Author

revusky commented Apr 6, 2023

I may also be incorrectly mapping Delimiter, this is not going to be easy.

INJECT Delimiter : extends bsh.BSHPrimarySuffix
INJECT AdditiveExpression : extends bsh.BSHBinaryExpression

Well, Delimiter is a Token subclass, so it can't extend BSHPrimarySuffix, which, I suppose is a BaseNode, i.e. a NonTerminal. If you did need Delimiter and BSHPrimarySuffix to have a common API somehow, they would have to implement an interface in common.

@nickl-

This comment was marked as off-topic.

@nickl-

This comment was marked as off-topic.

@nickl-
Copy link
Member

nickl- commented Apr 6, 2023

Well, Delimiter is a Token subclass, so it can't extend BSHPrimarySuffix, which, I suppose is a BaseNode, i.e. a NonTerminal. If you did need Delimiter and BSHPrimarySuffix to have a common API somehow, they would have to implement an interface in common

Well Delimeter is returned as the next child so it is happy to be a Node when it is not wanted.

@revusky
Copy link
Author

revusky commented Apr 6, 2023

I've already said that I will do whatever incremental work to get the REPL going.

YOU DON'T HAVE TO DO THE REPL!!!!

Well, in principle, I don't absolutely have to. But the fact remains that it is currently a flaw in CongoCC that this use-case isn't really handled. And surely Beanshell won't be the only project that needs something like this, so it makes sense to handle this upstream.

But also, I do hope you understand that there is no urgent need to get the REPL functionality working immediately. Getting to the point where you can just read in a .bsh file, parse it and run it, that is really orthogonal to the REPL issue. That can be solved separately at a later point.

The basic problem that you can (optionally) have an incomplete input buffer and if you reach the end, you block until there is new input -- well, it's really a pretty well understood problem. The way Beanshell currently handles this is just by counting on the blocking I/O in the core Java class. So the read() method just blocks until there is new input.

But it's possible to do something more elegant and flexible than that. And as I have said, legacy JavaCC has ZERO concept of fault-tolerant parsing -- error recovery, backtracking... So you're bound to discover that you're opening up so many interesting possibilities by using the more advanced tool.

@revusky
Copy link
Author

revusky commented Apr 6, 2023

Well, Delimiter is a Token subclass, so it can't extend BSHPrimarySuffix, which, I suppose is a BaseNode, i.e. a NonTerminal. If you did need Delimiter and BSHPrimarySuffix to have a common API somehow, they would have to implement an interface in common

Well Delimeter is returned as the next child so it is happy to be a Node when it is not wanted.

Well, it could be that your first pass on an interpreter should just be an instance of Node.Visitor. Actually you already worked with Node.Visitor when you were tweaking the JavaFormatter and the Reaper (maybe it was still called DeadCodeEliminator at the time). Both of those classes extend Node.Visitor. You know, here and here.

As you surely saw, you can just walk the tree recursively and execute any node handling methods that are encountered.

But, I mean, if you don't want to do anything with Tokens, well, then just:

     void visit(Token tok) {}

Or actually, what I am I saying? If you don't want to do anything with Tokens, you just don't define a handler! It just recurses into the children, but since there aren't any children....

You can also define a handler solely for the Token subclass you're interested in:

  void visit(Identifier id) {do something....}

So, well, you can "visit" the Tokens in the tree or not, if you want. With JJTree, you can't put in visit handlers for the Tokens, because, as you point out, Tokens are NOT Nodes! In Congo, they are Nodes when it's convenient, but if you want to ignore them, that's not a problem!

Oh, there is also a TOKENS_ARE_NODES=false option and then they aren't automatically added as the terminal nodes to the tree. So there is that too. The bottom line is that you have a superset of the functionality from before. So it's all much more flexible.

@revusky

This comment was marked as off-topic.

@nickl-
Copy link
Member

nickl- commented Apr 6, 2023

Oh, there is also a TOKENS_ARE_NODES=false option

Looks like this was helpful. Thanx.

Ast still all over the place...

@nickl-
Copy link
Member

nickl- commented Apr 6, 2023

Pushed new grammar with additional mappings added.

Here is a modified BaseNode with dump output changes.
BaseNode.java.txt

@revusky
Copy link
Author

revusky commented Apr 6, 2023

The TOKENS_ARE_NODES=false is not going to work,

Actually, this probably can't work for this case because I think too much of the logic in the Java grammar assumes that the tokens are in the tree. For example, by default, the tree-building logic will build a node if there is >1 nodes on the stack, and that includes tokens, so it will build a completely different tree if TOKENS_ARE_NODES is false.

Well, come to think of it, you could change the aforementioned tree-building logic to build a node for every production, i.e.

  TOKENS_ARE_NODES=false;
   SMART_NODE_CREATION=false;

up top. But I'm not sure offhand if that would set things straight. Probably the better approach is to use the defaults, that the tokens are nodes, and it builds a new node if there is more than one node on the stack.

Now, if you really need to massage the tree, you could use the close hook, like:

   INJECT BinaryExpression : 
   {
         public void close() {
              //put in some logic  to adjust the child nodes
         }
   }

So there are various options. But, probably, in the end, you're running into problems trying to keep the old system working and the new alongside one another. At some point, it will be simpler just to cut loose from the older system, I suppose.

@revusky

This comment was marked as off-topic.

@nickl-
Copy link
Member

nickl- commented Apr 7, 2023

So there are various options.

I solved it by filtering the children collection and using firstChildOfType.

But, probably, in the end, you're running into problems trying to keep the old system working and the new alongside one another. At some point, it will be simpler just to cut loose from the older system, I suppose.

Yip it's broken =/

@nickl-

This comment was marked as off-topic.

@nickl-
Copy link
Member

nickl- commented Apr 7, 2023

We need to resolve these grammar issues, please I need your help:

Assignment, PrimaryExpression and a missing AmbiguousName: String

Here are some more examples.

Expected: 1+1;

Assignment
 BinaryExpression: "+"
  PrimaryExpression
   Literal: 1
  PrimaryExpression
   Literal: 1

Actual:

BSHAssignment
 BSHBinaryExpression: +
  BSHLiteral: 1
  BSHLiteral: 1

Expected for: System.out.println("Hello Congo");

Assignment
 PrimaryExpression
  MethodInvocation
   AmbiguousName: System.out.println
   Arguments
    Assignment
     PrimaryExpression
      Literal: Hello Congo

Actual:

BSHAssignment
 BSHMethodInvocation
  BSHAmbiguousName: System.out.println
  BSHArguments
   BSHLiteral: "Hello Congo"

Completed ambiguous names and the operators now which reminds me I have a question about that will make a separate post. Here is the getBshPrompt.bsh now

Expected: src/main/resources/bsh/commands/getBshPrompt.bsh

MethodDeclaration: getBshPrompt
 ReturnType: void=false
  Type
   AmbiguousName: String
 FormalParameters
 Block: static=false, synchronized=false
  IfStatement
   Assignment
    BinaryExpression: "&&"
     BinaryExpression: "!="
      PrimaryExpression
       AmbiguousName: bsh
      PrimaryExpression
       Literal: void
     BinaryExpression: "!="
      PrimaryExpression
       AmbiguousName: bsh.prompt
      PrimaryExpression
       Literal: void
   ReturnStatement: "return" null:
    Assignment
     PrimaryExpression
      AmbiguousName: bsh.prompt
   ReturnStatement: "return" null:
    Assignment
     PrimaryExpression
      Literal: bsh %

Actual:

BSHFormalComment: null
BSHMethodDeclaration: null
 BSHReturnType: void=false
  BSHType
 BSHFormalParameters
 BSHBlock: static=false, synchronized=false
  BSHIfStatement
   BSHBinaryExpression: &&
    BSHBinaryExpression: !=
     BSHAmbiguousName: bsh
     BSHLiteral: void
    BSHBinaryExpression: !=
     BSHAmbiguousName: bsh.prompt
     BSHLiteral: void
   BSHReturnStatement
    BSHAmbiguousName: bsh.prompt
   BSHReturnStatement
    BSHLiteral: "bsh % "

Expected: Integer.parseInt("1");

Assignment
 PrimaryExpression
  MethodInvocation
   AmbiguousName: Integer.parseInt
   Arguments
    Assignment
     PrimaryExpression
      Literal: 1

Actual:

BSHAssignment
 BSHMethodInvocation
  BSHAmbiguousName: Integer.parseInt
  BSHArguments
   BSHLiteral: "1"

Expected: src/main/resources/bsh/commands/pwd.bsh

MethodDeclaration: pwd
 FormalParameters
 Block: static=false, synchronized=false
  Assignment
   PrimaryExpression
    MethodInvocation
     AmbiguousName: print
     Arguments
      Assignment
       PrimaryExpression
        AmbiguousName: bsh.cwd

Actual:

BSHFormalComment: null
BSHMethodDeclaration: null
 BSHReturnType: void=false
 BSHFormalParameters
 BSHBlock: static=false, synchronized=false
  BSHAssignment
   BSHMethodInvocation
    BSHAmbiguousName: print
    BSHArguments
     BSHAmbiguousName: bsh.cwd

Expected: src/main/resources/bsh/commands/exit.bsh

Assignment: "="
 PrimaryExpression
  AmbiguousName: bsh.help.exit
 Assignment
  PrimaryExpression
   Literal: usage: exit()
MethodDeclaration: exit
 FormalParameters
 Block: static=false, synchronized=false
  IfStatement
   Assignment
    BinaryExpression: "!="
     PrimaryExpression
      AmbiguousName: bsh.system.shutdownOnExit
     PrimaryExpression
      Literal: false
   Assignment
    PrimaryExpression
     MethodInvocation
      AmbiguousName: System.exit
      Arguments
       Assignment
        PrimaryExpression
         Literal: 0

Actual:

BSHFormalComment: null
BSHAssignment
 BSHAssignment: =
  BSHAmbiguousName: bsh.help.exit
  BSHLiteral: "usage: exit()"
BSHMethodDeclaration: null
 BSHReturnType: void=false
 BSHFormalParameters
 BSHBlock: static=false, synchronized=false
  BSHIfStatement
   BSHBinaryExpression: !=
    BSHAmbiguousName: bsh.system.shutdownOnExit
    BSHLiteral: false
   BSHAssignment
    BSHMethodInvocation
     BSHAmbiguousName: System.exit
     BSHArguments
      BSHLiteral: 0

Expected: src/main/resources/bsh/commands/error.bsh

MethodDeclaration: error
 ReturnType: void=true
 FormalParameters
  FormalParameter: item, final=false, varargs=false
 Block: static=false, synchronized=false
  Assignment
   PrimaryExpression
    MethodInvocation
     AmbiguousName: this.interpreter.error
     Arguments
      Assignment
       PrimaryExpression
        MethodInvocation
         AmbiguousName: String.valueOf
         Arguments
          Assignment
           PrimaryExpression
            AmbiguousName: item

Actual:

Exception in thread "main" bsh.congo.parser.ParseException:
Encountered an error at (or somewhere around) src/main/resources/bsh/commands/error.bsh:7:6
Was expecting one of the following:
SEMICOLON
Found string "error" of type IDENTIFIER
        at bsh.congo.parser.BeanshellParser.handleUnexpectedTokenType(BeanshellParser.java:36983)
        at bsh.congo.parser.BeanshellParser.consumeToken(BeanshellParser.java:36972)
        at bsh.congo.parser.BeanshellParser.ExpressionStatement(BeanshellParser.java:7450)
        at BeanshellParser.Statement(/include/java/Java.ccc:947)
        at bsh.congo.parser.BeanshellParser.Statement(BeanshellParser.java:6744)
        at BeanshellParser.BlockStatement(src/main/congo/Beanshell.ccc:482)
        at bsh.congo.parser.BeanshellParser.BlockStatement(BeanshellParser.java:7048)
        at BeanshellParser.Statements(src/main/congo/Beanshell.ccc:344)
        at bsh.congo.parser.BeanshellParser.Statements(BeanshellParser.java:10081)

Oops!! =)

The major hold up is Assignment and PrimaryExpression. The former is the general Expression in the current jjt grammar, which is scattered all over the place.

/*
* This expansion has been written this way instead of:
* Assignment() | ConditionalExpression()
* for performance reasons.
*/
void Expression() #Assignment :
{ int op ; }
{
ConditionalExpression()
[ op=AssignmentOperator() { jjtThis.operator = op; } Expression() ]
}

And the PrimaryExpression defines the split between Prefix and Suffix definitions.

void PrimaryExpression() #PrimaryExpression : { }
{
PrimaryPrefix() ( PrimarySuffix() )*
}

@nickl-
Copy link
Member

nickl- commented Apr 7, 2023

RE: Operator vs TokenType

I know how to get the enum type constant from an operator Operator->getType the question is, given a token enum constant, how do I get the corresponding operator string?

@nickl-
Copy link
Member

nickl- commented Apr 7, 2023

New commit 6098986 updating changes.

@revusky
Copy link
Author

revusky commented Apr 7, 2023

We need to resolve these grammar issues, please I need your help:

Okay, well, I assume you know that the Java grammar being INCLUDEd into the beanshell grammar is the one here. And, of course, the errors.bsh cannot be parsed because the FormalParameter production requires a Type parameter. Or, more concretely:

  void error( item ) {
       this.interpreter.error( String.valueOf(item) );
  }

can't be parsed because the item needs a type in front of it. This could parsed perfectly well:

  void error(int item ) {
       this.interpreter.error( String.valueOf(item) );
  }

I tried to adjust things so the cases where Beanshell is looser than regular Java worked, but I somehow neglected this one.

It could be that the solution would be to redefine the FormalParameter production, which is currently:

FormalParameter : 
        {permissibleModifiers = EnumSet.of(FINAL);}# 
        Modifiers 
        Type 
        [ (Annotation)* <VAR_ARGS> ] 
         VariableDeclaratorId 
;

to redefine it in your grammar to make the Type parameter optional, i.e.

FormalParameter : 
        {permissibleModifiers = EnumSet.of(FINAL);}# 
        Modifiers 
        [Type] 
        [ (Annotation)* <VAR_ARGS> ] 
         VariableDeclaratorId 
;

However, I have to admit that I'm not 100% sure that this wouldn't have some implications elsewhere! You'd have to try it, I guess.

Assignment, PrimaryExpression and a missing AmbiguousName: String

As for AmbiguousName, I'm not sure what the problem is. I guess you could pretty much always use Name instead. And I see that you've noticed that Name actually subclasses AmbiguousName, so, in principle, you can use that name as if it was an AmbiguousName (though it's true that the devil is in the details!) Probably you would need some changes to what is defined in AmbiguousName.java so that it can do the right thing even when it is really a Name, not an AmbiguousName.

As for PrimaryExpression, I'm not sure what to say about that. First of all, surely you understan that there is no one unique "correct" parse tree that you can generate from a given construct. So, the fact that the newer grammar generates a somewhat different tree... well, okay, I understand that it is a problem for you at this juncture, but it is not a bug per se. But anyway, if we characterize the problem as being that it generates a tree that is different from the one that your client code "expects", then there are, in principle, two basic solutions:

  • adjust the grammar (or maybe use the close hook or a separate tree manipulation post-parse pass) somehow so that it does generate the desired tree
  • adjust the client code so that it does the right thing with the somewhat different parse tree shape

Both the above look feasible, but my tendency would be to think that the second approach will have better (more robust) results.

By the way, there is a feature I am considering adding (actually much certainly will add) that could be useful to you. At the moment, you cannot redefine the tree-building annotation on a production without copy-pasting the entire production. So, let's say, for example, you had:

             SomeProduction #(>2)  : blah blah....blah blah ;

So you build a new node if there are more than 2 nodes on tree-building stack. Otherwise, you leave them there.

Now, suppose you want to change the tree-building annotation, let's say to >3. (Or possibly, to some more complex condition, whatever....) Then you have to copy-paste the entire production (potentially long) and change the >2 to >3 or whatever. I'm thinking about being able to just change the tree-building annotation but reusing everything else. You see what I mean?

But that currently isn't implemented. It's on my mental TODO list.

Well, I think that's all I can answer for now. I'm glad you're getting into this, Nick. I would say that even if this is a rough slog right now, at some point, the whole thing will unravel, kind of like doing a crossword puzzle or something. You get past some key point and the rest of the thing just kind of falls into place. And I think you will end up seeing that getting this transition done is not the multi-month project that you have thought it is. Though, I could be wrong, but you would have to accept that I have more prior experience with this sort of stuff -- transitioning grammars and adjusting the API and all that. So my point estimate on how much time this entails is likely to be more accurate. But I could still be wrong!

@revusky
Copy link
Author

revusky commented Apr 7, 2023

RE: Operator vs TokenType

I know how to get the enum type constant from an operator Operator->getType the question is, given a token enum constant, how do I get the corresponding operator string?

Well, if you have the operator object, then the string representation of the operator is just operator.getImage(), no?

@nickl-
Copy link
Member

nickl- commented Apr 7, 2023

I know how to get the enum type constant from an operator Operator->getType the question is, given a token enum constant, how do I get the corresponding operator string?

Well, if you have the operator object, then the string representation of the operator is just operator.getImage(), no?

Wow we clearly have a huge communication gap, happened several times yesterday as well. I am always eager to blame myself as obviously I did not express myself clearly enough otherwise you would understood me. Reading my question again and again I am unable to see how saying that I have the token constant and want the operator leads you to understand I have an operator and want an operator. You are trolling me right?

Since I am desperate for an answer I have to repeat the question again.

I have a token constant, for example TokenType.ELVIS and nothing else, how do I get the corresponding operator, which should be: ?:, returned?

JavaCC provides an array lookup under ParserConstants called tokenImage which has the token string indexed to the constant's value. So the answer would be to use tokenImage[ELVIS] == "?:". In the absence of an obvious solution I attempted to retain the Operator object but shot myself in the foot having to revert that effort. Because the logic is reusable, and gets reused in parts where only TokenType is available.

@nickl-
Copy link
Member

nickl- commented Apr 7, 2023

Probably you would need some changes to what is defined in AmbiguousName.java so that it can do the right thing even when it is really a Name, not an AmbiguousName.

Actually it's the other way around, the classes can already identify correctly interpret multiple types of similar form, what we need now is to massage the congo grammar to also label the same content for us as before. This still leaves us with a ton of work, all the data injections need to be reimplemented, the node traversal has changed, with or without tokens as nodes, and the data types have changed of which Operator and TokenType was hopefully the biggest, which is done. Every one of these fixes requires research and investigation to find a solution which is very time consuming.

But at least each class will remain contained within their specific domains along with the flow diagram between the objects. Ripping the logic apart to accommodate the AST will be a futile effort, and humpty dumpty will never be put back together again. Which is why these are raised as show stoppers and why I am begging for your expertise to assist.

If there is one thing which this migration can prove, yes congo is opinionated, abandoned backwards compatibility, with many unique and fantastic new features, that it is flexible enough that BeanShell could migrate without changing their existing logic. That seems to me like a worthwhile goal to strive for. Making this an accomplished case study worthy to follow with your own javacc projects.

@revusky
Copy link
Author

revusky commented Apr 7, 2023

I know how to get the enum type constant from an operator Operator->getType the question is, given a token enum constant, how do I get the corresponding operator string?

Well, if you have the operator object, then the string representation of the operator is just operator.getImage(), no?

Wow we clearly have a huge communication gap, happened several times yesterday as well. I am always eager to blame myself as obviously I did not express myself clearly enough otherwise you would understood me. Reading my question again and again I am unable to see how saying that I have the token constant and want the operator leads you to understand I have an operator and want an operator. You are trolling me right?

No, absolutely not. I found the question perplexing. It honestly doesn't occur to me a context where you have the TokenType but don't have the Token.

If you are storing the tokens as terminal child nodes, then it's something like:

              binaryExpression.firstChildOfType(TokenType.OPERATOR).getImage()

You see, you have to understand that I am just so accustomed to the idea that you have the Tokens themselves as terminal children in the Node, so I would always be able to write:

                binaryExpression.firstChildOfType(TokenType.OPERATOR).getImage()

assuming I wanted the string image of the token in question.

Since I am desperate for an answer I have to repeat the question again.

I have a token constant, for example TokenType.ELVIS and nothing else, how do I get the corresponding operator, which should be: ?:, returned?

JavaCC provides an array lookup under ParserConstants called tokenImage which has the token string indexed to the constant's value. So the answer would be to use tokenImage[ELVIS] == "?:".

Yeah, I remember that now. I guess I removed it as some point because I didn't think it was that useful. I think, BTW, I think that that tokenImage thingy was mostly used to generate error messages, and since I wasn't using that mechanism any more, I eventually got rid of it (rightly or wrongly) because it just looked like code bloat. But look, if you really want a lookup for TokenType->String, it's easy to have it.

You could just inject this somewhere, most likely into Token or BaseNode:

    INJECT Token :
        import java.util.EnumMap;
    {
           static private EnumMap<TokenType.class> imageLookup = new EnumMap<>();
           static {
                    imageLookup.put(TokenType.PLUS, "+");
                    imageLokup.put(TokenType.MINUS, "-"); 
                    etc.
           }
           static public String imageFromTokenType(TokenType type) {return (String) imageLookup.get(type);}
   }

But, you know, when you have INJECT, you can just inject whatever you need, in principle.

OR... just store the string as a field in the appropriate place.

In the absence of an obvious solution I attempted to retain the Operator object but shot myself in the foot having to revert that effort. Because the logic is reusable, and gets reused in parts where only TokenType is available.

@revusky
Copy link
Author

revusky commented Apr 7, 2023

Probably you would need some changes to what is defined in AmbiguousName.java so that it can do the right thing even when it is really a Name, not an AmbiguousName.

Actually it's the other way around, the classes can already identify correctly interpret multiple types of similar form, what we need now is to massage the congo grammar to also label the same content for us as before. This still leaves us with a ton of work, all the data injections need to be reimplemented, the node traversal has changed, with or without tokens as nodes, and the data types have changed of which Operator and TokenType was hopefully the biggest, which is done. Every one of these fixes requires research and investigation to find a solution which is very time consuming.

But at least each class will remain contained within their specific domains along with the flow diagram between the objects. Ripping the logic apart to accommodate the AST will be a futile effort, and humpty dumpty will never be put back together again. Which is why these are raised as show stoppers and why I am begging for your expertise to assist.

If there is one thing which this migration can prove, yes congo is opinionated, abandoned backwards compatibility, with many unique and fantastic new features, that it is flexible enough that BeanShell could migrate without changing their existing logic. That seems to me like a worthwhile goal to strive for. Making this an accomplished case study worthy to follow with your own javacc projects.

Here is an alternative pattern:

            INJECT BinaryExpression : 
            {
                    @Property String opString;
            }

and then in the appropriate place in the grammar:

            BinaryExpression :
                   LHS
                   (<PLUS>|<MINUS>|<TIMES>|<DIVIDE>)
                   {CURRENT_NODE.setOpString(lastConsumedToken.getImage());}
                   RHS
           ;

Something like that. As you doubtless surmise, the @Property String opString above actually injects this:

           private String opString;
           public String getOpString() {return opString;}
           public void setOpString(String opString) {this.opString = opString;}

i mean, this is a much more powerful and you'll see that there are many ways to skin a cat, once you really come to some familiarity with the feature set, anyway.

BTW, maybe I mentioned this, I have the new C# 11 raw string literals working. (In the C# grammar obviously.) See here. I don't honestly think that legacy JavaCC is powerful enough to express this. (Particularly the interpolated raw strings.) Well, it might be possible to get it working, but the implementation would just be something horrific. The main implementation with Congo is pretty clean but some messy details are hidden in a TOKEN_HOOK routine. Well, there, I'm just showing off, I guess, since this is not directly relevant to anything you're doing or were asking me about. Well, I guess I do want to convey how much more powerful the tool you'll be using is. It will be an especially good situation if you manage to stay on good terms with the implementor. Like, you know, welcome to real software development. (As opposed to nothingburger-ism.

Oh, and by the way, I did experiment with tweaking the FormalParameter production in the beanshell grammar and it seems okay. But I have to admit I don't have any extensive tests or anything. But hey, it ain't my project really!

But I mean, if you just insert this:

FormalParameter : 
  {permissibleModifiers = EnumSet.of(FINAL);}# 
   Modifiers 
   [Type]
   [ (Annotation)* <VAR_ARGS> ] 
   VariableDeclaratorId 
;

into the Beanshell.ccc, it does seem to handle certain Beanshell constructs that it didn't handle before. It just makes the Type before the parameter optional.

@revusky
Copy link
Author

revusky commented Apr 7, 2023

Actually it's the other way around, the classes can already identify correctly interpret multiple types of similar form, what we need now is to massage the congo grammar to also label the same content for us as before. This still leaves us with a ton of work,

Well, it's some work for sure. This isn't for the work-shy! But I'd say it's still more or less moderate. You can quantify it. More or less. There are 42 bsh.BSHXXX classes, right? Though, some of them are quite trivial. My scheme for transition, which I thought was pretty clever, providing a basis to migrate, was to generate Node classes that are subclasses of the corresponding BSHXXX classes. So, in principle, they have the legacy functionality inside of them. And so, during an initial stage, you could keep the old system working and gradually get the new parser to replace the old parser... BUT...

gradually

That's the idea basically. So, as best I can figure, the best approach is that you pick the low-hanging fruit first. For example, the most simple thing to interpret, I guess, is just a plain variable, like:

   foo

And that's probably basically just a hash lookup. Or maybe multiple hash lookup. It looks in a hash that represents local variables and then failing that, checks the more outer scopes. More or less.. Right?

Well, I guess evaluating foo.bar.baz is you evaluate foo and then in that scope, look for a bar, and then in the resulting scope, look for a baz.... Right?

But, in principle, the Node generated by the newer parser should be able to do that stuff pretty easily. One would have to think that switch-case or a for-loop is more difficult to deal with. But the basic approach should be clear, I think. But the thing is, I guess, that in an initial stage of the process, you find yourself adding code to the BSHXXX classes to deal with the fact that you either have that class or are in a generated subclass. But then, you'll finally reach an inversion point, where you're not adding code any more. And then when you're confident that you have the new parser working, then a whole bunch of old crufty code just melts away, because so much of what you have is a kind of scaffolding that was necessary to keep the legacy system working. So when you switch entirely to the new system, a whole bunch of scaffolding code isn't needed any more.

And, you know, that kind of thing can be very (very very) satisfying in the end, when you reach the point where you're largely geting rid of the old scaffolding. Because, after that, you reach this point where you step back and just look at the code you now have, and you'll see how much better structured it is! That extremely satisfying moment is kind of like orgasm or something. (Except that it lasts longer!) That's what happened when I was refactoring/rewriting all the legacy JavaCC code. There would be some part of it that was just horrific and I'd manage to come up with an incremental process to replace it, but you reach a point where you look at what you now have, and it's just... well... like the last piece I rewrote was the regexp part. And I swear, after I got that rewritten, at times, I'd just look at it and it was like... "man, this is fucking beautiful". (Sorry for my language.) But the thing is that you can't reach these incredibly satisfying points if you approach the code with some sort of ultra fearful attitude. There has to be the courage and belief in oneself to get in there and hack away. I think that once you hack away at this some more, the fog will clear and the path foward will just get much clearer.

If there is one thing which this migration can prove, yes congo is opinionated, abandoned backwards compatibility,

Well, correct. Basically. But really, the word "opinionated" is a bit loaded. It implies controversy. Actually, the things that Congo is opinionated about are not really very controversial for the most part... fields should be properly encapsulated.... All those static final int things should be type-safe enums, post-editing generated files is a no-no.... That last point I mention should not be controversial, but I guess sometimes it is, because people get so accustomed to a totally f'ed up situation, that it seems normal to them that you post-edit generated files! But, when you have INJECT you don't post-edit the generated files. You simply inject the code you specifically need in there. And then when you rebuild the project, all the generated files just get regenerated. If you post-edit generate files you can basically never do a genuine clean rebuild. I tried to explain all this to that French kid (the one who wrote that ant file with all the regexp kludges) and I think he got enraged at me and decided to cancel me. (And that was without even me telling him that I didn't believe in "gay marriage" and didn't think that Donald Trump was the devil incarnate....LOL.) But, you know, these narcissistic younger guys really really hate it when you try to explain something to them. They take it as a real insult. I always kind of had an inkling of that, but the sheer extent of it has taken me aback recently.

that it is flexible enough that BeanShell could migrate without changing their existing logic.

Well, one point to bear in mind about this is that what we are doing now is attempting to leverage the existing Java grammar that is built into the tool. We're doing INCLUDE JAVA and just including the java grammar that is in the actual jarfile. It's the same Java grammar that is used internally also. So, yes, you're absolutely right that the ability to leverage the existing Java grammar is a very significant proof of concept. MASSIVE. That is true enough. But the point I would make about this is that all the stumbling blocks we are running into relate to that atttempt to reuse (and modify/extend) the existing Java grammar. If we took the other approach of simply rewriting bsh.jjt using Congo, then it would be much easier in a way, much more closed-ended task, because we could just consciously replicate the various logic that is already there. But when you try to leverage the existing Java grammar, yeah, you hit the problem that the tree, at various points, is shaped differently, like with PrimaryExpression and other things you've brought up. So, I mean, if the goal was just to transition your existing grammar to Congo, it would not be so difficult probably.

BUT... there is a lot more long-term gain if you succeed in (largely) reusing the existing Java grammar. Because, obviously, for one thing, you just immediately have all sorts of things like Lambdas and Annotations and all the rest of it. And also, you free yourself from the need to maintain all these things yourself! But the transition is more difficult.

Well, also, I have to admit that Beanshell is much more different syntactically from standard Java than I had initially thought. So, when I initially volunteered to do the initial work on this, I really thought it was much easier than it turned out to be.

BUT.. live and learn. And it's a good thing to meet difficult challenges.

@nickl-
Copy link
Member

nickl- commented Apr 9, 2023

I tried to adjust things so the cases where Beanshell is looser than regular Java worked, but I somehow neglected this one.

@revusky In the absence of seeing any commits from you, not wanting to rush to conclusion, to be absolutely clear that everybody understands exactly where we stand. You have no intention of fixing any problems or shortcomings with the work you have delivered?

@revusky
Copy link
Author

revusky commented Apr 10, 2023

I tried to adjust things so the cases where Beanshell is looser than regular Java worked, but I somehow neglected this one.

@revusky In the absence of seeing any commits from you, not wanting to rush to conclusion, to be absolutely clear that everybody understands exactly where we stand. You have no intention of fixing any problems or shortcomings with the work you have delivered?

Okay, I did a few things since you wrote the above. At that point in time, the parser I gave you parsed about 62% of the scripts in src/**/*.bsh. I honestly thought it was higher than that, but I quantified it. (I certainly knew it was far short of 100%!) But, do understand that I put far more work into this than I ever anticipated when I made my initial offer. Originally, when I made the offer up top, I thought it would involve a day or two. Maybe three days at the most. Then I spent a much longer time on it. The thing is that, when I made the offer to get the new parser working for you, I didn't realize just how much the Beanshell syntax diverged from standard Java. I must have thought that it was just a question of including the standard Java grammar and then making a few adjustments to have a Beanshell grammar.

In any case, after the final round of work on the parser, it really does parse just about everything. There seems to be one legitimate parsing failure (relating to the ELVIS operator) but I leave that for you to fix, since you really know how it should work, and really, to be honest, I'm ******* tired of this now.

So, here. Do this. (And this is addressed not to just you, Nick, but to everybody in this community.)

   git clone https://github.com/beanshell/beanshell.git -b congocc
   cd beanshell 
   mvn compile
   java -cp target/classes bsh.TestHarness $(find src -name "*.bsh")

Actually, I would suggest that everybody reading these lines execute the above magic incantation in a command-line shell and report back success/failure.

So, you or anybody can pose any question about this, including even questions on the order of: WTF is this?

Now, I have to say that my position is that this ends my involvement with all this. I have to say (I actually don't have to say it but I will...) that if you guys are handed off this and can't pick up the ball and run with it, then maybe you should find some other branch of activity to devote yourselves to. (Maybe professional tiddlywinks or something...)

Now, I don't want to get in some kind of amateur lawyer conversation, but if you look up top in this conversation, I wrote on 4 January (3 months ago +) the following:

Well, to be clear, it's unlikely that I'm going to do much work on the semantics of things or how your runtime works. My goal right now is just to get you on the rails in terms of using JavaCC 21 for parser generation. (snip) In any case, my offer is pretty much solely to get the parsing side using JavaCC 21. (Of course JavaCC 21 CongoCC)

With what I just handed you, your parser (I say your, because you now have to take ownership of this) can handle all sorts of things that your older parser does not.

  • generics
  • lambdas
  • method references
  • annotations
  • yield statement
  • assert statement
  • records
  • instanceof with pattern matching (since JDK 16)
  • new style switch (even with pattern matching, a preview feature that is implemented in Congo.)
  • Since the new parser can handle 32-bit unicode, you are now working with the fully correct version of what characters are valid in an identifier. You can also add lexical specifications that specify things in terms of full unicode.

(All the above is just off the top of my head. I might be forgetting a couple of things. Oh, I did forget to mention that the machinery is there to generate a fault-tolerant parser. That's actually a biggie. Legacy JavaCC has basically zero concept of error recovery or backtracking. (And it never will!))

NOTA BENE: When I say above it can "handle" these things, I just mean it can parse these constructs. Obviously, if you want to handle them in a real sense, i.e. the semantics, not just parsing, you have to write the code. This, after all, is just a parser, which is all I ever offered you.

Anyway, to answer your question, Nick, I think this is just about it. I am now really just handing this off to you. And you have to run with it now. I've done what I promised to do (the parser) and it is surely pretty obvious that I never promised to adjust all your client code to work against a somewhat different parse tree. I just committed to giving you a parser.

Now, I would also add that, though I have no intention of doing any more work on this, our community is certainly available to answer any questions. But really they should be specific questions, where you tried your best to figure something out and need a leg-up. (Nothing wrong with that.) Preferably any such questions should be asked here. That's my preference anyway....

So, this was a belated Christmas gift, I guess. Christmas is now 8 months away, at least by the Gregorian calendar. So, kids, behave yourselves and maybe Santa will bring you something next Christmas. But, for now, I have to get back to my own work.

I hope I've made myself clear.

@nickl-
Copy link
Member

nickl- commented Apr 11, 2023

Wow, ok! Was going for a yes or no answer, I really didn't expect you to jump in and finish up. But much appreciated, many thanks! Awesome!!! You rock!!!!

Not sure why you are so adamant to be done with this, in fact you haven't actually answered my question. You say it was more work than expected, and I certainly won't deny that, but that has nothing to do with it. It is not like you are scared of work, nor did it bother you while you were working it. You did twice remember, once on the congo branch attempting a walker, until the light bulb lit up and you started again with extended model on the congocc branch.

You obviously won't be expected to do semantic stuff, you made that clear, but the threat of having to work on BeanShell code is not the reason either. On the congocc branch you reimplemented SimpleNode which is at the very heart of BeanShell, and had no problem doing migrations on several of the nodes you started mapping. That is great but your time is best spent on congo stuff, which you know. I don't know, and even simple migrations like where am I going to get the operator, takes trial and error, meaning time. Implementing the new parser, from the TestHarness example, almost took no time.

You certainly can't say you are not invested, there are volumes in this thread alone that says otherwise. A bystander might say it is because you already made so much effort but it goes way beyond just the payload. You are actually eager to see it work. I bet it crosses your mind often, something akin to that song you just can't get out of your head, and why shouldn't it? Congo is meant to be implemented and BeanShell is pretty neat implementation. Something you can be proud of, even if you had no part in this project.

I think you were invested long before you even thought of contributing to BeanShell. When I suggested to add you as committer to the project, it never even crossed your mind. Lol if I recall it took another week or two for you to come around to the idea that it would probably be easier to have commit access =) The fact that you were invested, is what planted the seed, which grew uncontrollably, prompting you...no compelling you to start this thread with: I'll do it for you.

So you are done hey? But in the meantime, will everyone who comes across this post please run the test, and let us know if it breaks. It is certainly not out of hubris, nothing quite like writing code to keep a man humble, if the last commit shows anything it is an ability to accept imperfection. We are getting close though because this "thing" does overwhelm the desire to strive for perfection, at least to some degree. Not more important, something else... One thing is for sure if someone finds a bug you won't be able to resist. =)

Whatever it is doesn't really matter, I just don't get it. It is an important milestone for congo, it really can't fail, it is that important. Personally I prefer working with someone, perhaps you don't, we motivate each other, stick it through the grind together, share the trials and tribulations, and someone to celebrate the small things with. Only a teammate understands, you can try and tell others who may even be able to relate, but they just won't get it.


I can't say it's fun when you're acting like it doesn't matter: you can try this, or maybe that, it may cause other issues but hey you should try it, it's really up to you, you know what completely rewriting BeanShell over from scratch to work with congo is going to get rid of the scaffolding and produce the best stable code. Ahh go f...

Maybe it doesn't matter?

It just isn't a priority for me right now, I would make time to work on it with you but on a list of things I want to do myself, it doesn't rank very high. I will definitely do it at some point, and we have made enough progress that starting from scratch with something new or even attempting to hack the javacc grammar to fix a bug or add new feature wouldn't make much sense. It hasn't been a waste, we did accomplish substantially.

It will get done eventually, I just wouldn't hold my breath if I were you.

@nickl-
Copy link
Member

nickl- commented Apr 11, 2023

Output from TestHarness:

------
2 failures
241 successes
Parse failed on: src/test/resources/test-scripts/Data/syntaxerrors.bsh
Parse failed on: src/test/resources/test-scripts/operators.bsh

Which is probably what everyone gets. Busy running more tests and will report findings...

@nickl-
Copy link
Member

nickl- commented Apr 11, 2023

BeanShell script source .bsh parser fails on

  • Elvis Operator
  • Null Coalesce Operator

Can be treated the same as BOOL_OR ||

  • Null Coalesce Assign Operator

Should just work once NULLCOALESCE is understood

  • Safe Navigate Operator

Allows an optional ? before dots in dot notation

Foo?.Bar?.Baz();

Beanshell.ccc.bsh.exp.but.fnd.txt

@revusky
Copy link
Author

revusky commented Apr 11, 2023

Output from TestHarness:

------
2 failures
241 successes
Parse failed on: src/test/resources/test-scripts/Data/syntaxerrors.bsh
Parse failed on: src/test/resources/test-scripts/operators.bsh

Yep, that is no surprise. Thanks for reporting back.

Which is probably what everyone gets.

Well, I think it probably is what everybody gets, assuming they try it! But, as seems to be the usual situation, the only person who is reporting back is you. Why is that?

Busy running more tests and will report findings...

Here is another point about this that I forgot to mention....

This new Congo-based beanshell parser can also parse regular Java. For example, if, in the above, you do:

  java -cp target/classes bsh.TestHarness $(find src - name "*.java")

(i.e. run it over all the Java files instead of all the .bsh files, it parses them all. And that includes at least some new syntax with lambdas and so on.)

I've tried it over quite a range of test input, by the way, and it is not perfect. It fails in certain cases. For example, there was a case, kind of like:

      if (someCondition()) {
           //blah blah blah
      }
      /**
         And now for something completely different!
       */
     else {
         // blah blah
     }

and that was failing. It took me some effort to figure out why. Here is why. The "formal comment" above, which is the "and now for something completely different" is actually taken to be a Statement!

But what that means is that in the case above, the if statement terminates because there is a new "statement" after the first block. And then the else that follows is an orphan basically. You see the problem? If you simply get rid of the extra * that starts off that comment and turn it into a regular multiline comment, the above parses okay. But this is a consequence of saying that the comment that starts with /** is a Statement. But, that was your idea, not mine!

I had resolved not to put any more effort into this, but just a little while ago, I addressed a couple of other issues I had become aware of. Due to an oversight on my part, it was not handling hexadecimal floating point literals. (That is something that I never used in my life anyway, but...). It also was not handling void.class. But I just fixed those quickly. There is still some problem with annotation type declarations, which I have not got to the bottom of. Of course, your legacy parser has zero concept of an annotation or annotation type declaration, so....

I found the above issues running over some Java standard library code. (From JDK 19!).

 private static final double C =  0x1.15f15f15f15f1p-1;

The 0x1.15f15f15f15f1p-1 is a hexadecimal floating point literal, which, before working on this Java parser, I did not even know such a thing existed. But it is failing to handle this.

@revusky
Copy link
Author

revusky commented Apr 11, 2023

BeanShell script source .bsh parser fails on

  • Elvis Operator
  • Null Coalesce Operator

I know it's failing on those things, but I decided to let you fix it.

@nickl-
Copy link
Member

nickl- commented Apr 11, 2023

JAVA source .java parser fails in several scenarios

Beanshell.ccc.exp.but.fnd.txt - Was expecting Found report
Beanshell.ccc.exp.but.fnd.txt

fail.zip.txt - Zip archive of java source files unable to be parsed with Beanshell grammar
fail.zip.txt

@nickl-
Copy link
Member

nickl- commented Apr 11, 2023

I know it's failing on those things, but I decided to let you fix it.

I'm good thanks, you can treat them like BOOL_OR || it's basically the same.

@nickl-
Copy link
Member

nickl- commented Apr 11, 2023

I've tried it over quite a range of test input, by the way, and it is not perfect. It fails in certain cases. For example, there was a case, kind of like:

Java needs to parse, even more important than bean script syntax

But, that was your idea, not mine!

Not my idea but I did fix it in jjtree grammar.

   < FORMAL_COMMENT:
        "/**" (~["*"])* "*" ("*" | (~["*","/"] (~["*"])* "*"))* "/"
    >

and

| <MULTI_LINE_COMMENT:
    ("/***" (["*"])* | "/*") (~["*"])* "*" ("*" | (~["*","/"] (~["*"])* "*"))* "/">

Should let them through...

I had resolved not to put any more effort into this, but just a little while ago, I addressed a couple of other issues I had become aware of. Due to an oversight on my part, it was not handling hexadecimal floating point literals.

I knew you wouldn't be able to resist =)

which, before working on this Java parser, I did not even know such a thing existed. But it is failing to handle this.

Also binary and octal...

  < #HEX_LITERAL: "0" ["x","X"] (["0"-"9","a"-"f","A"-"F","_"])+ >
|
  < #BINARY_LITERAL: "0" ["b","B"] ((["0","1","_"])*(["0","1"])?) >
|
  < #OCTAL_LITERAL: "0" (["0"-"7"])* >

I never knew about the underscores _

  < #DECIMAL_LITERAL: ["1"-"9"] (["0"-"9","_"])* >

@revusky
Copy link
Author

revusky commented Apr 11, 2023

JAVA source .java parser fails in several scenarios

Beanshell.ccc.exp.but.fnd.txt - Was expecting Found report Beanshell.ccc.exp.but.fnd.txt

fail.zip.txt - Zip archive of java source files unable to be parsed with Beanshell grammar fail.zip.txt

I looked at this a bit. For starters, there are 969 .java files in there but 417 of them are module-info.java files. There is no reason to think that the test harness I handed to you deals with those. (It could, since I have the grammar to parse them but it doesn't because the test harness just assumes that the input is a series of Beanshell/Java "statements". So it's automatically failing on all these module-info.java files!

So that leaves 552 Java files. My own (really quite robust) Java parser is failing on 171 of them, so I think those ones are pretty much all invalid Java. I suppose you mean they are supposed to be valid Beanshell. But, you know, as far as I can tell, there is no really formal spec of what is valid (or not) in Beanshell.

By the way, you know how to test the Java parser that is built into Congo?

       git clone https://github.com/congo-cc/congo-parser-generator.bsh congo
       cd congo
       ant test
       java -cp examples/java JParse <directories or files>

You can see what the original Java parser can parse or not. My point here is that the Beanshell parser I gave you is basically just hacked/modified version of this Java parser, so it is, by and large, going to fail to parse things that the actual Java parser rejects. This is going to be the case unless there is some specific adjust for it to accept things that the Java parser does not...

Or, to put it another way, I guess you could say it is a design goal that the Beanshell parser should accept standard Java AND in addition, some other things that are not standard Java. But it doesn't seem to be very well specified what it is supposed to accept or reject. What I'm handing off to you is based on the sample of .bsh files that I had to use as a test suite, i.e. what you get from find src -name "*.bsh"in the top level directory. It's doing a pretty good job on those, except for the ELVIS and NULL_COALESCE operators which I've left as an exercise for you.

@nickl-
Copy link
Member

nickl- commented Apr 12, 2023

I looked at this a bit. For starters, there are 969 .java files in there but 417 of them are module-info.java files.

Did not look at the exceptions only ran the parser through all the java files on my system, copied the offending files to a faults directory, each into their own folder as to not mistakenly overwrite duplicate filenames. It is certainly easier to now only run against the 969 known to be offending files instead of the 10 minute test to parse everything one by one.

My own (really quite robust) Java parser is failing on 171 of them, so I think those ones are pretty much all invalid Java. I suppose you mean they are supposed to be valid Beanshell.

Not sure what you consider to be valid JAVA, for BeanShell we require everything which the JAVA compiler accepts to be parseable at the very least, even if not functional. I have no reason to suspect that any of the offending files would not compile based on erroneous syntax, they all come from existing public projects. Certainly not to the degree suggested by 171 of them which you say the standard congo JAVA grammar chokes on, this is disconcerting.

This is going to be the case unless there is some specific adjust for it to accept things that the Java parser does not...

Don't know why you are suggesting this, the only motivation for moving to congo is because it can parse valid JAVA and promises to continue to do so as new language features are added. Do you suggest to say that this is an unreasonable expectation?

Beanshell parser should accept standard Java AND in addition, some other things that are not standard Java.

I would rather use "valid" instead of "standard" in so far as valid JAVA syntax is accepted by the JAVA compiler, the latter could suggest conforming to a code style and is not readily verifiable. The "other things" are mostly just much looser JAVA
syntax, which is why the JAVA grammar should be easily adaptable to BeanShell, even the most far out syntactical sugar aims to be JAVA like.

It's doing a pretty good job on those, except for the ELVIS and NULL_COALESCE operators which I've left as an exercise for you.

I have to assume you think you are helping but you are not. The purpose of this project is to migrate to congo, and there is enough outstanding without you adding to the workload by thinking up "exercises" for me to do. I can just as easily learn by looking at what would've taken you only a minute to implement, how is me wasting time figuring it out helping?

If only avoiding parse errors was the goal it is setting a very low bar. Then it is easy to take shortcuts like ignoring StatementExpression just to avoid parse errors and claim a proof for a job well done.

StatementExpression : Expression;

The bottom line is, as long as we are unable to produce node instances which at least resemble the structure that BeanShell expects to find, this project stays dead in the water. But 10 points for effort, well done.

@nickl-
Copy link
Member

nickl- commented Apr 12, 2023

New commit b0232a5

Parser now captures PrimaryExpression nodes.
Fixed Null Coalesce, Elvis, SpaceShip, and Safe Traverse operators
Removed unchanged binary expression overwrites
Fixed StatementExpression which is why we now get PrimaryExpression
Includes a hacked ExpressionStatement which is far too greedy but good for now

Now even parses the syntaxerrors.bsh script lol

Expected:

Assignment
 PrimaryExpression
  MethodInvocation
   AmbiguousName: Integer.parseInt
   Arguments
    Assignment
     PrimaryExpression
      Literal: 1

Actual:

BSHAssignment
 BSHPrimaryExpression
  BSHMethodInvocation
   BSHAmbiguousName: Integer.parseInt
   BSHArguments
    BSHLiteral: "1"

Getting closer...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants