Upgrade ANTLR to version 4.11.1 #12016

reta · 2022-12-13T17:29:07Z

Signed-off-by: Andriy Redko andriy.redko@aiven.io

Description

The Apache Lucene is using quite old version of ANTLR 4.5.1-1. By itself, it is not a showstopper, but more profound issue is that some ANTLR 3.x bits are used [1]. Since ANTLR 4.10.x (or even earlier), the compatibility layer with 3.x release line has been dropped in 4.x (see please [2]), which makes Apache Lucene impossile to be used with recent ANTLR 4.10.x+ releases [3]. The sample exception is below.

   >         java.lang.UnsupportedOperationException: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with version 3 (expected 4).
   >             at org.antlr.antlr4.runtime@4.11.1/org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:56)
   >             at org.antlr.antlr4.runtime@4.11.1/org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:48)
   >             at org.apache.lucene.expressions@10.0.0-SNAPSHOT/org.apache.lucene.expressions.js.JavascriptLexer.<clinit>(JavascriptLexer.java:279)

[1] https://github.com/apache/lucene/blob/main/lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java#L189
[2] antlr/antlr4@c68e127
[3] opensearch-project/OpenSearch#4546

Closes #11788

Signed-off-by: Andriy Redko <andriy.redko@aiven.io>

reta · 2022-12-13T17:34:21Z

@rmuir @uschindler what kind of (performance? jmh?) testing would help to discard / prove that moving to 4.11.x makes / does not make sense. You have definitely seen traps in the past, I would very appreciate pointers / guidance to explore the possibility of any regressions, thank you in advance.

rmuir · 2022-12-13T19:45:00Z

the only way i know to prevent the traps is to do like painless and "enable picky mode" which fails test instead of doing slow things. and to have 100% test coverage of grammar!

rmuir · 2022-12-13T19:47:07Z

Looks like this in painless: https://github.com/opensearch-project/OpenSearch/blob/04757607c5aead788b465c77cec6ef459720f625/modules/lang-painless/src/main/java/org/opensearch/painless/antlr/Walker.java#L224-L245

reta · 2022-12-13T19:48:51Z

Looks like this in painless: https://github.com/opensearch-project/OpenSearch/blob/04757607c5aead788b465c77cec6ef459720f625/modules/lang-painless/src/main/java/org/opensearch/painless/antlr/Walker.java#L224-L245

Thanks a lot, @rmuir , I will take it from there

rmuir · 2022-12-13T19:49:55Z

@reta I remember doing this adds overhead, that's why it is a boolean there. so it really just needs to be something we do from tests. for example it could be a package-private setter or similar?

rmuir · 2022-12-13T19:51:34Z

As far as inspecting coverage, I suspect it is pretty good. But there is instructions in https://github.com/apache/lucene/blob/main/help/tests.txt on how to generate reports.

rmuir · 2022-12-13T19:54:41Z

here is coverage report using the current antlr. I guess i dont know why so much is missing here:

https://ci-builds.apache.org/job/Lucene/job/Lucene-Coverage-main/618/jacoco/org.apache.lucene.expressions.js/

But yeah, if we exercise the possibilities of grammar from tests, AND tests use picky mode, then build will fail if grammar is ambiguous. It is definitely a PITA compared to antlr 3 :(

rmuir · 2022-12-13T19:58:46Z

cc: @jdconrad who might remember a lot more about this than me

rmuir · 2022-12-13T21:40:04Z

In general, sorry if i discouraged before, it is really just a frustrating situation

If you get stuck, just leave the PR open. I will try to dig into this too, I have been through it before. I do agree we can't stay on old releases forever.

But the crazy shit they did between antlr 3.x and 4.x caused me pain:

made the mistake of not spelunking thru antlr guts to figure out how to prevent performance traps
had users report performance bugs (in all cases it was some grammar tweak to fix)
fix these performance bugs in subsequent releases
get frustrated with the leniency and figure out how to make the shit picky again.

So I don't wish that on anyone. Currently the expressions is great because it is simple and performs well: Users can represent needs in a simple javascript-like fashion and trust that it has same performance as writing custom java code.

reta · 2022-12-13T21:51:23Z

Thanks for encouraging @rmuir ! I will be working on the matter this week and share my findings, thank you!

jdconrad · 2022-12-13T23:42:53Z

I agree with @rmuir that having an ambiguity check for tests similar to Painless would be great for expressions. I'm a bit surprised this change didn't require much additionally to the regeneration of the lexer/grammar. One other thing I'm curious about is if this suffers from the same issue that Painless did with multiple nested conditionals where we had to separate them into their own rule in the grammar so they didn't do needless backtracking. So that may be something else worth adding a test for. Thanks for looking into upgrading this @reta!

rmuir · 2022-12-14T06:12:45Z

Simple package-private static method to turn on the pickiness should do it.

I don't want to see 100 new constructors/abstractions added with booleans, just because antlr made a bad decision. Our APIs should not have to suffer for it.

Don't care what java developers or static analysis tools or whatever else wants to say about it: minimal abstractions here.

uschindler · 2022-12-14T08:57:38Z

Hi, I agree with all Robert say. I would also like to make another suggestion (that could also be applied when this has been fixed). To me it looks like a bad decission of antlr, that the tool compiles code and creates some Java/Class files but at runtime it also needs some dependency JAR in exactly the correct version. This is a problem for a library like lucene that gets included into other projects. The same also applies for ASM, although this is no longer a big problem anymore, because ASM's APIs are now very stable and it does not matter if a project uses ASM 8 or ASM 9 if minimum requirements are guaranteed (so the user has more flexibility).

In contrast javacc/jflex do not have that problem, because javacc/jflex generate all of the code and you don't need any runtime library to execute and access the AST of bytecode or syntax.

In Lucene both (ANTLR and ASM) are dependencies of one artifact: lucene-expressions, so it is only a real problem there. My suggestion would be: Let's shade the ANTLR (and possibly also ASM - until the JDK-included bytecode generator is out of incubation/preview) into the JAR file. With Apache Ant build this was hard to do, but with Grade it is quite easy. Just add another configuration with those two dependencies and use the "Gradle Shadow Plugin" (https://imperceptiblethoughts.com/shadow/) to transform their package names into the Lucene namespace.

I know some people don't like this, but I had exactly the same problems with ASM in the forbiddenapis plugin and shading ASM in there was the only way to work around different, old, incompatible versions of ASM inside Maven or Gradle's classpath, while forbiddenapis always needeing the newest one.

If you like I could make a PR to demonstrate the shading for Lucene's Gradle build in Expressions. @dweiss : What do you think? Of course we should not use shading anywhere else, but ANTLR and ASM are the candidates that always bring problems. Their size is also small so the overhead is small, but you have a consistent package that is unlikely to break when other projects use different library versions.

dweiss · 2022-12-14T11:53:18Z

With Apache Ant this was hard to do but with Grade it is quite easy. Just add another configuration with those two dependencies and use the "Gradle Shadow Plugin" (https://imperceptiblethoughts.com/shadow/) to transform their package names into the Lucene namespace.

Technically - easy to do. Question is whether we want to do it (I don't see any problems here).

rmuir · 2022-12-14T13:13:13Z

personally i am against the shading. I think it is a huge antipattern, it hides third-party artifacts completely. think about someone trying to do security or license audit and they have "secret dependencies" hidden from view, as an example.

I don't understand the issue myself. if you want to use different antlr version in two places just use two classloaders. thats what we did in elastic/opensearch.

but like i said, i'm not opposed to upgrading IF concerns about the new version are taken care of.

rmuir · 2022-12-14T13:22:02Z

I also dont understand the issue where ppl think they can modify arbitrary versions of lucene dependencies.

Can we specify our dependencies in a different way (e.g. exact version) in our maven stuff so this won't happen? e.g. you can do this in python, and specify that you depend on antlr == x.y.z rather than just depend on antlr.

dweiss · 2022-12-14T13:29:09Z

Can we specify our dependencies in a different way (e.g. exact version) in our maven stuff so this won't happen? e.g. you can do this in python, and specify that you depend on antlr == x.y.z rather than just depend on antlr.

You can - I think what Uwe is describing is a problem for downstream projects where Lucene has antlr x.y.z and some other dependency has antlr a.b.c - then namespaces clash and the conflict is not easily resolved. Classloader separation is possible, of course, but it's hardly an easy alternative. :)

I personally don't mind shading artifacts but I do agree they are a pain... even tracking down which version a project is actually using is a problem then (because shaded artifacts don't manifest their versions as clearly as a maven dependency). Corporate environments will hate them for legal reasons (for reasons Rob mentioned).

rmuir · 2022-12-14T13:37:10Z

Yes, java showed what a disaster it is around supply chains with the log4j vulnerability. Shading should not even be considered as an option.

uschindler · 2022-12-14T13:38:45Z

Can we specify our dependencies in a different way (e.g. exact version) in our maven stuff so this won't happen? e.g. you can do this in python, and specify that you depend on antlr == x.y.z rather than just depend on antlr.

You can - I think what Uwe is describing is a problem for downstream projects where Lucene has antlr x.y.z and some other dependency has antlr a.b.c - then namespaces clash and the conflict is not easily resolved. Classloader separation is possible, of course, but it's hardly an easy alternative. :)

I personally don't mind shading artifacts but I do agree they are a pain... even tracking down which version a project is actually using is a problem then (because shaded artifacts don't manifest their versions as clearly as a maven dependency). Corporate environments will hate them for legal reasons (for reasons Rob mentioned).

Hi,
as said before this was just a suggestion from my experience with forbiddenapis with some artifacts like "ASM" and "ANTLR". They are always a pain. From the secruity perspective there are problems, but you can also see it like "code copied" - we can of course also do this, but that's a lot more hassle.

What you can always do: Offer a shaded version without those 2 dependencies as a separate artifact. Often seen on maven as "uber" or "shaded" behind version number. If you want to use shaded version, you know consequences.

Setting exact version in Maven POMs is not possible, unfortunately. Maven has some tricks (it won't silently upgrade across major versions, but bugfix releases are automatically applied). I don't know exacty how this is handled internally by Maven resolver.

My idea would be: publish "lucene-expressions-shaded-x.y.z" in addition to "lucene-expressions-x.y.z"
That is also what the gradle-plugin is doing, it adds another coordinate.

rmuir · 2022-12-14T13:42:00Z

I feel that if we publish a shaded jar we become guilty of "contributing to the delinquency" of terrible supply chain management, by hiding third party dependencies and their versions from view. It causes problems for checkers and even humans if there were ever some issue with one of the dependencies (example log4j). They might miss it entirely and get hacked, as an example. Then they will be mad at us for shading in the first place, even though its arguably their fault because they used a shaded jar.

Let's just not hand them the gun.

rmuir · 2022-12-14T13:44:23Z

another way to say it: shading is a terrible, TERRIBLE idea and you only hear about it in java, because java is the only language with developers that are bad enough to consider it.

Please, let's not talk about it ever again.

reta · 2022-12-14T14:01:10Z

Please, let's not talk about it ever again.

Hard to disagree with you @rmuir, I think shading is the last resort which might be taken if no other options are possible, I could count only a few legitimate reasons when it makes sense (speaking in general). I think we should not take this path in Lucene, at least with ANTLR4.

jdconrad · 2022-12-14T15:46:21Z

One other possibility would be to hand-roll the expression lexer/parser. This would get rid of the need for any additional dependencies and generated code. From what I can tell the API has been stable for a long time, so I don't think there would be a problem with maintenance. This would have the added benefit of likely improved performance as ANTLR isn't the fastest, and potentially better error messages.

uschindler · 2022-12-14T16:38:09Z

One other possibility would be to hand-roll the expression lexer/parser. This would get rid of the need for any additional dependencies and generated code. From what I can tell the API has been stable for a long time, so I don't think there would be a problem with maintenance. This would have the added benefit of likely improved performance as ANTLR isn't the fastest, and potentially better error messages.

javacc?

jdconrad · 2022-12-14T16:59:44Z

javacc?

I was under the impression this was EOL? If it's still well supported I'm not familiar enough with the code generation to know if this would avoid the pitfalls ANTLR has.

Signed-off-by: Andriy Redko <andriy.redko@aiven.io>

reta · 2022-12-14T20:02:26Z

@rmuir first pass over pickiness, all tests are now run w/o diagnostics listener (picky / not picky mode), the rough observations so far are encouraging, there are no significant changes in the test duration between these two modes (some random timings below):

I am moving further to compare the numbers against main and to conduct more precise measurements (instead of rough test timings).

rmuir · 2022-12-14T21:07:49Z

we don't need to parameterize the pickiness IMO, we can just turn it on in these tests. Thanks for getting this hooked up. I will look at your branch and try to play with it.

rmuir · 2022-12-14T21:44:53Z

I pushed a proposal to your branch (only fixing one of the tests in .js to use it). If you are really against it, just revert the commit. But i think it keeps tests simpler.

reta · 2022-12-14T21:52:42Z

I pushed a proposal to your branch (only fixing one of the tests in .js to use it). If you are really against it, just revert the commit. But i think it keeps tests simpler.

👍 No objections, thanks a lot for helping out @rmuir , it keeps tests simpler indeed

rmuir · 2022-12-14T21:53:30Z

sorry i didnt do the other tests, I gotta run for now. i can do them later tonight or tomorrow if you need but just wanted to prototype it out

lucene/expressions/src/java/module-info.java

uschindler · 2022-12-14T21:57:08Z

lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptCompilerSettings.java

+package org.apache.lucene.expressions.js;
+
+/** Settings for expression compiler for javascript expressions. */
+final class JavascriptCompilerSettings {


Do we really need this additional class only holding one boolean? I think a simple additional PKG private compile method taking the boolean picky would have same effect.

No, not really, but it could be useful to add more settings later on

Personally I hate mutable instances with getters and setters. 🤬

I agree with Uwe, let's move to a boolean (actually only need one package-private ctor with the boolean, the way the new base CompilerTestCase class works).

If we have more options to this thing later, we can deal with the issue at that time. It is all package-private anyway.

Signed-off-by: Andriy Redko <andriy.redko@aiven.io>

reta · 2022-12-14T22:15:52Z

sorry i didnt do the other tests, I gotta run for now. i can do them later tonight or tomorrow if you need but just wanted to prototype it out

@rmuir tests have been migrated, thank you

…n favor of boolean Signed-off-by: Andriy Redko <andriy.redko@aiven.io>

uschindler

Looks fine to me.

rmuir · 2022-12-14T22:47:34Z

I will check it out and inspect the coverage report from .js tests and see if there are any holes. If i find them I will push more tests. I am just really paranoid about some of the slow things antlr4 will do, having fought thru these issues with @jdconrad once before.

uschindler · 2022-12-14T22:56:19Z

I will check it out and inspect the coverage report from .js tests and see if there are any holes. If i find them I will push more tests. I am just really paranoid about some of the slow things antlr4 will do, having fought thru these issues with @jdconrad once before.

I am a bit afraid because those blobs with DFAs growed quite a lot in the patch. So yes, let's review how it behaves.

showed as untested in coverage report, only need one codepath thru this stuff

rmuir · 2022-12-15T15:13:25Z

i wrestled with it, I think we are good testing-wise. Lots of impossible branches in the generated code so it looks bad at a glance, but I feel like the functionality is covered.

uschindler · 2022-12-15T15:15:24Z

Cool, thanks for the "huge whitespace" (public·void·testEnormousExpressionSource) test!

reta · 2022-12-15T15:18:52Z

@rmuir @uschindler thanks a lot for HUGE help here guys!

rmuir · 2022-12-16T03:26:48Z

I forced regeneration with ./gradlew -p lucene/expressions regenerate --rerun-tasks just to ensure there were no source code changes and regeneration is idempotent / reproducible.

Drop 3.x compatibility (which was pickier at compile-time and prevented slow things from happening). Instead add paranoia to runtime tests, so that they fail if antlr would do something slow in the parsing. This is needed because antlrv4 is a big performance trap: https://github.com/antlr/antlr4/blob/master/doc/faq/general.md "Q: What are the main design decisions in ANTLR4? Ease-of-use over performance. I will worry about performance later." It allows us to move forward with newer antlr but hopefully prevent the associated headaches. Signed-off-by: Andriy Redko <andriy.redko@aiven.io> Co-authored-by: Robert Muir <rmuir@apache.org>

reta · 2022-12-16T13:49:24Z

@rmuir ~~would it be possible to backport it to 9.5? thank you!~~ nwd, I saw it in lucene_9x already, thanks again

uschindler · 2022-12-16T16:54:13Z

See 6700b7e
We just cherrypick, no need for PRs (they are only required if backports are complex).

Upgrade ANTLR to version 4.11.1

1207ce0

Signed-off-by: Andriy Redko <andriy.redko@aiven.io>

Add JavascriptCompilerSettings to allow configuring picky parsing

c662a23

Signed-off-by: Andriy Redko <andriy.redko@aiven.io>

reta force-pushed the issue-11788 branch from f6c9e36 to c662a23 Compare December 14, 2022 17:39

add base test class and simplify one of the tests as an example

f580cc6

uschindler reviewed Dec 14, 2022

View reviewed changes

Addressed code review comments: migrated tests to CompilerTestCase

22f8927

Signed-off-by: Andriy Redko <andriy.redko@aiven.io>

Addressing code review comments: removed JavascriptCompilerSettings i…

1d3b4e3

…n favor of boolean Signed-off-by: Andriy Redko <andriy.redko@aiven.io>

uschindler approved these changes Dec 14, 2022

View reviewed changes

reta marked this pull request as ready for review December 14, 2022 23:05

rmuir added 2 commits December 15, 2022 09:27

remove unnecessary factory and ctor method

331648e

showed as untested in coverage report, only need one codepath thru this stuff

increase coverage a tiny bit for some corner cases

69246f0

rmuir approved these changes Dec 15, 2022

View reviewed changes

tidy

fd66734

CHANGES

e3af70f

rmuir merged commit 945d7fe into apache:main Dec 16, 2022

rmuir added this to the 9.5.0 milestone Dec 16, 2022

reta mentioned this pull request Dec 16, 2022

Bump antlr4 from 4.9.3 to 4.11.1 opensearch-project/OpenSearch#4546

Merged

6 tasks

Upgrade ANTLR to version 4.11.1 #12016

Upgrade ANTLR to version 4.11.1 #12016

Conversation

reta commented Dec 13, 2022

Description

reta commented Dec 13, 2022

rmuir commented Dec 13, 2022

rmuir commented Dec 13, 2022

reta commented Dec 13, 2022

rmuir commented Dec 13, 2022

rmuir commented Dec 13, 2022

rmuir commented Dec 13, 2022

rmuir commented Dec 13, 2022

rmuir commented Dec 13, 2022

reta commented Dec 13, 2022

jdconrad commented Dec 13, 2022 • edited Loading

rmuir commented Dec 14, 2022

uschindler commented Dec 14, 2022 • edited Loading

dweiss commented Dec 14, 2022 • edited by uschindler Loading

rmuir commented Dec 14, 2022

rmuir commented Dec 14, 2022

dweiss commented Dec 14, 2022

rmuir commented Dec 14, 2022

uschindler commented Dec 14, 2022 • edited Loading

rmuir commented Dec 14, 2022

rmuir commented Dec 14, 2022

reta commented Dec 14, 2022

jdconrad commented Dec 14, 2022

uschindler commented Dec 14, 2022

jdconrad commented Dec 14, 2022

reta commented Dec 14, 2022 • edited Loading

rmuir commented Dec 14, 2022

rmuir commented Dec 14, 2022

reta commented Dec 14, 2022

rmuir commented Dec 14, 2022

uschindler Dec 14, 2022 • edited Loading

Choose a reason for hiding this comment

reta Dec 14, 2022

Choose a reason for hiding this comment

uschindler Dec 14, 2022

Choose a reason for hiding this comment

rmuir Dec 14, 2022

Choose a reason for hiding this comment

reta commented Dec 14, 2022

uschindler left a comment

Choose a reason for hiding this comment

rmuir commented Dec 14, 2022

uschindler commented Dec 14, 2022

rmuir commented Dec 15, 2022

uschindler commented Dec 15, 2022 • edited Loading

reta commented Dec 15, 2022

rmuir commented Dec 16, 2022

reta commented Dec 16, 2022 • edited Loading

uschindler commented Dec 16, 2022

jdconrad commented Dec 13, 2022 •

edited

Loading

uschindler commented Dec 14, 2022 •

edited

Loading

dweiss commented Dec 14, 2022 •

edited by uschindler

Loading

uschindler commented Dec 14, 2022 •

edited

Loading

reta commented Dec 14, 2022 •

edited

Loading

uschindler Dec 14, 2022 •

edited

Loading

uschindler commented Dec 15, 2022 •

edited

Loading

reta commented Dec 16, 2022 •

edited

Loading