Fix painless's regex lexer and error messages #23634

nik9000 · 2017-03-17T16:36:44Z

Without this change, if write a script with multiple regexes
sometimes the lexer will decide to look at them like one
big regex and then some trailing garbage. Like this discuss post:
https://discuss.elastic.co/t/error-with-the-split-function-in-painless-script/79021

def val = /\\\\/.split(ctx._source.event_data.param17);
if (val[2] =~ /\\./) {
  def val2 = /\\./.split(val[2]);
  ctx._source['user_crash'] = val2[0]
} else {
  ctx._source['user_crash'] = val[2]
}

The error message you get from the lexer is lexer_no_viable_alt_exception
right after the second regex.

With this change each regex is just a single regex like it ought to be.

As a bonus, while looking into this issue I found that the error
reporting for regexes wasn't very nice. If you specify an invalid
pattern then you get an error marker on the start of the pattern
with the JVM's regex error message which attempts to point you to the
location in the regex but is totally unreadable in the JSON response.

This change fixes the location to point to the appropriate spot
inside the pattern and removes the portion of the JVM's error message
that doesn't render well. It is no longer needed now that we point
users to the appropriate spot in the pattern.

Without this change, if write a script with multiple regexes *sometimes* the lexer will decide to look at them like one big regex and then some trailing garbage. Like this discuss post: https://discuss.elastic.co/t/error-with-the-split-function-in-painless-script/79021 ``` def val = /\\\\/.split(ctx._source.event_data.param17); if (val[2] =~ /\\./) { def val2 = /\\./.split(val[2]); ctx._source['user_crash'] = val2[0] } else { ctx._source['user_crash'] = val[2] } ``` The error message you get from the lexer is `lexer_no_viable_alt_exception` right after the *second* regex. With this change each regex is just a single regex like it ought to be. As a bonus, while looking into this issue I found that the error reporting for regexes wasn't very nice. If you specify an invalid pattern then you get an error marker on the start of the pattern with the JVM's regex error message which attempts to point you to the location in the regex but is totally unreadable in the JSON response. This change fixes the location to point to the appropriate spot inside the pattern and removes the portion of the JVM's error message that doesn't render well. It is no longer needed now that we point users to the appropriate spot in the pattern.

nik9000 · 2017-03-17T16:37:40Z

modules/lang-painless/src/main/antlr/PainlessLexer.g4

@@ -120,7 +120,7 @@ INTEGER: ( '0' | [1-9] [0-9]* ) [lLfFdD]?;
 DECIMAL: ( '0' | [1-9] [0-9]* ) (DOT [0-9]+)? ( [eE] [+\-]? [0-9]+ )? [fFdD]?;

 STRING: ( '"' ( '\\"' | '\\\\' | ~[\\"] )*? '"' ) | ( '\'' ( '\\\'' | '\\\\' | ~[\\'] )*? '\'' );
-REGEX: '/' ( ~('/' | '\n') | '\\' ~'\n' )+ '/' [cilmsUux]* { slashIsRegex() }?;
+REGEX: '/' ( '\\' ~'\n' | ~('/' | '\n') )+? '/' [cilmsUux]* { slashIsRegex() }?;


This non-greedy pattern works well for strings so I figured I should use it for regexes as well. I had to reorder the escape characters to the front so they wouldn't be ignored because order is important in non-greedy rules.

nik9000 · 2017-03-17T16:38:18Z

modules/lang-painless/src/test/java/org/elasticsearch/painless/RegexTests.java

+    }
+
+    public void testRegexIsNonGreedy() {
+        assertEquals(true, exec("def s = /\\\\/.split('.\\\\.'); return s[1] ==~ /\\./"));


This blew up mightily without the lexer change.

nik9000 · 2017-03-17T16:39:05Z

modules/lang-painless/src/test/java/org/elasticsearch/painless/RegexTests.java

-    public void testSlashesEscapePattern() {
-        assertEquals(true, exec("return '//' ==~ /\\/\\//"));
+    public void testBackslashEscapesForwardSlash() {
+        assertEquals(true, exec("'//' ==~ /\\/\\//"));


This one blows up if you don't reorder the lexer rules after you switch it to non-greedy.

nik9000 · 2017-03-17T16:40:08Z

modules/lang-painless/src/main/java/org/elasticsearch/painless/node/ERegex.java

-        } catch (PatternSyntaxException exception) {
-            throw createError(exception);
+        } catch (PatternSyntaxException e) {
+            throw new Location(location.getSourceName(), location.getOffset() + 1 + e.getIndex()).createError(


I found this while debugging this issue and it bothered me more than it should have. I'm quite happy PatternSyntaxException seems designed for you to extract this information.

The 1 is for the leading slash.

Is it not possible to fix this value where the Location object is originally generated?

I need the offset from the exception to get the location right. The +1 bit is to get the leading /. What kind of fix were you thinking about? An override of createError maybe?

jdconrad

LGTM! Thanks for fixing this. Left one minor comment.

* master: Docs: Fix language on a few snippets Painless: Fix regex lexer and error messages (elastic#23634) Skip 5.4 bwc test for new name for now Count through the primary in list of strings test Skip testing new name if it isn't known Wait for all shards in list of strings test Deprecate request_cache for clear-cache (elastic#23638)

Without this change, if write a script with multiple regexes *sometimes* the lexer will decide to look at them like one big regex and then some trailing garbage. Like this discuss post: https://discuss.elastic.co/t/error-with-the-split-function-in-painless-script/79021 ``` def val = /\\\\/.split(ctx._source.event_data.param17); if (val[2] =~ /\\./) { def val2 = /\\./.split(val[2]); ctx._source['user_crash'] = val2[0] } else { ctx._source['user_crash'] = val[2] } ``` The error message you get from the lexer is `lexer_no_viable_alt_exception` right after the *second* regex. With this change each regex is just a single regex like it ought to be. As a bonus, while looking into this issue I found that the error reporting for regexes wasn't very nice. If you specify an invalid pattern then you get an error marker on the start of the pattern with the JVM's regex error message which attempts to point you to the location in the regex but is totally unreadable in the JSON response. This change fixes the location to point to the appropriate spot inside the pattern and removes the portion of the JVM's error message that doesn't render well. It is no longer needed now that we point users to the appropriate spot in the pattern.

nik9000 added :Plugin Lang Painless v5.4.0 v6.0.0-alpha1 labels Mar 17, 2017

nik9000 requested a review from jdconrad March 17, 2017 16:36

nik9000 commented Mar 17, 2017

View reviewed changes

nik9000 added review >bug labels Mar 17, 2017

jdconrad approved these changes Mar 17, 2017

View reviewed changes

nik9000 merged commit 257a7d7 into elastic:master Mar 22, 2017

clintongormley added :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache and removed :Plugin Lang Painless labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix painless's regex lexer and error messages #23634

Fix painless's regex lexer and error messages #23634

nik9000 commented Mar 17, 2017 •

edited

Loading

nik9000 Mar 17, 2017

nik9000 Mar 17, 2017

nik9000 Mar 17, 2017

nik9000 Mar 17, 2017

jdconrad Mar 17, 2017

nik9000 Mar 17, 2017

jdconrad left a comment

Fix painless's regex lexer and error messages #23634

Fix painless's regex lexer and error messages #23634

Conversation

nik9000 commented Mar 17, 2017 • edited Loading

nik9000 Mar 17, 2017

Choose a reason for hiding this comment

nik9000 Mar 17, 2017

Choose a reason for hiding this comment

nik9000 Mar 17, 2017

Choose a reason for hiding this comment

nik9000 Mar 17, 2017

Choose a reason for hiding this comment

jdconrad Mar 17, 2017

Choose a reason for hiding this comment

nik9000 Mar 17, 2017

Choose a reason for hiding this comment

jdconrad left a comment

Choose a reason for hiding this comment

nik9000 commented Mar 17, 2017 •

edited

Loading