Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix painless's regex lexer and error messages #23634

Merged
merged 1 commit into from
Mar 22, 2017

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Mar 17, 2017

Without this change, if write a script with multiple regexes
sometimes the lexer will decide to look at them like one
big regex and then some trailing garbage. Like this discuss post:
https://discuss.elastic.co/t/error-with-the-split-function-in-painless-script/79021

def val = /\\\\/.split(ctx._source.event_data.param17);
if (val[2] =~ /\\./) {
  def val2 = /\\./.split(val[2]);
  ctx._source['user_crash'] = val2[0]
} else {
  ctx._source['user_crash'] = val[2]
}

The error message you get from the lexer is lexer_no_viable_alt_exception
right after the second regex.

With this change each regex is just a single regex like it ought to be.

As a bonus, while looking into this issue I found that the error
reporting for regexes wasn't very nice. If you specify an invalid
pattern then you get an error marker on the start of the pattern
with the JVM's regex error message which attempts to point you to the
location in the regex but is totally unreadable in the JSON response.

This change fixes the location to point to the appropriate spot
inside the pattern and removes the portion of the JVM's error message
that doesn't render well. It is no longer needed now that we point
users to the appropriate spot in the pattern.

Without this change, if write a script with multiple regexes
*sometimes* the lexer will decide to look at them like one
big regex and then some trailing garbage. Like this discuss post:
https://discuss.elastic.co/t/error-with-the-split-function-in-painless-script/79021

```
def val = /\\\\/.split(ctx._source.event_data.param17);
if (val[2] =~ /\\./) {
  def val2 = /\\./.split(val[2]);
  ctx._source['user_crash'] = val2[0]
} else {
  ctx._source['user_crash'] = val[2]
}
```

The error message you get from the lexer is `lexer_no_viable_alt_exception`
right after the *second* regex.

With this change each regex is just a single regex like it ought to be.

As a bonus, while looking into this issue I found that the error
reporting for regexes wasn't very nice. If you specify an invalid
pattern then you get an error marker on the start of the pattern
with the JVM's regex error message which attempts to point you to the
location in the regex but is totally unreadable in the JSON response.

This change fixes the location to point to the appropriate spot
inside the pattern and removes the portion of the JVM's error message
that doesn't render well. It is no longer needed now that we point
users to the appropriate spot in the pattern.
@@ -120,7 +120,7 @@ INTEGER: ( '0' | [1-9] [0-9]* ) [lLfFdD]?;
DECIMAL: ( '0' | [1-9] [0-9]* ) (DOT [0-9]+)? ( [eE] [+\-]? [0-9]+ )? [fFdD]?;

STRING: ( '"' ( '\\"' | '\\\\' | ~[\\"] )*? '"' ) | ( '\'' ( '\\\'' | '\\\\' | ~[\\'] )*? '\'' );
REGEX: '/' ( ~('/' | '\n') | '\\' ~'\n' )+ '/' [cilmsUux]* { slashIsRegex() }?;
REGEX: '/' ( '\\' ~'\n' | ~('/' | '\n') )+? '/' [cilmsUux]* { slashIsRegex() }?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This non-greedy pattern works well for strings so I figured I should use it for regexes as well. I had to reorder the escape characters to the front so they wouldn't be ignored because order is important in non-greedy rules.

}

public void testRegexIsNonGreedy() {
assertEquals(true, exec("def s = /\\\\/.split('.\\\\.'); return s[1] ==~ /\\./"));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This blew up mightily without the lexer change.

public void testSlashesEscapePattern() {
assertEquals(true, exec("return '//' ==~ /\\/\\//"));
public void testBackslashEscapesForwardSlash() {
assertEquals(true, exec("'//' ==~ /\\/\\//"));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one blows up if you don't reorder the lexer rules after you switch it to non-greedy.

} catch (PatternSyntaxException exception) {
throw createError(exception);
} catch (PatternSyntaxException e) {
throw new Location(location.getSourceName(), location.getOffset() + 1 + e.getIndex()).createError(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this while debugging this issue and it bothered me more than it should have. I'm quite happy PatternSyntaxException seems designed for you to extract this information.

The 1 is for the leading slash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to fix this value where the Location object is originally generated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need the offset from the exception to get the location right. The +1 bit is to get the leading /. What kind of fix were you thinking about? An override of createError maybe?

Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for fixing this. Left one minor comment.

@nik9000 nik9000 merged commit 257a7d7 into elastic:master Mar 22, 2017
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 22, 2017
* master:
  Docs: Fix language on a few snippets
  Painless: Fix regex lexer and error messages (elastic#23634)
  Skip 5.4 bwc test for new name for now
  Count through the primary in list of strings test
  Skip testing new name if it isn't known
  Wait for all shards in list of strings test
  Deprecate request_cache for clear-cache (elastic#23638)
nik9000 added a commit that referenced this pull request Mar 22, 2017
Without this change, if write a script with multiple regexes
*sometimes* the lexer will decide to look at them like one
big regex and then some trailing garbage. Like this discuss post:
https://discuss.elastic.co/t/error-with-the-split-function-in-painless-script/79021

```
def val = /\\\\/.split(ctx._source.event_data.param17);
if (val[2] =~ /\\./) {
  def val2 = /\\./.split(val[2]);
  ctx._source['user_crash'] = val2[0]
} else {
  ctx._source['user_crash'] = val[2]
}
```

The error message you get from the lexer is `lexer_no_viable_alt_exception`
right after the *second* regex.

With this change each regex is just a single regex like it ought to be.

As a bonus, while looking into this issue I found that the error
reporting for regexes wasn't very nice. If you specify an invalid
pattern then you get an error marker on the start of the pattern
with the JVM's regex error message which attempts to point you to the
location in the regex but is totally unreadable in the JSON response.

This change fixes the location to point to the appropriate spot
inside the pattern and removes the portion of the JVM's error message
that doesn't render well. It is no longer needed now that we point
users to the appropriate spot in the pattern.
@clintongormley clintongormley added :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache and removed :Plugin Lang Painless labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache v5.4.0 v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants