NIFI-5826 Fix to escaped backslash #3183

bdesert · 2018-11-27T18:03:20Z

Fixing escaped backslash by unescaping it in java just before compiling regex.

For a consistency with expression language regex-based evaluators, all record path operators will follow the same escape sequence for regex. Single backslash should be defined as double backslash.
Examples:

replaceRegex(/col1, '\s', "_") - will replace all whitespaces (spaces, tabs) with underscore.
replaceRegex(/col1, '\.', ",") - will replace all a period with
replaceRegex(/col1, '\\', "/") - will replace backslash with forward slash

EL References: StringLiteralEvaluator. Special characters (backslash sequences) will be handled for any string literal value, and will affect the way regex and escaped chars must be defined. Same logic has been added to RecordPathUtils.unescapeBackslash.

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

Fixing escaped backslash by unescaping it in java just before compiling regex. For a consistency with expression language regex-based evaluators, all record path operators will follow the same escape sequence for regex. Single backslash should be defined as double backslash. Examples: - replaceRegex(/col1, '\\s', "_") - will replace all whitespaces (spaces, tabs) with underscore. - replaceRegex(/col1, '\\.', ",") - will replace all a period with - replaceRegex(/col1, '\\\\', "/") - will replace backslash with forward slash NIFI-5826 Fix to escaped backslash Fixing escaped backslash by unescaping it in java just before compiling regex. For a consistency with expression language regex-based evaluators, all record path operators will follow the same escape sequence for regex. Single backslash should be defined as double backslash. Examples: - replaceRegex(/col1, '\\s', "_") - will replace all whitespaces (spaces, tabs) with underscore. - replaceRegex(/col1, '\\.', ",") - will replace all a period with - replaceRegex(/col1, '\\\\', "/") - will replace backslash with forward slash

ottobackwards

A couple of comments

ottobackwards · 2018-11-27T18:14:31Z

...commons/nifi-record-path/src/main/java/org/apache/nifi/record/path/util/RecordPathUtils.java

+        boolean lastCharIsBackslash = false;
+        for (int i = 0; i < value.length(); i++) {
+            final char c = value.charAt(i);
+


Is this the entire list that has to be escaped? Maybe you can reference where that list is specified for maintenance.

Added to the PR description:
EL References: StringLiteralEvaluator. Special characters (backslash sequences) will be handled for any string literal value, and will affect the way regex and escaped chars must be defined. Same logic has been added to RecordPathUtils.unescapeBackslash.

I am sorry, I meant if you could document the code. That way someone who goes in can tell that you are escaping these specific characters per some specification

@ottobackwards I've added Javadocs with reference to antlr parser and explaining the reasons for this transformation.

ottobackwards · 2018-11-27T18:15:24Z

nifi-commons/nifi-record-path/src/test/java/org/apache/nifi/record/path/TestRecordPath.java

@@ -1008,12 +1008,16 @@ public void testReplaceRegex() {
        final List<RecordField> fields = new ArrayList<>();
        fields.add(new RecordField("id", RecordFieldType.INT.getDataType()));
        fields.add(new RecordField("name", RecordFieldType.STRING.getDataType()));


There should be tests for all the escapable chars

added tests to cover all of special chars. Also added backslash escape sequences to test exceptions.

…es with backslash

ottobackwards · 2018-11-27T22:00:34Z

+1, LGTM

ijokarumawak

Hi @bdesert, first of all, thanks for trying to improve this! I was trying to understand the issue and the fix to merge, but couldn't get convinced fully if this approach is the best way to address the issue.

I should say I'm not an expert of ANTLR, so excuse me if I'm saying nonsense..

I found following code block in RecordPathLexer.g. My understanding is the ESC fragment converts \[ to \\[, that leads the RegularExpression parse error. And this PR is trying to support such regex argument work.

fragment
ESC
	:	'\\'
		(
				'"'		{ setText("\""); }
			|	'\''	{ setText("\'"); }
			|	'r'		{ setText("\r"); }
			|	'n'		{ setText("\n"); }
			|	't'		{ setText("\t"); }
			|	'\\'	{ setText("\\\\"); }
			|	nextChar = ~('"' | '\'' | 'r' | 'n' | 't' | '\\')
				{
					StringBuilder lBuf = new StringBuilder(); lBuf.append("\\\\").appendCodePoint(nextChar); setText(lBuf.toString());
				}
		)
	;

https://github.com/apache/nifi/blob/master/nifi-commons/nifi-record-path/src/main/antlr3/org/apache/nifi/record/path/RecordPathLexer.g#L155

Especially the line 155, StringBuilder lBuf = new StringBuilder(); lBuf.append("\\\\").appendCodePoint(nextChar); setText(lBuf.toString()); does the conversion.
Do you know why this is needed? Shouldn't it append "\\" instead of "\\\\" like below? :

StringBuilder lBuf = new StringBuilder(); lBuf.append("\\").appendCodePoint(nextChar); setText(lBuf.toString());

I believe this ESC fragment does unescaping escaped characters such as \t or \n. If so, adding another unescaping at each function does not make sense.

Would you please educate me a bit with some explanation? Thanks a lot!

ijokarumawak · 2018-11-29T07:11:50Z

...commons/nifi-record-path/src/main/java/org/apache/nifi/record/path/util/RecordPathUtils.java

+    /**
+     * This method handles backslash sequences after ANTLR parser converts all backslash into double ones
+     * with exception for \t, \r and \n. See
+     * <a href="file:../../../../../../../../../src/main/antlr3/org/apache/nifi/record/path/RecordPathParser.g">org/apache/nifi/record/path/RecordPathParser.g</a>


I wonder you wanted to link to RecordPathLexer.g instead of RecordPathParser.g.

Yeah, that was supposed to be Lexer (I'll update PR later just in case I'll need to make more changes).
Regarding your question on Lexer.
When I started working on this issue, I also first started looking into Lexer, trying to understand all these escapes for backslash.
After that I took a look at Expression Language Lexer and found the same code there:
https://github.com/apache/nifi/blob/master/nifi-commons/nifi-expression-language/src/main/antlr3/org/apache/nifi/attribute/expression/language/antlr/AttributeExpressionLexer.g#L235

Then I started digging into actual code in Java and understood the reasons.
ANTLR parses the code coming from a textbox. All special cases for backslash sequences (\r,\n,\t) should be supported and escaped to be able to capture actual new lines and tab chars. But then, if you support backslash sequences, you would also need to escape "" itself, so there is another escape for backslash on a line 152. And then, if you escape single backslash, as special character, to avoid confusion with \r,\n,\t,\, the rest of the characters are being escaped with double backslash + character (line 155)- to escape actual two char sequence of "\t" or "\r" , etc...
Changing Lexer would change default behavior for backslash sequences not only for regex functions, but for all record based input. That would create backward compatibility issue, and that is the main reason I decided not to change Lexer.
And also, as I mentioned in already above, current "escape" policy is consistent with EL, so i preferred to keep consistency and just fix a bug of not escaping back for regular expressions.

Let me know if that makes sense, or if you have better ideas.

@bdesert Thanks for the explanation. I read through your comment over and over, and tested different input string. However, that differs from what what I'm seeing.

I debugged how Lexer works, too, and found that:

The ESC fragment handles an escaped special character in String representation. I.e. String \t will be converted to actual tab character.

The string values user input from NiFi UI are passed to RecordPath.compile method as it is. E.g. the input string replaceRegex(/name, '\[', '') is passed to as is, then the single back-slash is converted to double back-slash by the ESC fragment line 155.

Since the escaped characters such as \n are already unescaped by this Lexer, the RecordPathUtils.unescapeBackslash method added by this PR will not see any escaped characters other than the doubled back-slash //. And the doubled back-slash is converted back to a single back-slash by RecordPathUtils.unescapeBackslash.

I believe the line 153-156 is aimed to preserve escaped characters as it is, because such escape doesn't mean anything for the RecordPath/AttrExpLang spec. And those should be unescaped later by underlying syntaxes such as RegEx.

And current line 155 does it wrongly. It should append a single back-slash..

Other Lexers (AttributeExpressionLexer.g and HL7QueryLexer.g) have the same issue.

So, I think we should fix all Lexers instead of adding another conversion.

Here is the Lexer code for reference:

143 fragment 144 ESC 145 : '\\' 146 ( 147 '"' { setText("\""); } 148 | '\'' { setText("\'"); } 149 | 'r' { setText("\r"); } 150 | 'n' { setText("\n"); } 151 | 't' { setText("\t"); } 152 | '\\' { setText("\\\\"); } 153 | nextChar = ~('"' | '\'' | 'r' | 'n' | 't' | '\\') 154 { 155 StringBuilder lBuf = new StringBuilder(); lBuf.append("\\\\").appendCodePoint(nextChar); setText(lBuf.toString()); 156 } 157 ) 158 ;

Can share any input String value that a user would set into a processor property text box (or passed to RecordPath.compile method), that will go the proposed RecordPathUtils.unescapeBackslash method lines of code for \n, \r or \t? Current test case does have regex test cases using those escaped characters, but coverage shows those branches are not used:

@ijokarumawak ,
I understand what you are talking about, that was my first idea as well.
the problem is coming with regression on WORKING functionality, like EL.
If we change Lexer in all three cases, we will have a problem of backward compatibility.
As an example: define GFF with attribute "a1" having value (with actual new line):

"this is new line and this is just a backslash \n"

(doesn't matter with quotes or w/o).
next: UpdateAttribute with:
a1: "${a:replaceAll('\\n','_')}"
and
a2: "${a:replaceAll('\n','_')}"
(note single and double backslash)
In both cases the only character that will be replaced will be actual new line, resulting in:
a1=a2="this is new line_and this is just a backslash \n"
If we change Lexer for EL, this will be changed and will behave differently and will result in:
a1 = "his is new line
and this is just a backslash _"
a2 = "his is new line_and this is just a backslash \n"

Of course, we can do "right" way for RecordPath regex-based functions, but then we will have different user experience with regex in RecordPath and EL, which I think should be avoided.

Regarding Java method "unescapeBackslash". As a name suggests, this function it to treat string values having "backslash" in their values. I do agree that for the test cases we have, some parts of the code are dead, but since this is public method in utility class, it can have generic functionality to support wider varieties of the use cases.

Would appreciate your feedback!

What would help, is one or more clear, failing unit tests against the current code that illustrate the problem.

We have regex escape routines in more than one place, some for places without grammars ( replace text). Where we do have a grammar, the correct thing, or the preferred thing in a vacuum would be to have the grammar handle this. It is centralized and more maintainable. We may want to evaluate the regression as a community, based on the fix and the implications wrt maintainability and correctness.

I would almost say that we would want to have discussion and review of both approaches.

That would be my preference.

Such discussion would solicit input from other experienced contributors as well.

@bdesert Thanks for sharing the example. I tried it out but result was different. I believe the content (attribute value) itself won't processed by the Lexer, so changing Lexer wouldn't affect that EL usages. Please look at #3200 for detail. I shared flow template and results. I hope I have set up the flow as you mentioned, but please let me know if I misunderstood!

Providing generic util method such as unescapeBackslash is fine. But I think using it to address NIFI-5826 is not a good idea. As it would make just another confusion around how these actually work.

@ottobackwards Agreed to add more unit tests and keep discussion going for both approaches with more examples. I created #3200 for comparison.

bdesert · 2019-01-03T04:26:59Z

Closing this PR. Please refer to PR 3200 with better solution.

ottobackwards reviewed Nov 27, 2018

View reviewed changes

bdesert added 3 commits November 27, 2018 13:55

NIFI-5826 Adding tests to cover all special chars and special use cas…

8a9c863

…es with backslash

added javadoc

b4cb3a3

fix for javadoc

adf9520

ijokarumawak reviewed Nov 29, 2018

View reviewed changes

bdesert closed this Jan 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-5826 Fix to escaped backslash #3183

NIFI-5826 Fix to escaped backslash #3183

bdesert commented Nov 27, 2018 •

edited

ottobackwards left a comment

ottobackwards Nov 27, 2018

bdesert Nov 27, 2018

ottobackwards Nov 27, 2018

bdesert Nov 27, 2018

ottobackwards Nov 27, 2018

bdesert Nov 27, 2018

ottobackwards commented Nov 27, 2018

ijokarumawak left a comment •

edited

ijokarumawak Nov 29, 2018

bdesert Nov 29, 2018

ijokarumawak Nov 30, 2018

bdesert Dec 4, 2018

ottobackwards Dec 4, 2018

ottobackwards Dec 4, 2018

ijokarumawak Dec 5, 2018

bdesert commented Jan 3, 2019

NIFI-5826 Fix to escaped backslash #3183

NIFI-5826 Fix to escaped backslash #3183

Conversation

bdesert commented Nov 27, 2018 • edited

Fixing escaped backslash by unescaping it in java just before compiling regex.

For all changes:

For code changes:

For documentation related changes:

Note:

ottobackwards left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ottobackwards commented Nov 27, 2018

ijokarumawak left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdesert commented Jan 3, 2019

bdesert commented Nov 27, 2018 •

edited

ijokarumawak left a comment •

edited