Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-5826 Fix to escaped backslash #3183

Closed
wants to merge 4 commits into from

Conversation

bdesert
Copy link
Contributor

@bdesert bdesert commented Nov 27, 2018

Fixing escaped backslash by unescaping it in java just before compiling regex.

For a consistency with expression language regex-based evaluators, all record path operators will follow the same escape sequence for regex. Single backslash should be defined as double backslash.
Examples:

  • replaceRegex(/col1, '\s', "_") - will replace all whitespaces (spaces, tabs) with underscore.
  • replaceRegex(/col1, '\.', ",") - will replace all a period with
  • replaceRegex(/col1, '\\', "/") - will replace backslash with forward slash

EL References: StringLiteralEvaluator. Special characters (backslash sequences) will be handled for any string literal value, and will affect the way regex and escaped chars must be defined. Same logic has been added to RecordPathUtils.unescapeBackslash.


Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

Fixing escaped backslash by unescaping it in java just before compiling regex.
For a consistency with expression language regex-based evaluators, all record path operators will follow the same escape sequence for regex. Single backslash should be defined as double backslash.
Examples:
- replaceRegex(/col1, '\\s', "_") - will replace all whitespaces (spaces, tabs) with underscore.
- replaceRegex(/col1, '\\.', ",") - will replace all a period with
- replaceRegex(/col1, '\\\\', "/") - will replace backslash with forward slash

NIFI-5826 Fix to escaped backslash

Fixing escaped backslash by unescaping it in java just before compiling regex.
For a consistency with expression language regex-based evaluators, all record path operators will follow the same escape sequence for regex. Single backslash should be defined as double backslash.
Examples:
- replaceRegex(/col1, '\\s', "_") - will replace all whitespaces (spaces, tabs) with underscore.
- replaceRegex(/col1, '\\.', ",") - will replace all a period with
- replaceRegex(/col1, '\\\\', "/") - will replace backslash with forward slash
Copy link
Contributor

@ottobackwards ottobackwards left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments

boolean lastCharIsBackslash = false;
for (int i = 0; i < value.length(); i++) {
final char c = value.charAt(i);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the entire list that has to be escaped? Maybe you can reference where that list is specified for maintenance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the PR description:
EL References: StringLiteralEvaluator. Special characters (backslash sequences) will be handled for any string literal value, and will affect the way regex and escaped chars must be defined. Same logic has been added to RecordPathUtils.unescapeBackslash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, I meant if you could document the code. That way someone who goes in can tell that you are escaping these specific characters per some specification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ottobackwards I've added Javadocs with reference to antlr parser and explaining the reasons for this transformation.

@@ -1008,12 +1008,16 @@ public void testReplaceRegex() {
final List<RecordField> fields = new ArrayList<>();
fields.add(new RecordField("id", RecordFieldType.INT.getDataType()));
fields.add(new RecordField("name", RecordFieldType.STRING.getDataType()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be tests for all the escapable chars

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests to cover all of special chars. Also added backslash escape sequences to test exceptions.

@ottobackwards
Copy link
Contributor

+1, LGTM

Copy link
Member

@ijokarumawak ijokarumawak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bdesert, first of all, thanks for trying to improve this! I was trying to understand the issue and the fix to merge, but couldn't get convinced fully if this approach is the best way to address the issue.

I should say I'm not an expert of ANTLR, so excuse me if I'm saying nonsense..

I found following code block in RecordPathLexer.g. My understanding is the ESC fragment converts \[ to \\[, that leads the RegularExpression parse error. And this PR is trying to support such regex argument work.

fragment
ESC
	:	'\\'
		(
				'"'		{ setText("\""); }
			|	'\''	{ setText("\'"); }
			|	'r'		{ setText("\r"); }
			|	'n'		{ setText("\n"); }
			|	't'		{ setText("\t"); }
			|	'\\'	{ setText("\\\\"); }
			|	nextChar = ~('"' | '\'' | 'r' | 'n' | 't' | '\\')
				{
					StringBuilder lBuf = new StringBuilder(); lBuf.append("\\\\").appendCodePoint(nextChar); setText(lBuf.toString());
				}
		)
	;

https://github.com/apache/nifi/blob/master/nifi-commons/nifi-record-path/src/main/antlr3/org/apache/nifi/record/path/RecordPathLexer.g#L155

Especially the line 155, StringBuilder lBuf = new StringBuilder(); lBuf.append("\\\\").appendCodePoint(nextChar); setText(lBuf.toString()); does the conversion.
Do you know why this is needed? Shouldn't it append "\\" instead of "\\\\" like below? :

StringBuilder lBuf = new StringBuilder(); lBuf.append("\\").appendCodePoint(nextChar); setText(lBuf.toString());

I believe this ESC fragment does unescaping escaped characters such as \t or \n. If so, adding another unescaping at each function does not make sense.

Would you please educate me a bit with some explanation? Thanks a lot!

/**
* This method handles backslash sequences after ANTLR parser converts all backslash into double ones
* with exception for \t, \r and \n. See
* <a href="file:../../../../../../../../../src/main/antlr3/org/apache/nifi/record/path/RecordPathParser.g">org/apache/nifi/record/path/RecordPathParser.g</a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder you wanted to link to RecordPathLexer.g instead of RecordPathParser.g.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was supposed to be Lexer (I'll update PR later just in case I'll need to make more changes).
Regarding your question on Lexer.
When I started working on this issue, I also first started looking into Lexer, trying to understand all these escapes for backslash.
After that I took a look at Expression Language Lexer and found the same code there:
https://github.com/apache/nifi/blob/master/nifi-commons/nifi-expression-language/src/main/antlr3/org/apache/nifi/attribute/expression/language/antlr/AttributeExpressionLexer.g#L235

Then I started digging into actual code in Java and understood the reasons.
ANTLR parses the code coming from a textbox. All special cases for backslash sequences (\r,\n,\t) should be supported and escaped to be able to capture actual new lines and tab chars. But then, if you support backslash sequences, you would also need to escape "" itself, so there is another escape for backslash on a line 152. And then, if you escape single backslash, as special character, to avoid confusion with \r,\n,\t,\, the rest of the characters are being escaped with double backslash + character (line 155)- to escape actual two char sequence of "\t" or "\r" , etc...
Changing Lexer would change default behavior for backslash sequences not only for regex functions, but for all record based input. That would create backward compatibility issue, and that is the main reason I decided not to change Lexer.
And also, as I mentioned in already above, current "escape" policy is consistent with EL, so i preferred to keep consistency and just fix a bug of not escaping back for regular expressions.

Let me know if that makes sense, or if you have better ideas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdesert Thanks for the explanation. I read through your comment over and over, and tested different input string. However, that differs from what what I'm seeing.

I debugged how Lexer works, too, and found that:

  • The ESC fragment handles an escaped special character in String representation. I.e. String \t will be converted to actual tab character.
  • The string values user input from NiFi UI are passed to RecordPath.compile method as it is. E.g. the input string replaceRegex(/name, '\[', '') is passed to as is, then the single back-slash is converted to double back-slash by the ESC fragment line 155.
  • Since the escaped characters such as \n are already unescaped by this Lexer, the RecordPathUtils.unescapeBackslash method added by this PR will not see any escaped characters other than the doubled back-slash //. And the doubled back-slash is converted back to a single back-slash by RecordPathUtils.unescapeBackslash.
  • I believe the line 153-156 is aimed to preserve escaped characters as it is, because such escape doesn't mean anything for the RecordPath/AttrExpLang spec. And those should be unescaped later by underlying syntaxes such as RegEx.
    • And current line 155 does it wrongly. It should append a single back-slash..
    • Other Lexers (AttributeExpressionLexer.g and HL7QueryLexer.g) have the same issue.
  • So, I think we should fix all Lexers instead of adding another conversion.

Here is the Lexer code for reference:

143 fragment
144 ESC
145   :  '\\'
146     (
147         '"'    { setText("\""); }
148       |  '\''  { setText("\'"); }
149       |  'r'   { setText("\r"); }
150       |  'n'   { setText("\n"); }
151       |  't'   { setText("\t"); }
152       |  '\\'  { setText("\\\\"); }
153       |  nextChar = ~('"' | '\'' | 'r' | 'n' | 't' | '\\')
154        {
155          StringBuilder lBuf = new StringBuilder(); lBuf.append("\\\\").appendCodePoint(nextChar); setText(lBuf.toString());
156        }
157    )
158  ;

Can share any input String value that a user would set into a processor property text box (or passed to RecordPath.compile method), that will go the proposed RecordPathUtils.unescapeBackslash method lines of code for \n, \r or \t? Current test case does have regex test cases using those escaped characters, but coverage shows those branches are not used:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijokarumawak ,
I understand what you are talking about, that was my first idea as well.
the problem is coming with regression on WORKING functionality, like EL.
If we change Lexer in all three cases, we will have a problem of backward compatibility.
As an example: define GFF with attribute "a1" having value (with actual new line):

"this is new line
and this is just a backslash \n"

(doesn't matter with quotes or w/o).
next: UpdateAttribute with:
a1: "${a:replaceAll('\\n','_')}"
and
a2: "${a:replaceAll('\n','_')}"
(note single and double backslash)
In both cases the only character that will be replaced will be actual new line, resulting in:
a1=a2="this is new line_and this is just a backslash \n"
If we change Lexer for EL, this will be changed and will behave differently and will result in:
a1 = "his is new line
and this is just a backslash _"
a2 = "his is new line_and this is just a backslash \n"

Of course, we can do "right" way for RecordPath regex-based functions, but then we will have different user experience with regex in RecordPath and EL, which I think should be avoided.


Regarding Java method "unescapeBackslash". As a name suggests, this function it to treat string values having "backslash" in their values. I do agree that for the test cases we have, some parts of the code are dead, but since this is public method in utility class, it can have generic functionality to support wider varieties of the use cases.

Would appreciate your feedback!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would help, is one or more clear, failing unit tests against the current code that illustrate the problem.

We have regex escape routines in more than one place, some for places without grammars ( replace text). Where we do have a grammar, the correct thing, or the preferred thing in a vacuum would be to have the grammar handle this. It is centralized and more maintainable. We may want to evaluate the regression as a community, based on the fix and the implications wrt maintainability and correctness.

I would almost say that we would want to have discussion and review of both approaches.

That would be my preference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such discussion would solicit input from other experienced contributors as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdesert Thanks for sharing the example. I tried it out but result was different. I believe the content (attribute value) itself won't processed by the Lexer, so changing Lexer wouldn't affect that EL usages. Please look at #3200 for detail. I shared flow template and results. I hope I have set up the flow as you mentioned, but please let me know if I misunderstood!

Providing generic util method such as unescapeBackslash is fine. But I think using it to address NIFI-5826 is not a good idea. As it would make just another confusion around how these actually work.

@ottobackwards Agreed to add more unit tests and keep discussion going for both approaches with more examples. I created #3200 for comparison.

@bdesert bdesert closed this Jan 3, 2019
@bdesert
Copy link
Contributor Author

bdesert commented Jan 3, 2019

Closing this PR. Please refer to PR 3200 with better solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants