Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent Test expression fail 2x #813

Closed
PaltryProgrammer opened this issue Jan 30, 2023 · 15 comments
Closed

Apparent Test expression fail 2x #813

PaltryProgrammer opened this issue Jan 30, 2023 · 15 comments

Comments

@PaltryProgrammer
Copy link

Greetings Kind Regards Thank You for dnGrep Attached please see GIF demonstrate two apparent Test expression failures i.e. to wit 1st failed to match "#include" 2nd failed to open dialog on 2nd attempt . Also please see log attached Thank You Kindly
Grep_Error_Log.zip
Test expression Fail

@doug24
Copy link
Contributor

doug24 commented Jan 30, 2023

The first problem is with the "Whole word" option and the # character. Unlike case-sensitive, there is no 'whole word' flag in regular expressions to automatically find whole words, so dnGrep tries to help the user by adding a \b word boundary anchor on both ends of the user-supplied expression. The problem here is that the # character is not a word character, so \b# will never match - there is no word boundary transition between another non-word character and the # character.

Adding the \b characters to match whole words was done in dnGrep back in January 2010, long before I started working on the code. It's the kind of thing that seems helpful until you get to edge cases like this. I'm not going to remove the code adding the \b - too many people rely on it - but I will see if I can make it smarter to check for non-word characters before adding them.

The second problem with the exception is an easy fix. Not clear why it happened here or hasn't come up as a problem before.

@doug24
Copy link
Contributor

doug24 commented Jan 30, 2023

After thinking about it some more, I don't think there is way to make the Whole word option work any better in regular expressions when the user-supplied expression starts or ends with a non-word character. So, I think I will leave the existing code as-is. For these cases, uncheck the whole word option and you have the full regex language to define the expression you need.

@PaltryProgrammer
Copy link
Author

Attached Please see GIF demonstrates RegEx of #include w/ Whole Word works in Visual Studio
RegEx With Whole Word Works in Visual Studio

@doug24
Copy link
Contributor

doug24 commented Jan 31, 2023

That's nice, I wonder what code they have added to do that.

Notepad++ disables 'Match whole word only' with regular expressions.

So why have you enabled regular expressions? You aren't using a regular expression.

You can always uncheck the Whole word option and supply your own regular expression to do exactly what you want. Maybe \W?#include\b or \W?#include\W? will work for you?

@PaltryProgrammer
Copy link
Author

Is the attached GIF of RegEx Find in Visual Studio w/o Whole Word demonstration helpful
RegEx WithOut Whole Word Works in Visual Studio

@doug24
Copy link
Contributor

doug24 commented Jan 31, 2023

I'm not clear on what you are trying to show here? If you mean this to be a demonstration of whole word matching, then I think abc#include should NOT match because there is no non-word character before the #. Just adding a trailing \b is not adequate for the general case.

We already know that the \b anchor before and after the search pattern is not going to work if the search pattern starts with or ends with a non-word character but does work for patterns that begin and end with word characters.

I'm going to run some tests on other expressions to see if something works for the general case. Maybe something like this:
(?<=\W|^)(__pattern__)(?=\W|$)

@PaltryProgrammer
Copy link
Author

i am attempting to show whole word matching can be performed by placing \b at boundary of text which is not otherwise bounded . as for abc#include matching i do not believe it matches rather only the later portion i.e. #include is matched which is as expected and desired . as for the regular expression provided in your response it is over my head as i know nothing of capture groups . i do not understand the difficulty of whole word matching as it seems to me one merely as stated places a \b as described above . ¯_(ツ)_/¯

@doug24
Copy link
Contributor

doug24 commented Jan 31, 2023

If you want to just a trailing \b for your special case, go ahead. But that does not work for the general case, and I can't put it in the code.

For example, using just a trailing \b this is an incorrect whole word match:

As for abc#include - it should not generate any match, not even on the later portion - that is not what whole word means. Forget the # symbol for a minute: if you search for book as a whole word, you should not expect to get a match on abcbook.

As I mentioned above, the existing dnGrep code for the past 13 years does this for whole word matches:
\b(pattern)\b
and that works if the pattern does not begin or end with a non-word character. The problem with \b is that it is looking both forward and backward for a non-word character. We need something that only looks backward before the pattern, and only forward after the pattern.

I think this alternative will handle both cases:

(?<=\W|^)(pattern)(?=\W|$)

These are grouping constructs, but not capture groups:
The first is a Zero-Width Positive Lookbehind Assertion, add it says there must be a non-word character \W or start of line ^ before the pattern.
The second is a Zero-Width Positive Lookahead Assertion, and says there must be a non-word character \W or end of line $ after the pattern.

@doug24
Copy link
Contributor

doug24 commented Feb 1, 2023

After having spent hours on this... I now have this expression:

(?<=\W|\b|^)(pattern)(?=\W|\b|$)

In addition to the normal whole word matches, it handles the cases where the pattern begins with or ends with punctuation, symbols, or spaces (where the pattern begins with or ends with something that normally signifies the start or end of a word).

I added about 50 unit tests, including 20 that failed with the existing implementation, and now work with the pattern above. I'm sure I haven't begun to include all possibilities.

So, it does fix with the original issue:
Search whole word #include in #include, and it does match #include

It will match a pattern beginning with or ending with a non-word character immediately adjacent to a word character. But this is the same behavior you show above in Visual Studio.
Search whole word #include in abc#include, and it does match #include
Search whole word string\. in string text = string.Empty, and it does match string.
which I guess is no different than:
Search whole word chemist in the chemist's lab, and it does match chemist

@doug24
Copy link
Contributor

doug24 commented Feb 21, 2023

Added in v3.2.279

@PaltryProgrammer
Copy link
Author

i must be stupid because to me it seems simple id est a word boundary i presume is just that exempli gratia as stated here learn Microsoft Word Boundary: \b . the notion of a word is well defined id est a series of characters matched by \w . so if a whole word flag is not available in a regex engine than it seems to my ignorant self to offer such a selectable option it is only necessary to scan the pattern for all words and surround each w/ /b . of course a user may do both id est select the whole word option and install the /b anchor in which case ¯\(ツ)/¯ .
i was not aware "whole word" is not part of the regex engine only that it is commonly provided . i will therefore endeavor to take your kind advice and utilize /b w/o the proffered whole word option selected .
i apologize for taxing your kind and helpful patience w/ my ignorance .

@doug24
Copy link
Contributor

doug24 commented Dec 27, 2023

Go ahead and experiment. You will very quickly find that the beginning of line and end of line do not match the \b word boundary token.

So in many cases the simple \b does not work where my complicated expression above does work. As I said above:

In addition to the normal whole word matches, it handles the cases where the pattern begins with or ends with punctuation, symbols, or spaces (where the pattern begins with or ends with something that normally signifies the start or end of a word).

@doug24
Copy link
Contributor

doug24 commented Dec 27, 2023

Just one more thing - you opened this issue because the way dnGrep made a regex 'whole word' didn't work the way you wanted or expected. And that method was to add a \b before and after the pattern entered by the user. I went through a lot of work to come up with a better method. And now, 11 months later, you think that that adding a \b is all that is necessary. I am at a loss for words.

@PaltryProgrammer
Copy link
Author

i must not understand something as your knowledge i am certain is superior to mine . i do not understand your reference to beginning and end of line as i believed we were speaking of words not lines . i am assuming the < and > characters are not word characters therefore if i were writing the code i would instead utilize the "whole word" pattern <\baaa\b> which suceeds as shown. further as shown below it seems sufficient to eschew "whole word" anchors and instead merely utilize \w+ .
in any event i wish you a happy and prosperous New Year free of my troublesome self .
word boundary as i understand it
apparently no need for word boundary

@doug24
Copy link
Contributor

doug24 commented Dec 27, 2023

Yes, you can write many different expressions to get what you want, but I can't write code that discerns what you intend when you enter any pattern and then say, 'make that a whole word, please'. The only thing the code can do is try make your entire pattern a 'whole word'. There is no way the code can go from a user entering <aaa> and modifying it for them to <\baaa\b>. The dnGrep code doesn't parse the regular expressions.

I think the current solution is pretty good in that respect - as good as Visual Studio, I'd say.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants