Extend Regexp support to accommodate expanding $ variables #1351

abhinavdangeti · 2020-03-16T22:57:46Z

regexp filter's should replace a match with a
single entry of replace and not by a number of times.
For example ..
- Consider the following regex:
  ([a-z])\s+(\d)
- For the string "temp 1", the above regex matches:
  "p 1"
- Let the replacement be "$1-$2", so the expectation
  is that "p 1" gets replaced by "p-1".
- The code before the fix replaces "p 1" with:
  "$1-$2$1-$2$1-$2"

abhinavdangeti · 2020-03-16T22:58:11Z

@mschoch Not sure if I'm overlooking something obvious here, but could you review this for me?

mschoch · 2020-03-16T23:15:12Z

So, the reason it was coded this way was that if the length of the input is changed, you can no longer use the byte offsets for client-side highlighting. So, this version attempts to make replacements but keep the overall length, and relative offsets for the tokens which will later be generated the same.

Is there some use-case we need to support that is broken by this? Can we maybe make the behavior an option?

abhinavdangeti · 2020-03-16T23:18:56Z

I see. So the highlighting makes sense to me.
I've added a test case that generates erroneous output - how to support ${..} replacements with bytes.Repeat.

mschoch · 2020-03-16T23:20:47Z

It's also possible from reading your examples that the behavior I intended, actually does stupidly wrong things in other cases as well. So, maybe I need to read this all again closer.

mschoch · 2020-03-16T23:29:29Z

I mean, I think there are just 2 different use cases here. It was obviously coded in a way that expected a literal replacement, not a backreference. In fact, it doesn't even seem like it would work right if the replacement isn't the exact same length as the matched expression.

So, with that limitation, it seems there isn't a good way to support what I was trying to achieve.

I don't remember why I thought this was so important in the first place. Maybe it makes more sense to say, if client-side highlighting matters, you can't alter the text positions with a charfilter.

abhinavdangeti · 2020-03-17T00:16:35Z

Hmm, it seems the current code only works when the replacement is a single byte. Also here's an example that concerns me..

Lets take the same example regexp again: ([a-z])\s+(\d)
Say my input were temp 1
And replace were A
The output I'd get is temAAA, this doesn't look right to me.
But substituting the occurrence p 1 with the A and filling the rest of the length with ' ' somewhat does look ok to me .. so that's temA .

Any thoughts/concerns with that^

mschoch · 2020-03-17T00:20:40Z

So, I think the intention here was more about removing problematic characters and replacing them with less problematic ones. Something like replace “ or ” with ".

Again, I'm not opposed to changing things here, but would feel better if you have a use-case in mind.

abhinavdangeti · 2020-03-17T00:30:01Z

Got it, thanks for explaining why.
Let me think on this a bit, and put up a change here, and then we'll revisit this again.

abhinavdangeti · 2020-03-17T01:32:19Z

So the code change I'm proposing here is to accommodate regexp's power to Expand $ variables, while making sure highlighting doesn't break. Here're some examples showing new behavior ..

regexp	replace	input	output (before)	output (after)
`([a-z])\s+(\d)`	`$1-$2`	`temp 1`	`tem$1-$2$1-$2$1-$2`	`temp-1`
`foo.?`	`X`	`seafood, fool`	`seaXXXX, XXXX`	`seaX , X`
`def`	`_`	`abcdefghi`	`abc___ghi`	`abc_ ghi`
`456`	`000000`	`123456789`	`123000000000000000000789`	`123000000789`
`“\|”`	`"`	`“hello”`	`"""hello"""`	`" hello"`

abhinavdangeti · 2020-03-17T21:09:43Z

Hey @mschoch , have you had a chance to look at my proposition^

mschoch · 2020-03-18T03:56:41Z

I have, again I think I'm still hung up on the same thing. The original expectation was that users would replace a matched byte sequence with another of the same length, and it's up to them to choose a replacement that makes sense. Your last example, almost matches what I describe, except that “ is 3 bytes in utf8, and the replacement " is only a single byte. This case, they might have chosen to replace it with " or " and then the "output (before)" wouldn't seem so ridiculous.

Again I still can't fully defend the way this works, as I admit it has obvious limitations. But, I'm still left wondering if there is a way to offer both. Then we can argue about what the default should be?

abhinavdangeti · 2020-03-18T17:11:52Z

Hmm, so I suppose I can look into a way to support both ways of doing things.
However, I didn't quite follow one thing from your last comment ..

Your last example, almost matches what I describe, except that “ is 3 bytes in utf8, and the replacement " is only a single byte. This case, they might have chosen to replace it with " or " and then the "output (before)" wouldn't seem so ridiculous.

Yes “ takes 3 bytes as opposed to " which takes 1 byte. Assuming that the user did replace the 3byte “ with the 1byte ", the current code (without this PR) replaces “ with """ - which does look pretty off to me - is this what you meant?

mschoch · 2020-03-18T17:17:28Z

No, I'm saying you're example is wrong. I tried to use github formatting to illustrate that, but perhaps it didn't help.

Why is your example wrong? You replaced a 3 byte sequence with a 1 byte sequence, which then causes it to be repeated 3 times. That is what the code will do if you ask it, but I'm saying that is incorrect usage. Correct usage is to provide your own 3 byte sequence. Going to try again with different formatting. Something like <space><quote><space> or <space><space><quote>. If you do this, you do not end up with an unsightly """.

mschoch · 2020-03-18T17:37:54Z

OK, I'm changing my opinion 180 degrees. This should just be your original patch. Just do a regexp replace all, and document that it alters offsets relative to the original, so client-side highlighting can no longer be done.

Some of the unit tests I had will fail, so we should remove them as well.

abhinavdangeti · 2020-03-18T17:38:19Z

Got it, ok cool that makes sense.

Now, here's my one last argument before I look into making expansion an optional setting..

Here's why I think we don't need to offer 2 ways of doing this - if a user were to use a replacement string of the same length as the regexp match, with this PR - the output will remain as it were before. Taking the same example we're talking about ..

regexp	replace	input	output (before)	output (after)
`“\|”`	`_"_`	`“hello”`	`_"_hello_"_`	`_"_hello_"_`

If however the length of the replacement string is different, this PR will -

just substitute the regexp match with the replacement if len(output) >= len(input).
substitute and fill in whitespaces to get the output string's length to equal the input's.
in case of $ variables in the replacement string, the correct length of the replacement is determined.

This example shows the change in behavior ..

regexp	replace	input	output (before)	output (after)
`foo.?`	`X`	`seafood, fool`	`seaXXXX, XXXX`	`seaX , X`

Do you see a usecase where output(before) generates a more preferred outcome?

mschoch · 2020-03-18T17:39:35Z

So, we're replying to each other out of sync now. But, I just realized you had change the code with some special handling of whitespace, and I really don't like that. So, that's why I'm suggesting you just go back to your original proposal.

abhinavdangeti · 2020-03-18T17:39:50Z

I'd have to make some changes to the highlighting code to not panic if I were to just use ReplaceAll.

mschoch · 2020-03-18T17:41:21Z

I'd have to make some changes to the highlighting code to not panic if I were to just use ReplaceAll.

Wasn't this your original proposal?

return s.r.ReplaceAll(input, s.replacement)

mschoch · 2020-03-18T17:42:13Z

Oh I see, it doesn't just break client-side highlighting, it breaks our highlighting as well, because the original (prior to char filter) is what gets stored.

abhinavdangeti · 2020-03-18T17:43:02Z

Yeah and it breaks our highlighting as well :)

abhinavdangeti · 2020-04-14T19:32:23Z

For now, my new proposal is to just follow elastic's approach as documented here.

Using a replacement string that changes the length of the original text will work for search purposes, but will result in incorrect highlighting
If replacement string is much greater in length than the original text, highlighting may bail entirely to avoid panic-ing on out of bounds.

regexp	replace	input	output (before)	output (after)
`([a-z])\s+(\d)`	`$1-$2`	`temp 1`	`tem$1-$2$1-$2$1-$2`	`temp-1`
`foo.?`	`X`	`seafood, fool`	`seaXXXX, XXXX`	`seaX, X`
`def`	`_`	`abcdefghi`	`abc___ghi`	`abc_ghi`
`456`	`000000`	`123456789`	`123000000000000000000789`	`123000000789`
`“\|”`	`"`	`“hello”`	`"""hello"""`	`"hello"`

@mschoch Do let me know if you think this is ok to start with..

mschoch · 2020-04-14T20:14:08Z

So, my main complaint is this change assumes, you'd rather have variable length replacement working than being able to do some replacements and keep highlighting working.

I can see value in both use cases, but this PR just trades one that works today for a different one. You still haven't include the example as I intended which was to replace “ with a 3 byte sequence . And to replace ” with the 3 byte sequence . This correctly replaces a 3 byte sequence with another 3 byte sequence, one that should actually tokenize like the original, and preserve highlighting. I think maybe it works in your code as well, but can you add it to see?

Second, I don't think it's correct to say this is exactly what ES does. That is what is documented in the Pattern Replace documentation. But, some of the other character filters describe behavior which indicates that for some of them, they track changes in positions and do something a bit better (still not perfect in many cases).

Third, you describe highlighting bailing early to avoid panic if the replacement is much greater. What does much greater mean here. Isn't just a single byte greater potentially going to return an offset that is invalid for highlighting?

abhinavdangeti · 2020-04-14T21:13:26Z

My main motivation for this change is: to support expanding $ variables within replacements.

So let me summarize everything else I have so far ..

Firstly, highlighting bails at times in the current code as well, if replacement's length differs the pattern matched by the regexp.
Like you mention there will be behavior change noted in highlighting with variable length replacement. Let's take the example you speak of ..

regexp	replace	input	output (before)	output (after)
`“\|”`	`"`	`“hello”`	`"""hello"""`	`"hello"`

Here's how highlighting will differ with the change made ..
Before: “hello”
After: “hello”

So you're right, the output in the above example is less desirable with this change ..
For the sake of my argument here, lets take another example ..

regexp	replace	input	output (before)	output (after)
`def`	`@#$`	`abcdefghi`	`abc@#$@#$@#$ghi`	`abc@#$ghi`

In this case, highlighting is more desirable with the code change ..
Before: abcdefghi (highlighting bails)
After: abcdefghi

So, the current code behavior is erratic if length of replacement is greater than a byte.
You're likely right on the second point you mention on what ES does, I was just referring to them mentioning that highlighting may be incorrect owing to the replacement used.

Now let me talk about a case that isn't handled within our highlighting code that'd cause the panic I mention, note that this is without any code change. Take the new example I added to search_test.go (TestSearchHighlightingWithRegexpReplacement) for this. The output generated for the regexp-replacement is fooooooo$1-$2ooooo$1-$2ooooo$1-$20, and with the "unicode" tokenizer the tokens generated are ["1", "20", "2ooooo", "fooooooo"].

The start location for the token "1" is at 9 in the replacement, and when the highlighter tries to map this on the original value "fool 10", it panics as the length of the original value is 7.
I believe the fix within search/highlight/fragmenter/simple/simple.go is needed no matter what ..

--- FAIL: TestSearchHighlightingWithRegexpReplacement (0.08s)
panic: runtime error: slice bounds out of range [:9] with capacity 8 [recovered]
	panic: runtime error: slice bounds out of range [:9] with capacity 8

goroutine 24 [running]:
testing.tRunner.func1(0xc0002ac100)
	/usr/local/Cellar/go/1.13.7/libexec/src/testing/testing.go:874 +0x3a3
panic(0x1693a80, 0xc0003e4040)
	/usr/local/Cellar/go/1.13.7/libexec/src/runtime/panic.go:679 +0x1b2
github.com/blevesearch/bleve/search/highlight/fragmenter/simple.(*Fragmenter).Fragment(0xc0003cc220, 0xc0003cc248, 0x7, 0x8, 0xc0003ee6c0, 0x7, 0x8, 0x8, 0xc000408360, 0x120)
	/Users/abhinavdangeti/Documents/go/src/github.com/blevesearch/bleve/search/highlight/fragmenter/simple/simple.go:69 +0x9f8
github.com/blevesearch/bleve/search/highlight/highlighter/simple.(*Highlighter).BestFragmentsInField(0xc0003f2480, 0xc0003fce40, 0xc0003ee480, 0xc000026890, 0x6, 0x1, 0xc0003c1890, 0x10151d1, 0x120)
	/Users/abhinavdangeti/Documents/go/src/github.com/blevesearch/bleve/search/highlight/highlighter/simple/highlighter_simple.go:96 +0xa66
github.com/blevesearch/bleve.LoadAndHighlightFields(0xc0003fce40, 0xc0003ce000, 0x16ddbfa, 0x7, 0x17c1c20, 0xc000314500, 0x17be040, 0xc0003f2480, 0x0, 0xc0003f4460)
	/Users/abhinavdangeti/Documents/go/src/github.com/blevesearch/bleve/index_impl.go:658 +0x15e
github.com/blevesearch/bleve.(*indexImpl).SearchInContext(0xc000179180, 0x17b7580, 0xc0000ba008, 0xc0003ce000, 0x0, 0x0, 0x0)
	/Users/abhinavdangeti/Documents/go/src/github.com/blevesearch/bleve/index_impl.go:562 +0x16a5
github.com/blevesearch/bleve.(*indexImpl).Search(0xc000179180, 0xc0003ce000, 0x0, 0x0, 0xc0000202d0)
	/Users/abhinavdangeti/Documents/go/src/github.com/blevesearch/bleve/index_impl.go:369 +0x4d
github.com/blevesearch/bleve.TestSearchHighlightingWithRegexpReplacement(0xc0002ac100)
	/Users/abhinavdangeti/Documents/go/src/github.com/blevesearch/bleve/search_test.go:1700 +0x8e8
testing.tRunner(0xc0002ac100, 0x16fefc8)
	/usr/local/Cellar/go/1.13.7/libexec/src/testing/testing.go:909 +0xc9
created by testing.(*T).Run
	/usr/local/Cellar/go/1.13.7/libexec/src/testing/testing.go:960 +0x350

mschoch · 2020-04-14T21:27:54Z

I still think you're not getting my example. It is not the same replacement for left and right quote, they are unique replacements that have the exact same length as the original they are replacing, so that none of the character repeating stuff ever comes into play. Can we have a call on this tomorrow?

mschoch · 2020-04-14T21:32:40Z

Alright, let's just forget my example.

How involved are the changes so that the highlighter never panics? Like every slice into the original stored field, should have bounds checking.

abhinavdangeti · 2020-04-14T21:40:00Z

For the situation where we replace a pattern with a unique replacement of the exact same length - the behavior is better with this code change, as I illustrated earlier with the abcdefghi example. Take this as another example ..

regexp	replace	input	output (before)	output (after)
`“\|”`	`@#$`	`“hello”`	`@#$@#$@#$hello@#$@#$@#$`	`@#$hello@#$`

So highlighting ..
Before: “hello”
After: “hello”

Yes, the highlighter changes involve a bounds check for every stored field.

abhinavdangeti · 2020-04-14T21:40:24Z

Oh, and open to a call tomorrow if you'd prefer it.

The old code works well only if the replacement length is no more than 1 byte.

Addresses blevesearch#1370

+ Expand essentially interprets $ signs, so for example $1 represents the text of the first submatch. + For example .. - Consider the following regex: ([a-z])\s+(\d) - For the string "temp 1", the above regex matches: "p 1" - Let the replacement be "$1-$2", so the expectation is that "p 1" gets replaced by "p-1". - The code before the fix replaces "p 1" with: "$1-$2$1-$2$1-$2"

+ I've tracked this down as a regression that was introduced when I changed the regexp character filter's replacement behavior here - #1351 + The above change is still necessary for highlighting to work correctly when regexp character filter is used. + The HTML character filter which uses the regexp character filter will need to replace every character of the matched sequence with whitespace so that the number of whitespaces equals the length of the HTML tag. + Tracking ticket: https://issues.couchbase.com/browse/MB-50002

+ I've tracked this down as a regression that was introduced when I changed the regexp character filter's replacement behavior here - #1351 + The above change is still necessary for highlighting to work correctly when regexp character filter is used and for replacements that involve '$' variables. + The HTML character filter which uses the regexp character filter will need to replace every character of the matched sequence with whitespace so that the number of whitespaces equals the length of the HTML tag. + Tracking ticket: https://issues.couchbase.com/browse/MB-50002

abhinavdangeti requested review from steveyen, mschoch and sreekanth-cb March 16, 2020 22:57

abhinavdangeti force-pushed the regexp branch from 60ac437 to 89353ac Compare March 17, 2020 01:07

abhinavdangeti changed the title ~~Regexp fix for replacing a substring~~ Extend Regexp support to accommodate expanding $ variables Mar 17, 2020

abhinavdangeti added 3 commits April 15, 2020 10:28

Fix slice out-of-bounds panic within highlighter

487ec05

Addresses blevesearch#1370

MB-38312: More test cases for regexp-replace

e94966d

abhinavdangeti force-pushed the regexp branch from 6314dd8 to e94966d Compare April 15, 2020 17:36

abhinavdangeti mentioned this pull request Apr 15, 2020

Regexp-replace character filter is erratic when replacement is more than 1 byte #1372

Closed

mschoch approved these changes Apr 15, 2020

View reviewed changes

abhinavdangeti merged commit ba61f7e into blevesearch:master Apr 15, 2020

abhinavdangeti mentioned this pull request Dec 8, 2021

Use of HTML character filter breaks highlighting #1641

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend Regexp support to accommodate expanding $ variables #1351

Extend Regexp support to accommodate expanding $ variables #1351

abhinavdangeti commented Mar 16, 2020 •

edited

abhinavdangeti commented Mar 16, 2020

mschoch commented Mar 16, 2020

abhinavdangeti commented Mar 16, 2020

mschoch commented Mar 16, 2020

mschoch commented Mar 16, 2020

abhinavdangeti commented Mar 17, 2020

mschoch commented Mar 17, 2020

abhinavdangeti commented Mar 17, 2020

abhinavdangeti commented Mar 17, 2020

abhinavdangeti commented Mar 17, 2020

mschoch commented Mar 18, 2020

abhinavdangeti commented Mar 18, 2020

mschoch commented Mar 18, 2020

mschoch commented Mar 18, 2020 •

edited

abhinavdangeti commented Mar 18, 2020

mschoch commented Mar 18, 2020

abhinavdangeti commented Mar 18, 2020

mschoch commented Mar 18, 2020

mschoch commented Mar 18, 2020

abhinavdangeti commented Mar 18, 2020

abhinavdangeti commented Apr 14, 2020

mschoch commented Apr 14, 2020

abhinavdangeti commented Apr 14, 2020 •

edited

mschoch commented Apr 14, 2020

mschoch commented Apr 14, 2020

abhinavdangeti commented Apr 14, 2020

abhinavdangeti commented Apr 14, 2020 •

edited

Extend Regexp support to accommodate expanding $ variables #1351

Extend Regexp support to accommodate expanding $ variables #1351

Conversation

abhinavdangeti commented Mar 16, 2020 • edited

abhinavdangeti commented Mar 16, 2020

mschoch commented Mar 16, 2020

abhinavdangeti commented Mar 16, 2020

mschoch commented Mar 16, 2020

mschoch commented Mar 16, 2020

abhinavdangeti commented Mar 17, 2020

mschoch commented Mar 17, 2020

abhinavdangeti commented Mar 17, 2020

abhinavdangeti commented Mar 17, 2020

abhinavdangeti commented Mar 17, 2020

mschoch commented Mar 18, 2020

abhinavdangeti commented Mar 18, 2020

mschoch commented Mar 18, 2020

mschoch commented Mar 18, 2020 • edited

abhinavdangeti commented Mar 18, 2020

mschoch commented Mar 18, 2020

abhinavdangeti commented Mar 18, 2020

mschoch commented Mar 18, 2020

mschoch commented Mar 18, 2020

abhinavdangeti commented Mar 18, 2020

abhinavdangeti commented Apr 14, 2020

mschoch commented Apr 14, 2020

abhinavdangeti commented Apr 14, 2020 • edited

mschoch commented Apr 14, 2020

mschoch commented Apr 14, 2020

abhinavdangeti commented Apr 14, 2020

abhinavdangeti commented Apr 14, 2020 • edited

abhinavdangeti commented Mar 16, 2020 •

edited

mschoch commented Mar 18, 2020 •

edited

abhinavdangeti commented Apr 14, 2020 •

edited

abhinavdangeti commented Apr 14, 2020 •

edited