Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Regexp support to accommodate expanding $ variables #1351

Merged
merged 3 commits into from Apr 15, 2020

Conversation

abhinavdangeti
Copy link
Member

@abhinavdangeti abhinavdangeti commented Mar 16, 2020

  • regexp filter's should replace a match with a
    single entry of replace and not by a number of times.
  • For example ..
    • Consider the following regex:
      ([a-z])\s+(\d)
    • For the string "temp 1", the above regex matches:
      "p 1"
    • Let the replacement be "$1-$2", so the expectation
      is that "p 1" gets replaced by "p-1".
    • The code before the fix replaces "p 1" with:
      "$1-$2$1-$2$1-$2"

@abhinavdangeti
Copy link
Member Author

@mschoch Not sure if I'm overlooking something obvious here, but could you review this for me?

@mschoch
Copy link
Member

mschoch commented Mar 16, 2020

So, the reason it was coded this way was that if the length of the input is changed, you can no longer use the byte offsets for client-side highlighting. So, this version attempts to make replacements but keep the overall length, and relative offsets for the tokens which will later be generated the same.

Is there some use-case we need to support that is broken by this? Can we maybe make the behavior an option?

@abhinavdangeti
Copy link
Member Author

I see. So the highlighting makes sense to me.
I've added a test case that generates erroneous output - how to support ${..} replacements with bytes.Repeat.

@mschoch
Copy link
Member

mschoch commented Mar 16, 2020

It's also possible from reading your examples that the behavior I intended, actually does stupidly wrong things in other cases as well. So, maybe I need to read this all again closer.

@mschoch
Copy link
Member

mschoch commented Mar 16, 2020

I mean, I think there are just 2 different use cases here. It was obviously coded in a way that expected a literal replacement, not a backreference. In fact, it doesn't even seem like it would work right if the replacement isn't the exact same length as the matched expression.

So, with that limitation, it seems there isn't a good way to support what I was trying to achieve.

I don't remember why I thought this was so important in the first place. Maybe it makes more sense to say, if client-side highlighting matters, you can't alter the text positions with a charfilter.

@abhinavdangeti
Copy link
Member Author

Hmm, it seems the current code only works when the replacement is a single byte. Also here's an example that concerns me..

  • Lets take the same example regexp again: ([a-z])\s+(\d)
  • Say my input were temp 1
  • And replace were A
  • The output I'd get is temAAA, this doesn't look right to me.
  • But substituting the occurrence p 1 with the A and filling the rest of the length with ' ' somewhat does look ok to me .. so that's temA .

Any thoughts/concerns with that^

@mschoch
Copy link
Member

mschoch commented Mar 17, 2020

So, I think the intention here was more about removing problematic characters and replacing them with less problematic ones. Something like replace or with ".

Again, I'm not opposed to changing things here, but would feel better if you have a use-case in mind.

@abhinavdangeti
Copy link
Member Author

Got it, thanks for explaining why.
Let me think on this a bit, and put up a change here, and then we'll revisit this again.

@abhinavdangeti
Copy link
Member Author

So the code change I'm proposing here is to accommodate regexp's power to Expand $ variables, while making sure highlighting doesn't break. Here're some examples showing new behavior ..

regexp replace input output (before) output (after)
([a-z])\s+(\d) $1-$2 temp 1 tem$1-$2$1-$2$1-$2 temp-1
foo.? X seafood, fool seaXXXX, XXXX seaX , X
def _ abcdefghi abc___ghi abc_ ghi
456 000000 123456789 123000000000000000000789 123000000789
“|” " “hello” """hello""" " hello"

@abhinavdangeti abhinavdangeti changed the title Regexp fix for replacing a substring Extend Regexp support to accommodate expanding $ variables Mar 17, 2020
@abhinavdangeti
Copy link
Member Author

Hey @mschoch , have you had a chance to look at my proposition^

@mschoch
Copy link
Member

mschoch commented Mar 18, 2020

I have, again I think I'm still hung up on the same thing. The original expectation was that users would replace a matched byte sequence with another of the same length, and it's up to them to choose a replacement that makes sense. Your last example, almost matches what I describe, except that is 3 bytes in utf8, and the replacement " is only a single byte. This case, they might have chosen to replace it with " or " and then the "output (before)" wouldn't seem so ridiculous.

Again I still can't fully defend the way this works, as I admit it has obvious limitations. But, I'm still left wondering if there is a way to offer both. Then we can argue about what the default should be?

@abhinavdangeti
Copy link
Member Author

Hmm, so I suppose I can look into a way to support both ways of doing things.
However, I didn't quite follow one thing from your last comment ..

Your last example, almost matches what I describe, except that is 3 bytes in utf8, and the replacement " is only a single byte. This case, they might have chosen to replace it with " or " and then the "output (before)" wouldn't seem so ridiculous.

Yes takes 3 bytes as opposed to " which takes 1 byte. Assuming that the user did replace the 3byte with the 1byte ", the current code (without this PR) replaces with """ - which does look pretty off to me - is this what you meant?

@mschoch
Copy link
Member

mschoch commented Mar 18, 2020

No, I'm saying you're example is wrong. I tried to use github formatting to illustrate that, but perhaps it didn't help.

Why is your example wrong? You replaced a 3 byte sequence with a 1 byte sequence, which then causes it to be repeated 3 times. That is what the code will do if you ask it, but I'm saying that is incorrect usage. Correct usage is to provide your own 3 byte sequence. Going to try again with different formatting. Something like <space><quote><space> or <space><space><quote>. If you do this, you do not end up with an unsightly """.

@mschoch
Copy link
Member

mschoch commented Mar 18, 2020

OK, I'm changing my opinion 180 degrees. This should just be your original patch. Just do a regexp replace all, and document that it alters offsets relative to the original, so client-side highlighting can no longer be done.

Some of the unit tests I had will fail, so we should remove them as well.

@abhinavdangeti
Copy link
Member Author

Got it, ok cool that makes sense.

Now, here's my one last argument before I look into making expansion an optional setting..

Here's why I think we don't need to offer 2 ways of doing this - if a user were to use a replacement string of the same length as the regexp match, with this PR - the output will remain as it were before. Taking the same example we're talking about ..

regexp replace input output (before) output (after)
“|” _"_ “hello” _"_hello_"_ _"_hello_"_

If however the length of the replacement string is different, this PR will -

  • just substitute the regexp match with the replacement if len(output) >= len(input).
  • substitute and fill in whitespaces to get the output string's length to equal the input's.
  • in case of $ variables in the replacement string, the correct length of the replacement is determined.

This example shows the change in behavior ..

regexp replace input output (before) output (after)
foo.? X seafood, fool seaXXXX, XXXX seaX , X

Do you see a usecase where output(before) generates a more preferred outcome?

@mschoch
Copy link
Member

mschoch commented Mar 18, 2020

So, we're replying to each other out of sync now. But, I just realized you had change the code with some special handling of whitespace, and I really don't like that. So, that's why I'm suggesting you just go back to your original proposal.

@abhinavdangeti
Copy link
Member Author

I'd have to make some changes to the highlighting code to not panic if I were to just use ReplaceAll.

@mschoch
Copy link
Member

mschoch commented Mar 18, 2020

I'd have to make some changes to the highlighting code to not panic if I were to just use ReplaceAll.

Wasn't this your original proposal?

return s.r.ReplaceAll(input, s.replacement)

@mschoch
Copy link
Member

mschoch commented Mar 18, 2020

Oh I see, it doesn't just break client-side highlighting, it breaks our highlighting as well, because the original (prior to char filter) is what gets stored.

@abhinavdangeti
Copy link
Member Author

Yeah and it breaks our highlighting as well :)

@abhinavdangeti
Copy link
Member Author

For now, my new proposal is to just follow elastic's approach as documented here.

  • Using a replacement string that changes the length of the original text will work for search purposes, but will result in incorrect highlighting
  • If replacement string is much greater in length than the original text, highlighting may bail entirely to avoid panic-ing on out of bounds.
regexp replace input output (before) output (after)
([a-z])\s+(\d) $1-$2 temp 1 tem$1-$2$1-$2$1-$2 temp-1
foo.? X seafood, fool seaXXXX, XXXX seaX, X
def _ abcdefghi abc___ghi abc_ghi
456 000000 123456789 123000000000000000000789 123000000789
“|” " “hello” """hello""" "hello"

@mschoch Do let me know if you think this is ok to start with..

@mschoch
Copy link
Member

mschoch commented Apr 14, 2020

So, my main complaint is this change assumes, you'd rather have variable length replacement working than being able to do some replacements and keep highlighting working.

I can see value in both use cases, but this PR just trades one that works today for a different one. You still haven't include the example as I intended which was to replace with a 3 byte sequence . And to replace with the 3 byte sequence . This correctly replaces a 3 byte sequence with another 3 byte sequence, one that should actually tokenize like the original, and preserve highlighting. I think maybe it works in your code as well, but can you add it to see?

Second, I don't think it's correct to say this is exactly what ES does. That is what is documented in the Pattern Replace documentation. But, some of the other character filters describe behavior which indicates that for some of them, they track changes in positions and do something a bit better (still not perfect in many cases).

Third, you describe highlighting bailing early to avoid panic if the replacement is much greater. What does much greater mean here. Isn't just a single byte greater potentially going to return an offset that is invalid for highlighting?

@abhinavdangeti
Copy link
Member Author

abhinavdangeti commented Apr 14, 2020

My main motivation for this change is: to support expanding $ variables within replacements.

So let me summarize everything else I have so far ..

  • Firstly, highlighting bails at times in the current code as well, if replacement's length differs the pattern matched by the regexp.
  • Like you mention there will be behavior change noted in highlighting with variable length replacement. Let's take the example you speak of ..
regexp replace input output (before) output (after)
“|” " “hello” """hello""" "hello"

Here's how highlighting will differ with the change made ..
Before: <mark>“hello”</mark>
After: <mark>“hell</mark>o”

  • So you're right, the output in the above example is less desirable with this change ..
  • For the sake of my argument here, lets take another example ..
regexp replace input output (before) output (after)
def @#$ abcdefghi abc@#$@#$@#$ghi abc@#$ghi

In this case, highlighting is more desirable with the code change ..
Before: abcdefghi (highlighting bails)
After: <mark>abcdefghi</mark>

  • So, the current code behavior is erratic if length of replacement is greater than a byte.
  • You're likely right on the second point you mention on what ES does, I was just referring to them mentioning that highlighting may be incorrect owing to the replacement used.

Now let me talk about a case that isn't handled within our highlighting code that'd cause the panic I mention, note that this is without any code change. Take the new example I added to search_test.go (TestSearchHighlightingWithRegexpReplacement) for this. The output generated for the regexp-replacement is fooooooo$1-$2ooooo$1-$2ooooo$1-$20, and with the "unicode" tokenizer the tokens generated are ["1", "20", "2ooooo", "fooooooo"].

  • The start location for the token "1" is at 9 in the replacement, and when the highlighter tries to map this on the original value "fool 10", it panics as the length of the original value is 7.
  • I believe the fix within search/highlight/fragmenter/simple/simple.go is needed no matter what ..
--- FAIL: TestSearchHighlightingWithRegexpReplacement (0.08s)
panic: runtime error: slice bounds out of range [:9] with capacity 8 [recovered]
	panic: runtime error: slice bounds out of range [:9] with capacity 8

goroutine 24 [running]:
testing.tRunner.func1(0xc0002ac100)
	/usr/local/Cellar/go/1.13.7/libexec/src/testing/testing.go:874 +0x3a3
panic(0x1693a80, 0xc0003e4040)
	/usr/local/Cellar/go/1.13.7/libexec/src/runtime/panic.go:679 +0x1b2
github.com/blevesearch/bleve/search/highlight/fragmenter/simple.(*Fragmenter).Fragment(0xc0003cc220, 0xc0003cc248, 0x7, 0x8, 0xc0003ee6c0, 0x7, 0x8, 0x8, 0xc000408360, 0x120)
	/Users/abhinavdangeti/Documents/go/src/github.com/blevesearch/bleve/search/highlight/fragmenter/simple/simple.go:69 +0x9f8
github.com/blevesearch/bleve/search/highlight/highlighter/simple.(*Highlighter).BestFragmentsInField(0xc0003f2480, 0xc0003fce40, 0xc0003ee480, 0xc000026890, 0x6, 0x1, 0xc0003c1890, 0x10151d1, 0x120)
	/Users/abhinavdangeti/Documents/go/src/github.com/blevesearch/bleve/search/highlight/highlighter/simple/highlighter_simple.go:96 +0xa66
github.com/blevesearch/bleve.LoadAndHighlightFields(0xc0003fce40, 0xc0003ce000, 0x16ddbfa, 0x7, 0x17c1c20, 0xc000314500, 0x17be040, 0xc0003f2480, 0x0, 0xc0003f4460)
	/Users/abhinavdangeti/Documents/go/src/github.com/blevesearch/bleve/index_impl.go:658 +0x15e
github.com/blevesearch/bleve.(*indexImpl).SearchInContext(0xc000179180, 0x17b7580, 0xc0000ba008, 0xc0003ce000, 0x0, 0x0, 0x0)
	/Users/abhinavdangeti/Documents/go/src/github.com/blevesearch/bleve/index_impl.go:562 +0x16a5
github.com/blevesearch/bleve.(*indexImpl).Search(0xc000179180, 0xc0003ce000, 0x0, 0x0, 0xc0000202d0)
	/Users/abhinavdangeti/Documents/go/src/github.com/blevesearch/bleve/index_impl.go:369 +0x4d
github.com/blevesearch/bleve.TestSearchHighlightingWithRegexpReplacement(0xc0002ac100)
	/Users/abhinavdangeti/Documents/go/src/github.com/blevesearch/bleve/search_test.go:1700 +0x8e8
testing.tRunner(0xc0002ac100, 0x16fefc8)
	/usr/local/Cellar/go/1.13.7/libexec/src/testing/testing.go:909 +0xc9
created by testing.(*T).Run
	/usr/local/Cellar/go/1.13.7/libexec/src/testing/testing.go:960 +0x350

@mschoch
Copy link
Member

mschoch commented Apr 14, 2020

I still think you're not getting my example. It is not the same replacement for left and right quote, they are unique replacements that have the exact same length as the original they are replacing, so that none of the character repeating stuff ever comes into play. Can we have a call on this tomorrow?

@mschoch
Copy link
Member

mschoch commented Apr 14, 2020

Alright, let's just forget my example.

How involved are the changes so that the highlighter never panics? Like every slice into the original stored field, should have bounds checking.

@abhinavdangeti
Copy link
Member Author

For the situation where we replace a pattern with a unique replacement of the exact same length - the behavior is better with this code change, as I illustrated earlier with the abcdefghi example. Take this as another example ..

regexp replace input output (before) output (after)
“|” @#$ “hello” @#$@#$@#$hello@#$@#$@#$ @#$hello@#$

So highlighting ..
Before: “hello”
After: <mark>“hello”</mark>

Yes, the highlighter changes involve a bounds check for every stored field.

@abhinavdangeti
Copy link
Member Author

abhinavdangeti commented Apr 14, 2020

Oh, and open to a call tomorrow if you'd prefer it.

The old code works well only if the replacement length is no more than 1 byte.

+ Expand essentially interprets $ signs, so for example
  $1 represents the text of the first submatch.
+ For example ..
    - Consider the following regex:
        ([a-z])\s+(\d)
    - For the string "temp 1", the above regex matches:
        "p 1"
    - Let the replacement be "$1-$2", so the expectation
      is that "p 1" gets replaced by "p-1".
    - The code before the fix replaces "p 1" with:
        "$1-$2$1-$2$1-$2"
@abhinavdangeti abhinavdangeti merged commit ba61f7e into blevesearch:master Apr 15, 2020
abhinavdangeti added a commit that referenced this pull request Dec 8, 2021
+ I've tracked this down as a regression that was introduced
  when I changed the regexp character filter's replacement
  behavior here - #1351
+ The above change is still necessary for highlighting to work
  correctly when regexp character filter is used.

+ The HTML character filter which uses the regexp character
  filter will need to replace every character of the matched
  sequence with whitespace so that the number of whitespaces
  equals the length of the HTML tag.
+ Tracking ticket: https://issues.couchbase.com/browse/MB-50002
abhinavdangeti added a commit that referenced this pull request Dec 8, 2021
+ I've tracked this down as a regression that was introduced
  when I changed the regexp character filter's replacement
  behavior here - #1351
+ The above change is still necessary for highlighting to work
  correctly when regexp character filter is used.

+ The HTML character filter which uses the regexp character
  filter will need to replace every character of the matched
  sequence with whitespace so that the number of whitespaces
  equals the length of the HTML tag.
+ Tracking ticket: https://issues.couchbase.com/browse/MB-50002
abhinavdangeti added a commit that referenced this pull request Dec 8, 2021
+ I've tracked this down as a regression that was introduced
  when I changed the regexp character filter's replacement
  behavior here - #1351
+ The above change is still necessary for highlighting to work
  correctly when regexp character filter is used and for
  replacements that involve '$' variables.

+ The HTML character filter which uses the regexp character
  filter will need to replace every character of the matched
  sequence with whitespace so that the number of whitespaces
  equals the length of the HTML tag.
+ Tracking ticket: https://issues.couchbase.com/browse/MB-50002
abhinavdangeti added a commit that referenced this pull request Dec 9, 2021
+ I've tracked this down as a regression that was introduced
  when I changed the regexp character filter's replacement
  behavior here - #1351
+ The above change is still necessary for highlighting to work
  correctly when regexp character filter is used and for
  replacements that involve '$' variables.

+ The HTML character filter which uses the regexp character
  filter will need to replace every character of the matched
  sequence with whitespace so that the number of whitespaces
  equals the length of the HTML tag.
+ Tracking ticket: https://issues.couchbase.com/browse/MB-50002
abhinavdangeti added a commit that referenced this pull request Dec 15, 2021
+ I've tracked this down as a regression that was introduced
  when I changed the regexp character filter's replacement
  behavior here - #1351
+ The above change is still necessary for highlighting to work
  correctly when regexp character filter is used and for
  replacements that involve '$' variables.

+ The HTML character filter which uses the regexp character
  filter will need to replace every character of the matched
  sequence with whitespace so that the number of whitespaces
  equals the length of the HTML tag.
+ Tracking ticket: https://issues.couchbase.com/browse/MB-50002
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants