New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend Regexp support to accommodate expanding $ variables #1351
Conversation
@mschoch Not sure if I'm overlooking something obvious here, but could you review this for me? |
So, the reason it was coded this way was that if the length of the input is changed, you can no longer use the byte offsets for client-side highlighting. So, this version attempts to make replacements but keep the overall length, and relative offsets for the tokens which will later be generated the same. Is there some use-case we need to support that is broken by this? Can we maybe make the behavior an option? |
I see. So the highlighting makes sense to me. |
It's also possible from reading your examples that the behavior I intended, actually does stupidly wrong things in other cases as well. So, maybe I need to read this all again closer. |
I mean, I think there are just 2 different use cases here. It was obviously coded in a way that expected a literal replacement, not a backreference. In fact, it doesn't even seem like it would work right if the replacement isn't the exact same length as the matched expression. So, with that limitation, it seems there isn't a good way to support what I was trying to achieve. I don't remember why I thought this was so important in the first place. Maybe it makes more sense to say, if client-side highlighting matters, you can't alter the text positions with a charfilter. |
Hmm, it seems the current code only works when the replacement is a single byte. Also here's an example that concerns me..
Any thoughts/concerns with that^ |
So, I think the intention here was more about removing problematic characters and replacing them with less problematic ones. Something like replace Again, I'm not opposed to changing things here, but would feel better if you have a use-case in mind. |
Got it, thanks for explaining why. |
So the code change I'm proposing here is to accommodate regexp's power to
|
Hey @mschoch , have you had a chance to look at my proposition^ |
I have, again I think I'm still hung up on the same thing. The original expectation was that users would replace a matched byte sequence with another of the same length, and it's up to them to choose a replacement that makes sense. Your last example, almost matches what I describe, except that Again I still can't fully defend the way this works, as I admit it has obvious limitations. But, I'm still left wondering if there is a way to offer both. Then we can argue about what the default should be? |
Hmm, so I suppose I can look into a way to support both ways of doing things.
Yes |
No, I'm saying you're example is wrong. I tried to use github formatting to illustrate that, but perhaps it didn't help. Why is your example wrong? You replaced a 3 byte sequence with a 1 byte sequence, which then causes it to be repeated 3 times. That is what the code will do if you ask it, but I'm saying that is incorrect usage. Correct usage is to provide your own 3 byte sequence. Going to try again with different formatting. Something like |
OK, I'm changing my opinion 180 degrees. This should just be your original patch. Just do a regexp replace all, and document that it alters offsets relative to the original, so client-side highlighting can no longer be done. Some of the unit tests I had will fail, so we should remove them as well. |
Got it, ok cool that makes sense. Now, here's my one last argument before I look into making expansion an optional setting.. Here's why I think we don't need to offer 2 ways of doing this - if a user were to use a replacement string of the same length as the regexp match, with this PR - the output will remain as it were before. Taking the same example we're talking about ..
If however the length of the replacement string is different, this PR will -
This example shows the change in behavior ..
Do you see a usecase where |
So, we're replying to each other out of sync now. But, I just realized you had change the code with some special handling of whitespace, and I really don't like that. So, that's why I'm suggesting you just go back to your original proposal. |
I'd have to make some changes to the highlighting code to not panic if I were to just use ReplaceAll. |
Wasn't this your original proposal?
|
Oh I see, it doesn't just break client-side highlighting, it breaks our highlighting as well, because the original (prior to char filter) is what gets stored. |
Yeah and it breaks our highlighting as well :) |
For now, my new proposal is to just follow elastic's approach as documented here.
@mschoch Do let me know if you think this is ok to start with.. |
So, my main complaint is this change assumes, you'd rather have variable length replacement working than being able to do some replacements and keep highlighting working. I can see value in both use cases, but this PR just trades one that works today for a different one. You still haven't include the example as I intended which was to replace Second, I don't think it's correct to say this is exactly what ES does. That is what is documented in the Pattern Replace documentation. But, some of the other character filters describe behavior which indicates that for some of them, they track changes in positions and do something a bit better (still not perfect in many cases). Third, you describe highlighting bailing early to avoid panic if the replacement is much greater. What does much greater mean here. Isn't just a single byte greater potentially going to return an offset that is invalid for highlighting? |
My main motivation for this change is: to support expanding $ variables within replacements. So let me summarize everything else I have so far ..
Here's how highlighting will differ with the change made ..
In this case, highlighting is more desirable with the code change ..
Now let me talk about a case that isn't handled within our highlighting code that'd cause the panic I mention, note that this is without any code change. Take the new example I added to search_test.go (TestSearchHighlightingWithRegexpReplacement) for this. The output generated for the regexp-replacement is
|
I still think you're not getting my example. It is not the same replacement for left and right quote, they are unique replacements that have the exact same length as the original they are replacing, so that none of the character repeating stuff ever comes into play. Can we have a call on this tomorrow? |
Alright, let's just forget my example. How involved are the changes so that the highlighter never panics? Like every slice into the original stored field, should have bounds checking. |
For the situation where we replace a pattern with a unique replacement of the exact same length - the behavior is better with this code change, as I illustrated earlier with the
So highlighting .. Yes, the highlighter changes involve a bounds check for every stored field. |
Oh, and open to a call tomorrow if you'd prefer it. The old code works well only if the replacement length is no more than 1 byte. |
+ Expand essentially interprets $ signs, so for example $1 represents the text of the first submatch. + For example .. - Consider the following regex: ([a-z])\s+(\d) - For the string "temp 1", the above regex matches: "p 1" - Let the replacement be "$1-$2", so the expectation is that "p 1" gets replaced by "p-1". - The code before the fix replaces "p 1" with: "$1-$2$1-$2$1-$2"
+ I've tracked this down as a regression that was introduced when I changed the regexp character filter's replacement behavior here - #1351 + The above change is still necessary for highlighting to work correctly when regexp character filter is used. + The HTML character filter which uses the regexp character filter will need to replace every character of the matched sequence with whitespace so that the number of whitespaces equals the length of the HTML tag. + Tracking ticket: https://issues.couchbase.com/browse/MB-50002
+ I've tracked this down as a regression that was introduced when I changed the regexp character filter's replacement behavior here - #1351 + The above change is still necessary for highlighting to work correctly when regexp character filter is used. + The HTML character filter which uses the regexp character filter will need to replace every character of the matched sequence with whitespace so that the number of whitespaces equals the length of the HTML tag. + Tracking ticket: https://issues.couchbase.com/browse/MB-50002
+ I've tracked this down as a regression that was introduced when I changed the regexp character filter's replacement behavior here - #1351 + The above change is still necessary for highlighting to work correctly when regexp character filter is used and for replacements that involve '$' variables. + The HTML character filter which uses the regexp character filter will need to replace every character of the matched sequence with whitespace so that the number of whitespaces equals the length of the HTML tag. + Tracking ticket: https://issues.couchbase.com/browse/MB-50002
+ I've tracked this down as a regression that was introduced when I changed the regexp character filter's replacement behavior here - #1351 + The above change is still necessary for highlighting to work correctly when regexp character filter is used and for replacements that involve '$' variables. + The HTML character filter which uses the regexp character filter will need to replace every character of the matched sequence with whitespace so that the number of whitespaces equals the length of the HTML tag. + Tracking ticket: https://issues.couchbase.com/browse/MB-50002
+ I've tracked this down as a regression that was introduced when I changed the regexp character filter's replacement behavior here - #1351 + The above change is still necessary for highlighting to work correctly when regexp character filter is used and for replacements that involve '$' variables. + The HTML character filter which uses the regexp character filter will need to replace every character of the matched sequence with whitespace so that the number of whitespaces equals the length of the HTML tag. + Tracking ticket: https://issues.couchbase.com/browse/MB-50002
single entry of replace and not by a number of times.
([a-z])\s+(\d)
"p 1"
is that "p 1" gets replaced by "p-1".
"$1-$2$1-$2$1-$2"