Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugs in scenarios of Chinese characters or incorrect using of match.Index #50

Closed
nako-ruru opened this issue May 12, 2022 · 2 comments
Closed

Comments

@nako-ruru
Copy link

nako-ruru commented May 12, 2022

the following codes fails

package main

import (
	"fmt"
	"github.com/dlclark/regexp2"
)

func main()  {
	regex := regexp2.MustCompile("<style", regexp2.IgnoreCase|regexp2.Singleline)
	match, err := regex.FindStringMatch(sample)
	if err != nil {
		panic(err)
	}
	if match != nil {
		t, err := regex.Replace(sample, "xxx", match.Index, -1)
		if err != nil {
			panic(err)
		}
		fmt.Printf("%s", t)
	}
}

var sample = "<title>错<style"

if i search some words/regex successfully, and then replace something from match.Index instead of -1, the codes fails.

however, if removed the Chinese character , the codes succeeds.

so, in such scenario, what should beginning index be if I want to replace all and don't want to replace from -1(begining)

@nako-ruru nako-ruru changed the title bugs caused by Chinese characters or incorrect using of match.Index bugs in scenarios of Chinese characters or incorrect using of match.Index May 12, 2022
@dlclark
Copy link
Owner

dlclark commented May 12, 2022

Thanks for finding this!

This is because internally the matching engine always uses a rune slice and match.Index is the index into that rune slice, which for anything after a multi-byte rune will not be the same as the index into the original string.

If the user passes in a rune slice (e.g. FindRunesMatch) then indexing from the runes makes sense, but if the user passes in a string (e.g. FindStringMatch) then this is confusing. I think I should make a change to sort out this confusion and make the API easier to use with only strings.

I believe the best backward-compatible fix would be to keep match.Index as the index into the rune slice and add a match.StringIndex as the index of the original string. I'd also clean up the documentation associated with this to make it more clear what the Index field actually is.

The Replace function also could use a ReplaceRunes version for compatibility with the FindRunesMatch and if someone needs to call it repeatedly they could reduce the number of string to []rune conversions.

For now, to work around this issue you'll need to map the runes to the string index yourself:

func getRunesAndMap(in string) ([]rune, map[int]int) {
	ret := make([]rune, len(in))
	idxMap := make(map[int]int)

	i := 0
	for strIdx, r := range in {
		idxMap[i] = strIdx
		ret[i] = r
		i++
	}
	return ret[:i], idxMap
}

func TestReplaceMultiByte(t *testing.T) {
	var sample = "<title>错<style"
	var runes, idxMap = getRunesAndMap(sample)

	regex := MustCompile("<style", IgnoreCase|Singleline)
	match, err := regex.FindRunesMatch(runes)
	if err != nil {
		panic(err)
	}
	if match != nil {
		t, err := regex.Replace(sample, "xxx", idxMap[match.Index], -1)
		if err != nil {
			panic(err)
		}
		fmt.Printf("%s", t)
	}
}

@nako-ruru
Copy link
Author

I find another solution.
I think I can do some extra initialization when MatchEvaluator of ReplaceFunc is called first time, in this way I can do replacements from -1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants