Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture.Length undefined, making FindAllString hard to implement #35

Closed
slimsag opened this issue Oct 11, 2020 · 2 comments
Closed

Capture.Length undefined, making FindAllString hard to implement #35

slimsag opened this issue Oct 11, 2020 · 2 comments

Comments

@slimsag
Copy link

slimsag commented Oct 11, 2020

Hello! First, thanks for this great library - this is an impressive feat!

I needed an equivalent function for https://golang.org/pkg/regexp/#Regexp.FindAllString which ideally would be a part of this library, but unfortunately doesn't exist today. I took a stab at implementing it (without the n parameter):

func regexp2FindAllString(re *regexp2.Regexp, s string) []string {
	var matches []string
	for {
		match, _ := re.FindStringMatch(s)
		if match == nil {
			break
		} else {
			matches = append(matches, match.String())
			s = s[match.Index+match.Length:]
		}
	}
	return matches
}

At first glance, this seemed correct and appeared to work - however I realized that it in fact is incompatible with unicode because match.Length appears to report length in runes not bytes. I'm not sure whether or not Capture.Index reports bytes or runes either, and the docs don't define this:

    // the position in the original string where the first character of
    // captured substring was found.
    Index int
    // the length of the captured substring.
    Length int

From testing, it appears that Capture.Index oddly is in bytes and not runes. A corrected implementation is:

func regexp2FindAllString(re *regexp2.Regexp, s string) []string {
	var matches []string
	for {
		match, _ := re.FindStringMatch(s)
		if match == nil {
			break
		} else {
			matches = append(matches, match.String())
-			s = s[match.Index+match.Length:]
+			s = s[match.Index+len(match.String()):]
		}
	}
	return matches
}

This brings me to my points of feedback:

  1. Index in bytes and Length in runes is an odd inconsistency, I imagine they should be the same.
  2. The docstrings should ideally clarify this.
  3. It would be great if the library exposed a FindAllString implementation

Thanks again for the great library!

@slimsag
Copy link
Author

slimsag commented Oct 11, 2020

OK, it turns out I was wrong and both Index and Length are in runes. They are consistent. So my only feedback is points 2 & 3 above 😃

Here is the final working regexp2FindAllString implementation for anyone else who may be looking for one:

func regexp2FindAllString(re *regexp2.Regexp, s string) []string {
	var (
		matches []string
		r = []rune(s)
	)
	for {
		match, _ := re.FindStringMatch(string(r))
		if match == nil {
			break
		} else {
			matches = append(matches, match.String())
			r = r[match.Index+match.Length:]
		}
	}
	return matches
}

(I would send this as a PR, but I realize it is incomplete without the n parameter and proper tests)

@dlclark
Copy link
Owner

dlclark commented Oct 12, 2020

To accomplish this you'll want to use FindNextMatch since it'll re-use the converted []rune for you instead of re-generating a new []rune each time.

func regexp2FindAllString(re *regexp2.Regexp, s string) []string {
	var matches []string
	m, _ := re.FindStringMatch(s)
	for m != nil {
		matches = append(matches, m.String())
		m, _ = re.FindNextMatch(m)
	}
	return matches
}

I'm going to close this issue since this looks like a pretty simple function to implement. However, I will update the README.md to give an example usage of FindNextMatch.

@dlclark dlclark closed this as completed Oct 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants