Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check subject UTF8 validity just once for String#gsub, #scan, #split #13406

Merged

Conversation

HertzDevil
Copy link
Contributor

As of #13313, regex matching uses neither NO_UTF_CHECK nor MATCH_INVALID_UTF, which means the subject string is checked for UTF-8 validity once per match. String#gsub, #scan, and #split are special in that the regex doesn't change and is expected to be matched many times, so for these methods we should only need to do this check exactly once; if the first match does succeed, we know that the subject string is validly encoded and we can apply NO_UTF_CHECK in subsequent matches, regardless of whether other optimizations are in place.

Benchmarks:

require "benchmark"

class String
  def scan_new(pattern : Regex, &) : self
    # this PR's implementation
  end
end

data = "a" * 10000
count = 0

Benchmark.ips do |b|
  b.report("old") do
    count = 0
    data.scan(/a/) { count += 1 }
  end

  b.report("new") do
    count = 0
    data.scan_new(/a/) { count += 1 }
  end
end
old  72.43  ( 13.81ms) (± 0.66%)  313kB/op  38.50× slower
new   2.79k (358.60µs) (± 0.94%)  313kB/op        fastest

If #13353 is merged then options |= :no_utf_check would apply to the parameters of the respective methods themselves.

@HertzDevil
Copy link
Contributor Author

Another candidate is StringScanner whose subject string is supplied in the constructor and also never changes afterwards.

Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good improvement, but maybe (hopefully?) only termporary. It will become obsolete if/when #12020 gets implemented.

@straight-shoota straight-shoota added this to the 1.9.0 milestone Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants