Improve perf of `String::char_len` and `String::is_valid_encoding` #1491

lopopolo · 2021-11-13T00:47:26Z

Introduce bytecount and simdutf8 for SIMD-accelerated UTF-8 validation and character counting.

Performance

I've tested this PR out on the tip of the WIP PR to add String to the VM #1222. These numbers compare with and without this PR as well as against MRI 3.0.2.

This PR speeds up the existing implementation of String::length by a factor of 10 and brings String#length performance to the same order of magnitude as MRI in all cases except pure ASCII strings. MRI caches which valid encodings a String has. For ASCII Strings this fast path is very fast: it returns the length embedded in the RString.

Benchmark script `string_length.rb`

# frozen_string_literal: true

def bench(s, name:)
  s = s * 100_000
  before = Time.now
  result = s.length
  after = Time.now

  puts "#{name}; length: #{result}; time: #{after - before}"
end

s = "a" * 10_000
bench(s, name: 'UTF-8 only ascii')

t = s + '💎' * 10_000 + s
bench(t, name: 'UTF-8 with emojis')

u = 'a💎' * 10_000
bench(u, name: 'ascii + emoji interspersed')

v = 'a' * 10_000
v << '\xFF'
bench(v, name: 'invalid UTF-8 at tail')

w = s + '💎' * 10_000 + s
w << '\xFF'
bench(w, name: 'invalid UTF-8 at tail with emojis')

MRI 3.0.2

$ /usr/local/var/rbenv/versions/3.0.2/bin/ruby ../string_length.rb
UTF-8 only ascii; length: 1000000000; time: 2.0e-06
UTF-8 with emojis; length: 3000000000; time: 0.768025
ascii + emoji interspersed; length: 2000000000; time: 0.614303
invalid UTF-8 at tail; length: 1000400000; time: 1.0e-06
invalid UTF-8 at tail with emojis; length: 3000400000; time: 0.746079

$ /usr/local/var/rbenv/versions/3.0.2/bin/ruby ../string_length.rb
UTF-8 only ascii; length: 1000000000; time: 1.0e-06
UTF-8 with emojis; length: 3000000000; time: 0.739436
ascii + emoji interspersed; length: 2000000000; time: 0.624905
invalid UTF-8 at tail; length: 1000400000; time: 2.0e-06
invalid UTF-8 at tail with emojis; length: 3000400000; time: 0.75076

Without this PR

$ cargo run --bin artichoke --release -q -- ../string_length.rb
UTF-8 only ascii; length: 1000000000; time: 0.729179
UTF-8 with emojis; length: 3000000000; time: 7.913318
ascii + emoji interspersed; length: 2000000000; time: 7.228271
invalid UTF-8 at tail; length: 1000400000; time: 0.719593
invalid UTF-8 at tail with emojis; length: 3000400000; time: 7.944082

$ cargo run --bin artichoke --release -q -- ../string_length.rb
UTF-8 only ascii; length: 1000000000; time: 0.731228
UTF-8 with emojis; length: 3000000000; time: 7.867516
ascii + emoji interspersed; length: 2000000000; time: 7.173596
invalid UTF-8 at tail; length: 1000400000; time: 0.712632
invalid UTF-8 at tail with emojis; length: 3000400000; time: 7.86214

With this PR

$ cargo run --bin artichoke --release -q -- ../string_length.rb
UTF-8 only ascii; length: 1000000000; time: 0.06711499999999999
UTF-8 with emojis; length: 3000000000; time: 0.899454
ascii + emoji interspersed; length: 2000000000; time: 0.763296
invalid UTF-8 at tail; length: 1000400000; time: 0.064452
invalid UTF-8 at tail with emojis; length: 3000400000; time: 0.929182

Introduce `bytecount` and `simdutf8` for SIMD-accelerated UTF-8 validation and character counting.

lopopolo · 2021-11-13T05:49:38Z

with native target CPU:

$ RUSTFLAGS='-C target-cpu=native' cargo run --bin artichoke --release -q -- ../string_length.rb
UTF-8 only ascii; length: 1000000000; time: 0.064621
UTF-8 with emojis; length: 3000000000; time: 0.832497
ascii + emoji interspersed; length: 2000000000; time: 0.758753
invalid UTF-8 at tail; length: 1000400000; time: 0.063723
invalid UTF-8 at tail with emojis; length: 3000400000; time: 0.901933

@ekroon

Implement `String` in Rust **Implement the `String` class from Ruby core in Rust with `spinoso-string` in `artichoke-backend`.** 💎 🎉 💎 🎉 💎 [All 144 APIs available on `String`][ruby-core-string] in MRI 3.0.2 are either implemented or stubbed to raise `NotImplementedError`. [ruby-core-string]: https://ruby-doc.org/core-3.0.2/String.html ## Design Artichoke's `String` is based on [`spinoso_string::String`[spinoso-string-doc]. `String` is a byte buffer and associated encoding. `String` supports UTF-8, ASCII, and binary encodings. Encodings are conventional (a UTF-8 `String` may contain invalid UTF-8 byte sequences, an ASCII `String` may contain bytes greater than `0x7F`). Character oriented APIs like `String::center`, `String::get_char` and `String::char_len` are implemented according to the `String`'s conventional encoding. Unlike `String` in MRI, Artichoke's `String` does not cache whether a `String`'s encoding is valid, nor does it cache whether the `String` is `ascii_only`. These enhancements are expected to come in a future set of changes. I've [discussed how I would like to restructure `String`'s internals][twitter-string-nirvana] to make this happen on Twitter `spinoso-string` uses state of the art Rust dependencies for encoding-oriented operations with first-class SIMD implementations. `spinoso-string` is built on top of `bstr` (ASCII and UTF-8 decoding, `find`-oriented APIs), `simdutf8` (UTF-8 validity checks), and `bytecount` (UTF-8 character length). [spinoso-string-doc]: https://artichoke.github.io/artichoke/spinoso_string/struct.String.html [twitter-string-nirvana]: https://twitter.com/artichokeruby/status/1464117285002440707 ### mruby FFI Compatibility Much like earlier work on `Array`, the `MRB_API` functions for strings are used heavily in the VM in addition to being exposed to Ruby code. The mruby VM treats strings as memory managed byte buffers and uses them extensively in its internals. It is not sufficient to `#ifdef` out the C API since it is used outside of the `string.c` compilation unit. Artichoke includes several reimplementations of mruby C APIs, mostly in Rust and some in the `mruby-sys` C extension. During the development of these FFI compat functions, I encountered 1 segfault and 1 buffer overrun, both documented in the commits that address them: - 7a13a30 - 279c698 See https://twitter.com/artichokeruby/status/1463024367948812293 for some more color around these bugs. ## API Coverage Artichoke implements the following String APIs in mostly native code with some pure Ruby implementations and the `mruby-pack` mrbgem: ### Native Rust Implementations - `String#*` - `String#+` - `String#<<` - `String#<=>` - `String#==` - `String#[]` - `String#[]=` (stubbed, raises `NotImplementedError`) - `String#ascii_only?` - `String#b` - `String#bytes` - `String#bytesize` - `String#byteslice` - `String#capitalize` - `String#capitalize!` - `String#casecmp` - `String#casecmp?` - `String#center` - `String#chars` - `String#chomp` - `String#chomp!` - `String#chop` - `String#chop!` - `String#chr` - `String#clear` - `String#codepoints` - `String#concat` (stubbed, raises `NotImplementedError`) - `String#downcase` - `String#downcase!` - `String#empty?` - `String#eql?` - `String#getbyte` - `String#hash` - `String#include?` - `String#index` (only supports `String` patterns) - `String#initialize` - `String#initialize_copy` - `String#inspect` - `String#intern` - `String#length` - `String#ord` - `String#replace` - `String#reverse` - `String#reverse!` - `String#rindex` (only supports `String` patterns) - `String#scan` - `String#setbyte` - `String#size` - `String#slice` - `String#slice!` (stubbed, raises `NotImplementedError`) - `String#to_f` (stubbed, raises `NotImplementedError`) - `String#to_i` - `String#to_s` - `String#to_sym` - `String#upcase` - `String#upcase!` - `String#valid_encoding?` ### Pure Ruby Implementations - `String::try_convert` - `String#%` - `String#+@` - `String#-@` - `String#/` - `String#===` - `String#=~` - `String#count` (stubbed, raises `NotImplementedError`) - `String#crypt` (stubbed, raises `NotImplementedError`, not intended to be implemented for security considerations) - `String#delete` - `String#delete!` - `String#delete_prefix` - `String#delete_suffix` - `String#delete_suffix!` - `String#dump` (stubbed, raises `NotImplementedError`) - `String#each_byte` - `String#each_char` - `String#each_codepoint` - `String#each_grapheme_cluster` (stubbed, raises `NotImplementedError`) - `String#each_line` - `String#encode` (stubbed, pending completion of #183) - `String#encode!` (stubbed, pending completion of #183) - `String#encoding` (stubbed, pending completion of #183) - `String#end_with?` - `String#force_encoding` (stubbed, pending completion of #183) - `String#grapheme_clusters` (stubbed, raises `NotImplementedError`) - `String#gsub` - `String#gsub!` - `String#hex` (stubbed, raises `NotImplementedError`) - `String#insert` - `String#lines` - `String#ljust` - `String#lstrip` - `String#lstrip!` - `String#match` - `String#match?` - `String#next` (stubbed, raises `NotImplementedError`) - `String#next!` (stubbed, raises `NotImplementedError`) - `String#oct` (stubbed, raises `NotImplementedError`) - `String#partition` - `String#prepend` - `String#rjust` - `String#rpartition` (stubbed, raises `NotImplementedError`) - `String#rstrip` - `String#rstrip!` - `String#scrub` (stubbed, raises `NotImplementedError`, this should be implemented in native code) - `String#scrub!` (stubbed, raises `NotImplementedError`, this should be implemented in native code) - `String#split` (partial implementation, buggy, does not support block arguments) - `String#squeeze` (partial implementation, does not support character set arguments) - `String#squeeze!` (partial implementation, does not support character set arguments) - `String#start_with?` - `String#strip` - `String#strip!` - `String#sub` - `String#sub!` - `String#succ` (stubbed, raises `NotImplementedError`) - `String#succ!` (stubbed, raises `NotImplementedError`) - `String#sum` (stubbed, raises `NotImplementedError`) - `String#swapcase` (stubbed, raises `NotImplementedError`) - `String#swapcase!` (stubbed, raises `NotImplementedError`) - `String#to_a` - `String#to_c` (stubbed, raises `NotImplementedError`, Artichoke does not support `Complex` numbers) - `String#to_r` (stubbed, raises `NotImplementedError`, Artichoke does not support `Rational` numbers) - `String#to_str` - `String#tr` - `String#tr!` - `String#tr_s` - `String#tr_s!` - `String#undump` (stubbed, raises `NotImplementedError`) - `String#unicode_normalize` (stubbed, raises `NotImplementedError`) - `String#unicode_normalize!` (stubbed, raises `NotImplementedError`) - `String#unicode_normalized?` (stubbed, raises `NotImplementedError`) - `String#upto` ### mruby mrbgem C extension - `String#unpack` - `String#unpack1` ## ruby/spec compliance This PR improves ruby/spec compliance for Artichoke 📈 🚀 On current trunk, when running `rake spec`, Artichoke has the following stats: ``` Passed 1837, skipped 238, not implemented 14, failed 0 specs. ``` On the tip of #1222, when running `rake spec`, Artichoke has the following stats: ``` Passed 1845, skipped 238, not implemented 14, failed 0 specs. ``` This PR additionally allows for `SecureRandom#random_bytes` specs to pass since the returned `String` can properly declare their binary encoding. Previously, the returned `String`s had UTF-8 encoding which meant that if the random bytes happened to contain a multi-byte UTF-8 character byte sequence, the reported length would be less than `bytesize`. See: - fac36b9 ## Future Work - Implement `String#[]=`. - Add fast paths when `String`s of any encoding are valid per their encoding. - Add fast paths when `String`s of any encoding are ASCII only. - Remove dependency on `mruby-pack`. - Expose `Encoding` in Ruby, see #183. - Use encoding-aware `String` constructors in native APIs where possible/required per ruby/spec. ## Linked Issues Closes #184. Partially addresses #183. These PRs were foundational to getting this PR to a mergeable state: - #1012, thanks @ekroon - #1045 - #1047 - #1048 - #1049 - #1051 - #1088 - #1141 - #1271 - #1348 - #1349 - #1353 - #1357 - #1402 - #1403 - #1449 - #1450 (comment) and #1510, thanks @rurban - #1451 - #1488, thanks @masonforest - #1489 - #1490 - #1491 - #1505, thanks @briankung - #1512 - #1513 - #1514 - #1515 - #1516 - #1517 - #1519

Improve perf of String::char_len and String::is_valid_encoding

a188a2c

Introduce `bytecount` and `simdutf8` for SIMD-accelerated UTF-8 validation and character counting.

lopopolo added A-ruby-core Area: Ruby Core types. A-performance Area: Performance improvements and optimizations. A-deps Area: Source and library dependencies. labels Nov 13, 2021

lopopolo merged commit 0ac710e into trunk Nov 13, 2021

lopopolo deleted the lopopolo/speed-up-string-length branch November 13, 2021 01:19

lopopolo mentioned this pull request Nov 27, 2021

Implement String in Rust #1222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve perf of `String::char_len` and `String::is_valid_encoding` #1491

Improve perf of `String::char_len` and `String::is_valid_encoding` #1491

lopopolo commented Nov 13, 2021

lopopolo commented Nov 13, 2021

Improve perf of String::char_len and String::is_valid_encoding #1491

Improve perf of String::char_len and String::is_valid_encoding #1491

Conversation

lopopolo commented Nov 13, 2021

Performance

Benchmark script string_length.rb

MRI 3.0.2

Without this PR

With this PR

lopopolo commented Nov 13, 2021

Improve perf of `String::char_len` and `String::is_valid_encoding` #1491

Improve perf of `String::char_len` and `String::is_valid_encoding` #1491

Benchmark script `string_length.rb`