New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the beginnings of the spinoso-string crate #1045
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit adds a WIP spinoso-string crate, which will be used to implement the `String` Ruby Core class in Artichoke. Spinoso `String` is modeled as a `Vec<u8>` with an associated `Encoding`. The `Encoding` is conventional only, but does influence several read-oriented APIs, like `String#ord` and `String#chr` as well as new constructors like `String#center`. Because a Spinoso `String` is meant to be packed into an mruby `RArray`, it can be converted to raw pointer. `Encoding` can be serialize to a `u8` bit flag. Spinoso `String` intentionally only supports three Ruby encodings: - UTF-8 - ASCII - Binary / ASCII-8BIT For UTF-8 support, `spinoso-string` makes use of `bstr`.
Linking to #184. |
lopopolo
added a commit
that referenced
this pull request
Nov 27, 2021
Implement `String` in Rust **Implement the `String` class from Ruby core in Rust with `spinoso-string` in `artichoke-backend`.** 💎 🎉 💎 🎉 💎 [All 144 APIs available on `String`][ruby-core-string] in MRI 3.0.2 are either implemented or stubbed to raise `NotImplementedError`. [ruby-core-string]: https://ruby-doc.org/core-3.0.2/String.html ## Design Artichoke's `String` is based on [`spinoso_string::String`[spinoso-string-doc]. `String` is a byte buffer and associated encoding. `String` supports UTF-8, ASCII, and binary encodings. Encodings are conventional (a UTF-8 `String` may contain invalid UTF-8 byte sequences, an ASCII `String` may contain bytes greater than `0x7F`). Character oriented APIs like `String::center`, `String::get_char` and `String::char_len` are implemented according to the `String`'s conventional encoding. Unlike `String` in MRI, Artichoke's `String` does not cache whether a `String`'s encoding is valid, nor does it cache whether the `String` is `ascii_only`. These enhancements are expected to come in a future set of changes. I've [discussed how I would like to restructure `String`'s internals][twitter-string-nirvana] to make this happen on Twitter `spinoso-string` uses state of the art Rust dependencies for encoding-oriented operations with first-class SIMD implementations. `spinoso-string` is built on top of `bstr` (ASCII and UTF-8 decoding, `find`-oriented APIs), `simdutf8` (UTF-8 validity checks), and `bytecount` (UTF-8 character length). [spinoso-string-doc]: https://artichoke.github.io/artichoke/spinoso_string/struct.String.html [twitter-string-nirvana]: https://twitter.com/artichokeruby/status/1464117285002440707 ### mruby FFI Compatibility Much like earlier work on `Array`, the `MRB_API` functions for strings are used heavily in the VM in addition to being exposed to Ruby code. The mruby VM treats strings as memory managed byte buffers and uses them extensively in its internals. It is not sufficient to `#ifdef` out the C API since it is used outside of the `string.c` compilation unit. Artichoke includes several reimplementations of mruby C APIs, mostly in Rust and some in the `mruby-sys` C extension. During the development of these FFI compat functions, I encountered 1 segfault and 1 buffer overrun, both documented in the commits that address them: - 7a13a30 - 279c698 See https://twitter.com/artichokeruby/status/1463024367948812293 for some more color around these bugs. ## API Coverage Artichoke implements the following String APIs in mostly native code with some pure Ruby implementations and the `mruby-pack` mrbgem: ### Native Rust Implementations - `String#*` - `String#+` - `String#<<` - `String#<=>` - `String#==` - `String#[]` - `String#[]=` (stubbed, raises `NotImplementedError`) - `String#ascii_only?` - `String#b` - `String#bytes` - `String#bytesize` - `String#byteslice` - `String#capitalize` - `String#capitalize!` - `String#casecmp` - `String#casecmp?` - `String#center` - `String#chars` - `String#chomp` - `String#chomp!` - `String#chop` - `String#chop!` - `String#chr` - `String#clear` - `String#codepoints` - `String#concat` (stubbed, raises `NotImplementedError`) - `String#downcase` - `String#downcase!` - `String#empty?` - `String#eql?` - `String#getbyte` - `String#hash` - `String#include?` - `String#index` (only supports `String` patterns) - `String#initialize` - `String#initialize_copy` - `String#inspect` - `String#intern` - `String#length` - `String#ord` - `String#replace` - `String#reverse` - `String#reverse!` - `String#rindex` (only supports `String` patterns) - `String#scan` - `String#setbyte` - `String#size` - `String#slice` - `String#slice!` (stubbed, raises `NotImplementedError`) - `String#to_f` (stubbed, raises `NotImplementedError`) - `String#to_i` - `String#to_s` - `String#to_sym` - `String#upcase` - `String#upcase!` - `String#valid_encoding?` ### Pure Ruby Implementations - `String::try_convert` - `String#%` - `String#+@` - `String#-@` - `String#/` - `String#===` - `String#=~` - `String#count` (stubbed, raises `NotImplementedError`) - `String#crypt` (stubbed, raises `NotImplementedError`, not intended to be implemented for security considerations) - `String#delete` - `String#delete!` - `String#delete_prefix` - `String#delete_suffix` - `String#delete_suffix!` - `String#dump` (stubbed, raises `NotImplementedError`) - `String#each_byte` - `String#each_char` - `String#each_codepoint` - `String#each_grapheme_cluster` (stubbed, raises `NotImplementedError`) - `String#each_line` - `String#encode` (stubbed, pending completion of #183) - `String#encode!` (stubbed, pending completion of #183) - `String#encoding` (stubbed, pending completion of #183) - `String#end_with?` - `String#force_encoding` (stubbed, pending completion of #183) - `String#grapheme_clusters` (stubbed, raises `NotImplementedError`) - `String#gsub` - `String#gsub!` - `String#hex` (stubbed, raises `NotImplementedError`) - `String#insert` - `String#lines` - `String#ljust` - `String#lstrip` - `String#lstrip!` - `String#match` - `String#match?` - `String#next` (stubbed, raises `NotImplementedError`) - `String#next!` (stubbed, raises `NotImplementedError`) - `String#oct` (stubbed, raises `NotImplementedError`) - `String#partition` - `String#prepend` - `String#rjust` - `String#rpartition` (stubbed, raises `NotImplementedError`) - `String#rstrip` - `String#rstrip!` - `String#scrub` (stubbed, raises `NotImplementedError`, this should be implemented in native code) - `String#scrub!` (stubbed, raises `NotImplementedError`, this should be implemented in native code) - `String#split` (partial implementation, buggy, does not support block arguments) - `String#squeeze` (partial implementation, does not support character set arguments) - `String#squeeze!` (partial implementation, does not support character set arguments) - `String#start_with?` - `String#strip` - `String#strip!` - `String#sub` - `String#sub!` - `String#succ` (stubbed, raises `NotImplementedError`) - `String#succ!` (stubbed, raises `NotImplementedError`) - `String#sum` (stubbed, raises `NotImplementedError`) - `String#swapcase` (stubbed, raises `NotImplementedError`) - `String#swapcase!` (stubbed, raises `NotImplementedError`) - `String#to_a` - `String#to_c` (stubbed, raises `NotImplementedError`, Artichoke does not support `Complex` numbers) - `String#to_r` (stubbed, raises `NotImplementedError`, Artichoke does not support `Rational` numbers) - `String#to_str` - `String#tr` - `String#tr!` - `String#tr_s` - `String#tr_s!` - `String#undump` (stubbed, raises `NotImplementedError`) - `String#unicode_normalize` (stubbed, raises `NotImplementedError`) - `String#unicode_normalize!` (stubbed, raises `NotImplementedError`) - `String#unicode_normalized?` (stubbed, raises `NotImplementedError`) - `String#upto` ### mruby mrbgem C extension - `String#unpack` - `String#unpack1` ## ruby/spec compliance This PR improves ruby/spec compliance for Artichoke 📈 🚀 On current trunk, when running `rake spec`, Artichoke has the following stats: ``` Passed 1837, skipped 238, not implemented 14, failed 0 specs. ``` On the tip of #1222, when running `rake spec`, Artichoke has the following stats: ``` Passed 1845, skipped 238, not implemented 14, failed 0 specs. ``` This PR additionally allows for `SecureRandom#random_bytes` specs to pass since the returned `String` can properly declare their binary encoding. Previously, the returned `String`s had UTF-8 encoding which meant that if the random bytes happened to contain a multi-byte UTF-8 character byte sequence, the reported length would be less than `bytesize`. See: - fac36b9 ## Future Work - Implement `String#[]=`. - Add fast paths when `String`s of any encoding are valid per their encoding. - Add fast paths when `String`s of any encoding are ASCII only. - Remove dependency on `mruby-pack`. - Expose `Encoding` in Ruby, see #183. - Use encoding-aware `String` constructors in native APIs where possible/required per ruby/spec. ## Linked Issues Closes #184. Partially addresses #183. These PRs were foundational to getting this PR to a mergeable state: - #1012, thanks @ekroon - #1045 - #1047 - #1048 - #1049 - #1051 - #1088 - #1141 - #1271 - #1348 - #1349 - #1353 - #1357 - #1402 - #1403 - #1449 - #1450 (comment) and #1510, thanks @rurban - #1451 - #1488, thanks @masonforest - #1489 - #1490 - #1491 - #1505, thanks @briankung - #1512 - #1513 - #1514 - #1515 - #1516 - #1517 - #1519
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit adds a WIP spinoso-string crate, which will be used to
implement the
String
Ruby Core class in Artichoke.Spinoso
String
is modeled as aVec<u8>
with an associatedEncoding
. TheEncoding
is conventional only, but does influenceseveral read-oriented APIs, like
String#ord
andString#chr
as wellas new constructors like
String#center
.Because a Spinoso
String
is meant to be packed into an mrubyRArray
,it can be converted to raw pointer.
Encoding
can be serialize to au8
bit flag.Spinoso
String
intentionally only supports three Ruby encodings:For UTF-8 support,
spinoso-string
makes use ofbstr
.