-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split UTF-8 string variant into owned and borrowed types #1909
Conversation
2c65395
to
b9c32ea
Compare
17fcf5a
to
b9de856
Compare
b9de856
to
4481cb8
Compare
spinoso-string
@b-n tagging you as a reviewer here since you're familiar with the |
If this change sounds like a good idea, I'll follow it up with the same refactor for the Binary and ASCII encoding variants. |
4481cb8
to
3132f14
Compare
`spinoso-string` implements the various encodings it supports with one type per encoding. For the UTF-8 encoding, the type is `Utf8String`. `Utf8String` is conceptually similar to `Vec<u8>` and `String` in `std`. Both of these types deref to a slice reference counterpart (`&[u8]` and `&str`). All methods which do not require growing the backing vector (or dynamic allocation more generally) are implemented on the slice ref part of the type pair. All of the encoding variants in `spinoso-string` do not have this flexible API. This presents challenges when wanting to perform encoding-oriented operations on a borrowed byte slice. A recent example of this was the refactoring required to get UTF-8 char len calculation shared among various parts of the code: - 67726d1 - #2554 This commit experiments with breaking apart `Utf8String` into an owned and borrowed pair of types: `Utf8String` and `&Utf8Str`/`&mut Utf8Str`. The borrowed types are heavily inspired by `bstr::BStr`. The benefits of the reference types that associate a byte slice to an encoding mostly arise from avoiding the need to allocate an owned `Utf8String`. I've long viewed a `String`'s encoding as a "view" or "cursor" over the underlying byte content. Being able to interrogate an arbitrary byte slice via a specific encoding makes the API much more powerful. Eventually I would like to extract APIs like `Utf8String`/`Utf8Str` to their own crates in the workspace. The code is sufficiently complicated and there is _a lot_ of it. Doing so would allow exposing the various encodings in a `no-std::no-alloc` configuration. This commit contains one breaking change which led to the version bump in `spinoso-string` (removed `impl Extend<&'a mut u8>`), but surprisingly little of the crate had to change outside of the UTF-8 internals. I still consider this approach experimental and am looking for feedback on the design. One WIP in the design: ASCII and Binary encoded strings will be able to implement `reverse()` on their reference type. Due to the complexity in a character-wise reversal of the conventionally UTF-8 byte contents in `Utf8String`, allocation is required so this API still lives on `Utf8String`. A benefit of the borrowed type design is (at least for me) it makes it more obvious how a crate like `encoding_rs` could be used to support many more encodings than Artichoke does today. As an aside, sorry for the massive commit. I ended up working on #2589 _after_ I made these changes and I couldn't resolve the merge conflicts.
3132f14
to
8c1dd04
Compare
Expanding on this a bit, fully committing to using custom slice types to implement a view into the underlying The Onigmo encodings are basically a set of function pointers (aka V tables) which are very similar conceptually to wrapping the raw byte slice in something like This means that pub struct String {
encoding: Encoding,
buf: Buf
}
|
Maybe a paraphrase, but thinking conceptually:
I kinda like it, and it seems way more scalable. e.g. trait Encoding {
fn char_len(buf: &[u8]) -> usize {}
fn get_char(buf: &[u8], index: usize) -> Option<&'_ [u8]> {}
} Whether or not you use a trait or not is irrelevant since as you said you could do it with vtables (I don't know how this would look in rust tbh). That would make a huge bonus when it comes to adding support for new encodings, it's essentially just add the right encoding aware functions and you're done. Right now it would require mass edits to https://github.com/artichoke/artichoke/blob/trunk/spinoso-string/src/enc/mod.rs which will eventually hurt hah. |
Yea something along these lines, but I was actually thinking something like this to match the trait Encoding {
fn char_len(&self) -> usize {}
fn get_char(&self, index: usize) -> Option<&'_ [u8]> {}
}
impl Encoding for Utf8Str {
todo!();
}
#[derive(Clone, Copy)]
enum EncodingType {
Utf8,
Ascii,
Binary,
}
impl EncodingType {
fn to_view(self, bytes: &[u8]) -> &dyn Encoding {
match self {
Self::Utf8 => Utf8Str::new(bytes),
_ => todo!(),
}
}
fn to_view_mut(self, bytes: &mut [u8]) -> &dyn mut Encoding {
match self {
Self::Utf8 => Utf8Str::new_mut(bytes),
_ => todo!(),
}
}
}
struct String {
encoding: EncodingType,
bytes: Buf,
}
impl String {
pub fn char_len(&self) -> usize {
let bytes = self.buf.as_slice();
let view = self.encoding.to_view(bytes);
view.char_len()
}
} |
I had a quick play to see if there was a way to make it play with static functions (for some reason in my head it felt nicer than having to initialize the underlying encoding types - even though the compiler is likely optimizing these away). The closest I got was: type Buf = Vec<u8>;
trait Encoding {
fn char_len(buf: &[u8]) -> usize;
fn get_char(buf: &[u8], index: usize) -> Option<&'_ [u8]>;
}
struct Utf8 {}
impl Encoding for Utf8 {
fn char_len(buf: &[u8]) -> usize {
todo!();
}
fn get_char(buf: &[u8], index: usize) -> Option<&'_ [u8]> {
todo!();
}
}
struct String<E: Encoding> {
encoding: E,
bytes: Buf,
}
impl<E: Encoding> String<E> {
pub fn char_len(&self) -> usize {
let bytes = self.bytes.as_slice();
self.encoding.char_len(bytes)
}
} However,
And any attempts to get around this was looking exactly what you have anyway 😅. (e.g. Wrapping the encoding in an Enum was always an eventuality). So yes, what you're doing makes a whole bunch of sense 😄. |
Quick thought: Having someway to represent |
cool cool, it looks like at least there's buy in for experimenting here. I'll merge this and make roughly the same refactor to Ascii and Binary types. Then we can talk about what to do re: |
spinoso-string
implements the various encodings it supports with one type per encoding. For the UTF-8 encoding, the type isUtf8String
.Utf8String
is conceptually similar toVec<u8>
andString
instd
. Both of these types deref to a slice reference counterpart (&[u8]
and&str
). All methods which do not require growing the backing vector (or dynamic allocation more generally) are implemented on the slice ref part of the type pair.All of the encoding variants in
spinoso-string
do not have this flexible API. This presents challenges when wanting to perform encoding-oriented operations on a borrowed byte slice. A recent example of this was the refactoring required to get UTF-8 char len calculation shared among various parts of the code:String::index
andString::rindex
#2554This PR experiments with breaking apart
Utf8String
into an owned and borrowed pair of types:Utf8String
and&Utf8Str
/&mut Utf8Str
. The borrowed types are heavily inspired bybstr::BStr
.The benefits of the reference types that associate a byte slice to an encoding mostly arise from avoiding the need to allocate an owned
Utf8String
. I've long viewed aString
's encoding as a "view" or "cursor" over the underlying byte content. Being able to interrogate an arbitrary byte slice via a specific encoding makes the API much more powerful.Eventually I would like to extract APIs like
Utf8String
/Utf8Str
to their own crates in the workspace. The code is sufficiently complicated and there is a lot of it. Doing so would allow exposing the various encodings in ano-std::no-alloc
configuration.This PR contains one breaking change which led to the version bump in
spinoso-string
(removedimpl Extend<&'a mut u8>
), but surprisingly little of the crate had to change outside of the UTF-8 internals.I still consider this approach experimental and am looking for feedback on the design.
One WIP in the design: ASCII and Binary encoded strings will be able to implement
reverse()
on their reference type. Due to the complexity in a character-wise reversal of the conventionally UTF-8 byte contents inUtf8String
, allocation is required so this API still lives onUtf8String
.A benefit of the borrowed type design is (at least for me) it makes it more obvious how a crate like
encoding_rs
could be used to support many more encodings than Artichoke does today.As an aside, sorry for the massive commit in this PR. I ended up working on #2589 after I made these changes and I couldn't resolve the merge conflicts.