Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split UTF-8 string variant into owned and borrowed types #1909

Merged
merged 4 commits into from
Jun 7, 2023

Conversation

lopopolo
Copy link
Member

@lopopolo lopopolo commented Jun 20, 2022

spinoso-string implements the various encodings it supports with one type per encoding. For the UTF-8 encoding, the type is Utf8String.

Utf8String is conceptually similar to Vec<u8> and String in std. Both of these types deref to a slice reference counterpart (&[u8] and &str). All methods which do not require growing the backing vector (or dynamic allocation more generally) are implemented on the slice ref part of the type pair.

All of the encoding variants in spinoso-string do not have this flexible API. This presents challenges when wanting to perform encoding-oriented operations on a borrowed byte slice. A recent example of this was the refactoring required to get UTF-8 char len calculation shared among various parts of the code:

This PR experiments with breaking apart Utf8String into an owned and borrowed pair of types: Utf8String and &Utf8Str/&mut Utf8Str. The borrowed types are heavily inspired by bstr::BStr.

The benefits of the reference types that associate a byte slice to an encoding mostly arise from avoiding the need to allocate an owned Utf8String. I've long viewed a String's encoding as a "view" or "cursor" over the underlying byte content. Being able to interrogate an arbitrary byte slice via a specific encoding makes the API much more powerful.

Eventually I would like to extract APIs like Utf8String/Utf8Str to their own crates in the workspace. The code is sufficiently complicated and there is a lot of it. Doing so would allow exposing the various encodings in a no-std::no-alloc configuration.

This PR contains one breaking change which led to the version bump in spinoso-string (removed impl Extend<&'a mut u8>), but surprisingly little of the crate had to change outside of the UTF-8 internals.

I still consider this approach experimental and am looking for feedback on the design.

One WIP in the design: ASCII and Binary encoded strings will be able to implement reverse() on their reference type. Due to the complexity in a character-wise reversal of the conventionally UTF-8 byte contents in Utf8String, allocation is required so this API still lives on Utf8String.

A benefit of the borrowed type design is (at least for me) it makes it more obvious how a crate like encoding_rs could be used to support many more encodings than Artichoke does today.

As an aside, sorry for the massive commit in this PR. I ended up working on #2589 after I made these changes and I couldn't resolve the merge conflicts.

@lopopolo lopopolo added A-ruby-core Area: Ruby Core types. A-performance Area: Performance improvements and optimizations. S-wip Status: This pull request is a work in progress. S-speculative Status: This is just an idea. labels Jun 20, 2022
@lopopolo lopopolo force-pushed the lopopolo/enc-str-slice branch 2 times, most recently from 17fcf5a to b9de856 Compare June 4, 2023 22:22
@lopopolo lopopolo removed S-wip Status: This pull request is a work in progress. S-speculative Status: This is just an idea. labels Jun 4, 2023
@lopopolo lopopolo added the S-speculative Status: This is just an idea. label Jun 4, 2023
@lopopolo lopopolo changed the title [WIP] Experiment with encoding-specific slice types in spinoso-string Split UTF-8 string variant into owned and borrowed types Jun 4, 2023
@lopopolo
Copy link
Member Author

lopopolo commented Jun 4, 2023

@b-n tagging you as a reviewer here since you're familiar with the String code and I think such a big change warrants a second pair of eyes.

@lopopolo lopopolo requested a review from b-n June 4, 2023 22:57
@lopopolo
Copy link
Member Author

lopopolo commented Jun 4, 2023

If this change sounds like a good idea, I'll follow it up with the same refactor for the Binary and ASCII encoding variants.

`spinoso-string` implements the various encodings it supports with one
type per encoding. For the UTF-8 encoding, the type is `Utf8String`.

`Utf8String` is conceptually similar to `Vec<u8>` and `String` in `std`.
Both of these types deref to a slice reference counterpart (`&[u8]` and
`&str`). All methods which do not require growing the backing vector (or
dynamic allocation more generally) are implemented on the slice ref part
of the type pair.

All of the encoding variants in `spinoso-string` do not have this
flexible API. This presents challenges when wanting to perform
encoding-oriented operations on a borrowed byte slice. A recent example
of this was the refactoring required to get UTF-8 char len calculation
shared among various parts of the code:

- 67726d1
- #2554

This commit experiments with breaking apart `Utf8String` into an owned
and borrowed pair of types: `Utf8String` and `&Utf8Str`/`&mut Utf8Str`.
The borrowed types are heavily inspired by `bstr::BStr`.

The benefits of the reference types that associate a byte slice to an
encoding mostly arise from avoiding the need to allocate an owned
`Utf8String`. I've long viewed a `String`'s encoding as a "view" or
"cursor" over the underlying byte content. Being able to interrogate an
arbitrary byte slice via a specific encoding makes the API much more
powerful.

Eventually I would like to extract APIs like `Utf8String`/`Utf8Str` to
their own crates in the workspace. The code is sufficiently complicated
and there is _a lot_ of it. Doing so would allow exposing the various
encodings in a `no-std::no-alloc` configuration.

This commit contains one breaking change which led to the version bump
in `spinoso-string` (removed `impl Extend<&'a mut u8>`), but
surprisingly little of the crate had to change outside of the UTF-8
internals.

I still consider this approach experimental and am looking for feedback
on the design.

One WIP in the design: ASCII and Binary encoded strings will be able to
implement `reverse()` on their reference type. Due to the complexity in
a character-wise reversal of the conventionally UTF-8 byte contents in
`Utf8String`, allocation is required so this API still lives on
`Utf8String`.

A benefit of the borrowed type design is (at least for me) it makes it
more obvious how a crate like `encoding_rs` could be used to support
many more encodings than Artichoke does today.

As an aside, sorry for the massive commit. I ended up working on #2589
_after_ I made these changes and I couldn't resolve the merge conflicts.
@lopopolo
Copy link
Member Author

lopopolo commented Jun 4, 2023

A benefit of the borrowed type design is (at least for me) it makes it more obvious how a crate like encoding_rs could be used to support many more encodings than Artichoke does today.

Expanding on this a bit, fully committing to using custom slice types to implement a view into the underlying Buf is basically exactly the same code structure used in MRI with an RString* for holding the bytes and the various encodings in Onigmo for performing character-oriented operations on those bytes.

The Onigmo encodings are basically a set of function pointers (aka V tables) which are very similar conceptually to wrapping the raw byte slice in something like Utf8Str.

This means that Utf8String doesn't really need to be a thing (modulo figuring out how to handle the case changing routines which may change the bytesize of the string). spinoso_string::String will look like this:

pub struct String {
    encoding: Encoding,
    buf: Buf
}

EncodedString goes away and the encoding-aware methods in String will match on self.encoding, wrapping the underlying bytes in the appropriate ref type to swap in the right impl of the method.

@b-n
Copy link
Member

b-n commented Jun 5, 2023

Maybe a paraphrase, but thinking conceptually:

  • A RString is just a series of bytes (null terminated), and an encoding.
  • There are a number of methods which are not encoding aware (e.g. push bytes, read bytes, etc), and thus can access and return the underlying buf (well ref to the buf)
  • There are a number of encoding aware methods, but these can pass a ref of the buf to the relevant encoding aware function, which then can perform encoding aware functionality?

I kinda like it, and it seems way more scalable.

e.g.

trait Encoding {
    fn char_len(buf: &[u8]) -> usize {}
    fn get_char(buf: &[u8], index: usize) -> Option<&'_ [u8]> {}
}

Whether or not you use a trait or not is irrelevant since as you said you could do it with vtables (I don't know how this would look in rust tbh).

That would make a huge bonus when it comes to adding support for new encodings, it's essentially just add the right encoding aware functions and you're done. Right now it would require mass edits to https://github.com/artichoke/artichoke/blob/trunk/spinoso-string/src/enc/mod.rs which will eventually hurt hah.

@lopopolo
Copy link
Member Author

lopopolo commented Jun 6, 2023

Yea something along these lines, but I was actually thinking something like this to match the Utf8Str slice type in this PR:

trait Encoding {
    fn char_len(&self) -> usize {}
    fn get_char(&self, index: usize) -> Option<&'_ [u8]> {}
}

impl Encoding for Utf8Str {
    todo!();
}

#[derive(Clone, Copy)]
enum EncodingType {
    Utf8,
    Ascii,
    Binary,
}

impl EncodingType {
    fn to_view(self, bytes: &[u8]) -> &dyn Encoding {
        match self {
            Self::Utf8 => Utf8Str::new(bytes),
            _ => todo!(),
        }
    }

    fn to_view_mut(self, bytes: &mut [u8]) -> &dyn mut Encoding {
        match self {
            Self::Utf8 => Utf8Str::new_mut(bytes),
            _ => todo!(),
        }
    }
}

struct String {
    encoding: EncodingType,
    bytes: Buf,
}

impl String {
    pub fn char_len(&self) -> usize {
        let bytes = self.buf.as_slice();
        let view = self.encoding.to_view(bytes);
        view.char_len()
    }
}

@b-n
Copy link
Member

b-n commented Jun 7, 2023

I had a quick play to see if there was a way to make it play with static functions (for some reason in my head it felt nicer than having to initialize the underlying encoding types - even though the compiler is likely optimizing these away).

The closest I got was:

type Buf = Vec<u8>;

trait Encoding {
    fn char_len(buf: &[u8]) -> usize;
    fn get_char(buf: &[u8], index: usize) -> Option<&'_ [u8]>;
}

struct Utf8 {}

impl Encoding for Utf8 {
    fn char_len(buf: &[u8]) -> usize {
        todo!();
    }
    fn get_char(buf: &[u8], index: usize) -> Option<&'_ [u8]> {
        todo!();
    }
}

struct String<E: Encoding> {
    encoding: E,
    bytes: Buf,
}

impl<E: Encoding> String<E> {
    pub fn char_len(&self) -> usize {
        let bytes = self.bytes.as_slice();
        self.encoding.char_len(bytes)
    }
}

However, self.encoding.char_len is not a function, it's a static method of course:

= note: found the following associated functions; to be used as methods, functions must have a self parameter

And any attempts to get around this was looking exactly what you have anyway 😅. (e.g. Wrapping the encoding in an Enum was always an eventuality).

So yes, what you're doing makes a whole bunch of sense 😄.

@b-n
Copy link
Member

b-n commented Jun 7, 2023

Quick thought:

Having someway to represent Utf8 as a single struct would make life easier for Encoding types still. Specifically, Encoding.names in Ruby having a single place to call would make life easier. But this could also be extracted from spinoso-string altogether and put somewhere else too - just if you saw a win along the way 😄

@lopopolo
Copy link
Member Author

lopopolo commented Jun 7, 2023

cool cool, it looks like at least there's buy in for experimenting here.

I'll merge this and make roughly the same refactor to Ascii and Binary types. Then we can talk about what to do re: EncodedString, simplifying, and making this Encoding trait

@lopopolo lopopolo merged commit 747b9ca into trunk Jun 7, 2023
19 checks passed
@lopopolo lopopolo deleted the lopopolo/enc-str-slice branch June 7, 2023 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-performance Area: Performance improvements and optimizations. A-ruby-core Area: Ruby Core types. S-speculative Status: This is just an idea.
Development

Successfully merging this pull request may close these issues.

None yet

2 participants