Split UTF-8 string variant into owned and borrowed types #1909

lopopolo · 2022-06-20T03:21:25Z

spinoso-string implements the various encodings it supports with one type per encoding. For the UTF-8 encoding, the type is Utf8String.

Utf8String is conceptually similar to Vec<u8> and String in std. Both of these types deref to a slice reference counterpart (&[u8] and &str). All methods which do not require growing the backing vector (or dynamic allocation more generally) are implemented on the slice ref part of the type pair.

All of the encoding variants in spinoso-string do not have this flexible API. This presents challenges when wanting to perform encoding-oriented operations on a borrowed byte slice. A recent example of this was the refactoring required to get UTF-8 char len calculation shared among various parts of the code:

This PR experiments with breaking apart Utf8String into an owned and borrowed pair of types: Utf8String and &Utf8Str/&mut Utf8Str. The borrowed types are heavily inspired by bstr::BStr.

The benefits of the reference types that associate a byte slice to an encoding mostly arise from avoiding the need to allocate an owned Utf8String. I've long viewed a String's encoding as a "view" or "cursor" over the underlying byte content. Being able to interrogate an arbitrary byte slice via a specific encoding makes the API much more powerful.

Eventually I would like to extract APIs like Utf8String/Utf8Str to their own crates in the workspace. The code is sufficiently complicated and there is a lot of it. Doing so would allow exposing the various encodings in a no-std::no-alloc configuration.

This PR contains one breaking change which led to the version bump in spinoso-string (removed impl Extend<&'a mut u8>), but surprisingly little of the crate had to change outside of the UTF-8 internals.

I still consider this approach experimental and am looking for feedback on the design.

One WIP in the design: ASCII and Binary encoded strings will be able to implement reverse() on their reference type. Due to the complexity in a character-wise reversal of the conventionally UTF-8 byte contents in Utf8String, allocation is required so this API still lives on Utf8String.

A benefit of the borrowed type design is (at least for me) it makes it more obvious how a crate like encoding_rs could be used to support many more encodings than Artichoke does today.

As an aside, sorry for the massive commit in this PR. I ended up working on #2589 after I made these changes and I couldn't resolve the merge conflicts.

lopopolo · 2023-06-04T22:57:25Z

@b-n tagging you as a reviewer here since you're familiar with the String code and I think such a big change warrants a second pair of eyes.

lopopolo · 2023-06-04T23:09:52Z

If this change sounds like a good idea, I'll follow it up with the same refactor for the Binary and ASCII encoding variants.

`spinoso-string` implements the various encodings it supports with one type per encoding. For the UTF-8 encoding, the type is `Utf8String`. `Utf8String` is conceptually similar to `Vec<u8>` and `String` in `std`. Both of these types deref to a slice reference counterpart (`&[u8]` and `&str`). All methods which do not require growing the backing vector (or dynamic allocation more generally) are implemented on the slice ref part of the type pair. All of the encoding variants in `spinoso-string` do not have this flexible API. This presents challenges when wanting to perform encoding-oriented operations on a borrowed byte slice. A recent example of this was the refactoring required to get UTF-8 char len calculation shared among various parts of the code: - 67726d1 - #2554 This commit experiments with breaking apart `Utf8String` into an owned and borrowed pair of types: `Utf8String` and `&Utf8Str`/`&mut Utf8Str`. The borrowed types are heavily inspired by `bstr::BStr`. The benefits of the reference types that associate a byte slice to an encoding mostly arise from avoiding the need to allocate an owned `Utf8String`. I've long viewed a `String`'s encoding as a "view" or "cursor" over the underlying byte content. Being able to interrogate an arbitrary byte slice via a specific encoding makes the API much more powerful. Eventually I would like to extract APIs like `Utf8String`/`Utf8Str` to their own crates in the workspace. The code is sufficiently complicated and there is _a lot_ of it. Doing so would allow exposing the various encodings in a `no-std::no-alloc` configuration. This commit contains one breaking change which led to the version bump in `spinoso-string` (removed `impl Extend<&'a mut u8>`), but surprisingly little of the crate had to change outside of the UTF-8 internals. I still consider this approach experimental and am looking for feedback on the design. One WIP in the design: ASCII and Binary encoded strings will be able to implement `reverse()` on their reference type. Due to the complexity in a character-wise reversal of the conventionally UTF-8 byte contents in `Utf8String`, allocation is required so this API still lives on `Utf8String`. A benefit of the borrowed type design is (at least for me) it makes it more obvious how a crate like `encoding_rs` could be used to support many more encodings than Artichoke does today. As an aside, sorry for the massive commit. I ended up working on #2589 _after_ I made these changes and I couldn't resolve the merge conflicts.

lopopolo · 2023-06-04T23:53:08Z

A benefit of the borrowed type design is (at least for me) it makes it more obvious how a crate like encoding_rs could be used to support many more encodings than Artichoke does today.

Expanding on this a bit, fully committing to using custom slice types to implement a view into the underlying Buf is basically exactly the same code structure used in MRI with an RString* for holding the bytes and the various encodings in Onigmo for performing character-oriented operations on those bytes.

The Onigmo encodings are basically a set of function pointers (aka V tables) which are very similar conceptually to wrapping the raw byte slice in something like Utf8Str.

This means that Utf8String doesn't really need to be a thing (modulo figuring out how to handle the case changing routines which may change the bytesize of the string). spinoso_string::String will look like this:

pub struct String {
    encoding: Encoding,
    buf: Buf
}

EncodedString goes away and the encoding-aware methods in String will match on self.encoding, wrapping the underlying bytes in the appropriate ref type to swap in the right impl of the method.

b-n · 2023-06-05T07:00:18Z

Maybe a paraphrase, but thinking conceptually:

A RString is just a series of bytes (null terminated), and an encoding.
There are a number of methods which are not encoding aware (e.g. push bytes, read bytes, etc), and thus can access and return the underlying buf (well ref to the buf)
There are a number of encoding aware methods, but these can pass a ref of the buf to the relevant encoding aware function, which then can perform encoding aware functionality?

I kinda like it, and it seems way more scalable.

e.g.

trait Encoding {
    fn char_len(buf: &[u8]) -> usize {}
    fn get_char(buf: &[u8], index: usize) -> Option<&'_ [u8]> {}
}

Whether or not you use a trait or not is irrelevant since as you said you could do it with vtables (I don't know how this would look in rust tbh).

That would make a huge bonus when it comes to adding support for new encodings, it's essentially just add the right encoding aware functions and you're done. Right now it would require mass edits to https://github.com/artichoke/artichoke/blob/trunk/spinoso-string/src/enc/mod.rs which will eventually hurt hah.

lopopolo · 2023-06-06T03:18:42Z

Yea something along these lines, but I was actually thinking something like this to match the Utf8Str slice type in this PR:

trait Encoding {
    fn char_len(&self) -> usize {}
    fn get_char(&self, index: usize) -> Option<&'_ [u8]> {}
}

impl Encoding for Utf8Str {
    todo!();
}

#[derive(Clone, Copy)]
enum EncodingType {
    Utf8,
    Ascii,
    Binary,
}

impl EncodingType {
    fn to_view(self, bytes: &[u8]) -> &dyn Encoding {
        match self {
            Self::Utf8 => Utf8Str::new(bytes),
            _ => todo!(),
        }
    }

    fn to_view_mut(self, bytes: &mut [u8]) -> &dyn mut Encoding {
        match self {
            Self::Utf8 => Utf8Str::new_mut(bytes),
            _ => todo!(),
        }
    }
}

struct String {
    encoding: EncodingType,
    bytes: Buf,
}

impl String {
    pub fn char_len(&self) -> usize {
        let bytes = self.buf.as_slice();
        let view = self.encoding.to_view(bytes);
        view.char_len()
    }
}

b-n · 2023-06-07T08:42:18Z

I had a quick play to see if there was a way to make it play with static functions (for some reason in my head it felt nicer than having to initialize the underlying encoding types - even though the compiler is likely optimizing these away).

The closest I got was:

type Buf = Vec<u8>;

trait Encoding {
    fn char_len(buf: &[u8]) -> usize;
    fn get_char(buf: &[u8], index: usize) -> Option<&'_ [u8]>;
}

struct Utf8 {}

impl Encoding for Utf8 {
    fn char_len(buf: &[u8]) -> usize {
        todo!();
    }
    fn get_char(buf: &[u8], index: usize) -> Option<&'_ [u8]> {
        todo!();
    }
}

struct String<E: Encoding> {
    encoding: E,
    bytes: Buf,
}

impl<E: Encoding> String<E> {
    pub fn char_len(&self) -> usize {
        let bytes = self.bytes.as_slice();
        self.encoding.char_len(bytes)
    }
}

However, self.encoding.char_len is not a function, it's a static method of course:

= note: found the following associated functions; to be used as methods, functions must have a self parameter

And any attempts to get around this was looking exactly what you have anyway 😅. (e.g. Wrapping the encoding in an Enum was always an eventuality).

So yes, what you're doing makes a whole bunch of sense 😄.

b-n · 2023-06-07T08:50:55Z

Quick thought:

Having someway to represent Utf8 as a single struct would make life easier for Encoding types still. Specifically, Encoding.names in Ruby having a single place to call would make life easier. But this could also be extracted from spinoso-string altogether and put somewhere else too - just if you saw a win along the way 😄

lopopolo · 2023-06-07T16:00:47Z

cool cool, it looks like at least there's buy in for experimenting here.

I'll merge this and make roughly the same refactor to Ascii and Binary types. Then we can talk about what to do re: EncodedString, simplifying, and making this Encoding trait

lopopolo added A-ruby-core Area: Ruby Core types. A-performance Area: Performance improvements and optimizations. S-wip Status: This pull request is a work in progress. S-speculative Status: This is just an idea. labels Jun 20, 2022

lopopolo force-pushed the lopopolo/enc-str-slice branch from 2c65395 to b9c32ea Compare June 21, 2022 00:40

This was referenced Jun 2, 2023

MatchData#offset return utf8 character offset #2581

Open

Split string buffer internals out from spinoso-string into a new scolapasta-strbuf crate #2589

Merged

lopopolo force-pushed the lopopolo/enc-str-slice branch 2 times, most recently from 17fcf5a to b9de856 Compare June 4, 2023 22:22

lopopolo removed S-wip Status: This pull request is a work in progress. S-speculative Status: This is just an idea. labels Jun 4, 2023

lopopolo force-pushed the lopopolo/enc-str-slice branch from b9de856 to 4481cb8 Compare June 4, 2023 22:55

lopopolo added the S-speculative Status: This is just an idea. label Jun 4, 2023

lopopolo changed the title ~~[WIP] Experiment with encoding-specific slice types in spinoso-string~~ Split UTF-8 string variant into owned and borrowed types Jun 4, 2023

lopopolo requested a review from b-n June 4, 2023 22:57

lopopolo force-pushed the lopopolo/enc-str-slice branch from 4481cb8 to 3132f14 Compare June 4, 2023 23:28

lopopolo added 4 commits June 4, 2023 16:29

Address clippy lint violations

b62a0b1

Derive Hash, PartialEq, Eq on Utf8String

34566f9

Implement Hash for Utf8Str

8c1dd04

lopopolo force-pushed the lopopolo/enc-str-slice branch from 3132f14 to 8c1dd04 Compare June 4, 2023 23:29

lopopolo merged commit 747b9ca into trunk Jun 7, 2023
19 checks passed

lopopolo deleted the lopopolo/enc-str-slice branch June 7, 2023 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split UTF-8 string variant into owned and borrowed types #1909

Split UTF-8 string variant into owned and borrowed types #1909

lopopolo commented Jun 20, 2022 •

edited

lopopolo commented Jun 4, 2023

lopopolo commented Jun 4, 2023

lopopolo commented Jun 4, 2023 •

edited

b-n commented Jun 5, 2023 •

edited

lopopolo commented Jun 6, 2023

b-n commented Jun 7, 2023

b-n commented Jun 7, 2023 •

edited

lopopolo commented Jun 7, 2023

Split UTF-8 string variant into owned and borrowed types #1909

Split UTF-8 string variant into owned and borrowed types #1909

Conversation

lopopolo commented Jun 20, 2022 • edited

lopopolo commented Jun 4, 2023

lopopolo commented Jun 4, 2023

lopopolo commented Jun 4, 2023 • edited

b-n commented Jun 5, 2023 • edited

lopopolo commented Jun 6, 2023

b-n commented Jun 7, 2023

b-n commented Jun 7, 2023 • edited

lopopolo commented Jun 7, 2023

lopopolo commented Jun 20, 2022 •

edited

lopopolo commented Jun 4, 2023 •

edited

b-n commented Jun 5, 2023 •

edited

b-n commented Jun 7, 2023 •

edited