Lossy utf-8 string support #314

ladislavmacoun · 2022-11-11T09:01:49Z

The reference c++ library allows users to decide how to approach invalid utf8 encoded text.
for example.

struct Message {
    stringField @0 :Text;
}

...
::capnp::FlatArrayMessageReader reader = ::capnp::FlatArrayMessageReader(
    kj::ArrayPtr<capnp::word>(reinterpret_cast<capnp::word*>(buffer), sz)
);
Message::Reader root = reader.getRoot<Message>();
std::string msg = root.hasStringField() ? root.getStringField() : "";
std::string valid_utf8;
utf8::replace_invalid(msg.begin(), msg.end(), std::back_inserter(valid_utf8));
handle_valid_utf8(std::move(valid_utf8));

Here we have a Message with one text stringField, if stringField contains invalid utf8 data, the replace_invalid()1 method replaces the invalid characters with the 0xfffd, the same approach which is the standard rust String doing for String::from_utf8_lossy(). We are losing some information but still, this is a better approach than throwing out the Message entirely (depending on the use case ofc.).

I am having a hard time replicating something similar in rust with this library. From what I have tried I am only able to extract a valid UTF-8 &str or error. But I cannot find a way to replace the invalid characters with Unicode replacement characters2.

let msg = capnp::serialize::read_message_from_flat_slice(
    &mut buf,
    capnp::message::ReaderOptions::new(),
)
.unwrap();
let root = msg.get_root::<http_log_capnp::message::Reader>().unwrap();
if root.has_string_field() {
    match root.get_string_field() {
        Ok(string_field) => println!("valid utf8 {:?}", string_field),
        Err(e) => println!("invalid utf8 {:?}", e),
    }
};

Is there any way how achieve something similar to the c++ example above?

The text was updated successfully, but these errors were encountered:

dwrensha · 2022-11-11T14:41:57Z

Yeah, currently capnp::text::Reader is just an alias of Rust's &str, so there's not a good way to get a handle on one without already doing the utf8 validation. A terrible hack you could do is to define a second message type

struct Message2 {
    stringField @0 :Data;
}

and read the message at that type.

I think the real fix will be to make capnp::text::Reader a proper wrapper that only optionally does the utf8 validation. That can help with performance too, because not all consumers actually need to get read the value as a&str.

ladislavmacoun · 2022-11-15T13:41:07Z

Thank you for the reply.
I will try changing the rust consumer scheme to data, looking at the serialization the only difference between the Text and Data fields in capnp is trailing a null byte so it should be fine. Although, the proper fix you are proposing would be nice!

osiewicz · 2023-02-03T11:18:19Z

I think the real fix will be to make capnp::text::Reader a proper wrapper that only optionally does the utf8 validation. That can help with performance too, because not all consumers actually need to get read the value as a&str.

That would immensely help my use case too, which involves calling .get_text several times. UTF validation does show up in my flamegraphs.

antoinecordelle · 2023-08-16T09:47:01Z

Hello,
We're looking into addressing this issue implementing your solution David, so make text::Reader a proper wrapper. The idea would be:

To make text::Reader a wrapper around data::Reader, so only containing an array of bytes.
To expose a method for text::Reader to_str(self) -> Result<&str> performing the UTF8 check.
On the codegen side, for a given Text field foo, expose the following methods:
- get_foo(self) -> Result<&str> (no API break as it previously returned Result<text::Reader>) that would perform the UTF8 check and the conversion to string.
- get_foo_as_bytes(self) that would return either the underlying data::Reader or directly the &[u8] (this is most likely equivalent though).

This should allow to have no breaking change on the user's side once the API is generated as the get_foo method will behave the same and still return a Result<&str>, it will only additionally perform the UTF8 check itself, compared to doing it at construction time like before. It will however require schema regeneration.

Thoughts? Happy to make a PR once I've finished implementing if that sounds suitable.

ladislavmacoun · 2023-08-16T09:54:03Z

Sounds good to me!

dwrensha · 2023-08-30T17:20:41Z

@antoinecordelle could you describe more about why you need this feature? In particular, is it for performance reasons, or is it just to be able to recover the data in the event of encountering some non-utf8 bytes? If it's the latter, then there's a possibility that we could get away with a much smaller change than #429.

I'd also be curious to hear about what @osiewicz is doing. Why does get_text() need to be called multiple times, what how would making utf-8 validation lazier help that?

In any case, having more context will help justify whatever decision we make.

dwrensha · 2023-08-30T17:22:21Z

#429 is going to require a lot of people to update their code and will make things more verbose. I want to make sure we have a solid justification for doing that.

osiewicz · 2023-08-30T17:28:31Z

Hey, I've moved on from the project since making that comment. @Shaddy should be able to give more perspective.
.get_text being called multiple times was admittedly specific to that project. Having the validation be optional would mean one could run the validation once and then use the non-validating API (with something like from_utf8_unchecked) if the first validation was successful.
It's not my cup of tea anymore though, so feel free to ignore.

antoinecordelle · 2023-08-31T08:59:06Z

Sure. In this case, this is mostly to be able to recover the data from Text fields that may have some non-utf8 bytes. Performance reasons can be a nice to have but the main reason was non-utf8 text. One note that differs from the initial discussions in this issue though, is that I'd need to be able to recover the data in a non-lossy way, so not replacing invalid bytes by placeholders.

I mostly implemented it as such in #429 as that seemed like the change that conceptually made the most sense to me. But happy to hear about the simpler way of achieving that if that can limit the breaking change and not require people to change their code.

ladislavmacoun changed the title ~~Lossy string support~~ Lossy utf-8 string support Nov 11, 2022

dwrensha added the breaking change requires version bump label Nov 27, 2022

antoinecordelle mentioned this issue Aug 15, 2023

Move capnp::text::Reader to a wrapper of data::Reader antoinecordelle/capnproto-rust#1

Closed

antoinecordelle mentioned this issue Aug 23, 2023

Make text::Reader a wrapper around data::Reader instead of &str #429

Closed

dwrensha mentioned this issue Aug 31, 2023

lazy utf-8 validation #433

Merged

dwrensha closed this as completed in #433 Sep 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lossy utf-8 string support #314

Lossy utf-8 string support #314

ladislavmacoun commented Nov 11, 2022 •

edited

dwrensha commented Nov 11, 2022 •

edited

ladislavmacoun commented Nov 15, 2022

osiewicz commented Feb 3, 2023

antoinecordelle commented Aug 16, 2023 •

edited

ladislavmacoun commented Aug 16, 2023

dwrensha commented Aug 30, 2023

dwrensha commented Aug 30, 2023

osiewicz commented Aug 30, 2023 •

edited

antoinecordelle commented Aug 31, 2023

Lossy utf-8 string support #314

Lossy utf-8 string support #314

Comments

ladislavmacoun commented Nov 11, 2022 • edited

dwrensha commented Nov 11, 2022 • edited

ladislavmacoun commented Nov 15, 2022

osiewicz commented Feb 3, 2023

antoinecordelle commented Aug 16, 2023 • edited

ladislavmacoun commented Aug 16, 2023

dwrensha commented Aug 30, 2023

dwrensha commented Aug 30, 2023

osiewicz commented Aug 30, 2023 • edited

antoinecordelle commented Aug 31, 2023

ladislavmacoun commented Nov 11, 2022 •

edited

dwrensha commented Nov 11, 2022 •

edited

antoinecordelle commented Aug 16, 2023 •

edited

osiewicz commented Aug 30, 2023 •

edited