Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lossy utf-8 string support #314

Closed
ladislavmacoun opened this issue Nov 11, 2022 · 9 comments · Fixed by #433
Closed

Lossy utf-8 string support #314

ladislavmacoun opened this issue Nov 11, 2022 · 9 comments · Fixed by #433
Labels
breaking change requires version bump

Comments

@ladislavmacoun
Copy link

ladislavmacoun commented Nov 11, 2022

The reference c++ library allows users to decide how to approach invalid utf8 encoded text.
for example.

struct Message {
    stringField @0 :Text;
}
...
::capnp::FlatArrayMessageReader reader = ::capnp::FlatArrayMessageReader(
    kj::ArrayPtr<capnp::word>(reinterpret_cast<capnp::word*>(buffer), sz)
);
Message::Reader root = reader.getRoot<Message>();
std::string msg = root.hasStringField() ? root.getStringField() : "";
std::string valid_utf8;
utf8::replace_invalid(msg.begin(), msg.end(), std::back_inserter(valid_utf8));
handle_valid_utf8(std::move(valid_utf8));

Here we have a Message with one text stringField, if stringField contains invalid utf8 data, the replace_invalid()1 method replaces the invalid characters with the 0xfffd, the same approach which is the standard rust String doing for String::from_utf8_lossy(). We are losing some information but still, this is a better approach than throwing out the Message entirely (depending on the use case ofc.).

I am having a hard time replicating something similar in rust with this library. From what I have tried I am only able to extract a valid UTF-8 &str or error. But I cannot find a way to replace the invalid characters with Unicode replacement characters2.

let msg = capnp::serialize::read_message_from_flat_slice(
    &mut buf,
    capnp::message::ReaderOptions::new(),
)
.unwrap();
let root = msg.get_root::<http_log_capnp::message::Reader>().unwrap();
if root.has_string_field() {
    match root.get_string_field() {
        Ok(string_field) => println!("valid utf8 {:?}", string_field),
        Err(e) => println!("invalid utf8 {:?}", e),
    }
}; 

Is there any way how achieve something similar to the c++ example above?

@ladislavmacoun ladislavmacoun changed the title Lossy string support Lossy utf-8 string support Nov 11, 2022
@dwrensha
Copy link
Member

dwrensha commented Nov 11, 2022

Yeah, currently capnp::text::Reader is just an alias of Rust's &str, so there's not a good way to get a handle on one without already doing the utf8 validation. A terrible hack you could do is to define a second message type

struct Message2 {
    stringField @0 :Data;
}

and read the message at that type.

I think the real fix will be to make capnp::text::Reader a proper wrapper that only optionally does the utf8 validation. That can help with performance too, because not all consumers actually need to get read the value as a&str.

@ladislavmacoun
Copy link
Author

Thank you for the reply.
I will try changing the rust consumer scheme to data, looking at the serialization the only difference between the Text and Data fields in capnp is trailing a null byte so it should be fine. Although, the proper fix you are proposing would be nice!

@dwrensha dwrensha added the breaking change requires version bump label Nov 27, 2022
@osiewicz
Copy link

osiewicz commented Feb 3, 2023

I think the real fix will be to make capnp::text::Reader a proper wrapper that only optionally does the utf8 validation. That can help with performance too, because not all consumers actually need to get read the value as a&str.

That would immensely help my use case too, which involves calling .get_text several times. UTF validation does show up in my flamegraphs.

@antoinecordelle
Copy link

antoinecordelle commented Aug 16, 2023

Hello,
We're looking into addressing this issue implementing your solution David, so make text::Reader a proper wrapper. The idea would be:

  • To make text::Reader a wrapper around data::Reader, so only containing an array of bytes.
  • To expose a method for text::Reader to_str(self) -> Result<&str> performing the UTF8 check.
  • On the codegen side, for a given Text field foo, expose the following methods:
    • get_foo(self) -> Result<&str> (no API break as it previously returned Result<text::Reader>) that would perform the UTF8 check and the conversion to string.
    • get_foo_as_bytes(self) that would return either the underlying data::Reader or directly the &[u8] (this is most likely equivalent though).

This should allow to have no breaking change on the user's side once the API is generated as the get_foo method will behave the same and still return a Result<&str>, it will only additionally perform the UTF8 check itself, compared to doing it at construction time like before. It will however require schema regeneration.

Thoughts? Happy to make a PR once I've finished implementing if that sounds suitable.

@ladislavmacoun
Copy link
Author

Sounds good to me!

@dwrensha
Copy link
Member

@antoinecordelle could you describe more about why you need this feature? In particular, is it for performance reasons, or is it just to be able to recover the data in the event of encountering some non-utf8 bytes? If it's the latter, then there's a possibility that we could get away with a much smaller change than #429.

I'd also be curious to hear about what @osiewicz is doing. Why does get_text() need to be called multiple times, what how would making utf-8 validation lazier help that?

In any case, having more context will help justify whatever decision we make.

@dwrensha
Copy link
Member

#429 is going to require a lot of people to update their code and will make things more verbose. I want to make sure we have a solid justification for doing that.

@osiewicz
Copy link

osiewicz commented Aug 30, 2023

Hey, I've moved on from the project since making that comment. @Shaddy should be able to give more perspective.
.get_text being called multiple times was admittedly specific to that project. Having the validation be optional would mean one could run the validation once and then use the non-validating API (with something like from_utf8_unchecked) if the first validation was successful.
It's not my cup of tea anymore though, so feel free to ignore.

@antoinecordelle
Copy link

Sure. In this case, this is mostly to be able to recover the data from Text fields that may have some non-utf8 bytes. Performance reasons can be a nice to have but the main reason was non-utf8 text. One note that differs from the initial discussions in this issue though, is that I'd need to be able to recover the data in a non-lossy way, so not replacing invalid bytes by placeholders.

I mostly implemented it as such in #429 as that seemed like the change that conceptually made the most sense to me. But happy to hear about the simpler way of achieving that if that can limit the breaking change and not require people to change their code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change requires version bump
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants