Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing Serializer in the public API #18

Open
trepetti opened this issue Mar 2, 2021 · 14 comments
Open

Exposing Serializer in the public API #18

trepetti opened this issue Mar 2, 2021 · 14 comments
Assignees
Projects

Comments

@trepetti
Copy link

trepetti commented Mar 2, 2021

Hi all,

I saw that Nathaniel mentioned the possibility of exposing the Serializer in the public API in pyfisch/cbor#179 (comment). I was wondering what steps or contributions you would be looking for getting it into a state where that would be a possibility. Not sure how many other people share my use-case, but I have been using Serializer::serialize_seq() for outputting logs to disk (with some intervening compression) with serde_json and was curious about moving to CBOR. I know there is more involved than just dropping pub in front of

struct Serializer<W: Write>(Encoder<W>);

I can help by coming up with doc strings for rustdoc to put it on par with serde_json's treatment of Serializer if that would help.

Let me know what you think and thanks for all your work on this crate.

-Tom

@enarxbot enarxbot added this to Triage in Planning Mar 2, 2021
@npmccallum
Copy link
Member

@trepetti Basically, I find my current approach to the different types to be not something I want to stabilize yet. I see two approaches:

  1. We can clean up the Serializer and Deserializer types and expose them as stable.

  2. We can add a feature flag like unstableapis which exposes them. This would allows us to get the types public ASAP but without committing to any stability guarantee.

@trepetti
Copy link
Author

trepetti commented Mar 3, 2021

@npmccallum Thank you for laying out these options. For the first option, would making an API that focuses on analogy to serde_json be a direction you would be interested in going in? I implemented a minimally modified, proof of concept public Serializer that still keeps the encoding provided courtesy of ciborium-ll inaccessible from the top-level Serializer again by analogy to serde_json. My other goal was to make sure that all the signatures in serde trait implementations match serde 1:1 as written (even if it doesn't affect the implementation in practice): trepetti@1c01b99

If this isn't the direction you wanted to go in towards a stable API, I completely understand. And regardless, I think putting it initially behind a feature flag would be prudent (have not done that in this POC, though). It looks like there is no way to have the CollectionSerializer of ciborium as pub(crate), but it appears that serde_json had this same issue with their corresponding Compound type and just opted for #[doc(hidden)] as well.

Let me know what you think. I would implement the corresponding Deserializer API changes and open a draft PR based on your feedback.

@quininer
Copy link

erased_serde also depend on Serializer and Deserializer, exposing it would be useful.

@connorkuehl connorkuehl moved this from Triage to Backlog in Planning Mar 16, 2021
@trepetti
Copy link
Author

trepetti commented Mar 16, 2021

In looking at the Deserializer side of things, I definitely get why Enarx would not want to make assumptions about how/where scratch is allocated for embedded scenarios, although it does present a challenge for implementing Deserializer::{new, from_bytes, from_str, from_reader} in exactly the same way as serde_json. I see a few ways to approach this:

  1. We could maintain the flexibility in the choice of allocation strategy by modifying these functions with an extra argument for scratch, which I think would be the cleanest option.

  2. Alternatively, we could define

enum Scratch {
    Preallocated(&'b mut [u8]),
    Dynamic(Box<[u8]>),
}

and modify everywhere that references scratch accordingly to match on the enum.

  1. We could convert scratch to a Vec as serde_json does since de/mod.rs is already pulling in std::alloc::Box and std::alloc::String anyway. We could default to a 4 kiB with_capacity starting out to reduce runtime allocation overhead.

Let me know what you think. Since Serializer doesn't need any additional memory beyond a few conversions to String, we can keep it 1:1 with serde_json as in the draft above, but for Deserializer there is that large amount of scratch memory needed, so it is worth thinking through what you'd want it to end up looking like in terms of the allocation strategy.

@npmccallum
Copy link
Member

@trepetti Apologies for the lack of response. I was on vacation. :)

I'm fine with the first approach too. The scratch argument definitely needs to be carefully documented.

@trepetti
Copy link
Author

@npmccallum Hope you had a good vacation! One last pair of questions before I put together a PR: if we expose scratch, is it also worth exposing the recursion limit, too? For either or both we could have pub consts DEFAULT_SCRATCH_SIZE and DEFAULT_RECURSION_LIMIT exported, if you think that makes sense, based on the values used in de::from_reader(). That way in addition to explaining the purposes of both arguments in the documentation, users could stick with sensible defaults without having to think about it too much if they don't want to.

@npmccallum
Copy link
Member

@trepetti We should probably just work through these details in the PR review.

@jonasbb
Copy link
Contributor

jonasbb commented Jun 15, 2021

A couple of serde helper libraries also require the Serializer/Deserializer types to be exposed:

  1. We can clean up the Serializer and Deserializer types and expose them as stable.

What would be required to clean them up? I did not see that yet in the thread.

@npmccallum
Copy link
Member

  1. We can clean up the Serializer and Deserializer types and expose them as stable.

What would be required to clean them up? I did not see that yet in the thread.

They currently leverage some internal generic types which are intended to promote code sharing, but probably aren't a good public interface. These would have to be reworked to something more reasonable.

@jonasbb
Copy link
Contributor

jonasbb commented Jun 15, 2021

Thanks for the info. The internal generic types could still remain private and only the Serializer and Deserializer types made public. I imagine something like this: https://github.com/jonasbb/ciborium/commit/4324dc3791fa638ada4c6d059b42e7d98aca34e1

For the Serializer the only constraint is to also expose CollectionSerializer.

For the Deserializer the scratch space "leaks" because of the lifetime. You can abstract it away and provide a constructor for a Deserializer<'static, R>, without requiring the user to know anything about the scratch space. I made a Scratch struct, which abstracts over borrowed or owned scratch space. This is like the 2. option mentioned by @trepetti. The 1. option can always be added by providing new construction functions for Deserializer.

@trepetti
Copy link
Author

I apologize for dropping the ball on this.

I have been using this initial attempt at exposing Serializer and have been using SerializeSeq for streaming serialization in one of my own projects successfully, but it needed more synthetic tests done in-tree before opening it as a PR: https://github.com/trepetti/ciborium/commits/public-deserializer-and-serializer. I can take a break from some prose writing tonight to add those tests tonight and possibly break it into a separate PR for the simpler Serializer case, if that would be better. I modified some of the naming conventions in the existing unexposed traits to make side-by-side comparison with serde and serde_json easier but can also revert those if you want a cleaner commit history.

On the deserialization side, this branch is a reasonably well fleshed-out attempt at what going with option 1 for Deserializer might look like: forcing any scratch space to be preallocated and forcing a recursion limit to be known ahead-of-time in the constructor, which are possibly nice to be explicit about in the resource constrained use case.I expose some sensible defaults for these as constants in the top-level API as de::DEFAULT_SCRATCH_SIZE and de::DEFAULT_RECURSION_LIMIT. Again, I will take a stab at finishing some more tests tonight.

In going further to flesh out the equivalent of StreamDeserializer to allow feature parity with serde_json, I ran into the fact that there would need to be some changes in ciborium-ll in order to support the equivalent of serde_json's Deserializer::end(). We would have to expose part of the decoder in the ciborium-ll to report the impending end of an I/O stream. This is where it gets tricky because I am a security novice and wasn't sure how much would/could/should be exposed in a potential interface to the inner workings of ciborium-ll or if this is even a valid concern in the first place. The larger project is sensitive about this, so I thought before moving forward with StreamDeserializer and modifying ciborium-ll, it would be best to submit a PR omitting Deserializer::end() and consequently Deserializer::into_iter() and StreamDeserializer to flesh it out further at a later time if that is something there is also interest in: https://github.com/trepetti/ciborium/blob/dffbe59a2afc0ae67715be45b12dd52a9ce5779e/ciborium/src/de/mod.rs#L653-L659

@jonasbb
Copy link
Contributor

jonasbb commented Jun 17, 2021

@trepetti I'm not sure where the best place to post feedback is, so just place it here to keep the discussion in one place.
Your changes already look quite complete. I didn't spot any obvious problems in them. I do have two comments:

  1. Given that ciborium::de::from_reader exists without taking any scratch or recursion arguments, I think it might be worthwhile to have a ciborium::de::Deserializer::??? which also does not require those. To achieve that 2. might need to be implemented.
  2. Types which do not borrow are often more convenient to use, for example when returning them from a function or storing somewhere. This would require dropping the scratch requirement or having the option of an owned type like enum Scratch above.

@pickfire
Copy link

I am also looking into using the public Serializer and Deserializer for https://github.com/pickfire/babelfish which uses serde-transcode given that serde_cbor is now unmaintained, can't really do a switch from serde_cbor since it does not have Deserializer::from_reader.

@dominikwerder
Copy link

erased_serde depends on having access to the Serializer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Planning
  
Backlog
Development

No branches or pull requests

6 participants