New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default fact serializer for durability #262

Open
WilliamParker opened this Issue Jan 30, 2017 · 5 comments

Comments

Projects
None yet
2 participants
@WilliamParker
Member

WilliamParker commented Jan 30, 2017

The optimal way to serialize facts is heavily dependent on the exact fact types and structures. Clara's durability implementation, which is discussed at #198, currently both allows and requires users to supply a serializer for their facts in the form of an implementation of IWorkingMemorySerializer. However, it makes sense to me for Clara to provide a default, perhaps naive, IWorkingMemorySerializer for Clojure data types. This could potentially be as simple as an implementation that uses pr-str. Even if serious use ends up requiring a more specialized implementation tailored to a user's specific case a default that could be used in initial experiments and prototyping would lower the barrier to entry of using the durability API.

@mrrodriguez

This comment has been minimized.

Show comment
Hide comment
@mrrodriguez

mrrodriguez Jan 30, 2017

Collaborator

This could potentially be as simple as an implementation that uses pr-str.

I don't really think this is a "simple" or very tractable path. I originally tried to write the session serializer with a pr-str based default and there were many issues with its preservation of types, metadata, and conflicts with other usages of pr-str throughout clj, popular REPL middleware, etc.

Also, the pr-str format is very bloated and very slow serializing non-code-centric data. I really don't think it is an appropriate usage of it.

However, I think a default Fressian implementation for Clojure data structures could be provided fairly easily since there has already been a lot of work done to make Fressian work well for the Clj to Clj SerDe use-case we are going for here.

It's an "optional" dependency, but I still think that would be useful for a consumer that only needed a pure-clj-based serialization impl.

Keep in mind that this SerDe doesn't really lend itself too well to schema evolution of the underlying models though.

Collaborator

mrrodriguez commented Jan 30, 2017

This could potentially be as simple as an implementation that uses pr-str.

I don't really think this is a "simple" or very tractable path. I originally tried to write the session serializer with a pr-str based default and there were many issues with its preservation of types, metadata, and conflicts with other usages of pr-str throughout clj, popular REPL middleware, etc.

Also, the pr-str format is very bloated and very slow serializing non-code-centric data. I really don't think it is an appropriate usage of it.

However, I think a default Fressian implementation for Clojure data structures could be provided fairly easily since there has already been a lot of work done to make Fressian work well for the Clj to Clj SerDe use-case we are going for here.

It's an "optional" dependency, but I still think that would be useful for a consumer that only needed a pure-clj-based serialization impl.

Keep in mind that this SerDe doesn't really lend itself too well to schema evolution of the underlying models though.

@WilliamParker

This comment has been minimized.

Show comment
Hide comment
@WilliamParker

WilliamParker Jan 31, 2017

Member

Conflicts with other libraries could be a problem. Metadata on facts is dangerous in Clara anyway since Clara uses equality semantics, not identity semantics, but if you guarantee that no two facts exist such that they are equal but their metadata is unequal it is usable and ideally a durability implementation wouldn't break things allowed without durability. Performance might not be such a large concern if this ended up just being a way to demo/prototype durability and any serious usage involved writing a custom memory serializer, but of course a default that both allowed a quick demo and was performant enough for use in real use cases would be better.

Member

WilliamParker commented Jan 31, 2017

Conflicts with other libraries could be a problem. Metadata on facts is dangerous in Clara anyway since Clara uses equality semantics, not identity semantics, but if you guarantee that no two facts exist such that they are equal but their metadata is unequal it is usable and ideally a durability implementation wouldn't break things allowed without durability. Performance might not be such a large concern if this ended up just being a way to demo/prototype durability and any serious usage involved writing a custom memory serializer, but of course a default that both allowed a quick demo and was performant enough for use in real use cases would be better.

@mrrodriguez

This comment has been minimized.

Show comment
Hide comment
@mrrodriguez

mrrodriguez Jan 31, 2017

Collaborator

Conflicts with other libraries could be a problem.

Fressian is an optional dependency. The consumer must explicitly add it if they wish to use a namespace that loads Fressian. So there is no conflict here.
If you want a reasonable Fressian default impl, you are free to use it. It'll be tested in the :dev profile of Clara against some (modern) Fressian version.

Metadata on facts is dangerous in Clara anyway ...

This is only true if you consider the metadata to be the distinguishing characteristic between two or more facts. That is certainly not the only use of metadata. Losing all metadata post-serialization can definitely lead to mysteries in the consuming application.

Performance might not be such a large concern if this ended up just being a way to demo/prototype durability and any serious usage involved writing a custom memory serializer, but of course a default that both allowed a quick demo and was performant enough for use in real use cases would be better.

I worked through implementing it with print-dup (and later print-method) at length. It was really messy. Many things are not implemented there that are typically needed when trying to use it as a full-blown serialization format. The recommendation from the community was to use print-method instead of print-dup for something like this.
print-method had many nasty edge cases and hierarchical conflicts when trying to apply this to general-purpose Clojure data.

I really think print-method or print-dup are really only practical serialization formats for transmitting code-as-data or restricted subset of Clojure data structures.

Also, the performance characteristics are really bad. Especially if you plan to have data types involved like deftype or defrecord created Classes. There is a ton of non-cached reflection done by the Clojure Reader when parsing this. It really is atrocious. :)
print-dup is even worse in terms of reflection in the reader since it attempts (but not very reliably) to preserve data types.

I really just believe this use-case is an abuse of the purpose of print-dup and print-method and would be opposed to seeing it be a default maintained and provided by Clara.

The Fressian approach seems much more practical at this point since so many of the custom . handlers that are needed are already available in the codebase. Again, this would be an optional dependency. The d/IWorkingMemorySerializer impl would be in a namespace like clara.rules.durability.fressian in the same way the d/ISessionSerializer currently is.


I'll also note, the only Clara-provided d/ISessionSerializer impl is in clara.rules.durability.fressian right now. And it requires the consumer to provide a Fressian impl.

It is mostly undesirable for a consumer to implement d/ISessionSerializer themselves since it will be fragile and tightly coupled to Clara internal structure changes. This was only done as a protocol so that
(1) Fressian could be optional
(2) Clara could offer alternatives to Fressian impl's of d/ISessionSerializer for consumers that didn't want Fressian. However, this would be at the discretion of Clara to find appropriate alternatives, like perhaps

  • Kryo (extensible and achievable),
  • Transit, Thrift, Avro (would be pretty restricted and difficult to pull off)
  • Java SerDe (would need to provide wrapping handlers for most Clojure structures to avoid leaking Clojure impl details that are liable to change across versions)
  • etc.
Collaborator

mrrodriguez commented Jan 31, 2017

Conflicts with other libraries could be a problem.

Fressian is an optional dependency. The consumer must explicitly add it if they wish to use a namespace that loads Fressian. So there is no conflict here.
If you want a reasonable Fressian default impl, you are free to use it. It'll be tested in the :dev profile of Clara against some (modern) Fressian version.

Metadata on facts is dangerous in Clara anyway ...

This is only true if you consider the metadata to be the distinguishing characteristic between two or more facts. That is certainly not the only use of metadata. Losing all metadata post-serialization can definitely lead to mysteries in the consuming application.

Performance might not be such a large concern if this ended up just being a way to demo/prototype durability and any serious usage involved writing a custom memory serializer, but of course a default that both allowed a quick demo and was performant enough for use in real use cases would be better.

I worked through implementing it with print-dup (and later print-method) at length. It was really messy. Many things are not implemented there that are typically needed when trying to use it as a full-blown serialization format. The recommendation from the community was to use print-method instead of print-dup for something like this.
print-method had many nasty edge cases and hierarchical conflicts when trying to apply this to general-purpose Clojure data.

I really think print-method or print-dup are really only practical serialization formats for transmitting code-as-data or restricted subset of Clojure data structures.

Also, the performance characteristics are really bad. Especially if you plan to have data types involved like deftype or defrecord created Classes. There is a ton of non-cached reflection done by the Clojure Reader when parsing this. It really is atrocious. :)
print-dup is even worse in terms of reflection in the reader since it attempts (but not very reliably) to preserve data types.

I really just believe this use-case is an abuse of the purpose of print-dup and print-method and would be opposed to seeing it be a default maintained and provided by Clara.

The Fressian approach seems much more practical at this point since so many of the custom . handlers that are needed are already available in the codebase. Again, this would be an optional dependency. The d/IWorkingMemorySerializer impl would be in a namespace like clara.rules.durability.fressian in the same way the d/ISessionSerializer currently is.


I'll also note, the only Clara-provided d/ISessionSerializer impl is in clara.rules.durability.fressian right now. And it requires the consumer to provide a Fressian impl.

It is mostly undesirable for a consumer to implement d/ISessionSerializer themselves since it will be fragile and tightly coupled to Clara internal structure changes. This was only done as a protocol so that
(1) Fressian could be optional
(2) Clara could offer alternatives to Fressian impl's of d/ISessionSerializer for consumers that didn't want Fressian. However, this would be at the discretion of Clara to find appropriate alternatives, like perhaps

  • Kryo (extensible and achievable),
  • Transit, Thrift, Avro (would be pretty restricted and difficult to pull off)
  • Java SerDe (would need to provide wrapping handlers for most Clojure structures to avoid leaking Clojure impl details that are liable to change across versions)
  • etc.
@WilliamParker

This comment has been minimized.

Show comment
Hide comment
@WilliamParker

WilliamParker Jan 31, 2017

Member

Regarding metadata, I think we're in agreement. You'll only have problems if you're inserting facts that are equal but with different metadata and you have expectations about which one you keep if, for example, you retract one.

To clarify, my concern with Fressian was definitely not about needing to bring it in as a dependency. I see the goals here as just to create something that is super easy for someone experimenting with Clara durability to play around with before committing to the effort needed to use durability in a real use case. In many ways an IMemorySerializer that didn't actually serialize anything, but was just an object the user could keep in the REPL, would meet this need. A serializer that was equally simple to use but could be used for real use cases would be preferable though. My main concern with Fressian was that binary would be confusing to users since the serialized facts wouldn't be human-readable. However, on further thought we could just stamp both the session and serialized memory with a UUID and throw an exception if a user tries to deserialize an exception and those don't match. This could be done in a passive way if the durability implementation only did this if the memory serializer implemented a "SerializedFactsVersion" protocol. This sort of error reporting would mitigate most of my concerns there.

Member

WilliamParker commented Jan 31, 2017

Regarding metadata, I think we're in agreement. You'll only have problems if you're inserting facts that are equal but with different metadata and you have expectations about which one you keep if, for example, you retract one.

To clarify, my concern with Fressian was definitely not about needing to bring it in as a dependency. I see the goals here as just to create something that is super easy for someone experimenting with Clara durability to play around with before committing to the effort needed to use durability in a real use case. In many ways an IMemorySerializer that didn't actually serialize anything, but was just an object the user could keep in the REPL, would meet this need. A serializer that was equally simple to use but could be used for real use cases would be preferable though. My main concern with Fressian was that binary would be confusing to users since the serialized facts wouldn't be human-readable. However, on further thought we could just stamp both the session and serialized memory with a UUID and throw an exception if a user tries to deserialize an exception and those don't match. This could be done in a passive way if the durability implementation only did this if the memory serializer implemented a "SerializedFactsVersion" protocol. This sort of error reporting would mitigate most of my concerns there.

@mrrodriguez

This comment has been minimized.

Show comment
Hide comment
@mrrodriguez

mrrodriguez Feb 1, 2017

Collaborator

So I think I understand where you are coming from. You just don't like the binary format due to its opaqueness? That certainly does make it difficult to troubleshoot, I'd agree. Too bad it isn't like Avro where you can encode as binary or JSON with a switch of encoder/decoder.

However, the data format here is just difficult to express - more difficult than something restrictive like JSON/Avro.

Collaborator

mrrodriguez commented Feb 1, 2017

So I think I understand where you are coming from. You just don't like the binary format due to its opaqueness? That certainly does make it difficult to troubleshoot, I'd agree. Too bad it isn't like Avro where you can encode as binary or JSON with a switch of encoder/decoder.

However, the data format here is just difficult to express - more difficult than something restrictive like JSON/Avro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment