Serialize and/or persist working memory #16

Closed
rbrush opened this Issue Sep 28, 2013 · 16 comments

Comments

Projects
None yet
2 participants
@rbrush
Collaborator

rbrush commented Sep 28, 2013

From the Google Group discussion at [1], add the ability to serialize and/or persist the working memory of a session.

Internally, the working memory is simply a set of Clojure data structures (maps, lists, and vectors) and facts are clojure Records, so a simple version of this could be to expose this data structure and allow the caller to use EDN (or a binary optimized equivalent) to persist and transport it.

[1]
https://groups.google.com/forum/#!topic/clojure/pfeFyZkoRdU

@rbrush

This comment has been minimized.

Show comment
Hide comment
@rbrush

rbrush Nov 8, 2013

Collaborator

Some thoughts on how I'm planning on approaching this one. We should be able to effectively create a change log of items to a working memory, a sequence of updated facts, accumulator values, and which rules are pending firing. The change log could be used to persist an entire working memory by scanning it and creating a change log from scratch. Later changes could simply be appended to the change log, creating a facility to quickly persist updates to the working memory as changes occur.

This pattern could be quite powerful, allowing us to move through time via the change log. Also, I plan on designing the data structure so it can be persisted to things like Kafka's compactable logs [1]...which means we could run rules in a distributed environment, persist rule state to the logs, and re-load from those logs in the event of failure.

[1]
https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction

Collaborator

rbrush commented Nov 8, 2013

Some thoughts on how I'm planning on approaching this one. We should be able to effectively create a change log of items to a working memory, a sequence of updated facts, accumulator values, and which rules are pending firing. The change log could be used to persist an entire working memory by scanning it and creating a change log from scratch. Later changes could simply be appended to the change log, creating a facility to quickly persist updates to the working memory as changes occur.

This pattern could be quite powerful, allowing us to move through time via the change log. Also, I plan on designing the data structure so it can be persisted to things like Kafka's compactable logs [1]...which means we could run rules in a distributed environment, persist rule state to the logs, and re-load from those logs in the event of failure.

[1]
https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction

@eslick

This comment has been minimized.

Show comment
Hide comment
@eslick

eslick May 18, 2014

I've been toying with building a distributed rules system on top of Immutant and clara appears to be a fantastic starting point. My overall system happens to use Cassandra which provides a nice facility for designating row values with a time to live (TTL) so we can have a transaction log with finite storage consumption coupled with periodic complete snapshots of working memory maintained on a longer-term TTL. I also have Datomic in the mix but I'd have to think of how it would fit in.

The transaction logs can be kept for a few weeks or even months for forensic purposes, but a complete snapshot plus the last hour/day of logs is enough to recover in the event of a node loss. There are some interesting coordination issues in the distributed case to consider so snapshots are coherent and logs from multiple nodes can be replayed properly, especially if we have cluster sizes (hash distribution) which change over time.

eslick commented May 18, 2014

I've been toying with building a distributed rules system on top of Immutant and clara appears to be a fantastic starting point. My overall system happens to use Cassandra which provides a nice facility for designating row values with a time to live (TTL) so we can have a transaction log with finite storage consumption coupled with periodic complete snapshots of working memory maintained on a longer-term TTL. I also have Datomic in the mix but I'd have to think of how it would fit in.

The transaction logs can be kept for a few weeks or even months for forensic purposes, but a complete snapshot plus the last hour/day of logs is enough to recover in the event of a node loss. There are some interesting coordination issues in the distributed case to consider so snapshots are coherent and logs from multiple nodes can be replayed properly, especially if we have cluster sizes (hash distribution) which change over time.

@rbrush

This comment has been minimized.

Show comment
Hide comment
@rbrush

rbrush May 19, 2014

Collaborator

Ian, that does sound like an interesting problem. Adding support for this
sort of thing via a change log and snapshot feature is something that I've
wanted to tackle for a while, but more immediate needs kept pushing it down
the list. But this problem makes me want to get support for it going.

I think we could get a simple transaction log and snapshot going reasonably
easily, although early passes likely won't be passive between releases. I
really want to be a able to use Kafka's self-compacting logs as an option
to back this, but haven't figured out quite how to achieve that yet. If
nothing else we could create a very simple changelog and get back to Kafka
down the road.

On Sunday, May 18, 2014, Ian Eslick notifications@github.com wrote:

I've been toying with building a distributed rules system on top of
Immutant and clara appears to be a fantastic starting point. My overall
system happens to use Cassandra which provides a nice facility for
designating row values with a time to live (TTL) so we can have a
transaction log with finite storage consumption coupled with periodic
complete snapshots of working memory maintained on a longer-term TTL. I
also have Datomic in the mix but I'd have to think of how it would fit in.

The transaction logs can be kept for a few weeks or even months for
forensic purposes, but a complete snapshot plus the last hour/day of logs
is enough to recover in the event of a node loss. There are some
interesting coordination issues in the distributed case to consider so
snapshots are coherent and logs from multiple nodes can be replayed
properly, especially if we have cluster sizes (hash distribution) which
change over time.


Reply to this email directly or view it on GitHubhttps://github.com/rbrush/clara-rules/issues/16#issuecomment-43443887
.

Collaborator

rbrush commented May 19, 2014

Ian, that does sound like an interesting problem. Adding support for this
sort of thing via a change log and snapshot feature is something that I've
wanted to tackle for a while, but more immediate needs kept pushing it down
the list. But this problem makes me want to get support for it going.

I think we could get a simple transaction log and snapshot going reasonably
easily, although early passes likely won't be passive between releases. I
really want to be a able to use Kafka's self-compacting logs as an option
to back this, but haven't figured out quite how to achieve that yet. If
nothing else we could create a very simple changelog and get back to Kafka
down the road.

On Sunday, May 18, 2014, Ian Eslick notifications@github.com wrote:

I've been toying with building a distributed rules system on top of
Immutant and clara appears to be a fantastic starting point. My overall
system happens to use Cassandra which provides a nice facility for
designating row values with a time to live (TTL) so we can have a
transaction log with finite storage consumption coupled with periodic
complete snapshots of working memory maintained on a longer-term TTL. I
also have Datomic in the mix but I'd have to think of how it would fit in.

The transaction logs can be kept for a few weeks or even months for
forensic purposes, but a complete snapshot plus the last hour/day of logs
is enough to recover in the event of a node loss. There are some
interesting coordination issues in the distributed case to consider so
snapshots are coherent and logs from multiple nodes can be replayed
properly, especially if we have cluster sizes (hash distribution) which
change over time.


Reply to this email directly or view it on GitHubhttps://github.com/rbrush/clara-rules/issues/16#issuecomment-43443887
.

@eslick

This comment has been minimized.

Show comment
Hide comment
@eslick

eslick May 19, 2014

Durability is a central feature of the system we're building, both for fault tolerance and for operational purposes such as rolling upgrades. Clara would be receiving a stream of system events and producing new workflow jobs, alerts, or log information based on productions. The product features this would support are on deck for this summer, so while we won't be able to contribute directly to clara for a few weeks at least, I'd be game to integrate a first pass attempt and/or submit some pull requests on top of an initial effort once we free up. I've been scanning the source in my free time and am prototyping some of our features on our dev system in the meantime - is there a preferred place to collect requirements / design ideas for change logs and snapshots? I'm also curious how this will interact with a distributed version of Clara -- it's not yet clear to me how to map what you did with clara-storm to our environment. An early use case of scanning our debugging streams for interesting cases probably doesn't require durability but would benefit from being distributed given the volume of data we produce across the cluster.

eslick commented May 19, 2014

Durability is a central feature of the system we're building, both for fault tolerance and for operational purposes such as rolling upgrades. Clara would be receiving a stream of system events and producing new workflow jobs, alerts, or log information based on productions. The product features this would support are on deck for this summer, so while we won't be able to contribute directly to clara for a few weeks at least, I'd be game to integrate a first pass attempt and/or submit some pull requests on top of an initial effort once we free up. I've been scanning the source in my free time and am prototyping some of our features on our dev system in the meantime - is there a preferred place to collect requirements / design ideas for change logs and snapshots? I'm also curious how this will interact with a distributed version of Clara -- it's not yet clear to me how to map what you did with clara-storm to our environment. An early use case of scanning our debugging streams for interesting cases probably doesn't require durability but would benefit from being distributed given the volume of data we produce across the cluster.

@rbrush

This comment has been minimized.

Show comment
Hide comment
@rbrush

rbrush May 19, 2014

Collaborator

For the time being, I'd suggest just posting any design ideas to this issue. Once we get an idea of how to approach the problem we can break it up into sub-issues. (I would be open to creating a Google Group for Clara if that seems better.)

Regarding distributed versions of Clara: I think we focus the snapshot and change log on a single in-process working memory for now. This would actually be used to support a future clara-storm, which really needs durability before it can be considered for production use. A truly distributed working memory the durability is bigger project, but many problems could be solved with many durable, single-process working memories running on many machines as opposed to a unified, distributed working memory. This assumes you can partition your data/streams for this to work.

By the way, clara-storm isn't much more than an experiment at this point, but getting a change log will be a step toward making it real.

Collaborator

rbrush commented May 19, 2014

For the time being, I'd suggest just posting any design ideas to this issue. Once we get an idea of how to approach the problem we can break it up into sub-issues. (I would be open to creating a Google Group for Clara if that seems better.)

Regarding distributed versions of Clara: I think we focus the snapshot and change log on a single in-process working memory for now. This would actually be used to support a future clara-storm, which really needs durability before it can be considered for production use. A truly distributed working memory the durability is bigger project, but many problems could be solved with many durable, single-process working memories running on many machines as opposed to a unified, distributed working memory. This assumes you can partition your data/streams for this to work.

By the way, clara-storm isn't much more than an experiment at this point, but getting a change log will be a step toward making it real.

@eslick

This comment has been minimized.

Show comment
Hide comment
@eslick

eslick May 19, 2014

So I think the big issues are:

  • Log event generation
    • Sessions optionally accept an entity that implements an IDurableLog protocol? Your names may be better than mine!
    • A seq of node EDN transitions for all downstream consequences of an insert/retract or fire-rules? i.e. each top level command is transactional?
    • Or is it enough to simply log the top-level insert/retract/fire-rules arguments before executing them? (Probably not if we want to support distributed operation where node updates are forwarded to foreign nodes who may not have the initial insert context available?)
  • Snapshot generation (optional if you are using compression logs)
    • IDurableSnapshot. (snapshot, restore)
    • How do we correlate a snapshot record to an equivalent node if there is a new graph or the graph was re-generated? Are id's identical across compilation passes?
    • Can generate in the background using an older session so long as we know the TS of the last log entry that preceded the snapshot.

Re: Distributed. I think we're on the same page. We would have N copies of a local rule system, but break up which ones maintain the working memory for a subset of nodes via the transport's routing? If a node fails we simply need to have a node recover from that node's snapshot + logs either into an existing process or into a fresh one.

eslick commented May 19, 2014

So I think the big issues are:

  • Log event generation
    • Sessions optionally accept an entity that implements an IDurableLog protocol? Your names may be better than mine!
    • A seq of node EDN transitions for all downstream consequences of an insert/retract or fire-rules? i.e. each top level command is transactional?
    • Or is it enough to simply log the top-level insert/retract/fire-rules arguments before executing them? (Probably not if we want to support distributed operation where node updates are forwarded to foreign nodes who may not have the initial insert context available?)
  • Snapshot generation (optional if you are using compression logs)
    • IDurableSnapshot. (snapshot, restore)
    • How do we correlate a snapshot record to an equivalent node if there is a new graph or the graph was re-generated? Are id's identical across compilation passes?
    • Can generate in the background using an older session so long as we know the TS of the last log entry that preceded the snapshot.

Re: Distributed. I think we're on the same page. We would have N copies of a local rule system, but break up which ones maintain the working memory for a subset of nodes via the transport's routing? If a node fails we simply need to have a node recover from that node's snapshot + logs either into an existing process or into a fresh one.

@rbrush

This comment has been minimized.

Show comment
Hide comment
@rbrush

rbrush May 20, 2014

Collaborator

We should be able to model the event log as a collection of state changes (insert/retract/accumulate result/activate rule) for this case. I'll probably model the "snapshot" as a sequence of the same state changes...enough to go from an empty working memory to the full state. This should be efficient and solve both the change log and the full snapshot using the same model.

I'm going to punt on exposing durability with distributed working memory, but solve it for in-process memory in a way that a distributed system can leverage. It will probably be as simple as:

(let [change-seq (get-changes my-session)] 
  ;; Write change seq to the log of the user's choice.
)

And to restore a session, the user simply loads the sequence of changes from the store and replays it:

(let [restored-session (load-changes change-seq)])
   ;; Use restored session.
)

This way the interface to Clara is very simple, and consuming code can simply write the sequence to whatever underlying storage is appropriate. A couple other functions will be necessary -- such as clearing the change log from the working memory after we write it -- but it should be straightforward.

There are a couple tricky pieces internally if we want to do this efficiently. We could trivially just store and re-play every single insertion, but this would cause multiple rule firings which would be a problem if that triggers side effects. So we'll need to plug into the internals a bit. I'll spend some time looking into how we can do so.

Collaborator

rbrush commented May 20, 2014

We should be able to model the event log as a collection of state changes (insert/retract/accumulate result/activate rule) for this case. I'll probably model the "snapshot" as a sequence of the same state changes...enough to go from an empty working memory to the full state. This should be efficient and solve both the change log and the full snapshot using the same model.

I'm going to punt on exposing durability with distributed working memory, but solve it for in-process memory in a way that a distributed system can leverage. It will probably be as simple as:

(let [change-seq (get-changes my-session)] 
  ;; Write change seq to the log of the user's choice.
)

And to restore a session, the user simply loads the sequence of changes from the store and replays it:

(let [restored-session (load-changes change-seq)])
   ;; Use restored session.
)

This way the interface to Clara is very simple, and consuming code can simply write the sequence to whatever underlying storage is appropriate. A couple other functions will be necessary -- such as clearing the change log from the working memory after we write it -- but it should be straightforward.

There are a couple tricky pieces internally if we want to do this efficiently. We could trivially just store and re-play every single insertion, but this would cause multiple rule firings which would be a problem if that triggers side effects. So we'll need to plug into the internals a bit. I'll spend some time looking into how we can do so.

@rbrush

This comment has been minimized.

Show comment
Hide comment
@rbrush

rbrush May 22, 2014

Collaborator

I stubbed out an idea of what the API for this might look in this gist.

These changes won't be trivial, since we'll need to plug into the internals of Clara to track the information we need. I do feel like the approach is promising, and by simply exposing the changes as sequences we can let the caller use what ever durable mechanism is at hand. I can imagine providing durable log implementations as well, but I'm scoping that out of this issue.

One side note: I'm going to be traveling for a few days so I probably won't be jumping into this until at least next week.

Collaborator

rbrush commented May 22, 2014

I stubbed out an idea of what the API for this might look in this gist.

These changes won't be trivial, since we'll need to plug into the internals of Clara to track the information we need. I do feel like the approach is promising, and by simply exposing the changes as sequences we can let the caller use what ever durable mechanism is at hand. I can imagine providing durable log implementations as well, but I'm scoping that out of this issue.

One side note: I'm going to be traveling for a few days so I probably won't be jumping into this until at least next week.

@eslick

This comment has been minimized.

Show comment
Hide comment
@eslick

eslick May 29, 2014

Interesting idea from a related question asked on the Datomic list recently...

"One far fetched idea I had was to use datomic as the working memory component within Clara... it could be any db instead of datomic but I think datomic would be a better fit - you might look Into writing an alternate IWorkingMemory protocol implementation that uses datomic... YMMV. I still think this could hold some promise but don't have the bandwidth to pursue it by myself."

eslick commented May 29, 2014

Interesting idea from a related question asked on the Datomic list recently...

"One far fetched idea I had was to use datomic as the working memory component within Clara... it could be any db instead of datomic but I think datomic would be a better fit - you might look Into writing an alternate IWorkingMemory protocol implementation that uses datomic... YMMV. I still think this could hold some promise but don't have the bandwidth to pursue it by myself."

@eslick

This comment has been minimized.

Show comment
Hide comment
@eslick

eslick May 29, 2014

In the replay, I can see how you get back to a coherent RETE state, but the side effects are a fundamental issue if you are, for example, issuing alerts. Wouldn't the RHS functions need to either just issue rules or be able to inhibit side effects on replay?

eslick commented May 29, 2014

In the replay, I can see how you get back to a coherent RETE state, but the side effects are a fundamental issue if you are, for example, issuing alerts. Wouldn't the RHS functions need to either just issue rules or be able to inhibit side effects on replay?

@rbrush

This comment has been minimized.

Show comment
Hide comment
@rbrush

rbrush May 29, 2014

Collaborator

Yes, we wouldn't trigger the RHS during replay for those reasons. I was planning on handling this by explicitly storing pending RHS activation records as part of the replayed state, and restoring those pending activations during the replay. The caller than could invoke (fire-rules), and only the pending activations would trigger, as expected.

In short, the rule behavior of a persisted and replayed session should be indistinguishable from the same session that wasn't persisted/replayed.

Regarding Datomic: I can definitely see using it to persist the working memory. We should make the data store pluggable, but Datomic could be as good an option as any. I'll need to learn more about Datomic before I could talk specifics, though.

Collaborator

rbrush commented May 29, 2014

Yes, we wouldn't trigger the RHS during replay for those reasons. I was planning on handling this by explicitly storing pending RHS activation records as part of the replayed state, and restoring those pending activations during the replay. The caller than could invoke (fire-rules), and only the pending activations would trigger, as expected.

In short, the rule behavior of a persisted and replayed session should be indistinguishable from the same session that wasn't persisted/replayed.

Regarding Datomic: I can definitely see using it to persist the working memory. We should make the data store pluggable, but Datomic could be as good an option as any. I'll need to learn more about Datomic before I could talk specifics, though.

@rbrush

This comment has been minimized.

Show comment
Hide comment
@rbrush

rbrush May 31, 2014

Collaborator

Added #57 as a first step in this direction, so we can detect state changes that need to be persisted.

Collaborator

rbrush commented May 31, 2014

Added #57 as a first step in this direction, so we can detect state changes that need to be persisted.

@rbrush rbrush added this to the 0.6.0 milestone May 31, 2014

@rbrush

This comment has been minimized.

Show comment
Hide comment
@rbrush

rbrush Jun 9, 2014

Collaborator

A minimal but working version of durability is on the branch linked below. It basically just exposes methods to get a session state as a structure that can be written to the data store of the user's choice via serialized EDN or Fressian. Feel free to start experimenting with this, with the understanding that it's likely to change non-passively prior to release. Some unit tests cover simple use cases, but this testing isn't very robust at this point either.

The next step is to address some edge cases in the current implementation and build on listener functionality in issue #57. This will allow the caller to get a "diff" of the session state as compared to a prior save point. This enables writing efficient write-ahead logs of state changes of Clara sessions.

https://github.com/rbrush/clara-rules/tree/issue-16

Collaborator

rbrush commented Jun 9, 2014

A minimal but working version of durability is on the branch linked below. It basically just exposes methods to get a session state as a structure that can be written to the data store of the user's choice via serialized EDN or Fressian. Feel free to start experimenting with this, with the understanding that it's likely to change non-passively prior to release. Some unit tests cover simple use cases, but this testing isn't very robust at this point either.

The next step is to address some edge cases in the current implementation and build on listener functionality in issue #57. This will allow the caller to get a "diff" of the session state as compared to a prior save point. This enables writing efficient write-ahead logs of state changes of Clara sessions.

https://github.com/rbrush/clara-rules/tree/issue-16

@rbrush rbrush closed this Jun 9, 2014

@rbrush rbrush reopened this Jun 9, 2014

@rbrush

This comment has been minimized.

Show comment
Hide comment
@rbrush

rbrush Jun 9, 2014

Collaborator

Disregard this issue being closed. User error on my part; the development branch above is a work in progress, so this issue remains open until it is complete.

Collaborator

rbrush commented Jun 9, 2014

Disregard this issue being closed. User error on my part; the development branch above is a work in progress, so this issue remains open until it is complete.

@rbrush

This comment has been minimized.

Show comment
Hide comment
@rbrush

rbrush Jul 25, 2014

Collaborator

Alright, so the above commit gets us to the point where we can persist complete snapshots of sessions. I'm going to use issue #38 to track incremental changes for a transaction log, so this should wrap up what I wanted to accomplish in this issue.

I'll probably close this issue in the next couple days in preparation for the 0.6.0 release unless there are objections. We can track the related changes on #38 or on new issues as needed.

A draft of documentation for durability is here: https://github.com/rbrush/clara-rules/wiki/Durability

Collaborator

rbrush commented Jul 25, 2014

Alright, so the above commit gets us to the point where we can persist complete snapshots of sessions. I'm going to use issue #38 to track incremental changes for a transaction log, so this should wrap up what I wanted to accomplish in this issue.

I'll probably close this issue in the next couple days in preparation for the 0.6.0 release unless there are objections. We can track the related changes on #38 or on new issues as needed.

A draft of documentation for durability is here: https://github.com/rbrush/clara-rules/wiki/Durability

@rbrush

This comment has been minimized.

Show comment
Hide comment
@rbrush

rbrush Jul 27, 2014

Collaborator

Closing issue per the above comments. Feel free to track #38 or open other issues for related items.

Collaborator

rbrush commented Jul 27, 2014

Closing issue per the above comments. Feel free to track #38 or open other issues for related items.

@rbrush rbrush closed this Jul 27, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment