Optimize LocalTransport retraction #183

mrrodriguez · 2016-04-22T12:05:51Z

Background

In clara.rules.engine the LocalTransport implementation of the retract-elements and retract-tokens has the potential to waste quite a bit of time. In the reverse direction, the send-elements and send-tokens the issue is not present. It looks like it was optimized a long while back with 12f0cdb#diff-fc240c836498f0d897381182d11d3df8L88. This optimization was beneficial for batch propagation to nodes with no join bindings. It was also an optimization to create smaller map keys when doing a group-by (or our tuned variety).

Since retract-elements and retract-tokens are essentially the same implementation, I'll just refer to retract-elements from here on.

Problem

retract-elements always performs something of the form (platform/tuned-group-by :bindings elements) regardless of join bindings of the nodes that it is going to call right-retract on. This causes performance degradation for three reasons that I can immediately see.

(1)
(platform/tuned-group-by :bindings elements) must hash all of the :bindings for each element, since these become the map keys. We only really need the :bindings involved in join bindings to be the keys of this map. This extra hashing can become expensive, especially when the values of the :bindings have slow hash code implementations - which is more likely to happen when hashing :fact-bindings or :result-bindings from accumulators. We wouldn't typically be doing a join on something like a fact-binding or a large accumulated :result-binding either in practice.
e.g.

(defrule my-rule
  [?big <- BigFact]
  [?markers <- (acc/all) :from [MarkerFact]]
  =>
  (insert-unconditional! (->NewFact (get-things-from ?big) (get-things-from ?markers))))

In my (very contrived) my-rule, there are no explicit join bindings between ?big and ?marker. With the usage of insert-unconditional! we actually do not need to ever hash the BigFact at all. However, we are hashing it currently upon retraction of previously accumulated values. Note: this is actually an issue with retract-tokens, but like I said, their implementation details are the same.

Overall, this is a sneaky consequence, but we have rules of this form in practice and the calculation of hash codes for our "BigFact"s can be expensive and ideally the engine can try to avoid it when it is not necessary. It would be much rarer to actually use this BigFact as a real join binding.

(2)
When a node has no join bindings, such as a RootJoinNode [1], the elements to retract are still batched by their :bindings prior to propagation across the network. This causes unnecessary separation of batched propagations. Batched propagation across the network tends to lead to significant performance gains, so it is good to keep these batches as large as possible. Note: Other nodes may also have no join bindings.
Note: It is fairly common to have independent conditions used in rules before in which case we get HashJoinNode's, AccumulateNode's, NegationNode's, etc with no join bindings etc.

(3)
Relating to #2, we do not need to perform a group-by operation on :bindings at all if there are no join bindings for the node anyways. Since group-by is hit pretty hard within the engine and has been a bottleneck before, it is good to avoid it whenever unnecessary.

Proposed changes

The changes in this commit are to use the same pattern as send-elements and send-tokens in the retract-elements and retract-tokens path of the LocalTransport. This should behave functionally equivalent as before, only with less potential to waste time in the retraction paths. Since all 4 of these paths are now near identical, I pulled a helper function out to extract the logic. I did this to avoid the maintenance burden and potential to miss updating one if we update another and they are intended to be the same. I don't think this extraction adds any overhead that is worth worrying about. I have profiled it with criterium and the difference looks to be irrelevant:

(def q (dsl/parse-query [] [[Temperature (= ?t temperature)]]))
(def tn 10000)
(def temps (mapv #(->Temperature % "MCI") (range tn)))
(def s (mk-session [q]))

(require '[criterium.core :as cc])

;; criterium for insert
(cc/bench
 (let [inserted (insert-all s temps)]
   (-> inserted
       fire-rules
       (query q)
       count)))

;;;; Before
Evaluation count : 11880 in 60 samples of 198 calls.
             Execution time mean : 5.345061 ms
    Execution time std-deviation : 561.303402 µs
   Execution time lower quantile : 4.994397 ms ( 2.5%)
   Execution time upper quantile : 6.851082 ms (97.5%)
                   Overhead used : 1.997513 ns

Found 9 outliers in 60 samples (15.0000 %)
    low-severe   9 (15.0000 %)
 Variance from outliers : 72.0634 % Variance is severely inflated by outliers
nil

;;;; AFTER

WARNING: Final GC required 7.352561061504426 % of runtime
WARNING: Final GC required 24.650959898300002 % of runtime
Evaluation count : 114 in 6 samples of 19 calls.
             Execution time mean : 5.411021 ms
    Execution time std-deviation : 204.595005 µs
   Execution time lower quantile : 5.251963 ms ( 2.5%)
   Execution time upper quantile : 5.743773 ms (97.5%)
                   Overhead used : 1.997513 ns

Found 1 outliers in 6 samples (16.6667 %)
    low-severe   1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil

I don't necessary have any great tests to show this since the functionality is the same. A benchmark may be a good idea, however, this isn't really a good scenario to do a Drools vs Clara analysis when this is much more a detail of Clara's implementation. I can definitely demonstrate a significant difference in how this can hurt batched retraction propagation in the REPL.
The following uses the same setup as the criterium benchmark above. I didn't use criterium here because the BEFORE version is 30+ seconds, while AFTER is less than 3 seconds. Running it many times myself manually results in similar numbers anyways. Profiling results show that in the BEFORE case is causing many more small changes to ripple through the (small in this case) network, while AFTER a single change is propagated in batch across the nodes in the network.

(let [inserted (insert-all s temps)
      retracted (time (apply retract
                             inserted
                             temps))]
  [(-> inserted
       fire-rules
       (query q)
       count)
   (-> retracted
       fire-rules
       (query q)
       count)])

;;; BEFORE

"Elapsed time: 34107.628297 msecs"
[10000 0]

;;; AFTER

"Elapsed time: 2390.955789 msecs"
[10000 0]

Results

When I ran with these changes in place in our production environment, I saw overall time slightly better than before. However, what was significant was I saw our worst-case scenarios improved significantly. We identify in our environment when a particular rule session against a particular dataset are much longer runner than others. I've seen a significant reduction in those showing up as long running now.

Furthermore, the issue I mentioned in point (1) regarding excessive hash coding of large objects is more critical in some workflows we are working on where we need quicker one-off runs of rules for a given dataset.

Notes

[1] In RootJoinNode's case I think (get-join-keys [node] binding-keys) will always result in an an empty set because only join bindings seem to be included here in the clara.rules.compiler. The RootJoinNode shouldn't have any joins since it is the root "dummy" node starting the beta network.

…ropagate correctly join-binding grouped items to the nodes

rbrush · 2016-04-22T15:11:43Z

Looks good to me, and thanks for the detailed analysis and explanation. I hadn't anticipated such a load on retraction, but this optimization definitely makes sense (and you reduced the total code along the way!)

rbrush · 2016-04-22T15:12:17Z

I went ahead and merged it since it seems solid. Thanks!

- retracting tokens and elements in the local transport should only p…

8c5d3cf

…ropagate correctly join-binding grouped items to the nodes

rbrush merged commit e67cfed into oracle-samples:master Apr 22, 2016

rbrush added this to the 0.11.1 milestone Apr 25, 2016

mrrodriguez mentioned this pull request May 5, 2016

Mutable JVM colls in local memory for efficiency #187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize LocalTransport retraction #183

Optimize LocalTransport retraction #183

mrrodriguez commented Apr 22, 2016

rbrush commented Apr 22, 2016

rbrush commented Apr 22, 2016

Optimize LocalTransport retraction #183

Optimize LocalTransport retraction #183

Conversation

mrrodriguez commented Apr 22, 2016

Background

Problem

Proposed changes

Results

Notes

rbrush commented Apr 22, 2016

rbrush commented Apr 22, 2016