Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 255: Provide better feedback when constraints throw exceptions #256

Closed

Conversation

carlosphillips
Copy link
Contributor

@carlosphillips carlosphillips commented Jan 7, 2017

Please see issue #255.

([exp-seq equality-only-variables]
(compile-constraints exp-seq equality-only-variables 0))
([exp-seq equality-only-variables constraint-idx]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like constraint-idx is ultimately part of the exception thrown in throw-constraint-exception. I don't see it being used used, am I overlooking something? It also seems to me like including the constraint itself in the exception thrown provides a superset of the information that the constraint-idx does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the original implementation published constraint-idx and the node reconciled it. There is no need for constraint-idx now.

(if next
(recur (into rest (:children next)) (conj flattened next))
(->> flattened
(keep :production)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constraints could be from either rules or queries; queries won't have the :productions key. This won't pick up query names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me look at that.


(defn node-rule-names
[children]
(loop [[next & rest] children flattened []]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but this would probably be a little cleaner if you used something like tree-seq to handle the tree traversal instead of implementing it yourself. https://clojuredocs.org/clojure.core/tree-seq

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, I didn't know that tree-seq existed!

exp (:constraint-exp error)
rule-names (node-rule-names children)
condition (or (:condition node)
(-> children first :condition))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second branch of this or is for an AlphaNode, correct? I'm not sure the children will all have the same condition, I'll have to think on it a bit more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Let me know. This or expression is a bit opaque here anyway.

(sort)
(map #(if (nil? %) unnamed-rule-msg %))
(map #(str " " %))
(clojure.string/join "\n"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: "clojure.string" is aliased as "string" in the namespace header in this file. I suggest that we use the alias here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

"when processing fact"
(str " " (pr-str fact))
bindings-description])]
(binding [*out* *err*] (println message))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason to directly print a message here instead of throwing the exception to the caller and letting them decide what to do with it (print it out, log it somewhere, etc.)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, mostly just to demo the feature. Will remove. In emacs, the error message is formatted strangely in the stack trace, though.

(str " " (pr-str fact))
bindings-description])]
(binding [*out* *err*] (println message))
(throw (Exception. message (:exception error)))))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think building the exception like this will work in ClojureScript; I think this relies on JVM interop. If you look at the Travis build you'll see the message "WARNING: Use of undeclared Var clara.rules.engine/Exception at line 444 /home/travis/build/cerner/clara-rules/src/main/clojure/clara/rules/engine.cljc" and a similar one for *err*.

My suggestion is to use ex-info. Basically, Clojure and ClojureScript define their own exception type that carries a payload of user-defined data. This user-defined data can then be inspected in a REPL when the exception is thrown during development. I created a quick gist demonstrating this functionality at https://gist.github.com/WilliamParker/d033cb89ea6ac996115e1712adf5a56a.

The docs on ex-info are at https://clojuredocs.org/clojure.core/ex-info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense. Thank you for the demo. I'll change to ex-info.


The returned expression evaluates to a two item vector, [bindings error], which contains
the evaluated bindings in the first argument and a map with information on any exception
that might have been thrown by evaluating a constraint as the second vector entry."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the docstring.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for being slow here...I had been traveling and then got swamped with other items.

Overall I think this looks good. One thing that's not clear to me: rather than returning a [bindings error] tuple, could the compiled function simple return the bindings (as it did before) or throw a specific type of an exception that the caller could handle when an error occurs? The thrown exception could include the same information as seen in the error in the returned tuple, so we wouldn't lose anything.

It's probably not a huge deal, but throw/catching a specialized exception here would save us from creating and de-structuring an extra for every time we run our constraint expression.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know Carlos is busy with other commitments so I'm happy to help make these adjustments, but first I just wanted to understand if my reasoning is valid.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the cost of using the error/bindings tuples, that is a good point @rbrush and I agree. Obviously the cost of instantiating and destructuring the vector is small but it could add up. We've seen some very tiny optimizations in our rule constraints that end up being part of the AlphaNode filters in particular be noticeable before.

I think the reason this was implemented this way is that it derives a lot of information about the context in which the exception was thrown - i.e. the entire condition, the productions(rules/queries) that the constraint is a part of, etc. by examining the rule network at the time the exception is thrown. In principle this information could be determined at compilation time rather than re-derived but it looked to me like this would involve significant development work in the compiler and since this analysis is only done when an exception is thrown, and thus should not be a performance problem, I suggested that we stick with deriving the information at runtime.

However, on further examination this does not preclude the approach you suggest. Take part of the code from the current PR:

(defn- alpha-node-matches
  [facts env activation node]
  (for [fact facts
        :let [[bindings error] (activation fact env)]
        :when (or bindings error)] ; FIXME: add env.
    (if error
      (throw-constraint-exception error node fact env)
      [fact bindings])))

The throw-constraint-exception function is called with error, node, fact, and env. The contents of error are derived from fact and env by the activation call. Therefore, if we called the activation with the node in addition to its existing arguments it could derive all this information itself and then throw the exception itself. I don't think this would be too difficult to do; I would anticipate that we could just keep the existing infrastructure Carlos has built to gather the information, format the messages, etc. and just call it elsewhere i.e. in the compiled filter or join function.

Regarding throwing a specialized exception, are you suggesting that we implement our own exception class? If so I suggest that it implement IExceptionInfo, which would not be too difficult to do. I've found the ability to inspect data relevant to an exception programmatically at the REPL to be quite valuable. However, I'm not sure this is something that adds much value. It isn't obvious to me why a consumer would only want to catch such a specialized exception type as opposed to a more general exception type. For example, obviously I'd hope this doesn't happen, but if Clara itself threw an exception due to a bug I'd think a consumer would want to handle it in much the same way as an exception from a function call in their rule constraints. I'm not opposed to making our own type if we land on use-cases for it though, it just seems somewhat premature to me if we don't know of a need for it yet, especially since this would be platform-specific work i.e. different between Clojure and ClojureScript.

I agree that especially since Carlos has other commitments we should be willing to get this over the finish line. This sort of error handling is likely to be very beneficial. Regarding the procedural aspects of that I'm not quite sure what form that would take; I do want to be sure that the Git history ends up appropriately crediting Carlos for this significant piece of work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again I think I may have misinterpreted parts of @rbrush 's comment. We could do something like

;;; compiler
(compile-constraints
 (try constraint-stuff
      (catch Exception e
        (throw (ex-info "message" {:exception e
                                   :constraint-exp val-here
                                   :bindings val-here
                                   :env env
                                   ::constraint-exception true})))))
;; engine
(defn- alpha-node-matches
  [facts env activation node]
  (for [fact facts
        :let [bindings (try (activation fact env)
                            (catch #?(:clj Exception :cljs :default) e
                              (if (-> e ex-data :clara.rules.compiler/constraint-exception)
                                (throw-constraint-exception
                                 (-> e ex-data
                                     (dissoc :clara.rules.compiler/constraint-exception)
                                     node
                                     fact
                                     env))
                                (throw e)
                                ;; Or add some info in the else if we want but that isn't necessary for this enhancement.
                                )))]
        :when bindings] ; FIXME: add env.
    [fact bindings]))

I think this is what Ryan was actually getting at, but he can speak to whether my first interpretation or this one is correct, or if I was misinterpreting both times. :) This sort of approach is probably even easier than what I suggested before. The one thing I did change here is to use a key in an ex-data map to dispatch on whether the exception was thrown from a constraint rather than using a different exception type which I think is probably easier than needing to define a new exception type for both Clojure and ClojureScript.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you're right about warnings in users' builds, not a good idea.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this some thought and I think I see a way out here.

Steps:

  1. We can just eliminate the try-catch in build-constraints if we're compiling ClojureScript.
    In essence, this would bring us back to the previous behavior in ClojureScript while keeping the improvement in Clojure.
  2. We could then add an option to add such try-catch blocks for ClojureScript that would primarily be used at development time (and would create unreachable code warnings in some cases).
  3. We can then add information to the "else" case where there the exception isn't from a constraint
    in the engine. The error message could specify that the exception might be from either Clara itself or user code if that is desired. If the session is a ClojureScript one the message could reference the option created in (2) above.

Normally I'd be pretty reluctant to add the complication of another user-facing option, but this one would have the nice property that anyone who needs it can have it very directly pointed out by the exception message, and the user can then recompile with that option on if they want more information. For immediate purposes, I'd be fine with just implementing (1) for now and logging work to do (2) and (3).

Thoughts? @rbrush @carlosphillips

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A variation of Option 1 would be to keep the outer, node-level try/catch in ClojureScript, and the fine-grained, per-expression try/catch in Clojure. This way ClojureScript users could at least see the entire expression that threw an exception (although not the specific constraint.)

I'd be in favor of that variation, or Will's Option 1 as he described it as well. I think it's okay that we offer slightly different error reporting on these platforms based on their capabilities, especially since the overall flow is the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbrush to clarify, that wasn't exactly an options list. 1, 2, and 3 above are proposed sequential pieces of work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WilliamParker Gotcha. What I meant was I'd be fine with step 1 for now, even if we deferred 2 and 3 for ClojureScript indefinitely.

@WilliamParker
Copy link
Collaborator

I think we'll want some tests on this, at least enough to validate that we aren't suppressing exceptions. My suggestion would be to use the functionality in test_rules.clj for testing that an ExceptionInfo with the expected data is thrown. Testing such data tends to be less brittle than testing exception messages (i.e. asserting the content and formatting of a string) directly in my experience.


(defn throw-constraint-exception
[error node fact env]
"Adds a useful error message when executing a constraint node raises an exception."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc strings go before the arg vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

carlosphillips added a commit to carlosphillips/clara-rules that referenced this pull request Jan 10, 2017
@carlosphillips
Copy link
Contributor Author

I pushed a changeset trying to address all of the recommendations. Let me know whether this was the right way to proceed?

@WilliamParker
Copy link
Collaborator

@carlosphillips We prefer to have a single commit in the history for each enhancement, bug fix, etc. Could you squash your commits for this together into one commit and update the pull request? I'll plan to look at the changes sometime over the next couple days.

@carlosphillips
Copy link
Contributor Author

Ok, I have squashed the commits and pushed it through.

condition (or (:condition node) ; beta node
(-> node :children first :condition) ; alpha node
(-> node :children first :accum-condition) ; accum node
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked over the compiler and I believe the strategy of just getting the condition for an alpha node off the first child is mostly valid. Specifically, to-alpha-graph groups structures used to create AlphaNode instances by condition [1] so the children of an AlphaNode should all have the same condition if I'm understanding this correctly. The only exception I see is that it is possible to construct a scenario where an AlphaNode has some combination of accumulator children, negation children, and children that are neither, and the condition you're adding in the message includes the extra information that accumulator and negation conditions have. This happens since to-alpha-graph uses the condition after it is transformed in condition-to-node, specifically these lines [2], to group the beta nodes. My thinking would be to resolve the issue by just using the value that condition-to-node uses for each condition type since the constraints are what is throwing an exception anyway and the rule and query names provide the necessary information to find the original productions. I'm also not too concerned if we just say that it is a weird edge case for now and leave dealing with it for a future enhancement since this is just for error reporting, this sort of case probably isn't going to come up often in practice, the rule names and query names in the exception provide the necessary information in any case, and the current error reporting for errors in constraints is quite poor so this would be an improvement.

;; This session has one AlphaNode for Cold, which has ordinary, accumulator, and negation children.
(let [q1 (dsl/parse-query [] [[?cs <- (acc/all) :from [Cold (< temperature 0)]]])
        q2 (dsl/parse-query [] [[Cold (< temperature 0)]])
        r (dsl/parse-rule [[:not [Cold (< temperature 0)]]]
                          (insert! (->Hot 100)))]
    (mk-session [q1 q2 r] :cache false))

It looks like this doesn't handle AccumulateWithJoinFilterNode though since those conditions will be under :accum-condition, not :condition, although the case where it is the child is handled. This could handled with a 4-way or like

(or (:condition node)
      (:accum-condition node)
       (-> node :children first :condition) ; alpha node
       (-> node :children first :accum-condition) ; accum node)

However, my thinking here is that it would be cleaner for each relevant node to implement a protocol defining how to retrieve its condition rather than covering each case here. Something like

(defprotocol IConditionNode
       (get-condition [this]))

(defrecord AlphaNode ..... IConditionNode (get-condition [this] (-> children first get-condition)))

(defrecord AccumulateNode ... IConditionNode (get-condition [this] (:from accum-condition)))

(defrecord NegationNode ... IConditionNode (get-condition [this] (second condition)))

and so forth for the relevant node types, and then throw-constraint-exception could just call that protocol function.

  1. https://github.com/cerner/clara-rules/blob/0.13.0-RC7/src/main/clojure/clara/rules/compiler.clj#L1429
  2. https://github.com/cerner/clara-rules/blob/0.13.0-RC7/src/main/clojure/clara/rules/compiler.clj#L827

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up implementing a couple of small protocols and a couple of new functions based on your comments. We can now handle multiple different conditions containing the same constraint. Here's an example output.

Constraint exception.
Exception raised during execution of constraint
  (> 1 rate)
when processing fact
  #msg_test.RateToUSD{:currency "CAD", :rate nil}
with no bindings

Conditions:

1. [msg_test.RateToUSD (> 1 rate)]
   rules:
     rule1-name
     rule3-name
   queries:
     query1-name

2. [:?wind <- (clara.rules.accumulators/all) :from [msg_test.RateToUSD (> 1 rate)]]
   rules:
     rule2-name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still am not handling TestNode properly, but I'll probably need someone else to take that over. It's getting harder to find free time.

(assert-ex-data {:constraint-exp '(> temperature 0)
:bindings {:?location "MCI"}
:fact temperature-fact
:queries ["temperature-query1" "temperature-query2"] ; ordering may change with implementation details
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're ordering these names in node-rule-names I don't think the ordering actually will change. Consistency is good from a user-facing point of view in general I think, and these names are used in the error message, so +1 to sorting. There isn't any logical ordering to the names though so I think we could use a sorted-set here. That would make this testing (and any user code interacting with the data returned in the ExceptionInfo) more straightforward in that it wouldn't need to care about the ordering. Sorted-set is also nil safe.

user=> (sorted-set "world" "hello" nil)
#{nil "hello" "world"}
user=> (sorted-set nil "hello" "world")
#{nil "hello" "world"}
user=> (= (sorted-set nil "world" "hello") #{"hello" "world" nil})
true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

(assert-ex-data {:constraint-exp '(> temperature 0)
:bindings {:?location "MCI"}
:fact temperature-fact
:queries ["temperature-query"] ; ordering may change with implementation details
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordering won't change a one-element vector; copy-paste error I presume? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

:bindings {}
:fact temperature-fact
:queries ["temperature-query"] ; ordering may change with implementation details
:productions []}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have a reason to omit the :condition from the assertion here or was this just an omission?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition structure was too brittle and I worried that it would easily in the future. Not a problem with the new protocols.

@WilliamParker
Copy link
Collaborator

WilliamParker commented Jan 12, 2017

I've looked over the changes @carlosphillips . There were some outstanding issues with how this is retrieving conditions from nodes and a couple of other minor things that should be quick to fix. Otherwise +1 from me. On retrieving conditions, I'd like to resolve the issue with getting the condition off of AccumulateWithJoinFilterNode but don't personally mind if we leave the edge case of dealing with alpha nodes with children with different conditions for another time; I don't know others views on that though. Note that the conditions would only be different in being ordinary conditions, negations, or accumulation; the constraints would the same. From previous comments it sounds like @rbrush may want to look over this as well before merging.

@carlosphillips carlosphillips force-pushed the constraint-messages branch 4 times, most recently from 54aeef5 to 4ec1807 Compare January 16, 2017 03:19
@carlosphillips
Copy link
Contributor Author

Hi guys, thank you for having a look.

WilliamParker
WilliamParker previously approved these changes Jan 16, 2017
@WilliamParker
Copy link
Collaborator

A couple notes on the latest changeset:

  1. The comment here isn't accurate:
;; Protocol for getting a node's condition expression.
(defprotocol ITerminalNode
  (terminal-node-type [this]))

This comment is identical to the comment on IConditionNode:

;; Protocol for getting a node's condition expression.
(defprotocol IConditionNode
  (get-condition-description [this]))

but these nodes aren't actually doing the same thing. terminal-node-type returns a tuple of a type keyword and and a name, not a condition. I don't mind just fixing this myself after a merge if you don't get to it before then.

  1. I tried this out in my Emacs REPL and I also got weird formatting on the message string, although when I directly printed out the exception string with
(-> *e .getMessage println)

it looked the way I'd expect. I'm not sure whether something is off in the string building here or if something in my development tooling is misbehaving, but in any case this sort of thing shouldn't block merging IMO since useful information with strange formatting is far better than a lack of useful information, which is the current state.

  1. You mentioned that TestNode isn't being handled, but I wouldn't expect TestNode to be a direct child of an AlphaNode so an AlphaNode wouldn't get its conditions from TestNode. TestNode corresponds to a :test condition in a rule like
(defrule cold-rule
    [Temperature (= ?t temperature)]
    [:test (< ?t 0)]
    =>
    (insert! (->Cold ?t)))

I'd consider adding useful error information to TestNode to be a separate, although related, issue that shouldn't block merging this.

In short, these changes look good to me and I don't see any further blockers to merging. I'll hold off on merging for now though to give @rbrush a chance to look at this if he wants to do so.

@carlosphillips
Copy link
Contributor Author

I'm getting the same thing as you in the Emacs REPL. I'm not sure how to fix that. I did fix the comment.

@carlosphillips
Copy link
Contributor Author

Btw, you provide excellent feedback and it has been a pleasure working with you.

(defn- alpha-node-matches
[facts env activation node]
(for [fact facts
:let [[bindings error] (activation fact env)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above comments...rather than returning a [bindings error] tuple, I wonder if we could just throw a specific exception we'd only catch for this case that contains the error, saving us the creation/destructuring of a tuple that only helps in an exception case.

@rbrush
Copy link
Collaborator

rbrush commented Jan 23, 2017

One more general comment on this: we need to make sure the presence of a try/catch for each constraint doesn't hurt performance. try/catch is cheap on the JVM, but this flow here could add tens of millions of flows through try/catch blocks for some of our workloads.

If there is a significant performance penalty to this, we might revert to having the outer, node-level try/catch. It won't tell you the exact sub-expression, but it will at least get you to the right top-level constraint.

@rbrush
Copy link
Collaborator

rbrush commented Jan 23, 2017

Another thing we could do when eval'ing the constraint is to attach metadata to each expression in the constraint that includes a line number. So a constraint like:

 [Temperature (>temperature 40) (= "LAX" location)] 

could generate code where (>temperature 40) is line 1, and (= "LAX" location) is line 2, which would show up in the exception. We could then catch an exception on each node, and look at the line number in the stack trace to determine exactly which expression threw it.

Of course, this flow would only work on the JVM, but is a way to get back to the exact expression that threw the exception without needing try/catch blocks everywhere. Something we could consider if we're worried about code generation size or performance here.

@WilliamParker
Copy link
Collaborator

WilliamParker commented Jan 24, 2017

@rbrush Regarding the performance of try-catch, I did some reading on the subject and it doesn't sound like we should expect much if any performance impact. It seems that when the JVM throws an exception in any bytecode instruction, it looks up the handler for the exception in a table that is indexed on the range of bytecode instruction positions for which the handler should be used. The program counter, the position of the bytecode instruction the VM is currently on, is used to look up the needed handler in this table. Thus, adding try-catch catch shouldn't add additional bytecode instructions to the main path. It would add to the total memory taken by compiled classes, but my expectation would be that in the environments Java typically runs in (i.e. server-size machines) this probably isn't a major issue. I suppose it could be an issue if anyone were trying to run huge rulesets in a highly memory-limited JVM environment but I'm not aware of any such use cases.

It does sound like the introduction of more finely-grained try-catch blocks increases the complexity of JIT optimizations that reorder bytecode instructions at runtime, and thus could reduce them. I'd expect any such impact to be heavily dependent on the exact constraints, so it will be hard/impossible to prove that such impacts don't exist and in any event would likely be dependent on the VM implementation, possibly OS and processor architecture, etc. I'll go ahead and test Clara with these changes against one our workflows though and validate that performance doesn't significantly worsen.

Sources:

@WilliamParker
Copy link
Collaborator

@rbrush I'm very surprised at this, but I tried this change against one of our workflows that I know taxes the alpha network extremely heavily and it seems this results in double-digit percentage increases in runtime for this workflow. Relating to the conversation above, I'm guessing this has something to do with hitting less optimal paths in the JIT compiler, which unfortunately is likely to be a pretty difficult issue to diagnose. I tried inlining the check in alpha-node-matches into the AlphaNode (i.e. putting the code directly in the body instead of in a function) to avoid the Var indirection but this did not noticeably reduce runtime.

Perhaps we can think of a way to bring execution time down or just check on the condition as a whole rather than the constraints. I wonder if the right solution here is to introduce a :debug-mode option to mk-session that enables this and any other checks that could be too expensive to perform at runtime in production environments. Aggressive schema checking would be another example of this. I don't think the additional complexity in the code would be too bad. Needing to complicate the user-facing API would be unfortunate but is perhaps our best option. Such an option would have the added bonus of resolving our issues with dead code in ClojureScript.

@rbrush
Copy link
Collaborator

rbrush commented Jan 25, 2017

Gah, bummer on the performance hit.

Even so, I'd still be in favor of checking on the condition as a whole rather than each constraint to avoid the performance penalty. If we can get a user to the right condition, he or she has a much better chance of figuring out the right cause.

@mrrodriguez
Copy link
Collaborator

mrrodriguez commented Jan 25, 2017

@WilliamParker @rbrush

I'm very surprised at this, but I tried this change against one of our workflows that I know taxes the alpha network extremely heavily and it seems this results in double-digit percentage increases in runtime for this workflow. Relating to the conversation above, I'm guessing this has something to do with hitting less optimal paths in the JIT compiler, which unfortunately is likely to be a pretty difficult issue to diagnose. I tried inlining the check in alpha-node-matches into the AlphaNode (i.e. putting the code directly in the body instead of in a function) to avoid the Var indirection but this did not noticeably reduce runtime.

Did you get any profiling data to be sure it had anything to do with the try-catch situation?
Was it only the alpha network affected?

Another point to consider is if the issue is not the try-catch, but rather causing too large bytecode size for inlining due to the code movement. You can enable flags to watch for this for the JVM. I'd do so with Java7+ though due to some odd results I found for some of the inlining output for Java6 before (the Mac version of JDK6 at least).

I originally thought: It is hard to believe that the try-catch blocks are costing time given the actual byte code that is generated by it. It essentially doesn't add any runtime overhead and only minimal bytecodes.

Then I looked for some actual evidence:
I'll demonstrate what I'm seeing from Clojure's compiled code regarding try-catch.

;;;; WITHOUT try-catch
(binding [*compile-files* true]
  (eval '(conj [:a] :b)))

;;;; WITH try-catch
(binding [*compile-files* true]
  (eval '(try
           (conj [:a] :b)
              (catch Exception e
                (throw e)))))

Now look at the *compile-path*, which is something like
"/Users/<user>/<path-to-repls-project>/target/classes"

Using javap, I'll truncate to the most relevant part of the discussion.

WITHOUT try-catch

  public static java.lang.Object invokeStatic();
    Code:
       0: getstatic     #15                 // Field const__0:Lclojure/lang/Var;
       3: invokevirtual #20                 // Method clojure/lang/Var.getRawRoot:()Ljava/lang/Object;
       6: checkcast     #22                 // class clojure/lang/IFn
       9: getstatic     #26                 // Field const__2:Lclojure/lang/AFn;
      12: getstatic     #30                 // Field const__3:Lclojure/lang/Keyword;
      15: invokeinterface #34,  3           // InterfaceMethod clojure/lang/IFn.invoke:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;
      20: areturn

WITH try-catch

  public static java.lang.Object invokeStatic();
    Code:
       0: getstatic     #15                 // Field const__0:Lclojure/lang/Var;
       3: invokevirtual #20                 // Method clojure/lang/Var.getRawRoot:()Ljava/lang/Object;
       6: checkcast     #22                 // class clojure/lang/IFn
       9: getstatic     #26                 // Field const__2:Lclojure/lang/AFn;
      12: getstatic     #30                 // Field const__3:Lclojure/lang/Keyword;
      15: invokeinterface #34,  3           // InterfaceMethod clojure/lang/IFn.invoke:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;
      20: astore_0
      21: goto          36
      24: astore_1
      25: aload_1
      26: aconst_null
      27: astore_1
      28: checkcast     #36                 // class java/lang/Throwable
      31: athrow
      32: astore_0
      33: goto          36
      36: aload_0
      37: areturn
    Exception table:
       from    to  target type
           0    21    24   Class java/lang/Exception

It looks like Clojure's compiler is doing 2 potentially bad things:
(1) Adding quite a bit more bytecodes. If the JIT inlining threshold was bordering on the edge of being too high before, this could be enough to send it over the top.
(2) The return value is stored into local variable 0 with and there is a goto for the regular flow of execution. Then the value is loaded back from local variable 0 to return.

I do not think (1) is likely just because I think many of Clara's functions are already compiling into too large of methods for inlining. This is unfortunate, but nothing that has changed here.

(2) could be suspect though.

Now I'm also suspicious that perhaps Clojure's compiler doesn't make try-catch as non-obtrusive as, for example, Java's compiler. I haven't been able to find any discussions online related to this.

Edit:
Based on a little research, if http://www.javaworld.com/article/2076868/learn-java/how-the-java-virtual-machine-handles-exceptions.html is to be trusted, it is odd that Clojure's compiler does this variable store, goto, and variable load. The exception table should already handle the case in a way that doesn't affect the runtime flow of normal execution. i.e. The Java compiler would have immediately had the areturn in the normal flow and if an exception was thrown it would have just consulted the exception table.

@WilliamParker
Copy link
Collaborator

Frankly, I was pretty stunned as well to see the performance degradation and I'm still not ruling out that I'm missing something or that there is some other piece of this other than the try-catch that is causing problems. It merits further thought/investigation.

  • I went with something that tested the alpha network heavily because that is the only dataset I know of that would test one of these conditions that heavily. In this case, it taxes alpha conditions that call functions that we've heavily optimized where we found that things like using ThreadLocals instead of dynamic vars and doing identity rather than equality checks on keywords yielded large performance benefits. As a result of this optimization the cost of the rules engine is (I think) pretty isolated. I did do some profiling but it isn't quite as conclusive as I'd like since it consists of randomly selected profiles from a MapReduce job.
  • The difference in the bytecode in the main path is interesting since the literature I read on the JVM suggests that the only difference should be the addition of the exception handler. See my comment at Issue 255: Provide better feedback when constraints throw exceptions #256 (comment) Comparing this with the bytecode generated from similar logic in Java might be enlightening. That said, I do think we need to keep in mind that the bytecode is ultimately an abstraction, not what actually gets executed. As I discussed in my comment there is some fairly authoritative information that adding try-catch blocks prevents some reordering operations of the JVM that would otherwise be allowed to optimize performance. Unfortunately I have yet to find any sources that elaborate on what sorts of optimizations would be prevented. Note that my tests were against Java 6.
  • If we can achieve our performance goals while wrapping all conditions, rather than constraints, in try-catch blocks that would be a reasonable solution IMO. As @carlosphillips pointed out it does have the downside of making internal exceptions not necessarily readily distinguishable from exceptions in user code, but the benefits of providing the condition rather than nothing outweigh this downside by a considerable margin IMO. That said, we'd need to check the performance of that as well. The case I used has predominantly two constraints in the poorly-performing conditions; if performance degradation were a straight linear function of the number of try-catch blocks I don't think a third of the degradation I saw would be acceptable. That said I doubt that the relationship between try-catch blocks and performance is quite that simple.

@mrrodriguez
Copy link
Collaborator

The difference in the bytecode in the main path is interesting since the literature I read on the JVM suggests that the only difference should be the addition of the exception handler. See my comment at #256 (comment) Comparing this with the bytecode generated from similar logic in Java might be enlightening. That said, I do think we need to keep in mind that the bytecode is ultimately an abstraction, not what actually gets executed. As I discussed in my comment there is some fairly authoritative information that adding try-catch blocks prevents some reordering operations of the JVM that would otherwise be allowed to optimize performance. Unfortunately I have yet to find any sources that elaborate on what sorts of optimizations would be prevented. Note that my tests were against Java 6.

Your references were with respect to the Java compiler (i.e. javac). Clojure's compiler is completely different and can do things much differently. I'm actually not too happy now that I've seen how Clojure compiles the try-catch in a way that adds obvious runtime overhead.
The bytecode isn't purely "an abstraction". Adding these local variable stores and loads are additional work. They could maybe be optimized, but there isn't much guarantees on that.
I'd like to see some additional feedback on this, which I think would have to be asking on the Clojurians #clojure slack or the google groups.

The other issue is with inlining, that's what you're mostly discussing. Certain sorts of "happens before" relationships have to hold in optimizations and the body of code in the try-catch has to be done in a controlled scope so that when an exception is thrown it is caught at the right time and able to jump to the exception table location. That has many different ways it could manifest itself as a place that can break up optimizations.

The highest level way to put it is just that the code between try-catch must all stay one unit and can't shift around before the try starts or after the catch. That sort of thing. Places where that matter would be non-obvious.

However, there is also just a straightforward bytecode # limit for JIT inlining. The numbers vary, but say it is 35 (there are actually several different thresholds). That just means any method longer than 35 bytecodes cannot be inlined into the caller.

The try-catch adds about 10 bytecodes to the compiled bytecodes I demonstrate above. That is 10 bytecodes closer to exceeding an inlining limit threshold value.

@WilliamParker
Copy link
Collaborator

I agree that there seems to be some variance in your example between what the Clojure compiler is doing and what the Java compiler (according to my references) does for similar logic that could explain performance differences. Extra bytecode instructions are definitely noticeable/concerning. I haven't yet compared with the actual output of a Java compiler yet.

The general idea of the sorts of optimizations this could disrupt makes sense to me. In general, from Clara's POV, the exact impacts would likely be dependent on the actual alpha condition so investigating exactly what is going on with my example wouldn't necessarily say anything about the general case.

@WilliamParker WilliamParker dismissed their stale review January 25, 2017 23:07

Performance concerns

@mrrodriguez
Copy link
Collaborator

mrrodriguez commented Jan 25, 2017

Microbenchmarks always have their quirks, but in cases with optimized operations like this:

(time (dotimes [_ 1E9]
        (+ 1 2)))
;; "Elapsed time: 503.603477 msecs"

(time (dotimes [_ 1E9]
        (try (+ 1 2) (catch Exception e e))))

;; "Elapsed time: 4250.413995 msecs"

The performance effect is very large.
The compiled bytecode:

NO try-catch:

  public static java.lang.Object invokeStatic();
    Code:
       0: lconst_1
       1: ldc2_w        #12                 // long 2l
       4: invokestatic  #19                 // Method clojure/lang/Numbers.add:(JJ)J
       7: invokestatic  #23                 // Method clojure/lang/Numbers.num:(J)Ljava/lang/Number;
      10: areturn

WITH try-catch

  public static java.lang.Object invokeStatic();
    Code:
       0: lconst_1
       1: ldc2_w        #12                 // long 2l
       4: invokestatic  #19                 // Method clojure/lang/Numbers.add:(JJ)J
       7: invokestatic  #23                 // Method clojure/lang/Numbers.num:(J)Ljava/lang/Number;
      10: astore_0
      11: goto          22
      14: astore_1
      15: aload_1
      16: aconst_null
      17: astore_1
      18: astore_0
      19: goto          22
      22: aload_0
      23: areturn
    Exception table:
       from    to  target type
           0    11    14   Class java/lang/Exception

It looks like the Clojure number optimizations are all in place in both cases. So that wasn't messed up due to something unforeseen in the microbenchmark. However, the movement of the number in memory to and from the local variable seems significant. Or perhaps it is not able to take some optimizing path due to this.

Just an example.

@WilliamParker
Copy link
Collaborator

Hmm.. I get very different results from that numerical benchmark.

;;; Clojure 1.8.0
user=> (time (dotimes [_ 1E9]                                             
        (try (+ 1 2) (catch Exception e e))))
"Elapsed time: 315.164 msecs"
nil
user=> (System/getProperty "java.version")
"1.6.0_65"
user=> (time (dotimes [_ 1E9]
        (try (+ 1 2) (catch Exception e e))))
"Elapsed time: 303.749 msecs"
nil
user=> 

What sort of environment (Clojure version, JVM, OS, etc.) are you running it in?

@mrrodriguez
Copy link
Collaborator

Comparing with the Java compiler for similar code as the Number.add() example above:

package test;

import clojure.lang.Numbers;

public class Test {
    public static Object invokeStatic() {
        return Numbers.num(Numbers.add(1L, 2L));
    }

    public static Object invokeStaticWithTry() {
        try {
            return Numbers.num(Numbers.add(1L, 2L));
            } catch (Exception e) {
            throw e;
        }
    }
}

Generates

NO try-catch:

  public static java.lang.Object invokeStatic();
    Code:
       0: lconst_1
       1: ldc2_w        #2                  // long 2l
       4: invokestatic  #4                  // Method clojure/lang/Numbers.add:(JJ)J
       7: invokestatic  #5                  // Method clojure/lang/Numbers.num:(J)Ljava/lang/Number;
      10: areturn

WITH try-catch:

  public static java.lang.Object invokeStaticWithTry();
    Code:
       0: lconst_1
       1: ldc2_w        #2                  // long 2l
       4: invokestatic  #4                  // Method clojure/lang/Numbers.add:(JJ)J
       7: invokestatic  #5                  // Method clojure/lang/Numbers.num:(J)Ljava/lang/Number;
      10: areturn
      11: astore_0
      12: aload_0
      13: athrow
    Exception table:
       from    to  target type
           0    10    11   Class java/lang/Exception

Notice that the "WITH try-catch" case doesn't do any more work on the normal path. It doesn't need to. The "Exception table" is all the info the JVM needs to know how to jump in the case of an exception.

@mrrodriguez
Copy link
Collaborator

mrrodriguez commented Jan 25, 2017

Hmm.. I get very different results from that numerical benchmark.

I wonder if it has to do with JVM. Although, I'd typically expect 8 to outperform 6.

Clojure:
1.8

JVM:
"1.8.0_112"

Edit: I do think it was environmental in that particular example. I just ran it raw instead to be sure.

java -jar ~/.m2/repository/org/clojure/clojure/1.8.0/clojure-1.8.0.jar -e "(time (dotimes [_ 1e9] (+ 1 2)))"

;; "Elapsed time: 318.542907 msecs"

vs

java -jar ~/.m2/repository/org/clojure/clojure/1.8.0/clojure-1.8.0.jar -e "(time (dotimes [_ 1e9] (try (+ 1 2) (catch Exception e e))))"

;; "Elapsed time: 331.059534 msecs"

Numbers were fairly consistent for a few runs. So the difference is much smaller.
Regardless, I just didn't have a good example, doesn't negate the potential (and apparent?) negatives we already have been discussing here.

@WilliamParker
Copy link
Collaborator

So getting back to this after test runs using a particularly heavy workflow we have in production against various iterations of this:

  • Running with the full constraints try-check increased the runtime of this case by about 32% relative to master.
  • Interestingly, when I tried the full constraints try-check but without the exception data map, the runtime was about 17% longer than the baseline.
  • When I tried running without the exception data map by removing the let binding around the constraint expression result, and instead directly inlining it into a when conditional, results were similar to those in the case with only the exception data map removed.
  • I tried running with the JIT disabled to compare the baseline with the changes without JIT (and thus determine whether to what extent the impacts were due to JIT or the actual bytecode) , but the performance of that was so atrocious that it was obvious the job wouldn't complete.
  • When I changed this to just have a try-catch on the condition, rather than the individual constraints, runtime increased by about 1.5%. It should be noted that fluctuations of several percentage points in runtimes on our production compute cluster are fairly normal.

I didn't find anything in the bytecode of the generated functions I looked at in this case that jumped out at me as likely to cause these kinds of performance problems. I find the fact that the runtime decreased drastically with the removal of the exception data map quite notable. My suspicion is that the need to maintain variables across the different constraint try-catch loops is causing problems for the HotSpot JIT compiler. Unfortunately, I'm skeptical that we'll be able to precisely diagnose what is going on there. I did some research on the HotSpot JIT to try and determine what here could potentially cause problems with it, but while there is some general information out there there seems to be a lack of really detailed descriptions of the behavior (short of the openJDK source of course).

As far as how to move forward on this, it should be kept in mind that the workflow I tested is a remarkably demanding workflow (which I used for this reason) and that the constraints involved have been heavily optimized well beyond what idiomatic Clojure allows. I wouldn't expect a more normal workflow to be so heavily impact, but nevertheless I don't think a 32% increase is going to be acceptable. My inclination is to move forward with the option of just wrapping a try-catch on the condition rather than the individual constraints. It would be nice to have the exact constraint that failed, but I suspect that practically speaking narrowing down to a particular condition, especially with the fact and bindings from prior nodes in the exception data map, will deliver most of the value here. Regarding the downside @carlosphillips pointed out about also catching exceptions from within Clara, on reflection I think this is probably a good thing on the whole. Clara shouldn't be throwing such exceptions and if they are localized to a particular condition it should hopefully be easier to diagnose any such exceptions that occur. In the case of JVM Clojure, one intriguing potential future enhancement I see is that since eval is available, we could potentially catch an exception from the condition and only then compile the condition with try-catch on individual constraints. However, this would be pretty complex and IMO is best left for another time rather than blocking progress here.

Thoughts? @rbrush @mrrodriguez

@rbrush
Copy link
Collaborator

rbrush commented Jan 30, 2017

From @WilliamParker:

My inclination is to move forward with the option of just wrapping a try-catch on the condition rather than the individual constraints. It would be nice to have the exact constraint that failed, but I suspect that practically speaking narrowing down to a particular condition, especially with the fact and bindings from prior nodes in the exception data map, will deliver most of the value here.

These are my thoughts as well.

+1 to implementing that here, and we can look at doing something more involved that matches typically code style on the JVM (like attaching line-number metadata as I briefly discussed earlier) t a later point.

@mrrodriguez
Copy link
Collaborator

Yeah, I'd just say any significant performance degradation is worth it. It sounds like the more minimal approach is useful enough for now. Anything excessively heavy-weight probably needs to be something that could be turned-on in a "debug" mode if it came down to features like that. Doesn't sound necessary yet here.

@WilliamParker
Copy link
Collaborator

I have created a branch that changes this to only wrap the entire condition in a try-catch block, rather than each individual constraint, at https://github.com/WilliamParker/clara-rules/tree/fix-constraint-messages. The commit is at WilliamParker@055b930. Note that this is commit is on top of the commit that adds a try-catch block to each individual constraint.

I'll test this against the problematic workflow discussed earlier when I get a chance. It is basically but not exactly the same as the version I tested against that ran with no major perf degradation, so I don't anticipate different performance characteristics, but I want to make sure.

@WilliamParker
Copy link
Collaborator

It looks like the draft I posted at https://github.com/WilliamParker/clara-rules/tree/fix-constraint-messages basically isn't impacting performance. :)

@rbrush I've posted more details on the runs I did internally at Cerner since unfortunately I can't post them here.

@carlosphillips
Copy link
Contributor Author

I'm excited to see that you have found a way to make some of this work.

@rbrush
Copy link
Collaborator

rbrush commented Feb 8, 2017

+1 to merge. (Sorry for the slow responses...been kind of a crazy week.)

@WilliamParker
Copy link
Collaborator

OK, regarding how to merge this, I'll plan to push the state of https://github.com/WilliamParker/clara-rules/commits/issue255-condition-messages-only onto master after validating that the commit history (credited appropriately to Carlos and with appropriate ordering of the commits) looks the same way in master on my fork as in that branch unless anyone has other ideas/suggestions/objections. I don't particularly care about the mechanics as long as we get those things right and I'm not experienced with these sorts of manual merges of PRs from others in Git so I could certainly be missing better options.

A colleague of mine is looking through the manual performance testing I did but I anticipate putting this in master today or tomorrow unless something comes up.

@WilliamParker
Copy link
Collaborator

I pushed these changes to master manually since I needed to add some changes on top of the PR was modified. You can see the commits here:

0f85ae7
da0845a

The Travis build for the second commit passed at https://travis-ci.org/cerner/clara-rules/builds/200391914

I don't see anything left to do on this PR. Closing.

@rbrush
Copy link
Collaborator

rbrush commented Feb 10, 2017

Awesome. Thanks everyone for pushing this through!

@WilliamParker WilliamParker added this to the 0.13.0 milestone Feb 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants