Use lazy indexing repeated expression caching to optimize predicates #715

seadowg · 2023-04-18T09:22:57Z

Closes #689
Closes #714

This adds two extra levels of caching for predicates. Both take advantage of expressions with the structure (in either direction):

<relative (to the nodes being filtered) path expression> <operator> <"absolute" expression>

The "absolute" expression can currently be a path expression, a numeric literal or a string literal. It's likely we could extend this to any "idempotent" (where XPathExpression#isIdempotent is true) expression though.

For the first level of caching (CompareChildToAbsoluteExpressionFilter), we support these expressions with the operators =, !=, <, >, <= and >=. These expressions can be broken up into a "relative" part and an "absolute" part, the absolute part can then be evaluated and use a key (along with the predicate and some other variables) to cache the results for that expression when evaluated with for that particular evaluation. For example an expression like name = /data/choice can be partly evaluated to name = "Bill" and then we have enough information to cache the children where name is "Bill". This means that we evaluate repeated predicates of this form with a previously evaluated answer in O(1) rather than O(n) where n is the number children we are filtering.

The second level of caching (IndexPredicateFilter) only works on =. This uses a similar approach to PredicateCachingFilter, but lazily builds an index using the for the "relative" side. From the example before, this would mean we end up creating an index for all the possible name values in a set of children we're filtering meaning when we evaluate the absolute side to "Bill", we can just grab the matching children out of the index. This means that we evaluate repeated predicates of this form where we've previously filtered on that relative path in O(1) rather than O(n) where n is the number children we are filtering. It's worth noting that this will probably increase loading times for forms as indexing will most likely happen then. I think we can probably extend this indexing approach for the other operators, but I'd need to consult a grimoire to work those out, and it may just not be worth it.

There are a few caveats/edge cases here. I don't think any of these block us here, but it would be good to discuss whether these are things we should create issues for, or just leave for a potential future where we feel like they're worth it:

No caching is currently used in cases where there are multiple predicates like /data/children[<predicate>][<another predicate]. This fixes Cascade-level filtered expression caching results in bad evaluations with multiple predicates #714. We'll most likely want to do extra work to support these expressions or add support for && predicates in the future. We might be able to add a simple compromise where we only use caching with the first predicate in a set of multiple predicates. We now use caching when evaluating the first in a chain of predicates, but don't for any following ones as we discussed.
Less caching is currently used in cases where there are nested predicates like /data/children[<predicate>]/grandChild[<another predicate>]. This causes problems for indexing as we need to index the whole possible nodeset rather than the nested one. Similar to the above caveat, there might be a nice compromise where just use caching with the first predicate in the path.
No caching is used in cases where a predicate compares to relative paths with any of the references operators. An example would be something like name = nickname. If we feel like this would be impactful, we can probably add it down the road, but the implementation would differ from what we have so far (as both sides need to be evaluated for every child).

I've also added an API to allow clients to add a PredicateFilter:

formEntryController.addPredicateFilter(MyPredicateFilter());

This should allow Collect to add caching that works between form sessions (speeding up form loading times).

This only works for predicates of the form: ``` <path expression> = <path expression> ``` It's very likely there are untested edge cases that this implementation will break.

…side of eq

…ptimized

seadowg · 2023-05-11T16:16:54Z

As discussed, JavaRosa should expose a way to add a PredicateFilter rather than override the TreeReferenceIndex. We think this will provide more flexibility and might be required for things Entities.

lognaturel

I haven't done a deep review yet but I think it would be helpful to get this in beta so we can get feedback on any crashes or issues that users notice.

lognaturel · 2023-06-28T22:27:24Z

src/main/java/org/javarosa/core/model/condition/EvaluationContext.java

                passed.addAll(predicateCache.get(nodeSetRef, xpe, () -> {
                    List<TreeReference> predicatePassed = new ArrayList<>(treeReferences.size());
                    for (int i = 0; i < treeReferences.size(); ++i) {
                        //if there are predicates then we need to see if e.nextElement meets the standard of the predicate
                        TreeReference treeRef = treeReferences.get(i);

                        //test the predicate on the treeElement
-                        EvaluationContext evalContext = rescope(treeRef, (firstTimeCapture ? treeRef.getMultLast() : i));
+                        EvaluationContext evalContext = rescope(treeRef, i);


I'm having trouble convincing myself that the whole condition is ok to delete. Can you explain the reasoning please? Or is it just that there are no tests for it?

It had no tests and I couldn't explain the reasoning for it. Agreed it feels scary to delete, but it's also completely unexplained code. Do you feel like we need to add the logic back in? I couldn't come up with an example at the time, but I was heavily down in the weeds. Maybe we'd have more success reverse engineering a test now.

lognaturel · 2023-07-03T19:35:23Z

src/main/java/org/javarosa/core/model/TriggerableDag.java

-            context = new EvaluationContext(evalContext, new IdempotentInMemPredicateCache());
+            List<PredicateFilter> filters = Stream.concat(
+                customPredicateFilters.stream(),
+                Stream.of(indexPredicateFilter, cachingPredicateFilter, new IdempotentPredicateCache())


The first two caches are instantiated at form load time and the third is instantiated at the start of an evaluation cascade. This is really fundamental to how they're intended to work and it feels scary to me that there's no explanation of this distinction anywhere in the code. When I first saw this line, my reaction was that all three should be constructed in the same way.

The good news is that repeatsUsedInCalculatesStayUpToDate and similarCmpAndEqExpressionsDoNotGetConfused do fail if the last cache is instantiated globally rather than just for a single cascade.

Not sure there's action to take here but did want to flag it. Maybe the last cache type's header comment should include that it's only intended for use in a single evaluation cascade. Another idea would be to build a single cache and explicitly reset it at the beginning or end of each cascade evaluation.

Side note, why does similarCmpAndEqExpressionsDoNotGetConfused fail in that case? Are there cases where the expressions could get confused within a single cascade?

Not sure there's action to take here but did want to flag it. Maybe the last cache type's header comment should include that it's only intended for use in a single evaluation cascade. Another idea would be to build a single cache and explicitly reset it at the beginning or end of each cascade evaluation.

There is currently a comment on IdempotentPredicateCache:

Caches down stream evaluations (in the {@link PredicateFilter} chain) for "idempotent" (with respect to current form state) predicates. Can only be used for static instances or in cases where form state won't change - will cause clashes otherwise. Repeated evaluations are fetched in O(1) time.

Does that feel like enough or do you think it needs more details?

Side note, why does similarCmpAndEqExpressionsDoNotGetConfused fail in that case? Are there cases where the expressions could get confused within a single cascade?

Not sure! The cache definitely isn't designed to work between evaluation runs, so I'd imagine that's messing with something. I believe the repeats test was one that drove out the cache only being used for single evaluations.

lognaturel · 2023-07-03T22:47:29Z

src/main/java/org/javarosa/core/model/condition/EvaluationContext.java

-                return filtered;
-            }
-        }
+        return filterWithPredicate(sourceInstance, treeReference, predicate, children, 0);


The iterative approach felt simper to me! I'd be interested to briefly discuss why this feels clearer to you.

It's been a while now since I made the change, but I guess I was anticipating that we needed to "fall through" and then let values "bubble" back up through these filters. In CompareChildToAbsoluteExpressionFilter for example, you need to let the lower parts of the chain evaluate and then use that to build the cache. For me, that felt easiest to model recursively. Does that make sense?

lognaturel · 2023-07-03T23:11:44Z

src/main/java/org/javarosa/core/model/CompareChildToAbsoluteExpression.java

+        if (absoluteSide instanceof XPathPathExpr) {
+            return ((XPathPathExpr) getAbsoluteSide()).eval(sourceInstance, evaluationContext).unpack();
+        } else {
+            return absoluteSide.eval(sourceInstance, evaluationContext);


This accounts for literals? Yes, confirmed through debugging tests.

seadowg added 30 commits April 18, 2023 11:25

Remove potentially unused code

68799cb

Extract predicate filtering to a method

5c35208

Use chained approach for caching implemenation

d7c37b2

Simplify chain with recursive implementation

5b61f24

Allow chain to built up rather than set

e88f90d

Use lambda instead of anonymous

beb837f

Implement limited between answer predicate caching

a9cf53f

This only works for predicates of the form: ``` <path expression> = <path expression> ``` It's very likely there are untested edge cases that this implementation will break.

Add failing test for case where relative expression is on right hand …

dbe239c

…side of eq

Only allow expressions where the left side is relative

b069f4c

Rename param

c8f0d1f

Correct test form

a8dc7bd

Add stub for potentially failing case

890454d

Add failing test for cmatching predicates that come after different ones

8447ba3

Add check for potential regression

5884aeb

Disable predicate caching for multiple predicates for the moment

d50cbfc

Add stub failing tests for other scenarios we want to make sure are o…

de743ff

…ptimized

Rewrite tests with Scenario DSL

fbb0a14

Add passing test for comparison case

4816c1b

Add test for failing comp case

ab7226f

Support caching for cmp predicates

fc9b983

Add test for func case

ae070b5

Evaluate eq predicates with index

b543f38

Split indexing caching out

3399e34

Rename classes

aa9bcf2

Pull out expression parsing logic

e559d3e

Rename class

6785e5a

Add failing tests for supporting more expressions

68c9667

Support expressions regardless of direction

140f7cd

Support numeric and string literals for index caching

4fe85ae

Don't try and index nested nodesets

a0c4568

seadowg added 6 commits May 5, 2023 17:26

Fix different types of eq being confused

59cb3f9

Prefer path expressions in regression tests where it's possible

2bb898f

Correct name of test

fa1a824

Don't apply caching to expressions on main instance

2753251

Add caching for first predicate in chain

7ded84f

Pull indexing data structure code out of predicate filter

b496772

seadowg marked this pull request as ready for review May 9, 2023 13:57

seadowg marked this pull request as draft May 9, 2023 14:05

seadowg marked this pull request as ready for review May 9, 2023 14:21

seadowg requested a review from lognaturel May 9, 2023 14:21

seadowg marked this pull request as draft May 10, 2023 11:05

seadowg marked this pull request as ready for review May 10, 2023 11:10

seadowg marked this pull request as draft May 11, 2023 16:13

seadowg added 3 commits May 12, 2023 10:16

Correct Measure implementation

3cc5c76

Remove unused method

adc1463

Nest interface

31fbd2f

seadowg force-pushed the repeated-expr-2 branch from e97aa30 to 31fbd2f Compare May 12, 2023 07:19

seadowg added 5 commits May 12, 2023 10:40

Create default filter chain

fcf5703

Remove interface that's only used in one class

a30fade

Add ability to add predicate filter

44d944a

Improve interface for clients

93a6d6c

Move nesting check to EvaluationContext

2dc034d

seadowg marked this pull request as ready for review May 12, 2023 08:26

Add test to check we aren't increasing load time when indexing

f08c2b3

lognaturel approved these changes May 17, 2023

View reviewed changes

lognaturel merged commit 34f4cbd into getodk:master May 17, 2023
3 checks passed

seadowg deleted the repeated-expr-2 branch May 18, 2023 06:47

lognaturel reviewed Jul 4, 2023

View reviewed changes

lognaturel mentioned this pull request Oct 18, 2023

Fixed memory leaks getodk/collect#5780

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use lazy indexing repeated expression caching to optimize predicates #715

Use lazy indexing repeated expression caching to optimize predicates #715

seadowg commented Apr 18, 2023 •

edited

seadowg commented May 11, 2023 •

edited

lognaturel left a comment

lognaturel Jun 28, 2023 •

edited

seadowg Jul 4, 2023

lognaturel Jul 3, 2023

lognaturel Jul 3, 2023

seadowg Jul 4, 2023

seadowg Jul 4, 2023 •

edited

lognaturel Jul 3, 2023

seadowg Jul 4, 2023

lognaturel Jul 3, 2023

Use lazy indexing repeated expression caching to optimize predicates #715

Use lazy indexing repeated expression caching to optimize predicates #715

Conversation

seadowg commented Apr 18, 2023 • edited

seadowg commented May 11, 2023 • edited

lognaturel left a comment

Choose a reason for hiding this comment

lognaturel Jun 28, 2023 • edited

Choose a reason for hiding this comment

seadowg Jul 4, 2023

Choose a reason for hiding this comment

lognaturel Jul 3, 2023

Choose a reason for hiding this comment

lognaturel Jul 3, 2023

Choose a reason for hiding this comment

seadowg Jul 4, 2023

Choose a reason for hiding this comment

seadowg Jul 4, 2023 • edited

Choose a reason for hiding this comment

lognaturel Jul 3, 2023

Choose a reason for hiding this comment

seadowg Jul 4, 2023

Choose a reason for hiding this comment

lognaturel Jul 3, 2023

Choose a reason for hiding this comment

seadowg commented Apr 18, 2023 •

edited

seadowg commented May 11, 2023 •

edited

lognaturel Jun 28, 2023 •

edited

seadowg Jul 4, 2023 •

edited