Feature/sorted sets #523

zonotope · 2023-07-05T22:05:28Z

This patch replaces clojure.data.avl with persistent-sorted-set as the sorted set implementation backing the 5 core indexes.

A number of changes had to be made to the internal indexing apis to make them compatible with persistent-sorted-set:

persistent-sorted-set does not support sorted maps, so our index branch nodes now use a sorted set instead of a sorted map to organize the branch's children.
- There is a new index node comparator derived from our standard flake comparators to compare branch nodes. The node comparator treats nodes whose flake ranges overlap as equal, and disjoint branch nodes are compared by comparing their branch intervals.
- In order to find a range of a branch node's children between two flakes, we now construct two dummy "nodes" out of those flakes, one fore each flake endpoint, whose :first-flake and :rhs attributes are each set to the specified endpoint. The index node comparator defined above will return the correct range using these two nodes.
persistent-sorted-set supports efficiently creating sequential subranges called slices which are not themselves sorted sets, unlike clojure.data.avl which directly creates subsets from its subrange api. Changes have been made here to ensure that no set operations were ever directly performed on slices.
persistent-supported-set does not support > or < order checks in its slice api, opting instead to only implicitly support >= and <=. This could mean that there could be two index nodes which share and endpoint (the :rhs of one is equal to the :first-flake of the next). I don't think this will be an issue in practices because this should only matter when integrating novelty into existing nodes, and all the nodes in novelty should have at least a different t value than the flakes in any of the existing nodes, so we shouldn't run into nodes that actually share endpoints in practice. The other place this might matter is while re-balancing leaf nodes when the overflow limit is hit, but this patch includes code to explicitly drop overlapping flakes after a split operation.

Some of the behavior of the internal apis also have changed to make use of the efficiencies of persistent sorted set. Now, we must pass a collection to sorted-set-by instead of individual flakes avoiding the need for apply. Also, sorted-set-by now expects the supplied collection of flakes to already be sorted because this kind of set creation is heavily optimized in persistent-sorted-set. All of the flake collections we had were already sorted anyway, so this shouldn't prove to be an issue.

Besides those changes, I took the time to simplify the api of tree-chan to cut down on the transducers we have to supply to it. Instead of always needing to limit the flakes returned between starting endpoints, that is now done automatically by passing the endpoints in as arguments. Also, there's no need for an include? function to pass in as an argument, since that can be accomplished with a filtering transducer.

Lastly, I removed a lot of unused and unnecessary code, and moved the logback-test.xml into a new test-resources directory so it wouldn't be active under the development profile.

I might be forgetting some other changes here because it's a lot.

If two nodes overlap, then they are equal. otherwise, compare them based on the flake interval they contain.

Use persistent sorted set's efficient slices so we don't have to consider all of a node's children (linear time) and we can instead limit it to only the children we care about in logarithmic time.

zonotope · 2023-07-06T14:56:58Z

converted to draft because i've found some issues with queries using the file connection and loaded dbs

zonotope · 2023-07-13T19:00:38Z

@fluree/core I'm still characterizing the indexing and querying performance on large ledgers, but this patch is ready for review

…ure/sorted-sets

dpetran · 2023-07-14T19:25:06Z

The changes look good to me, thanks for making your commits so focused - it made reviewing much simpler. Since this is a large change I want to give others some chances to review and also wait until you've got a feel for any perf regressions.

zonotope · 2023-07-16T06:53:03Z

Fixes fluree/core#18

mpoffald

Overall LGTM. I appreciate the cleanup, and I'm happy to see the tree-chan api get a little simpler as a bonus. 😄

I know you're still measuring performance implications, so I will also hold off on a formal 👍 for now, but wanted to comment so it's clear I did get a chance to review.

zonotope · 2023-07-22T16:26:21Z

Unfortunately, persistent-sorted-set shows at least a 2x slowdon over clojure.data.avl when accessing the results of sorted set subscans. pss returns a lazy seq while avl returns a fully realized sorted set, and we pay the penalty whenever we consume the entire seq, which we always will do.

Because of this performance regression, I'm closing this without merging.

zonotope added 30 commits June 22, 2023 10:06

add persistent-sorted-set dependency

cf2f43c

remove unused fns

dbff487

use subrange fn defined in flake ns

116569a

remove unused fns and references to network attribute

3382a18

ledger-id -> ledger-alias

871a4c3

remove references to network/ledger-id in storage

0421797

fluree.db.storage.core -> fluree.db.storage

d04bef9

add accessors and a comparator for child entries

7ea67f5

add a minimum flake value

395383d

use slice instead of subrange

231322e

fix recursive invocation arity

e2868a8

use persistent-sorted-set instead of clojure.data.avl for flake sets

921657e

don't use apply when creating sorted sets and maps

77ca187

ensure node children remains a sorted map; use map entries in slice fns

1e77e31

remove unused functions

2fab4e6

use nil for indeterminate flake boundaries

996de33

remove unnecessary start/end test opts

d0e37d4

use sorted sets for node children; remove now unnecessary sorted map

14e7cf6

remove unused clojure.data.avl dependency

0398837

remove unused namespace

3ff5aab

fix typo

410707c

Merge remote-tracking branch 'origin/main' into feature/sorted-sets

e7bae2c

change node comparator to consider entire interval of flakes

2bd331c

If two nodes overlap, then they are equal. otherwise, compare them based on the flake interval they contain.

add an rslice fn; fix max/min flakes' metadata components as integer

d51ac0f

add endpoints to tree-chan api to filter out irrelevant children

0bef8ab

Use persistent sorted set's efficient slices so we don't have to consider all of a node's children (linear time) and we can instead limit it to only the children we care about in logarithmic time.

remove unused flake ranking fns

a5c5908

remove unused resolved-leaf? fn

551067b

add docstring to node comparator

aebfb93

remove include? fn in favor of passed in transducer in tree-chan

296193d

iterate through flakes only once by combining transducers

0410d48

zonotope requested a review from a team July 5, 2023 22:05

zonotope self-assigned this Jul 5, 2023

zonotope marked this pull request as draft July 6, 2023 14:56

zonotope added 9 commits July 10, 2023 12:13

Merge remote-tracking branch 'origin/main' into feature/sorted-sets

0cb168d

formatting

1639a06

prioritize including n and xf over start/end flakes in tree-chan

d8d30e6

pretty-print flakes for (some) repl formatters to work

a247bbe

resolve empty branches too

4120855

don't clobber :first attr when add/removing flakes to/from leaves

2d3f1ae

Merge remote-tracking branch 'origin/main' into feature/sorted-sets

b728b63

use pre-existing util/sequential fn

0d1f520

Merge remote-tracking branch 'origin/main' into feature/sorted-sets

48044b7

zonotope marked this pull request as ready for review July 13, 2023 18:59

zonotope added 7 commits July 13, 2023 21:23

unresolve child nodes before attaching them to branches

42e1709

unresolve the root node as well after indexing is complete

b41c839

use the original :first attribute when beginning to rebalance leaves

f99f883

add widely used ns to dev repl env

cbab0e9

Merge remote-tracking branch 'origin/fix/id-map-resolution' into feat…

9390d6d

…ure/sorted-sets

Merge remote-tracking branch 'origin/main' into feature/sorted-sets

04584a2

don't resolve empty nodes unless necessary

5ffb8c7

Merge remote-tracking branch 'origin/main' into feature/sorted-sets

56f930b

add some docstrings

69255a9

mpoffald reviewed Jul 17, 2023

View reviewed changes

zonotope mentioned this pull request Jul 21, 2023

A few improvements #536

Merged

zonotope closed this Jul 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/sorted sets #523

Feature/sorted sets #523

zonotope commented Jul 5, 2023

zonotope commented Jul 6, 2023

zonotope commented Jul 13, 2023

dpetran commented Jul 14, 2023

zonotope commented Jul 16, 2023

mpoffald left a comment

zonotope commented Jul 22, 2023

Feature/sorted sets #523

Feature/sorted sets #523

Conversation

zonotope commented Jul 5, 2023

zonotope commented Jul 6, 2023

zonotope commented Jul 13, 2023

dpetran commented Jul 14, 2023

zonotope commented Jul 16, 2023

mpoffald left a comment

Choose a reason for hiding this comment

zonotope commented Jul 22, 2023