Convert Cayley indexing to an append-only log #113

barakmich · 2014-08-11T02:34:18Z

Fixes #70

In order to provide for replication, graph history and other neat clustering features, I created the interface QuadWriter which can take care of all the replication logic; TripleStore (soon, QuadStore, I'm guessing) should store the things that have been accepted. Further, it should be kept as a log -- a follow-up CL which adds GetRange(from, to int64) []*graph.Delta to the TripleStore interface is forthcoming.

Some question of moving, renaming and ownership is certainly valid.

This also turns Cayley into a full quadstore for all backends, which it was heading toward anyway. This PR finalizes that.

Anyway, this means that nothing really gets deleted, but that indexed quads get extra data about when they're valid, and the triplestore^Hquadstore iterators should check the validity. This adds some amount of slowdown for Next(), but how much?

$ benchcmp nolog.txt prelog.txt
benchmark                                   old ns/op       new ns/op       delta
BenchmarkNamePredicate                      1075967         1144500         +6.37%
BenchmarkLargeSetsNoIntersection            48881903        56925194        +16.45%
BenchmarkVeryLargeSetsSmallIntersection     509579408       757046655       +48.56%
BenchmarkHelplessContainsChecker            26454974514     27333222110     +3.32%
BenchmarkNetAndSpeed                        15422744        16408261        +6.39%
BenchmarkKeanuAndNet                        12878175        13006483        +1.00%
BenchmarkKeanuAndSpeed                      14542854        14950108        +2.80%
BenchmarkKeanuOther                         61897336        76940891        +24.30%
BenchmarkKeanuBullockOther                  78667479        81221650        +3.25%

A couple percent, but not noise. Incidentally, this is exacerbated by the Materialize iterator that recently went in; it does a lot of Next()ing, so on things where it aborts (HelplessContainsChecker) it doesn't have as problematic an impact. So, the wins from Materialize are still large, just less so now.

So what's the overhead on queries going through the addition/deletion/readdition cycle, other than using more persistence space? This now gets tested.

The answer to that, at least with the memstore is:

$ benchcmp prelog.txt postlog.txt
benchmark                                   old ns/op       new ns/op       delta
BenchmarkNamePredicate                      1144500         1108439         -3.15%
BenchmarkLargeSetsNoIntersection            56925194        57843995        +1.61%
BenchmarkVeryLargeSetsSmallIntersection     757046655       747045447       -1.32%
BenchmarkHelplessContainsChecker            27333222110     28253011444     +3.37%
BenchmarkNetAndSpeed                        16408261        17083855        +4.12%
BenchmarkKeanuAndNet                        13006483        13495442        +3.76%
BenchmarkKeanuAndSpeed                      14950108        15693748        +4.97%
BenchmarkKeanuOther                         76940891        68439409        -11.05%
BenchmarkKeanuBullockOther                  81221650        74245237        -8.59%

Meh. Not too much.

Obviously it'll be a bit slower for larger backends, but it'll still be fairly small because it fails fairly fast; we don't have to go through the whole tree for things that are invalid, we just go to the next thing in the underlying persistence iterator.

A migration tool to collapse the history (ie, only keep additions of things that are valid) might make sense in the future, but should be a separate tool.

/cc @kortschak @pbnjay

$ benchcmp gollrb.bench b-gen.bench benchmark old ns/op new ns/op delta BenchmarkNamePredicate 1731218 1693373 -2.19% BenchmarkLargeSetsNoIntersection 81290360 70205277 -13.64% BenchmarkVeryLargeSetsSmallIntersection 768135620 442906243 -42.34% BenchmarkHelplessContainsChecker 39477086024 35260603748 -10.68% BenchmarkNetAndSpeed 22510637 21587975 -4.10% BenchmarkKeanuAndNet 18018886 17795328 -1.24% BenchmarkKeanuAndSpeed 20336586 20560228 +1.10% BenchmarkKeanuOther 85495040 80718152 -5.59% BenchmarkKeanuBullockOther 95457792 83868434 -12.14% Code gen from $GOPATH/src/github.com/cznic/b: make generic \ | sed -e 's/KEY/int64/g' -e 's/VALUE/struct{}/g' \ > $GOPATH/src/github.com/google/cayley/graph/memstore/b/keys.go key_test.go manually edited.

kortschak · 2014-08-12T23:46:45Z

graph/leveldb/iterator.go

@@ -41,7 +43,7 @@ type Iterator struct {
 	result         graph.Value
 }

-func NewIterator(prefix string, d quad.Direction, value graph.Value, qs *TripleStore) *Iterator {
+func NewIterator(prefix string, d quad.Direction, value graph.Value, qs *TripleStore) graph.Iterator {


Are you hiding this for a reason?

Just to return null. It works alright if it's a closed leveldb iterator as well. Fixed

Can't you just return nil for the *Iterator?

kortschak · 2014-08-12T23:59:46Z

graph/quadwriter.go

+	h.QuadWriter.Close()
+}
+
+var ErrQuadExists = errors.New("Quad exists")


Write as a var block.

kortschak · 2014-08-13T00:12:50Z

LGTM with comments.

Has this been tested on leveldb and mongo?

barakmich · 2014-08-13T00:27:12Z

Been watching your comments come in, all should be done (including concretization of Delta, because the pointer was out of habit, not speed). Extensively tested on LevelDB (also the extant LevelDB tests), and some on Mongo but it's probably worth hacking the integration test (-short) to prove both.

pbnjay · 2014-08-13T04:07:32Z

I don't feel too qualified to comment on this... I may be missing some of the surrounding conversations here.

But, FWIW, my high-level viewpoint is that this seems to muddy cayley functionality (as a graph query framework) with the backend implementation details. Postgres for example handles transactions and write logging just fine by itself.

barakmich · 2014-08-13T05:31:36Z

One thing I've thought about doing is changing to

type graph.Handle struct {
  graph.TripleStore // soon QuadStore
  graph.QuadWriter
}

As it's a simple union type which should have no overlap. The prior API would work fine (eg, AddTriple) and provide API users the same interface.

Yes, API helpers with a higher-level API seem right. To make an empty store ought to be as easy as
func cayley_api.MakeMemoryGraph() (graph.Handle, error)
Or, similarly:
func cayley_api.Dial(cayley_http_endpoint string) (graph.Handle, error)

As for backend implementors what handle transactions -- this is to help aid that fact. Yeah, Postgres does it fine, but perhaps there's another Cayley connected to the same server? How does one know that the data is there? Blind ignorance (and trusting the backend) was the order of the day before, but that doesn't hold for questions of distributed consistency.

By dealing in "accepted" deltas instead of dealing with pending and whatnot, all the backend features you want can be used. So yeah, batch together the Postgres transactions and use all the cool indexing features you like; it's free reign.

The further good news is the consistency gets managed by the QuadWriter now, and TripleStore is allowed to be mostly ignorant of this fact, so it's not much worse there.

Anyway, that's a rough explanation as to the why and the sort of effect it has. It's kind of important, and worth a very little more mud for implementors (if you did AddTripleSet and RemoveTriple before, it's nigh on a wrapper function). And as a framework, none too shabby.

$ benchcmp gollrb.bench b-gen.bench benchmark old ns/op new ns/op delta BenchmarkNamePredicate 1369329 1444990 +5.53% BenchmarkLargeSetsNoIntersection 72329029 64975716 -10.17% BenchmarkVeryLargeSetsSmallIntersection 890824761 408784476 -54.11% BenchmarkHelplessContainsChecker 35314797618 30673240485 -13.14% BenchmarkNetAndSpeed 19694146 19486797 -1.05% BenchmarkKeanuAndNet 15340756 15317415 -0.15% BenchmarkKeanuAndSpeed 17902709 18042030 +0.78% BenchmarkKeanuOther 53452058 50984817 -4.62% BenchmarkKeanuBullockOther 90827780 86536510 -4.72% benchmark old allocs new allocs delta BenchmarkNamePredicate 1339 1339 +0.00% BenchmarkLargeSetsNoIntersection 22603 22674 +0.31% BenchmarkVeryLargeSetsSmallIntersection 65787 65860 +0.11% BenchmarkHelplessContainsChecker 1713541 1713669 +0.01% BenchmarkNetAndSpeed 17135 17146 +0.06% BenchmarkKeanuAndNet 15802 15802 +0.00% BenchmarkKeanuAndSpeed 16397 16396 -0.01% BenchmarkKeanuOther 30148 30149 +0.00% BenchmarkKeanuBullockOther 35542 35544 +0.01% benchmark old bytes new bytes delta BenchmarkNamePredicate 96226 95842 -0.40% BenchmarkLargeSetsNoIntersection 1165914 119725 +2.69% BenchmarkVeryLargeSetsSmallIntersection 2760072 2777798 +0.64% BenchmarkHelplessContainsChecker 84388448 84351168 -0.04% BenchmarkNetAndSpeed 1414837 1425752 +0.77% BenchmarkKeanuAndNet 1247249 1247453 +0.02% BenchmarkKeanuAndSpeed 1275522 1275243 -0.02% BenchmarkKeanuOther 2021107 2021497 +0.02% BenchmarkKeanuBullockOther 2682243 2683250 +0.04%

Conflicts: graph/memstore/iterator.go graph/memstore/triplestore.go

barakmich · 2014-08-14T06:24:58Z

You were right about Mongo having some issues -- it's all better now, but it wasn't indexing. (Plus I removed some duplication and whatnot)

PTAL, I'm signing off for sleep though.

Here's the delta against master:

$ benchcmp masterbench.txt memlog.txt
benchmark                                   old ns/op       new ns/op       delta
BenchmarkNamePredicate                      892522          891003          -0.17%
BenchmarkLargeSetsNoIntersection            120452689       119235335       -1.01%
BenchmarkVeryLargeSetsSmallIntersection     466922201       594306798       +27.28%
BenchmarkHelplessContainsChecker            16471396973     14392818665     -12.62%
BenchmarkNetAndSpeed                        27157535        24395951        -10.17%
BenchmarkKeanuAndNet                        15147774        13962188        -7.83%
BenchmarkKeanuAndSpeed                      25653885        23129695        -9.84%
BenchmarkKeanuOther                         92172791        80948439        -12.18%
BenchmarkKeanuBullockOther                  248606767       228117419       -8.24%

For fun, here's the three backends on my machine, after log-ization:

$ cat memlog.txt leveldblog.txt mongolog.txt
PASS
BenchmarkNamePredicate      2000            891003 ns/op
BenchmarkLargeSetsNoIntersection              20         119235335 ns/op
BenchmarkVeryLargeSetsSmallIntersection        2         594306798 ns/op
BenchmarkHelplessContainsChecker               1        14392818665 ns/op
BenchmarkNetAndSpeed         100          24395951 ns/op
BenchmarkKeanuAndNet         100          13962188 ns/op
BenchmarkKeanuAndSpeed       100          23129695 ns/op
BenchmarkKeanuOther           20          80948439 ns/op
BenchmarkKeanuBullockOther            10         228117419 ns/op
ok      github.com/google/cayley        51.053s
PASS
BenchmarkNamePredicate       200           6191220 ns/op
BenchmarkLargeSetsNoIntersection               1        1251704355 ns/op
BenchmarkVeryLargeSetsSmallIntersection        1        21464633270 ns/op
BenchmarkHelplessContainsChecker               1        419149679194 ns/op
BenchmarkNetAndSpeed          50          70967738 ns/op
BenchmarkKeanuAndNet          50          58774665 ns/op
BenchmarkKeanuAndSpeed        20          73371978 ns/op
BenchmarkKeanuOther           10         240545715 ns/op
BenchmarkKeanuBullockOther             5         296443733 ns/op
ok      github.com/google/cayley        457.961s
PASS
BenchmarkNamePredicate       100          11653367 ns/op
BenchmarkLargeSetsNoIntersection               1        1215239008 ns/op
BenchmarkVeryLargeSetsSmallIntersection        1        5214089953 ns/op
BenchmarkHelplessContainsChecker               1        463718010910 ns/op
BenchmarkNetAndSpeed           2         946459782 ns/op
BenchmarkKeanuAndNet           5         484922502 ns/op
BenchmarkKeanuAndSpeed         5         521168853 ns/op
BenchmarkKeanuOther            1        1721563675 ns/op
BenchmarkKeanuBullockOther             1        2205499232 ns/op
ok      github.com/google/cayley        487.024s

Comparison of b against GoLLRB (as at d5f020). $ benchcmp gollrb.bench b-gen.bench benchmark old ns/op new ns/op delta BenchmarkNamePredicate 1631932 1409531 -13.63% BenchmarkLargeSetsNoIntersection 190792654 63748682 -66.59% BenchmarkVeryLargeSetsSmallIntersection 896154437 373475843 -58.32% BenchmarkHelplessContainsChecker 20719182678 14078301640 -32.05% BenchmarkNetAndSpeed 32519019 20188665 -37.92% BenchmarkKeanuAndNet 18319247 15224988 -16.89% BenchmarkKeanuAndSpeed 30849568 18744134 -39.24% BenchmarkKeanuOther 105552525 107620648 +1.96% BenchmarkKeanuBullockOther 295395338 115193002 -61.00% benchmark old allocs new allocs delta BenchmarkNamePredicate 1339 1341 +0.15% BenchmarkLargeSetsNoIntersection 22585 23632 +4.64% BenchmarkVeryLargeSetsSmallIntersection 65776 69396 +5.50% BenchmarkHelplessContainsChecker 1713541 2036316 +18.84% BenchmarkNetAndSpeed 17104 17240 +0.80% BenchmarkKeanuAndNet 15816 15855 +0.25% BenchmarkKeanuAndSpeed 16368 16493 +0.76% BenchmarkKeanuOther 30134 30634 +1.66% BenchmarkKeanuBullockOther 35510 36454 +2.66% benchmark old bytes new bytes delta BenchmarkNamePredicate 96162 96294 +0.14% BenchmarkLargeSetsNoIntersection 1172356 1249872 +6.61% BenchmarkVeryLargeSetsSmallIntersection 2810080 2992409 +6.49% BenchmarkHelplessContainsChecker 89233264 104999088 +17.67% BenchmarkNetAndSpeed 1388793 1428110 +2.83% BenchmarkKeanuAndNet 1263145 1250079 -1.03% BenchmarkKeanuAndSpeed 1246956 1281546 +2.77% BenchmarkKeanuOther 2021312 2024727 +0.17% BenchmarkKeanuBullockOther 2671448 2742968 +2.68% Conflicts: graph/memstore/triplestore.go

Conflicts: graph/leveldb/triplestore.go graph/mongo/triplestore.go

Use cznic/b B+tree implementation in place of GoLLRB for memstore

barakmich · 2014-08-15T02:07:20Z

Faster memstore using b is merged, all the other bits seem to be in place and the build is green. I'm going for a beer and then sleeping, but barring objection, I'll merge this behemoth tomorrow. (I've got some others depending on it)

Convert Cayley indexing to an append-only log

barakmich added 22 commits July 24, 2014 16:43

add replication interface

426e0b6

add replication registry and local replication

768ca5c

update the triplestore interface and local replication

929b4f5

lint

9793096

wip

7a8d419

single writer

e13e65d

rename

1b24d66

update to master

cedaac3

rename to quads

81b3bf9

Make Memstore work with the QuadWriter

dcb495d

convert leveldb to log-structure

d4e5eea

speedup and cleanup

c3bd164

merge to master

c64acab

convert to using real quads

6d4738c

Merge with new Next() interface

a1e5a53

add config options and graph.Handle

8821c19

first swing at mongo indexing (iterator todo)

ff148f5

add iterator check for mongo

6d22037

test clean

3770190

Mongo log works (and bug fixed)

48711af

add removal test

9ce35ae

Merge branch 'master' into log_database

664b37b

barakmich mentioned this pull request Aug 11, 2014

Write log, as_of time and prep for replication #70

Closed

barakmich mentioned this pull request Aug 11, 2014

Provide informative error messages on web UI failures #110

Closed

kortschak reviewed Aug 12, 2014
View reviewed changes

graph/quadwriter.go

h.QuadWriter.Close()

}

var ErrQuadExists = errors.New("Quad exists")

Copy link

Contributor

kortschak Aug 12, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write as a var block.

kortschak and others added 5 commits August 13, 2014 16:01

Merge branch 'log_database' into b

4a92ae9

Conflicts: graph/memstore/iterator.go graph/memstore/triplestore.go

merge with master

fe0569c

comments and concretized deltas

f967b36

fix mongo indexing name mismatch

d2026ea

barakmich and others added 5 commits August 14, 2014 02:36

Merge branch 'master' into log_database

d5f020b

Merge hash pool in from master

8720e17

Conflicts: graph/leveldb/triplestore.go graph/mongo/triplestore.go

Merge pull request #1 from kortschak/b

3b83845

Use cznic/b B+tree implementation in place of GoLLRB for memstore

add test dep for travis

0ffb244

barakmich added a commit that referenced this pull request Aug 16, 2014

Merge pull request #113 from barakmich/log_database

e1e95b9

Convert Cayley indexing to an append-only log

barakmich merged commit e1e95b9 into cayleygraph:master Aug 16, 2014

barakmich deleted the log_database branch February 8, 2015 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert Cayley indexing to an append-only log #113

Convert Cayley indexing to an append-only log #113

barakmich commented Aug 11, 2014

kortschak Aug 12, 2014

barakmich Aug 14, 2014

kortschak Aug 14, 2014

kortschak Aug 12, 2014

kortschak commented Aug 13, 2014

barakmich commented Aug 13, 2014

pbnjay commented Aug 13, 2014

barakmich commented Aug 13, 2014

barakmich commented Aug 14, 2014

barakmich commented Aug 15, 2014

Convert Cayley indexing to an append-only log #113

Convert Cayley indexing to an append-only log #113

Conversation

barakmich commented Aug 11, 2014

kortschak Aug 12, 2014

Choose a reason for hiding this comment

barakmich Aug 14, 2014

Choose a reason for hiding this comment

kortschak Aug 14, 2014

Choose a reason for hiding this comment

kortschak Aug 12, 2014

Choose a reason for hiding this comment

kortschak commented Aug 13, 2014

barakmich commented Aug 13, 2014

pbnjay commented Aug 13, 2014

barakmich commented Aug 13, 2014

barakmich commented Aug 14, 2014

barakmich commented Aug 15, 2014