Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Cayley indexing to an append-only log #113

Merged
merged 33 commits into from
Aug 16, 2014

Conversation

barakmich
Copy link
Member

Fixes #70

In order to provide for replication, graph history and other neat clustering features, I created the interface QuadWriter which can take care of all the replication logic; TripleStore (soon, QuadStore, I'm guessing) should store the things that have been accepted. Further, it should be kept as a log -- a follow-up CL which adds GetRange(from, to int64) []*graph.Delta to the TripleStore interface is forthcoming.

Some question of moving, renaming and ownership is certainly valid.

This also turns Cayley into a full quadstore for all backends, which it was heading toward anyway. This PR finalizes that.

Anyway, this means that nothing really gets deleted, but that indexed quads get extra data about when they're valid, and the triplestore^Hquadstore iterators should check the validity. This adds some amount of slowdown for Next(), but how much?

$ benchcmp nolog.txt prelog.txt
benchmark                                   old ns/op       new ns/op       delta
BenchmarkNamePredicate                      1075967         1144500         +6.37%
BenchmarkLargeSetsNoIntersection            48881903        56925194        +16.45%
BenchmarkVeryLargeSetsSmallIntersection     509579408       757046655       +48.56%
BenchmarkHelplessContainsChecker            26454974514     27333222110     +3.32%
BenchmarkNetAndSpeed                        15422744        16408261        +6.39%
BenchmarkKeanuAndNet                        12878175        13006483        +1.00%
BenchmarkKeanuAndSpeed                      14542854        14950108        +2.80%
BenchmarkKeanuOther                         61897336        76940891        +24.30%
BenchmarkKeanuBullockOther                  78667479        81221650        +3.25%

A couple percent, but not noise. Incidentally, this is exacerbated by the Materialize iterator that recently went in; it does a lot of Next()ing, so on things where it aborts (HelplessContainsChecker) it doesn't have as problematic an impact. So, the wins from Materialize are still large, just less so now.

So what's the overhead on queries going through the addition/deletion/readdition cycle, other than using more persistence space? This now gets tested.

The answer to that, at least with the memstore is:

$ benchcmp prelog.txt postlog.txt
benchmark                                   old ns/op       new ns/op       delta
BenchmarkNamePredicate                      1144500         1108439         -3.15%
BenchmarkLargeSetsNoIntersection            56925194        57843995        +1.61%
BenchmarkVeryLargeSetsSmallIntersection     757046655       747045447       -1.32%
BenchmarkHelplessContainsChecker            27333222110     28253011444     +3.37%
BenchmarkNetAndSpeed                        16408261        17083855        +4.12%
BenchmarkKeanuAndNet                        13006483        13495442        +3.76%
BenchmarkKeanuAndSpeed                      14950108        15693748        +4.97%
BenchmarkKeanuOther                         76940891        68439409        -11.05%
BenchmarkKeanuBullockOther                  81221650        74245237        -8.59%

Meh. Not too much.

Obviously it'll be a bit slower for larger backends, but it'll still be fairly small because it fails fairly fast; we don't have to go through the whole tree for things that are invalid, we just go to the next thing in the underlying persistence iterator.

A migration tool to collapse the history (ie, only keep additions of things that are valid) might make sense in the future, but should be a separate tool.

/cc @kortschak @pbnjay

$ benchcmp gollrb.bench b-gen.bench
benchmark                                   old ns/op       new ns/op	delta
BenchmarkNamePredicate                      1731218         1693373		-2.19%
BenchmarkLargeSetsNoIntersection            81290360        70205277	-13.64%
BenchmarkVeryLargeSetsSmallIntersection     768135620       442906243	-42.34%
BenchmarkHelplessContainsChecker            39477086024     35260603748	-10.68%
BenchmarkNetAndSpeed                        22510637        21587975	-4.10%
BenchmarkKeanuAndNet                        18018886        17795328	-1.24%
BenchmarkKeanuAndSpeed                      20336586        20560228	+1.10%
BenchmarkKeanuOther                         85495040        80718152	-5.59%
BenchmarkKeanuBullockOther                  95457792        83868434	-12.14%

Code gen from $GOPATH/src/github.com/cznic/b:

  make generic \
| sed -e 's/KEY/int64/g' -e 's/VALUE/struct{}/g' \
> $GOPATH/src/github.com/google/cayley/graph/memstore/b/keys.go

key_test.go manually edited.
@@ -41,7 +43,7 @@ type Iterator struct {
result graph.Value
}

func NewIterator(prefix string, d quad.Direction, value graph.Value, qs *TripleStore) *Iterator {
func NewIterator(prefix string, d quad.Direction, value graph.Value, qs *TripleStore) graph.Iterator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you hiding this for a reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to return null. It works alright if it's a closed leveldb iterator as well. Fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just return nil for the *Iterator?

h.QuadWriter.Close()
}

var ErrQuadExists = errors.New("Quad exists")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write as a var block.

@kortschak
Copy link
Contributor

LGTM with comments.

Has this been tested on leveldb and mongo?

@barakmich
Copy link
Member Author

Been watching your comments come in, all should be done (including concretization of Delta, because the pointer was out of habit, not speed). Extensively tested on LevelDB (also the extant LevelDB tests), and some on Mongo but it's probably worth hacking the integration test (-short) to prove both.

@pbnjay
Copy link
Contributor

pbnjay commented Aug 13, 2014

I don't feel too qualified to comment on this... I may be missing some of the surrounding conversations here.

But, FWIW, my high-level viewpoint is that this seems to muddy cayley functionality (as a graph query framework) with the backend implementation details. Postgres for example handles transactions and write logging just fine by itself.

@barakmich
Copy link
Member Author

One thing I've thought about doing is changing to

type graph.Handle struct {
  graph.TripleStore // soon QuadStore
  graph.QuadWriter
} 

As it's a simple union type which should have no overlap. The prior API would work fine (eg, AddTriple) and provide API users the same interface.

Yes, API helpers with a higher-level API seem right. To make an empty store ought to be as easy as
func cayley_api.MakeMemoryGraph() (graph.Handle, error)
Or, similarly:
func cayley_api.Dial(cayley_http_endpoint string) (graph.Handle, error)

As for backend implementors what handle transactions -- this is to help aid that fact. Yeah, Postgres does it fine, but perhaps there's another Cayley connected to the same server? How does one know that the data is there? Blind ignorance (and trusting the backend) was the order of the day before, but that doesn't hold for questions of distributed consistency.

By dealing in "accepted" deltas instead of dealing with pending and whatnot, all the backend features you want can be used. So yeah, batch together the Postgres transactions and use all the cool indexing features you like; it's free reign.

The further good news is the consistency gets managed by the QuadWriter now, and TripleStore is allowed to be mostly ignorant of this fact, so it's not much worse there.

Anyway, that's a rough explanation as to the why and the sort of effect it has. It's kind of important, and worth a very little more mud for implementors (if you did AddTripleSet and RemoveTriple before, it's nigh on a wrapper function). And as a framework, none too shabby.

kortschak and others added 5 commits August 13, 2014 16:01
$ benchcmp gollrb.bench b-gen.bench
benchmark                                   old ns/op       new ns/op	delta
BenchmarkNamePredicate                      1369329         1444990	+5.53%
BenchmarkLargeSetsNoIntersection            72329029        64975716	-10.17%
BenchmarkVeryLargeSetsSmallIntersection     890824761       408784476	-54.11%
BenchmarkHelplessContainsChecker            35314797618     30673240485	-13.14%
BenchmarkNetAndSpeed                        19694146        19486797	-1.05%
BenchmarkKeanuAndNet                        15340756        15317415	-0.15%
BenchmarkKeanuAndSpeed                      17902709        18042030	+0.78%
BenchmarkKeanuOther                         53452058        50984817	-4.62%
BenchmarkKeanuBullockOther                  90827780        86536510	-4.72%

benchmark                                   old allocs     new allocs	delta
BenchmarkNamePredicate                      1339           1339		+0.00%
BenchmarkLargeSetsNoIntersection            22603          22674	+0.31%
BenchmarkVeryLargeSetsSmallIntersection     65787          65860	+0.11%
BenchmarkHelplessContainsChecker            1713541        1713669	+0.01%
BenchmarkNetAndSpeed                        17135          17146	+0.06%
BenchmarkKeanuAndNet                        15802          15802	+0.00%
BenchmarkKeanuAndSpeed                      16397          16396	-0.01%
BenchmarkKeanuOther                         30148          30149	+0.00%
BenchmarkKeanuBullockOther                  35542          35544	+0.01%

benchmark                                   old bytes     new bytes	delta
BenchmarkNamePredicate                      96226         95842		-0.40%
BenchmarkLargeSetsNoIntersection            1165914       119725	+2.69%
BenchmarkVeryLargeSetsSmallIntersection     2760072       2777798	+0.64%
BenchmarkHelplessContainsChecker            84388448      84351168	-0.04%
BenchmarkNetAndSpeed                        1414837       1425752	+0.77%
BenchmarkKeanuAndNet                        1247249       1247453	+0.02%
BenchmarkKeanuAndSpeed                      1275522       1275243	-0.02%
BenchmarkKeanuOther                         2021107       2021497	+0.02%
BenchmarkKeanuBullockOther                  2682243       2683250	+0.04%
Conflicts:
	graph/memstore/iterator.go
	graph/memstore/triplestore.go
@barakmich
Copy link
Member Author

You were right about Mongo having some issues -- it's all better now, but it wasn't indexing. (Plus I removed some duplication and whatnot)

PTAL, I'm signing off for sleep though.

Here's the delta against master:

$ benchcmp masterbench.txt memlog.txt
benchmark                                   old ns/op       new ns/op       delta
BenchmarkNamePredicate                      892522          891003          -0.17%
BenchmarkLargeSetsNoIntersection            120452689       119235335       -1.01%
BenchmarkVeryLargeSetsSmallIntersection     466922201       594306798       +27.28%
BenchmarkHelplessContainsChecker            16471396973     14392818665     -12.62%
BenchmarkNetAndSpeed                        27157535        24395951        -10.17%
BenchmarkKeanuAndNet                        15147774        13962188        -7.83%
BenchmarkKeanuAndSpeed                      25653885        23129695        -9.84%
BenchmarkKeanuOther                         92172791        80948439        -12.18%
BenchmarkKeanuBullockOther                  248606767       228117419       -8.24%

For fun, here's the three backends on my machine, after log-ization:

$ cat memlog.txt leveldblog.txt mongolog.txt
PASS
BenchmarkNamePredicate      2000            891003 ns/op
BenchmarkLargeSetsNoIntersection              20         119235335 ns/op
BenchmarkVeryLargeSetsSmallIntersection        2         594306798 ns/op
BenchmarkHelplessContainsChecker               1        14392818665 ns/op
BenchmarkNetAndSpeed         100          24395951 ns/op
BenchmarkKeanuAndNet         100          13962188 ns/op
BenchmarkKeanuAndSpeed       100          23129695 ns/op
BenchmarkKeanuOther           20          80948439 ns/op
BenchmarkKeanuBullockOther            10         228117419 ns/op
ok      github.com/google/cayley        51.053s
PASS
BenchmarkNamePredicate       200           6191220 ns/op
BenchmarkLargeSetsNoIntersection               1        1251704355 ns/op
BenchmarkVeryLargeSetsSmallIntersection        1        21464633270 ns/op
BenchmarkHelplessContainsChecker               1        419149679194 ns/op
BenchmarkNetAndSpeed          50          70967738 ns/op
BenchmarkKeanuAndNet          50          58774665 ns/op
BenchmarkKeanuAndSpeed        20          73371978 ns/op
BenchmarkKeanuOther           10         240545715 ns/op
BenchmarkKeanuBullockOther             5         296443733 ns/op
ok      github.com/google/cayley        457.961s
PASS
BenchmarkNamePredicate       100          11653367 ns/op
BenchmarkLargeSetsNoIntersection               1        1215239008 ns/op
BenchmarkVeryLargeSetsSmallIntersection        1        5214089953 ns/op
BenchmarkHelplessContainsChecker               1        463718010910 ns/op
BenchmarkNetAndSpeed           2         946459782 ns/op
BenchmarkKeanuAndNet           5         484922502 ns/op
BenchmarkKeanuAndSpeed         5         521168853 ns/op
BenchmarkKeanuOther            1        1721563675 ns/op
BenchmarkKeanuBullockOther             1        2205499232 ns/op
ok      github.com/google/cayley        487.024s

barakmich and others added 5 commits August 14, 2014 02:36
Comparison of b against GoLLRB (as at d5f020).

$ benchcmp gollrb.bench b-gen.bench
benchmark                                   old ns/op       new ns/op	delta
BenchmarkNamePredicate                      1631932         1409531	-13.63%
BenchmarkLargeSetsNoIntersection            190792654       63748682	-66.59%
BenchmarkVeryLargeSetsSmallIntersection     896154437       373475843	-58.32%
BenchmarkHelplessContainsChecker            20719182678     14078301640	-32.05%
BenchmarkNetAndSpeed                        32519019        20188665	-37.92%
BenchmarkKeanuAndNet                        18319247        15224988	-16.89%
BenchmarkKeanuAndSpeed                      30849568        18744134	-39.24%
BenchmarkKeanuOther                         105552525       107620648	+1.96%
BenchmarkKeanuBullockOther                  295395338       115193002	-61.00%

benchmark                                   old allocs     new allocs	delta
BenchmarkNamePredicate                      1339           1341	+0.15%
BenchmarkLargeSetsNoIntersection            22585          23632	+4.64%
BenchmarkVeryLargeSetsSmallIntersection     65776          69396	+5.50%
BenchmarkHelplessContainsChecker            1713541        2036316	+18.84%
BenchmarkNetAndSpeed                        17104          17240	+0.80%
BenchmarkKeanuAndNet                        15816          15855	+0.25%
BenchmarkKeanuAndSpeed                      16368          16493	+0.76%
BenchmarkKeanuOther                         30134          30634	+1.66%
BenchmarkKeanuBullockOther                  35510          36454	+2.66%

benchmark                                   old bytes     new bytes	delta
BenchmarkNamePredicate                      96162         96294	+0.14%
BenchmarkLargeSetsNoIntersection            1172356       1249872	+6.61%
BenchmarkVeryLargeSetsSmallIntersection     2810080       2992409	+6.49%
BenchmarkHelplessContainsChecker            89233264      104999088	+17.67%
BenchmarkNetAndSpeed                        1388793       1428110	+2.83%
BenchmarkKeanuAndNet                        1263145       1250079	-1.03%
BenchmarkKeanuAndSpeed                      1246956       1281546	+2.77%
BenchmarkKeanuOther                         2021312       2024727	+0.17%
BenchmarkKeanuBullockOther                  2671448       2742968	+2.68%

Conflicts:
	graph/memstore/triplestore.go
Conflicts:
	graph/leveldb/triplestore.go
	graph/mongo/triplestore.go
Use cznic/b B+tree implementation in place of GoLLRB for memstore
@barakmich
Copy link
Member Author

Faster memstore using b is merged, all the other bits seem to be in place and the build is green. I'm going for a beer and then sleeping, but barring objection, I'll merge this behemoth tomorrow. (I've got some others depending on it)

barakmich added a commit that referenced this pull request Aug 16, 2014
Convert Cayley indexing to an append-only log
@barakmich barakmich merged commit e1e95b9 into cayleygraph:master Aug 16, 2014
@barakmich barakmich deleted the log_database branch February 8, 2015 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write log, as_of time and prep for replication
3 participants