batch operations #107

dominictarr · 2013-07-05T19:38:20Z

Having a batch operation, that atomically processes a series of puts and dels,
is vital for building reliable applications.

For example, say I wanted to implement a map-reduce, or some other async transformation on inserted data: on inserting the data, also insert a job item,
and when that is durably written, start performing the job, and when the result of the job is written, delete the job.

Thus, if the system crashes while data is being written, you either get no data and no job, or data, and then the job, eventually!

I have been doing a lot of work build node.js databases based on leveldb ( https://github.com/rvagg/node-levelup ), and have been looking for an suitable database primitive implemented in javascript, so that it is not necessary to compile anything.

Then, strata will be usable with the growing number of level-* modules
https://github.com/rvagg/node-levelup/wiki/Modules

The other thing that is needed, is reverse iteration.

Browsing your code, I'm thinking that to get a batch operation, you'd need a extended checksum, that validated the entire batch, or some other way, of grouping a whole batch of writes into one line? maybe some other separator, or a symbol meaning that the line is a part of a batch, and the checksum applies to the batch as a whole?

Cheers, Dominic

bigeasy · 2013-07-05T20:15:19Z

@dominictarr Reverse iteration would be implemented optimistically. Forward iteration is performed using links, but reverse iteration is performed optimistically, starting from the right most leaf, you work your way backwards descending the tree to the preceding. This is the intended design.The locking strategy for Strata is designed for liveness. The locking required to maintain reverse links would reduce liveness. It is assumed that descents are cheap in Strata, and quick to make progress because exclusive locks are not global.

Batch operations are a requirement for a log structured tree. Currently, every insert into Strata is an append. Nothing is lost. However, each write opens and closes the leaf page. Yes, I know, horrible, but I now that I understand the requirements, it's easy enough to use a WriteStream to do a leaf page append. Not brilliant, but a step in the right direction for performance you're looking for.

But, yes, durability was there at the start. Absurdly durable. The file open and close was in the hopes that it would get the filesystem to flush. You can probably tell me a lot about how to properly sync Node.js I/O and I'd be happy to learn.

Note that, Strata doesn't store a record. It is a b-tree primitive. The record that goes into Strata is not the one provided by the user. It was always assumed that in order to build a useful structure for a database, you would include database metadata in a record. For example, Strata does not support duplicate index values, but it was designed so that you could easily fake duplicates with a hidden increment. The increment makes the key unique, but you don't share that with the user.

With the right record, or a hidden record, you could fulfill your atomicity requirement. It is my intention to build a MVCC NoSQL database on top of Strata as proof of concept. (Your concepts would be powerful proof too, so maybe I'll focus on that at the outset.) The MVCC database would use Strata to store an object/document keeping an version number that gets incremented for each connection. This version number is stored hidden to the user in a record that contains the user object/document. When it visits a leaf page, it can inspect the version numbers of the other records with the same key. There a common collection of outstanding transactions. If one of the versions is that of an outstanding transaction, we know that when we commit, if the outstanding transaction has already committed, we have a conflict and we can give constraint error to the user.

Thus, to get a batch, you might use a separate Strata b-tree that represents WAL. It would be that each record represents a component of your batching data structure, instead of trying to fit it all into one record. You would key the records on a batch number.

bigeasy · 2013-07-05T20:26:34Z

I'm prettyrobots on Skype if you would like to talk, talk.

Let me know where you believe I should insert strata into the LevelDB ecosystem and I'll fork and submit a Pull Request. I can have a look myself, I suppose, to see if there is a backing storage API.

dominictarr · 2013-07-07T03:57:56Z

No pull request is needed, instead, you just need to support the api of https://github.com/rvagg/node-abstract-leveldown
it just gives you api stubs, and test cases that are shared with the other leveldown implementations (there are a handful of them now)

Hmm, I still think that batch operations should be lower in the abstraction hierachy - but I'm not gonna pressure the point.
I don't understand enough on how b-trees work to be confidant that one way is simpler or not.

Hmm, I get your point about the writeahead log - in fact, that is pretty much how leveldb does it. The first level in a in memory cache that is backed by a log file - and when this gets large, it's compacted into a sorted file.

Since this is relatively small, you could use the same approach on top of strata... hmm, it wouldn't even need to be a btree... it could just be a small file that is appended to, with all the data kept in memory, and then when it gets to a certain size, you'd write it out to the btree, and start a new file.

to get batches, you'd just write each collection of records out inside an array, so if it crashed part way through,
then you'd get invalid json on that line.

hmm, this is inspiring me to continue with by pure js leveldown that I started a while back...
I got an sst implementation https://npmjs.org/package/json-sst and a log db https://npmjs.org/package/json-logdb
but havn't gotten around to integrating them yet...

bigeasy · 2013-07-07T04:27:21Z

Strata is going to create a file backed table of sorted records as you've described. If you do not want it to be a b-tree, then don't balance it. But, you probably do want it to be a b-tree, and balanced frequently, because then, if you'er inserting random data, operations can be performed in parallel, as different clients are updating different leaves of the tree. I don't know if LevelDB supports this or if it is a single-writer API. (I'll look harder tomorrow.)

Strata has always been intended for use in an MVCC database, and I've always wanted to allow concurrent writes to a b-tree.

I'll look harder at the abstraction too. I'm not sure I see the support for batched operations. I'm looking forward to that directory of benchmarks in the LevelDOWN repository.

In any case, this is an open issue, because using those benchmarks, I'll be designing a batching strategy for Strata (along with a binary storage strategy, since it appears that everyone feels that is necessary.)

dominictarr · 2013-07-07T04:49:15Z

leveldb is threadsafe, but may only be opened by a single process, and you can do parallel writes.

For binary data, maybe you can just use msgpack?
you'd need a length delimited format instead of a line separated one,
but you could probably just drop that in, changing very little.
I have this, which may serve as a useful start https://npmjs.org/package/msgpack-stream
that I adapted from @creationix's https://npmjs.org/package/msgpack-js

Another option for off-the-shelf serialization would be the redis protocol.

bigeasy · 2013-07-07T05:20:23Z

I'd do checksummed and length, but with delimiters so that you could scan through the file and save records. One nice attribute of strata is that the leaves are lines of JSON, so if worst came to worst, you could run through a ruined b-tree reading the leaves, saving what you can. However, there is no operation in Strata isn't either a copy or an append. One of the issues open is to archive dead pages instead of deleting them. An option that will allow Strata some time to build trust, and recover from violations of that trust.

So, some sort of sentry to find where the records begin and end. From their there is a length and checksum to retrieve the record. Of course, if there is no damage, there is no scanning. I'm only thinking about the damage.

Then storage is just a blob, so it could be anything and Msgpack is a good option at the next level of abstraction. I'm not sure what LevelDB stores as binary. I only see binary. I also have been asked for binary by Kevin Swiber for his https://github.com/argo/medea database. I know I don't do binary. I suppose the right API is to offer up a buffer, or a write stream that accepts buffers.

Currently, storage is just a JSON blob. The user provides a collation function and a key extraction function. Their might be a vivify function added to the mix, that might be a no-op, that allows the user to transform their buffer into a more meaningful intermediate state, like a JSON object from serialized JSON or Msgpack or other.

I'm going to need a faster checksum. I was going to implement Murmur 3. Looking forward to benchmarks blasting away at my assumptions. I know I need better batching, a better delayed write strategy, but I didn't want to bother implementing it without a real application.

Also, I'm obviously going to need a merging iterator.

dominictarr · 2013-07-07T07:00:08Z

Oh yeah, I also have this:

https://npmjs.org/package/json-buffer

It just converts buffers into base64, but you could use this and strata would have binary, abet, not the most performant possible. I appreciate the simplicity you get from a line separated format, but the down side is that you have to escape the data. The difference if you can avoid mem copies is measurable.

Leveldb uses Protocol Buffers, and Snappy.

just found this
https://github.com/sirikata/protojs

Snappy is fast compression (byte oriented, instead of bit oriented, optimised for speed first and compression second).

There doesn't seem to be a js implementation of snappy (it can be disabled in leveldb), but it would be pretty cool.
Sooner or later some level-* person will probably tackel that!

bigeasy · 2013-08-09T19:28:45Z

This is actually related to good old #4.

Extract record formatting to a separate function so that it can be invoked by a streams based delayed write. See #107. See #4.

See #355. See #107. See #4.

See #356. See #355. See #107. See #4.

See bigeasy#356. See bigeasy#355. See bigeasy#107. See bigeasy#4.

Convert final use of `writePositions` to write footer separately. See bigeasy#356. See bigeasy#355. See bigeasy#107. See bigeasy#4.

See bigeasy#356. See bigeasy#355. See bigeasy#107. See bigeasy#4.

Implement `writeBranch` using the streaming entry writing functions. Rework `Transcript` to correctly convert streams errors to callback errors, I hope, hard to tell if I've managed to capture every possible error condition. See bigeasy#356. See bigeasy#355. See bigeasy#107. See bigeasy#4.

See #356. See #355. See #107. See #4.

Convert final use of `writePositions` to write footer separately. See #356. See #355. See #107. See #4.

See #356. See #355. See #107. See #4.

Implement `writeBranch` using the streaming entry writing functions. Rework `Transcript` to correctly convert streams errors to callback errors, I hope, hard to tell if I've managed to capture every possible error condition. See #356. See #355. See #107. See #4.

See #356. See #355. See #107. See #4.

Closes #356. Closes #355. Closes #354. See #107. See #4.

See #364. See #359. See #358. See #107. See #4.

ghost assigned bigeasy Jul 5, 2013

bigeasy mentioned this issue Jul 14, 2013

Locket general discussion. bigeasy/locket#1

Open

bigeasy mentioned this issue Apr 14, 2014

Extract record formatting to own function. #354

Closed

bigeasy pushed a commit that referenced this issue Apr 14, 2014

Extract record formatting to own function.

21e93a6

Extract record formatting to a separate function so that it can be invoked by a streams based delayed write. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 10, 2014

Write a delete using streams.

e8197be

See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 13, 2014

Split insert from footer in cursor insert.

1e5222b

See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 15, 2014

Split insert from footer in leaf rewrite.

c780eaf

See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 16, 2014

Remove debugging message.

4783bfc

See #356. See #355. See #107. See #4.

demarius pushed a commit to demarius/strata that referenced this issue May 27, 2014

Remove writeInsert.

8ef2cc1

See bigeasy#356. See bigeasy#355. See bigeasy#107. See bigeasy#4.

demarius pushed a commit to demarius/strata that referenced this issue May 27, 2014

writePositions writes footer separately.

892feda

Convert final use of `writePositions` to write footer separately. See bigeasy#356. See bigeasy#355. See bigeasy#107. See bigeasy#4.

demarius pushed a commit to demarius/strata that referenced this issue May 27, 2014

Remove old writePositions.

0188f26

See bigeasy#356. See bigeasy#355. See bigeasy#107. See bigeasy#4.

demarius pushed a commit to demarius/strata that referenced this issue May 27, 2014

Convert write insert and footer to use streams.

eea7041

See bigeasy#356. See bigeasy#355. See bigeasy#107. See bigeasy#4.

demarius pushed a commit to demarius/strata that referenced this issue May 27, 2014

Write delete now writes footer separately.

103f4ec

See bigeasy#356. See bigeasy#355. See bigeasy#107. See bigeasy#4.

bigeasy pushed a commit that referenced this issue May 28, 2014

Remove writeInsert.

ae31e99

See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 28, 2014

writePositions writes footer separately.

1f0b72e

Convert final use of `writePositions` to write footer separately. See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 28, 2014

Remove old writePositions.

83dcaf7

See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 28, 2014

Convert write insert and footer to use streams.

057e5a1

See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 28, 2014

Write delete now writes footer separately.

d2a1a37

See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 31, 2014

Use transcript to write positions in exorcism.

273614a

See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 31, 2014

Use transcript in rewriteLeaf.

19fca5b

See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 31, 2014

Rename writePositions to writePositions2.

d3d0045

See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 31, 2014

Delete unused write footer implementations.

f13c7c7

See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 31, 2014

Remove unused insert functions.

b10d268

See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 31, 2014

Rename writeInsert3 to writeInsert.

ce5cd8e

See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue May 31, 2014

Rename writeFooter4 to writeFooter.

17e19fe

See #356. See #355. See #107. See #4.

bigeasy pushed a commit that referenced this issue Jun 1, 2014

Add Trascript as a depenency.

de182a2

Closes #356. Closes #355. Closes #354. See #107. See #4.

bigeasy closed this as completed in e284fb4 Aug 9, 2014

bigeasy pushed a commit that referenced this issue Aug 9, 2014

Evict file cache before writing new entry.

bd3b542

See #364. See #359. See #358. See #107. See #4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch operations #107

batch operations #107

dominictarr commented Jul 5, 2013

bigeasy commented Jul 5, 2013

bigeasy commented Jul 5, 2013

dominictarr commented Jul 7, 2013

bigeasy commented Jul 7, 2013

dominictarr commented Jul 7, 2013

bigeasy commented Jul 7, 2013

dominictarr commented Jul 7, 2013

bigeasy commented Aug 9, 2013

batch operations #107

batch operations #107

Comments

dominictarr commented Jul 5, 2013

bigeasy commented Jul 5, 2013

bigeasy commented Jul 5, 2013

dominictarr commented Jul 7, 2013

bigeasy commented Jul 7, 2013

dominictarr commented Jul 7, 2013

bigeasy commented Jul 7, 2013

dominictarr commented Jul 7, 2013

bigeasy commented Aug 9, 2013