Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch operations #107

Closed
dominictarr opened this issue Jul 5, 2013 · 8 comments
Closed

batch operations #107

dominictarr opened this issue Jul 5, 2013 · 8 comments
Assignees
Milestone

Comments

@dominictarr
Copy link

Having a batch operation, that atomically processes a series of puts and dels,
is vital for building reliable applications.

For example, say I wanted to implement a map-reduce, or some other async transformation on inserted data: on inserting the data, also insert a job item,
and when that is durably written, start performing the job, and when the result of the job is written, delete the job.

Thus, if the system crashes while data is being written, you either get no data and no job, or data, and then the job, eventually!

I have been doing a lot of work build node.js databases based on leveldb ( https://github.com/rvagg/node-levelup ), and have been looking for an suitable database primitive implemented in javascript, so that it is not necessary to compile anything.

Then, strata will be usable with the growing number of level-* modules
https://github.com/rvagg/node-levelup/wiki/Modules

The other thing that is needed, is reverse iteration.

Browsing your code, I'm thinking that to get a batch operation, you'd need a extended checksum, that validated the entire batch, or some other way, of grouping a whole batch of writes into one line? maybe some other separator, or a symbol meaning that the line is a part of a batch, and the checksum applies to the batch as a whole?

Cheers, Dominic

@ghost ghost assigned bigeasy Jul 5, 2013
@bigeasy
Copy link
Owner

bigeasy commented Jul 5, 2013

@dominictarr Reverse iteration would be implemented optimistically. Forward iteration is performed using links, but reverse iteration is performed optimistically, starting from the right most leaf, you work your way backwards descending the tree to the preceding. This is the intended design.The locking strategy for Strata is designed for liveness. The locking required to maintain reverse links would reduce liveness. It is assumed that descents are cheap in Strata, and quick to make progress because exclusive locks are not global.

Batch operations are a requirement for a log structured tree. Currently, every insert into Strata is an append. Nothing is lost. However, each write opens and closes the leaf page. Yes, I know, horrible, but I now that I understand the requirements, it's easy enough to use a WriteStream to do a leaf page append. Not brilliant, but a step in the right direction for performance you're looking for.

But, yes, durability was there at the start. Absurdly durable. The file open and close was in the hopes that it would get the filesystem to flush. You can probably tell me a lot about how to properly sync Node.js I/O and I'd be happy to learn.

Note that, Strata doesn't store a record. It is a b-tree primitive. The record that goes into Strata is not the one provided by the user. It was always assumed that in order to build a useful structure for a database, you would include database metadata in a record. For example, Strata does not support duplicate index values, but it was designed so that you could easily fake duplicates with a hidden increment. The increment makes the key unique, but you don't share that with the user.

With the right record, or a hidden record, you could fulfill your atomicity requirement. It is my intention to build a MVCC NoSQL database on top of Strata as proof of concept. (Your concepts would be powerful proof too, so maybe I'll focus on that at the outset.) The MVCC database would use Strata to store an object/document keeping an version number that gets incremented for each connection. This version number is stored hidden to the user in a record that contains the user object/document. When it visits a leaf page, it can inspect the version numbers of the other records with the same key. There a common collection of outstanding transactions. If one of the versions is that of an outstanding transaction, we know that when we commit, if the outstanding transaction has already committed, we have a conflict and we can give constraint error to the user.

Thus, to get a batch, you might use a separate Strata b-tree that represents WAL. It would be that each record represents a component of your batching data structure, instead of trying to fit it all into one record. You would key the records on a batch number.

@bigeasy
Copy link
Owner

bigeasy commented Jul 5, 2013

I'm prettyrobots on Skype if you would like to talk, talk.

Let me know where you believe I should insert strata into the LevelDB ecosystem and I'll fork and submit a Pull Request. I can have a look myself, I suppose, to see if there is a backing storage API.

@dominictarr
Copy link
Author

No pull request is needed, instead, you just need to support the api of https://github.com/rvagg/node-abstract-leveldown
it just gives you api stubs, and test cases that are shared with the other leveldown implementations (there are a handful of them now)

Hmm, I still think that batch operations should be lower in the abstraction hierachy - but I'm not gonna pressure the point.
I don't understand enough on how b-trees work to be confidant that one way is simpler or not.

Hmm, I get your point about the writeahead log - in fact, that is pretty much how leveldb does it. The first level in a in memory cache that is backed by a log file - and when this gets large, it's compacted into a sorted file.

Since this is relatively small, you could use the same approach on top of strata... hmm, it wouldn't even need to be a btree... it could just be a small file that is appended to, with all the data kept in memory, and then when it gets to a certain size, you'd write it out to the btree, and start a new file.

to get batches, you'd just write each collection of records out inside an array, so if it crashed part way through,
then you'd get invalid json on that line.

hmm, this is inspiring me to continue with by pure js leveldown that I started a while back...
I got an sst implementation https://npmjs.org/package/json-sst and a log db https://npmjs.org/package/json-logdb
but havn't gotten around to integrating them yet...

@bigeasy
Copy link
Owner

bigeasy commented Jul 7, 2013

Strata is going to create a file backed table of sorted records as you've described. If you do not want it to be a b-tree, then don't balance it. But, you probably do want it to be a b-tree, and balanced frequently, because then, if you'er inserting random data, operations can be performed in parallel, as different clients are updating different leaves of the tree. I don't know if LevelDB supports this or if it is a single-writer API. (I'll look harder tomorrow.)

Strata has always been intended for use in an MVCC database, and I've always wanted to allow concurrent writes to a b-tree.

I'll look harder at the abstraction too. I'm not sure I see the support for batched operations. I'm looking forward to that directory of benchmarks in the LevelDOWN repository.

In any case, this is an open issue, because using those benchmarks, I'll be designing a batching strategy for Strata (along with a binary storage strategy, since it appears that everyone feels that is necessary.)

@dominictarr
Copy link
Author

leveldb is threadsafe, but may only be opened by a single process, and you can do parallel writes.

For binary data, maybe you can just use msgpack?
you'd need a length delimited format instead of a line separated one,
but you could probably just drop that in, changing very little.
I have this, which may serve as a useful start https://npmjs.org/package/msgpack-stream
that I adapted from @creationix's https://npmjs.org/package/msgpack-js

Another option for off-the-shelf serialization would be the redis protocol.

@bigeasy
Copy link
Owner

bigeasy commented Jul 7, 2013

I'd do checksummed and length, but with delimiters so that you could scan through the file and save records. One nice attribute of strata is that the leaves are lines of JSON, so if worst came to worst, you could run through a ruined b-tree reading the leaves, saving what you can. However, there is no operation in Strata isn't either a copy or an append. One of the issues open is to archive dead pages instead of deleting them. An option that will allow Strata some time to build trust, and recover from violations of that trust.

So, some sort of sentry to find where the records begin and end. From their there is a length and checksum to retrieve the record. Of course, if there is no damage, there is no scanning. I'm only thinking about the damage.

Then storage is just a blob, so it could be anything and Msgpack is a good option at the next level of abstraction. I'm not sure what LevelDB stores as binary. I only see binary. I also have been asked for binary by Kevin Swiber for his https://github.com/argo/medea database. I know I don't do binary. I suppose the right API is to offer up a buffer, or a write stream that accepts buffers.

Currently, storage is just a JSON blob. The user provides a collation function and a key extraction function. Their might be a vivify function added to the mix, that might be a no-op, that allows the user to transform their buffer into a more meaningful intermediate state, like a JSON object from serialized JSON or Msgpack or other.

I'm going to need a faster checksum. I was going to implement Murmur 3. Looking forward to benchmarks blasting away at my assumptions. I know I need better batching, a better delayed write strategy, but I didn't want to bother implementing it without a real application.

Also, I'm obviously going to need a merging iterator.

@dominictarr
Copy link
Author

Oh yeah, I also have this:

https://npmjs.org/package/json-buffer

It just converts buffers into base64, but you could use this and strata would have binary, abet, not the most performant possible. I appreciate the simplicity you get from a line separated format, but the down side is that you have to escape the data. The difference if you can avoid mem copies is measurable.

Leveldb uses Protocol Buffers, and Snappy.

just found this
https://github.com/sirikata/protojs

Snappy is fast compression (byte oriented, instead of bit oriented, optimised for speed first and compression second).

There doesn't seem to be a js implementation of snappy (it can be disabled in leveldb), but it would be pretty cool.
Sooner or later some level-* person will probably tackel that!

@bigeasy
Copy link
Owner

bigeasy commented Aug 9, 2013

This is actually related to good old #4.

bigeasy pushed a commit that referenced this issue Apr 14, 2014
Extract record formatting to a separate function so that it can be
invoked by a streams based delayed write.

See #107.
See #4.
bigeasy pushed a commit that referenced this issue May 10, 2014
bigeasy pushed a commit that referenced this issue May 13, 2014
bigeasy pushed a commit that referenced this issue May 15, 2014
bigeasy pushed a commit that referenced this issue May 16, 2014
demarius pushed a commit to demarius/strata that referenced this issue May 27, 2014
demarius pushed a commit to demarius/strata that referenced this issue May 27, 2014
Convert final use of `writePositions` to write footer separately.

See bigeasy#356.
See bigeasy#355.
See bigeasy#107.
See bigeasy#4.
demarius pushed a commit to demarius/strata that referenced this issue May 27, 2014
demarius pushed a commit to demarius/strata that referenced this issue May 27, 2014
demarius pushed a commit to demarius/strata that referenced this issue May 27, 2014
demarius pushed a commit to demarius/strata that referenced this issue May 27, 2014
Implement `writeBranch` using the streaming entry writing functions.

Rework `Transcript` to correctly convert streams errors to callback
errors, I hope, hard to tell if I've managed to capture every possible
error condition.

See bigeasy#356.
See bigeasy#355.
See bigeasy#107.
See bigeasy#4.
bigeasy pushed a commit that referenced this issue May 28, 2014
See #356.
See #355.
See #107.
See #4.
bigeasy pushed a commit that referenced this issue May 28, 2014
Convert final use of `writePositions` to write footer separately.

See #356.
See #355.
See #107.
See #4.
bigeasy pushed a commit that referenced this issue May 28, 2014
bigeasy pushed a commit that referenced this issue May 28, 2014
bigeasy pushed a commit that referenced this issue May 28, 2014
bigeasy pushed a commit that referenced this issue May 28, 2014
Implement `writeBranch` using the streaming entry writing functions.

Rework `Transcript` to correctly convert streams errors to callback
errors, I hope, hard to tell if I've managed to capture every possible
error condition.

See #356.
See #355.
See #107.
See #4.
bigeasy pushed a commit that referenced this issue May 31, 2014
bigeasy pushed a commit that referenced this issue May 31, 2014
bigeasy pushed a commit that referenced this issue May 31, 2014
bigeasy pushed a commit that referenced this issue May 31, 2014
bigeasy pushed a commit that referenced this issue May 31, 2014
bigeasy pushed a commit that referenced this issue May 31, 2014
bigeasy pushed a commit that referenced this issue May 31, 2014
bigeasy pushed a commit that referenced this issue Jun 1, 2014
Closes #356.
Closes #355.
Closes #354.
See #107.
See #4.
@bigeasy bigeasy closed this as completed in e284fb4 Aug 9, 2014
bigeasy pushed a commit that referenced this issue Aug 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants