Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roadmap: Blob storage #243

Open
justinsb opened this issue Dec 21, 2014 · 15 comments

Comments

@justinsb
Copy link

commented Dec 21, 2014

First off, I think this is awesome. I did a series of blog posts about a year ago (http://blog.justinsb.com/blog/categories/cloudata/) that posited the existence of a reliable key-value store and a blob store, and built all sorts of cool things on top of it. For example, a git store would store only the refs in consistent storage, but would store the data itself in the blob store. Similarly a filesystem: metadata in the key-value store; file data in blob storage. It might be fun for me to update that blog series using cockroachdb.

By blob storage, I mean immutable storage, where blobs are stored by their SHA-256 hash or similar. Blobs can be created and are replicated for reliable storage, but then the only "modification" allowed is deletion.

Do you have any plans/pointers for integrating blob storage? I could always just rely on S3 or similar, but this seems like something you might well be planning on supporting directly. (I presume it's not a good idea to just store these (multi-MB) blobs as values in the k-v store.)

@jamesgraves

This comment has been minimized.

Copy link
Contributor

commented Dec 22, 2014

If you are willing to live with the restrictions for immutable blobs, that opens up a lot of design possibilities. I can imagine a couple different kinds of blob storage that might be useful in various situations. For example, some applications might work well in a peer-to-peer fashion, instead of using centralized servers.

I presume it's not a good idea to just store these (multi-MB) blobs as values in the k-v store.

Performance can potentially be much better with external blob storage. For a particular key-value, you are only talking to one node (the currently elected leader) which may not be located nearby, it could be in another datacenter. The leader may also be busy handling other transactions from other clients.

Presumably, if you are replicating blobs, the client can chose the closest and/or least busy server to download from. It is also easier to scale up the serving of static content.

@spencerkimball

This comment has been minimized.

Copy link
Member

commented Dec 22, 2014

While at Google, I worked on Colossus, which was the successor to GFS. It used bigtable to scale out file metadata and a server at each disk to store chunks of data. Aside from increased scale, it used Reed Solomon encoding for more efficient storage. I'd of course like to do something similar on top of Cockroach at some point.

You definitely wouldn't want to store multi-MB blobs as values in the KV store. I believe that leveldb internally will store anything larger than 64K as a separate file.

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Dec 24, 2014

@spencerkimball

It used bigtable to scale out file metadata and a server at each disk to store chunks of data.

So the metadata server is also distributed and each of them only put part of the metadata into memory?

Aside from increased scale, it used Reed Solomon encoding for more efficient storage.

Reed Solomon encoding/decoding is high cost as far as I know. Is this a client-driven thing? Or it is done at the server side?

I'd of course like to do something similar on top of Cockroach at some point.

I would love to see this happen.

@rbastic

This comment has been minimized.

Copy link

commented Dec 31, 2014

Hi,

I've finally just released: https://github.com/photosrv/photosrv which is aimed at handling immutable blob storage. It also needs some work in some areas, does not do any kind of encoding/decoding, but has already been proven to scale effectively to hundreds of millions of files.

Maybe there is room or possibility for an integration layer between both photosrv and CockroachDB?

Cheers,

@spencerkimball

This comment has been minimized.

Copy link
Member

commented Jan 13, 2015

@xiang90

The metadata was distributed, but it didn't hold the data in memory, except through bigtable's normal caching facility.

Reed Solomon encoding/decoding can be made blazingly fast with appropriate optimizations. Both in how the cauchy encoding matrix is chosen and with carefully tuned low-level inner-loop instructions. At google, this was done exclusively on the client-side for encoding, and for decoding on-the-fly in the event that requested data blocks were not available, necessitating reads of parity blocks. It was also done server-side for permanent reconstructions when machines went missing, data corruption was identified, or disks died.

@Frank-Jin

This comment has been minimized.

Copy link

commented Feb 7, 2015

@spencerkimball
Hi,I ask one question.
I see that Google's spanner was built on top of Colossus,but cockroach doesn't depend on any distributed FS. What's different between them?

@spencerkimball

This comment has been minimized.

Copy link
Member

commented Feb 7, 2015

Running on top of Colossus is a very efficient proposition since it uses erasure coded storage. It means that if you store replicas of part of your key range within five datacenters, you end up with 5 * 1.67x = 8.3x encoding rate. Back in the days of GFS, the same configuration would mean 5 * 3x = 15x encoding rate for triplicated storage or 5 * 2x = 10x encoding for just two replicas / datacenter (gfs r=2 encoding).

Cockroach doesn't use a separate distributed file system as a dependent layer, so with five datacenters, you'd either end up with 5x encoding rate (one replica in each) or you can increase the Raft encoding to require two replicas per datacenter (or even three, though I don't think that would make as much sense). Using only a single replica per datacenter means you need to use inter-datacenter bandwidth in order to recover a lost disk or machine. Using two replicas per datacenter means you very often could rely on intra-datacenter bandwidth for recovery. The cost of using two replicas in each datacenter is in encoding efficiency (x2) and write latencies increasing both from additionally required bandwidth on writes and probably a change in the latency histogram due to more replicas participating in consensus. You could possibly get really clever with the inter-datacenter bandwidth by sending to only one replica per datacenter and having that replica responsible for forwarding to the alternate, but that would be a really onerous bit of complexity to add.

Eventually Cockroach could just as easily run using a distributed file system, but at this stage it would be a mistake to require such a complex external dependency. Most Cockroach clusters will start as single-datacenter deployments which makes the inter-datacenter bandwidth issue a moot point. Further, a very smart and cheap way to mitigate the costs of recovery in multi-datacenter clusters would be to use more reliable storage than Google uses internally. For example, hardware RAID.

I'm quite serious about the next step being CockroachFS, which would introduce erasure encoding and CockroachDB could bootstrap itself and run on top of that, making it closer to the Spanner architecture where storage is concerned.

@Frank-Jin

This comment has been minimized.

Copy link

commented Feb 9, 2015

I see,thank you

@cloudmode

This comment has been minimized.

Copy link

commented Mar 28, 2015

We just ported our old FoundationFS demo to cockroachdb as an excercise to learn/understand cockroachdb. Overall, it was a good experience, and we intend to keep this project up to date as cockroachdb moves into beta and production. You can check it out at https://github.com/cloudmode/roachclip-fs. It's an attempt to replicate mongodb's gridfs interface.

@cloudmode

This comment has been minimized.

Copy link

commented Mar 28, 2015

Just to clarify, roachclip-fs is for file storage, it's not a file system.

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2015

@spencerkimball

Reed Solomon encoding/decoding can be made blazingly fast with appropriate optimizations. Both in how the cauchy encoding matrix is chosen and with carefully tuned low-level inner-loop instructions.

I just found https://github.com/tsuraan/Jerasure and https://github.com/catid/longhair. I am not sure if it worth a rewrite in go.

At google, this was done exclusively on the client-side for encoding, and for decoding on-the-fly in the event that requested data blocks were not available, necessitating reads of parity blocks. It was also done server-side for permanent reconstructions when machines went missing, data corruption was identified, or disks died.

I did some research about this recently. And I am more interested in the permanent reconstructions at server side at the moment.

My understanding is this has to be done at the server side since we want to reduce the bandwidth at client side. To reconstruct a missing block might involve at most K transmissions for a (k, m) coding.

But does this also mean that the server side has to know about the original encoding matrix? And the reconstructing needs some help from the upper level (or might be need to remember the matrix?)

@spencerkimball

This comment has been minimized.

Copy link
Member

commented Apr 10, 2015

@xiang90 @google on Colossus we did both client-side and server-side reconstructions. Waiting for the server when accessing via the client would result in unacceptable latency. There are also some common situations where an optimizing data layout would result in no additional bandwidth to reconstruct (e.g. when scanning a file a "stripe" at a time). The encoding matrix is just a binary string of length k * m. We'd store that with the file metadata so the server could reconstruct at its leisure.

Here's a description of a possible data layout:

image

The above diagram shows the data layout for two "stripes", which is a convenient concept to describe pieces of a larger file. In this example, there are 8 data chunks and 4 code chunks in the stripe. Another common format might be 6.4. The metadata for a larger file was broken down by stripes. Each stripe would have 12 "chunks". In the diagram above, a chunk is represented by a vertical column containing 8 MB of non-contiguous data. There are 8 data chunks and 4 code chunks. A stripe here contains 64MB of data and 32MB of parity/code blocks. A "block" is one of the cells in the diagram.

Why layout the data this way? There are some compelling reasons. First, you expect many files to be relatively small. If you have files flirting with sizes which are within an order of magnitude of the stripe size, you would suffer from quite expensive overhead in code blocks for a file with has a size modulo (say) 16MB. In that case, the final stripe would only contain 16 MB of data blocks. If you laid out data contiguously along chunks, that would mean you'd need all 32MB of code blocks. Your final stripe of data would have an encoding rate of 3x. Not good. With the layout in the diagram, you spread the 16MB of data blocks across only the first two block rows and use only 8MB of code blocks (also just those along the first two "mini" stripes), giving you an encoding rate of 1.5x.

For similar reasons (and critical to performance), on-the-fly recovery of data is greatly enhanced by this data layout. If you're trying to read 2MB of data and one of the chunks is unavailable, you'll end up (likely) reading 8MB of data to reconstruct the missing 1MB (say, the 1MB from the 2MB which is available and 7 additional blocks from the remaining 10). This is a 4x read blowup. On the other hand, with the contiguous block layout, missing 2MB of data would mean reading 16MB (2MB from each of 8 of the remaining 12 chunks), for an 8x read blowup. Further, if you read in 8MB increments, a missing chunk with this data layout actually incurs no read blowup costs--let's say you read from 7 of the 8, then you just need 1 MB from one of the 4 code chunks). For the contiguous chunk model, you'd need to read 64MB.. Ouch--an 8x read blowup.

@CooCooCaCha

This comment has been minimized.

Copy link

commented Dec 27, 2015

I recently found this Go library for Reed-Solomon encoding. It claims 1GB/s/cpu core.
https://github.com/klauspost/reedsolomon

@petermattis petermattis modified the milestone: Post 1.0 Feb 14, 2016

@petermattis petermattis changed the title Feature request: Blob storage sql: Feature request: Blob storage Mar 31, 2016

@knz knz changed the title sql: Feature request: Blob storage Feature request: Blob storage May 9, 2018

@knz knz changed the title Feature request: Blob storage roadmap: Blob storage Jul 23, 2018

@tim-o

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2018

Zendesk ticket #2743 has been linked to this issue.

@Dmole

This comment has been minimized.

Copy link

commented Apr 6, 2019

OID / BLOB support would bring CockroachDB closer to a drop-in replacement for PostgreSQL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.