Implement chunking for blobs, at least #17

cmasone-attic · 2015-07-08T17:17:16Z

We want to split large pieces of data up into chunks. Initially, we'll do this for blobs, but probably want to expand beyond that.

We'll use a rolling hash function to figure out where to chunk things, and will need to deal with chunks that are still too big.

A blob value will really just contain some list of Refs to the chunks that make it up, and the Ref of a blob will be some kind of hash-of-hashes thing. This does introduce some complexity to the Reader() of a blob, since it now needs to find and decode all the chunks that make up the blob. Aaron proposes a "Resolver" interface that all composite types will have a pointer to, which looks like this:

type Resolver struct{}
func (r *Resolver) Resolve(r ref.Ref) Value

aboodman · 2015-07-08T17:40:12Z

Note that this resolver thing is also needed to implement incremental decoding (issue #11)

arv · 2015-07-30T00:09:44Z

I'm making good progress...

As a first step I'm only doing one level of nesting for the compound blobs. Making it into a tree is going to be similar to how we would do chunking for lists etc.

this point the compoundBlob only contains blob leafs but a future change will create multiple tiers. Both these implement the new Blob interface. The splitting is done by using a rolling hash over the last 64 bytes, when that hash ends with 13 consecutive ones we split the data. Issue #17

This adds a test to ensure that we generate the same blob leafs when we prepend and append to the data. Issue #17

This is similar to io.MultiReader but it does not deref the Future until needed. Issue #17

This is in preparation for Seek Issue #155, #17

- Put the length last - Skip the initial 0 since first blob is always at 0 Issue #17

This allows us to only read the relevant chunks Issue #17, #155

The json serialization now only contains the length of each individual blob child. The go representation of this still uses offsets but the offsets are for the end delimiter. For "hi" "bye" we get {"cb", [{"ref": "sha1-hi"}, 2, {"ref": "sha1-bye"}, 3]} compoundBlob{[2, 5], [sha1-hi, ,sha1-bye]} Keeping the length in the serialization leads to smaller serializations Using the end offset leads to simpler binary search and allows us to use the last entry as the length. Issue #17

After a compound blob is created we try to chunk it again in a similar way to how we chunk Lists. We use the refs of the sub blob and compute a rolling hash over these. If the hash matches a pattern then we split the existing compound blob into a new compound blob with sub blobs which are slices of the original compound blob. Issue #17

arv · 2015-09-17T15:34:58Z

This is done.

cmasone-attic assigned cmasone-attic and unassigned cmasone-attic Jul 8, 2015

aboodman assigned arv Jul 29, 2015

This was referenced Jul 31, 2015

Initial implementation of chunked blobs #154

Merged

Blob should support Seek and Write #155

Closed

arv mentioned this issue Aug 4, 2015

Blob chunking: Test that we generate same blob leafs #157

Merged

arv mentioned this issue Aug 4, 2015

Make Blob Reader lazy #161

Merged

arv referenced this issue in arv/noms-old Aug 4, 2015

Blob chunking: Test that we generate same blob leafs

c9f56a6

This adds a test to ensure that we generate the same blob leafs when we prepend and append to the data. Issue #17

arv referenced this issue in arv/noms-old Aug 4, 2015

Make Blob Reader lazy

72b8c87

This is similar to io.MultiReader but it does not deref the Future until needed. Issue #17

arv referenced this issue in arv/noms-old Aug 4, 2015

Swith to use offsets in compoun blobs

1e7db8e

This is in preparation for Seek Issue #155, #17

arv mentioned this issue Aug 4, 2015

Swith to use offsets in compoun blobs #167

Merged

arv referenced this issue in arv/noms-old Aug 5, 2015

Change JSON serialization format for compound blobs

d834f7d

- Put the length last - Skip the initial 0 since first blob is always at 0 Issue #17

arv mentioned this issue Aug 5, 2015

Change JSON serialization format for compound blobs #168

Merged

arv mentioned this issue Aug 5, 2015

Implement Seek for Blob.Reader() #171

Closed

arv added a commit that referenced this issue Aug 6, 2015

Implement Seek for Blob.Reader()

ea52c4a

This allows us to only read the relevant chunks Issue #17, #155

arv mentioned this issue Aug 7, 2015

Slight modification to compound blob encoding #180

Merged

arv mentioned this issue Sep 1, 2015

Chunking: Multi level chunking for blobs #248

Merged

arv closed this as completed Sep 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement chunking for blobs, at least #17

Implement chunking for blobs, at least #17

cmasone-attic commented Jul 8, 2015

aboodman commented Jul 8, 2015

arv commented Jul 30, 2015

arv commented Sep 17, 2015

Implement chunking for blobs, at least #17

Implement chunking for blobs, at least #17

Comments

cmasone-attic commented Jul 8, 2015

aboodman commented Jul 8, 2015

arv commented Jul 30, 2015

arv commented Sep 17, 2015