Skip to content
This repository has been archived by the owner on Aug 28, 2021. It is now read-only.

Implement chunking for blobs, at least #17

Closed
cmasone-attic opened this issue Jul 8, 2015 · 3 comments
Closed

Implement chunking for blobs, at least #17

cmasone-attic opened this issue Jul 8, 2015 · 3 comments
Assignees

Comments

@cmasone-attic
Copy link
Contributor

We want to split large pieces of data up into chunks. Initially, we'll do this for blobs, but probably want to expand beyond that.

We'll use a rolling hash function to figure out where to chunk things, and will need to deal with chunks that are still too big.

A blob value will really just contain some list of Refs to the chunks that make it up, and the Ref of a blob will be some kind of hash-of-hashes thing. This does introduce some complexity to the Reader() of a blob, since it now needs to find and decode all the chunks that make up the blob. Aaron proposes a "Resolver" interface that all composite types will have a pointer to, which looks like this:

type Resolver struct{}
func (r *Resolver) Resolve(r ref.Ref) Value

@aboodman
Copy link
Contributor

aboodman commented Jul 8, 2015

Note that this resolver thing is also needed to implement incremental decoding (issue #11)

@arv
Copy link
Contributor

arv commented Jul 30, 2015

I'm making good progress...

As a first step I'm only doing one level of nesting for the compound blobs. Making it into a tree is going to be similar to how we would do chunking for lists etc.

arv referenced this issue in arv/noms-old Aug 4, 2015
this point the compoundBlob only contains blob leafs but a future
change will create multiple tiers. Both these implement the new Blob
interface.

The splitting is done by using a rolling hash over the last 64 bytes,
when that hash ends with 13 consecutive ones we split the data.

Issue #17
arv referenced this issue in arv/noms-old Aug 4, 2015
This adds a test to ensure that we generate the same blob leafs when
we prepend and append to the data.

Issue #17
arv referenced this issue in arv/noms-old Aug 4, 2015
This is similar to io.MultiReader but it does not deref the Future
until needed.

Issue #17
arv referenced this issue in arv/noms-old Aug 4, 2015
This is in preparation for Seek

Issue #155, #17
arv referenced this issue in arv/noms-old Aug 5, 2015
- Put the length last
- Skip the initial 0 since first blob is always at 0

Issue #17
arv added a commit that referenced this issue Aug 6, 2015
This allows us to only read the relevant chunks

Issue #17, #155
arv referenced this issue in arv/noms-old Aug 7, 2015
The json serialization now only contains the length of each individual
blob child.

The go representation of this still uses offsets but the offsets are
for the end delimiter.

For "hi" "bye" we get

{"cb", [{"ref": "sha1-hi"}, 2, {"ref": "sha1-bye"}, 3]}

compoundBlob{[2, 5], [sha1-hi, ,sha1-bye]}

Keeping the length in the serialization leads to smaller serializations

Using the end offset leads to simpler binary search and allows us to
use the last entry as the length.

Issue #17
arv referenced this issue in arv/noms-old Sep 3, 2015
After a compound blob is created we try to chunk it again in a similar
way to how we chunk Lists. We use the refs of the sub blob and compute
a rolling hash over these. If the hash matches a pattern then we split
the existing compound blob into a new compound blob with sub blobs
which are slices of the original compound blob.

Issue #17
@arv
Copy link
Contributor

arv commented Sep 17, 2015

This is done.

@arv arv closed this as completed Sep 17, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants