Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

More Compact Riak Object Binary Format #479

Merged
merged 5 commits into from Feb 27, 2013

Conversation

Projects
None yet
4 participants
Contributor

jrwest commented Feb 7, 2013

This PR changes how riak_objects are serialized before storing them on disk, with the goal of reducing overhead. The current term_to_binary approach used results in a significant amount of unnecessary overhead due to the metadata dict, among other things. This is especially true for small objects. It also means we store the bucket and key twice for each object (once as the actual key in the backend, and once in the serialized object). The new format addresses these issues by building its own binary that more compactly stores the metadata dictionary in addition to not including the bucket and key.

To support the new format several other changes were made: riak_kv_vnode uses a new capability to determine when the new format is supported and encodes the object for writes using the version negotiated by capabilities. Several sub-systems, e.g. aae, backup, handoff were also changed to account for the new version. The commit notes have specific details on what was changed and why.

Additionally, in order for rolling downgrades to be possible after new objects have been written, riak_kv_format is included in this PR with the intention of it being run on a live node before starting the downgrade. A riak_test for this can be found here. riak_kv_format is general enough that it can be used to upgrade the node's data as well.

Finally, this PR includes some general cleanup of the riak_object module.

There are several possible improvements, such as some potential binary optimizations, that could be made before this PR is merged or in a separate PR. I figured those were best left to discussion with a reviewer.

EDIT: much thanks to @argv0 @buddhisthead and @seancribbs who also worked on this branch. The author history was lost as part of rebasing many commits from the portland meetup.

@jrwest jrwest referenced this pull request in basho/riak_test Feb 7, 2013

Merged

rolling downgrade with different on-disk format test #197

@jrwest jrwest was assigned Feb 8, 2013

Contributor

buddhisthead commented Feb 20, 2013

+1 on the tests. I can't really be the main reviewer since I contributed to the early code changes.

@russelldb russelldb was assigned Feb 21, 2013

@russelldb russelldb commented on the diff Feb 21, 2013

src/riak_kv_backup.erl
Pid.
result_collector(PPid) ->
receive
stop ->
PPid ! stop;
- {backup, M} when is_binary(M) ->
- disk_log:log(?TABLE, M),
+ {backup, {{B, K}, M}} when is_binary(M) ->
+ %% make sure binary is encoded using term_to_binary (v0)
@russelldb

russelldb Feb 21, 2013

Member

One day someone will ask "why?", maybe that day is today.

@jrwest

jrwest Feb 21, 2013

Contributor

Ah, "why?" is somewhere in the commit notes but probably best as a comment. will add that.

@russelldb russelldb and 1 other commented on an outdated diff Feb 21, 2013

src/riak_kv_index_hashtree.erl
@@ -102,10 +102,11 @@ insert(Id, Key, Hash, Tree) ->
insert(Id, Key, Hash, Tree, Options) ->
gen_server:cast(Tree, {insert, Id, Key, Hash, Options}).
-%% @doc Add a term_to_binary encoded riak_object associated with a given
+%% @doc Add a encoded (binary) riak_object associated with a given
@russelldb

russelldb Feb 21, 2013

Member

super_picky "Add an encoded…"

@jrwest

jrwest Feb 21, 2013

Contributor

proper use of a/an is always good :). will fix when i make changes to the kv backup comment

@russelldb russelldb commented on the diff Feb 21, 2013

src/riak_kv_vnode.erl
+ IndexSpecs = riak_object:index_specs(RObj);
+ false ->
+ IndexSpecs = []
+ end,
+ PutArgs = #putargs{returnbody=false,
+ bkey=BKey,
+ reqid=undefined,
+ index_specs=IndexSpecs},
+ case perform_put({true, RObj}, State, PutArgs) of
+ {{fail, _, _}, UpdState} ->
+ Reply = {error, backend_error};
+ {_, UpdState} ->
+ Reply = ok
+ end
+ end,
+ {Reply, UpdState}.
@russelldb

russelldb Feb 21, 2013

Member

Stylistic thing: how do you feel about exporting Repy and UpdState from case statements? I try and avoid it just 'cos my editor complains. Personally I think it requires more reading to track down the source of the bindings.

@jrwest

jrwest Feb 21, 2013

Contributor

I actually prefer it too and get the same warning but most of our code I was looking at did it and I wasn't sure if it was a whitespace indentation thing. Up to change it if you prefer.

@russelldb

russelldb Feb 21, 2013

Member

I'm easy either way…I agree it is good to have the code in keeping with the general conventions of the larger corpus. I prefer not to export from case statements. Really up to you. I guess leave for now.

@reiddraper

reiddraper Feb 21, 2013

Contributor

I'll throw my opinion in the ring and say exporting from case is A Bad Idea.

@russelldb russelldb and 1 other commented on an outdated diff Feb 21, 2013

src/riak_kv_vnode.erl
@@ -876,6 +894,37 @@ perform_put({true, Obj},
end,
{Reply, State#state{modstate=UpdModState}}.
+do_reformat({Bucket, Key}=BKey, State=#state{mod=Mod, modstate=ModState}) ->
+ case Mod:get(Bucket, Key, ModState) of
+ {error, not_found, _UpdModState} ->
+ Reply = {error, not_found},
+ UpdState = State;
+ {ok, ObjBin, _UpdModState} ->
+ %% since it is assumed capabilities have been properly set
+ %% to the desired version, to reformat, all we need to do
+ %% is submit a new write
+ RObj = riak_object:from_binary(Bucket, Key, ObjBin),
+ {ok, Capabilities} = Mod:capabilities(Bucket, ModState),
+ IndexBackend = lists:member(indexes, Capabilities),
@russelldb

russelldb Feb 21, 2013

Member

Why do we need to re-write the indexes?

@jrwest

jrwest Feb 21, 2013

Contributor

I was just extracting them here because they are necessary to pass to perform_put which now that I think about it probably does rewrite the indexes, which is not ideal, although probably not harmful. The reason I didn't want to add yet another put function is we have so many already. Should be changed I think though.

@russelldb

russelldb Feb 21, 2013

Member

Just pass the empty list and all will remain as before, (edit) Erm, I think. best to double check.

@jrwest

jrwest Feb 21, 2013

Contributor

thanks for the tip. will verify and give it a shot

@jrwest

jrwest Feb 21, 2013

Contributor

@rdb so it looks like handling of the IndexSpecs is delegated to the kv backend on Mod:put. if level doesn't update the index we may be able to take advantage of that by passing [] but not sure that is safe to rely upon. Thoughts?

@russelldb

russelldb Feb 21, 2013

Member

@jrwest so, index specs are of the form {add, name, value} and {remove, name, value}, the backend uses these as instructions for what to do (write or delete indexes) in this case, isn't it that only adds will be generated? And aren't those adds for indexes that are already added (since your just reading and writing the object locally) So, I guess the question is 'is the empty list a valid index spec to pass for all backends?' I'm pretty sure the answer is 'yes'

@russelldb russelldb commented on the diff Feb 21, 2013

src/riak_object.erl
@@ -129,13 +138,29 @@ equal_contents([C1|R1],[C2|R2]) ->
andalso (C1#r_content.value =:= C2#r_content.value)
andalso equal_contents(R1,R2).
-%% @spec reconcile([riak_object()], boolean()) -> riak_object()
+
@russelldb

russelldb Feb 21, 2013

Member

I see that the PR description includes a mention of "general cleanup". I find it harder to review with these non-pertinent changes mixed in. IM(not so)HO a separate PR for all these @spec to -spec changes would have been easier to review.

@jrwest

jrwest Feb 21, 2013

Contributor

ah my bad. you are correct. OCD got the best of me after staring at the file for so long. It should be easy enough with a rebase to back out those changes if you want...There is only one thing from the last commit that really should be in this PR (the sorting of vnode entries in riak_object:hash/1)

@russelldb russelldb and 1 other commented on an outdated diff Feb 21, 2013

src/riak_object.erl
+ KeyBin = encode_maybe_binary(Key),
+ KeyLen = byte_size(KeyBin),
+ MetaBin = <<KeyLen:32/integer, KeyBin/binary, ValueLen:32/integer, ValueBin/binary>>,
+ {Elems, <<RestBin/binary, MetaBin/binary>>}.
+
+encode_maybe_binary(Bin) when is_binary(Bin) ->
+ <<1, Bin/binary>>;
+encode_maybe_binary(Bin) ->
+ <<0, (term_to_binary(Bin))/binary>>.
+
+decode_maybe_binary(<<1, Bin/binary>>) ->
+ Bin;
+decode_maybe_binary(<<0, Bin/binary>>) ->
+ binary_to_term(Bin).
+
+contents_size(Vsn, Contents) ->
@russelldb

russelldb Feb 21, 2013

Member

Sorry if I am the worlds pickiest reviewer, but this would be so much easier to read if it was at L570+ (directly where it is used in calculating size)

@jrwest

jrwest Feb 21, 2013

Contributor

I can move it up there. I always have a hard time separating private functions from public API ones for that reason. No objection to moving it up though

@russelldb russelldb commented on the diff Feb 21, 2013

src/riak_object.erl
+ VclockBin = term_to_binary(Vclock),
+ VclockLen = byte_size(VclockBin),
+ SibCount = length(Siblings),
+ SibsBin = bin_contents(Siblings),
+ <<?MAGIC:8/integer, ?V1_VERS:8/integer, VclockLen:32/integer, VclockBin/binary, SibCount:32/integer, SibsBin/binary>>.
+
+bin_content(#r_content{metadata=Meta, value=Val}) ->
+ ValBin = encode_maybe_binary(Val),
+ ValLen = byte_size(ValBin),
+ MetaBin = meta_bin(Meta),
+ MetaLen = byte_size(MetaBin),
+ <<ValLen:32/integer, ValBin:ValLen/binary, MetaLen:32/integer, MetaBin:MetaLen/binary>>.
+
+bin_contents(Contents) ->
+ F = fun(Content, Acc) ->
+ <<Acc/binary, (bin_content(Content))/binary>>
@russelldb

russelldb Feb 21, 2013

Member

I've never seen this (bin_content(Content)) before, is it to save setting an intermediate variable?

@jrwest

jrwest Feb 21, 2013

Contributor

yea, need the extra parens to make erlc happy

@russelldb russelldb commented on the diff Feb 21, 2013

src/riak_object.erl
+ VTagLen = byte_size(VTagBin),
+ LastModBin = case LastModVal of
+ undefined -> <<0:32/integer, 0:32/integer, 0:32/integer>>;
+ {Mega,Secs,Micro} -> <<Mega:32/integer, Secs:32/integer, Micro:32/integer>>
+ end,
+ <<LastModBin/binary, VTagLen:8/integer, VTagBin:VTagLen/binary,
+ Deleted:1/binary-unit:8, RestBin/binary>>.
+
+fold_meta_to_bin(?MD_VTAG, Value, {{_Vt,Del,Lm},RestBin}) ->
+ {{Value, Del, Lm}, RestBin};
+fold_meta_to_bin(?MD_LASTMOD, Value, {{Vt,Del,_Lm},RestBin}) ->
+ {{Vt, Del, Value}, RestBin};
+fold_meta_to_bin(?MD_DELETED, true, {{Vt,_Del,Lm},RestBin})->
+ {{Vt, <<1>>, Lm}, RestBin};
+fold_meta_to_bin(?MD_DELETED, "true", Acc) ->
+ fold_meta_to_bin(?MD_DELETED, true, Acc);
@russelldb

russelldb Feb 21, 2013

Member

Why not the same as the previous clause to save the call?

@jrwest

jrwest Feb 21, 2013

Contributor

stylistic thing i guess. Usually I prefer to have only one version of the function that returns when converting a value (like "true" -> true) because if the return value changes it only requires changing one function clause. Can be convinced otherwise if you have strong feelings the other way.

@russelldb

russelldb Feb 21, 2013

Member

Ah, good reason. The 'optimisation' of not doing is probably far less than the read/maintainability. I like your way. Let's leave it.

@russelldb russelldb and 1 other commented on an outdated diff Feb 21, 2013

src/riak_object.erl
+ {error, bad_object_format}
+ end.
+
+sibs_of_binary(Count,SibsBin) ->
+ sibs_of_binary(Count, SibsBin, []).
+
+sibs_of_binary(0, <<>>, Result) -> lists:reverse(Result);
+sibs_of_binary(0, _NotEmpty, _Result) ->
+ {error, corrupt_contents};
+sibs_of_binary(Count, SibsBin, Result) ->
+ {Sib, SibsRest} = sib_of_binary(SibsBin),
+ sibs_of_binary(Count-1, SibsRest, [Sib | Result]).
+
+sib_of_binary(<<ValLen:32/integer, ValBin:ValLen/binary, MetaLen:32/integer, MetaBin:MetaLen/binary, Rest/binary>>) ->
+ <<LMMega:32/integer, LMSecs:32/integer, LMMicro:32/integer, VTagLen:8/integer, VTag:VTagLen/binary, Deleted:1/binary-unit:8, MetaRestBin/binary>> = MetaBin,
+ DeletedVal = case Deleted of <<1>> -> true; _False -> false end,
@russelldb

russelldb Feb 21, 2013

Member

I re-factored this section to

russelldb/riak_kv@basho:jrw-pdx-binrobj-rebased2...russelldb:rdb-refctor-sib-of-binary

Dunno if you want me to push that, or if you prefer yours. Again it is just a stylistic thing, but there for discussion.

@jrwest

jrwest Feb 21, 2013

Contributor

wfm, reads much nicer

Member

russelldb commented Feb 21, 2013

Pending answers to questions +1 from me. Code works. Tests pass.

Basho benches on my limited set up on bolth level and bitcask show no discernible performance degradation, maybe there was even a slight perf increase (especially for small object values.) Also, for a million 1kish objects, the level folder was nearly a gig smaller on each node.

Contributor

jrwest commented Feb 21, 2013

force pushed an update to the branch w/ comments addressed. @russelldb I squashed your changes into the cleanup commit

Member

russelldb commented Feb 26, 2013

WFM, +1. Nice work.

@jrwest jrwest was assigned Feb 26, 2013

buddhisthead and others added some commits Nov 20, 2012

@buddhisthead @jrwest buddhisthead new riak_object binary format
goal is to be versionable and smaller on disk than using term_to_binary.
removes duplication of bucket/key in encoded value and bloat from metadata
dictionary primarly.
b89add3
@jrwest jrwest new object format all the things
* register {riak_kv, object_format} capability
* writes should only be serialized to v1 format when it is
  supported accross the cluster
* before sending data between nodes (handoff) the binary format
  should be changed to the current version supported accross the
  cluster. This is to ensure an older node does not receive data
  in the new format after being downgraded
* update aae to support new format and use riak_object:hash/1
  which always uses the v0 (term_to_binary) format to calculate the
  hash
* kv_backup restore calls binary_to_term on the binary encoded object.
  Since the bucket and key are not available at this time, when backing
  up the data the binary encoding is changed to term_to_binary (v0).
* use riak_object:to_binary/1 in yessir backend
7393cc3
@jrwest jrwest support for reformating objects on disk on a running node
* Reformatting is primarily intended for downgrading a cluster
that does not support a binary version used by a newer riak
* riak_kv_reformat:run/2 folds over each partition reformatting
each object it encounters whose binary version does not match
the desired one (the first argument to run). riak_kv_vnode:reformat_object/2
does the actual reformatting by simply fetching the object and rewriting it if
it exists. This assumes that capabilities have been set to the desired
version before the call to reformat_object/2. run/2 does this and either kills
or waits for handoffs to terminate on the node (based on the second argument
to run) before proceeding. This is necessary for the node to safely continue operation.
The function folds over all partitions at once, which may not be ideal
for large datasets or highly active clusters. Based on testing it may be
prudent to introduce some concurrency control. Additionally, due to the use of pmap,
crashing folds will simple cause the calling process to hang
* This commit does not include a method for how the function will be run against
a node
* added riak_object:binary_version/1
13945b9
@jrwest jrwest move size calculation into riak_object and account for v1 metadata 4cd7351
@jrwest jrwest general cleanup of riak_object module
* move sorting of vnode entries into hash/1
* replace edoc specs with module attribute specs
* move some things around
* fix some issues caught by dialyzer
a4b33ea

@jrwest jrwest added a commit that referenced this pull request Feb 27, 2013

@jrwest jrwest Merge pull request #479 from basho/jrw-pdx-binrobj-rebased2
More Compact Riak Object Binary Format
faa72cd

@jrwest jrwest merged commit faa72cd into master Feb 27, 2013

1 check passed

default The Travis build passed
Details

@seancribbs seancribbs deleted the jrw-pdx-binrobj-rebased2 branch Apr 1, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment