New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track "user data" size #42

Closed
wants to merge 10 commits into
from

Conversation

Projects
None yet
2 participants
@bdionne

bdionne commented Feb 28, 2011

These three commits, in conjunction with a request from Fabric, add functionality to surface real data sizes
via both db_info and group_info. The size information is added when the documents are flushed and computed in couch_key_tree from the non_deleted leaves. Both doc sizes and attachment sizes are counted.

The first commit adds a foldl function to couch_key_tree that is generally useful, the second handles sizes for the views, and the third for the documents.

Robert Dionne added some commits Feb 28, 2011

Robert Dionne
BugzID:9995 Add a tree foldl function to couch_key_tree.
This function can be used by many of the existing functions
in couch_key_tree that need to walk the revision tree, apply
a function at each branch or leaf, and then accumulate results.
Robert Dionne
BugzID:9995 report real data size of views.
The reduce function now stores the sizes of the KV pairs
and the reductions.
Robert Dionne
BugzID:9995 surface actual data_size in db_info.
Added data_size to full_doc_info records, modified couch_db_updater. This required
tweaks to btree_by_id_split and btree_by_id_join to preserve data_size accross
calls to couch_btree:add_remove. The actual data is added in flush_rev_trees.
Show outdated Hide outdated apps/couch/src/couch_view.erl
size(term_to_binary(Value)) +
Acc
end,0,KVList) +
size(term_to_binary(Reduction)).

This comment has been minimized.

@kocolosk

kocolosk Mar 1, 2011

Member

Super-trivial, you could use size(t2b(Reduction)) as the initial value for the foldl instead of adding it in after.

@kocolosk

kocolosk Mar 1, 2011

Member

Super-trivial, you could use size(t2b(Reduction)) as the initial value for the foldl instead of adding it in after.

This comment has been minimized.

Show outdated Hide outdated apps/couch/src/couch_db.erl
@@ -274,6 +280,7 @@ get_db_info(Db) ->
{purge_seq, couch_db:get_purge_seq(Db)},
{compact_running, Compactor/=nil},
{disk_size, Size},
{other, {data_size, DataSize}},

This comment has been minimized.

@kocolosk

kocolosk Mar 1, 2011

Member

This looks like it won't encode the way you want it to. Don't you need it to be {other, {[{data_size, DataSize}]}}?

@kocolosk

kocolosk Mar 1, 2011

Member

This looks like it won't encode the way you want it to. Don't you need it to be {other, {[{data_size, DataSize}]}}?

This comment has been minimized.

@bdionne

bdionne Mar 1, 2011

I think this is ok, the extra wrapping is handled in the fabric layer. Ah, I see, this will break when used over the 5986 port, you're right. Will fix

@bdionne

bdionne Mar 1, 2011

I think this is ok, the extra wrapping is handled in the fabric layer. Ah, I see, this will break when used over the 5986 port, you're right. Will fix

Show outdated Hide outdated apps/couch/src/couch_key_tree.erl
false ->
Acc ++ [{P,Len}]
end
end,TotalAttSizes,AttSizes).

This comment has been minimized.

@kocolosk

kocolosk Mar 1, 2011

Member

I'm wondering if this add_att_sizes function might be more clearly implemented using umerge/2, e.g. umerge(TotalAttSizes, lists:sort(AttSizes))

@kocolosk

kocolosk Mar 1, 2011

Member

I'm wondering if this add_att_sizes function might be more clearly implemented using umerge/2, e.g. umerge(TotalAttSizes, lists:sort(AttSizes))

This comment has been minimized.

@bdionne

bdionne Mar 1, 2011

I can try it. I think for sure it will be simpler, more efficient? I suppose the sizes of the lists are always small enough to not matter. A bad habit I have is to implement everything from a few primitives, rather than check the apis.

@bdionne

bdionne Mar 1, 2011

I can try it. I think for sure it will be simpler, more efficient? I suppose the sizes of the lists are always small enough to not matter. A bad habit I have is to implement everything from a few primitives, rather than check the apis.

This comment has been minimized.

@kocolosk

kocolosk Mar 1, 2011

Member

Yeah, analyzing the efficiency of the two choices is not completely straightforward. The current approach does two O(n) BIFs for every iteration of the foldl, so O(n^2). The umerge pays an upfront sorting cost of O(n log n), and after that it's a simple O(n) merge. On the other hand, the sort/umerge approach is all pure Erlang while the current approach uses BIFs extensively.

@kocolosk

kocolosk Mar 1, 2011

Member

Yeah, analyzing the efficiency of the two choices is not completely straightforward. The current approach does two O(n) BIFs for every iteration of the foldl, so O(n^2). The umerge pays an upfront sorting cost of O(n log n), and after that it's a simple O(n) merge. On the other hand, the sort/umerge approach is all pure Erlang while the current approach uses BIFs extensively.

Show outdated Hide outdated apps/couch/src/couch_view_group.erl
{ok, {_,DataSize,_}} -> DataSize;
{ok, {_,_}} -> 0
end
end,ViewList)).

This comment has been minimized.

@kocolosk

kocolosk Mar 1, 2011

Member

A foldl instead of sum(map()) would be cleaner and more efficient here. Something like

lists:foldl(fun(#view{btree=Bt}, Acc) ->
    case couch_btree:full_reduce(Bt) of
    {ok, {_, DataSize, _}} -> DataSize+Acc;
    {ok, {_, _}} -> Acc
    end
end, 0, ViewList).
@kocolosk

kocolosk Mar 1, 2011

Member

A foldl instead of sum(map()) would be cleaner and more efficient here. Something like

lists:foldl(fun(#view{btree=Bt}, Acc) ->
    case couch_btree:full_reduce(Bt) of
    {ok, {_, DataSize, _}} -> DataSize+Acc;
    {ok, {_, _}} -> Acc
    end
end, 0, ViewList).

This comment has been minimized.

@bdionne

bdionne Mar 1, 2011

missed this one, fixed

@bdionne

bdionne Mar 1, 2011

missed this one, fixed

Show outdated Hide outdated apps/couch/src/couch_view_group.erl
user_reds ->
% always the last element
element(tuple_size(Red), Red)
end end, Reds).

This comment has been minimized.

@kocolosk

kocolosk Mar 1, 2011

Member

I think this function could be written more cleanly using separate clauses for the various values of Element. A minor point to be sure.

@kocolosk

kocolosk Mar 1, 2011

Member

I think this function could be written more cleanly using separate clauses for the various values of Element. A minor point to be sure.

Show outdated Hide outdated apps/couch/src/couch_key_tree.erl
foldl(Fun, Acc, [{Pos, Branch} | Rest]) ->
Acc1 = foldl_simple(Fun, Pos, [Branch], Acc),
foldl(Fun, Acc1, Rest).

This comment has been minimized.

@kocolosk

kocolosk Mar 1, 2011

Member

This function looks good and useful. Maybe just call it fold/3, though? I'm trying to picture the order in which tree nodes are visited and I'm not sure if there's a clear L -> R preference.

@kocolosk

kocolosk Mar 1, 2011

Member

This function looks good and useful. Maybe just call it fold/3, though? I'm trying to picture the order in which tree nodes are visited and I'm not sure if there's a clear L -> R preference.

This comment has been minimized.

@bdionne

bdionne Mar 1, 2011

I agree the name isn't quite accurate. I chose it because I also have a foldr in that couchdb branch 432 that's similar but calls the function differently

@bdionne

bdionne Mar 1, 2011

I agree the name isn't quite accurate. I chose it because I also have a foldr in that couchdb branch 432 that's similar but calls the function differently

Robert Dionne
BugzID:9995 Correct json formatting of db_info.
Additional small refactorings from code review to
make code cleaner and more efficient.
@kocolosk

This comment has been minimized.

Show comment
Hide comment
@kocolosk

kocolosk Mar 3, 2011

Member

Hi Bob, thanks. I think the lists:sort(TotalAttSizes) is superfluous, but harmless.

I've been thinking that a lot of the case statements in this patch that deal with upgrading tuples on the fly will really only be active in hot code upgrades. Is that correct? It seems that once couch_db_updater:rev_tree/1 expands the revision tree with ?DOC_SIZE_MISSING fields (and we do the analogous thing for view indexes) everything else should just be able to use the new format.

On another note, I wonder if the key tree code would benefit from the introduction of a new record (#leaf?) instead of the multi-element tuple. For example

-record(leaf, {deleted, ptr, seq, size, atts})

Of course that'd still be a little awkward because the key tree would then store {Rev, #leaf{}} pairs instead of just storing #leaf{} records directly. Adding the rev to the #leaf would be cool but would probably require some more serious refactoring.

Member

kocolosk commented Mar 3, 2011

Hi Bob, thanks. I think the lists:sort(TotalAttSizes) is superfluous, but harmless.

I've been thinking that a lot of the case statements in this patch that deal with upgrading tuples on the fly will really only be active in hot code upgrades. Is that correct? It seems that once couch_db_updater:rev_tree/1 expands the revision tree with ?DOC_SIZE_MISSING fields (and we do the analogous thing for view indexes) everything else should just be able to use the new format.

On another note, I wonder if the key tree code would benefit from the introduction of a new record (#leaf?) instead of the multi-element tuple. For example

-record(leaf, {deleted, ptr, seq, size, atts})

Of course that'd still be a little awkward because the key tree would then store {Rev, #leaf{}} pairs instead of just storing #leaf{} records directly. Adding the rev to the #leaf would be cool but would probably require some more serious refactoring.

Robert Dionne and others added some commits Mar 3, 2011

Represent key tree leafs using #leaf records
Additionally, remove some special-case code that would only take effect
if the data size code was hot-loaded into a running system.

BugzID: 9995
Remove superfluous sorting of TotalAttSizes
They're already sorted.

BugzID: 9995
@kocolosk

This comment has been minimized.

Show comment
Hide comment
@kocolosk

kocolosk Mar 11, 2011

Member

Ok, I added the #leaf records, I think it makes the code cleaner. I also removed some of the special-casing code since it would only take effect if we hot-loaded the .beam files into a running system. On restart everything gets loaded from disk, so the rev_tree and btree_by_id_join functions can take care of upgrading the data structures.

Next steps: I think we can remove more special-casing code with a careful construction of the reduce functions. Basically, reduce and rereduce need to always return tuples in the new format. A bigger issue is that I think we're not updating data sizes correctly during compaction. We don't update them at all in couch_db_updater:copy_docs, so the argument that we'll get the right answer for data_size after compaction doesn't hold.

Member

kocolosk commented Mar 11, 2011

Ok, I added the #leaf records, I think it makes the code cleaner. I also removed some of the special-casing code since it would only take effect if we hot-loaded the .beam files into a running system. On restart everything gets loaded from disk, so the rev_tree and btree_by_id_join functions can take care of upgrading the data structures.

Next steps: I think we can remove more special-casing code with a careful construction of the reduce functions. Basically, reduce and rereduce need to always return tuples in the new format. A bigger issue is that I think we're not updating data sizes correctly during compaction. We don't update them at all in couch_db_updater:copy_docs, so the argument that we'll get the right answer for data_size after compaction doesn't hold.

@kocolosk

This comment has been minimized.

Show comment
Hide comment
@kocolosk

kocolosk Aug 13, 2011

Member

Closing this one out, it did get merged but I guess it was done in a way that the auto-merge missed it

Member

kocolosk commented Aug 13, 2011

Closing this one out, it did get merged but I guess it was done in a way that the auto-merge missed it

@kocolosk kocolosk closed this Aug 13, 2011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment