New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Integrate count and sum feat sq into the interaction generation routine #2987
refactor: Integrate count and sum feat sq into the interaction generation routine #2987
Conversation
vowpalwabbit/gd_mf.cc
Outdated
ec.num_features -= | ||
ec.feature_space[static_cast<int>(i[0])].size() * ec.feature_space[static_cast<int>(i[1])].size(); | ||
ec.num_features += ec.feature_space[static_cast<int>(i[0])].size() * d.rank; | ||
ec.num_features += ec.feature_space[static_cast<int>(i[1])].size() * d.rank; | ||
ec.num_features_from_interactions += | ||
ec.feature_space[static_cast<size_t>(i[0])].size() * ec.feature_space[static_cast<size_t>(i[1])].size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is calculated on line 107 too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, gd_mf does it's interaction manually using quadratics only and does some weird stuff with feature counting. It's not clear to me why it subtracts that value. I can calc once though and reuse the value though.
|
||
finished run | ||
number of examples = 4 | ||
weighted example sum = 4.000000 | ||
weighted label sum = 0.000000 | ||
average loss = 0.000000 | ||
total feature number = 32 | ||
total feature number = 48 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the explanation behind these 2 counters, for example how does one arrive at 8 or 12 features in this one example?
ccb shared |User b
ccb action |Action d
ccb action |Action e
ccb slot 0:0:0.2 |Slot h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For multi_ex, most of the time it is wrong at the moment. It is just an approximation. This is because the multi_ex abstraction and output example didn't really solve how to count features. The feature count reporting should happen at the base learner.
For that example: The slot is merged into shared, which is in turn merged into each action when is then sent down the stack. There is additionally the constant feature and the slot feature which are hidden as well, but do contribute to the count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding --noconstant to this cmd halves the total features (this case 4) of that line so its probably just counting d, e, b, h
total_sum_feat_sq_calculated = false; | ||
} | ||
|
||
friend void VW::copy_example_data(example* dst, const example* src); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of making these friends
could we add copy methods/setters for the private members?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I toyed with copy methods for example a while ago but there are multiple different versions, with and without labels, metadata only etc. So I opted to just let the existing one have access rather than try and encapsulate it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I agree on the copy_example
makes sense for it to be a friend, and if we keep using the permutation we can think about making it public in future
} | ||
ec.num_features_from_interactions = num_features_from_interactions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this assignment of num_feature_from_interactions
be happening at the spot where they are calculated (since the example is also passed down)? I am worried that we are going to forget that this is needed at some future point when we call foreach
or gd::inline_predict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid concern, I thought the same. But I think there are situations where foreach_feature is called and you don't want to set the num_feature_from_interactions. Additionally it prevents examples from being const where they otherwise can be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think it would be worth to wrap those calls?
(a) where examples can be const there is a non-const overload that calls the const version and then sets the number of features on the example and
(b) if a foreach call doesn't want to set the number of features we could again overload and just not set the example with the result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature count is a very non-critical operation and it's only used as a measure of work. If it's missed really all that happens is the progressive validation log will show fewer features. There are already quite a few overloads of foreach_feature and this PR is adding more overloads there. I don't think it would be worth it
float partial_prediction = 0.f; // shared data for prediction. | ||
float updated_prediction = 0.f; // estimated post-update prediction. | ||
float loss = 0.f; | ||
float total_sum_feat_sq = 0.f; // precomputed, cause it's kind of fast & easy. | ||
|
||
float total_sum_feat_sq = 0.f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here about how this value is used and what will invalidate it? I see it getting reset a lot in this PR and I don't understand why places that used to update the value now reset it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything that used to change it now just becomes a cache invalidation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment
@@ -149,6 +150,7 @@ inline void generate_interactions(namespace_interactions& interactions, bool per | |||
auto begin = second.audit_cbegin(); | |||
if (same_namespace) { begin += (PROCESS_SELF_INTERACTIONS(ft_value)) ? i : i + 1; } | |||
auto end = second.audit_cend(); | |||
num_features += std::distance(begin, end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like num_features isn't used by the callers all the time. Is it better to compute the distance every time than to have num_features be an optional input?
I'm assuming the distance calls are relatively expensive since the iterators are more complex than raw pointers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The std::distance call here is constant time (and very cheap) since the iterators are random access iterators. The addition of this extra logic hasn't affected the benchmarks in a meaningful way. Making it optional would require a branch instead of simple arithmetic.
The other option is an overload without this out parameter. But that results in a lot of code duplication with no noticeable difference in perf.
@@ -225,7 +223,7 @@ void predict_or_learn_multi(nn& n, single_learner& base, example& ec) | |||
|
|||
CONVERSE: // That's right, I'm using goto. So sue me. | |||
|
|||
n.output_layer.total_sum_feat_sq = 1; | |||
n.output_layer.reset_total_sum_feat_sq(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code sets the sum_feat_sq to 1 while the function sets it to 0. Is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since it will get calculated from scratch when needed and that 0 wont make it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM: my understanding is that the calculation of interaction features is still evolving (especially for ccb) and that it isn't always exact right now and that is ok since we are moving towards correct calculations (especially compared to before)
This change updates the way in which features are counted and sum of features squared is calculated. For some multi_ex reductions, ccb and slates in particular, the number of features reported to the logger is incorrect. They were incorrect before this change and this doesn't fix it. The way in which features are counted needs to be updated for the multi_ex abstraction.