-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redundant rules in linear model #28
Comments
As an aside, if my intuition is true,
and
So in the end rules A and B are also equivalent in terms of linear model if transform(A) = 1 - transform(B) , i.e. the negated rule is equivalent to itself. This is true when the linear model contains an intercept (which is the default in RuleFit). Depending on the research question these may not be desired in cases where intercept is not used. Such positive-negative redundancies can be found in a similar way: equivalence_classes_with_negative = {}
for rule, col in transformed_x.T.iterrows():
key_pos = tuple(col)
key_neg = tuple(1 - col)
try:
equivalence_classes_with_negative[key_pos].append(('+', rule))
except KeyError:
equivalence_classes_with_negative[key_pos] = [('+', rule)]
try:
equivalence_classes_with_negative[key_neg].append(('-', rule))
except KeyError:
equivalence_classes_with_negative[key_neg] = [('-', rule)]
redundant_rules_with_neg = {k: v for k, v in equivalence_classes_with_negative.items() if len(v) > 1}
for v in redundant_rules_with_neg.values():
print('Redundant rules')
for vv in v:
print('{}: support={:.3f}, prediction={:.3f}'.format(vv, vv[1].support, vv[1].prediction_value)) we get rule groups like this:
Arguably we would need only one of these features in the linear model (as the other can be absorbed into the intercept). I'm pretty sure this is a special case of the issue of constructing linearly-independent categorical encoding that I wonder if something like this should be implemented here as well. |
Very nice observation. Lasso does tend to randomly pick a feature (rule) in the case of collinearity! For the OR option, do you mean carry all equivalent rules say R1 R2 R3 into a composite R*=(R1 or R2 or R3)? Would a simpler solution be to whip out Occams razor and simply pick the simplest rule (lowest number of conditions) - or pick at random if they are equal? Or do you think it is important to let the user see the options for equivalent rules... |
My concern is that simplest rule might overgeneralise on the training dataset. For instance, for something like the following dataset of (x,y) pairs:
The following rules are redundant
But for a hypothetical test observation Ideally I think something like
Not entirely sure how to implement this though. Maybe sympy? Related to my second comment in this issue, I have been thinking about that and either I don't understand something about the decision rule generation as described in Friedman in Popescu, or it could be improved. In my opinion, only one side of the rules (condition == True, or condition == False) need to be incorporated at any given level of the tree, as including both adds no extra predictability to the linear model downstream. Let me see if I can draft a proof for this. |
The current implementation of
rulefit
can sometimes produce redundant features that are then fed into the lasso. This comes from the stochastic nature of random trees and lack of rule pruning.To illustrate this let's have a look at the Boston example in the readme:
The inner workings of
RuleFit
rely on L1-penalised linear model which acts on the rules-transformed input variable, i.e.:Now what happens (even with the Boston dataset) is that this
transformed_x
contains columns that are in all purposes duplicated, given the training set. This result comes from the different decision trees having slightly different splits that end up having the same effect on X. To see this we can construct a hashtable of rules based ontransformed_x
values:Out of the groups output some are quite obviously redundant, such as:
Other are perhaps less intuitive:
I guess the difference in support comes from random subsampling during gradient boosting.
This makes the linear model badly specified, and will cause model instability the in the best case, as L1 will keep picking one of the equivalent groups at random.
To solve this one would probably need to merge these rules. I'm happy to give a PR that does this, but I think it needs to be discussed on what would be the best way to approach this.
The logical option is probably an OR operator.
This would handle the cases where the rules might not be equivalent in other datasets that are not training sets.
The question, is of course what to do with the support and prediction values during the merge.
If I understand the code correctly the
prediction_value
is not used anywhere.Can
support
be recalculated at linear model step?The text was updated successfully, but these errors were encountered: