Added Gini Importances #154

mepland · 2022-12-29T21:30:23Z

Hi @csinva how do the new Gini importances look?

I based the calculation off sklearn's code from here and here, though it needed to be made recursive as we do not have arrays of all the nodes and their properties.

There is a demo of the new code in the FIGS_viz_demo.ipynb notebook. I am a bit concerned with the None impurity in the root node of the second tree:

node_id: 0, left.node_id: 1, right.node_id: 2, impurity: None

I filled it with 0 for the calculation for now:

                importance_data_tree[node.feature] += (
                    np.sum(node.value_sklearn) * (node.impurity if node.impurity is not None else 0.) -
                    np.sum(node.left.value_sklearn) * node.left.impurity -
                    np.sum(node.right.value_sklearn) * node.right.impurity
                )

Is None expected if the tree has just one split?

Also, after taking the mean and normalizing most of the importances are negative. I think this is fine, as we just care about the relative order of the features, but wanted to get your opinion as well:

BTW I noticed that we have an unused variable in plot():

criterion = "squared_error" if isinstance(self, RegressorMixin) else "gini"

Is this need for anything, or should we delete it?

mepland · 2022-12-29T21:31:40Z

Addresses #127 for FIGS.

mepland · 2022-12-29T21:34:51Z

Also, I didn't add support for sample_weight during this initial pass: #89

csinva · 2022-12-30T15:15:00Z

criterion = "squared_error" if isinstance(self, RegressorMixin) else "gini"

Yeah good catch we can drop this.

csinva · 2022-12-30T15:19:03Z

Also, after taking the mean and normalizing most of the importances are negative. I think this is fine, as we just care about the relative order of the features, but wanted to get your opinion as well:

I think we should make all the importances positive (just flip the sign at the end), to stay consistent with sklearn.

csinva · 2022-12-30T15:25:57Z

Your calculation looks right to me -- I actually don't understand the None impurity issue. It seems to me that every new added stump should be assigned an impurity (L181) and if it's None we should see an error or early stop (L235).

mepland · 2022-12-30T15:49:34Z

I think we should make all the importances positive (just flip the sign at the end), to stay consistent with sklearn.

@csinva if we do that, multiplying by -1, the order would flip too, which is not what we want, right? Unless we think that when normalizing, by that negative number, we basically already flipped the order. I can see that actually, Glucose concentration test is used twice at the start of the first tree so it should be the most important.

Maybe I should normalize by avg_feature_importances / abs(np.sum(avg_feature_importances))? But then the sum is -1...

mepland · 2022-12-30T15:59:39Z

BTW sklearn has an additional normalization step in the code, but it is not ultimately used.

mepland · 2022-12-30T16:05:36Z

Your calculation looks right to me -- I actually don't understand the None impurity issue. It seems to me that every new added stump should be assigned an impurity (L181) and if it's None we should see an error or early stop (L235).

Actually, I think we should try to figure this out first as the negative importance for Diabetes pedigree function is basically being caused by that None impurity being set as 0. Fixing this should also fix the negatives issue.

mepland · 2022-12-30T16:31:57Z

@csinva I did some debugging and I think that the None impurity node at the root of the second tree is coming from this line:

                # add new root potential node
                node_new_root = Node(is_root=True, idxs=np.ones(X.shape[0], dtype=bool),
                                     tree_num=-1)
                potential_splits.append(node_new_root)

The default param for impurity in Node() is None and then it is never filled with a real value. I'll paste my debugging output below so you can see it in practice.

How would you like to fix this? Seems to be a problem in the underlying FIGS fitting algo itself, not the impurity or importance calculations.

Debugging output:

DEBUG starting new tree:
tree_num: -1, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2239312065972222
DEBUG in progress tree
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2239312065972222
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.06320022981901752
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.24828989767652213
DEBUG in progress tree
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2239312065972222
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.06320022981901752
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.24828989767652213
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2385204081632653
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.1541950113378685
DEBUG in progress tree
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2239312065972222
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.06320022981901752
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.24828989767652213
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2385204081632653
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.19996537396121883
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2307098765432099
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.1541950113378685
DEBUG in progress tree
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2239312065972222
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.06320022981901752
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.24828989767652213
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2385204081632653
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.19996537396121883
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.06035379812695109
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.24395061728395062
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2307098765432099
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.1541950113378685
DEBUG starting new tree:
tree_num: -1, node_id: None, left.node_id: None, right.node_id: None, impurity: None
DEBUG in progress tree
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2239312065972222
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.06320022981901752
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.24828989767652213
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2385204081632653
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.19996537396121883
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.06035379812695109
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.24395061728395062
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2307098765432099
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.1541950113378685
tree_num: 1, node_id: None, left.node_id: None, right.node_id: None, impurity: None
tree_num: 1, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.12839368044187147
tree_num: 1, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.1477618114868653
DEBUG in progress tree
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2239312065972222
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.06320022981901752
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.24828989767652213
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2385204081632653
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.19996537396121883
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.06035379812695109
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.24395061728395062
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.21345881926419968
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.18349552685909898
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2307098765432099
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.1541950113378685
tree_num: 1, node_id: None, left.node_id: None, right.node_id: None, impurity: None
tree_num: 1, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.12839368044187147
tree_num: 1, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.1477618114868653
DEBUG fitted tree
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2239312065972222
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.06320022981901752
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.24828989767652213
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2385204081632653
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.19996537396121883
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.06035379812695109
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.24395061728395062
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.21345881926419968
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.18349552685909898
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.2307098765432099
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.1541950113378685
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.09754575404204457
tree_num: 0, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.009377186611138793
tree_num: 1, node_id: None, left.node_id: None, right.node_id: None, impurity: None
tree_num: 1, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.12839368044187147
tree_num: 1, node_id: None, left.node_id: None, right.node_id: None, impurity: 0.1477618114868653

csinva · 2022-12-30T16:50:09Z

Ah you're right, when that new potentital node is updated at this line, impurity_reduction is updated but not impurity (since it wasn't used up till now). It's an easy fix, let me push it now.

csinva · 2022-12-30T16:58:10Z

Okay I think it should be fixed now

mepland · 2022-12-30T17:07:32Z

All set!

csinva · 2022-12-30T17:09:10Z

Perfect!

mepland · 2022-12-30T17:12:29Z

@csinva ready to merge!

mepland · 2022-12-30T17:13:56Z

@csinva BTW I want to make one more PR to clean up that notebook and fix spelling mistakes, but then we can make a new release if you'd like.

added Gini importances

512f045

mepland requested a review from csinva December 29, 2022 21:30

remove unused criterion var

5b606b0

update impurity in new splits

c9f8394

mepland added 2 commits December 30, 2022 12:06

remove None check

88f790b

updated notebook

1ff794f

mepland marked this pull request as ready for review December 30, 2022 17:07

revert notebook to allow merge

c15c5c1

csinva added 2 commits December 30, 2022 13:16

resolve FIGS_viz_demo conflicts

2b59d5d

reset FIGS_viz_demo nb

77ba65f

csinva merged commit 4382106 into master Dec 30, 2022

mepland mentioned this pull request Dec 30, 2022

Create mean decrease in impurity (MDI) feature importances for tree-based models #127

Open

mepland deleted the figs_gini_importance branch January 3, 2023 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Gini Importances #154

Added Gini Importances #154

mepland commented Dec 29, 2022

mepland commented Dec 29, 2022

mepland commented Dec 29, 2022

csinva commented Dec 30, 2022

csinva commented Dec 30, 2022

csinva commented Dec 30, 2022

mepland commented Dec 30, 2022 •

edited

mepland commented Dec 30, 2022

mepland commented Dec 30, 2022

mepland commented Dec 30, 2022 •

edited

csinva commented Dec 30, 2022 •

edited

csinva commented Dec 30, 2022

mepland commented Dec 30, 2022

csinva commented Dec 30, 2022

mepland commented Dec 30, 2022

mepland commented Dec 30, 2022

Added Gini Importances #154

Added Gini Importances #154

Conversation

mepland commented Dec 29, 2022

mepland commented Dec 29, 2022

mepland commented Dec 29, 2022

csinva commented Dec 30, 2022

csinva commented Dec 30, 2022

csinva commented Dec 30, 2022

mepland commented Dec 30, 2022 • edited

mepland commented Dec 30, 2022

mepland commented Dec 30, 2022

mepland commented Dec 30, 2022 • edited

csinva commented Dec 30, 2022 • edited

csinva commented Dec 30, 2022

mepland commented Dec 30, 2022

csinva commented Dec 30, 2022

mepland commented Dec 30, 2022

mepland commented Dec 30, 2022

mepland commented Dec 30, 2022 •

edited

mepland commented Dec 30, 2022 •

edited

csinva commented Dec 30, 2022 •

edited