How unimportant features/words are being handled? #6

pidahbus · 2020-03-20T11:52:48Z

Hi @csinva, in linear regression or logistic regression we do calculate the standard error or do t-test to identify whether that variable is significant or not. Let me mathematically formulate my question here, assume I am doing one linear regression with two variables (variables can be words also), so the predicted equation will be,

y_hat = b0_hat + b1_hat * x1 + b2_hat * x2

and let's say that b1_hat = 100 and b2_hat = 2. We may think that as b1_hat is larger than b2_hat, then it has more positive effect on output, but it may happen that in spite of high value, x1 may come insignificant by doing t-test of b1_hat because of it's high standard error.

What I understood reading your paper that the ACD/CD score is somewhat similar to the b (or beta or coefficients) values of the above linear (or logistic) equation. Now in spite of very higher (or lower) CD/ACD score of a word or phrase, it may happen that the specific word/phrase is not important/insignificant to the model. Could you please tell me how your method is handling this scenario?

The text was updated successfully, but these errors were encountered:

csinva · 2020-03-20T17:19:17Z

Thanks for the detailed question!

One quick thing to clarify, the ACD/CD score is actually not the same as the coefficient, but is more like the coefficient * the feature value. In your example above, the ACD/CD score for the feature x_1 is like b1_hat * x1 . As a result, it takes into account both the value of the feature and the coefficient when deciding how significant something is.

pidahbus · 2020-03-23T08:56:35Z

Hi @csinva,
your last answer arose two questions in my mind.

I think to check whether x1 is important or not, we need to consider the standard error of b1_hat. Values of x1 and b1_hat are not sufficient.
If the ACD/CD score is analogous to b1_hat * x1, that means while calculating feature importance it considers thee value of the feature also. This means you method will always give higher ACD/CD score to the outliers.

csinva · 2020-03-23T15:25:03Z

Thanks for the questions!

Re: standard error - this is a good point and something that would be good to look into! As far as I know there are basically no techniques to explicitly do this, although one might be able to co-opt some popular techniques for estimating neural network uncertainty (e.g. test-time dropout or model ensembling)
This is often true (although unlike a linear model, the contribution of extreme values of features can often saturate / behave nonlinearly). It's also usually worth comparing the relative contributions of different features for a single prediction (and if comparing across different predictions, normalizing by the prediction values for each data point).

Hope that helps!

csinva added the question Further information is requested label Mar 20, 2020

csinva closed this as completed Sep 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How unimportant features/words are being handled? #6

How unimportant features/words are being handled? #6

pidahbus commented Mar 20, 2020

csinva commented Mar 20, 2020

pidahbus commented Mar 23, 2020

csinva commented Mar 23, 2020

How unimportant features/words are being handled? #6

How unimportant features/words are being handled? #6

Comments

pidahbus commented Mar 20, 2020

csinva commented Mar 20, 2020

pidahbus commented Mar 23, 2020

csinva commented Mar 23, 2020