Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How unimportant features/words are being handled? #6

Closed
pidahbus opened this issue Mar 20, 2020 · 3 comments
Closed

How unimportant features/words are being handled? #6

pidahbus opened this issue Mar 20, 2020 · 3 comments
Labels
question Further information is requested

Comments

@pidahbus
Copy link

Hi @csinva, in linear regression or logistic regression we do calculate the standard error or do t-test to identify whether that variable is significant or not. Let me mathematically formulate my question here, assume I am doing one linear regression with two variables (variables can be words also), so the predicted equation will be,

y_hat = b0_hat + b1_hat * x1 + b2_hat * x2

and let's say that b1_hat = 100 and b2_hat = 2. We may think that as b1_hat is larger than b2_hat, then it has more positive effect on output, but it may happen that in spite of high value, x1 may come insignificant by doing t-test of b1_hat because of it's high standard error.

What I understood reading your paper that the ACD/CD score is somewhat similar to the b (or beta or coefficients) values of the above linear (or logistic) equation. Now in spite of very higher (or lower) CD/ACD score of a word or phrase, it may happen that the specific word/phrase is not important/insignificant to the model. Could you please tell me how your method is handling this scenario?

@csinva csinva added the question Further information is requested label Mar 20, 2020
@csinva
Copy link
Owner

csinva commented Mar 20, 2020

Thanks for the detailed question!

One quick thing to clarify, the ACD/CD score is actually not the same as the coefficient, but is more like the coefficient * the feature value. In your example above, the ACD/CD score for the feature x_1 is like b1_hat * x1 . As a result, it takes into account both the value of the feature and the coefficient when deciding how significant something is.

@pidahbus
Copy link
Author

Hi @csinva,
your last answer arose two questions in my mind.

  1. I think to check whether x1 is important or not, we need to consider the standard error of b1_hat. Values of x1 and b1_hat are not sufficient.

  2. If the ACD/CD score is analogous to b1_hat * x1, that means while calculating feature importance it considers thee value of the feature also. This means you method will always give higher ACD/CD score to the outliers.

@csinva
Copy link
Owner

csinva commented Mar 23, 2020

Thanks for the questions!

  1. Re: standard error - this is a good point and something that would be good to look into! As far as I know there are basically no techniques to explicitly do this, although one might be able to co-opt some popular techniques for estimating neural network uncertainty (e.g. test-time dropout or model ensembling)

  2. This is often true (although unlike a linear model, the contribution of extreme values of features can often saturate / behave nonlinearly). It's also usually worth comparing the relative contributions of different features for a single prediction (and if comparing across different predictions, normalizing by the prediction values for each data point).

Hope that helps!

@csinva csinva closed this as completed Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants