New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linear regression and categorical variables: X'X singular. (Chapter 2) #167
Comments
I solved this by creating an DummyCoder transformer (based on CategoricalEncoder) that creates for each feature with k categories (k-1) columns of ones and zeros:
Moreover I found out that results of linear regression may change if sparse matrix is used instead of dense array for fitting. |
Hi @igor-vyr , |
Dear Aurélien,
I'm currently reading your book and it's great. However, I have one question regarding regression and categorical features: in chapter 2, you use 1-hot encoding, that means that you replace the categorical feature column (with 5 possible strings as values) , with 5 columns (with ones and zeros as values).
However, these 5 columns are linearly dependent with the new intercept column x_0 (with all values equal 1) that is added to matrix X (of preprocessed data) in linear regression (if fit_intercept=True). Therefore the matrix X'X should be singular, not invertible and linear regression should fail. However this does not happen. My question is how is it possible?
Some more observations:
The code
produces 3 times the same output, what confirms the redundance of either intercept or one 1-hot column.
For a dataset composed only by the categorical variable, this problem seems more critical; the code
returns unrealistic coefficients (about 10^16)
fit_intercept=False seems as a good workaround, however it will work, only if we have one categorical feature. Solution will be an 1-hot encoding transformer returning (k-1) columns for k possible categories. Do you know some transformer doing this?
The text was updated successfully, but these errors were encountered: