Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear regression and categorical variables: X'X singular. (Chapter 2) #167

Closed
igor-vyr opened this issue Jan 29, 2018 · 2 comments
Closed

Comments

@igor-vyr
Copy link

igor-vyr commented Jan 29, 2018

Dear Aurélien,

I'm currently reading your book and it's great. However, I have one question regarding regression and categorical features: in chapter 2, you use 1-hot encoding, that means that you replace the categorical feature column (with 5 possible strings as values) , with 5 columns (with ones and zeros as values).

However, these 5 columns are linearly dependent with the new intercept column x_0 (with all values equal 1) that is added to matrix X (of preprocessed data) in linear regression (if fit_intercept=True). Therefore the matrix X'X should be singular, not invertible and linear regression should fail. However this does not happen. My question is how is it possible?

Some more observations:
The code

some_data = housing.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

lin_reg1 = LinearRegression(fit_intercept=True)
lin_reg1.fit(housing_prepared, housing_labels)
print("Predictions 1:", lin_reg1.predict(some_data_prepared))

lin_reg2 = LinearRegression(fit_intercept=False)
lin_reg2.fit(housing_prepared, housing_labels)
print("Predictions 2:", lin_reg2.predict(some_data_prepared))

lin_reg3 = LinearRegression(fit_intercept=True)
lin_reg3.fit(housing_prepared[:,:-1], housing_labels)
print("Predictions 3:", lin_reg3.predict(some_data_prepared[:,:-1]))

produces 3 times the same output, what confirms the redundance of either intercept or one 1-hot column.

For a dataset composed only by the categorical variable, this problem seems more critical; the code

lin_reg = LinearRegression(fit_intercept=True)
lin_reg.fit(housing_prepared[:,-5:], housing_labels)
lin_reg.coef_, lin_reg.intercept_

returns unrealistic coefficients (about 10^16)

fit_intercept=False seems as a good workaround, however it will work, only if we have one categorical feature. Solution will be an 1-hot encoding transformer returning (k-1) columns for k possible categories. Do you know some transformer doing this?

@igor-vyr
Copy link
Author

I solved this by creating an DummyCoder transformer (based on CategoricalEncoder) that creates for each feature with k categories (k-1) columns of ones and zeros:

class DummyCoder(BaseEstimator, TransformerMixin):
    """
    Encodes categorical features as numerical arrays of ones and zeros.
    A feature with N categories is encoded with (N-1) columns if one_dummy_less=True and with
    N columns if one_dummy_less=False.
    Dependency: sklearn.preprocessing.CategoricalEncoder. 
    """
    def __init__(self,handle_unknown='ignore', encoding='onehot-dense', one_dummy_less=True):
        self.handle_unknown=handle_unknown
        self.one_dummy_less=one_dummy_less
        self.encoding=encoding
        self.subencoder_for_fit=CategoricalEncoder(handle_unknown=self.handle_unknown, 
                                                   encoding=self.encoding)
        self.subencoder_for_transform=CategoricalEncoder(handle_unknown=self.handle_unknown, 
                                                         encoding=self.encoding)
    def fit(self, X, y=None):
        # fitting first CategoricalEncoder to get categories:
        self.subencoder_for_fit.fit(X)
        no_columns=len(self.subencoder_for_fit.categories_)
        self.categories_=[]
        # removing one category for each feature:
        for k in range(no_columns): self.categories_.append(self.subencoder_for_fit.categories_[k][1:])
        # fitting second CategoricalEncoder with new categories:    
        self.subencoder_for_transform.categories=self.categories_
        self.subencoder_for_transform.fit(X)
        return self
    def transform(self, X):
        if self.one_dummy_less==True: #use second CategoricalEncoder
            return self.subencoder_for_transform.transform(X)
        else: #use first categorical Encoder
            return self.subencoder_for_fit.transform(X)

Moreover I found out that results of linear regression may change if sparse matrix is used instead of dense array for fitting.

@ageron
Copy link
Owner

ageron commented Feb 6, 2018

Hi @igor-vyr ,
You are quite right, it's not great to have redundant (or highly correlated) input features when performing linear regression. The LinearRegression class actually performs SVD decomposition, it does not directly try to compute the inverse of X.T.dot(X). The singular values of X are available in the singular_ instance variable, and the rank of X is available as rank_. When some features are highly correlated some singular values will be very low. When some features are redundant, the rank will be lower than the number of features and some singular values will be almost zero. In both cases, the coefficients may be insanely large. I like your DummyCoder!

@ageron ageron closed this as completed Mar 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants