Linear regression and categorical variables: X'X singular. (Chapter 2) #167

igor-vyr · 2018-01-29T18:54:17Z

Dear Aurélien,

I'm currently reading your book and it's great. However, I have one question regarding regression and categorical features: in chapter 2, you use 1-hot encoding, that means that you replace the categorical feature column (with 5 possible strings as values) , with 5 columns (with ones and zeros as values).

However, these 5 columns are linearly dependent with the new intercept column x_0 (with all values equal 1) that is added to matrix X (of preprocessed data) in linear regression (if fit_intercept=True). Therefore the matrix X'X should be singular, not invertible and linear regression should fail. However this does not happen. My question is how is it possible?

Some more observations:
The code

some_data = housing.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

lin_reg1 = LinearRegression(fit_intercept=True)
lin_reg1.fit(housing_prepared, housing_labels)
print("Predictions 1:", lin_reg1.predict(some_data_prepared))

lin_reg2 = LinearRegression(fit_intercept=False)
lin_reg2.fit(housing_prepared, housing_labels)
print("Predictions 2:", lin_reg2.predict(some_data_prepared))

lin_reg3 = LinearRegression(fit_intercept=True)
lin_reg3.fit(housing_prepared[:,:-1], housing_labels)
print("Predictions 3:", lin_reg3.predict(some_data_prepared[:,:-1]))

produces 3 times the same output, what confirms the redundance of either intercept or one 1-hot column.

For a dataset composed only by the categorical variable, this problem seems more critical; the code

lin_reg = LinearRegression(fit_intercept=True)
lin_reg.fit(housing_prepared[:,-5:], housing_labels)
lin_reg.coef_, lin_reg.intercept_

returns unrealistic coefficients (about 10^16)

fit_intercept=False seems as a good workaround, however it will work, only if we have one categorical feature. Solution will be an 1-hot encoding transformer returning (k-1) columns for k possible categories. Do you know some transformer doing this?

The text was updated successfully, but these errors were encountered:

igor-vyr · 2018-01-31T12:25:40Z

I solved this by creating an DummyCoder transformer (based on CategoricalEncoder) that creates for each feature with k categories (k-1) columns of ones and zeros:

class DummyCoder(BaseEstimator, TransformerMixin):
    """
    Encodes categorical features as numerical arrays of ones and zeros.
    A feature with N categories is encoded with (N-1) columns if one_dummy_less=True and with
    N columns if one_dummy_less=False.
    Dependency: sklearn.preprocessing.CategoricalEncoder. 
    """
    def __init__(self,handle_unknown='ignore', encoding='onehot-dense', one_dummy_less=True):
        self.handle_unknown=handle_unknown
        self.one_dummy_less=one_dummy_less
        self.encoding=encoding
        self.subencoder_for_fit=CategoricalEncoder(handle_unknown=self.handle_unknown, 
                                                   encoding=self.encoding)
        self.subencoder_for_transform=CategoricalEncoder(handle_unknown=self.handle_unknown, 
                                                         encoding=self.encoding)
    def fit(self, X, y=None):
        # fitting first CategoricalEncoder to get categories:
        self.subencoder_for_fit.fit(X)
        no_columns=len(self.subencoder_for_fit.categories_)
        self.categories_=[]
        # removing one category for each feature:
        for k in range(no_columns): self.categories_.append(self.subencoder_for_fit.categories_[k][1:])
        # fitting second CategoricalEncoder with new categories:    
        self.subencoder_for_transform.categories=self.categories_
        self.subencoder_for_transform.fit(X)
        return self
    def transform(self, X):
        if self.one_dummy_less==True: #use second CategoricalEncoder
            return self.subencoder_for_transform.transform(X)
        else: #use first categorical Encoder
            return self.subencoder_for_fit.transform(X)

Moreover I found out that results of linear regression may change if sparse matrix is used instead of dense array for fitting.

ageron · 2018-02-06T16:52:21Z

Hi @igor-vyr ,
You are quite right, it's not great to have redundant (or highly correlated) input features when performing linear regression. The LinearRegression class actually performs SVD decomposition, it does not directly try to compute the inverse of X.T.dot(X). The singular values of X are available in the singular_ instance variable, and the rank of X is available as rank_. When some features are highly correlated some singular values will be very low. When some features are redundant, the rank will be lower than the number of features and some singular values will be almost zero. In both cases, the coefficients may be insanely large. I like your DummyCoder!

ageron closed this as completed Mar 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linear regression and categorical variables: X'X singular. (Chapter 2) #167

Linear regression and categorical variables: X'X singular. (Chapter 2) #167

igor-vyr commented Jan 29, 2018 •

edited

igor-vyr commented Jan 31, 2018

ageron commented Feb 6, 2018

Linear regression and categorical variables: X'X singular. (Chapter 2) #167

Linear regression and categorical variables: X'X singular. (Chapter 2) #167

Comments

igor-vyr commented Jan 29, 2018 • edited

igor-vyr commented Jan 31, 2018

ageron commented Feb 6, 2018

igor-vyr commented Jan 29, 2018 •

edited