# Linear Regression with OneHotEncoding done manually (without a Pipeline)

In this notebook we show how OneHotEncoding is done manually (without a Pipeline).

We suggest to do OneHotEncoding in a Pipeline as done in `4_linear_regression.ipynb`, but this may show the mechanism behind OneHotEncoding better as well as being closer to the exercises.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Prepare data

In [2]:
# Load the train data
train_data = pd.read_csv('../data/houses_train.csv', index_col=0)

In [3]:
# Split data into features and labels.
X_data = train_data.drop(columns='price')
y_data = train_data['price']

In [4]:
# Split features and labels into train (X_train, y_train) and validation set (X_val, y_val).
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, stratify=X_data['object_type_name'], test_size=0.1)

# Define and train model

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder

Here we construct manually a DataFrame with the features `zipcode`, `object_type_name` and `municipality_name` one-hot-encoded.

1. we define three `OneHotEncoders` on for `zipcode`, one for `object_type_name` and one for `municipality_name`.
2. we "train" (`fit`) and apply (`transform`) them respectively on each feature. We can think of `fit` in the OneHotEncoder as fixating the mapping, e.g. which `zipcode` becomes the $i$th column in the output of the OneHotEncoder and `transform` as actually doing the OneHotEncoding.
3. we combine the remaining numerical features with all features from the OneHotEncoders.

We OneHotEncode the `zipcode` because even though it is a number it is a categorical feature. The `zipcode` `8000` is not bigger than (and not double of) the `zipcode` `4000`.

In [6]:
from scipy import sparse

# 1. Define the OneHotEncoders
ohe_zipcode = OneHotEncoder(handle_unknown='ignore')
ohe_object_type_name = OneHotEncoder(handle_unknown='ignore')
ohe_municipality_name = OneHotEncoder(handle_unknown='ignore')

# Transform the zipcode to a string, otherwise sklearn warns us later.
X_train['zipcode'] = X_train['zipcode'].astype("string")

# 2. Train and apply them on the individual feature.
# zipcode
X_train_ohe_zipcode = ohe_zipcode.fit_transform(X_train[['zipcode']])
X_train_zipcode = pd.DataFrame(data=X_train_ohe_zipcode.toarray(), index=X_train.index, columns=ohe_zipcode.categories_[0])
# object_type_name
X_train_ohe_object_type_name = ohe_object_type_name.fit_transform(X_train[['object_type_name']])
X_train_object_type_name = pd.DataFrame(data=X_train_ohe_object_type_name.toarray(), index=X_train.index, columns=ohe_object_type_name.categories_[0])
# municipality_name
X_train_ohe_municipality_name = ohe_municipality_name.fit_transform(X_train[['municipality_name']])
X_train_municipality_name = pd.DataFrame(data=X_train_ohe_municipality_name.toarray(), index=X_train.index, columns=ohe_municipality_name.categories_[0])

# 3. Combine the numerical features (e.g. `living_area`) together with the OneHotEncoder outputs
X_train_ohe = pd.concat([
    X_train.drop(columns=['zipcode', 'object_type_name', 'municipality_name']),  # numerical features
    X_train_zipcode,  # zipcode OneHot features
    X_train_object_type_name,  # object_type_name OneHot features
    X_train_municipality_name  # municipality_name OneHot features
], axis=1)

`X_train_ohe` has now all features with `zipcode`, `object_type_name` and `municipality_name` being one-hot-encoded.

In [7]:
print(X_train_ohe.shape)

(18281, 4322)


Now we can train our `LinearRegression` model on the `one-hot-encoded` data:

In [8]:
model = LinearRegression()
_ = model.fit(X_train_ohe, y_train)

## Predict and evaluate prices for the validation set

The trained model will now be applied to the validation set. Note that we have to reuse the above `OneHotEncoders`, because we have to apply the same mapping "learned" in fit, so the columns are in the same order as during training.

Therefore, we do the following steps:

1. Prep validation data with OneHotEncoders from training.
2. Predict prices for prepared validation data.

In [9]:
def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [10]:
X_val['zipcode'] = X_val['zipcode'].astype("string")

# 1. Prep validation data
X_val_ohe_zipcode = ohe_zipcode.transform(X_val[['zipcode']])
X_val_zipcode = pd.DataFrame(data=X_val_ohe_zipcode.toarray(), index=X_val.index, columns=ohe_zipcode.categories_[0])
X_val_ohe_object_type_name = ohe_object_type_name.transform(X_val[['object_type_name']])
X_val_object_type_name = pd.DataFrame(data=X_val_ohe_object_type_name.toarray(), index=X_val.index, columns=ohe_object_type_name.categories_[0])
X_val_ohe_municipality_name = ohe_municipality_name.transform(X_val[['municipality_name']])
X_val_municipality_name = pd.DataFrame(data=X_val_ohe_municipality_name.toarray(), index=X_val.index, columns=ohe_municipality_name.categories_[0])

X_val_ohe = X_val.drop(columns=['zipcode', 'object_type_name', 'municipality_name'])
X_val_ohe = pd.concat([X_val_ohe, X_val_zipcode], axis=1)
X_val_ohe = pd.concat([X_val_ohe, X_val_object_type_name], axis=1)
X_val_ohe = pd.concat([X_val_ohe, X_val_municipality_name], axis=1)

# 2. Predict for prepared validation data
y_val_pred = model.predict(X_val_ohe)
print(mean_absolute_percentage_error(y_val, y_val_pred))

1686425.256605689


**Oh no what happend! Our model has a huge error! Our model is aweful!** All this work wasted...? Do we have a bug in the code?

Looking at the performance on the `train set` shows that it is **not a bug**. We are good on the train data, but bad on new data (validation data), we **overfitted**! We overfitted extremly!

In [11]:
y_train_pred = model.predict(X_train_ohe)
print("Train Set:", mean_absolute_percentage_error(y_train, y_train_pred))
y_val_pred = model.predict(X_val_ohe)
print("Val Set:", mean_absolute_percentage_error(y_val, y_val_pred))

print("------")

print("Number of features: ", len(X_train_ohe.columns))
print("Number of samples: ", len(X_train_ohe))

Train Set: 26.039938676254344
Val Set: 1686425.256605689
------
Number of features:  4322
Number of samples:  18281


We probably overfit due to having `4347 features` and only `18281 samples`. There is a rule of fist that you should have at least 10 times more samples than features.

If we look at the learned $\vec{\beta}$, we can see that the learned values are huge positive and negative values, a typical property when overfitting.

In [12]:
print(f"{np.min(model.coef_)=}")
print(f"{np.max(model.coef_)=}")

np.min(model.coef_)=-1072709201879814.9
np.max(model.coef_)=21447039238626.008


What can we do?

* Remove some Features
* Regularization, so the model does not learn those huge $\vec{\beta}$

## Remove some Features

Here we only encode the first two numbers of the `zipcode` and `object_type_name` resulting in 104 new features rather than +4000. The `municipality_name` feature is just dropped (not used).

In [13]:
from scipy import sparse

# 1. Define the OneHotEncoders
ohe_zipcode_2 = OneHotEncoder(handle_unknown='ignore')
ohe_object_type_name = OneHotEncoder(handle_unknown='ignore')

# Transform the zipcode to a string, otherwise sklearn warns us later.
X_train['zipcode_2'] = (X_train['zipcode'].astype("int") // 100).astype("string")

# 2. Train and apply them on the individual feature.
# zipcode
X_train_ohe_zipcode_2 = ohe_zipcode_2.fit_transform(X_train[['zipcode_2']])
X_train_zipcode_2 = pd.DataFrame(data=X_train_ohe_zipcode_2.toarray(), index=X_train.index, columns=ohe_zipcode_2.categories_[0])
# object_type_name
X_train_ohe_object_type_name = ohe_object_type_name.fit_transform(X_train[['object_type_name']])
X_train_object_type_name = pd.DataFrame(data=X_train_ohe_object_type_name.toarray(), index=X_train.index, columns=ohe_object_type_name.categories_[0])

# 3. Combine the numerical features (e.g. `living_area`) together with the OneHotEncoder outputs
X_train_ohe = pd.concat([
    X_train.drop(columns=['zipcode', 'zipcode_2', 'object_type_name', 'municipality_name']),  # numerical features
    X_train_zipcode_2,  # zipcode (first two digits only) OneHot features
    X_train_object_type_name  # municipality_name OneHot features
], axis=1)

model = LinearRegression()
_ = model.fit(X_train_ohe, y_train)

In [14]:
X_val['zipcode_2'] = (X_val['zipcode'].astype("int") // 100).astype("string")

# 1. Prep validation data
X_val_ohe_zipcode_2 = ohe_zipcode_2.transform(X_val[['zipcode_2']])
X_val_zipcode_2 = pd.DataFrame(data=X_val_ohe_zipcode_2.toarray(), index=X_val.index, columns=ohe_zipcode_2.categories_[0])
X_val_ohe_object_type_name = ohe_object_type_name.transform(X_val[['object_type_name']])
X_val_object_type_name = pd.DataFrame(data=X_val_ohe_object_type_name.toarray(), index=X_val.index, columns=ohe_object_type_name.categories_[0])

X_val_ohe = pd.concat([
    X_val.drop(columns=['zipcode', 'zipcode_2', 'object_type_name', 'municipality_name']),
    X_val_zipcode_2,
    X_val_object_type_name
], axis=1)

# 2. Predict for prepared validation data
y_val_pred = model.predict(X_val_ohe)
print(mean_absolute_percentage_error(y_val, y_val_pred))

33.13233061981303


Now the value is reasonable. The model seems to overfit no more. Are at least not as obviously.

In [15]:
# Clean up, remove features added in above cell
X_train = X_train.drop(columns=['zipcode_2'])
X_val = X_val.drop(columns=['zipcode_2'])

## Regularization

Another approach is to use regularization. We keep all features, but restrict the flexibility of the model. If a $\beta$ becomes large, we punish the model in the cost function (see slides). This "motivates" the model to keep the $\beta$ values small, which helps against the danger of overfitting.

In [16]:
from sklearn.linear_model import Ridge
from sklearn.compose import make_column_transformer, TransformedTargetRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

numerical_features = list(X_train.drop(columns=['zipcode', 'object_type_name', 'municipality_name']).columns)

model_regularized = Pipeline([
    ('std', make_column_transformer([StandardScaler(), numerical_features], remainder='passthrough', sparse_threshold=0.0)),
    ('reg', TransformedTargetRegressor(
        regressor=Ridge(),  # regressor=LinearRegression(), # Try LinearRegression to see how big betas are even with StandardScaler on y
        transformer=StandardScaler()
    ))
])
_ = model_regularized.fit(X_train_ohe, y_train)

In [17]:
y_train_pred = model_regularized.predict(X_train_ohe)
print("Train Set:", mean_absolute_percentage_error(y_train, y_train_pred))
y_val_pred = model_regularized.predict(X_val_ohe)
print("Val Set:", mean_absolute_percentage_error(y_val, y_val_pred))

Train Set: 32.80037545125168
Val Set: 33.0673564906153


In [18]:
print("Model Regularzed:", np.min(model_regularized['reg'].regressor_.coef_), np.max(model_regularized['reg'].regressor_.coef_))

Model Regularzed: -0.9803635305174393 2.409262474770106


### Hyperparameter Selection for `alpha` with RandomizedSearchCV

`Ridge` as the Hyperparameter `alpha` which represents the regularization strength (`alpha` is called `lambda` on slides).

Rather than taking the default value of `alpha=1.0`, we can try out different values and pick the one that performs best.

This can be done with `GridSearch` or `RandomizedSearch`.

Here we use `GridSearchCV` which does a GridSearch and uses k-fold cross validation to measure how good a parameter is.

In [19]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'reg__regressor__alpha': [0.01, 0.1, 1, 5, 10, 20, 50]
}

rs = GridSearchCV(model_regularized, param_grid, cv=3)
_ = rs.fit(X_train_ohe, y_train)
print(rs.best_params_)

{'reg__regressor__alpha': 0.1}


In [20]:
y_train_pred = rs.predict(X_train_ohe)
print("Train Set:", mean_absolute_percentage_error(y_train, y_train_pred))
y_val_pred = rs.predict(X_val_ohe)
print("Val Set:", mean_absolute_percentage_error(y_val, y_val_pred))

Train Set: 32.87849947433423
Val Set: 33.12139122338599


## (Extra) Sparse vs. Dense

An interesting special behavior of sklearn is that `LinearRegression` learns with a different optimization algorithm on sparse than on dense data.

So if we transform `X_train_ohe` to be sparse, the model does not overfit (as hard).

In [21]:
# Eval model trained on dense data
model_dense = LinearRegression()
_ = model_dense.fit(X_train_ohe, y_train)
y_train_pred = model_dense.predict(X_train_ohe)
print("Train Set on model_sparse:", mean_absolute_percentage_error(y_train, y_train_pred))
y_val_pred = model_dense.predict(X_val_ohe)
print("Val Set on model_sparse:", mean_absolute_percentage_error(y_val, y_val_pred))

print("-------")

# Eval model trained on sparse data
X_train_ohe_sparse = sparse.csr_matrix(X_train_ohe.to_numpy())  # Make data sparse
model_sparse = LinearRegression()
_ = model_sparse.fit(X_train_ohe_sparse, y_train)
y_train_pred = model_sparse.predict(X_train_ohe.to_numpy())
print("Train Set on model_sparse:", mean_absolute_percentage_error(y_train, y_train_pred))
y_val_pred = model_sparse.predict(X_val_ohe.to_numpy())
print("Val Set on model_dense:", mean_absolute_percentage_error(y_val, y_val_pred))

Train Set on model_sparse: 32.89321126280548
Val Set on model_sparse: 33.13233061981303
-------
Train Set on model_sparse: 32.90761536725446
Val Set on model_dense: 33.14018109827843


Note that the `LinearRegression` on dense data does clearly overfit the data.
Note that the `LinearRegression` on sparse data does not clearly overfit the data.

Even though `X_train_ohe_sparse` and `X_train_ohe` is **the same** data (with different representation) and the problem is convex (single best solution!) due to the different optimization algorithm different models are learned.

This is very `sklearn` specific and there is no good theoretical reason why this is so. However it is interesting to point out.