# Linear Regression: Outliers

This notebook is more advanced. It shows an example of how to implement preprocessing steps inside a sklearn pipeline.

First we develop a custom TransformMixin called "ValueClipper" to clip the values with illegal 'build_year' inside our pipeline.

Second we develop a custom TransformMixin called "RoomCleaner". The Room Cleaner solves the 'num_rooms' problem discussed in data-analysis.

Third we apply the "ValueClipper" to suspicious 'living_area' values.

In the end we evaluate all three steps with cross-validation.

Note that we do not remove the outliers, we clip them (fix the suspicious values). Other strategies are possible, like removing the outliers.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Prepare data

In [2]:
# Load the train data
train_data = pd.read_csv('../data/houses_train.csv', index_col=0)

In [3]:
# Split data into features and labels.
X_data = train_data.drop(columns='price')
y_data = train_data['price']

In [4]:
# Split features and labels into train (X_train, y_train) and validation set (X_val, y_val).
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, stratify=X_data['object_type_name'], test_size=0.1)

# ValueClipper

We implement a ValueClipper to clip values that are under or over a certain threshold to the threshold.
The threshold is given by us, it is a Hyperparameter.

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

In [6]:
from sklearn.base import BaseEstimator, TransformerMixin
class ValueClipper(BaseEstimator, TransformerMixin):
    def __init__(self, column, min_threshold, max_threshold):
        self.column = column
        self.min_threshold = min_threshold
        self.max_threshold = max_threshold

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        X.loc[X[self.column] < self.min_threshold, self.column] = self.min_threshold
        X.loc[X[self.column] > self.max_threshold, self.column] = self.max_threshold
        return X

We apply the ValueClipper to the `build_year` with handpicked thresholds.

In [7]:
build_year_model = Pipeline([
    ('vc', ValueClipper('build_year', 1900, 2020)),  # We add the ValueClipper here
    ('ohe', make_column_transformer((OneHotEncoder(handle_unknown='ignore'), ['zipcode', 'municipality_name', 'object_type_name']), remainder='passthrough')),
    ('reg', LinearRegression())
])

We check how many data points are effected by the ValueClipper on the `build_year`, to better understand its impact.

In [8]:
# How many are effected by ValueClipper
build_year_data_len = len(train_data[(train_data['build_year'] < 1900) | (train_data['build_year'] > 2020)])
print(f'{round(build_year_data_len / len(train_data) * 100, 3)}% of our data points have such a build year.')

4.391% of our data points have such a build year.


In [9]:
# Train (fit) the model with the train data.
_ = build_year_model.fit(X_train, y_train)

In [10]:
def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [11]:
# Predict with the model the validation data.
y_val_pred = build_year_model.predict(X_val)

In [12]:
# How good are we on the validation data?
print(mean_absolute_percentage_error(y_val, y_val_pred))

29.0121756565058


# Room cleaner

We implement a room cleaner that specifically cleans the `num_rooms` column.

If the average room size is smaller than 5m², we assume that the `num_rooms` is given in 10m² instead of 1m², due to a typo. We correct this by dividing the `num_rooms` by 10.

In [13]:
class RoomCleaner(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X = X.copy()
        X.loc[X.living_area / X.num_rooms < 5, 'num_rooms'] /= 10
        return X

We add the RoomCleaner to the `num_rooms` columns.

In [14]:
room_cleaner_model = Pipeline([
    ('vc', ValueClipper('build_year', 1900, 2020)),
    ('rc', RoomCleaner()),
    ('ohe', make_column_transformer((OneHotEncoder(handle_unknown='ignore'), ['zipcode', 'municipality_name', 'object_type_name']), remainder='passthrough')),
    ('reg', LinearRegression())
])

We check how many data points are effected by the RoomCleaner, to better understand its impact.

In [15]:
# How many are effected by RoomCleaner
room_data_len = len(train_data[(train_data.living_area / train_data.num_rooms) < 5])
print(f'{round(room_data_len / len(train_data) * 100, 3)}% of our data points have such a average room size.')

0.044% of our data points have such a average room size.


In [16]:
# Train (fit) the model with the train data.
_ = room_cleaner_model.fit(X_train, y_train)

In [17]:
# Predict with the model the validation data.
y_val_pred = room_cleaner_model.predict(X_val)

In [18]:
# How good are we on the validation data?
print(mean_absolute_percentage_error(y_val, y_val_pred))

29.183103395638334


## Get a better estimate with cross-validation

Were the results of the ValueClipper and RoomCleaner good?

Let's get a better estimate with cross-validation.

Check the effects of handling each outliers individually, by evaluating the following models:

1. LinearRegression (Baseline for comparison)
2. LinearRegression with ValueClipper on `build_year`
3. LinearRegression with RoomCleaner
4. LinearRegression with ValueClipper on `living_area`

In [19]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import make_scorer

X = train_data.drop(columns='price')
y = train_data['price']

# 1.
lr_model = Pipeline([
    ('ohe', make_column_transformer((OneHotEncoder(handle_unknown='ignore'), ['zipcode', 'municipality_name', 'object_type_name']), remainder='passthrough')),
    ('reg', LinearRegression())
])

y_val_pred = cross_val_predict(lr_model, X, y, cv=5)
print("lr_model", mean_absolute_percentage_error(y, y_val_pred))

# 2.
build_year_model = Pipeline([
    ('vc', ValueClipper('build_year', 1900, 2020)),
    ('ohe', make_column_transformer((OneHotEncoder(handle_unknown='ignore'), ['zipcode', 'municipality_name', 'object_type_name']), remainder='passthrough')),
    ('reg', LinearRegression())
])

y_val_pred = cross_val_predict(build_year_model, X, y, cv=5)
print("build_year_model", mean_absolute_percentage_error(y, y_val_pred))

# 3.
room_cleaner_model = Pipeline([
    ('rc', RoomCleaner()),
    ('ohe', make_column_transformer((OneHotEncoder(handle_unknown='ignore'), ['zipcode', 'municipality_name', 'object_type_name']), remainder='passthrough')),
    ('reg', LinearRegression())
])

y_val_pred = cross_val_predict(room_cleaner_model, X, y, cv=5)
print("room_cleaner_model", mean_absolute_percentage_error(y, y_val_pred))

# 4.
living_area_cleaner_model = Pipeline([
    ('out', ValueClipper('living_area', 0, 400)),
    ('ohe', make_column_transformer((OneHotEncoder(handle_unknown='ignore'), ['zipcode', 'municipality_name', 'object_type_name']), remainder='passthrough')),
    ('reg', LinearRegression())
])

y_val_pred = cross_val_predict(living_area_cleaner_model, X, y, cv=5)
print("living_area_cleaner_model", mean_absolute_percentage_error(y, y_val_pred))

lr_model 31.69741252129228
build_year_model 30.14880717885878
room_cleaner_model 31.63280997879636
living_area_cleaner_model 28.563194269420617


Check the effect of all together:

In [20]:
model = Pipeline([
    ('vc', ValueClipper('build_year', 1900, 2020)),
    ('rc', RoomCleaner()),
    ('out', ValueClipper('living_area', 0, 400)),
    ('ohe', make_column_transformer((OneHotEncoder(handle_unknown='ignore'), ['zipcode', 'municipality_name', 'object_type_name']), remainder='passthrough')),
    ('reg', LinearRegression())
])

y_val_pred = cross_val_predict(model, X, y, cv=5)
print("final model", mean_absolute_percentage_error(y, y_val_pred))

final model 27.678069872913134


In [21]:
_ = model.fit(X_train, y_train)

# Predict prices for test set

Note that `ValueClipper` and `RoomCleaner` do **not remove any samples**, therefore we still make a prediction for **all samples** in the test set. So we did **not simplify the problem**, which means this result is comparable to the result we get in the other notebooks.

In [22]:
# Load the test set
test_data = pd.read_csv('../data/houses_test.csv', index_col=0)

In [23]:
# Split data into features and labels.
X_test = test_data.drop(columns='price')
y_test = test_data['price']

In [24]:
y_test_pred = model.predict(X_test)

In [25]:
# How good are we on the test data?
print(mean_absolute_percentage_error(y_test, y_test_pred))

27.87488842180928
