# Robust Regression

one of the main reasons we move from linear regression to robust regression is that linear regression and it's current form is heavily impacted by presence of outliers (the points far away from the crowd). we call them junk points, but in the linear regression these points affect the slope of regressor line and linear regression fit. so it's time to move on !

for better understanding of what is going on: [http://digitalfirst.bfwpub.com/stats_applet/stats_applet_5_correg.html](http://digitalfirst.bfwpub.com/stats_applet/stats_applet_5_correg.html)

preparing the data, same as before.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('housing.data', delim_whitespace=True, header=None)

In [None]:
df.head()

| Code   | Description   |
|:---|:---|
|**CRIM** | per capita crime rate by town |
|**ZN**  | proportion of residential land zoned for lots over 25,000 sq.ft. | 
|**INDUS**  | proportion of non-retail business acres per town | 
|**CHAS**  | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) | 
|**NOX**  | nitric oxides concentration (parts per 10 million) | 
|**RM**  | average number of rooms per dwelling | 
|**AGE**  | proportion of owner-occupied units built prior to 1940 | 
|**DIS**  | weighted distances to five Boston employment centres | 
|**RAD**  | index of accessibility to radial highways | 
|**TAX**  | full-value property-tax rate per $10,000 | 
|**PTRATIO**  | pupil-teacher ratio by town | 
|**B**  | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town | 
|**LSTAT**  | % lower status of the population | 
|**MEDV**  | Median value of owner-occupied homes in \$1000's | 

In [None]:
# create a list for feature labels 
col_name = ['CRIM', 'ZN' , 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

In [None]:
# renaming the naive numberal data labels to a much more sensible names...
df.columns = col_name

In [None]:
# watch it again...
df.head()

***

## RANdom SAmple Consensus (RANSAC) Algorithm

the idea is simple. exclude the outliers from the regression calculations !
this algorithm only uses inlier points to fit the model.

for more study please refer to this : [http://scikit-learn.org/stable/modules/linear_model.html#ransac-regression](http://scikit-learn.org/stable/modules/linear_model.html#ransac-regression)

Each iteration performs the following steps:

1. Select `min_samples` random samples from the original data and check whether the set of data is valid (see `is_data_valid`).

2. Fit a model to the random subset (`base_estimator.fit`) and check whether the estimated model is valid (see `is_model_valid`).

3. Classify all data as inliers or outliers by calculating the residuals to the estimated model (`base_estimator.predict(X) - y`) - all data samples with absolute residuals smaller than the `residual_threshold` are considered as inliers.

4. Save fitted model as best model if number of inlier samples is maximal. In case the current estimated model has the same number of inliers, it is only considered as the best model if it has better score.

In [None]:
# reformating to matrix representation
X = df['RM'].values.reshape(-1,1)
# our dependent value (values we wanna model!)
y = df['MEDV'].values

In [None]:
# luckily RANSAC is implemented before !
from sklearn.linear_model import RANSACRegressor

In [None]:
# instantiate the model
ransac = RANSACRegressor()

In [None]:
ransac.fit(X, y)
# everything is left with default values

In [None]:
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)

In [None]:
# for guys who didn't know that
np.arange(3, 10, 1)

In [None]:
line_X = np.arange(3, 10, 1)
line_y_ransac = ransac.predict(line_X.reshape(-1, 1))

In [None]:
# prepare the pallet
sns.set(style='darkgrid', context='notebook')
plt.figure(figsize=(15,10));

# drawing inliner (good points) with blue color
plt.scatter(X[inlier_mask], y[inlier_mask], 
            c='blue', marker='o', label='Inliers')
# drawing outliner (bad points) with brown color
plt.scatter(X[outlier_mask], y[outlier_mask],
            c='brown', marker='s', label='Outliers')

plt.plot(line_X, line_y_ransac, color='red')

# visualize the content
plt.xlabel('average number of rooms per dwelling')
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend()
# and fire!
plt.show()

In [None]:
ransac.estimator_.coef_

In [None]:
ransac.estimator_.intercept_

### actually we can modify the algorithm with playing with it's hyperparameters...

In [None]:
# defining and reformating to matrix representation again
X = df['RM'].values.reshape(-1,1)
# our dependent value (values we wanna model!)
y = df['MEDV'].values

In [None]:
# instantiate the model
ransac_m = RANSACRegressor(residual_threshold=10)

In [None]:
ransac_m.fit(X, y)
# modifying the residual_threshold

In [None]:
inlier_mask_m = ransac_m.inlier_mask_
outlier_mask_m = np.logical_not(inlier_mask_m)

In [None]:
line_y_ransac_m = ransac_m.predict(line_X.reshape(-1, 1))

In [None]:
# DRAW IT AGAIN!

# prepare the pallet
sns.set(style='darkgrid', context='notebook')
plt.figure(figsize=(15,10));

# drawing inliner (good points) with blue color
plt.scatter(X[inlier_mask_m], y[inlier_mask_m], 
            c='blue', marker='o', label='Inliers')
# drawing outliner (bad points) with brown color
plt.scatter(X[outlier_mask_m], y[outlier_mask_m],
            c='brown', marker='s', label='Outliers')

plt.plot(line_X, line_y_ransac_m, color='red')

# visualize the content
plt.xlabel('average number of rooms per dwelling')
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend()
# and fire!
plt.show()

***

## some other example

In [None]:
X = df['LSTAT'].values.reshape(-1,1)
y = df['MEDV'].values
ransac.fit(X, y)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
line_X = np.arange(0, 40, 1)
line_y_ransac = ransac.predict(line_X.reshape(-1, 1))

In [None]:
sns.set(style='darkgrid', context='notebook')
plt.figure(figsize=(12,10));
plt.scatter(X[inlier_mask], y[inlier_mask], 
            c='blue', marker='o', label='Inliers')
plt.scatter(X[outlier_mask], y[outlier_mask],
            c='brown', marker='s', label='Outliers')
plt.plot(line_X, line_y_ransac, color='red')
plt.xlabel('% lower status of the population')
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend(loc='upper right')
plt.show()

***