# Robust Regression

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn import linear_model

## Outliers

In [None]:
# Let's use the same data from earlier this morning and add some error
n = 100
x = np.arange(1,n).reshape(-1,1)
y = np.array([(i**2)+(10*i)*(np.sin(i)+1) for i in x])
plt.scatter(x,y)

In [None]:
# Add some problem data points
noutlier = 20

By making the number of outliers a variable it is easy to adjust and rerun to see how the number of outliers changes the ability to fit a model 

In [None]:
x = np.vstack([x,np.random.uniform(0,100,noutlier).reshape(noutlier,1)])
y = np.vstack([y,np.random.uniform(0, 8000,noutlier).reshape(noutlier,1)])

In [None]:
plt.scatter(x,y)

## Regression Model

Be sure to apply the transformation that we found earlier and separate out test and training sets

In [None]:
# Transform y variable
y = 

# Create train and test data sets: x_train, y_train, x_test, y_test


## 1) General regression model

In [None]:
model = 
model.fit()
print("R^2: ",model.score(x_test, y_test))
print("Slope: ", model.coef_)
print("Intercept: ", model.intercept_)

# Predict on test data
pred_test = 

# Residuals for test data
res_test = 

# Predict on all data (ie. all x)
pred = 

# Residuals for all data
res = 

Plot two plots (1 row, two columns) to visualise the results for the test dataset. In the first plot, plot actual data and line fit for test data. In the second plot, plot the residuals for the test data. The functions ```sns.scatterplot()```, ```sns.lineplot()``` may be useful. 

You may need to reshape the data into the right format, e.g. ```x_test.reshape(-1)```

In [None]:
import seaborn as sns 
fig, ax = plt.subplots(1, 2)
### YOUR CODE HERE ###

Plot another two plots (1 row, two columns), this time to visualise the results for the entire dataset. Use ```x```, ```y``` this time instead of ```x_test``` and ```y_test```.

In [None]:
import seaborn as sns 
fig, ax = plt.subplots(1, 2)
### YOUR CODE HERE ###

We want to repeat producing our plots for the next 3 sections. Instead of copying out the above code 3 times, write a function that will produce the visualisations you've just done above. Ideally, all variables used within the function should be defined in relation to the arguments of the function (but you can skip this in the interest of time).  

In [None]:

def make_plots(title=None):
    """
    Plot residual and model fit plots. The assumption of outside
    function variable names that this function is based on is bad practice.
    """
    ### YOUR CODE HERE ###

In [None]:
make_plots('OLS Regression')

## 2) RANSAC

RANSAC: Randomly sample the points over and over again, and pick the sample that best represents the inliers

In [None]:
ransac = linear_model.RANSACRegressor()
ransac.fit()
print("R^2: ",ransac.score(x_test, y_test))
pred_test = 
res_test = 
pred = 
res = 

In [None]:
make_plots('RANSAC Regression')

## 3) Theil-Sen


Theil-Sen: Pick out all possible pairs of points, calculate all the slopes and pick the median. Calculate the intercept and choose the median


In [None]:
theil_sen = linear_model.TheilSenRegressor(random_state=3)
theil_sen.fit(x_train, y_train.ravel())
print("R^2: ",theil_sen.score(x_test, y_test))
pred_test = 
res_test = 
pred = 
res = 

In [None]:
make_plots('Theil-Sen Regression')

## 4) Huber 

Huber Regression: Model fit that minimises Huber loss. Huber loss is a mix of squared loss and absolute loss

In [None]:
huber = linear_model.HuberRegressor()
huber.fit(x_train, y_train.ravel())
print("R^2: ",huber.score(x_test, y_test))
pred_test = 
res_test = 
pred = 
res = 

In [None]:
make_plots('Huber Regression')

### Comparison

In [None]:
print("R^2 OLS: ",model.score(x_test, y_test))
print("R^2 RANSAC: ",ransac.score(x_test, y_test))
print("R^2 Theil-Sen: ",theil_sen.score(x_test, y_test))
print("R^2 Huber: ",huber.score(x_test, y_test))

In [None]:
plt.scatter(x,y,label='Transformed Data')
xseq = np.linspace(0,100,num=100).reshape(-1, 1)
plt.plot(xseq,model.predict(xseq),label='OLS')
plt.plot(xseq,ransac.predict(xseq),label='RANSAC')
plt.plot(xseq,theil_sen.predict(xseq),label='Theil-Sen')
plt.plot(xseq,huber.predict(xseq),label='Huber')
plt.legend()

In [None]:
plt.scatter(x,y**2,label='Data')
xseq = np.linspace(0,100,num=100).reshape(-1, 1)
plt.plot(xseq,model.predict(xseq)**2,label='OLS')
plt.plot(xseq,ransac.predict(xseq)**2,label='RANSAC')
plt.plot(xseq,theil_sen.predict(xseq)**2,label='Theil-Sen')
plt.plot(xseq,huber.predict(xseq)**2,label='Huber')
plt.legend()