## Linear Regression

Regression deals with the modelling of **continuous values**. Linear regression creates a model that assumes a linear relationship between the inputs and outputs i.e. - the higher the inputs the higher (or lower) the outputs are.

What adjusts how *strong* the relationship is and the *direction* of this relationship is are our **coefficients**. Our first coefficient without an input is the **intercept**. Our goal is to calculate the optimal coefficients.

Univariate lin-reg comes in the form
$$
\hat{y} = \beta_0 + \beta_1 x + \epsilon
$$

Where:
* $\hat{y}$: Predicted output
* $\beta_0$ and $\beta_1$: Coefficients
* $x$: input
* $\epsilon$: error term

### Comparing predictions against reality

We use **residuals** to calculate model accuracy. **Residuals** are the differences between the actual values and our predicted values.

Now, we don't inspect all of the residuals as is - we typically use some type of **summary statistic** to evaluate the predictive power of our model.

To illustrate some of these methods, we'll load the boston housing dataset from sklearn:

In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use(['ggplot'])

from sklearn.datasets import load_boston
boston = load_boston()
bos = pd.DataFrame(boston.data, columns=boston.feature_names)
bos['PRICE'] = boston.target

X = bos.drop('PRICE', axis=1)
y = bos['PRICE']

### R Squared
Also known as the *coefficient of determination* his is pretty much the default scoring method for regression techniques.

This metric indicates how well model predictions approximates the true values where 1 indicates a perfect fit and 0 would be a regressor that always predicts the mean of the data.

**What's good**: it has an intuitive scale that doesn't depend on the units of the target variable.

**What's bad**: it says nothing about hte prediction error of the model (which is quite important).

#### Working Example

In [28]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=100)

model = LinearRegression()
model.fit(X, y)
model.score(X_test, y_test)

0.7694140401906062

### Mean Absolute Error (MAE)

We calculate MAE by taking the absolute value of the residual for every data point, so that negative and positive residuals do not cancel out. We then take the average of all these residuals.

A small MAE suggests the model is great at prediction, while a large MAE suggests the model may have trouble in certain areas.

Effectively, MAE describes the *typical* magnitude of the residuals. Because we use the the absolute value of the residual, the MAE does not indicate **underperformance** or **overperformance** of the model. Each residual contributes proportionally to the total amount of error, meaning that larger errors will contribute linearly to the overall error.

While MAE is easily interpretable, using the absolute value of the residual is often not as desirable as **squaring** this difference. Depending on how you want your model to treat **outliers**, you may want to bring more attention to these.

Equation:
$$
MAE\; = \; \frac{1}{n} \sum \vert \; y - \hat{y} \; \vert
$$

Using sklearn:

In [31]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test, model.predict(X_test))

3.17978701672763

### Mean Squared Error (MSE)

MSE is just like the MAE, but it *squares* the difference before summing them all instead of using the absolute value.

**Consequences of the Square Term**
Because we are squaring the difference, MSE will almost always be bigger than MAE (meaning we can't directly compare the two). The effect of the square term is most apparent with the **presence of outliers in our data**. While each residual in MAE contributes **proportionally** to the total error, the error grows **quadratically** in MSE. This ultimately means that outliers in our data will contribute to much higher total error in the MSE than they would in the MAE. In turn, this means our model would be penalized more for making predictions that differ greatly from the corresponding actual value.

**The choice between MAE and MSE really comes down to how important we consider outliers**

Equation:
$$
MSE\; = \; \frac{1}{n} \sum ( \; y - \hat{y} \; )^2
$$

Using sklearn:

In [32]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, model.predict(X_test))

22.273296310266296

### Root Mean Squared Error (RMSE)

RMSE is just the square root of the MSE. Because the MSE is squared, its units do not match that of the original output. RMSE is used in order to convert the error metric back into similar units, making interpretation easier. *They're both similarly affected by outliers**

The RMSE is analogous to the standard deviation (MSE to variance) and is a measure of how large your residuals are spread out.

Both MAE and MSE can range from 0 to positive infinity, so as both of these measurements get higher, it becomes harder to interpret model performance. Another way we can summarize our collection of residuals is by using percentages so that each prediction is scaled against the value it's supposed to estimate.

Using sklearn:

In [33]:
np.sqrt(mean_squared_error(y_test, model.predict(X_test)))

4.719459323933865