# Linear Regression
Rather than guessing  or brute force this, we can use a library to find the optimal regression coefficients.

First, let's define our problem:

>### *Problem*
> 
>  
>Find $w_0, w_1$ such that $w_1 \cdot OverallQual + w_0$ gives the most accurate predictions (as measured by mean absolute error.)  


This is a problem of model fitting, not model evaluation, so we use the train data set. 


## Scikit-Learn library

In the code below we make use of the the Scitkit-Learn library that provides us with functions to create and train a model. 

Use the [Scikit-Learn Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf) to help you understand the code below.

In [None]:
# Load modules
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

In [None]:
# Use same evaluate_model function as before
def evaluate_model(model_fn, print_result=False):
    '''
    Consumes a function model_fn
    and evaluates its predictive accuracy against 
    the housing prices test set.
    We have included a switch for the output to be a more human readable
    printed version or the uncurtailed floating point value of the average.
    '''
    test_data = pd.read_csv("https://raw.githubusercontent.com/eliiza/ml-training-data/master/housing_price_data/test_data.csv")
    actual_values = test_data['SalePrice']
    # Pass in all columns except SalePrice
    test_input = test_data.filter(regex='^(?!SalePrice$).*')
    predicted_saleprice = model_fn(test_input)
    mae = np.mean(np.abs(predicted_saleprice-actual_values))
    if print_result:
        return print("The model is inaccurate by $%.2f on average." % mae)
    else:
        return mae

In [None]:
# Load training set from Eliiza's GitHub (split and saved already)
training_set = pd.read_csv("https://raw.githubusercontent.com/eliiza/ml-training-data/master/housing_price_data/training_data.csv")

## Linear Regression: OverallQual

We will use the linear regression function `sklearn.linear_model.LinearRegression()`. As for the workflow in the previous notebook, we need to write a function that contains the model and returns predictions when given input data (`quality_linear_model`). This is so that we can use the pre-written `evaluate_model()` function.

In [None]:
# Create linear regression model based on a training set with 'OverallQual' column and 'SalePrice' column
# and return the model prediction function
    
# Create feature column and train model
training_features = training_set[['OverallQual']]
predictor = linear_model.LinearRegression()
predictor.fit(training_features, training_set['SalePrice'])
    
# Define a function which returns predictions using this model when given input data
def quality_linear_model(input_data):
    return(predictor.predict(input_data[['OverallQual']]))

In [None]:
# Evaluate model
# The evaluate model function needs a second argument to print a text comment instead of just returning a number
evaluate_model(quality_linear_model, print_result=True)

Let's examine the differences between predicted and actual value.

In [None]:
# Create copy of training set data frame and add columns with the predictions from the linear model above
# and the difference between the predictions and actual prices
quality_example = training_set.copy()
quality_example['Predicted'] = quality_linear_model(quality_example)
quality_example['Error'] = quality_example['Predicted'] - quality_example['SalePrice']
quality_example['Error'].plot() 
plt.xlabel("House_id")
plt.ylabel("Error")

We can inspect the outliers by sorting the dataframe

In [None]:
# This returns the 5 houses with the largest positive error and the 5 houses with the largest negative error
quality_example.sort_values("Error")

**Question:** Is a positive error value related to over- or under-estimating the price of the house?

## Exercise
Why do you think the houses with the largest error (outliers) are so cheap/expensive compared with our predictions? Remember you can look at individual columns by using `quality_example[["column1","column2",...]]`

**Hint:** Some useful attributes to examine might be the number of bedrooms, the number of bathrooms and the age of the house. But it might not be possible to explain every outlier!

**Optional:** Try contrasting the outliers with more accurate predictions (`Error` values close to zero).

## Exercise

Build a linear regression model for predicting SalePrice as a function of `GrLivArea` - another variable with high correlation with `SalePrice`.

Plot the errors and examine the outliers as above for this new model.