# Brute Force

We can easily make new "Bedroom Heuristics" simply by changing the amount of dollars we multiply the number of bedrooms by.

In [None]:
from utils import evaluate_model
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt


def generate_bedroom_heuristic(a):
    def heuristic(input_data):
        prediction = a * input_data['BedroomAbvGr']
        return(prediction)
    return(heuristic)


The question is now "What's the best amount of dollars to multiply by?" 

Let's start to answer this by using a for loop and trial and error.


In [None]:
model_scores = []
for i in range(200):
    score = evaluate_model(generate_bedroom_heuristic(i*1000))
    model_scores.append(score)

In [None]:
plt.plot(model_scores)
plt.xlabel("Bedrooms x 1000")
plt.ylabel("Mean absolute error")

In [None]:
# Find the minimum score.
models = pd.DataFrame()
models['Score'] = model_scores
models.loc[models.Score == models.Score.min()] 

One issue with this approach is choosing an appropriate level of granularity. In this example we stepped through increments of 1000. But how do we know the optimal value really is \$54,000 and not \$53,765.08 ? We could lower our step size but this would mean increasing our compute time by orders of magnitude. 

# Intermezzo: Optimisation

The science of doing the above effectively is called optimisation. There are some brilliant algorithms for doing optimisation, the most popular in machine learning is called minibatch gradient descent. You can see an outline of this algorithm [here.](https://cs.brown.edu/courses/csci1951-a/assignments/regression.html)

The idea behind the algorithm is to use calculus. As you may recall from high school maths, when the gradient $dy/dx  = 0$  then y is at a minimum or maximimum. This is illustrated in the following diagram.

<img src="https://cs.brown.edu/courses/csci1951-a/assignments/images/gd.png" width="800">


# Linear Regression
Rather than brute force this, we can use a library.

First, let's define our problem:

*Problem*

Find $w_0, w_1$ such that $w_1 \cdot bedrooms + w_0$ gives the most accurate predictions (as measured by mean absolute error.)


This is a problem of model fitting, not model evaluation, so we use the train data set.

In [None]:
def train_bedroom_linear_model(training_set):
    predictor = linear_model.LinearRegression()
    predictor.fit(training_set[['BedroomAbvGr']],training_set['SalePrice'])
    def bedroom_linear_model(input_data):
        return(predictor.predict(input_data[['BedroomAbvGr']]))
    return bedroom_linear_model

In [None]:
training_set = pd.read_csv("housing_price_data/training_data.csv")
bedroom_linear_model = train_bedroom_linear_model(training_set)
evaluate_model(bedroom_linear_model)


Let's examine the differences between predicted and actual value.

In [None]:
bedroom_example = training_set.copy()
bedroom_example['Predicted'] = bedroom_linear_model(bedroom_example)
bedroom_example['Error'] = bedroom_example['Predicted'] - bedroom_example['SalePrice']
bedroom_example['Error'].plot()  #TODO label axes.
plt.xlabel("id")
plt.ylabel("Difference between predictive and actual value")

We can inspect the outliers by sorting the dataframe

In [None]:
bedroom_example.sort_values("Error").head()

Exercise: Why do you think the outliers are so cheap? Remember you can look at individual columns by using
`bedroom_example[["column1","column2",...]]`

Exercise: Build a linear regression model for predicting SalePrice as a function of LotArea