This notebook has been edited from it's orginal form for the *HTP Bootcamp* on Sept. 28, 2019. Please see the original notebook [here.](https://github.com/fchollet/deep-learning-with-python-notebooks)

Before we begin, Go to "Runtime" > "Change runtime type" and select "GPU" as your hardware accelerator. 

In [0]:
import keras
keras.__version__

# Predicting house prices: a regression example

This notebook contains the code samples found in Chapter 3, Section 6 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.

----

## The Boston Housing Price dataset


We will be attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given a few data points about the 
suburb at the time, such as the crime rate, the local property tax rate, etc.

This dataset has very few data points, only 506 in 
total, split between 404 training samples and 102 test samples, and each "feature" in the input data (e.g. the crime rate is a feature) has 
a different scale. For instance some values are proportions, which take a values between 0 and 1, others take values between 1 and 12, 
others between 0 and 100...

Let's start with this data first to help you understand the structure of a neural network model. 

Let's take a look at the data:

In [0]:
from keras.datasets import boston_housing

(train_data, train_targets), (test_data, test_targets) =  boston_housing.load_data()

In [0]:
train_data.shape

In [0]:
test_data.shape

In [0]:
# View the data
import pandas as pd
view_train_data = pd.DataFrame(train_data)
view_train_data.head(10)


As you can see, we have 404 training samples and 102 test samples. The data comprises 13 features. The 13 features in the input data are as 
follow:

1. Per capita crime rate.
2. Proportion of residential land zoned for lots over 25,000 square feet.
3. Proportion of non-retail business acres per town.
4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
5. Nitric oxides concentration (parts per 10 million).
6. Average number of rooms per dwelling.
7. Proportion of owner-occupied units built prior to 1940.
8. Weighted distances to five Boston employment centres.
9. Index of accessibility to radial highways.
10. Full-value property-tax rate per $10,000.
11. Pupil-teacher ratio by town.
12. 1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town.
13. % lower status of the population.

The targets are the median values of owner-occupied homes, in thousands of dollars:

In [0]:
train_targets


The prices are typically between \$10,000 and \$50,000. If that sounds cheap, remember this was the mid-1970s, and these prices are not 
inflation-adjusted.

## Preparing the data


It would be problematic to feed into a neural network values that all take wildly different ranges. The network might be able to 
automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal 
with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), we 
will subtract the mean of the feature and divide by the standard deviation, so that the feature will be centered around 0 and will have a 
unit standard deviation. This is easily done in Numpy:

We also did this in our first bootcamp with Scikit-Learn also.

In [0]:
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std

test_data -= mean
test_data /= std

Let's see how the training data looks now. 

In [0]:
view_train_data = pd.DataFrame(train_data)
view_train_data.head(10)


Note that the quantities that we use for normalizing the test data have been computed using the training data. We should never use in our 
workflow any quantity computed on the test data, even for something as simple as data normalization.

## Linear Regression Example
Linear Regression is usually the first supervised learning method data scientists learn. It will output a model that looks like: 

*price* = 

*w1* * crime +

*w2* * land_zone +

...


*w13* * lower_status + 

*b*

In linear regression, the goal is the find the best weights/coefficients (the *w*'s) and best intercept/bias (*b*) that closely maps the inputs to the output, *price*. The "learning" part of machine learning referes to finding these values. 

Using scikit-learn in Bootcamp 1, we saw how quickly and easily we can compute the "best" weights and bias value. 

In [0]:
# Simple Regression Model
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(train_data, train_targets)

In [0]:
# Print the coefficients
coeff_df = pd.DataFrame(lin_reg.coef_, columns=['Coefficient'])  
coeff_df

These values represent the best weights for a linear model. We could use them to predict the prices of houses we have not seen. 

In [0]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

test_predictions = lin_reg.predict(test_data)
lin_mse = mean_squared_error(test_targets, test_predictions)
lin_mae = mean_absolute_error(test_targets, test_predictions)

print(lin_mse, lin_mae)

The number on the right, 3.4, is the mean abolute error. It means that the predictions are off, on average, by $3,464. 

### Linear Regression as a Neural Network

Now, let's connect the idea of linear regression to a neural network. We will build a simple neural network, and you'll see that it is essentially the same as a linear regression model!

VIEW SLIDES

## Building our first network

Our workflow will be as follow: first we will present our neural network with the training data, `train_data` and `train_labels`. The 
network will then learn to associate images and labels. Finally, we will ask the network to produce predictions for `test_targets`, and we 
will verify if these predictions match the labels from `test_targets`.

Let's build our network -- again, remember that you aren't supposed to understand everything about this example just yet.

The core building block of neural networks is the "layer", a data-processing module which you can conceive as a "filter" for data. Some data comes in, and comes out in a more useful form. 

Let's begin by importing modules from Keras. 

In [0]:
from keras import models
from keras import layers

A neural network consists of layers than can be added sequential. To start, we will use just one layer. The visual in the PowerPoint will help clarify what is happening. 

In [0]:
# Building blocks of a model
reg_model = models.Sequential()
reg_model.add(layers.Dense(1, input_shape=(train_data.shape[1],)))

The model has one layer. It takes as an input data with 13 numbers, and connects those thirteen numbers to a layer with 1 "node" (e.g. `Dense(1)`).

The goal is to find the best thirteen numbers (*weights*) and one constant term (*bias*) to map the input data to the output data. This is exactly what happened in regression!

To make our network ready for training, we need to pick three more things, as part of "compilation" step:

- A loss function: the is how the network will be able to measure how good a job it is doing on its training data, and thus how it will be able to steer itself in the right direction.
- An optimizer: this is the mechanism through which the network will update itself based on the data it sees and its loss function.
- Metrics to monitor during training and testing. Here we will only care about accuracy (the fraction of the images that were correctly classified).
The exact purpose of the loss function and the optimizer will be made clear throughout the next two chapters.

We will discuss these ideas more in later notebooks. 

In [0]:
reg_model.compile(optimizer = 'sgd',
              loss='mse', 
              metrics=['mae'])

Similar to a scikit-learn model, we use `fit` to begin the learning process. The model will fit the training data to the training targets by iterating over the weights and bias 100 times. 

In [0]:
reg_model.fit(train_data, 
          train_targets,
          epochs=100)

Let's see the weights learned from this process and compare them to the weights learned in linear regression. 

In [0]:
import numpy as np
weights, biases = reg_model.layers[0].get_weights()
coeff_df = pd.DataFrame({'LR Coefficients': lin_reg.coef_, 'NN Weights': weights[:,0]})  
coeff_df

They are close, just as expected. And because the weights are close, we expect the neural network to give similar predictions on the test set. 

In [0]:
reg_model.evaluate(test_data, test_targets)

To recap, the model had to learn 14 numbers: 13 weights and 1 bias. View this with the code below. 

In [0]:
reg_model.summary()

## Building a Better Neural Network

In that example, we made the most simple neural network possible. It had one layer and one node. Let's add another layer (a *hidden* layer) with 64 nodes! We hope this 64 nodes represent the input data in different ways. 

In [0]:
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(train_data.shape[1],)))
model.add(layers.Dense(1))

Notice the `Dense(64)` in the middle layer. This means the hidden layer with have 64 nodes that are fully connected to the input data. Thus, the input data's 13 features will transform into 64, hopefully identifying complex relationships in the housing data missed in the simple learning model. 

You also see a new function called an `activation` function. This converts the value of the node into a new value. In some cases, the new value will become zero, meaning the node will not "fire" and will not send information further into the network. Let's not worry to much about this now. We will be using the `relu` activation function from here on out. 

In the last layer, you see one node again, `Dense(1)`. The 64 hidden nodes connect to this node with weights and a bias to predict the final output, price. 

Our network ends with a single unit, and no activation (i.e. it will be linear layer). 
This is a typical setup for scalar regression (i.e. regression where we are trying to predict a single continuous value). 
Applying an activation function would constrain the range that the output can take; for instance if 
we applied a `sigmoid` activation function to our last layer, the network could only learn to predict values between 0 and 1. Here, because 
the last layer is purely linear, the network is free to learn to predict values in any range.

Next, let's compile the network with an optimizer and a loss. 

In [0]:
model.compile(optimizer='sgd', 
              loss='mse', 
              metrics=['mae'])

Before we fit the model, let's see how many more parameters the network needs to learn:

In [0]:
model.summary()

Finally, let's build the model, this time using 500 iterations of finding the weights and biases.  

In [0]:
# Train the model
model.fit(train_data, 
          train_targets,
          epochs=100)




Note that we are compiling the network with the `mse` loss function -- Mean Squared Error, the square of the difference between the 
predictions and the targets, a widely used loss function for regression problems.

We are also monitoring a new metric during training: `mae`. This stands for Mean Absolute Error. It is simply the absolute value of the 
difference between the predictions and the targets. For instance, a MAE of 0.5 on this problem would mean that our predictions are off by 
\$500 on average.

In [0]:
model.evaluate(test_data, test_targets)

In this case, we are off by \$2,500 on 
average, which is still significant considering that the prices range from \$10,000 to \$50,000. But this is a big improvement over the simple network which had an error of \$3,464.  


## Wrapping up


Here's what you should take away from this example:

* The most basic Neural Network is similar to Linear Regression.
* Neural Networks can find complex relationships using hidden layers and nodes. 
* Regression is done using different loss functions from classification; Mean Squared Error (MSE) is a commonly used loss function for 
regression.
* Similarly, evaluation metrics to be used for regression differ from those used for classification; naturally the concept of "accuracy" 
does not apply for regression. A common regression metric is Mean Absolute Error (MAE).
* When features in the input data have values in different ranges, each feature should be scaled independently as a preprocessing step.