# Chapter 11 - Neural Networks

### 11.1 - Introduction

* The central idea is to extract linear combinations of the inputs as derived features, and then model the target as a nonlinear function of these features.

### 11.2 - Projection Pursuit Regression
 $f(X) = \sum_{m=1}^{M} g_m(w_m^T X)$ where X is an input with $p$ components.
* The non-linear function $g_m$ is called a Ridge function. It varies only in the direction defined by the linear weights vector $w_m$.
* The linear combination is the projection of $X$ onto the unit vector $w_m$. 
* We seek $w_m$ so that the model fits well (i.e. the pursuit).
* This model can approximate in continuous function in $\mathbb{R}^p$. This class of methods are called universal approximators. Generally the flexibility comes at a cost -- they are very hard to interpret.
* A model with $M=1$ is more interpretable.

### 11.3 - Neural Networks

* While there is a lot of hype around neural networks, they are just non-linear statistical models.
* A neural network is a 2 stage regression or classification model.
* In the first stage, derived features $Z_m$ are created from linear combinations of the inputs. (via an activation function).
* Then, the targets are modelled as a function of the linear combinations of the $Z_m$. (using the output function).
* The units in the middle of the network, computing the derived features $Z_m$, are called hidden units since they are not direclty observed.
* There can be more than 1 hidden layer.

### 11.4 - Fitting Neural Networks

* The unknown parameters in the neural network model are often called "weights".
* There are weights for determining the input to the activation function as well as the linear combination weights in the hidden layer (for passing into the final output function).
* The general way to fit is by gradient descent, which is called back-propagation in this setting.
* The gradient descent updates can be implemented with a 2 pass algorithm. In the forward pass, the current weights are fixed and the predicted values are computed. In the backward pass, the errors are computed for one set of weights and then back propagated to get the errors for the other set of weights. Both errors are then used to compute the gradients for the updates.
* The back propagation procedure is also called the "delta rule".
* In the back propagation algorithm, each hidden unit passes and receives information only to and from units that share a connection. Hence it can be implemented efficiently on a parallel architecture computer.
* A training epoch refers to one sweep through the entire training set.

### 11.5 - Some Issues in Training Neural Networks

* Training neural networks can be difficult -- the model is generally overparametrized, and the optimization problem is nonconvex and unstable unless certain guidelines are followed.
* Usually weights are initialized to be random values near zero (so that the model starts out nearly linear).
* Neural networks with too many weights are easy to overfit - an early stopping rule is used to stop before we approach the global minimum.
* A regularization penalty can also be added to the loss function.
* Standardize inputs to have mean zero and std deviation of 1.
* Generally speaking, it is better to have too many hidden units than too few. Typically the number of hidden units is somewhere in the range of 5 to 100, with the number increasing with the number of inputs and number of training cases.
* Try a variety of random starting weight configurations, and choose the solution giving the lowest (penalized) error. 

### 11.6 - Example - Simulated Data

* A radial function is the most difficult for a neural net, as it is spherically symmetric and with no preferred directions.

### 11.7 - Example - ZIP Code Data

* If each of the units in a single 8 × 8 feature map share the same set of nine weights (but have their own bias parameter), this forces the extracted features in different parts of the image to be computed by the same linear functional, and consequently these networks are sometimes known as convolutional networks. 

### 11.8 - Discussion

* Neural networks and PPR are especially effective in problems with a high signal-to-noise ratio and settings where prediction without interpretation is the goal. 
* They are less effective for problems where the goal is to describe the physical process that generated the data and the roles of individual inputs. 

### 11.9 - Bayesian Neural Nets

* Certain modifications to neural nets are possible, such as boosting, bagging, and Bayesian fitting.

### 11.3 - Computation Considerations

* With N observations, p predictors, M hidden units and L training epochs, a neural network fit typically requires $O(NpML)$ operations. 

### Keras Example

In [3]:
# Import libraries and modules
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# load our toy data set
data, target = load_boston(True)
X_train, X_test, y_train, y_test = train_test_split(data, target)
 
# Define model architecture
def nn_model():
    model = Sequential()
    model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
    model.add(Dense(5, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    
    model.compile(loss='mean_squared_error',
              optimizer='adam')
    
    return model
 
kr = KerasRegressor(build_fn=nn_model, epochs=50, batch_size=5, verbose=1)
kr.fit(X_train, y_train)

train_predictions = kr.predict(X_train)
test_predictions = kr.predict(X_test)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [4]:
print(f"Train MSE: {mean_squared_error(y_train, train_predictions)}")
print(f"Test MSE: {mean_squared_error(y_test, test_predictions)}")

Train MSE: 25.811425258867953
Test MSE: 35.797897180585295
