# Challenge of Training Deep Learning
https://machinelearningmastery.com/a-gentle-introduction-to-the-challenge-of-training-deep-learning-neural-network-models/

- **Neural Nets Learn a Mapping Function**: 
    - A true mapping function exists to best map input variables to output variables and that a neural network model can do a reasonable job at approximating the true unknown underlying mapping function
    - We can describe the broader problem that neural networks solve as “function approximation.” They learn to approximate an unknown underlying mapping function given a training dataset. They do this by learning weights and the model parameters, given a specific network structure that we design.

- **Learning Network Weights Is Hard**:
    - Finding parameters for many machine learning algorithms involves solving a convex optimization problem: that is an error surface that is shaped like a bowl with a single best solution. *This is not the case for deep learning neural networks.*
    - It is not a simple bowl shape with a single best set of weights that we are guaranteed to find. Instead, there is a landscape of peaks and valleys with many good and many misleadingly good sets of parameters that we may discover.

- **Navigating the Non-Convex Error Surface**: 
    - Neural network models can be thought to learn by navigating a non-convex error surface.
    - Backpropagation refers to a technique from calculus to calculate the derivative (e.g. the slope or the gradient) of the model error for specific model parameters, allowing model weights to be updated to move down the gradient.
    - Stochastic gradient descent can be used to find the parameters for other machine learning algorithms, such as linear regression, and it is used when working with very large datasets, although if there are sufficient resources, then convex-based optimization algorithms are significantly more efficient.
- **Components of the Learning Algorithm**:
    - Loss Function. The function used to estimate the performance of a model with a specific set of weights on examples from the training dataset.
    - Weight Initialization. The procedure by which the initial small random values are assigned to model weights at the beginning of the training process.
    - Batch Size. The number of examples used to estimate the error gradient before updating the model parameters.
    - Learning Rate: The amount that each model parameter is updated per cycle of the learning algorithm.
    - Epochs. The number of complete passes through the training dataset before the training process is terminated.



# Configure Capacity With Nodes and Layers
https://machinelearningmastery.com/how-to-control-neural-network-model-capacity-with-nodes-and-layers/

*Including including the number of nodes in a layer and the number
of layers used to define the scope of functions that can be learned by the model*

- Neural network model capacity is controlled both by the number of nodes and the number of layers in the model.
- A model with a single hidden layer and a sufficient number of nodes has the capability of learning any mapping function, but the chosen learning algorithm may or may not be able to realize this capability.
- Increasing the number of layers provides a short-cut to increasing the capacity of the model with fewer resources, and modern techniques allow learning algorithms to successfully train deep models.

In [None]:
# study of mlp learning curves given different number of nodes for multi-class classification
from sklearn.datasets import make_blobs
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD
from keras.utils import to_categorical
from matplotlib import pyplot
 
# prepare multi-class classification dataset
def create_dataset():
	# generate 2d classification dataset
	X, y = make_blobs(n_samples=1000, centers=20, n_features=100, cluster_std=2, random_state=2)
	# one hot encode output variable
	y = to_categorical(y)
	# split into train and test
	n_train = 500
	trainX, testX = X[:n_train, :], X[n_train:, :]
	trainy, testy = y[:n_train], y[n_train:]
	return trainX, trainy, testX, testy
 
# fit model with given number of nodes, returns test set accuracy
def evaluate_model(n_nodes, trainX, trainy, testX, testy):
	# configure the model based on the data
	n_input, n_classes = trainX.shape[1], testy.shape[1]
	# define model
	model = Sequential()
	model.add(Dense(n_nodes, input_dim=n_input, activation='relu', kernel_initializer='he_uniform'))
	model.add(Dense(n_classes, activation='softmax'))
	# compile model
	opt = SGD(lr=0.01, momentum=0.9)
	model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
	# fit model on train set
	history = model.fit(trainX, trainy, epochs=100, verbose=0)
	# evaluate model on test set
	_, test_acc = model.evaluate(testX, testy, verbose=0)
	return history, test_acc
 
# prepare dataset
trainX, trainy, testX, testy = create_dataset()

## Change Model Capacity With Nodes

We can see that as the number of nodes is increased, the model is able to better decrease the loss, e.g. to better learn the training dataset. This plot shows the direct relationship between model capacity, as defined by the number of nodes in the hidden layer and the model’s ability to learn.

In [None]:
# evaluate model and plot learning curve with given number of nodes
num_nodes = [1, 2, 3, 4, 5, 6, 7]
for n_nodes in num_nodes:
	# evaluate model with a given number of nodes
	history, result = evaluate_model(n_nodes, trainX, trainy, testX, testy)
	# summarize final test set accuracy
	print('nodes=%d: %.3f' % (n_nodes, result))
	# plot learning curve
	pyplot.plot(history.history['loss'], label=str(n_nodes))
# show the plot
pyplot.legend()
pyplot.show()

## Change Model Capacity With Layers
- Increasing the number of layers can often greatly increase the capacity of the model, acting like a computational and learning shortcut to modeling a problem.
- The danger is that a model with more capacity than is required is likely to overfit the training data, and as with a model that has too many nodes, a model with too many layers will likely be unable to learn the training dataset, getting lost or stuck during the optimization process.

In [None]:
# evaluate model and plot learning curve of model with given number of layers
all_history = list()
num_layers = [1, 2, 3, 4, 5]
for n_layers in num_layers:
	# evaluate model with a given number of layers
	history, result = evaluate_model(n_layers, trainX, trainy, testX, testy)
	print('layers=%d: %.3f' % (n_layers, result))
	# plot learning curve
	pyplot.plot(history.history['loss'], label=str(n_layers))
pyplot.legend()
pyplot.show()

# Batch and an Epoch
https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/

- **What Is a Sample?**: A sample is a single row of data. It contains inputs that are fed into the algorithm and an output that is used to compare to the prediction and calculate an error.
- **What Is a Batch?**: The batch size is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.
    - Batch Gradient Descent. Batch Size = Size of Training Set
    - Stochastic Gradient Descent. Batch Size = 1
    - Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

- **What Is a Epoch?**: The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.
- **What Is the Difference Between Batch and Epoch?**: 
    - The batch size is a number of samples processed before the model is updated.
    - The number of epochs is the number of complete passes through the training dataset.

# Control the Stability With Batch Size
https://machinelearningmastery.com/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size/

*Including exploring whether variations such as batch, stochastic
(online), or minibatch gradient descent are more appropriate*
- Batch size controls the accuracy of the estimate of the error gradient when training neural networks.
- Batch, Stochastic, and Minibatch gradient descent are the three main flavors of the learning algorithm.
- There is a tension between batch size and the speed and stability of the learning process.

In [None]:
# mlp for the blobs problem with minibatch gradient descent with varied batch size
from sklearn.datasets import make_blobs
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD
from keras.utils import to_categorical
from matplotlib import pyplot
 
# prepare train and test dataset
def prepare_data():
	# generate 2d classification dataset
	X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)
	# one hot encode output variable
	y = to_categorical(y)
	# split into train and test
	n_train = 500
	trainX, testX = X[:n_train, :], X[n_train:, :]
	trainy, testy = y[:n_train], y[n_train:]
	return trainX, trainy, testX, testy
 
# fit a model and plot learning curve
def fit_model(trainX, trainy, testX, testy, n_batch):
	# define model
	model = Sequential()
	model.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
	model.add(Dense(3, activation='softmax'))
	# compile model
	opt = SGD(lr=0.01, momentum=0.9)
	model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
	# fit model
	history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0, batch_size=n_batch)
	# plot learning curves
	pyplot.plot(history.history['accuracy'], label='train')
	pyplot.plot(history.history['val_accuracy'], label='test')
	pyplot.title('batch='+str(n_batch), pad=-40)
 
# prepare dataset
trainX, trainy, testX, testy = prepare_data()
# create learning curves for different batch sizes
batch_sizes = [4, 8, 16, 32, 64, 128, 256, 450]
for i in range(len(batch_sizes)):
	# determine the plot number
	plot_no = 420 + (i+1)
	pyplot.subplot(plot_no)
	# fit model and plot learning curves for a batch size
	fit_model(trainX, trainy, testX, testy, batch_sizes[i])
# show learning curves
pyplot.show()

# Loss and Loss Functions
https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/

*Including understanding the way different loss functions
must be interpreted and whether an alternate loss function would be appropriate for your
problem*

# Configure What to Optimize With Loss Functions
https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/


# Understand the Impact of Learning Rate
https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/

*Including understanding the effect of different learning rates
on your problem and whether modern adaptive learning rate methods such as Adam would
be appropriate*

# Configure Speed of Learning With Learning Rate
https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks/


# Stabilize Learning With Data Scaling
https://machinelearningmastery.com/how-to-improve-neural-network-stability-and-modeling-performance-with-data-scaling/

*Including the sensitivity that small network weights have to
the scale of input variables and the impact of large errors in the target variable have on
weight updates*

# Introduction to the Rectified Linear Unit (ReLU)
https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

# Fix Vanishing Gradients With ReLU
https://machinelearningmastery.com/how-to-fix-vanishing-gradients-using-the-rectified-linear-activation-function/

*Prevent the training of deep multiple-layered networks causing
layers close to the input layer to not have their weights updated; that can be addressed using
modern activation functions such as the rectified linear activation function*

# Fix Exploding Gradients With Gradient Clipping
https://machinelearningmastery.com/how-to-avoid-exploding-gradients-in-neural-networks-with-gradient-clipping/

*Large weight updates cause a numerical overflow or underflow
making the network weights take on a NaN or Inf value; that can be addressed using
gradient scaling or gradient clipping*

# Introduction to Batch Normalization
https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/

# Accelerate Learning With Batch Normalization
https://machinelearningmastery.com/how-to-accelerate-learning-of-deep-neural-networks-with-batch-normalization/

# Deeper Models With Greedy Layer-Wise Pretraining
https://machinelearningmastery.com/greedy-layer-wise-pretraining-tutorial/

*Where layers are added one at a time to a model,
learning to interpret the output of prior layers and permitting the development of much
deeper models: a milestone technique in the field of deep learning*

# Jump-Start Training With Transfer Learning
https://machinelearningmastery.com/how-to-improve-performance-with-transfer-learning-for-deep-learning-neural-networks/

*Where a model is trained on a different, but somehow related,
predictive modeling problem and then used to seed the weights or used wholesale as a
feature extraction model to provide input to a model trained on the problem of interest*

# Adam Optimization Algorithm
https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/