# Choosing the number of hidden neurons
* The number of hidden neurons should be between the size of the input layer and the size of the output layer.
* The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
* The number of hidden neurons should be less than twice the size of the input layer.

https://www.heatonresearch.com/2017/06/01/hidden-layers.html

# Choosing the number of hidden layers
* In artificial neural networks, hidden layers are required if and only if the data must be separated non-linearly.

https://towardsdatascience.com/beginners-ask-how-many-hidden-layers-neurons-to-use-in-artificial-neural-networks-51466afa0d3e

# Activation Functions
In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or "OFF" (0), depending on input. This is similar to the behavior of the linear perceptron in neural networks. However, only nonlinear activation functions allow such networks to compute nontrivial problems using only a small number of nodes, and such activation functions are called nonlinearities
## Non-Linear Activation Functions
Modern neural network models use non-linear activation functions. They allow the model to create complex mappings between the network’s inputs and outputs, which are essential for learning and modeling complex data, such as images, video, audio, and data sets which are non-linear or have high dimensionality.Almost any process imaginable can be represented as a functional computation in a neural network, provided that the activation function is non-linear.Non-linear functions address the problems of a linear activation function: They allow backpropagation because they have a derivative function which is related to the inputs. They allow “stacking” of multiple layers of neurons to create a deep neural network. Multiple hidden layers of neurons are needed to learn complex data sets with high levels of accuracy.
### Sigmoid / Logistic
**Advantages:** Smooth gradient, preventing “jumps” in output values.
Output values bound between 0 and 1, normalizing the output of each neuron.
Clear predictions—For X above 2 or below -2, tends to bring the Y value (the prediction) to the edge of the curve, very close to 1 or 0. This enables clear predictions.**Disadvantages:** Vanishing gradient—for very high or very low values of X, there is almost no change to the prediction, causing a vanishing gradient problem. This can result in the network refusing to learn further, or being too slow to reach an accurate prediction. Outputs not zero centered. Computationally expensive
### TanH / Hyperbolic Tangent
Zero centered—making it easier to model inputs that have strongly negative, neutral, and strongly positive values. Otherwise like the Sigmoid function.**Disadvantages:** Like the Sigmoid function
### ReLU (Rectified Linear Unit)
Computationally efficient—allows the network to converge very quickly
Non-linear—although it looks like a linear function, ReLU has a derivative function and allows for backpropagation. **Disadvantages:** The Dying ReLU problem—when inputs approach zero, or are negative, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn.

### Parametric ReLU
Allows the negative slope to be learned—unlike leaky ReLU, this function provides the slope of the negative part of the function as an argument. It is, therefore, possible to perform backpropagation and learn the most appropriate value of α. Otherwise like ReLU**Disadvantages** May perform differently for different problems.
### Softmax
Able to handle multiple classes only one class in other activation functions—normalizes the outputs for each class between 0 and 1, and divides by their sum, giving the probability of the input value being in a specific class.Useful for output neurons—typically Softmax is used only for the output layer, for neural networks that need to classify inputs into multiple categories.
### Swish
Swish is a new, self-gated activation function discovered by researchers at Google. According to their paper, it performs better than ReLU with a similar level of computational efficiency. In experiments on ImageNet with identical models running ReLU and Swish, the new function achieved top -1 classification accuracy 0.6-0.9% higher.

## Notes
Recent research by Franco Manessi and Alessandro Rozza attempted to find ways to automatically learn which is the optimal activation function for a certain neural network and to even automatically combine activation functions to achieve the highest accuracy. This is a very promising field of research because it attempts to discover an optimal activation function configuration automatically, whereas today, this parameter is manually tuned.

## Derivatives of activation function
The derivative—also known as a gradient—of an activation function is extremely important for training the neural network. Neural networks are trained using a process called backpropagation—this is an algorithm which traces back from the output of the model, through the different neurons which were involved in generating that output, back to the original weight applied to each neuron. Backpropagation suggests an optimal weight for each neuron which results in the most accurate prediction.

## References
1. https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/
2. https://arxiv.org/pdf/1801.09403.pdf

In [37]:
activation_functions = [
    'deserialize', 
    'elu', 
    'exponential', 
    'get', 
    'hard_sigmoid', 
    'linear', 
    'relu', # max(x,0) nonlinear?
    'selu', 
    'serialize', 
    'sigmoid', 
    'softmax', 
    'softplus', 
    'softsign', 
    'swish', 
    'tanh',
]

# Otimizers
Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses.Optimization algorithms or strategies are responsible for reducing the losses and to provide the most accurate results possible.
### Gradient Descent
The most basic but most used optimization algorithm. It’s used heavily in linear regression and classification algorithms. Backpropagation in neural networks also uses a gradient descent algorithm.
### Stochastic Gradient Descent
is a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In this, the model parameters are altered after computation of loss on each training example. So, if the dataset contains 1000 rows SGD will update the model parameters 1000 times in one cycle of dataset instead of one time as in Gradient Descent.
### Mini-Batch Gradient Descent
It’s best among all the variations of gradient descent algorithms. It is an improvement on both SGD and standard gradient descent. It updates the model parameters after every batch. So, the dataset is divided into various batches and after every batch, the parameters are updated.
### Nesterov Accelerated Gradient
Momentum may be a good method but if the momentum is too high the algorithm may miss the local minima and may continue to rise up. So, to resolve this issue the NAG algorithm was developed. It is a look ahead method. We know we’ll be using γV(t−1) for modifying the weights so, θ−γV(t−1) approximately tells us the future location. Now, we’ll calculate the cost based on this future parameter rather than the current one. Nesterov momentum has slightly less overshooting compare to standard momentum since it takes the "gamble->correction" approach has shown below.
### Adagrad
One of the disadvantages of all the optimizers explained is that the learning rate is constant for all parameters and for each cycle. This optimizer changes the learning rate. It changes the learning rate ‘η’ for each parameter and at every time step ‘t’. It’s a type second order optimization algorithm. It works on the derivative of an error function. It makes big updates for infrequent parameters and small updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data. The main benefit of Adagrad is that we don’t need to tune the learning rate manually. Most implementations use a default value of 0.01 and leave it at that.Its main weakness is that its learning rate is always Decreasing and decaying.
### AdaDelta
It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of it. Instead of accumulating all previously squared gradients, Adadelta limits the window of accumulated past gradients to some fixed size w. In this exponentially moving average is used rather than the sum of all the gradients. It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of it. Another thing with AdaDelta is that we don’t even need to set a default learning rate.
### Adam
Adam (Adaptive Moment Estimation) works with momentums of first and second order. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search. In addition to storing an exponentially decaying average of past squared gradients like AdaDelta, Adam also keeps an exponentially decaying average of past gradients M(t).
## Conclusions
* Adam is the best optimizer. If one wants to train the neural network in less time and more efficiently than Adam is the optimizer.
* For sparse data use the optimizers with dynamic learning rate.
* If, want to use gradient descent algorithm than min-batch gradient descent is the best option.
* TL;DR Adam works well in practice and outperforms other Adaptive techniques.
* Use SGD+Nesterov for shallow networks, and either Adam or RMSprop for deepnets.

## References
1. https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6#:~:text=Optimizers%20are%20algorithms%20or%20methods,order%20to%20reduce%20the%20losses.&text=Optimization%20algorithms%20or%20strategies%20are,the%20most%20accurate%20results%20possible.
2. https://www.dlology.com/blog/quick-notes-on-how-to-choose-optimizer-in-keras/

In [None]:
optimizers = [
    'SGD', 
    'RMSprop', 
    'Adam', 
    'Adadelta', 
    'Adagrad', 
    'Adamax', 
    'Nadam', 
    'Ftrl', 
    'rmsprop',
]

# Adam = RMSprop + Momentum
# keras.optimizers.SGD(lr=0.01, nesterov=True)  <-- SGD + Nesterov

In [None]:
losses = [
    'BinaryCrossentropy',
    'CategoricalCrossentropy',
    'SparseCategoricalCrossentropy',
    'Poisson',
    'binary_crossentropy',
    'categorical_crossentropy',
    'sparse_categorical_crossentropy',
    'poisson',
    'KLDivergence',
    'kl_divergence',
]

In [None]:
learning_rates = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]

# Momentum
Momentum was invented for reducing high variance in SGD and softens the convergence. It accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant direction. One more hyperparameter is used in this method known as momentum symbolized by ‘γ’.

In [None]:
momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9]

In [1]:
# TensorFlow and tf.keras
import numpy as np
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

print("Load MNIST data set:")
(x_train, y_train), (x_test, y_test)= mnist.load_data()
print('x_train:\t{}'.format(x_train.shape))
print('y_train:\t{}'.format(y_train.shape))
print('x_test:\t\t{}'.format(x_test.shape))
print('y_test:\t\t{}'.format(y_test.shape))

print()
print("Encode y data:") 
y_train_encoded = to_categorical(y_train)
y_test_encoded = to_categorical(y_test)
print("First ten entries of y_train:\n {}\n".format(y_train[0:10]))
print("First ten rows of one-hot y_train:\n {}".format(y_train_encoded[0:10,]))
print()
print('y_train_encoded shape: ', y_train_encoded.shape)
print('y_test_encoded shape: ', y_test_encoded.shape)
x_train_reshaped = np.reshape(x_train, (60000, 784))
x_test_reshaped = np.reshape(x_test, (10000, 784))
print('x_train_reshaped shape: ', x_train_reshaped.shape)
print('x_test_reshaped shape: ', x_test_reshaped.shape)


print()
print('Min-Max Normalization:')
x_train_norm = x_train_reshaped.astype('float32') / 255
x_test_norm = x_test_reshaped.astype('float32') / 255
print(set(x_train_norm[0]))

Load MNIST data set:
x_train:	(60000, 28, 28)
y_train:	(60000,)
x_test:		(10000, 28, 28)
y_test:		(10000,)

Encode y data:
First ten entries of y_train:
 [5 0 4 1 9 2 1 3 1 4]

First ten rows of one-hot y_train:
 [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]

y_train_encoded shape:  (60000, 10)
y_test_encoded shape:  (10000, 10)
x_train_reshaped shape:  (60000, 784)
x_test_reshaped shape:  (10000, 784)

Min-Max Normalization:
{0.0, 0.011764706, 0.53333336, 0.07058824, 0.49411765, 0.6862745, 0.101960786, 0.6509804, 1.0, 0.96862745, 0.49803922, 0.11764706, 0.14117648, 0.36862746, 0.6039216, 0.6666667, 0.043137256, 0.05490196, 0.03529412, 0.85882354, 0.7764706, 0.7137255, 0.94509804, 0.3137255, 0.6117647, 0.4

In [53]:
# Use scikit-learn to grid search the learning rate and momentum
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
#from keras.optimizers import SGD

# Function to create model, required for KerasClassifier
def create_model(learn_rate=0.6, momentum=0, optimizer='adam', activation='relu', loss='binary_crossentropy'):
    # Create model
    model = Sequential()
    model.add(Dense(name = "hidden_layer", units=512, input_shape = (784,), activation=activation))
    model.add(Dense(name = "output_layer", units=10, activation='sigmoid'))
    # Compile model
    model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
    #model.summary()
    return model

model = KerasClassifier(build_fn=create_model)

# grid search parameters
learn_rate = [0.6]#[0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
momentum = [0]#[0.0, 0.2, 0.4, 0.6, 0.8, 0.9]
optimizers = ['rmsprop']
epochs = np.array([10])
batches = np.array([10])
activation = ['relu']#,'tanh']#,'softmax']
loss = ['binary_crossentropy'] #,'categorical_crossentropy']
#units_1 = [10,20,30,512,]
#units_2 = [0.3]#[0.001, 0.01, 0.1, 0.2, 0.3]

param_grid = dict(
    learn_rate=learn_rate,
    momentum=momentum,
    optimizer=optimizers,
    nb_epoch=epochs,
    batch_size=batches,
    activation=activation,
    loss=loss
)

#history = model.fit(x_train_norm,y_train_encoded, epochs=20,validation_split=0.20,batch_size = 10)
grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=10).fit(x_train_norm, y_train_encoded)



In [7]:
from sklearn.decomposition import PCA
pred_classes = np.argmax(model.predict(x_train_norm), axis=-1)
pixel_data = {'pred_class':pred_classes}
for k in range(0,784): 
    pixel_data[f"pix_val_{k}"] = x_train_norm[:,k]
pixel_df = pd.DataFrame(pixel_data)
pixel_df.head()

# from sklearn.decomposition import PCA

# Separating out the features
features = [*pixel_data][1:] # ['pix_val_0', 'pix_val_1',...]
x = pixel_df.loc[:, features].values 

pca = PCA(n_components=154)
principalComponents = pca.fit_transform(x_train_norm)

In [54]:
pd.set_option('Display.max_rows', None)
para = pd.DataFrame.from_dict(grid_result.cv_results_['params'])
mean = pd.DataFrame(grid_result.cv_results_['mean_test_score'],columns=['mean_test_score'])
stds = pd.DataFrame(grid_result.cv_results_['std_test_score'],columns=['std_test_score'])
df = para.join(mean.join(stds)).sort_values(['mean_test_score'], ascending=False)
df.reset_index().drop(columns=['index'])

Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score
0,relu,10,0.6,binary_crossentropy,0,10,rmsprop,0.9647,0.003841


In [51]:
# Function to create model, required for KerasClassifier
def create_model(learn_rate=0.6, momentum=0, optimizer='adam', activation='relu', loss='binary_crossentropy'):
    # Create model
    model = Sequential()
    model.add(Dense(name = "hidden_layer", units=512, input_shape = (154,), activation=activation))
    model.add(Dense(name = "output_layer", units=10, activation='sigmoid'))
    # Compile model
    model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
    #model.summary()
    return model

model = KerasClassifier(build_fn=create_model)
#history = model.fit(principalComponents,y_train_encoded, epochs=20,validation_split=0.10,batch_size = 10)
grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(principalComponents, y_train_encoded)



In [55]:
para = pd.DataFrame.from_dict(grid_result.cv_results_['params'])
mean = pd.DataFrame(grid_result.cv_results_['mean_test_score'],columns=['mean_test_score'])
stds = pd.DataFrame(grid_result.cv_results_['std_test_score'],columns=['std_test_score'])
df = para.join(mean.join(stds)).sort_values('mean_test_score', ascending=False)
df.reset_index().drop(columns=['index'])

Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score
0,relu,10,0.6,binary_crossentropy,0,10,rmsprop,0.9647,0.003841


In [52]:
para = pd.DataFrame.from_dict(grid_result.cv_results_['params'])
mean = pd.DataFrame(grid_result.cv_results_['mean_test_score'],columns=['mean_test_score'])
stds = pd.DataFrame(grid_result.cv_results_['std_test_score'],columns=['std_test_score'])
df = para.join(mean.join(stds)).sort_values('mean_test_score', ascending=False)
df.reset_index().drop(columns=['index'])

Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score
0,relu,10,0.6,binary_crossentropy,0,10,rmsprop,0.96685,0.003866
