# Experiment #1
Choosing the number of hidden neurons
* The number of hidden neurons should be between the size of the input layer and the size of the output layer.
* The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
* The number of hidden neurons should be less than twice the size of the input layer.

https://www.heatonresearch.com/2017/06/01/hidden-layers.html

# Experiment #2
Choosing the number of hidden layers
* In artificial neural networks, hidden layers are required if and only if the data must be separated non-linearly.

https://towardsdatascience.com/beginners-ask-how-many-hidden-layers-neurons-to-use-in-artificial-neural-networks-51466afa0d3e

# Experiment #3
Varying all other hyperparameters (Optimizer, Activation, Loss, Learning Rate, Momentum) 

# Experiment #4
Varying the DropOut and other 

# Experiment #5
GridSearch best hyperparameters

# Experiment #6
Data augmentation (e.g. combining groups of similar clothes

# Experiment #7
Ensembled CNN

# Activation Functions
In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or "OFF" (0), depending on input. This is similar to the behavior of the linear perceptron in neural networks. However, only nonlinear activation functions allow such networks to compute nontrivial problems using only a small number of nodes, and such activation functions are called nonlinearities

## References
1. https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/
2. https://arxiv.org/pdf/1801.09403.pdf

In [21]:
activation_functions = [
    'deserialize', 'elu', 'exponential', 'get', 'hard_sigmoid', 'linear', 'relu','selu', 
    'serialize', 'sigmoid', 'softmax', 'softplus', 'softsign', 'swish', 'tanh',
]

# Otimizers
Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses.Optimization algorithms or strategies are responsible for reducing the losses and to provide the most accurate results possible.
## Conclusions
* Adam is the best optimizer. If one wants to train the neural network in less time and more efficiently than Adam is the optimizer.
* For sparse data use the optimizers with dynamic learning rate.
* If, want to use gradient descent algorithm than min-batch gradient descent is the best option.
* TL;DR Adam works well in practice and outperforms other Adaptive techniques.
* Use SGD+Nesterov for shallow networks, and either Adam or RMSprop for deepnets.

## References
1. https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6#:~:text=Optimizers%20are%20algorithms%20or%20methods,order%20to%20reduce%20the%20losses.&text=Optimization%20algorithms%20or%20strategies%20are,the%20most%20accurate%20results%20possible.
2. https://www.dlology.com/blog/quick-notes-on-how-to-choose-optimizer-in-keras/

In [22]:
optimizers = [
    'SGD', 
    'RMSprop', 
    'Adam', 
    'Adadelta', 
    'Adagrad', 
    'Adamax', 
    'Nadam', 
    'Ftrl', 
    'rmsprop',
]

# Adam = RMSprop + Momentum
# keras.optimizers.SGD(lr=0.01, nesterov=True)  <-- SGD + Nesterov

In [23]:
losses = [
    'BinaryCrossentropy',
    'CategoricalCrossentropy',
    'SparseCategoricalCrossentropy',
    'Poisson',
    'binary_crossentropy',
    'categorical_crossentropy',
    'sparse_categorical_crossentropy',
    'poisson',
    'KLDivergence',
    'kl_divergence',
]

In [24]:
learning_rates = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]

In [25]:
momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9]

In [26]:
# TensorFlow and tf.keras
import numpy as np
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

print("Load MNIST data set:")
(x_train, y_train), (x_test, y_test)= mnist.load_data()
print('x_train:\t{}'.format(x_train.shape))
print('y_train:\t{}'.format(y_train.shape))
print('x_test:\t\t{}'.format(x_test.shape))
print('y_test:\t\t{}'.format(y_test.shape))

print()
print("Encode y data:") 
y_train_encoded = to_categorical(y_train)
y_test_encoded = to_categorical(y_test)
print("First ten entries of y_train:\n {}\n".format(y_train[0:10]))
print("First ten rows of one-hot y_train:\n {}".format(y_train_encoded[0:10,]))
print()
print('y_train_encoded shape: ', y_train_encoded.shape)
print('y_test_encoded shape: ', y_test_encoded.shape)
x_train_reshaped = np.reshape(x_train, (60000, 784))
x_test_reshaped = np.reshape(x_test, (10000, 784))
print('x_train_reshaped shape: ', x_train_reshaped.shape)
print('x_test_reshaped shape: ', x_test_reshaped.shape)


print()
print('Min-Max Normalization:')
x_train_norm = x_train_reshaped.astype('float32') / 255
x_test_norm = x_test_reshaped.astype('float32') / 255
print(set(x_train_norm[0]))

Load MNIST data set:
x_train:	(60000, 28, 28)
y_train:	(60000,)
x_test:		(10000, 28, 28)
y_test:		(10000,)

Encode y data:
First ten entries of y_train:
 [5 0 4 1 9 2 1 3 1 4]

First ten rows of one-hot y_train:
 [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]

y_train_encoded shape:  (60000, 10)
y_test_encoded shape:  (10000, 10)
x_train_reshaped shape:  (60000, 784)
x_test_reshaped shape:  (10000, 784)

Min-Max Normalization:
{0.0, 0.011764706, 0.53333336, 0.07058824, 0.49411765, 0.6862745, 0.101960786, 0.6509804, 1.0, 0.96862745, 0.49803922, 0.11764706, 0.14117648, 0.36862746, 0.6039216, 0.6666667, 0.043137256, 0.05490196, 0.03529412, 0.85882354, 0.7764706, 0.7137255, 0.94509804, 0.3137255, 0.6117647, 0.4

In [27]:
x_train.shape

(60000, 28, 28)

In [31]:
# Use scikit-learn to grid search the learning rate and momentum
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D
from keras.wrappers.scikit_learn import KerasClassifier
#from keras.optimizers import SGD

# Function to create model, required for KerasClassifier
def create_model(learn_rate=0.6, momentum=0, optimizer='adam', activation='relu', loss='binary_crossentropy'):
    # Create model
    model = Sequential()
    model.add(Dense(name = "hidden_layer", units=512, input_shape = (784,), activation=activation))
    model.add(Dense(name = "output_layer", units=10, activation='sigmoid'))
    # Compile model
    model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
    #model.summary()
    return model

model = KerasClassifier(build_fn=create_model)
# overfitting is really the issue for machine learning such a big problem
# we will face this all the time, methods to address this include DropOut
# research other ways to mitigate overfitting
# how much dropout (20%, 10%, vary the dropout? dropout rates, hyperparameter tuning)
# you should vary the nodes but i don't know how, in param_grid?
# plot the results
# document that i did NOT vary the node numbers etc...
# scikit learn and tensorflow spun out of Google
# does it matter which place the layers go?
# squeeze between layers
# vary the architechture of the model programatically!

# Flatten layer <-- it essentially creates the 784 thing
# The dense layer <-- this is the hidden layer

#units_1 = [10,20,30,512,]
#units_2 = [0.3]#[0.001, 0.01, 0.1, 0.2, 0.3]

# rmsprop and adam will differentiat based on batching...
# some optimizers are better for 
# when you change between cpu gpu and tpu, optimized for certain things
# if you are prebatching data - you will runa cycle where everythign through the processing units are zeros

results_df = pd.DataFrame(
    columns=[
        'activation', 'batch_size', 'learn_rate', 'loss',
        'momentum', 'nb_epoch', 'optimizer', 'mean_test_score',
        'std_test_score', 'mean_fit_time'
    ])
results_df

Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score,mean_fit_time


In [34]:
def print_results(grid_result):
    para = pd.DataFrame.from_dict(grid_result.cv_results_['params'])
    mean = pd.DataFrame(grid_result.cv_results_['mean_test_score'],columns=['mean_test_score'])
    stds = pd.DataFrame(grid_result.cv_results_['std_test_score'],columns=['std_test_score'])
    time = pd.DataFrame(grid_result.cv_results_['mean_fit_time'],columns=['mean_fit_time'])
    
    df = para.join(mean.join(stds)).join(time).sort_values('mean_test_score', ascending=False)
    df.reset_index().drop(columns=['index'])
    return df

# Activation Functions
In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or "OFF" (0), depending on input. This is similar to the behavior of the linear perceptron in neural networks. However, only nonlinear activation functions allow such networks to compute nontrivial problems using only a small number of nodes, and such activation functions are called nonlinearities

In [35]:
# grid search parameters
learn_rate = [0.01]
momentum = [0]
optimizers = ['adam']
epochs = [10]
batches = [100]
activation = [
    'elu', 'exponential', 'hard_sigmoid', 'linear', 
    'relu', 'selu', 'sigmoid', 'softmax', 'softplus', 
    'softsign', 'swish', 'tanh',
]
loss = ['binary_crossentropy']

param_grid = dict(
    learn_rate=learn_rate,
    momentum=momentum,
    optimizer=optimizers,
    nb_epoch=epochs,
    batch_size=batches,
    activation=activation,
    loss=loss
)

#history = model.fit(x_train_norm,y_train_encoded, epochs=20,validation_split=0.20,batch_size = 10)
grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(x_train_norm, y_train_encoded)
df = print_results(grid_result)
results_df = results_df.append(df)
results_df.head()



Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score,mean_fit_time
1,exponential,100,0.01,binary_crossentropy,0,10,adam,0.9448,0.001855,1.156888
4,relu,100,0.01,binary_crossentropy,0,10,adam,0.94165,0.00139,1.184122
10,swish,100,0.01,binary_crossentropy,0,10,adam,0.9308,0.002165,1.584021
11,tanh,100,0.01,binary_crossentropy,0,10,adam,0.9227,0.00152,1.233272
9,softsign,100,0.01,binary_crossentropy,0,10,adam,0.920583,0.001511,1.468389


# Momentum
Momentum was invented for reducing high variance in SGD and softens the convergence. It accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant direction. One more hyperparameter is used in this method known as momentum symbolized by ‘γ’.

In [36]:
# grid search parameters
learn_rate = [0.01]
momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9]
optimizers = ['adam']
epochs = [10]
batches = [100]
activation = ['relu']
loss = ['binary_crossentropy']

param_grid = dict(
    learn_rate=learn_rate,
    momentum=momentum,
    optimizer=optimizers,
    nb_epoch=epochs,
    batch_size=batches,
    activation=activation,
    loss=loss
)

#history = model.fit(x_train_norm,y_train_encoded, epochs=20,validation_split=0.20,batch_size = 10)
grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(x_train_norm, y_train_encoded)
df = print_results(grid_result)
results_df = results_df.append(df)
results_df.head()



Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score,mean_fit_time
1,exponential,100,0.01,binary_crossentropy,0,10,adam,0.9448,0.001855,1.156888
4,relu,100,0.01,binary_crossentropy,0,10,adam,0.94165,0.00139,1.184122
10,swish,100,0.01,binary_crossentropy,0,10,adam,0.9308,0.002165,1.584021
11,tanh,100,0.01,binary_crossentropy,0,10,adam,0.9227,0.00152,1.233272
9,softsign,100,0.01,binary_crossentropy,0,10,adam,0.920583,0.001511,1.468389


# Learning Rate
In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns". In the adaptive control literature, the learning rate is commonly referred to as gain.

In setting a learning rate, there is a trade-off between the rate of convergence and overshooting. While the descent direction is usually determined from the gradient of the loss function, the learning rate determines how big a step is taken in that direction. A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum.

In order to achieve faster convergence, prevent oscillations and getting stuck in undesirable local minima the learning rate is often varied during training either in accordance to a learning rate schedule or by using an adaptive learning rate. The learning rate and its adjustments may also differ per parameter, in which case it is a diagonal matrix that can be interpreted as an approximation to the inverse of the Hessian matrix in Newton's method. The learning rate is related to the step length determined by inexact line search in quasi-Newton methods and related optimization algorithms.

In [37]:
# grid search parameters
learn_rate = [0.01, 0.1, 0.5, 1.0, 1.5, 2.0]
momentum = [0.0]
optimizers = ['adam']
epochs = [10]
batches = [100]
activation = ['relu']
loss = ['binary_crossentropy']

param_grid = dict(
    learn_rate=learn_rate,
    momentum=momentum,
    optimizer=optimizers,
    nb_epoch=epochs,
    batch_size=batches,
    activation=activation,
    loss=loss
)

#history = model.fit(x_train_norm,y_train_encoded, epochs=20,validation_split=0.20,batch_size = 10)
grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(x_train_norm, y_train_encoded)
df = print_results(grid_result)
results_df = results_df.append(df)
results_df.head(20)



Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score,mean_fit_time
1,exponential,100,0.01,binary_crossentropy,0.0,10,adam,0.9448,0.001855,1.156888
4,relu,100,0.01,binary_crossentropy,0.0,10,adam,0.94165,0.00139,1.184122
10,swish,100,0.01,binary_crossentropy,0.0,10,adam,0.9308,0.002165,1.584021
11,tanh,100,0.01,binary_crossentropy,0.0,10,adam,0.9227,0.00152,1.233272
9,softsign,100,0.01,binary_crossentropy,0.0,10,adam,0.920583,0.001511,1.468389
0,elu,100,0.01,binary_crossentropy,0.0,10,adam,0.918433,0.003433,1.193996
5,selu,100,0.01,binary_crossentropy,0.0,10,adam,0.914817,0.002321,1.191628
8,softplus,100,0.01,binary_crossentropy,0.0,10,adam,0.914367,0.001969,1.461108
6,sigmoid,100,0.01,binary_crossentropy,0.0,10,adam,0.9014,0.002526,1.182006
3,linear,100,0.01,binary_crossentropy,0.0,10,adam,0.89835,0.003012,1.154006


# Otimizers
Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses.Optimization algorithms or strategies are responsible for reducing the losses and to provide the most accurate results possible.

Adam is the best optimizer. If one wants to train the neural network in less time and more efficiently than Adam is the optimizer. For sparse data use the optimizers with dynamic learning rate. If, want to use gradient descent algorithm than min-batch gradient descent is the best option. TL;DR Adam works well in practice and outperforms other Adaptive techniques. Use SGD+Nesterov for shallow networks, and either Adam or RMSprop for deepnets.

In [38]:
# grid search parameters
learn_rate = [0.01]
momentum = [0.0]
optimizers = ['SGD', 'RMSprop', 'Adam', 'Adadelta', 'Adagrad', 'Adamax', 'Nadam', 'Ftrl']
epochs = [10]
batches = [100]
activation = ['relu']
loss = ['binary_crossentropy']

param_grid = dict(
    learn_rate=learn_rate,
    momentum=momentum,
    optimizer=optimizers,
    nb_epoch=epochs,
    batch_size=batches,
    activation=activation,
    loss=loss
)

#history = model.fit(x_train_norm,y_train_encoded, epochs=20,validation_split=0.20,batch_size = 10)
grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(x_train_norm, y_train_encoded)
df = print_results(grid_result)
results_df = results_df.append(df)
results_df.head(20)



Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score,mean_fit_time
1,exponential,100,0.01,binary_crossentropy,0.0,10,adam,0.9448,0.001855,1.156888
4,relu,100,0.01,binary_crossentropy,0.0,10,adam,0.94165,0.00139,1.184122
10,swish,100,0.01,binary_crossentropy,0.0,10,adam,0.9308,0.002165,1.584021
11,tanh,100,0.01,binary_crossentropy,0.0,10,adam,0.9227,0.00152,1.233272
9,softsign,100,0.01,binary_crossentropy,0.0,10,adam,0.920583,0.001511,1.468389
0,elu,100,0.01,binary_crossentropy,0.0,10,adam,0.918433,0.003433,1.193996
5,selu,100,0.01,binary_crossentropy,0.0,10,adam,0.914817,0.002321,1.191628
8,softplus,100,0.01,binary_crossentropy,0.0,10,adam,0.914367,0.001969,1.461108
6,sigmoid,100,0.01,binary_crossentropy,0.0,10,adam,0.9014,0.002526,1.182006
3,linear,100,0.01,binary_crossentropy,0.0,10,adam,0.89835,0.003012,1.154006


# Loss Function
A loss function is used to optimize the parameter values in a neural network model. Loss functions map a set of parameter values for the network onto a scalar value that indicates how well those parameter accomplish the task the network is intended to do.

In [39]:
# grid search parameters
learn_rate = [0.01]
momentum = [0.0]
optimizers = ['adam']
epochs = [10]
batches = [100]
activation = ['relu']
loss = [
    'BinaryCrossentropy', 'CategoricalCrossentropy', #'SparseCategoricalCrossentropy',
    'Poisson', 'KLDivergence'
]

param_grid = dict(
    learn_rate=learn_rate,
    momentum=momentum,
    optimizer=optimizers,
    nb_epoch=epochs,
    batch_size=batches,
    activation=activation,
    loss=loss
)

#history = model.fit(x_train_norm,y_train_encoded, epochs=20,validation_split=0.20,batch_size = 10)
grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(x_train_norm, y_train_encoded)
df = print_results(grid_result)
results_df = results_df.append(df)
results_df.head(20)



Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score,mean_fit_time
1,exponential,100,0.01,binary_crossentropy,0.0,10,adam,0.9448,0.001855,1.156888
4,relu,100,0.01,binary_crossentropy,0.0,10,adam,0.94165,0.00139,1.184122
10,swish,100,0.01,binary_crossentropy,0.0,10,adam,0.9308,0.002165,1.584021
11,tanh,100,0.01,binary_crossentropy,0.0,10,adam,0.9227,0.00152,1.233272
9,softsign,100,0.01,binary_crossentropy,0.0,10,adam,0.920583,0.001511,1.468389
0,elu,100,0.01,binary_crossentropy,0.0,10,adam,0.918433,0.003433,1.193996
5,selu,100,0.01,binary_crossentropy,0.0,10,adam,0.914817,0.002321,1.191628
8,softplus,100,0.01,binary_crossentropy,0.0,10,adam,0.914367,0.001969,1.461108
6,sigmoid,100,0.01,binary_crossentropy,0.0,10,adam,0.9014,0.002526,1.182006
3,linear,100,0.01,binary_crossentropy,0.0,10,adam,0.89835,0.003012,1.154006


# Batches

In [40]:
# grid search parameters
learn_rate = [0.01]
momentum = [0.0]
optimizers = ['adam']
epochs = [10]
batches = [10,50,100]
activation = ['relu']
loss = ['BinaryCrossentropy']

param_grid = dict(
    learn_rate=learn_rate,
    momentum=momentum,
    optimizer=optimizers,
    nb_epoch=epochs,
    batch_size=batches,
    activation=activation,
    loss=loss
)

#history = model.fit(x_train_norm,y_train_encoded, epochs=20,validation_split=0.20,batch_size = 10)
grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(x_train_norm, y_train_encoded)
df = print_results(grid_result)
results_df = results_df.append(df)
results_df.head(20)



Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score,mean_fit_time
1,exponential,100,0.01,binary_crossentropy,0.0,10,adam,0.9448,0.001855,1.156888
4,relu,100,0.01,binary_crossentropy,0.0,10,adam,0.94165,0.00139,1.184122
10,swish,100,0.01,binary_crossentropy,0.0,10,adam,0.9308,0.002165,1.584021
11,tanh,100,0.01,binary_crossentropy,0.0,10,adam,0.9227,0.00152,1.233272
9,softsign,100,0.01,binary_crossentropy,0.0,10,adam,0.920583,0.001511,1.468389
0,elu,100,0.01,binary_crossentropy,0.0,10,adam,0.918433,0.003433,1.193996
5,selu,100,0.01,binary_crossentropy,0.0,10,adam,0.914817,0.002321,1.191628
8,softplus,100,0.01,binary_crossentropy,0.0,10,adam,0.914367,0.001969,1.461108
6,sigmoid,100,0.01,binary_crossentropy,0.0,10,adam,0.9014,0.002526,1.182006
3,linear,100,0.01,binary_crossentropy,0.0,10,adam,0.89835,0.003012,1.154006


# ALL PARAMETERS

In [41]:
# grid search parameters
learn_rate = [
    0.01, 0.1, 0.5, 1.0,# 1.5, 2.0
]
momentum = [
    0.0,# 0.2, 0.4, 0.6, 0.8, 0.9
]
optimizers = [
    'RMSprop', 'Adam', 'Nadam',
]
epochs = [10]
batches = [100]
activation = [
    'relu', 'exponential', 'swish', #'tanh',
]
loss = [
    'BinaryCrossentropy', 'CategoricalCrossentropy', 'Poisson'
]

param_grid = dict(
    learn_rate=learn_rate,
    momentum=momentum,
    optimizer=optimizers,
    nb_epoch=epochs,
    batch_size=batches,
    activation=activation,
    loss=loss
)

#history = model.fit(x_train_norm,y_train_encoded, epochs=20,validation_split=0.20,batch_size = 10)
grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(x_train_norm, y_train_encoded)
df = print_results(grid_result)
pd.set_option('Display.max_rows',None)
results_df = results_df.append(df)
results_df.head(20)









Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score,mean_fit_time
1,exponential,100,0.01,binary_crossentropy,0.0,10,adam,0.9448,0.001855,1.156888
4,relu,100,0.01,binary_crossentropy,0.0,10,adam,0.94165,0.00139,1.184122
10,swish,100,0.01,binary_crossentropy,0.0,10,adam,0.9308,0.002165,1.584021
11,tanh,100,0.01,binary_crossentropy,0.0,10,adam,0.9227,0.00152,1.233272
9,softsign,100,0.01,binary_crossentropy,0.0,10,adam,0.920583,0.001511,1.468389
0,elu,100,0.01,binary_crossentropy,0.0,10,adam,0.918433,0.003433,1.193996
5,selu,100,0.01,binary_crossentropy,0.0,10,adam,0.914817,0.002321,1.191628
8,softplus,100,0.01,binary_crossentropy,0.0,10,adam,0.914367,0.001969,1.461108
6,sigmoid,100,0.01,binary_crossentropy,0.0,10,adam,0.9014,0.002526,1.182006
3,linear,100,0.01,binary_crossentropy,0.0,10,adam,0.89835,0.003012,1.154006


In [42]:
results_df.to_csv('results.csv',index=False)
results_df_read = pd.read_csv('results.csv')
results_df_read.head()

Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score,mean_fit_time
0,exponential,100,0.01,binary_crossentropy,0.0,10,adam,0.9448,0.001855,1.156888
1,relu,100,0.01,binary_crossentropy,0.0,10,adam,0.94165,0.00139,1.184122
2,swish,100,0.01,binary_crossentropy,0.0,10,adam,0.9308,0.002165,1.584021
3,tanh,100,0.01,binary_crossentropy,0.0,10,adam,0.9227,0.00152,1.233272
4,softsign,100,0.01,binary_crossentropy,0.0,10,adam,0.920583,0.001511,1.468389


In [43]:
results_df_read.head(20)

Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score,mean_fit_time
0,exponential,100,0.01,binary_crossentropy,0.0,10,adam,0.9448,0.001855,1.156888
1,relu,100,0.01,binary_crossentropy,0.0,10,adam,0.94165,0.00139,1.184122
2,swish,100,0.01,binary_crossentropy,0.0,10,adam,0.9308,0.002165,1.584021
3,tanh,100,0.01,binary_crossentropy,0.0,10,adam,0.9227,0.00152,1.233272
4,softsign,100,0.01,binary_crossentropy,0.0,10,adam,0.920583,0.001511,1.468389
5,elu,100,0.01,binary_crossentropy,0.0,10,adam,0.918433,0.003433,1.193996
6,selu,100,0.01,binary_crossentropy,0.0,10,adam,0.914817,0.002321,1.191628
7,softplus,100,0.01,binary_crossentropy,0.0,10,adam,0.914367,0.001969,1.461108
8,sigmoid,100,0.01,binary_crossentropy,0.0,10,adam,0.9014,0.002526,1.182006
9,linear,100,0.01,binary_crossentropy,0.0,10,adam,0.89835,0.003012,1.154006


In [18]:
# grid search parameters
learn_rate = [
    0.01,# 0.1, 0.5, 1.0,# 1.5, 2.0
]
momentum = [
    0.0, 
    0.2, 
    0.4, 
    0.6, 
    0.8, 
    0.9
]
optimizers = [
    'Adam', 'Nadam',#'RMSprop', 
]
epochs = [10]
batches = [100]
activation = [
    'relu',# 'exponential',# 'swish', #'tanh',
]
loss = [
    #'BinaryCrossentropy', 
    'CategoricalCrossentropy'
]

param_grid = dict(
    learn_rate=learn_rate,
    momentum=momentum,
    optimizer=optimizers,
    nb_epoch=epochs,
    batch_size=batches,
    activation=activation,
    loss=loss
)

#history = model.fit(x_train_norm,y_train_encoded, epochs=20,validation_split=0.20,batch_size = 10)
grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(x_train_norm, y_train_encoded)
df = print_results(grid_result)
results_df.head(20)



Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score
3,relu,100,0.01,CategoricalCrossentropy,0.2,10,Nadam,0.9507,0.001071
0,relu,100,0.01,CategoricalCrossentropy,0.0,10,Adam,0.949417,0.002112
1,relu,100,0.01,CategoricalCrossentropy,0.0,10,Nadam,0.94915,0.001205
10,relu,100,0.01,CategoricalCrossentropy,0.9,10,Adam,0.94865,0.002225
5,relu,100,0.01,CategoricalCrossentropy,0.4,10,Nadam,0.94855,0.001975
8,relu,100,0.01,CategoricalCrossentropy,0.8,10,Adam,0.948267,0.002913
9,relu,100,0.01,CategoricalCrossentropy,0.8,10,Nadam,0.947983,0.00133
7,relu,100,0.01,CategoricalCrossentropy,0.6,10,Nadam,0.947917,0.001993
11,relu,100,0.01,CategoricalCrossentropy,0.9,10,Nadam,0.947883,0.003086
4,relu,100,0.01,CategoricalCrossentropy,0.4,10,Adam,0.9469,0.000204


In [20]:
# grid search parameters
learn_rate = [
    0.01,# 0.1, 0.5, 1.0,# 1.5, 2.0
]
momentum = [
    0.0, 
    #0.2, 
    #0.4, 
    #0.6, 
    #0.8, 
    #0.9
]
optimizers = [
    'Adam', 'Nadam',#'RMSprop', 
]
epochs = [10]
batches = [10]
activation = [
    'relu',# 'exponential',# 'swish', #'tanh',
]
loss = [
    'BinaryCrossentropy', 
    'CategoricalCrossentropy'
]

param_grid = dict(
    learn_rate=learn_rate,
    momentum=momentum,
    optimizer=optimizers,
    nb_epoch=epochs,
    batch_size=batches,
    activation=activation,
    loss=loss
)

#history = model.fit(x_train_norm,y_train_encoded, epochs=20,validation_split=0.20,batch_size = 10)
grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(x_train_norm, y_train_encoded)
df = print_results(grid_result)
df



Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score
0,relu,10,0.01,BinaryCrossentropy,0.0,10,Adam,0.9646,0.001467
1,relu,10,0.01,BinaryCrossentropy,0.0,10,Nadam,0.964117,0.00156
3,relu,10,0.01,CategoricalCrossentropy,0.0,10,Nadam,0.9597,0.002724
2,relu,10,0.01,CategoricalCrossentropy,0.0,10,Adam,0.958667,0.001921


In [33]:
# grid search parameters
learn_rate = [
    0.01,# 0.1, 0.5, 1.0,# 1.5, 2.0
]
momentum = [
    0.0, 
    #0.2, 
    #0.4, 
    #0.6, 
    #0.8, 
    #0.9
]
optimizers = [
    'Adam', 'Nadam',#'RMSprop', 
]
epochs = [10]
batches = [10, 50, 100]
activation = [
    'relu',# 'exponential',# 'swish', #'tanh',
]
loss = [
    'BinaryCrossentropy', 
    'CategoricalCrossentropy'
]

param_grid = dict(
    learn_rate=learn_rate,
    momentum=momentum,
    optimizer=optimizers,
    nb_epoch=epochs,
    batch_size=batches,
    activation=activation,
    loss=loss
)

#history = model.fit(x_train_norm,y_train_encoded, epochs=20,validation_split=0.20,batch_size = 10)
grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(x_train_norm, y_train_encoded)
df = print_results(grid_result)
df



Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score
1,relu,10,0.01,BinaryCrossentropy,0.0,10,Nadam,0.964167,0.001376
0,relu,10,0.01,BinaryCrossentropy,0.0,10,Adam,0.963517,0.001573
3,relu,10,0.01,CategoricalCrossentropy,0.0,10,Nadam,0.960817,0.001818
2,relu,10,0.01,CategoricalCrossentropy,0.0,10,Adam,0.95705,0.003228
6,relu,50,0.01,CategoricalCrossentropy,0.0,10,Adam,0.954683,0.003081
5,relu,50,0.01,BinaryCrossentropy,0.0,10,Nadam,0.953583,0.002616
7,relu,50,0.01,CategoricalCrossentropy,0.0,10,Nadam,0.952333,0.002605
4,relu,50,0.01,BinaryCrossentropy,0.0,10,Adam,0.951867,0.001096
11,relu,100,0.01,CategoricalCrossentropy,0.0,10,Nadam,0.94825,0.002002
10,relu,100,0.01,CategoricalCrossentropy,0.0,10,Adam,0.946233,0.001998


In [13]:
# Function to create model, required for KerasClassifier
def create_model(learn_rate=0.01, momentum=0, optimizer='adam', activation='relu', loss='binary_crossentropy'):
    # Create model
    model = Sequential()
    model.add(Dense(name = "hidden_layer", units=512, input_shape = (784,), activation=activation))
    model.add(Dense(name = "output_layer", units=10, activation='sigmoid'))
    # Compile model
    model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
    #model.summary()
    return model

model = KerasClassifier(build_fn=create_model)
history = model.fit(x_train_norm,y_train_encoded, epochs=10,validation_split=0.20,batch_size = 10)
#grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(x_train_norm, y_train_encoded)
#df = print_results(grid_result)
#df

Epoch 1/10
Epoch 2/10
Epoch 3/10

KeyboardInterrupt: 

# reconfigure to Conv2D

In [14]:
from keras import backend as K
import keras
from keras.layers import Conv2D, MaxPooling2D, Dense, Dropout, Flatten, Activation
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# input image dimensions
img_rows, img_cols = 28, 28
num_classes = 10
input_shape = (28, 28, 1)

if K.image_data_format() == 'channels_first': 
    X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols) 
    X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else: 
    X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1) 
    X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'training samples')
print(X_test.shape[0], 'testing samples')


X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'training samples')
print(X_test.shape[0], 'testing samples')
print(input_shape)

x_train_norm.shape
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

X_train shape: (60000, 28, 28, 1)
60000 training samples
10000 testing samples
X_train shape: (60000, 28, 28, 1)
60000 training samples
10000 testing samples
(28, 28, 1)


In [17]:
# Function to create model, required for KerasClassifier
def create_model(learn_rate=0.01, momentum=0, optimizer='adam', activation='relu', loss='categorical_crossentropy'):
    # Create model
    model = Sequential()
    model.add(Conv2D(32, kernel_size=(3, 3), activation='relu',input_shape=input_shape))
    model.add(Dropout(0.25))
    model.add(Conv2D(32, kernel_size=(3, 3), activation="relu")),
    model.add(MaxPooling2D(pool_size=(2, 2))),
    model.add(Conv2D(64, kernel_size=(3, 3), activation="relu")),
    model.add(MaxPooling2D(pool_size=(2, 2))),
    model.add(Flatten())
    model.add(Dense(name = "hidden_layer", units=512, activation=activation))
    model.add(Dense(name = "output_layer", units=10, activation='sigmoid'))
    # Compile model
    model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
    #model.summary()
    return model

model = KerasClassifier(build_fn=create_model)
history = model.fit(X_train,y_train, epochs=30,validation_split=0.20,batch_size = 50)
#grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(x_train_norm, y_train_encoded)
#df = print_results(grid_result)
#df


Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [31]:
from sklearn.decomposition import PCA
pred_classes = np.argmax(model.predict(x_train_norm), axis=-1)
pixel_data = {'pred_class':pred_classes}
for k in range(0,784): 
    pixel_data[f"pix_val_{k}"] = x_train_norm[:,k]
pixel_df = pd.DataFrame(pixel_data)
pixel_df.head()

# from sklearn.decomposition import PCA

# Separating out the features
features = [*pixel_data][1:] # ['pix_val_0', 'pix_val_1',...]
x = pixel_df.loc[:, features].values 

pca = PCA(n_components=154)
principalComponents = pca.fit_transform(x_train_norm)

In [45]:
# Function to create model, required for KerasClassifier
def create_model(learn_rate=0.01, momentum=0, optimizer='adam', activation='relu', loss='binary_crossentropy'):
    # Create model
    model = Sequential()
    model.add(Dense(name = "hidden_layer", units=512, input_shape = (784,), activation=activation))
    model.add(Dense(name = "output_layer", units=10, activation='sigmoid'))
    # Compile model
    model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
    #model.summary()
    return model

model = KerasClassifier(build_fn=create_model)
history = model.fit(x_train_norm,y_train_encoded, epochs=20,validation_split=0.20,batch_size = 10)
#grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(principalComponents, y_train_encoded)
#grid_result = GridSearchCV(estimator=model, param_grid=param_grid, cv=3).fit(x_train_norm, y_train_encoded)
#df = print_results(grid_result)
#df

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [29]:
para = pd.DataFrame.from_dict(grid_result.cv_results_['params'])
mean = pd.DataFrame(grid_result.cv_results_['mean_test_score'],columns=['mean_test_score'])
stds = pd.DataFrame(grid_result.cv_results_['std_test_score'],columns=['std_test_score'])
df = para.join(mean.join(stds)).sort_values('mean_test_score', ascending=False)
df.reset_index().drop(columns=['index'])

Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score
0,relu,10,0.01,BinaryCrossentropy,0.0,10,Adam,0.964167,0.000655


In [30]:
para = pd.DataFrame.from_dict(grid_result.cv_results_['params'])
mean = pd.DataFrame(grid_result.cv_results_['mean_test_score'],columns=['mean_test_score'])
stds = pd.DataFrame(grid_result.cv_results_['std_test_score'],columns=['std_test_score'])
df = para.join(mean.join(stds)).sort_values('mean_test_score', ascending=False)
df.reset_index().drop(columns=['index'])

Unnamed: 0,activation,batch_size,learn_rate,loss,momentum,nb_epoch,optimizer,mean_test_score,std_test_score
0,relu,10,0.01,BinaryCrossentropy,0.0,10,Adam,0.964167,0.000655
