### INTRODUCTION - This is an extension of previous basic model built for MNIST image classification with neural networks. This also explains the basic concepts of data loading, neural network model building, metric evaluation and prediction. It is a sequential model with single layer and differs at kernel initialization which is default here. It also differs in compilation phase where loss is changed from sparse_categorical_crossentropy to categorical_crossentropy.

In [None]:
import tensorflow as tf
import numpy as np
from tensorflow import keras

### In neural networks, one epoch means one forward pass and one backward pass of all the training examples. Example: if you have 1000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.
### Reference - https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network

In [None]:
EPOCHS = 200

### Batch size stands for the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need. Batch size defines the number of samples that will be propagated through the network.
#### For instance, let's say you have 1050 training samples and you want to set up a batch_size equal to 100. The algorithm takes the first 100 samples (from 1st to 100th) from the training dataset and trains the network. Next, it takes the second 100 samples (from 101st to 200th) and trains the network again. We can keep doing this procedure until we have propagated all samples through of the network. 
#### Advantages of using a batch size < number of all samples:
#### •	It requires less memory. Since you train the network using fewer samples, the overall training procedure requires less memory. That's especially important if you are not able to fit the whole dataset in your machine's memory.
#### •	Typically networks train faster with mini-batches. That's because we update the weights after each propagation. In our example we've propagated 11 batches (10 of them had 100 samples and 1 had 50 samples) and after each of them we've updated our network's parameters. 

#### Reference - https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network

In [None]:
BATCH_SIZE = 128

### By setting verbose 0, 1 or 2 you just say how do you want to 'see' the training progress for each epoch.
#### verbose=0 will show you nothing (silent)
#### verbose=1 will show you an animated progress bar like this:
#### verbose=2 will just mention the number of epoch like this:
#### Reference - https://stackoverflow.com/questions/47902295/what-is-the-use-of-verbose-in-keras-while-validating-the-model

In [None]:
VERBOSE = 1

### It is the number of outputs

In [None]:
NB_CLASSES = 10   # number of outputs = number of digits

### Positive integer, dimensionality of the output space that will be produced.

In [None]:
N_HIDDEN = 128

### It is the amount of data reserved for checking or proving the validity of the training process.

In [None]:
VALIDATION_SPLIT=0.2 # how much TRAIN is reserved for VALIDATION

### Loading MNIST dataset
### The split between train and test is 60,000, and 10,000 respectly 
### One-hot is automatically applied

In [None]:
mnist = keras.datasets.mnist
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


### Reshape value for 28*28 matrix
### X_train is 60000 rows of 28x28 values --> reshaped in 60000 x 784

In [None]:
RESHAPED = 784

### This is reshaping the data from one form to the other. Here X_train is 60000 rows of 28x28 values and Therefore, we reshape it to 60000 x 784. Also, X_test is 10000 rows of 28x28 values and Therefore, we reshape it to 10000 x 784. Reshaping is done because keras convolution layers work with higher dimensions.

In [None]:
X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)

### This converts the type to 32 bit float.

In [None]:
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

### Here, the training and testing data is being normalized by scaling it within a range of 0 and 1 as every value lies between 0-255.

In [None]:
#normalize in [0,1]
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

60000 train samples
10000 test samples


One-hot encoding (OHE)

We are going to use OHE as a simple tool to encode information used inside neural networks. In many applications it is convenient to transform categorical (non- numerical) features into numerical variables. For instance, the categorical feature "digit" with value d in [0 – 9] can be encoded into a binary vector with 10 positions, which always has 0 value except the d - th position where a 1 is present.

For example, the digit 3 can be encoded as [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. This type of representation is called One-hot encoding, or sometimes simply one-hot, and is very common in data mining when the learning algorithm is specialized in dealing with numerical functions.

In [None]:
#one-hot
Y_train = tf.keras.utils.to_categorical(Y_train, NB_CLASSES)
Y_test = tf.keras.utils.to_categorical(Y_test, NB_CLASSES)

### Sequential groups a linear stack of layers into a tf.keras.Model. Sequential provides training and inference features on this model.A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.
#### A Sequential model is not appropriate when:
#### •	Your model has multiple inputs or multiple outputs
#### •	Any of your layers has multiple inputs or multiple outputs
#### •	You need to do layer sharing
#### •	You want non-linear topology (e.g. a residual connection, a multi-branch model)
#### Reference - https://keras.io/guides/sequential_model/

In [None]:
#build the model
model = tf.keras.models.Sequential()

### The final layer is a single neuron with activation function "softmax", which is a generalization of the sigmoid function. A sigmoid function output is in the range (0, 1) when the input varies in the range (−∞, ∞). Similarly, a softmax "squashes" a K-dimensional vector of arbitrary real values into a K-dimensional vector of real values in the range (0, 1), so that they all add up to 1. In our case, it aggregates 10 answers provided by the previous layer with 10 neurons.

In [None]:
model.add(keras.layers.Dense(NB_CLASSES,
   input_shape=(RESHAPED,),
   name='dense_layer', 
   activation='softmax'))

In [None]:
# summary of the model
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_layer (Dense)          (None, 10)                7850      
Total params: 7,850
Trainable params: 7,850
Non-trainable params: 0
_________________________________________________________________


### model.compile() method compiles and creates a neural network model. It takes the following parameters – 
### There are a few choices to be made during compilation. Firstly, we need to select an optimizer, which is the specific algorithm used to update weights while we train our model. Second, we need to select an objective function, which is used by the optimizer to navigate the space of weights (frequently, objective functions are called either loss functions or cost functions. Third, we need  to evaluate the trained model.
### optimizer - Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses.
### Various optimizers - https://medium.com/@sdoshi579/optimizers-for-training-neural-network-59450d71caf6
### Loss - The Loss Function is one of the important components of Neural Networks. Loss is nothing but a prediction error of Neural Net. And the method to calculate the loss is called Loss Function. In simple words, the Loss is used to calculate the gradients.
#### Reference - https://towardsdatascience.com/understanding-different-loss-functions-for-neural-networks-dd1ed0274718
### Metrics - A metric is a function that is used to judge the performance of your model. Metric functions are similar to loss functions, except that the results from evaluating a metric are not used when training the model. Note that you may use any loss functions as a metric function.
#### Different metrics and reference - https://keras.io/api/metrics/

In [None]:
# compiling the model
model.compile(optimizer='SGD', 
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
#training the model
model.fit(X_train, Y_train,
        batch_size=BATCH_SIZE, epochs=EPOCHS,
        verbose=VERBOSE, validation_split=VALIDATION_SPLIT)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7feab0046550>

In [None]:
#evaluate the model
test_loss, test_acc = model.evaluate(X_test, Y_test)
print('Test accuracy:', test_acc)

Test accuracy: 0.9223999977111816


In [None]:
# making prediction
predictions = model.predict(X_test)

**Observations**

We have achieved training accuracy 90.4,validation accuracy-91.1 and test accuracy - 92.24 with single dense laye and 200 epochs