# MNIST Dataset

The dataset provides 70,000 images (28x28 pixels) of handwritten digits (1 digit per image). 

The goal is to write an algorithm that detects which digit is written. Since there are only 10 digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), this is a classification problem with 10 classes. 

Our goal would be to build a neural network with 2 hidden layers.

In [147]:
import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds
# these datasets will be stored in C:\Users\*USERNAME*\tensorflow_datasets\...
# the first time you download a dataset, it is stored in the respective folder 
# every other time, it is automatically loading the copy on your computer

#satha is my username

## Data

In [148]:
mnist_dataset,mnist_info = tfds.load(name='mnist',with_info = True, as_supervised = True)
# as_supervised will load the data in a 2 tuple structure [inputs,outputs]
# with_info provides us with tuple containing information about version, features and number of samples in the dataset



The mnist dataset has only train(60,000) and test(10,000) data set and no validation data set. 
So we take a percentage of train data set as validation data set|

In [149]:
mnist_train , mnist_test = mnist_dataset['train'],  mnist_dataset['test']


no_of_train_sample = mnist_info.splits['train'].num_examples
no_of_validation_sample = 0.1 * no_of_train_sample
no_of_validation_sample = tf.cast(no_of_validation_sample,tf.int64)
# The split may not always provide an integer value hence a decimal number may bring error while splitting 
# so we cast it to an integer

no_of_test_sample = tf.cast(mnist_info.splits['test'].num_examples,tf.int64)
# Notice that we merged two lines of code into one, i.e. taking split value and casting it into integer

Now we scale our data to make it numerically stable , inputs between 0 and 1.
There is a function 'dataset.map(*func*)' , it applies custom transformation to given dataset. It takes function as input.
Note:- the input function must take image and label as input and return the same.

In [150]:
def scale(image,label):
    image = tf.cast(image,tf.float32)/255.     # '.' ensures floating value at end
    return image,label

scaled_train_and_validation = mnist_train.map(scale)
scaled_test = mnist_test.map(scale)

#### Explaination of above code

Our data is of the form of 28 *28 matrix of pixels, each containing value between 0 and 255 (representing 256 shades of grey). Here 0 being absolute black and 255 being absolute white. Hence we divide with 255. 

#### Why is Shuffling needed?

Our dataset may be in an ascending/decending order, so while making batches this may hinder learning as data with same targets get into one batch. So we must uniformly shuffle the data

When we are dealing with enormous dataset, we can't shuffle the whole data at once (as we cannot load entire data in the memory). To solve this problem we must define value of 'buffer_size' such that tensorflow can take that value of dataset and shuffle them, again take the same value of dataset and shuffle again and so on.

In [151]:
#buffersize = 1 -> no shuffling takes place
#buffersize >= no. of samples  -> shuffled at once , uniformly
#1 < buffersize < no. of samples -> optimal computing power

In [152]:
buffer_size = 10_000

shuffled_train_and_validation = scaled_train_and_validation.shuffle(buffer_size)

# now we split the training data to get validation dataset
validation_data = shuffled_train_and_validation.take(no_of_validation_sample)
# takes the first 'no_of_validation_sample' from the dataset

# now we need training data without containing any validation data
train_data = shuffled_train_and_validation.skip(no_of_validation_sample)
# skips the first 'no_of_validation_sample' from the dataset and selects remaining data


To train  our model we will be using mini batch gradient descent

In [153]:
# batch_size = 1 -> Stochastic Gradient Descent (no use)
# batch_size = no. of samples -> Gradient descent(most accurate but time consuming)
# 1 < batch_size < no. of samples  -> Mini batch(professionally known as Stochastic Gradient Descent)

In [154]:
batch_size = 100

# we use 'dataset.batch(batch_size)' to combine the consecutive elements of a dataset into batches
# we also override the train_data - as train_data without batch not required

train_data = train_data.batch(batch_size)

#### what about validation data?

When batching we find the AVERAGE loss but during validation and testing we want the exact values so we take all the data at once. We won't be backpropagating on validation data but only forward propagation, so we dont really need to batch it BUT..... Our model expect them to be in batches(both validation and test)

In [155]:
validation_data = validation_data.batch(no_of_validation_sample)
test_data = scaled_test.batch(no_of_test_sample)
# here both are of single batch
# new column created in our tensor, indicating that the model should take entire dataset at once when it utilizes it

In [156]:
validation_inputs, validation_targets = next(iter(validation_data))
# our validation data must have the same shape and object properties as the train and test data.
# MNIST data is iterable and in 2 tuple format(as_supervised = True)
# so we must extract and convert the validation inputs and targets appropriately

# iter() creates an object which can be iterated one element at a time
# next() loads the next batch, as there is only one batch it will load inputs and targets
# next() for only one batch is unnecessary but it is done to follow a consistent pattern when working with datasets

## Model

##### Understanding the initials 
We have 784 inputs in our neural networks, as we have (28 * 28) pixels converted into one vector. Also we have 10 outputs, digits 0 to 9. In this case, we take 2 hidden layers, 50 each. As the depth of neural network is a hyperparameter, we can change these values to get more accurate results.

In [157]:
input_size = 784
output_size = 10
hidden_layer_size = 200
# we have taken size of all hidden layers to be same,i.e. 50

In [158]:
model = tf.keras.Sequential([
                               tf.keras.layers.Flatten(input_shape=(28,28,1)),
                               tf.keras.layers.Dense(hidden_layer_size,activation='relu'),# we are getting from first layer(input layer), to first hidden layer
                               tf.keras.layers.Dense(hidden_layer_size,activation='tanh'),# first hidden layer to second hidden layer
                               tf.keras.layers.Dense(output_size,activation='softmax'),# second hidden layer to output layer, for output layer we use softmax or argmax
                            ])
# t.keras.Sequencial() lays down the model, stacks layers
# first layer-> input layer
#     our data is of 28*28*1 (tensor of rank 3), so we need to flatten images into single vector
# second layer onwards to build each concecutive layer of neural network
#     it takes inputs, calculates dot product of the inputs and the weights and adds the bais, activation function applied too

##### Now we optimise our model and choose suitable loss function

In [159]:
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
# ADAM optimizr
#     it optimises the rate at which learning change and also checks for 'momentum'
# Cross entropy
#     the usual squared residual is a slow function, the cross entropy function punishes the model at wrong prediction with large values which is not the case with the prior function.
# Metrics = ['accuracy']
#     throughout the training and testing processes it updates us with 'accuracy'

##### Now we fit the data into the model

In [160]:
no_of_epochs = 5
# we dedicate variables to hyperparameters like input, output, buffer sizes to easily spot them when we fine tune the model
model.fit(train_data, epochs=no_of_epochs,validation_data=(validation_inputs,validation_targets),verbose = 2)

Epoch 1/5
540/540 - 4s - loss: 0.2554 - accuracy: 0.9246 - val_loss: 0.1212 - val_accuracy: 0.9658 - 4s/epoch - 7ms/step
Epoch 2/5
540/540 - 3s - loss: 0.0977 - accuracy: 0.9701 - val_loss: 0.0966 - val_accuracy: 0.9712 - 3s/epoch - 6ms/step
Epoch 3/5
540/540 - 3s - loss: 0.0654 - accuracy: 0.9797 - val_loss: 0.0706 - val_accuracy: 0.9795 - 3s/epoch - 6ms/step
Epoch 4/5
540/540 - 3s - loss: 0.0474 - accuracy: 0.9849 - val_loss: 0.0621 - val_accuracy: 0.9802 - 3s/epoch - 6ms/step
Epoch 5/5
540/540 - 3s - loss: 0.0356 - accuracy: 0.9885 - val_loss: 0.0433 - val_accuracy: 0.9875 - 3s/epoch - 6ms/step


<keras.callbacks.History at 0x14c054212a0>

##### What happens inside an epoch

1. At the beginning  of each epoch, the training loss will be set to 0.
2. The algorithm will iterate over a present number of batches, all from train_data
3. The weights and baises will be updated as many times as there are batches
4. We will get value for the loss function, indicating how the training is going
5. We will also see a training accuracy, due to the addiction of metric='[accuracy]'.
6. At the end of the epoch, the algorithm will forward propagate the whole validation set

In [161]:
# Note:
# validation accuracy is the true accuracy of the model
# training accuracy is the average across batches
# validation accuracy is of the whole validation set

### Playing with hyperparameters

1. hidden layer width = 200 -> accuracy = 98.78 %
2. 3 hidden layers -> accuracy = 98.67%
3. 5 hidden layers -> width=50 (accuracy = 96.88 %), width=100 (accuracy = 98.22 %), width=200 (accuracy = 98.45 %), width=300 (accuracy = 98.20 %)
4. both hidden layer, activation function ='sigmoid'->accuracy = 97.35 %
5. 2nd hidden layer , activation function ='tanh' -> accuracy = 97.42 %
6. both hidden layer , activation function ='tanh' -> accuracy = 97.03 %
7. both hidden layer , activation function ='tanh' with hidden layer size= 200 -> accuracy = 98.68 %
8. learning rate = 0.0001 instead of 'adam' -> accuracy = 97.78 %
9. learning rate = 0.0001 instead of 'adam' -> accuracy = 97.85%

#### Note:

1. Above is just validation accuracy, we still need to check for final test data accuracy.
2. We train on the the training data and validate with validation data, this ensures what weigths, parameters and baises dont overfit the model
3. once we train our first model, we fiddle with the hyperparameters (things we did when we were 'playing' with hyperparameters), i.e. we tried to find the best hyperparameters.
4. By playing with hyperparameters we didnt find the best hyperparameters in general BUT the hyperparameters that fit out validation dataset best.
5. By fine tuning them we were overfitting the validation dataset.
6. Test data is the reality check for our model

## Testing the data

In [163]:
test_loss, test_accuracy = model.evaluate(test_data)



In [165]:
print('Test loss: {0:.2f}\nTest accuracy: {1:.2f}%'.format(test_loss,test_accuracy*100))

Test loss: 0.07
Test accuracy: 97.65%


# Important Note

1. Now we are no longer allowed to change it.
2. If we now start to change the model after this point, the test data wont be the dataset that our model has never seen.
3. If we had accurcy of 50 to 60 %, we may surely know that our model has overfit the data.
4. But we got accuracy near close to validation data accuracy, hence we did not overfit.

# Done