# Deep Neural Network for MNIST Classification

The MNIST problem its a classic problem for getting started with neural networks.It is referred to as the "Hello World" of deep learning because for most students it is the first deep learning algorithm they see.

The dataset is called MNIST and refers to handwritten digit recognition. You can find more about it on Yann LeCun's website (Director of AI Research, Facebook). He is one of the pioneers of what we've been talking about and of more complex approaches that are widely used today, such as covolutional neural networks (CNNs). 

The dataset provides 70,000 images (28x28 pixels) of handwritten digits (1 digit per image). 

The goal is to write an algorithm that detects which digit is written. Since there are only 10 digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), this is a classification problem with 10 classes.

There are a lot of parameters that you can fiddle around to see if you can improve the accuracy of the model, some of the parameters are: 1)hidden layer size(width); 2)number of hidden layers; 3)activation function of the hidden layers(relu,sigmoid,softmax etc); 4)batch size; 5)learning rate;

After fiddling with the parameters a bit, i was able to achieve 98.5% of accuracy, but it takes 3h40min to train the whole model, so i will leave the original one, where its roughly 96% of accuracy but it takes almost no time to train the model.If you want to test it for yourself, here's what i changed: 

1)BATCH_SIZE, changed it from 100 to 150;

2)hidden_layer_size, changed it from 50 to 5000;

3)number of hidden layers, increased from 2 to 10, all of them using the 'relu' activation function;

4)NUM_EPOCHS, also increased from 5 to 10;

## Quick notes

This is a project made by a student of data science, so i made all the comments to help myself remember what i am studying and some explanation on things i should keep in mind while making deep learning models.So keep in mind that a lot of the comments its not useful if you are already an experienced data scientist, but to newcomers(such as myself) can be really useful.

Thank you for your attention and have fun.

## Lembretes

Esse é um projeto desenvolvido por um estudante da ciencia de dados, e como é possível ver, o código é inteiramente desenvolvido em inglês(inclusive os comentários) pois é a maneira pela qual eu estou aprendendo sobre Deep Learning, sendo mais fácil escrever tudo em inglês.
Se por acaso existirem quaisquer dúvidas sobre o conteúdo, fique á vontade para perguntar diretamente para mim diretamente no github, ou até mesmo por email (luccavick31@hotmail.com). Obrigado pela atenção e se divirta.

## Import the relevant packages

In [2]:
import numpy as np
#TensorFLow includes a data provider for MNIST that we'll use.
import tensorflow as tf
import tensorflow_datasets as tfds

#These datasets will be stored in C:\Users\*USERNAME*\tensorflow_datasets\...
#The first time you download a dataset, it is stored in the respective folder 
#Every other time, it is automatically loading the copy on your computer 

## Data

That's where we load and preprocess our data.

In [3]:
#tfds.load loads a dataset (or downloads and then loads if that's the first time you use it) 
#The name of the dataset is the only mandatory argument.
mnist_dataset, mnist_info = tfds.load(name='mnist', with_info=True, as_supervised=True)
#With_info=True will also provide us with a tuple containing information about the version, features, number of samples
#We will later store that information in the variable mnist_info.
#as_supervised=True will load the dataset in a 2-tuple structure (input, target), alternatively, as_supervised=False,
#would return a dictionary.

#Its preferable to have our inputs and targets separated.

#Load the dataset and extract the training and testing dataset with the built references.
mnist_train, mnist_test = mnist_dataset['train'], mnist_dataset['test']

#By default, TF has training and testing datasets, but no validation sets, and we must split it on our own.

#Start by defining the number of validation samples as a % of the train samples.
#This is also where we make use of mnist_info (we don't have to count the observations).
num_validation_samples = 0.1 * mnist_info.splits['train'].num_examples
#Let's cast this number to an integer, as a float may cause an error along the way.
num_validation_samples = tf.cast(num_validation_samples, tf.int64)

#Store the number of test samples in a dedicated variable (instead of using the mnist_info one)
num_test_samples = mnist_info.splits['test'].num_examples
#Im choosing to use an integer rather than the default float.
num_test_samples = tf.cast(num_test_samples, tf.int64)


#Normally, we would like to scale our data in some way to make the result more numerically stable, but in this case
#we will simply prefer to have inputs between 0 and 1.
#Define a function scale that takes the MNIST image and its label.
def scale(image, label):
    # Make sure the value is a float
    image = tf.cast(image, tf.float32)
    #Since the possible values for the inputs are 0 to 255 (256 different shades of grey).
    #Ff we divide each element by 255, we would get the desired result -> all elements will be between 0 and 1.
    image /= 255.

    return image, label


#The method .map() allows us to apply a custom transformation to a given dataset.
#Get the validation data from mnist_train.
scaled_train_and_validation_data = mnist_train.map(scale)

#Scale and batch the test data.
#It has to have the same magnitude as the train and validation.
#There is no need to shuffle it, because we won't be training on the test data.
#There would be a single batch, equal to the size of the test data.
test_data = mnist_test.map(scale)


#Shuffle the data(except the test data).

#A buffer_size is needed when you are dealing with enormous datasets, because it wont fit it all in the PC memory when we 
#shuffle the dataset.
#if BUFFER_SIZE=1 => no shuffling will actually happen.
#if BUFFER_SIZE >= num samples => shuffling is uniform.
#BUFFER_SIZE in between - a computational optimization to approximate uniform shuffling.
BUFFER_SIZE = 10000

#There is a shuffle method readily available and we just need to specify the buffer size.
shuffled_train_and_validation_data = scaled_train_and_validation_data.shuffle(BUFFER_SIZE)

#Once we have scaled and shuffled the data, we can proceed to actually extracting the train and validation
#Our validation data is equal to 10% of the training set.
#We use the .take() method to take that many samples
#Finally, we create a batch with a batch size equal to the total number of validation samples
validation_data = shuffled_train_and_validation_data.take(num_validation_samples)

#Similarly, the train_data is everything else, so we skip as many samples as there are in the validation dataset
train_data = shuffled_train_and_validation_data.skip(num_validation_samples)

#Set the batch size.
BATCH_SIZE = 100

#Batch the train data.
train_data = train_data.batch(BATCH_SIZE)

validation_data = validation_data.batch(num_validation_samples)

#Batch the test data.
test_data = test_data.batch(num_test_samples)


#Takes next batch (it is the only batch)
#Because as_supervized=True, we've got a 2-tuple structure
validation_inputs, validation_targets = next(iter(validation_data))

## Model

### Outline the model
When thinking about a deep learning algorithm, we mostly imagine building the model.

In [4]:
input_size = 784
output_size = 10
#Use same hidden layer size for both hidden layers. Not a necessity.
#You can mess about with the hidden layer size, if you increase it it will take more time to train in a epoch, but the model 
#will have more accuracy, and there is a point where the accuracy will not increase a significant amount and will take a big
#amount of time to train an epoch.
hidden_layer_size = 50
    
#Define how the model will look like
model = tf.keras.Sequential([
    
    #The first layer(input layer)
    #Each observation is 28x28x1 pixels, therefore it is a tensor of rank 3
    #Since this is not a CNN, i don't know how to feed such input into our net, so i must flatten the images
    #There is a convenient method 'Flatten' that simply takes our 28x28x1 tensor and orders it into a (None,) 
    #or (28x28x1,) = (784,) vector
    #This allows us to actually create a feed forward neural network
    tf.keras.layers.Flatten(input_shape=(28, 28, 1)), # input layer
    
    #We want to create only 2 hidden layers.
    #tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    #it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    #You can create more hidden layers if you just copy the line above and paste it.
    
    #the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

### Choose the optimizer and the loss function

In [5]:
#The compile() accepts the optimizer, the loss function and the metrics as parameters.
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

### Training
That's where we train the model we have built.

In [6]:
#Determine the maximum number of epochs
NUM_EPOCHS = 5

#Fit the model, specifying the, training data, the total number of epochs, and the validation data we just created 
#in the format: (inputs,targets)
model.fit(train_data, epochs=NUM_EPOCHS, validation_data=(validation_inputs, validation_targets), verbose =2)

Epoch 1/5
540/540 - 8s - loss: 0.4171 - accuracy: 0.8835 - val_loss: 0.2250 - val_accuracy: 0.9360
Epoch 2/5
540/540 - 7s - loss: 0.1944 - accuracy: 0.9430 - val_loss: 0.1608 - val_accuracy: 0.9525
Epoch 3/5
540/540 - 7s - loss: 0.1453 - accuracy: 0.9572 - val_loss: 0.1262 - val_accuracy: 0.9637
Epoch 4/5
540/540 - 7s - loss: 0.1177 - accuracy: 0.9650 - val_loss: 0.1115 - val_accuracy: 0.9640
Epoch 5/5
540/540 - 7s - loss: 0.0988 - accuracy: 0.9708 - val_loss: 0.1032 - val_accuracy: 0.9697


<tensorflow.python.keras.callbacks.History at 0x24d58b5a9c8>

## Test the model

After training on the training data and validating on the validation data, we test the final prediction power of our model by running it on the test dataset that the algorithm has NEVER seen before.

It is very important to realize that fiddling with the hyperparameters overfits the validation dataset. 

The test is the absolute final instance. You should not test before you are completely done with adjusting your model.

If you adjust your model after testing, you will start overfitting the test dataset, which will defeat its purpose.

In [7]:
test_loss, test_accuracy = model.evaluate(test_data)



In [8]:
#Prints the loss and accuracy of the test dataset.
print('Test loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))

Test loss: 0.12. Test accuracy: 96.19%


Using the initial model and hyperparameters given in this notebook, the final test accuracy should be roughly around 96%.

Each time the code is rerun, we get a different accuracy as the batches are shuffled, the weights are initialized in a different way, etc.

This is still a suboptimal solution, so it still have room to grow.