# Tensorflow Deep Net with Optimizers (Lesson 3)

- We are building a digit classifier based on lesson 1's shallow net using more optimized techniques
    - ### Layer 1: input layer (28x28 pixels, or 784 inputs)
        - Each pixel has an 8 digit value representing its colour (0 = white, 255 = black)
   
   - ### Layer 2: 64 sigmoid neurons (hidden layer)
   
   - ### Layer 3: 10 softmax neurons (output layer)
   
  

### Load the required dependencies:

In [1]:
import tensorflow
from tensorflow.keras.datasets import mnist #Keras module to build tensorflow model easily
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input #added input layer import for specifying input sizes
from tensorflow.keras.layers import Dropout, BatchNormalization #added dropout/batch normalization features
from tensorflow.keras.optimizers import SGD
from matplotlib import pyplot as plt #for visualization purposes

### Load data:

In [2]:
#x is the inputs, y is the outputs in the training/validation datasets
(X_train, y_train), (X_valid, y_valid) = mnist.load_data() 
print("successfully loaded MNIST data")

successfully loaded MNIST data


### Preprocessing the data:

In [3]:
#change all values in the training set from 8 bit unsigned integers to 32 bit float
X_train = X_train.reshape(60000, 784).astype('float32') 
X_valid = X_valid.reshape(10000,784).astype('float32')

In [4]:
#We just converted the integers to floats so that we can normalize the values as a float between 0 and 1 
# 0 is false, 1 is true, easier for classification 
X_train /= 255
X_valid /= 255

In [5]:
X_valid[0] #should see the normalized values (all values in the 2D array are now between 0/1)

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

In [6]:
num_digits = 10 # the number of possible digits (0-9)

# Convert the integer label into a one-hot encoding
y_train = to_categorical(y_train, num_digits)
y_valid = to_categorical(y_valid, num_digits)


In [7]:
y_valid[0] #7 is now encoded by an array of size 10, where all elements are 0 expect for index 7, which is 1

array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0.])

We do this one-hot encoding because it is the optimal output of the neural network when it is fed with 7 (or whatever input we give it). We can interpret this output format as having a 1 (100%) chance that the input digit is a 7, while all other digits have a probability of 0. 

#### Design Neural Network Architecture

In [8]:
model = Sequential()

# first hidden layer:
model.add(Input(shape=(784,)))

#input_shape specifies how many inputs the model should expect (784 for the 28x28 size input digits)

#first hidden layer:
model.add(Dense(64, activation='relu')) #64 relu neurons 
model.add(BatchNormalization()) #add batch normalization (not a layer, but just reshapes the inputs for next layer)


#second hidden layer:
model.add(Dense(64, activation='relu'))
model.add(BatchNormalization())


#third hidden layer:
model.add(Dense(64, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.2)) #only apply dropout to final layer
# only at the last layer could the model start to memorize training data, so dropout is neccessary as prevention


#output layer:
model.add(Dense(10, activation='softmax')) #10 softmax neurons

In [9]:
model.summary()

#### Compiling the model: 

In [10]:
# loss: measures where our model is incorrect (a metric to measure how much error in approximation)
# optimizer: most robust way of performing gradient descent, with adam (adaptive moment estimation)
# accuracy: % of correct guesses that model makes


model.compile(loss='categorical_crossentropy', optimizer='nadam', metrics=['accuracy'])


In [11]:
X_train.shape

(60000, 784)

#### Training the model (post-compilation)

In [12]:
model.fit(X_train, y_train, batch_size=128, epochs=10, verbose=1, validation_data=(X_valid, y_valid)) 
# verbose=1 will produce outputs as model trains

Epoch 1/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.7702 - loss: 0.7573 - val_accuracy: 0.9504 - val_loss: 0.1685
Epoch 2/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9512 - loss: 0.1635 - val_accuracy: 0.9610 - val_loss: 0.1234
Epoch 3/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9635 - loss: 0.1182 - val_accuracy: 0.9670 - val_loss: 0.1056
Epoch 4/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9722 - loss: 0.0920 - val_accuracy: 0.9684 - val_loss: 0.1046
Epoch 5/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9765 - loss: 0.0762 - val_accuracy: 0.9690 - val_loss: 0.1007
Epoch 6/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9799 - loss: 0.0640 - val_accuracy: 0.9706 - val_loss: 0.0974
Epoch 7/10
[1m469/469[0m 

<keras.src.callbacks.history.History at 0x17c388220>

### Changes made:
- Added two additional hidden layers in the network (64 ReLU, followed by 64 tanh, followed by 64 ReLU)
    - Now our architecture can be considered a deep learning model (has total of 5 layers and 3 hidden layers)
- Changed the loss function from stochastic gradient descent to the optimizer adam

#### Evaluate the models overall performance:

In [13]:
model.evaluate(X_valid, y_valid)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 469us/step - accuracy: 0.9698 - loss: 0.1006


[0.08738873153924942, 0.9736999869346619]

With improved architecture in the changes detailed above, able to achieve similar accuracy while running only 1/2 of the original epochs (10 vs 20).

#### Perform inference (check what model will predict for a given input)

In [14]:
valid_0 = X_valid[0].reshape(1, 784) #just get one input which is 784 (28x28) in size

In [15]:
model.predict(valid_0) #input valid_0 into our shallow neural network 

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step


array([[5.4791769e-07, 1.5223231e-06, 1.2411983e-05, 5.0981715e-05,
        5.3807480e-07, 9.0136064e-06, 6.1187274e-07, 9.9962544e-01,
        7.1625345e-07, 2.9811155e-04]], dtype=float32)

From the output array, 9.9991727e-01 (99.99%) is the highest probability, which corresponds to index 7. This value is our y-hat output for confidence that the input digit was a 7. 

This is significantly higher y-hat value than in our initial model (92%). All other values are even closer to 0 than before, which is also a sign of improvement. 

In [16]:
import numpy as np
np.argmax(model.predict(valid_0), axis=-1) #gets the highest probability in the output array

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step


array([7])