# Tensorflow Intermediate Net (Lesson 2)

- We are building a digit classifier based on lesson 1's shallow net using more optimized techniques
    - ### Layer 1: input layer (28x28 pixels, or 784 inputs)
        - Each pixel has an 8 digit value representing its colour (0 = white, 255 = black)
   
   - ### Layer 2: 64 sigmoid neurons (hidden layer)
   
   - ### Layer 3: 10 softmax neurons (output layer)
   
  

### Load the required dependencies:

In [1]:
import tensorflow
from tensorflow.keras.datasets import mnist #Keras module to build tensorflow model easily
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input #added input layer import for specifying input sizes
from tensorflow.keras.optimizers import SGD
from matplotlib import pyplot as plt #for visualization purposes

### Load data:

In [2]:
#x is the inputs, y is the outputs in the training/validation datasets
(X_train, y_train), (X_valid, y_valid) = mnist.load_data() 
print("successfully loaded MNIST data")

successfully loaded MNIST data


### Preprocessing the data:

In [15]:
#change all values in the training set from 8 bit unsigned integers to 32 bit float
X_train = X_train.reshape(60000, 784).astype('float32') 
X_valid = X_valid.reshape(10000,784).astype('float32')

In [16]:
#We just converted the integers to floats so that we can normalize the values as a float between 0 and 1 
# 0 is false, 1 is true, easier for classification 
X_train /= 255
X_valid /= 255

In [17]:
X_valid[0] #should see the normalized values (all values in the 2D array are now between 0/1)

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

In [18]:
num_digits = 10 # the number of possible digits (0-9)

# Convert the integer label into a one-hot encoding
y_train = to_categorical(y_train, num_digits)
y_valid = to_categorical(y_valid, num_digits)


In [19]:
y_valid[0] #7 is now encoded by an array of size 10, where all elements are 0 expect for index 7, which is 1

array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0.])

We do this one-hot encoding because it is the optimal output of the neural network when it is fed with 7 (or whatever input we give it). We can interpret this output format as having a 1 (100%) chance that the input digit is a 7, while all other digits have a probability of 0. 

#### Design Neural Network Architecture

In [20]:
model = Sequential()

#hidden layer:
model.add(Input(shape=(784,)))
model.add(Dense(64, activation='relu')) #64 sigmoid neurons 
model.add(Dense(64, activation='tanh'))

#input_shape specifies how many inputs the model should expect (784 for the 28x28 size input digits)

#output layer:
model.add(Dense(10, activation='softmax')) #10 softmax neurons

In [21]:
model.summary()

# Explanation of the numbers from model.summary():

<b>Total # of parameters in a layer = total # of weights + total # of biases</b>

(784 inputs/neuron * 64 neurons in dense layer) + 64*(1 bias/neuron) = <b>50240 parameters</b> for first dense layer


(64 inputs/neuron * 64 neurons in dense_1 layer) + 64*(1 bias/neuron) = <b>4160 parameters</b> for second dense layer

(64 inputs/neuron * 10 neurons in dense_2 layer) + 10*(1 bias/neuron) = <b>650 parameters </b>

Therefore, the total parameters = sum of all parameters in all layers = <b>55,050 parameters</b>



#### Compiling the model: 

In [22]:
# loss: measures where our model is incorrect (a metric to measure how much error in approximation)
# SGD: stocastic gradient descent
# lr: learning rate
# accuracy: % of correct guesses that model makes


model.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=0.1), metrics=['accuracy'])

              



In [23]:
X_train.shape

(60000, 784)

#### Training the model (post-compilation)

In [24]:
model.fit(X_train, y_train, batch_size=128, epochs=20, verbose=1, validation_data=(X_valid, y_valid)) 
# verbose=1 will produce outputs as model trains

Epoch 1/20
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 954us/step - accuracy: 0.7989 - loss: 0.7554 - val_accuracy: 0.9270 - val_loss: 0.2531
Epoch 2/20
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 865us/step - accuracy: 0.9304 - loss: 0.2367 - val_accuracy: 0.9412 - val_loss: 0.1983
Epoch 3/20
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 875us/step - accuracy: 0.9466 - loss: 0.1827 - val_accuracy: 0.9512 - val_loss: 0.1603
Epoch 4/20
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 819us/step - accuracy: 0.9543 - loss: 0.1541 - val_accuracy: 0.9575 - val_loss: 0.1377
Epoch 5/20
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 741us/step - accuracy: 0.9634 - loss: 0.1271 - val_accuracy: 0.9620 - val_loss: 0.1266
Epoch 6/20
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 733us/step - accuracy: 0.9679 - loss: 0.1084 - val_accuracy: 0.9604 - val_loss: 0.1247
Epoch 7/20
[1m4

<keras.src.callbacks.history.History at 0x162349100>

### Changes made:
- Added an additional hidden layer to the network 
    - Changed the initial hidden layer from sigmoid to ReLU, and added another tanh layer after the ReLU layer
- Changed the loss function from mean of squares to cross entropy
- Increased learning rate of the SGD by 10x (0.01 to 0.1)


#### Evaluate the models overall performance:

In [25]:
model.evaluate(X_valid, y_valid)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 372us/step - accuracy: 0.9728 - loss: 0.0941


[0.08191093802452087, 0.9763000011444092]

With improved architecture in the changes detailed above, able to improve accuracy from 86% to 97.28% while running only 1/10th of the original epochs (20 vs. 200).

#### Perform inference (check what model will predict for a given input)

In [26]:
valid_0 = X_valid[0].reshape(1, 784) #just get one input which is 784 (28x28) in size

In [27]:
model.predict(valid_0) #input valid_0 into our shallow neural network 

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 69ms/step


array([[1.0979418e-06, 1.1746988e-06, 1.2268937e-05, 3.3117918e-05,
        4.2451244e-08, 1.3391328e-07, 3.2943679e-11, 9.9991727e-01,
        1.9236336e-06, 3.2876113e-05]], dtype=float32)

From the output array, 9.9991727e-01 (99.99%) is the highest probability, which corresponds to index 7. This value is our y-hat output for confidence that the input digit was a 7. 

This is significantly higher y-hat value than in our initial model (92%). All other values are even closer to 0 than before, which is also a sign of improvement. 

In [28]:
import numpy as np
np.argmax(model.predict(valid_0), axis=-1) #gets the highest probability in the output array

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step


array([7])