<a href="https://colab.research.google.com/github/abdallaRml/lu/blob/master/Copy_of_Step_5_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Neural Network Example
Build a 2-hidden layers fully connected neural network (a.k.a multilayer perceptron) with TensorFlow v2.

This example is using a low-level approach to better understand all mechanics behind building neural networks and the training process.

## Neural Network Overview

<img src="http://cs231n.github.io/assets/nn1/neural_net2.jpeg" alt="nn" style="width: 400px;"/>

## MNIST Dataset Overview

This example is using MNIST handwritten digits. The dataset contains 60,000 examples for training and 10,000 examples for testing. The digits have been size-normalized and centered in a fixed-size image (28x28 pixels) with values from 0 to 255. 

In this example, each image will be converted to float32, normalized to [0, 1] and flattened to a 1-D array of 784 features (28*28).

![MNIST Dataset](http://neuralnetworksanddeeplearning.com/images/mnist_100_digits.png)

In this section we will build a network that can recognize handwritten numbers.
In order to achieve this goal, we'll use MNIST (http://yann.lecun.com/exdb/
mnist/), a database of handwritten digits made up of a training set of 60,000
examples, and a test set of 10,000 examples. The training examples are annotated by
humans with the correct answer. For instance, if the handwritten digit is the number
"3", then 3 is simply the label associated with that example.
In machine learning, when a dataset with correct answers is available, we say that we
can perform a form of supervised learning. In this case we can use training examples
to improve our net. Testing examples also have the correct answer associated to each
digit. In this case, however, the idea is to pretend that the label is unknown, let the
network do the prediction, and then later on reconsider the label to evaluate how
well our neural network has learned to recognize digits. Unsurprisingly, testing
examples are just used to test the performance of our net.
Each MNIST image is in grayscale and consists of 28*28 pixels. A subset of these
images of numbers is shown in Figure:

One-hot encoding (OHE)
We are going to use OHE as a simple tool to encode information used inside neural
networks. In many applications it is convenient to transform categorical (nonnumerical)
features into numerical variables. For instance, the categorical feature
"digit" with value d in [0 – 9] can be encoded into a binary vector with 10 positions,
which always has 0 value except the d - th position where a 1 is present.
For example, the digit 3 can be encoded as [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. This type of
representation is called One-hot encoding, or sometimes simply one-hot, and is very
common in data mining when the learning algorithm is specialized in dealing with
numerical functions.

In [1]:
import tensorflow as tf
import numpy as np
from tensorflow import keras
from keras.utils.vis_utils import model_to_dot
from IPython.display import SVG

Using TensorFlow backend.


In [2]:
# Intuitively, EPOCH defines how long the training should last, 
# BATCH_SIZE is the number of samples you feed in to your network at a time, 
# VALIDATION is the amount of data reserved for checking or proving the validity of the training process.

# network and training
EPOCHS = 5
BATCH_SIZE = 256
VERBOSE = 1
NB_CLASSES = 10   # number of outputs = number of digits
N_HIDDEN = 2048
VALIDATION_SPLIT=0.999 # how much TRAIN is reserved for VALIDATION
DROPOUT = 0.3

In [3]:
def data_summary(X_train, y_train, X_test, y_test):
    """Summarize current state of dataset"""
    print('Train images shape:', X_train.shape)
    print('Train labels shape:', y_train.shape)
    print('Test images shape:', X_test.shape)
    print('Test labels shape:', y_test.shape)
    print('Train labels:', y_train)
    print('Test labels:', y_test)

In [4]:
# Prepare MNIST data.
# Next we load our dataset (MNIST, using Keras' dataset utilities), 
# and then use the function above to get some dataset metadata.

# loading MNIST dataset
# verify
# the split between train and test is 60,000, and 10,000 respectly 
# one-hot is automatically applied
mnist = keras.datasets.mnist
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [5]:
# Check state of dataset
data_summary(X_train, Y_train, X_test, Y_test)

Train images shape: (60000, 28, 28)
Train labels shape: (60000,)
Test images shape: (10000, 28, 28)
Test labels shape: (10000,)
Train labels: [5 0 4 ... 5 6 8]
Test labels: [7 2 1 ... 4 5 6]


In [6]:
#X_train is 60000 rows of 28x28 values --> reshaped in 60000 x 784
RESHAPED = 784

# Flatten images to 1-D vector of 784 features (28*28).
# To feed MNIST instances into a neural network, they need to be reshaped,
# from a 2 dimensional image representation to a single dimension sequence. 
# We also convert our class vector to a binary matrix (using to_categorical). 
# This is accomplished below, after which the same function defined above is called again in order to show the effects of our data reshaping.
# X_train is 60000 rows of 28x28 values --> num_features in 60000 x 784

X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)

# Convert to float32.
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

In [7]:
#normalize in [0,1]
# Normalize images value from [0, 255] to [0, 1].
X_train, X_test = X_train / 255.0, X_test / 255.0
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

60000 train samples
10000 test samples


In [8]:
#one-hot
# Note that to_categorical(Y_train, NB_CLASSES) converts the array Y_train into
# a matrix with as many columns as there are classes. The number of rows stays the same.
# We also convert our class vector to a binary matrix (using to_categorical). 
# Categorically encode labels
Y_train = tf.keras.utils.to_categorical(Y_train, NB_CLASSES)
Y_test = tf.keras.utils.to_categorical(Y_test, NB_CLASSES)


In [9]:
# Build the model
# The Sequential class is used to define a linear stack of network layers which then, 
# collectively, constitute a model. 
# In our example below, we will use the Sequential constructor to create a model,
# which will then have layers added to it using the add() method.
# Both of the required data transformations have been accomplished. 
# Now it's time to build, compile, and train a neural network.
# The code in our example uses the Sequential class.
# It first calls the constructor, after which calls are made to the add() method to add layers to the model.
# The first such call adds a layer of type Dense ("Just your regular densely-connected NN layer"). 
# The Dense layer has an output of size 16, and an input of size INPUT_DIM, which is 32 in our case.
#  Note that only the first layer of the model requires the input dimension to be explicitly stated; 
# Note that only the first layer of the model requires the input dimension to be explicitly stated;
#  the following layers are able to infer from the previous linear stacked layer. 
# Following standard practice, the rectified linear unit activation function is used for this layer. 

model = tf.keras.models.Sequential()
# The first such call adds a layer of type Dense ("Just your regular densely-connected NN layer").

model.add(keras.layers.Dense(N_HIDDEN,
   		input_shape=(RESHAPED,),
   		name='dense_layer', activation='relu'))
#Now our baseline is 90.81% on the training set, 91.40% on validation, and 91.18%
#on test. A second improvement is very simple. We decide to randomly drop – with
#the DROPOUT probability – some of the values propagated inside our internal dense
#network of hidden layers during training. In machine learning this is a well-known
#form of regularization. Surprisingly enough, this idea of randomly dropping a few
#values can improve our performance. The idea behind this improvement is that
#random dropout forces the network to learn redundant patterns that are useful
#for better generalization:

model.add(keras.layers.Dropout(DROPOUT))

# An initial improvement is to add additional layers to our network because these
# additional neurons might intuitively help it to learn more complex patterns in the
# training data. In other words, additional layers add more parameters, potentially
# allowing a model to memorize more complex patterns. 
#So, after the input layer, we have a first dense layer with N_HIDDEN neurons and an activation function "ReLU."
# This additional layer is considered hidden because it is not directly connected either
# with the input or with the output. After the first hidden layer, we have a second
# hidden layer again with N_HIDDEN neurons followed by an output layer with 10
# neurons, each one of which will fire when the relative digit is recognized. The
# following code defines this new network:

model.add(keras.layers.Dense(N_HIDDEN,
   		name='dense_layer_2', activation='relu'))
model.add(keras.layers.Dropout(DROPOUT))
# The final layer is a single neuron with activation function "softmax", which is a
# generalization of the sigmoid function. As discussed earlier, a sigmoid function
# output is in the range (0, 1) when the input varies in the range (−∞, ∞) . Similarly,
# a softmax "squashes" a K-dimensional vector of arbitrary real values into a
# K-dimensional vector of real values in the range (0, 1), so that they all add up to 1.
# In our case, it aggregates 10 answers provided by the previous layer with 10 neurons.
# What we have just described is implemented with the following code:

model.add(keras.layers.Dense(NB_CLASSES,
   		name='dense_layer_3', activation='softmax'))

In [10]:
# summary of the model
model.summary()



Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_layer (Dense)          (None, 2048)              1607680   
_________________________________________________________________
dropout (Dropout)            (None, 2048)              0         
_________________________________________________________________
dense_layer_2 (Dense)        (None, 2048)              4196352   
_________________________________________________________________
dropout_1 (Dropout)          (None, 2048)              0         
_________________________________________________________________
dense_layer_3 (Dense)        (None, 10)                20490     
Total params: 5,824,522
Trainable params: 5,824,522
Non-trainable params: 0
_________________________________________________________________


In [11]:
# compiling the model
# Once we define the model, we have to compile it so that it can be executed by
# TensorFlow 2.0. There are a few choices to be made during compilation. 
# Firstly, we need to select an optimizer, which is the specific algorithm used to update
# weights while we train our model.
# Second, we need to select an objective function, which is used by the optimizer to navigate the space of weights (frequently,
# objective functions are called either loss functions or cost functions and the process of
# optimization is defined as a process of loss minimization). 
# Third, we need to evaluate the trained model.

#Some common choices for objective functions are:
#• MSE, which defines the mean squared error between the predictions and the true values.
#• binary_crossentropy, which defines the binary logarithmic loss.
#• categorical_crossentropy, which defines the multiclass logarithmic loss.

# Some common choices for metrics are:
# • Accuracy, which defines the proportion of correct predictions with respect to the targets
# • Precision, which defines how many selected items are relevant for a multilabel classification
# • Recall, which defines how many selected items are relevant for a multi-label classification

# With both the training data defined and model defined, it's time configure the learning process. 
# This is accomplished with a call to the compile() method of the Sequential model class. 
# Compilation requires 3 arguments: an optimizer, a loss function, and a list of metrics.
# In our example, set up as a multi-class classification problem, we will use the Adam optimizer,
#  the categorical crossentropy loss function, and include solely the accuracy metric.

# optimization algorithm used to reduce the mistakes made by neural networks after each training epoch.
# TensorFlow implements a fast variant of gradient descent known as SGD and many
# more advanced optimization techniques such as RMSProp and Adam. RMSProp
# and Adam include the concept of momentum

model.compile(optimizer='Adam', 
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# The with the call made to compile() with these arguments, our model now has its learning process configured.

In [12]:
#training the moodel
# At this point we have training data and a fully configured neural network to train with said data. 
# All that is left is to pass the data to the model for the training process to commence, a process which is completed by iterating on the training data. 
# Training begins by calling the fit() method.
# At minimum, fit() requires 2 arguments: input and target tensors. 
# If nothing more is provided, a single iteration of the training data is performed, which generally won't do you any good. 
# Therefore, it would be more conventional to, at a practical minimum, 
# define a pair of additional arguments: batch_size and epochs. Our example includes these 4 total arguments.

#Once the model is compiled, it can then be trained with the fit() method, which specifies a few parameters:
# • epochs is the number of times the model is exposed to the training set. At
# each iteration the optimizer tries to adjust the weights so that the objective function is minimized.
# • batch_size is the number of training instances observed before the
# optimizer performs a weight update; there are usually many batches per epoch.
# Training a model in TensorFlow 2.0 is very simple:

model.fit(X_train, Y_train,
		batch_size=BATCH_SIZE, epochs=EPOCHS,
		verbose=VERBOSE, validation_split=VALIDATION_SPLIT)


# Note that the epoch accuracies are not particularly admirable, which makes sense given the random data which was used.

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fbdb2544438>

In [13]:
#evalute the model
# Once the model is trained, we can evaluate it on the test set that contains new
# examples never seen by the model during the training phase.
# Note that, of course, the training set and the test set are rigorously separated. There
# is no point evaluating a model on an example that was already used for training.
# In TensorFlow 2.0 we can use the method evaluate(X_test, Y_test) to compute the
# test_loss and the test_acc:
test_loss, test_acc = model.evaluate(X_test, Y_test)
print('Test accuracy:', test_acc)
print('Test loss:', test_loss)


Test accuracy: 0.6654000282287598
Test loss: 1.0234118700027466


In [14]:
# making prediction
predictions = model.predict(X_test)

In [15]:
# Summary of neural network
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_layer (Dense)          (None, 2048)              1607680   
_________________________________________________________________
dropout (Dropout)            (None, 2048)              0         
_________________________________________________________________
dense_layer_2 (Dense)        (None, 2048)              4196352   
_________________________________________________________________
dropout_1 (Dropout)          (None, 2048)              0         
_________________________________________________________________
dense_layer_3 (Dense)        (None, 10)                20490     
Total params: 5,824,522
Trainable params: 5,824,522
Non-trainable params: 0
_________________________________________________________________
