# TensorFlow BasicsTutorial with MNIST DataSet

TensorFlow is a powerful library for doing large-scale numerical computation. One of the tasks at which it excels is implementing and training deep neural networks

### Importing the Libraries

In [2]:
import tensorflow as tf
import numpy as np
from sklearn.datasets import fetch_mldata

In [2]:
mnist1= fetch_mldata('MNIST original')

In [5]:
#print(mnist1)

### Forking MNIST dataset

These two lines are used to fork the mnist data set in our code
Here mnist is a class now, which stores the training, validation and testing sets as NumPy arrays.

mnist data is available in tensorflow examples. It is read by input_data.read_data_sets function and is present as MNIST_data. Here, the labels are prespecified to be onehot arrays

In [3]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data\train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data\train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting MNIST_data\t10k-images-idx3-ubyte.gz
Extracting MNIST_data\t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


### Starting TF interactive session

Tensorflow uses C++ to do its computation in the backend. This is because it is highly efficient and faster than python. Thus, this connection of Tensorflow in python to its backend C++ functions is known as Session. 

Generally, we first make a data flow graph in tensorflow and then launch it in a session, passing it the required placeholders or just evaluating it.

We use InteractiveSession class which allows us to interleave/parellelize(allows us to make changes even after running the session) the operations which build a computational graph with one that runs the graph otherwise, we have to build the entire the computational graph before starting the session and launching/firing the graph.


In [4]:
sess = tf.InteractiveSession()

### Computational Graph

We use lbraries like Numpy in python which does extensive numerical computation ouside python using some other languages(like C++). But this switching of values causes overhead processing which is bad for computations on GPUs or in a distributed manner computation due to high transferring costs.

Tensorflow also computes externally but instead of running a single expensive operation independentlyfrom python, it lets us describe a graph of interacting operations that run entirely outside python which avoids this overhead cost.

## Softmax Regression Model

We make a regression with a single linear layer.

### Placeholders

x is the input and y_ is the labels. These are the nodes in the computational graph for input images and target output classes/one hot labels. x and y_ are values which we we'll input when we ask tensorflow to run a computation.

MNIST has images of shape 28X28 which has been flattened to one dimension, hence having 784 input features represented by each pixel. The first dimension represents the batch size here. None presents a sence of variability and indicates that the batch can be of any size.
The target output/labels, y_, will also consist of a 2d tensor where each row will be a 10 dimensional vector. The output classes inthe dataset ranges from 0 to 9.

In [5]:
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])

### Variables

Weights and biases here are the variables. Unlike placeholders, they do not interact with the outside world. Rather, they are those values that "lives in tensorflow's computational graph". They can be used and modified by computation.

tf.Variable() method creates the variables and initializes them(0s here) W is a 784X10matrix(weight matrix shape= 784 inputs and 10 outputs{single layered}) and b is 10 dimensional vector as we have 10 classes(10 neurons in the single layer)

In [6]:
W = tf.Variable(tf.zeros([784,10]))#input to output dimensions of the weight matrix
b = tf.Variable(tf.zeros([10]))

Before the variables can be used in a session, they need to be initialized in the session with their initial value which can be done using sess.run(variable)command. To initialize all of them at once, we use tf.global_variables_initializer() function to point to all the variables. It returns an Output that initializes global variables.

In [7]:
sess.run(tf.global_variables_initializer())

### Forward Pass

In [8]:
y = tf.matmul(x,W) + b

### Predicted class and loss(cost) function 

Scores in the final layer without applying softmax are called logit. We provide it the output of the previous layer without applying softmax to it(z) as its input and it returns the cross entropy as its output. 

tf,reduce_mean averages all the elements of a tensor, and hence calculates the mean cost over all the images.

tf.nn.softmax_cross_entropy_with_logits Computes softmax cross entropy between `logits` and `labels`. Measures the probability error in discrete classification tasks in which the
classes are mutually exclusive. It returns  a 1-D `Tensor` of length `batch_size` of the same type as `logits` with the softmax cross entropy loss.

In other words tf.nn.softmax_cross_entropy_with_logits internally applies the softmax on the model's unnormalized model prediction and sums across all classes, and tf.reduce_mean takes the average over these sums.


In [9]:
cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.



### Training the model/ The Backward Pass

As the Tensorflow knows the wholle computational graph, it can use automatic differentiation to find the gradients(delL,dell,dcdw,dcdb). Here , this uses a gradient descent optimzer and trains the model according to minimizing the cross entropy(cost) of the whole system. It does so by adding operators in the model flow graph to compute cost gradients(delL, dell, dc/dw,dc/db, to compute the parameter updates-> wl=wl-alpha*(dc/dw) and bl=bl-alpha*(dc/db). 
Also, alpha here is learning rate

In [10]:
alpha= tf.Variable(0.5)
sess.run(tf.global_variables_initializer())

In [11]:
train_step = tf.train.GradientDescentOptimizer(alpha).minimize(cross_entropy)

In [12]:
#?tf.train.GradientDescentOptimizer().minimize()

The returned operation train_step, when run, will apply the gradient descent updates to the parameters. Training the model can therefore be accomplished by repeatedly running train_step.

The for loop here represents iterations we are performing. Here, we load a number of training examples in each iteration using mnist.train.next_batch(100) and store this value in batch. The 0the position in batch consists of the datapoints/images and the 1st positions in batch consists of the corresponding labels/target classes. This is fed to the placeholder x and y_ from batch using the feed_dict() function. 

We finally run the train step function in the session. Running any step in the session requires feeding of the placeholders. All the operations required to get to the point which we ran in the graph, are first executed. Hence, in a way, the whole graph gets executed one by one till the backwardpass itself. It feeds the input to x and labels to y_.First of all, the weights and biases gets initialized as forward pass requires them. The forward pass runs then, computing the output which is required by the cost computing function. According to this, the cross entropy is calculated as it is needed by the backward pass to minimize, and hence, the final step, train_step is run.

The whole process is repeated several times according to the for loop argument

In [13]:
for _ in range(1000):
    batch = mnist.train.next_batch(100)
    train_step.run(feed_dict={x: batch[0], y_: batch[1]})
   

tf.argmax is an extremely useful function which gives you the index of the highest entry in a tensor along some axis. We can use tf.equal to check if our prediction matches the truth. This gives a array of booleans whose sum could be use to calculate total number of true positives. Casting this gives our accuracy.

In [14]:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

In [15]:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

### Evaluation

as soon as we run our .eval function of any valriable or stp in the model graph, it runs all the preceeding steps reuired to compute that

In [16]:
print(accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

0.9167


In [17]:
print((x.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels})).shape)

(10000, 784)


# Convolutional Neural Network

### Weight Initialization

The following two functions are used to initialize the weights an the biases of our CNN. Weights and Biases are non 0(with a small of noise). This is so as for symmetry breaking and 0 gradients. Also they ar positive since we use ReLU activated neurons which might be considered dead with negative weights.

Both weights and biases are added as variables using tf.variable function.

Truncated.normal creates a 0 mean 0.1 standard deviation gaussian distributed random values which is clipped after 0.2 from both sides.

constant 0.1 biases are added. 

Shape is passed as parameter to these functions, hence the weights and biases arrays take the shape which we give to them while calling the function.

In [18]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)#truncated, anything outside -0.2 and 0.2 has a probability of occurence 0
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)#constant array of 0.1 with whatever shape we give to it
    return tf.Variable(initial)

### Convolution and Pooling

`conv2d` function uses `tf.nn.conv2d` function to tell our convolutional parameters. It computes a 2D convoluted output from a 4Dinput applying the filter. The parameters of the function are the input image x tensor(which is a 4D tensor of shape `[batch_size, input_height,input_width, input_channels])`, W defines the filter(also called as kernel)tensor which has the shape `[filter_height, filter_width,Input_channels,Output_channels]`, stride size in all 4 dimensions, padding of 0s on all sides of the original image to make its dimensions favourable for our processing. 
It returns a `Tensor`. Has the same type as `input`.

`tf.nn.max_pool`: Performs the max pooling on the input.

In [19]:
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')


In [20]:
#tf.nn.conv2d

In [21]:
#?tf.nn.max_pool

### First Convolutional Layer

Here we define the weights and biases of our first conolutional layer which are nothing but the filter parameters which we want to convolve through our image. 
`W_conv1` is now the first filter weight tensor. The filter is a 5x5 filter with 1 channel depth( grayscale images are used only here). We need 32 random such filters to be applied on the image so as to obtain a feature map of depth 32.
`b_conv1` are the bias variables attached to these.

In [22]:
W_conv1 = weight_variable([5, 5, 1, 32])#5X5 filter 1 is the in channel (gray-scale) and there are 32 feature maps
b_conv1 = bias_variable([32])#32 feature maps

We need to reshape the grayscale 3D image toa 4D image by reshaping, adding a 1 as the 4th dimension (this is because 1 is not present in grayscale images hence making the dimensions of the image to be 3 rather than 4)
-1 here acts as an 4th dimension alpha which must compute out to be a whole number (it should be equal to batch size for correct answers actually) otherwise it is a an error. This ensures that the image is corectly used.
`x_image` represents the image here of the shapr 28X28, gray scale, hence 1 and the feature map contains 32 individual layers(whole number) hence -1.

In [23]:
x_image = tf.reshape(x, [-1, 28, 28, 1])

In [24]:
#?tf.reshape

We apply our image, `x_image` by the filter we just made by defining its weights, add the defined biases to it. This makes a feature map of depth 32 for our image. We then activate this output feature map using ReLU function. We convolute `x_image` and `W_conv1` using the function `conv2d`, add the biases `b_conv1` and apply the ReLU activation using the function `tf.nn.relu`

Next, we apply 2X2 maxpooling function over the produced output which converts the feature map to the shape `14X14X1X32`. This is done using `max_pool_2x2` function

In [25]:
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)#64X28X28X32. Here 64 is the batch size (assumed)
h_pool1 = max_pool_2x2(h_conv1)#max pooling 64X14X14X32

### Second Convolutional layer

Has 64 features in the output

In [26]:
W_conv2 = weight_variable([5, 5, 32, 64])#64X14X14X64
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)#64X7X7X64

Now that the image size has been reduced to 7x7, we add a fully-connected layer with 1024 neurons to allow processing on the entire image.
Thus, the outputs are mapped to 1024 neurons which learn using the backpropagation along with the weights of the filters we are applying to the images.
This fully connected layer has a weight and bias given by `W_fc1` (dimensions `7X7` as the feature map, `64` input layers and `1024` output) and `b_fc1` 

We reshape the tensor we got from the last(2nd) maxpooling layer into a batch of vectors which is 2D and of dimensions of a whole number to be valid(1024) and `7*7*64`
It the multiply the weight matrix with the input(features from the feature map transformed to a 1D represention using the `matmul` function. It adds the biases and then applies the `ReLU` activation function to it

In [27]:
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

### Dropout 

To reduce overfitting, we will apply dropout before the readout layer. We create a placeholder for the probability that a neuron's output is kept during dropout. This allows us to turn dropout on during training, and turn it off during testing. TensorFlow's `tf.nn.dropout` output automatically handles scaling neuron outputs in addition to masking them, so dropout just works without any additional scaling

In [28]:
keep_prob = tf.placeholder(tf.float32)#probability of retaining the neurons. 0=all neurons dropped, 1 means no neuron is dropped
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

### ReadOut Layer

Finally, we add a layer, just like for the one layer softmax regression above.

In [29]:
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

In [30]:
cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

In [31]:
train_step = tf.train.GradientDescentOptimizer(1e-4).minimize(cross_entropy)

In [32]:
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

In [None]:
with tf.Session() as sess:#declaring tf.Session as sess
    sess.run(tf.global_variables_initializer())#initializing all the variables we used
    for i in range(1000):#number of iterations we are using
        batch = mnist.train.next_batch(100) #taking the slice of input from the mnist dataset
        if i % 100 == 0:#check accuracy after every such(100) iterations
            train_accuracy = accuracy.eval(feed_dict={x: batch[0], y_: batch[1], keep_prob: 1.0})#training accuracy calculated with no dropout probabilty
            print('step %d, training accuracy %g' % (i, train_accuracy))
        train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})#dropout probabilty during training is kept 0.5
    print('test accuracy %g' % accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))#test accuracy


step 0, training accuracy 0.06
step 100, training accuracy 0.15
step 200, training accuracy 0.22
step 300, training accuracy 0.27


shalinigakhar7@gmail.com