# Let's flow some tensors for multi class logistic regression

In [None]:
import tensorflow as tf # need to import the right package
import numpy as np 


Here we will build a simple logistic regression model to classify mnist data set

Task: **classify the hadwritten digit label $\in \{1, 2, \cdots, 9 \}$ using pixel from the digit image.**

This is classification(discrete label output) problem and multi class logistic regression can be used here

Check   [Yann LeCun](http://yann.lecun.com/),      http://yann.lecun.com/exdb/mnist/  website for details about mnist dataset etc.



In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

## Verify everthing looks good

In [None]:
print('{} training example with shape {}'.format(mnist.train.num_examples, mnist.train.images[0].shape))
print('{} test example with shape {}'.format(mnist.test.num_examples, mnist.test.images[0].shape))
mnist.train.images.shape

Dataset contains 55000 training and 10000 test example for handwritten digits in vectorized form(take pixel along rows and stack them in a column vector).

Images are 28X28 gray scale. After vectrorization we have 784 dimentional vector

In [None]:
28*28 == 784

## Let's do some sanity check

In [None]:

NUM_CLASSES = 10
X_DIM = 28
Y_DIM = 28
import numpy as np
unique_label = np.unique(np.argmax(mnist.train.labels, 1))
print(unique_label)
assert NUM_CLASSES == len(unique_label), 'number of label does not match'
assert X_DIM*Y_DIM == mnist.train.images[0].size, 'total pixel does not match'

## Let's randomly view some of them

In [None]:
# magic command so that images are inline in notebook
%matplotlib inline 
import matplotlib.pyplot as plt # visualization package in python

In [None]:
np.random.seed(0) # to make sure we have deterministic results on each iteration
NUM_FIG_DISP = 16
nrow = NUM_FIG_DISP//2
ncol = NUM_FIG_DISP//2
rand_ind = np.random.randint(mnist.train.num_examples, size=NUM_FIG_DISP)
print(rand_ind)
plt.figure(1,figsize= (16, 16))
plt.gray()
for idx, image_index in enumerate(rand_ind):
    plt.subplot(nrow,ncol,idx +1)
    # Have to reshape to 28x28 for display
    plt.imshow(np.reshape(mnist.train.images[image_index], (X_DIM, Y_DIM)) )
    plt.title('label is {}'.format(np.argmax(mnist.train.labels[image_index])))  
    plt.axis('off')
plt.show()

## define  multi class logistic regression  model

### Remember for each class $c \in \{1, 2, \cdots, K= 10   \}$ we need to compute

 probablity
 <font size = 6>
 $P(y = c|x) = \frac{\exp(w_{oc} + w_c^Tx)}{\sum_i^K \exp(w_{oi} + w_i^Tx) }$
 </font>
 
 
 Note the one can either add 1 to features or work with $D+1$ dimentional features or add class bias $w_{oc}$ term for each class explicility as done in above formula.
 
 We can keep each weight vector $[w_{c}]_{D \times 1}$ in $D\times K$ matrix $W$, 
 $$W = {\begin{bmatrix} w_1, w_2, \cdots, w_K \end{bmatrix}}_{D \times K}$$
 
 and class biases  in vector  $W_o = {\begin{bmatrix} w_{o1}\\ w_{02} \\ \vdots \\ w_{0K} \end{bmatrix}}_{K \times 1}$. $D$ is data dimension.
 
 
 Using matrix operation we can calculate each class probability for given example $x_{784 \times 1}$ by doing
 
 $softmax(W_o +  W^T x)$ where 
 
 $$W_o + W^Tx = \begin{bmatrix} w_{o1}\\ \vdots \\ w_{oK} \end{bmatrix} + \begin{bmatrix}  w_1^T\\ \vdots \\  w_K^T \end{bmatrix} x = \begin{bmatrix}w_{01} + w_1^Tx\\ \vdots \\ w_{0K} + w_K^Tx \end{bmatrix}$$
 
 Also remember that 
 <font size = 5>
 
 $$softmax(\begin{bmatrix} z_1 \\ z_2 \\ \vdots \\ z_K  \end{bmatrix}) = \begin{bmatrix} \frac{\exp(z_1)}{\sum_i^K \exp(z_i)} \\ \frac{\exp(z_2)}{\sum_i^K \exp(z_i)} \\ \vdots \\ \frac{\exp(z_K)}{\sum_i^K \exp(z_i)}  \end{bmatrix} $$ </font>
 
 Convince yourself that after applying softmax we will get the formula at the beginning of the cell.

# Let's try to  code above equations using tensorflow computational graph

When we define placeholder, we don't need to specify number of examples dimension.
See below we use **None**

## Q1 (1 point) Write the placeholer of True labels Y

In [None]:
PIXELS_PER_SAMPLE = X_DIM*Y_DIM

X =  tf.placeholder(tf.float32, [None, PIXELS_PER_SAMPLE])
Y = ## Write code here

print(X.shape, Y.shape)



## Q2 (1 point) Create a variable using get_variable for Vector $W_o$ as defined above and use zeros initializer

In [None]:
with tf.variable_scope("multi_class_logistic_model", reuse=tf.AUTO_REUSE):
    W = tf.get_variable('Weight_matrix', initializer = tf.random_normal(shape = (X_DIM*Y_DIM, NUM_CLASSES)))    
    W_o= ### Write your code here
    print(X.shape, W.shape, W_o.shape)
    # we have to do X traspose as examples are along the row and we need them along columns

    Y_pred = tf.matmul(tf.transpose(W), tf.transpose(X))  + W_o
    
print('shape of prediction vector is {}'.format(Y_pred.shape))

# ? represent free dimension so let transpose it again to keep free dimention first

In [None]:

Y_pred = tf.transpose(Y_pred)
print('shape of prediction vector is {}'.format(Y_pred.shape))



<font color = "red">Also above code can we written(most of the time you will see tensorflow code other way around) without so much transpose operations but then we have to think of above equations in transpose sense </font> 

It doesn't matter how you keep examples/weight in matrix(row or columns fashion). Just be careful about interpretation.

## Let's convert this score vector of 10 into probability vector using softmax

## Q3 (1 point)  Use softmax function from tensorflow to convert Y_pred to probability vector.

In [None]:
Y_pred_prob = ### Write your code here

## build a loss/cost/objective function to measure how good we are doing

We use cross entopy as discussed in the class

## Q4 (1 point) Using log, reduce_mean function from tensorflow, combine predicted probability tensor Y_pred_prob , and true probabilty place holder tensor Y to calculate cross entropy.

In [None]:
loss = ## Write your code here

## build an accuracy measure

We can also build the accuracy calculation in the graph. We will run this when we need to calculate accuracy.

### Q5 (1 point) build accracy in the graph

Hint: use equal and argmax(picking the index of high probabiity) , cast and reduce_mean function from tensorflow. You may have to write 3 line of code. keep final tensor name as accuracy

In [None]:
accuracy = ## Write you code here # 


In [None]:
accuracy.get_shape()

## let create an optimizer

Remember some time there is no close form solution to find parameters $W$ which maximizes likelihood or log likelihood function $C(W)$ (**MLE estimation procedure**) as in logistic regression. We showed that if function $C(W)$  is differentiable
one can use an iterative procedure called **gradient descent(GD)** to update the parameters.

$W_{k+1} = W_k + \eta \frac{dC(W)}{W}$

where $C(W) = \sum_{i=1}^{N}$cross_entropy(true_probability_i, machine_predicted_probability_i)

- To update parameter we have to compute $C(W)$ for all the example at every step. This can be
computationalty expensive if we have millions of example

One extreme is shuffle all the example and use one example(cost/loss of one example) at a time. This is called

**Stochastic GD**

for i in range(N):
    $W_{k+1} = W_k + \eta \frac{dC(W)_i}{W}$
    
As you can guess with this approach our search path for parameter search can be quite noisy. There are lot of other mathematical questions we need to ask,
- like will learned parameters  be correct?
- Will cost decrease or not?

In short there are gurantees if cost function is convex.

We can take a midway approach called **Mini-batch gradient**, where we use small portion of total example at each step to update the parameters $W$

Lingo we need to get used to is 

- size of Mini-batch is called batch size
- As you can guess we are not using all the example in each iteration/step, hence we need to make multiple
  iteration of batch size so that optimization alogorithm has seen all the example.
- An Epoch is a complete pass through all the training data. An Epcoh will contain multiple iteration of batch size examples
- We use multiple pass through training examples(epoch) to learn value of parameter $W$. Also we need to shuffle examples at the start of  new epoch.

    
    

## Q 6 (1 point) use GradientDescentOptimizer from tensorflow to create optimizer op in the graph. You need to specify learning rate and loss to minimize. You already build loss (using cross entropy)



In [None]:
opt = ## Write your code here



## When we run above opt node it calculates loss, gradient and updates the model weights

In [None]:
print(mnist.train.images[0].dtype)
print(mnist.train.labels[0].dtype)

In [None]:
BATCH_SIZE = 100
NUM_EPOCHS = 20

## Lets run the model and see how it performing

Note here I am using test data as validation data here.

## Q7 (1 point) run opt, loss, accuracy node using sess in the following code.

 you also need to feed actual data for X and Y placeholder
 
 write code in "Write your code here" line

In [None]:
train_losses, val_losses = [], []
train_accuracies, val_accuracies = [], []
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for eidx in range(NUM_EPOCHS):
        epoch_acc, epoch_loss = [], []
        for bidx in range(mnist.train.num_examples// BATCH_SIZE):
            xs, ys = mnist.train.next_batch(BATCH_SIZE)
            xs = xs.astype(np.float32)
            _, train_loss, train_acc= ### Write your code here
            if (bidx+1)%100 == 0: # print result every 100 batch
                print('epoch {} training batch {} loss {} accu {}'.format(eidx +1 , bidx +1, train_loss, train_acc))
            epoch_acc.append(train_acc)
            epoch_loss.append(train_loss)
        print('##################################')
        val_acc, val_loss = sess.run([accuracy, loss],
            feed_dict= {X:mnist.test.images.astype(np.float32), Y: mnist.test.labels})
        print('epoch {} # test accuracy {} $ test loss {}'.format(eidx +1, val_acc, val_loss ))
        print('##################################') 
        # Let keep epoch level values for plotting
        train_losses.append(np.mean(epoch_loss))
        train_accuracies.append(np.mean(epoch_acc))
        val_losses.append(val_loss)
        val_accuracies.append(val_acc)
                

In [None]:
plt.plot(range(1, NUM_EPOCHS+1),  train_losses)
plt.plot(range(1, NUM_EPOCHS+1),  val_losses)
plt.legend(['train loss','validation loss'])
plt.show()

## Q 8 (1 point) plot train and val accuracies

In [None]:
## write code here

# In deep leaning section we will beat current accuracy numbers