# Lab 2 - Logistic Regression with MNIST


# Model Overview
In this tutorial we will build and train a Multiclass Logistic Regression model using the MNIST data. 

The MNIST data comprises of hand-written digits with little background noise making it a standard dataset to create, experiment and learn deep learning models with reasonably small comptuing resources.



In [1]:
from IPython.display import Image
Image(url= "http://3.bp.blogspot.com/_UpN7DfJA0j4/TJtUBWPk0SI/AAAAAAAAABY/oWPMtmqJn3k/s1600/mnist_originals.png", width=200, height=200)


[Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) (LR) is a fundamental machine learning technique that uses a linear weighted combination of features and generates probability-based predictions of different classes.  
​
There are two basic forms of LR: **Binary LR** (with a single output that can predict two classes) and **multinomial LR** (with multiple outputs, each of which is used to predict a single class).  

![LR-forms](http://www.cntk.ai/jup/cntk103b_TwoFormsOfLR-v3.png)

In **Binary Logistic Regression** (see top of figure above), the input features are each scaled by an associated weight and summed together.  The sum is passed through a squashing (aka activation) function and generates an output in [0,1].  This output value (which can be thought of as a probability) is then compared with a threshold (such as 0.5) to produce a binary label (0 or 1).  This technique supports only classification problems with two output classes, hence the name binary LR.  In the binary LR example shown above, the [sigmoid][] function is used as the squashing function.

[sigmoid]: https://en.wikipedia.org/wiki/Sigmoid_function

In **Multiclass Linear Regression** (see bottom of figure above), 2 or more output nodes are used, one for each output class to be predicted.  Each summation node uses its own set of weights to scale the input features and sum them together. Instead of passing the summed output of the weighted input features through a sigmoid squashing function, the output is often passed through a `softmax` function (which in addition to squashing, like the `sigmoid`, the `softmax` normalizes each nodes' output value using the sum of all unnormalized nodes). 

$$ h(\textbf{z})_i = \frac{e^{z_i}}{\sum_{k=1}^C{e^{z_k}}} \text{,  where  } \textbf{z} = \textbf{W} \times \textbf{x}  + \textbf{b} $$

In this tutorials, we will use multinomial LR for classifying the MNIST digits (0-9) using 10 output nodes (1 for each of our output classes).



The figure below summarizes the model in the context of the MNIST data.

![mnist-LR](https://www.cntk.ai/jup/cntk103b_MNIST_LR.png)

The goal of training is to find the values of **W** and **b** parameters that fit the training samples and generalize well to the samples outside of the training set.

The trainer strives to reduce the `loss` function - a function that expresses the "difference" between model predictions and training ground-truth labels - by different optimization approaches, [Stochastic Gradient Descent][] (`sgd`) being one of the most popular one. Typically, one would start with random initialization of the model parameters. The `sgd` optimizer uses [gradient-decent][] to generate a new set of the model parameters in each iteration. 

The "classical" stochastic gradient descent uses a single observation to calculate gradient estimates and update the model parameters. This approach is  attractive since it does not require the entire data set (all observation) to be loaded in memory and also requires gradient computation over fewer datapoints, thus allowing for training on large data sets. However, the updates generated using a single observation sample at a time can vary wildly between iterations. An intermediate ground is to use a small set of observations and use an average of the `loss` or error from that set to update the model parameters. This subset is called a *minibatch*.

With minibatches, we often sample observation from the larger training dataset. We repeat the process of model parameters update using different combination of training samples and over a period of time minimize the `loss` (and the error). When the incremental error rates are no longer changing significantly or after a preset number of maximum minibatches to train, we claim that our model is trained.

One of the key optimization parameter is called the `learning_rate`. For now, we can think of it as a scaling factor that modulates how much we change the parameters in any iteration. 

[optimization]: https://en.wikipedia.org/wiki/Category:Convex_optimization
[Stochastic Gradient Descent]: https://en.wikipedia.org/wiki/Stochastic_gradient_descent
[gradient-decent]: http://www.statisticsviews.com/details/feature/5722691/Getting-to-the-Bottom-of-Regression-with-Gradient-Descent.html

# Code Walkthrough

## Initialize environment

In [64]:
# Import the relevant components

import sys
import os
import time
import numpy as np
import cntk as C
from cntk.logging.progress_print import ProgressPrinter



## Data reading

In this tutorial we are using the MNIST data pre-processed to follow CNTK CTF format. The dataset has 50,000 training images, 10,000 validation images, and 10,000 test images with each image being 28 x 28 pixels. Thus the number of features is equal to 784 (= 28 x 28 pixels), 1 per pixel. 

The data is in the following format:

    |labels 0 0 0 0 0 0 0 1 0 0 |features 0 0 0 0 ... 
                                                  (784 integers each representing a pixel)
    


In [65]:
# Ensure we always get the same amount of randomness
np.random.seed(0)

# Read a CTF formatted text (as mentioned above) using the CTF deserializer from a file
def create_reader(path, is_training, input_dim, num_label_classes):
    labelStream = C.io.StreamDef(field='labels', shape=num_label_classes)
    featureStream = C.io.StreamDef(field='features', shape=input_dim)
    deserializer = C.io.CTFDeserializer(path, C.io.StreamDefs(labels = labelStream, features = featureStream))
    return C.io.MinibatchSource(deserializer,
       randomize = is_training, max_sweeps = C.io.INFINITELY_REPEAT if is_training else 1)


## Network definition and training
In CNTK, a computational network (e.g. a neural network) is a **function object**. On one hand a computational network in CNTK is just a function that you can call to apply to data. On the other hand, a computational network contains learnable parameters that can be accessed like object members. Complicated networks can be composed as hierarchies of simpler ones, which, for example, represent layers. 

CNTK function objects are represented internally as graph structures in C++ that encode the computation. This graph structure is wraped in the Python class `Function` that exposes the necessary interface so that other Python functions can call it and access its members (such as learnable parameters)

The function object is CNTK's single abstraction used to represent different operations from simple **basic operations** without learnable parameters to  **layers**, **recurrent step functions**, **complete models**, **criterion functions** and more.


### Define the network


In [94]:
# Define the data dimensions
input_dim = 784
num_output_classes = 10

# Create inputs for features and labels
features = C.input_variable(input_dim)/255
labels = C.input_variable(num_output_classes, is_sparse=True)

# Define parameters
W = C.parameter(shape=(input_dim, num_output_classes))
b = C.parameter(shape=(num_output_classes))

# And the network
z = C.times(features, W) + b


### Define the criterion function


In [95]:
loss = C.cross_entropy_with_softmax(z, labels)
metric = C.classification_error(z, labels)
criterion = C.combine([loss, metric])


### Train the model using the SGD learner



In [96]:
learner = C.sgd(z.parameters, C.learning_rate_schedule(0.2, C.UnitType.minibatch))

progress_writer = ProgressPrinter()

# Create the reader to the training data set
train_file = "../Data/MNIST_train.txt"
reader_train = create_reader(train_file, True, input_dim, num_output_classes)

progress = criterion.train(minibatch_source = reader_train,
                    streams = (reader_train.streams.features, reader_train.streams.labels),
                    minibatch_size = 64,
                    epoch_size = 6400,
                    max_epochs = 70,
                    parameter_learners=[learner],
                    callbacks = [progress_writer])



Learning rate per minibatch: 0.2
Finished Epoch[1]: loss = 0.773105 * 6400, metric = 19.03% * 6400 1.162s (5507.7 samples/s);
Finished Epoch[2]: loss = 0.461877 * 6400, metric = 12.06% * 6400 0.115s (55652.2 samples/s);
Finished Epoch[3]: loss = 0.380197 * 6400, metric = 10.44% * 6400 0.187s (34224.6 samples/s);
Finished Epoch[4]: loss = 0.389870 * 6400, metric = 11.11% * 6400 0.116s (55172.4 samples/s);
Finished Epoch[5]: loss = 0.344490 * 6400, metric = 9.73% * 6400 0.123s (52032.5 samples/s);
Finished Epoch[6]: loss = 0.363797 * 6400, metric = 10.62% * 6400 0.138s (46376.8 samples/s);
Finished Epoch[7]: loss = 0.359560 * 6400, metric = 9.97% * 6400 0.114s (56140.4 samples/s);
Finished Epoch[8]: loss = 0.340928 * 6400, metric = 9.69% * 6400 0.144s (44444.4 samples/s);
Finished Epoch[9]: loss = 0.324644 * 6400, metric = 8.83% * 6400 0.285s (22456.1 samples/s);
Finished Epoch[10]: loss = 0.330513 * 6400, metric = 9.34% * 6400 0.199s (32160.8 samples/s);
Finished Epoch[11]: loss = 0.318

### Evaluate the model

In [97]:
validation_file = "../Data/MNIST_validate.txt"
reader_validate = create_reader(validation_file, False, input_dim, num_output_classes)

validation_metric = criterion.test(minibatch_source = reader_validate,
                                  minibatch_size = 64,
                                  streams = (reader_validate.streams.features, reader_validate.streams.labels),
                                  callbacks = [progress_writer])

Finished Evaluation [1]: Minibatch[1-157]: metric = 8.11% * 10000;


# Hackathon

Try to improve the performance of the model. 

Hints:
- Play with the learning rate, minibatch size and the number of epochs
- You can look at regularization - check `l1_regularization` and `l2_regularization` hyper parameters of the `sgd` learner

## Final testing


DON'T CHEAT. DON'T USE MNIST_test.txt FOR MODEL TRAINING AND SELECTION. DON'T EXECUTE THE BELOW CELL TILL YOU ARE READY FOR THE FINAL TEST


In [98]:
test_file = '../Data/MNIST_test.txt'
reader_test = create_reader(test_file, False, input_dim, num_output_classes)
test_metric = criterion.test(minibatch_source = reader_test,
                           minibatch_size = 64,
                           streams = (reader_test.streams.features, reader_test.streams.labels),
                           callbacks = [progress_writer])

Finished Evaluation [2]: Minibatch[1-157]: metric = 7.74% * 10000;
