# Lab 1 - Logistic Regression with MNIST


# Lab Overview
In this lab we will build and train a Multiclass Logistic Regression model using the MNIST data. 

The lab comprises two parts. During the first part, the instructor will walk you through the code to define, train, and evaluate the initial version of MLR model. In the second part you will compete with other students to improve the performance of the model.

The MNIST data consists of hand-written digits with little background noise making it a standard dataset to create, experiment and learn deep learning models with reasonably small comptuing resources.



In [None]:
from IPython.display import Image
Image(url= "http://3.bp.blogspot.com/_UpN7DfJA0j4/TJtUBWPk0SI/AAAAAAAAABY/oWPMtmqJn3k/s1600/mnist_originals.png", width=200, height=200)


[Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) (LR) is a fundamental machine learning technique that uses a linear weighted combination of features and generates probability-based predictions of different classes.  
​
There are two basic forms of LR: **Binary LR** (with a single output that can predict two classes) and **multinomial LR** (with multiple outputs, each of which is used to predict a single class).  

![LR-forms](http://www.cntk.ai/jup/cntk103b_TwoFormsOfLR-v3.png)

In **Binary Logistic Regression** (see top of figure above), the input features are each scaled by an associated weight and summed together.  The sum is passed through a squashing (aka activation) function and generates an output in [0,1].  This output value (which can be thought of as a probability) is then compared with a threshold (such as 0.5) to produce a binary label (0 or 1).  This technique supports only classification problems with two output classes, hence the name binary LR.  In the binary LR example shown above, the [sigmoid][] function is used as the squashing function.

[sigmoid]: https://en.wikipedia.org/wiki/Sigmoid_function

In **Multiclass Linear Regression** (see bottom of figure above), 2 or more output nodes are used, one for each output class to be predicted.  Each summation node uses its own set of weights to scale the input features and sum them together. Instead of passing the summed output of the weighted input features through a sigmoid squashing function, the output is often passed through a `softmax` function (which in addition to squashing, like the `sigmoid`, the `softmax` normalizes each nodes' output value using the sum of all unnormalized nodes). 

$$ h(\textbf{z})_i = \frac{e^{z_i}}{\sum_{k=1}^C{e^{z_k}}} \text{,  where  } \textbf{z} = \textbf{W} \times \textbf{x}  + \textbf{b} $$

In this lab, we will use multinomial LR for classifying the MNIST digits (0-9) using 10 output nodes (1 for each of our output classes).

The figure below summarizes the model in the context of the MNIST data.

![mnist-LR](https://www.cntk.ai/jup/cntk103b_MNIST_LR.png)

During the model training the values of **W** and **b** parameters are optimized to fit the training samples and generalize well to the samples outside of the training set.

This is achieved by iteratively reducing the `loss` function - a function that expresses the "difference" between model predictions and training ground-truth labels. 

`Cross-entropy` is a popular function to measure the loss. It is defined as:

$$ H(\textbf{W},\textbf{b}) = - \sum_{j=1}^C y_j \log (h(\textbf{z})_j ) \text{, where }  y_j \text{ is a ground truth label} $$  

Various optimization approaches can be utilized, Stochastic Gradient Descent (`sgd`) being one of the most popular ones. 

The "classical" stochastic gradient descent uses a single observation to calculate gradient estimates and update the model parameters. This approach is  attractive since it does not require the entire data set (all observation) to be loaded in memory and also requires gradient computation over fewer datapoints, thus allowing for training on large data sets. However, the updates generated using a single observation sample at a time can vary wildly between iterations. An intermediate ground is to use a small set of observations and use an average of the `loss` or error from that set to update the model parameters. This subset is called a *minibatch*.

With minibatches, we often sample observation from the larger training dataset. We repeat the process of model parameters update using different combination of training samples and over a period of time minimize the `loss` (and the error). When the incremental error rates are no longer changing significantly or after a preset number of maximum minibatches to train, we claim that our model is trained.

In the lab we will use the minibatch form of SGD.





# Code Walkthrough

## Initialize environment

In [None]:
# Import the relevant components

import sys
import os
import time
import numpy as np
import cntk as C
from cntk.logging.progress_print import ProgressPrinter



## Data reading

In this tutorial we are using the MNIST data pre-processed to follow CNTK CTF format. 


    |labels 0 0 0 0 0 0 0 1 0 0 |features 0 0 0 0 ... 
                                                  (784 integers each representing a pixel)
                                                 

Each line in the file contains two key-value pairs, also refered as streams. The `labels` stream is the one-hot encoded representation of a digit 0-9. The `features` stream is a 784 vector of 0-255 integers representing 28 x 28 pixel grayscale image.

Our dataset includes three files: the training file with 50,000 images, the validation file with 10,000 images, and the testing file with 10,000 images.

To read/sample the files, we define a `create_reader` function that configures and returns the CNTK MinibatchSource object.
    


In [None]:
# Ensure we always get the same amount of randomness
np.random.seed(0)

# Read a CTF formatted text (as mentioned above) using the CTF deserializer from a file
def create_reader(path, is_training, input_dim, num_label_classes):
    labelStream = C.io.StreamDef(field='labels', shape=num_label_classes)
    featureStream = C.io.StreamDef(field='features', shape=input_dim)
    deserializer = C.io.CTFDeserializer(path, C.io.StreamDefs(labels = labelStream, features = featureStream))
    return C.io.MinibatchSource(deserializer,
       randomize = is_training, max_sweeps = C.io.INFINITELY_REPEAT if is_training else 1)


## Network definition and training
In CNTK, a computational network (e.g. a neural network) is a **function object**. On one hand a computational network in CNTK is just a function that you can call to apply to data. On the other hand, a computational network contains learnable parameters that can be accessed like object members. Complicated networks can be composed as hierarchies of simpler ones, which, for example, represent layers. 

CNTK function objects are represented internally as graph structures in C++ that encode the computation. This graph structure is wraped in the Python class `Function` that exposes the necessary interface so that other Python functions can call it and access its members (such as learnable parameters)

The function object is CNTK's single abstraction used to represent different operations from simple **basic operations** without learnable parameters to  **layers**, **recurrent step functions**, **complete models**, **criterion functions** and more.


### Define the network


In [None]:
# Define the data dimensions
input_dim = 784
num_output_classes = 10

# Create inputs for features and labels
features = C.input_variable(input_dim)/255
labels = C.input_variable(num_output_classes, is_sparse=True)

# Define parameters
W = C.parameter(shape=(input_dim, num_output_classes))
b = C.parameter(shape=(num_output_classes))

# And the network
z = C.times(features, W) + b


### Define the criterion function


In [None]:
# We will use cross entropy as a loss function
loss = C.cross_entropy_with_softmax(z, labels)

# And a standard classsification error as an error metric
metric = C.classification_error(z, labels)

criterion = C.combine([loss, metric])


### Train the model using the SGD learner



In [None]:
# Define an SGD learner
learner = C.sgd(z.parameters, C.learning_rate_schedule(0.2, C.UnitType.minibatch))

# Define a helper function to report on training progress
progress_writer = ProgressPrinter()

# Create the reader to the training data set
train_file = "../Data/MNIST_train.txt"
reader_train = create_reader(train_file, True, input_dim, num_output_classes)

# Initiate training
progress = criterion.train(minibatch_source = reader_train,
                    streams = (reader_train.streams.features, reader_train.streams.labels),
                    minibatch_size = 64,
                    epoch_size = 6400,
                    max_epochs = 70,
                    parameter_learners=[learner],
                    callbacks = [progress_writer])



### Evaluate the model

In [None]:
# Create the reader on the validation data set
validation_file = "../Data/MNIST_validate.txt"
reader_validate = create_reader(validation_file, False, input_dim, num_output_classes)

# Score the validation set and calculate the classification error metric
validation_metric = criterion.test(minibatch_source = reader_validate,
                                  minibatch_size = 64,
                                  streams = (reader_validate.streams.features, reader_validate.streams.labels),
                                  callbacks = [progress_writer])

# Hackathon

Try to improve the performance of the model. 

Hints:
- Play with the learning rate, minibatch size and the number of epochs
- You can look at regularization - check `l1_regularization` and `l2_regularization` hyper parameters of the `sgd` learner

## Final testing


DON'T CHEAT. DON'T USE MNIST_test.txt FOR MODEL TRAINING AND SELECTION. DON'T EXECUTE THE BELOW CELL TILL YOU ARE READY FOR THE FINAL TEST


In [None]:
# Create the reader on the testing data set
test_file = '../Data/MNIST_test.txt'
reader_test = create_reader(test_file, False, input_dim, num_output_classes)

# Score the testing data set and calculate the classification error metric
test_metric = criterion.test(minibatch_source = reader_test,
                           minibatch_size = 64,
                           streams = (reader_test.streams.features, reader_test.streams.labels),
                           callbacks = [progress_writer])