In [None]:
import pandas as pd
import numpy as np

import requests
import io

import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
from matplotlib import pyplot
#rcParams['figure.figsize'] = 12, 4

# 10 Introduction to the MNIST Digit Dataset (Extra Chapter)
This is an extra chapter based on the famous MNIST-dataset consisting of images of handwritten digits.

* data can be found at kaggle: 
https://www.kaggle.com/vikramtiwari/mnist-numpy
* a part of this tutorial was borrowed from Richard Corrado:
https://richcorrado.github.io/MNIST_Digits-overview.html

The MNIST digit database is a very popular database for studying machine learning techniques, especially pattern-recognition methods. There are many reasons for this, but in particular, a large amount of preprocessing has already been done in translating the images to a machine encoding consisting of greyscale pixel values. Therefore the student/researcher can investigate further preprocessing, but can also directly apply machine learning techniques to a very clean dataset, without any preprocessing. Furthermore, it is a standardized dataset that allows for comparison of results with published approaches and error rates.

### 10.1 Loading Numpy Data

Unlike the previous data, this time the data is stored as a numpy '.npz' file. This has not to worry us, because with the following routine the data can be easily read in.

In [None]:
def load_data(path, isURL=False):
    if isURL:
        response = requests.get(path)
        response.raise_for_status()
        path = io.BytesIO(response.content)
    
    with np.load(path) as f:
        x_train, y_train = f['x_train'], f['y_train']
        x_test, y_test = f['x_test'], f['y_test']
        return (x_train, y_train), (x_test, y_test)

In [None]:
# 2 Options for loading the MNIST data:

# filepath for local execution:
(x_train, y_train), (x_test, y_test) = load_data('../data/mnist.npz')

# filepath for online (colab) execution: 
#(x_train, y_train), (x_test, y_test) = load_data('https://speicherwolke.uni-leipzig.de/index.php/s/5yaXwWDkq9SsqGj/download/mnist.npz', isURL=True)


Since we received a multidimensional numpy array instead of a pandas DataFrame, we'll dive into numpy a bit and come back to pandas later. First, let's look at the train data:

In [None]:
x_train

Well, except for a lot of zeros there is not really anything to see here. Maybe we should get a broader overview first. Let's have a look at the structure of the data.

* **! TODO:** Use the numpy command `.shape` to display the dimensions of the four loaded data structures. 

* **! TODO:** Answer the following questions:
1. How many samples are there (for each training and test data)? 
2. How many features does a sample consist of (for each training and test data)? 
3. How are the labels (categories) of the samples stored?

In [None]:
## YOUR CODE STARTS HERE#


#### your answer:
Number of Samples:  
Train Data: 
Test Data: 

Number of Features:  
Train Data:  
Test Data:

How are the labels (categories) of the samples stored?  
-> 

### 10.2 Visualizing The Data

Now that we have learned a little more about the structure of the data, let's look at individual handwritten digits. For this data set, we can select individual samples using their index. For example, the next cell outputs the 14810th sample using the `.print()` command.

In [None]:
index = 14810
single_sample = x_train[index]
single_label = y_train[index]

print(single_sample)

This is the format of a typical numpy array, but it is a bit large to display well in the notebook. If we drop all of the rows and columns that are all zeros, we can display the nonzero part of the matrix in a fairly compact form:

In [None]:
cropped = single_sample[~np.all(single_sample == 0, axis=1)]
cropped = cropped[:,~np.all(cropped == 0, axis=0)]
print(cropped)

If you look carefully, you might be able to recognize the digit that's been drawn.

We can get a better visualization of the digit by using matplotlib. Matplotlib conveniently provides a function `matshow()` to make a 2d plot of the entries of a matrix:

In [None]:
plt.matshow(single_sample, cmap=plt.cm.gray)
plt.suptitle("Digit Label: %d" % single_label)
plt.show()

The code below does the following:

1. Chooses 16 random digits between `0` and the `number of training samples - 1` as indices.
2. Use the matplotlib subplots function to generate 4 columns and 4 rows of subfigures.
3. Loop over the cells of this collection filling in the cells with a plot. Give the subplot a title corresponding to the known label of the digit.

In [None]:
# generate a list of 16 random rows which are our digits
rand_idx = np.random.randint(0, len(x_train), size=16)
# generate a 4x4 grid of subplots
fig, axs = plt.subplots(nrows=4, ncols=4, figsize=(10,10))

# define counter over rand_idx list elements
i = 0
# axs is a 4x4 array so we flatten it into a vector in order to loop over it
for ax in axs.reshape(-1):
    # Title is digit label, which can be found by referencing the label column of the row specified by rand_idx[i]
    ax.set_title("Digit Label: %d" % y_train[rand_idx[i]])
    ax.matshow(x_train[rand_idx[i]], cmap=plt.cm.gray)
    ax.axis('off')
    i += 1
# tight_layout gives more spacing between subplots    
plt.tight_layout()   
# Tell matplotlib to draw all of the previously generated objects
plt.show()

### 10.3. Zero Padding

As we ran the plot commands, we should also see that there is a consistent padding of blank pixels around the images. In practical terms, this means that some of our features simply take the value 0 with no variation over the training and/or test set. Depending on your machine learning approach, those features may be not be useful for later predictions, in a sense, we might have wasted some of our learning budget on a useless feature.

Using numpy's `.all()` command we can determine which pixels are 0 over the entire training set `x_train`. Again, we can visiualise the pixel matrix using `matshow()`.

In [None]:
plt.matshow((x_train == 0).all(axis=0), cmap=plt.cm.gray)

### 10.4. Frequency Distribution of Digits

A data analysis question we might ask is whether we have equal numbers of each digit appearing in the dataset, or if some digits are favored over another? This might affect our machine learning problem, because if a particular digit was very rare in the dataset, it might be relatively hard to learn how to distinguish that digit from the others.

To answer this question, we move to known waters and transform the dataset into pandas dataframes. 

In [None]:
y_train_pandas = pd.Series(y_train, name="train")
y_test_pandas = pd.Series(y_test, name="test")


* **! TODO:** Compare the frequency distributions of the digits between training and test data. To do this, perform the steps below:
1. Obtain the normalized frequency of each digits for both training and test data.
2. Combine those two statistics into a single dataframe.
3. Plot the derived dataframe as bar chart

In [None]:
## YOUR CODE STARTS HERE#
