# Data Preparation

**DIVE into Deep Learning**
___

In [None]:
%matplotlib inline
import pprint as pp
import tensorflow_datasets as tfds
import tensorflow.compat.v2 as tf
from matplotlib import pyplot as plt
import numpy as np
from IPython import display
import matplotlib.pyplot as plt
# produce vector inline graphics
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('svg')

## Loading Data

**What is an example in a dataset?**

A neural network learns from many examples collected together as a *dataset*. For instance, the [MNIST (Modified National Institute of Standards and Technology)](https://en.wikipedia.org/wiki/MNIST_database) dataset consists of labeled handwritten digits.$\def\abs#1{\left\lvert #1 \right\rvert}
\def\Set#1{\left\{ #1 \right\}}
\def\mc#1{\mathcal{#1}}
\def\M#1{\boldsymbol{#1}}
\def\R#1{\mathsf{#1}}
\def\RM#1{\boldsymbol{\mathsf{#1}}}
\def\op#1{\operatorname{#1}}
\def\E{\op{E}}
\def\d{\mathrm{\mathstrut d}}
$

<a title="Josef Steppan, CC BY-SA 4.0 &lt;https://creativecommons.org/licenses/by-sa/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:MnistExamples.png"><img alt="MnistExamples" src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png"></a>

A dataset is a sequence 

\begin{align}
(\RM{x}_1,\R{y}_1),(\RM{x}_2,\R{y}_2), \dots\tag{dataset}
\end{align}

of *tuples/instances* $(\RM{x}_i,\R{y}_i)$, each of which consists of

- an *input feature vector* $\RM{x}_i$ such as an image of a handwritten digit and

- a *label* $\R{y}_i$ such as the digit type of the handwritten digit.

The goal is to classify the digit type of a handwritten digit.

**How to load the MNIST dataset?**

We first specify the folder to download the data.  
Press `Shift+Enter` to evaluate the following cell:

In [None]:
import os

user_home = os.getenv("HOME")  # get user home directory
data_dir = os.path.join(user_home, "data")  # create download folder path

data_dir # show the path

The MNIST dataset can be obtained in many ways due to its popularity in image recognition.  
One way is to use the package [`tensorflow_datasets`](https://blog.tensorflow.org/2019/02/introducing-tensorflow-datasets.html).

In [None]:
import tensorflow_datasets as tfds  # give a shorter name tfds for convenience

ds, ds_info = tfds.load(
    'mnist',
    data_dir=data_dir,   # download location
    as_supervised=True,  # separate input features and label
    with_info=True,      # return information of the dataset
)

ds

- The function `tfds.load` downloads the data to `data_dir` and prepare it for loading using variable `ds`.
- The data are loaded as [`Tensor`s](https://www.tensorflow.org/guide/tensor), which can be operated faster by GPU or TPU instead of CPU.

The dataset is split into 
- a training set `ds["train"]` and
- a test set `ds["test"]`.

`tfds.load?` shows more information about the function. E.g., we can control the split ratio using the argument [`split`](https://www.tensorflow.org/datasets/splits).

**Why split the data?**

The test set is used to evaluate the performance of a neural network trained using the training set (separate from the test set).

The purpose of separating the test set from the training set is to avoid *overly-optimistic* performance estimate. Why?

Suppose the final exam questions (test set) are the same as the previous homework questions (training set). 
- Students may get a high exam score simply by studying the model answers to the homework instead of understanding entire subject.
- The exam score is therefore an overly-optimistic estimate of the students' understanding of the subject.

**How large are the training set and test set?**

Both the training and test sets are loaded as [`Dataset` objects](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).
- The loading is lazy, i.e., the data is not yet in memory, we cannot count the number of instances directly.  
- Instead, we obtain such information from `ds_info`.

**Exercise** Assign to `train_size` and `test_size` the numbers of instances in the training set and test set respectively.

Replace `raise NotImplementedError()` in the solution cell by the following code with the blanks filled with the desired numbers:

```Python
train_size = ___
test_size = ___
```

**Hint** Open a scratchpad with `CTRL+B` and evaluate 
- `ds_info` or
- `dir(ds_info.splits["train"])` and `dir(ds_info.splits["test"])`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
train_size, test_size

In [None]:
# tests
assert 0 < train_size < 100000
assert 0 < test_size < 50000
# hidden tests will be run to check your answers precisely after submission

Note that the training set is often much larger than the test set especially for deep learning because 
- training a neural network requires many examples but
- estimating its performance does not.

## Data Visualization

The following retrieves an example from the training set.

In [None]:
for image, label in ds["train"].take(1):
    print(
        f'image dtype: {type(image)} shape: {image.shape} element dtype: {image.dtype}'
    )
    print(f'label dtype: {label.dtype}')

The for loop above takes one example from `ds["train"]` using the method [`take`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#take) and print its data types. 
- The handwritten digit is represented by a 28x28x1 [`EagerTensor`](https://www.tensorflow.org/guide/eager), which is essentially a 2D array of bytes (8-bit unsigned integers `uint8`). 
- The digit type is an integer.

The following function plots the image using the `imshow` function from `matplotlib.pyplot`.

In [None]:
import matplotlib.pyplot as plt

for image, label in ds["train"].take(1): # take 1 example from training set
    plt.imshow(image) # plot the image
    plt.title(label.numpy()) # show digit type as plot title

- The method `numpy()` is needed to convert the label to the correct integer type for `matplotlib`.

The following function plots the image properly in grayscale labeled by pixel values:

In [None]:
def plot_mnist_image(example, ax=None, pixel_format=None):
    (image, label) = example
    if ax == None:
        ax = plt.gca()
    ax.imshow(image, cmap="gray_r")  # show image
    ax.title.set_text(label.numpy())  # show digit type as plot title
    # Major ticks
    ax.set_xticks(np.arange(0, 28, 3))
    ax.set_yticks(np.arange(0, 28, 3))
    # Minor ticks
    ax.set_xticks(np.arange(-.5, 28, 1), minor=True)
    ax.set_yticks(np.arange(-.5, 28, 1), minor=True)
    if pixel_format is not None:
        for i in range(28):
            for j in range(28):
                ax.text(
                    j,
                    i,
                    pixel_format.format(image[i, j,
                                              0].numpy()),  # show pixel value
                    va='center',
                    ha='center',
                    color='white',
                    fontweight='bold',
                    fontsize='small')
        ax.grid(color='lightblue', linestyle='-', linewidth=1, which='minor')
        ax.set_xlabel('2nd dimension')
        ax.set_ylabel('1st dimension')
        ax.title.set_text('Image with label ' + ax.title.get_text())


if input('Execute? [Y/n]').lower != 'n':
    plt.figure(figsize=(11, 11), dpi=80)
    for example in ds["train"].take(1):
        plot_mnist_image(example, pixel_format='{}')
    plt.show()

- We set the parameter `cmap` to `gray_r` so the color is darker if the pixel value is larger.

**Exercise** Complete the following code to generate a matrix plot of the first 50 examples from the training sets.  
The parameter `nrows` and `ncols` specify the number of rows and columns respectively. You code may look like
```Python
...
        for ax, example in zip(axes.flat, ds["train"].____(nrows * ncols)):
            plot_mnist_image(_______, ax)
            ax.axes.xaxis.set_visible(False)
            ax.axes.yaxis.set_visible(False)
...
```
and the output image should look like
![mnist_examples](mnist_examples.svg)

In [None]:
if input('Execute? [Y/n]').lower != 'n':
    def plot_mnist_image_matrix(ds, nrows=5, ncols=10):
        fig, axes = plt.subplots(nrows=nrows, ncols=ncols)

        # YOUR CODE HERE
        raise NotImplementedError()

        fig.tight_layout()  # adjust spacing between subplots automatically
        return fig, axes


    fig, axes = plot_mnist_image_matrix(ds, nrows=5)
    fig.set_figwidth(9)
    fig.set_figheight(6)
    fig.set_dpi(80)
    # plt.savefig('mnist_examples.svg')
    plt.show()

## Data Preprocessing

We will use the [`tensorflow`](https://www.tensorflow.org/) library to process the data and train the neural network. (Another popular library is [PyTorch](https://pytorch.org/).)

In [None]:
import tensorflow.compat.v2 as tf  # explicitly use tensorflow version 2

Each pixel is stored as an integer from $\{0,\dots,255\}$ ($2^8$ possible values). However, for computations by the neural network, we need to convert it to a floating point number. We will also normalize each pixel value to be within the unit interval $[0,1]$:

\begin{align} 
v \mapsto \frac{v - v_{\min}}{v_{\max} - v_{\min}} = \frac{v}{255}\tag{min-max normalization}
\end{align}

![mnist_example](mnist_example_normalized.svg)

**Exercise** Using the function `map`, normalize each element of an image to the unit interval $[0,1]$ after converting them to `tf.float32` using [`tf.cast`](https://www.tensorflow.org/api_docs/python/tf/cast).

Your code may look like
```Python
...
        ds_n[part] = ds[part].map(
                    lambda image, label: (_____(image, _____) / ___, label),
                    num_parallel_calls=tf.data.experimental.AUTOTUNE)
...
```
`map` applies the conversion to each example in the dataset.

In [None]:
def normalize_mnist(ds):
    """
  Returns:
  MNIST Dataset with image pixel values normalized to float32 in [0,1].
  """
    ds_n = dict.fromkeys(ds.keys())  # initialize the normalized dataset
    for part in ds.keys():
        # normalize pixel values to [0,1]
        # YOUR CODE HERE
        raise NotImplementedError()
    return ds_n


ds_n = normalize_mnist(ds)
ds_n

In [None]:
# Plot the normalized digit
if input('Execute? [Y/n]').lower != 'n':
    plt.figure(figsize=(11, 11), dpi=80)
    for example in ds_n["train"].take(1):
        plot_mnist_image(example,
                         pixel_format='{:.2f}')  # show pixel value to 2 d.p.s
    # plt.savefig('mnist_example_normalized.svg')
    plt.show()

In [None]:
# tests

To avoid overfitting, the training of a neural network uses *stochastic gradient descent* which
- divides the training into many steps where
- each step uses a *randomly* selected minibatch of samples 
- to improve the neural network *bit-by-bit*.

In [None]:
def batch_mnist(ds_n):
    ds_b = dict.fromkeys(ds_n.keys())  # initialize the batched dataset
    for part in ds_n.keys():
        ds_b[part] = (
            ds_n[part].batch(
                128)  # Use a minibatch of examples for each training step
            .shuffle(
                ds_info.splits[part].num_examples,
                reshuffle_each_iteration=True)  # shuffle data for each epoch
            .cache()  # cache current elements 
            .prefetch(tf.data.experimental.AUTOTUNE)
        )  # preload subsequent elements
    return ds_b


ds_b = batch_mnist(ds_n)
ds_b

The above code 
- specifies the batch size (128) and 
- enables caching and prefetching to reduce the latency in loading examples repeatedly for training and testing.

**Exercise** The output to the above cell should look like
```Python
{'test': <PrefetchDataset shapes: ((None, 28, 28, 1), (None,)), types: (tf.float32, tf.int64)>,
 'train': <PrefetchDataset shapes: ((None, 28, 28, 1), (None,)), types: (tf.float32, tf.int64)>}
```
with a new first dimension of unknown size `None`. Why?

*Hint:* Is the total number of examples divisible by the batch sizs?

YOUR ANSWER HERE

## Release Memory

You cannot run a notebook if you have insufficient memory. It is important to shut down a notebook to release the memory:  
- `Kernel`->`Shut Down Kernel`.

The JupyterLab interface also contains tools to help you monitor your memory consumption.