# The dataset

<img src="https://tensorflow.org/images/fashion-mnist-sprite.png">


(Image source: [www.tensorflow.org](https://www.tensorflow.org/tutorials/keras/basic_classification))

Today's dataset is [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist) -- a drop-in replacement for MNIST but with clothing items instead of digits. The categories are

+ 0 -- 	T-shirt/top
+ 1 -- 	Trouser
+ 2 -- 	Pullover
+ 3 -- 	Dress
+ 4 -- 	Coat
+ 5 -- 	Sandal
+ 6 -- 	Shirt
+ 7 -- 	Sneaker
+ 8 -- 	Bag
+ 9 --  Ankle boot

# Loading the dataset

Just run the codes to get the data.

* `images` will be the array with the image ("input") data
* `labels` will be the array with the label ("target") data

In [None]:
def load_mnist(path, kind='train'):
    import os
    import gzip
    import numpy as np

    """Load MNIST data from `path`"""
    labels_path = os.path.join(path,
                               '%s-labels-idx1-ubyte.gz'
                               % kind)
    images_path = os.path.join(path,
                               '%s-images-idx3-ubyte.gz'
                               % kind)

    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                               offset=8)

    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)

    return images, labels

In [None]:
!wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
!wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz

import numpy as np
np.random.seed(12) #This is good practice for reproducibiity
    
images, labels = load_mnist(".")

images.shape

In [None]:
labels.shape

## Task: Look at the images!

In [None]:
from matplotlib import pyplot as plt

# First  get and reshape the  _first_ image of the images data to the visible 28x28 shape
# (Hint, image is a numpy.ndarray, it has a function for that...)

# Show the image using pyplot.imshow
# Colormap should be "Greys"


reshaped_image =...
...
plt.show()

## Not task: just look at the first 30 images

In [None]:
from math import ceil

def show_images(images):
    """Show images in a grid
    """
    n_rows = ceil(len(images) / 10)
    fig, ax = plt.subplots(n_rows, 10, figsize=(15, 1.5 * n_rows),
                           subplot_kw={'xticks':[], 'yticks':[]},
                           gridspec_kw=dict(hspace=0.1, wspace=0.1))
    for i, _ in enumerate(images):
        ax[i // 10, i % 10].imshow(images[i].reshape(28, 28), cmap='Greys')

show_images(images[:30])
plt.show()

In [None]:
labels[:30]

# Task: PCA

We will use Principal Component Analysis of the dataset to come up with decomposed version of the dataset.

During the PCA we first find the components, then we transform the original dataset in terms of the new components, which is called "transform" in Scikit's parlance, and results in a new representation of the original data.

In [None]:
# Import and use principal components analysis from Scikit
# generate enough components to explain more than 50% of the total variation
# use the PCA model to transform the image data and store the transformed form in the variable below

...

pca = ... #This should be the model...
    
transformed = ...

In [None]:
## Nothing to do here, just run the code!

# Check that the explained variance of the components matches the ones below. Hopefully, "random seed" works well for this. :-)
variance_rate = pca.explained_variance_ratio_

np.testing.assert_array_almost_equal(variance_rate, np.array([ 0.29039228,  0.1775531 ,  0.06019222]))
# Small learning: np object's "sameness" cannot be asserted by python's default assert
# and numeric precision problems prevent even np.assert_equal to be usable
# we have to stick with the approach above. (default=6 decimals)

variance_rate

## Task: Back transform

In [None]:
# Please transform "back" all the images with the help of the fitted pca object
# Use the documentation of Scikit - and google ;-) 
# to find the appropriate single function of the PCA object!
# Show the first 30 "back transformed" images here
# You can use the displaying function we have written above to show the first 30 of them...

...

plt.show()

## Task: Observe the reconstructed images, let's discuss what we learn!

## Task: Analyze the components

... And let us see the components!

In [None]:
# Construct the "plot" of the components!
#
# The method is: 
# 1. construct one synthetic image with exactly one component being "strong"
# - ie. with a super high (arbitrary) numeric value in one component
# and all other values 0 (yes, really, just put a biiig number of your choosing in there :-)
# (for this you have to know how many components we have, and create an array of that shape)
#
# 2. Use "back transform" as above, and then 
#
# 3. display the image with Pyplot and imshow
# One separate plot per component is ok, but a complex plot with subplots is more desirable
# https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html - for bonus happiness ;-)

...

plt.show()

And now a 2d and 3d visualisation with class labels.

In [None]:
from matplotlib import pylab

categories = ["T-shirt/top",
              "Trouser",
              "Pullover",
              "Dress",
              "Coat",
              "Sandal",
              "Shirt",
              "Sneaker",
              "Bag",
              "Ankle boot"]

In [None]:
fig = plt.figure(figsize=(12, 9))

# Plot the 2D scatterplot of _first two_ dimensions of the transformed_data
# transformed data ideally you already have in a numpy array
# you basically want to have "all rows of the first column" vs. 
# "all rows of the second column" as the variables of the scatterplot.
# Since numpy arrays do not have a built in plotting as pandas DataFrames, you have to use 
# the scatterplot function from the imported pyplot - that is plt - namespace.
# No hassle, it is already done.
# Use the labels for coloring (argument c) and colormap "tab10" (argument cmap) - yes, just so, as a string. Because it's nice. :-)

...

cb = plt.colorbar()
loc = np.arange(0.5,9.5,9/10)
cb.set_ticks(loc)
cb.set_ticklabels(categories)

plt.show()

## Task: Let's observe the distribution of the original images plotted at the positions defined by their PCA transformation!

What can we learn from it?

In [None]:
# Nothing to do here, just observe the results! (Will take some time to run.)

from matplotlib.offsetbox import OffsetImage, AnnotationBbox

def visualize_scatter_with_images(coords, images, figsize=(45,45), image_zoom=3):
    # From https://www.kaggle.com/gaborvecsei/plants-t-sne
    fig, ax = plt.subplots(figsize=figsize)
    for xy, i in zip(coords, images):
        x0, y0 = xy
        img = OffsetImage(i.reshape(28,28), zoom=image_zoom, cmap="Greys")
        ab = AnnotationBbox(img, (x0, y0), xycoords='data', frameon=False)
        ax.add_artist(ab)
    ax.update_datalim(coords)
    ax.autoscale()
    plt.show()


visualize_scatter_with_images(transformed[:10000,:2], images[:10000])

## Task: Construct a 3D scatterplot

In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(12, 9))
ax = Axes3D(fig, elev=-150, azim=30)

# Use the "eerily similar" approach to above to construct a 3D scatterplot
# Nearly exactly the same as the 2D scatterplot you did before, same data source...
path = ax...


cb = plt.colorbar(paths)
loc = np.arange(0.5,9.5,9/10)
cb.set_ticks(loc)
cb.set_ticklabels(categories)

plt.show()

# Task: NMF method

In [None]:
# Import and use the non-negative matrix factorization function from Scikit
# Use 3 components to represent the data
# Plot the first 30 "back transformed" images
# Heavily copy from the approaches above! Nothing new under the sun! :-)

...

nmf = ...

transformed = ...

...

show_images(images_nmf[:30])
plt.show()

## Task: Plot the components!

In [None]:
# Plot the three components as above!
# Copy, copy, modify... Just to get used to what is what...

...


## Task: Observe the behavior of NMF, let's discuss results!

In [None]:
fig = plt.figure(figsize=(12, 9))
plt.scatter(transformed[:,1], transformed[:,2], c=labels, cmap="tab10", s=4)

cb = plt.colorbar()
loc = np.arange(0.5,9.5,9/10)
cb.set_ticks(loc)
cb.set_ticklabels(categories)

plt.show()

**Observe, that everything is squeezed into the positive region - unsurprisingly.**

In [None]:
#Please implement the 3D visualization of the transformed space as above!

...

# t-SNE

Because of the slowness of t-SNE we will work with a random sample.

## Task: create a random subset of the data - only with Numpy

In [None]:
# Create a subsample of the data
# use Numpy's sampling methods - google + documentation / examples helps!
# 5000 samples and no replacement
# The way to solve this elegantly is to create a random index and use it for subsetting the labels and data

n_sample = 5000

sample_indexes = ... #hint, hint: numpy has something like "choice"

images_sample = ... # an indexes variable can be used for subsetting _ROWS_!!! (NOT columns...)

assert images_sample.shape == (5000, 784)

images_sample.shape

## Task: fit t-SNE

In [None]:
# Plese import and use Scikit's tsne
# Run the fitting with the subsample of the data that you managed to create, called image_sample
# It would be fun to do it with all, but would be rather slow.

...

images_tsne = ...

# Please implement the scatterplot of the transformed data
# use the labels for the sample indices
# Please bear in mind that you don't need all the labels stored in the "labels" variable
# for coloring, just the same subsample that you have created!
# Re-use the index!
# And be brave, use the subsetted labels as "c" value in the scatter function
# use colormap "tab10"
# https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html

...

cb = plt.colorbar()
loc = np.arange(0.5,9.5,9/10)
cb.set_ticks(loc)
cb.set_ticklabels(categories)

plt.show()

# Just some fun: we display the t-SNE results - No code task

In [None]:
visualize_scatter_with_images(images_tsne, images[sample_indexes])

## Task: Please observe the distribution of the images!

**What are some interesting learnings?**