# Unsupervised learning

As we already know from our lectures, the absence of labels, or a "gold standard" from outside, constitutes unsupervised learning. This makes even more important to become your own feeling for the data.
In this notebook we want to analyze Zalando's Fashion-MNIST data-set with unsupervised learning methods.
Our approach:
1. Collecting Data 
2. Visualisation
3. Clustering
4. Evaluation
5. Outliers Detection

## Collecting Data

Since we use a standard data set, this step is very simple. Keras (and other toolkits) have functions to load these data sets easily. Given that this data-set with 70,000 images is quite extensive, we initially only work with the test data set (10,000 images). We'll then call the pictures "instances".


In [None]:
from keras.datasets import fashion_mnist

# load_data returns a nested tuple so that we can already assing 
# separate variables at the moment of calling it.
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

# Since we're only using test, we can just get rid of long names
x = x_test
y = y_test

# y is an array of length 10000 containing a label in each row
print ("y = {}, y.shape = {}".format(y, y.shape))

# x is also an array, in this case its shape is 10000x28x28. Each row contains 
# a 2-dimensional array (an image) of shape 28x28 with pixel values.

print ("x.shape = {}".format(x.shape))

# We can also use the function "?" for detailed information about *x* 
x?

## Looking at the data

Using `matplotlib.pyplot.imshow` we can display one of the pictures (here we only look at first one). The range colors is not actually intended for images, but we can change it to gray values. 

Most methods cannot deal with the 2-dimensional structure of an image, since each value of the data spans its own dimension. We need to reshape the 28 x 28 pixels into a list of 784 values (28*28).


In [None]:
from  matplotlib import pyplot as plt

# Display picture
picture_nr = 0
plt.imshow(x_test[picture_nr])
plt.show()

# Plot the data into 784 separate dimensions
x = x_test.reshape(len(x),28*28)
print ("reshaped: {} -> {}".format(x_test.shape,x.shape))

# Of course we are not able to display our image like this anymore. 
# We need to reverse this transformation. Here even the correct range of colors
plt.imshow(x[picture_nr].reshape(28,28), cmap='binary', interpolation='nearest')
plt.show()

## Clusters with K-Means


The basic idea of K-Means clustering is simple and easy to rebuild. 
For each (initially randomly generated) cluster, a "centroid" is calculated as the mean value of all instances belonging to the cluster. Then all instances are reassigned to the clsuters by looking at the distace between each intance and its closest centroid. These two steps are iterated until there are no more changes.

K stands for the number of clusters and must be defined beforehand. This quantity must be determined either experimentally (at what value do the results no longer improve?) or, as here, is specified. 

First step: Find out a good value for *k* by looking how many different labels there are in the dataset.

***Hint:*** `numpy.bincount()` counts how often each element occurs in an array.

In [None]:
# How many classes do we actually have?
import numpy as np

print( ... ) # -> Please fill in 

Fashion-MNIST has 10 classes and each class appears exactly 1000 times in the test data. So we try to divide the data into 10 groups according to their brightness values:

***Hint***
- you can refer to this pseudocode
 ![kmeans](../data/kmeans-alg.png)
 
 
- However, for this dataset it makes more sense to start the centroids with completly random data instead a sampling.
- rely on numpy for distance calculation(euclidian norm), finding minima and the like.
- after you have implemented it yourself: you can compare it to the results of `sklearn.cluster.KMeans` `KMeans.fit_predict()`. While the basic algorithm is simple, there are many optimizations possible (foremost: better initialization). Thus it  usually makes sense to stick to a framework implentation.

In [None]:
from sklearn.cluster import KMeans

# -> Please complete

# some constants
...

# some variables
...

# init centroids
centers = ...

# main loop
...

print (y_kmeans)
print (centers)

## Let's see how realistic is our result

Let's see how the distribution fits to the groud truth. If you have saved the membership array (the actual clusterung result) in the variable `y_kmeans`, the following code block creates a bar chart of the cluster sizes:

In [None]:
plt.bar(range(10), np.bincount(y_kmeans))
plt.show()

This doesn't look so well, because our evenly distributed dataset is clustered into differently sized clusters. Lets look further into the acutal model: the centroids. Given that they are living in the same "data space" as the instances, we can simply display them as images to get a feeling of what did the model "understand" of the data.

Plot the centroids using the same plotting code as in the beginning of this exercise.

***Hint***
- Don't forget the `reshape`.

In [None]:
# -> Please insert your contribution


Well, this doesn't look so bad, except for some classes with the "ghostly" shirts and shoes as centroid. To investigate further, we luckily have the correct answers in varaible `y`. Unfortunately, we cannot just compare them, as the order of our groups is arbitrary. 

Now, we have to go through our groups and look which ground truth label is the most fitting one.

***Hint***
- `numpy.zeros_like(X)` creates an empty array with the same dimensionality as *X*
- `scipy.stats.mode(X)` calculates the mode ( = most common value) of the array *X*
- with `(X == y)` the positions of the array *X* can be found, which have the (scalar) value *y*

In [None]:
# -> Please insert your contribution

print(class_mapping)

Having the mapping you can see that some groups point to the same label. For instance, two different variants of *Ankle Boots* (index 9) were distinguished, while other classes went lost.

Now we can also calculate how many mistakes we actually commited (see also the extra exercise file *Evaluation*).
You can count the true possitives by comparing the truth (y) with the assignment (from the code block above) or have scikit_learn calculate the *accuracy*.
Instead of just the number of correct answers, a confusion matrix enables a better insight into which class is causing problems.

***Hint***
- `sklearn.metrics` contains the evaluation functions
- `matplotlib.pyplot.matshow` can be used to visualize the confusion matrix.

In [None]:
# -> Please complete

import ...

# Option 1 from scratch:
...

# Option 2 sklearn:
...

# Output
...
plt.show()

We're missing some white spots on the main diagonal. These are the classes that cause the most problems. *Sweaters* and *dress* seem hard to tell apart from *T-shirts* and *Tops*.

## Detecting outliers

In order to examine outliers we now limit ourselves to a rather homogeneous class, so that the outliers become more obvious:

In [None]:
index = 1 # Trousers 

# Reduce to one class
x_oc = x[y_test == index]
print(x_oc.shape)

Searching for outliers is always an attempt to define the "normal". A simple solution is to look at the neighbours. If I am like my neighbours, then I am "normal". The *Local Outlier Factor* calculates if I am isolated from my neighbours in relation to how close my neighbours are to each other.

***Hint:***
- `sklearn.LocalOutlierFactor` can be used like KMeans (`fit_predict)` and returns -1 for outliers

In [None]:
# -> Please insert your contribution
...
print(np.histogram(...))

Show some pictures of the outlier class and some of the normal class:

In [None]:
# -> Please insert your contribution

...
plt.show()

<details> 
  <summary>Spoiler Warning! The result - Click here to show </summary>
   These are bell bottoms, two trousers in a pictures and even body images among the exceptions. Up to now, we have calculated all this only on gray values. By putting more effort into an optimal representation of the data, we could improve the results even more.
</details>

Feel free to analyse other classes, other clustering algorithms (`DBScan`) or try to familiarize yourself with dimension reduction. 