# Unsupervised Deep Embedding Clustering for Zooniverse

Use the approach from [Unsupervised Deep Embedding for Clustering Analysis](https://arxiv.org/abs/1511.06335). I'm using the keras implementation [here](https://github.com/XifengGuo/DEC-keras).

The algorithm uses greedy layer-wise pretaining of a [deep denoising autoencoder](https://blog.keras.io/building-autoencoders-in-keras.html) to learn an initial embedding that minimises reconstruction loss.  Clusters are then initialised using K-means.  Training examples (subjects) are then encoded by the auto encoder and assigned to clusters by a soft assignment based on proximity to a cluster centre.  The network parameters and cluster centres are then further trained by minimising the KL divergence between the soft cluster assignments and an auxiliary distribution.  The auxiliary distribution assumes that those examples lying closest to a cluster centre have a high confidence of belonging to the same class (or at least share some relationship that is worth reinforcing).  The auxiliary distribution also normalises the loss contribution of each centroid such that the loss is not completely dominated by the largest centroids.

Applying this to Zooniverse.

#### Experiment 1
Run the MNIST experiment from the original paper.  This is our completely unsupervised benchmark.  The clusters currently have no meaning.

The original paper uses the entire 70000 MNIST data set.  We should experiment with how varying the size of the initial training affects performance.  For the rest of our experiments we will need a test set so we should at least hold out the 10000 MNIST test set reducing the training set size to 60000 subjects.

#### Experiment 2
Simulate querying cluster labels from volunteers for the clustering in experiment 1. Tack a layer that maps each cluster to a label onto the network created in experiment 1.  Then train the network with the MNIST labels.  This replicates querying labels from perfect classifiers for every subject in the MNIST data set.  How does this performance compare to the unsupervised benchmark.  How does this compare to directly training a network from scratch on the labels.  We should observe a shorter time to convergence and potentially better performance.  Experiment with differing levels of volunteer classification noise and which subjects we should query labels for.

#### Experiment 3
Repeat experiment 1 and 2 for Supernova Hunters data.

#### Experiment 4
Repeat experiment 1 and 2 for CIFAR-10, CIFAR-100 and STL-10 datasets.  These are more similar to ecology projects.  STL-10 in particular might be interesting as there are only 500 training images and 800 test images  per class but has 100000 unlabelled images.  Might be interesting to test serendipitous discovery and the presence of uninteresting classes.  We could also test transfer learning between data sets.

#### Experiment 5
Repeat experiment 1 and 2 but with Marcos Ecology Project data set.

#### Experiment 6
Use Marcos pretrained CNN to replace the deep autoencoder in experiment 1.  This investigates transfer learning applied to our clustering approach to gathering labels.

#### Experiment 7
How can the idea of dissolving clusters be incorporated into the network architecture.  Should we just 'delete' that cluster in which case all its members will be assigned to their next best cluster? Should we just randomly reinitialise the cluster (this runs the risk of undoing earlier training)? Should we divide the cluster into 2 new clusters with the new cluster centres initialised based on some information we have about where different classes lie within the original cluster.

## Setup

In [34]:
import sys
from time import time

In [32]:
from keras.optimizers import SGD

Import the DEC implementation and various helper functions from the DEC-keras repository.

In [2]:
sys.path.insert(0,'../DEC-keras/')

In [24]:
from DEC import DEC, cluster_acc

## Experiment 1

Load the mnist data set normalised as in the DEC paper.

In [8]:
from datasets import load_mnist

In [9]:
x, y = load_mnist()

MNIST samples (70000, 784)


Define some contants from the paper.

In [25]:
n_clusters = 10 # this is chosen based on prior knowledge of classes in the data set.
batch_size = 256
lr         = 0.01 # learning rate
momentum   = 0.9
tol        = 0.001 # tolerance - if clustering stops if less than this fraction of the data changes cluster on an interation

Define some constants for this implementation

In [28]:
maxiter         = 2e4
update_interval = 140
save_dir         = '../DEC-keras/results/dec'

In [11]:
# prepare the DEC model
dec = DEC(dims=[x.shape[-1], 500, 500, 2000, 10], n_clusters=n_clusters, batch_size=batch_size)

I have already run the greedy layer-wise pretraining so tell DEC where to find those. 

In [21]:
ae_weights = '../DEC-keras/ae_weights.h5'

In [22]:
dec.initialize_model(optimizer=SGD(lr=lr, momentum=momentum),
                     ae_weights=ae_weights,
                     x=x)

Display a summary of the model archietecture.

In [23]:
dec.model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 784)               0         
_________________________________________________________________
encoder_0 (Dense)            (None, 500)               392500    
_________________________________________________________________
encoder_1 (Dense)            (None, 500)               250500    
_________________________________________________________________
encoder_2 (Dense)            (None, 2000)              1002000   
_________________________________________________________________
encoder_3 (Dense)            (None, 10)                20010     
_________________________________________________________________
clustering (ClusteringLayer) (None, 10)                100       
Total params: 1,665,110.0
Trainable params: 1,665,110.0
Non-trainable params: 0.0
____________________________________________________________

In [36]:
t0 = time()
y_pred = dec.clustering(x, y=y, tol=tol, maxiter=maxiter,
                        update_interval=update_interval, save_dir=save_dir)
print('acc:', cluster_acc(y, y_pred))
print('clustering time: ', (time() - t0))

Update interval 140
Save interval 1367.1875
Initializing cluster centers with k-means.
Iter 0 : Acc 0.87557 , nmi 0.8581 , ari 0.82627 ; loss= 0
saving model to: ../DEC-keras/results/dec//DEC_model_0.h5
Iter 140 : Acc 0.87553 , nmi 0.85805 , ari 0.82621 ; loss= 0.03464
delta_label  0.000542857142857 < tol  0.001
Reached tolerance threshold. Stopping training.
saving model to: ../DEC-keras/results/dec//DEC_model_final.h5
acc: 0.875528571429
clustering time:  69.14353013038635


The above gives our Unsupervised benchmark accuracy of 87.55% on the entire MNIST data set (70000 subjects).