# Machine Augmented Classification

The basic idea is to take advantage of underlying structure in an initially unlabelled data set.  Subjects are grouped into similar clusters based on proximity in feature space.

Volunteers can then decide what meaning each cluster has or if a cluster should be "dissolved".

Volunteers are provided with a list of predefined labels they can apply either to an entire cluster (assigning a meaning to the custer), or to individual subjects within a cluster ("dissolving" a cluster that groups subjects belonging to multiple classes).

After a pass over the data volunteers will have assigned labels to the data set.  An expert can review cluster labels and decide whether to merge or dissolve clusters based on domain knowledge.

A machine can be trained based on these labels.  The aim of this machine is to transform the data into a new feature space such that subjects from dissolved clusters now lie in distinct regions of the new feature space based on the new labels.  Well defined clusters may become even more tightly clustered in the new space.

Performance tracking can still be used here as volunteers label gold standard clusters or subsets of clusters or artificially contaminated clusters.

The labels that the machine is learning need not be exactly what the research team are looking for, but they can assign their own meaning on top of the volunteer labels.

Data sets that machines and humans are good at dealing with should fall out naturally.  A machine will naturally find it easy to classify classes that lie far from each other in feature space.  Humans will be good at identifying the "odd ones out" in clusters that confuse classes.

Jeremy Howard [TED talk](https://youtu.be/t4kyRyKyOpo?t=729) that captures the idea.

In [1]:
import os
import plotly
import numpy as np
import scipy.io as sio

In [2]:
plotly.tools.set_credentials_file(username=os.environ["PLOTLY_USERNAME"], api_key=os.environ["PLOTLY_KEY"])

In [3]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

In [4]:
import matplotlib.pyplot as plt

Define a function for interactive data visualisation

In [5]:
def threeDPlot(X, indices, not_indices, label, not_label):
  trace1 = go.Scatter3d(
    x=X[indices,0],
    y=X[indices,1],
    z=X[indices,2],
    name=label,
    mode='markers',
    marker=dict(
      size=5,
      color='#1E2EDE',
      line=dict(
        color='rgb(204, 204, 204)',
        width=0.1
      ),
      opacity=0.8
    )
  )

  trace2 = go.Scatter3d(
    x=X[not_indices,0],
    y=X[not_indices,1],
    z=X[not_indices,2],
    name=not_label,
    mode='markers',
    marker=dict(
      color='#F5B841',
      size=5,
      symbol='circle',
      line=dict(
        color='rgb(204, 204, 204)',
        width=0.1
      ),
      opacity=0.8
    )
  )

  data = [trace1, trace2]

  layout = go.Layout(
    margin=dict(
      l=0,
      r=0,
      b=0,
      t=0
    )
  )
    
  fig = go.Figure(data=data, layout=layout)
  return fig

## mnist data set

Lets take the mnist data set as an example.

In [6]:
from keras.datasets import mnist
 
# Load pre-shuffled MNIST data into train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# flatten the images for PCA
x_train_flattened = np.reshape(x_train, (x_train.shape[0], x_train.shape[1]*x_train.shape[2]))

# limit the number of examples to 10000 so we can work with plotly interactive plots
x_train = x_train[:10000]
x_train_flattened = x_train_flattened[:10000]
y_train = y_train[:10000]

Using TensorFlow backend.


User PCA to project this data into 3 dimensions so we can visualise it.

In [7]:
from sklearn.decomposition import PCA

In [8]:
pca = PCA(n_components=3)
x_train_pca = pca.fit_transform(x_train_flattened)

Visualise the data singling out images labelled zero.

In [9]:
fig = threeDPlot(x_train_pca, np.where(y_train==0)[0], np.where(y_train!=0)[0], '0', 'not 0')
py.iplot(fig, filename='mnist_pca3_label0')

Use heirarchical clustering (which is unsupervised) to group the subjects based on euclidean distance in the original 784 dimensional pixel space.  Choose 100 clusters for no particular reason. 

In [10]:
from sklearn.cluster import AgglomerativeClustering

In [11]:
n_clusters = 100
clustering = AgglomerativeClustering(n_clusters=n_clusters)
clustering.fit(x_train_flattened)

AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
            connectivity=None, linkage='ward', memory=None, n_clusters=100,
            pooling_func=<function mean at 0x110915950>)

In [12]:
from keras.utils import np_utils

del x_labelled

one_hot_encoded = np_utils.to_categorical(y_train, 10)

for cluster in range(n_clusters):
  cluster_indices = np.where(clustering.labels_ == cluster)[0]
  n_assigned_examples = cluster_indices.shape[0]
  cluster_labels = one_hot_encoded[cluster_indices]
  cluster_label_fractions = np.mean(cluster_labels, axis=0)
  dominant_cluster_class = np.argmax(cluster_label_fractions)
  print(cluster, n_assigned_examples, dominant_cluster_class, cluster_label_fractions[dominant_cluster_class])
  # assign labels based on >= 90% class membership, mimicing human labelling
  # I'm assuming that if a cluster is diminated by a singel class volunteers 
  # will assign that class to it.
  if cluster_label_fractions[dominant_cluster_class] >= 0.9:
    x = x_train[cluster_indices]
    l = np.zeros((x.shape[0], 10))
    l[:,dominant_cluster_class] += 1
    try:
      x_labelled = np.concatenate((x_labelled, x))
      labels = np.concatenate((labels, l))
      labelled_indices = np.concatenate((labelled_indices, cluster_indices))
    except NameError:
      x_labelled = x
      labels = l
      labelled_indices = cluster_indices
        
print(x_labelled.shape)
print(labels.shape)

m = x_labelled.shape[0]
order = np.random.permutation(m)
x_labelled = x_labelled[order]
x_labelled = x_labelled[:,:,:,np.newaxis]
labels = labels[order]

unlabelled_indices = np.array([x for x in range(x_train.shape[0]) if x not in labelled_indices])

NameError: name 'x_labelled' is not defined

The cluster id, number of examples assigned to each cluster, the dominant cluster class and the proportion of the cluster belonging to the dominant cluster class are printed out.

To replicate volunteer labelling of each class, if the dominant cluster class makes up more than 90% of the cluster then assign the dominant cluster label to that cluster.  This is the same as assuming that if 90% of the data is of one class a volunteer will assign it the label corresponding to 90% of the data in the cluster.

This gives us a data set of 6943 labelled examples.  There is an upper limit on the label contamination in this data set of 10%.

Define some functions to visualise the data assigned to each cluster.

In [None]:
def getDimensions(n):
  dim = int(np.ceil(np.sqrt(n)))
  return (dim, dim)

In [None]:
def plotCluster(cluster_labels, cluster, X, image_dim, limit=200, cmap='gray_r'):
  indices = np.where(cluster_labels == cluster)[0] # get the examples assigned to cluster 0

  n = np.where(cluster_labels == cluster)[0].shape[0]
  print(n)
  if n > limit:
    indices = indices[:limit]
    n = limit
    
  dims = getDimensions(n)
    
  fig = plt.figure(figsize=(20,20))
  for i in range(n):
    ax = fig.add_subplot(dims[0],dims[1],i+1)
    ax.imshow(np.reshape(X[indices[i]], (image_dim,image_dim), order='C'), cmap=cmap)
    plt.axis('off')
  plt.show()

Visualise cluster 0 containing 160 subjects with 98% labelled 0.

In [None]:
plotCluster(clustering.labels_, 0, x_train_flattened, 28)

Visualise cluster 13 containing 121 subjects with 98% labelled 5.

In [None]:
plotCluster(clustering.labels_, 13, x_train_flattened, 28)

Visualise cluster 38 containing 74 subjects with 55% labelled 9.

In [None]:
plotCluster(clustering.labels_, 38, x_train_flattened, 28)

Visualise cluster 2 containing 90 subjects with 48% labelled 3.

In [None]:
plotCluster(clustering.labels_, 2, x_train_flattened, 28)

Train a machine to classify the labelled data set.

Using a CNN here, but other architectures might be better.

In [None]:
from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.layers import Dropout, Flatten, Dense
from keras.models import Sequential

In [None]:
def calculateAccuracy(model, x, y, n_classes):
  preds = model.predict(x)
  return 100*np.sum(np.argmax(preds, axis=1)== \
                    np.argmax(np_utils.to_categorical(y, n_classes), axis=1))/ \
          len(preds)

In [None]:
# build the CNN
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, padding='valid', \
                   activation='relu', input_shape=(28,28,1)))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=32, kernel_size=2, padding='valid', activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64, kernel_size=2, padding='valid', activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(GlobalAveragePooling2D('channels_last'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
# fit the model to the data
model.fit(x_labelled, labels, epochs=20, batch_size=500)

In [None]:
# determine if test set classes are balanced
print(np.sum(np_utils.to_categorical(y_test, 10), axis=0)/np.sum(np_utils.to_categorical(y_test, 10)))

Calculate the accuracy of this classifier on the test set

In [None]:
test_accuracy = calculateAccuracy(model, x_test[:,:,:,np.newaxis], y_test, 10)
print('Test accuracy: %.4f%%' % test_accuracy)

Determine the effect of running this classifier for double the number of epochs.  This is important for later as we want to ensure that any future imporvements to this model are not just because we add epochs.

In [None]:
# clone the above model and load its weights after 20 epochs
model2 = Sequential()
model2.add(Conv2D(filters=16, kernel_size=2, padding='valid', \
                    activation='relu', input_shape=(28,28,1), \
                    weights=model.layers[0].get_weights())) # load the learned weights from the previous model
model2.add(MaxPooling2D(pool_size=2))
model2.add(Conv2D(filters=32, kernel_size=2, padding='valid', \
                    activation='relu',weights=model.layers[2].get_weights()))
model2.add(MaxPooling2D(pool_size=2))
model2.add(Conv2D(filters=64, kernel_size=2, padding='valid', \
                    activation='relu',weights=model.layers[4].get_weights()))
model2.add(MaxPooling2D(pool_size=2))
model2.add(GlobalAveragePooling2D('channels_last'))
model2.add(Dense(10, activation='softmax', weights=model.layers[7].get_weights()))
model2.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
# train the cloned mode for an additional 20 epochs
model2.fit(x_labelled, labels, epochs=20, batch_size=500)

In [None]:
test_accuracy = calculateAccuracy(model2, x_test[:,:,:,np.newaxis], y_test, 10)
print('Test accuracy: %.4f%%' % test_accuracy)

Make a clone of the original model striping of the output layer so we can project data into the feature space learned by the CNN.

In [None]:
# make a clone of the model above stripping off the output layer and loading the 
# trained weights after 20 epochs
model3 = Sequential()
model3.add(Conv2D(filters=16, kernel_size=2, padding='valid', \
                    activation='relu', input_shape=(28,28,1), \
                    weights=model.layers[0].get_weights())) # load the learned weights from the previous model
model3.add(MaxPooling2D(pool_size=2))
model3.add(Conv2D(filters=32, kernel_size=2, padding='valid', \
                    activation='relu',weights=model.layers[2].get_weights()))
model3.add(MaxPooling2D(pool_size=2))
model3.add(Conv2D(filters=64, kernel_size=2, padding='valid', \
                    activation='relu',weights=model.layers[4].get_weights()))
model3.add(MaxPooling2D(pool_size=2))
model3.add(GlobalAveragePooling2D('channels_last'))

Project the entire data set (10000 subjects) into the CNN feature space.

In [None]:
activations = model3.predict(x_train[:,:,:,np.newaxis]) # encode the images
print(activations.shape)

Reduce this feature representation to 3 dimensions for visualisation.

In [None]:
pca = PCA(n_components=3)
activations_pca = pca.fit_transform(activations)

Plot the original data set transformed into the new feature space, distinguishing subjects that were labelled form those that remain unlabelled.

In [None]:
fig = threeDPlot(activations_pca, labelled_indices, unlabelled_indices, 'labelled', 'unlabelled')
py.iplot(fig, filename='mnist_activations_pca3_labelled')

Now perform the clustering again in the new feature space again arbitrarily looking for 100 clusters.

In [None]:
clustering_activations = AgglomerativeClustering(n_clusters=n_clusters)
clustering_activations.fit(activations)

In [None]:
for cluster in range(n_clusters):
  cluster_indices = np.where(clustering_activations.labels_ == cluster)[0]
  n_assigned_examples = cluster_indices.shape[0]
  cluster_labels = one_hot_encoded[cluster_indices]
  cluster_label_fractions = np.mean(cluster_labels, axis=0)
  dominant_cluster_class = np.argmax(cluster_label_fractions)
  print(cluster, n_assigned_examples, dominant_cluster_class, cluster_label_fractions[dominant_cluster_class])
  # assign labels based on >= 90% class membership, mimicing human labelling
  if cluster_label_fractions[dominant_cluster_class] >= 0.9:
    a = activations[cluster_indices]
    l = np.zeros((a.shape[0], 10))
    l[:,dominant_cluster_class] += 1
    try:
      a_labelled = np.concatenate((a_labelled, a))
      labels = np.concatenate((labels, l))
      labelled_indices = np.concatenate((labelled_indices, cluster_indices))
    except NameError:
      a_labelled = a
      labels = l
      labelled_indices = cluster_indices
        
print(a_labelled.shape)
print(labels.shape)

m = a_labelled.shape[0]
order = np.random.permutation(m)
a_labelled = a_labelled[order]
labels = labels[order]

unlabelled_indices = np.array([x for x in range(x_train.shape[0]) if x not in labelled_indices])

The same as before the clusters with great than 90% dominant class membership are assinged the label of the dominant class.  The previous set of labels have been forgotten, but a mechanism to take advantage those might help.  This time we get a labelled training set with 4974 subjects.

Visualise some of these clusters.

Cluster 0 with 151 subjects and 99% labelled 4.

In [None]:
plotCluster(clustering_activations.labels_, 0, x_train, 28)

Cluster 36 with 100 subject 29% labelled 8

In [None]:
plotCluster(clustering_activations.labels_, 36, x_train, 28)

In [None]:
model4 = Sequential()
model4.add(Dense(500, activation='relu', input_shape=(activations.shape[1],)))
model4.add(Dense(10, activation='softmax'))
model4.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
# fit the model to the data
model4.fit(a_labelled, labels, epochs=20, batch_size=500)

In [None]:
test_activations = model3.predict(x_test[:,:,:,np.newaxis])
test_accuracy = calculateAccuracy(model4, test_activations, y_test, 10)
print('Test accuracy: %.4f%%' % test_accuracy)

This is an improvement on the 62% achieved above. This suggests that the data has been transformed into a more discriminant feature space.  Although model4 is trained on ~2000 fewer subjects than model2 it is 18% more accurate.

Confused clusters such as cluster 36 visualised above would be an example of a cluster to dissolve.  Volunteers would be asked to assign a label to each subject in the cluster if the cluster did not appear to capture anything sinificant.

## 3Pi image data

Lets try this with the PS1 3pi data set.

This is more difficult.  The data is labelled into two classes but there is more underlying structure such as different artefact types and signal-to-noise.  The classes are skewed to 3 times more bogus than real, MNIST classes are balanced.

In [None]:
path = '/Users/dwright/dev/zoo/data/'
file = '3pi_20x20_skew2_signPreserveNorm.mat'
data = sio.loadmat(path+file)

In [None]:
x_train = data['X'] # load the pixel data
y_train = np.squeeze(data['y']) # load the targets
x_test  = data['testX'] # load the pixel data
y_test  = np.squeeze(data['testy']) # load the targets

Cluster subjects into 20 groups using heirarchical clustering

In [None]:
n_clusters = 20
clustering_threepi = AgglomerativeClustering(n_clusters=n_clusters)
clustering_threepi.fit(x_train)

In [None]:
del x_labelled
one_hot_encoded = np_utils.to_categorical(y_train, 2)

for cluster in range(n_clusters):
  cluster_indices = np.where(clustering_threepi.labels_ == cluster)[0]
  n_assigned_examples = cluster_indices.shape[0]
  cluster_labels = one_hot_encoded[cluster_indices]
  cluster_label_fractions = np.mean(cluster_labels, axis=0)
  dominant_cluster_class = np.argmax(cluster_label_fractions)
  print(cluster, n_assigned_examples, dominant_cluster_class, cluster_label_fractions[dominant_cluster_class])
  # assign labels based on >= 95% class membership, mimicing human labelling
  # I'm assuming that if a cluster is diminated by a singel class volunteers 
  # will assign that class to it.
  if cluster_label_fractions[dominant_cluster_class] >= 0.95:
    x = x_train[cluster_indices]
    if dominant_cluster_class == 0:
      l = np.zeros((x.shape[0],))
    elif dominant_cluster_class == 1:
      l = np.ones((x.shape[0],))
    else:
        raise ValueError
    # save the indices for cluster 7. We'll use these below.
    try:
      x_labelled = np.concatenate((x_labelled, x))
      labels = np.concatenate((labels, l))
      labelled_indices = np.concatenate((labelled_indices, cluster_indices))
    except NameError:
      x_labelled = x
      labels = l
      labelled_indices = cluster_indices

m = x_labelled.shape[0]
order = np.random.permutation(m)
x_labelled = x_labelled[order]
x_labelled = np.reshape(x_labelled, (m,20,20), order='F')
x_labelled = x_labelled[:,:,:,np.newaxis]
labels = labels[order]

print(x_labelled.shape)
print(labels.shape)

unlabelled_indices = np.array([x for x in range(x_train.shape[0]) if x not in labelled_indices])

Assign a more stringent label assignment criteria based on 95% dominant class membership for this data set as there are only 2 classes.  In MNIST cluster contamination was likelt a mixture of a subset of the other 9 classes.

This creates a labelled training set of 1688 subjects, but no cluster with >= 95% class membership has a label of real.  We therefore can't train a machine.  We need to explore the cluster dissolving step.

Visualise some of these clusters

Cluster 0 with 113 subjects and 100% bogus subject membership

In [None]:
plotCluster(clustering_threepi.labels_, 0, x_train, 20, cmap='hot')

This cluster could be labelled bogus or 'high signal-to-noise artefacts'.  Or although this cluster contains 100% bogus subjects it might be worth dissolving it further into 2 further classes something like 'masked on the right hand side' and 'saturated source subtraction off centre to the left'.  These classes could be easier for the machine to learn, while an expert could assign a label of bogus to each of these classes without the mamchine having to try and force these into the same output neuron.

Visualise cluster 3 with 1264 sbjects 56% of which are labelled real.

In [None]:
plotCluster(clustering_threepi.labels_, 3, x_train, 20, cmap='hot')

This cluster could be dissolved based on labels of 'real' and  'burntool artefact'.

For now, to approximate the process of dissolving a cluster we use the labels of real and bogus to dissolve some of the most confused clusters.  Add a small amount of label noise to replicate volunteer errors.

In [None]:
def askClusterLabels(cluster_labels, cluster, X, y, image_dims=None, noise_level=None):
  cluster_indices = np.where(cluster_labels == cluster)[0]
  m = cluster_indices.shape[0]
  if image_dims:
    x = np.reshape(X[cluster_indices], (m, image_dims[0], image_dims[1]), order='F')
    x = x[:,:,:,np.newaxis]
  else:
    x = X[cluster_indices]
  cluster_labels = y_train[cluster_indices]
  if noise_level:
    # add some random noise to these labels 
    slice = np.random.permutation(m)[:int(noise_level*m)]
    cluster_labels[slice] = cluster_labels[slice] != 1
  return x, cluster_labels, cluster_indices

In [None]:
noise_level=0.1

x, cluster_labels, cluster_indices = askClusterLabels(clustering_threepi.labels_, 
                                                      2,
                                                      x_train, 
                                                      y_train, 
                                                      (20, 20),
                                                      noise_level
                                                     )

labelled_indices = np.concatenate((labelled_indices, cluster_indices))
x_labelled = np.concatenate((x_labelled, x))
labels = np.concatenate((labels, cluster_labels))

x, cluster_labels, cluster_indices = askClusterLabels(clustering_threepi.labels_, 
                                                      3, 
                                                      x_train, 
                                                      y_train, 
                                                      (20, 20),
                                                      noise_level
                                                     )

labelled_indices = np.concatenate((labelled_indices, cluster_indices))
x_labelled = np.concatenate((x_labelled, x))
labels = np.concatenate((labels, cluster_labels))

x, cluster_labels, cluster_indices = askClusterLabels(clustering_threepi.labels_, 
                                                      7, 
                                                      x_train, 
                                                      y_train, 
                                                      (20, 20),
                                                      noise_level
                                                     )

labelled_indices = np.concatenate((labelled_indices, cluster_indices))
x_labelled = np.concatenate((x_labelled, x))
labels = np.concatenate((labels, cluster_labels))

x, cluster_labels, cluster_indices = askClusterLabels(clustering_threepi.labels_, 
                                                      8, 
                                                      x_train, 
                                                      y_train, 
                                                      (20, 20),
                                                      noise_level
                                                     )

labelled_indices = np.concatenate((labelled_indices, cluster_indices))
x_labelled = np.concatenate((x_labelled, x))
labels = np.concatenate((labels, cluster_labels))

x, cluster_labels, cluster_indices = askClusterLabels(clustering_threepi.labels_, 
                                                      13, 
                                                      x_train, 
                                                      y_train, 
                                                      (20, 20),
                                                      noise_level
                                                     )

labelled_indices = np.concatenate((labelled_indices, cluster_indices))
x_labelled = np.concatenate((x_labelled, x))
labels = np.concatenate((labels, cluster_labels))

m = labels.shape[0]
order = np.random.permutation(m)

unlabelled_indices = np.array([x for x in range(x_train.shape[0]) if x not in labelled_indices])

x_labelled = x_labelled[order]
labels = labels[order]
labels = np_utils.to_categorical(labels, 2)

print(x_labelled.shape)
print(labels.shape)
print(labels)
print(np.sum(labels))

Now have a data set of 3638 bogus subjects and 2124 real subjects.

Reduce the 400 dimensional pixel space to 3 dimensions and visualise the labelled and unlabelled data sets.

In [None]:
pca = PCA(n_components=3)
x_train_pca = pca.fit_transform(x_train)

In [None]:
fig = threeDPlot(x_train_pca, labelled_indices, unlabelled_indices, 'labelled', 'unlabelled')
py.iplot(fig, filename='threepi_x_train_pca3')

Train a generic CNN on the new labelled data set as we did for MNIST.  The architecture is exactly the same here, nothing has been tweaked.

In [None]:
model5 = Sequential()

model5.add(Conv2D(filters=16, kernel_size=2, padding='valid', \
                   activation='relu', input_shape=(20,20,1)))
model5.add(MaxPooling2D(pool_size=2))
model5.add(Conv2D(filters=32, kernel_size=2, padding='valid', activation='relu'))
model5.add(MaxPooling2D(pool_size=2))
model5.add(Conv2D(filters=64, kernel_size=2, padding='valid', activation='relu'))
model5.add(MaxPooling2D(pool_size=2))
model5.add(GlobalAveragePooling2D('channels_last'))
model5.add(Dense(2, activation='softmax'))
model5.compile(loss='binary_crossentropy', optimizer='adam')

In [None]:
# fit the model to the data
model5.fit(x_labelled, labels, epochs=20, batch_size=500)

In [None]:
m = x_test.shape[0]
x_test = np.reshape(x_test, (m, 20, 20), order='F')
x_test = x_test[:,:,:,np.newaxis]

In [None]:
# determine if test set class balance
print(np.sum(np_utils.to_categorical(y_test, 2), axis=0)/np.sum(np_utils.to_categorical(y_test, 2)))

Classes are skewed so accuracy no the best measure.  Calculate the all zeros benchmark as the number to beat.

In [None]:
# deteremine the all zeros benchmark
preds = np.zeros(y_test.shape)
all_zeros = 100*np.sum(np.argmax(np_utils.to_categorical(preds, 2), axis=1)== \
                    np.argmax(np_utils.to_categorical(y_test, 2), axis=1))/ \
          len(preds)
print('All zeros accuracy: %.4f%%' % all_zeros)

In [None]:
test_accuracy = calculateAccuracy(model5, x_test, y_test, 2)
print('Test accuracy: %.4f%%' % test_accuracy)

As with MNIST clone the above network so we can project the data into the learned feature space.

In [None]:
# make a clone of the model above stripping off the output layer and load the weights
model6 = Sequential()
model6.add(Conv2D(filters=16, kernel_size=2, padding='valid', \
                    activation='relu', input_shape=(20,20,1), \
                    weights=model5.layers[0].get_weights())) # load the learned weights from the previous model
model6.add(MaxPooling2D(pool_size=2))
model6.add(Conv2D(filters=32, kernel_size=2, padding='valid', \
                    activation='relu',weights=model5.layers[2].get_weights()))
model6.add(MaxPooling2D(pool_size=2))
model6.add(Conv2D(filters=64, kernel_size=2, padding='valid', \
                    activation='relu',weights=model5.layers[4].get_weights()))
model6.add(MaxPooling2D(pool_size=2))
model6.add(GlobalAveragePooling2D('channels_last'))

In [None]:
m = x_train.shape[0]
x_train = np.reshape(x_train, (m, 20, 20), order='F')
x_train = x_train[:,:,:,np.newaxis]
activations = model6.predict(x_train) # encode the images
print(activations.shape)

Visualise the new data projections

In [None]:
pca = PCA(n_components=3)
activations_pca = pca.fit_transform(activations)

In [None]:
fig = threeDPlot(activations_pca, np.where(y_train==1)[0], np.where(y_train==0)[0], 'real', 'bogus')
py.iplot(fig, filename='threepi_activations_pca3')

Repeat the clustering but in the new feature space

In [None]:
clustering_activations = AgglomerativeClustering(n_clusters=n_clusters)
clustering_activations.fit(activations)

In [None]:
del labels
for cluster in range(n_clusters):
  cluster_indices = np.where(clustering_activations.labels_ == cluster)[0]
  n_assigned_examples = cluster_indices.shape[0]
  cluster_labels = one_hot_encoded[cluster_indices]
  cluster_label_fractions = np.mean(cluster_labels, axis=0)
  dominant_cluster_class = np.argmax(cluster_label_fractions)
  print(cluster, n_assigned_examples, dominant_cluster_class, cluster_label_fractions[dominant_cluster_class])
  # assign labels based on >= 90% class membership, mimicing human labelling
  if cluster_label_fractions[dominant_cluster_class] >= 0.95:
    a = activations[cluster_indices]
    l = np.zeros((a.shape[0], 2))
    l[:,dominant_cluster_class] += 1
    try:
      a_labelled = np.concatenate((a_labelled, a))
      labels = np.concatenate((labels, l))
      labelled_indices = np.concatenate((labelled_indices, cluster_indices))
    except NameError:
      a_labelled = a
      labels = l
      labelled_indices = cluster_indices
        
print(a_labelled.shape)
print(labels.shape)

m = a_labelled.shape[0]
order = np.random.permutation(m)
a_labelled = a_labelled[order]
labels = labels[order]

unlabelled_indices = np.array([x for x in range(x_train.shape[0]) if x not in labelled_indices])    

Again cluster labels assigned based on 95% cluster class membership.

Dissolve the most confused clusters.

In [None]:
noise_level=0.1

x, cluster_labels, cluster_indices = askClusterLabels(clustering_activations.labels_, 
                                                      3,
                                                      activations, 
                                                      y_train, 
                                                      noise_level=noise_level
                                                     )

labelled_indices = np.concatenate((labelled_indices, cluster_indices))
a_labelled = np.concatenate((a_labelled, x))
labels = np.concatenate((labels, np_utils.to_categorical(cluster_labels)))

x, cluster_labels, cluster_indices = askClusterLabels(clustering_activations.labels_, 
                                                      11,
                                                      activations, 
                                                      y_train, 
                                                      noise_level=noise_level
                                                     )

labelled_indices = np.concatenate((labelled_indices, cluster_indices))
a_labelled = np.concatenate((a_labelled, x))
labels = np.concatenate((labels, np_utils.to_categorical(cluster_labels)))

x, cluster_labels, cluster_indices = askClusterLabels(clustering_activations.labels_, 
                                                      12,
                                                      activations, 
                                                      y_train, 
                                                      noise_level=noise_level
                                                     )

labelled_indices = np.concatenate((labelled_indices, cluster_indices))
a_labelled = np.concatenate((a_labelled, x))
labels = np.concatenate((labels, np_utils.to_categorical(cluster_labels)))

x, cluster_labels, cluster_indices = askClusterLabels(clustering_activations.labels_, 
                                                      16,
                                                      activations, 
                                                      y_train, 
                                                      noise_level=noise_level
                                                     )

labelled_indices = np.concatenate((labelled_indices, cluster_indices))
a_labelled = np.concatenate((a_labelled, x))
labels = np.concatenate((labels, np_utils.to_categorical(cluster_labels)))

x, cluster_labels, cluster_indices = askClusterLabels(clustering_activations.labels_, 
                                                      17,
                                                      activations, 
                                                      y_train, 
                                                      noise_level=noise_level
                                                     )

labelled_indices = np.concatenate((labelled_indices, cluster_indices))
a_labelled = np.concatenate((a_labelled, x))
labels = np.concatenate((labels, np_utils.to_categorical(cluster_labels)))
print(a_labelled.shape)
print(labels.shape)

This produces a labelled data set with 9179 subjects with classes skewed as shown below.

In [None]:
print(np.sum(labels, axis=0)/np.sum(labels))

Build a fully connected Neural Net with a single hidden layer to learn these labels.

In [None]:
model7 = Sequential()
model7.add(Dense(500, activation='relu', input_shape=(activations.shape[1],)))
#model.add(Dropout(0.3))
model7.add(Dense(2, activation='softmax'))
model7.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
# fit the model to the data
model7.fit(a_labelled, labels, epochs=20, batch_size=500)

In [None]:
test_activations = model6.predict(x_test)
test_accuracy = calculateAccuracy(model7, test_activations, y_test, 2)
print('Test accuracy: %.4f%%' % test_accuracy)

This is an improvement on the 91.4% we had above.  This is a small improvement and the weakness might be assigning subjects to only one of 2 classes.