In [None]:
// boring imports
var {loadUnlabelledWine, grid2, plotClustersWithLabels} = require('./utils')
var Plot = require('plotly-notebook-js');
var table = require('text-table');

# Unsupervised Learning

![Unsupervised Learning slide](images/slide_unsupervised.png)


# K Means Clustering


![means algorihtm](images/slide_kmeans.png)

### Setup

Load our dataset and pick out the two features of interest, the ones we were looking at on the previous page

In [14]:
var {features, dataset} = loadUnlabelledWine({ verbose: true });

our dataset has 178 rows and  13 columns
Alcohol | Malic Acid | Ash | Alcalinity of ash | Magnesium | Total phenols | Flavanoids | Nonflavanoid phenols | Proanthocyanins | Color intensity | Hue | OD280/OD315 of diluted wines | Proline | Class
---------------------------
14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.8  | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065
 13.2 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.4  | 1050
13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.8  | 3.24 | 0.3  | 2.81 | 5.68 | 1.03 | 3.17 | 1185
14.37 | 1.95 | 2.5  | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.8  | 0.86 | 3.45 | 1480
13.24 | 2.59 | 2.87 | 21   | 118 | 2.8  | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735


In [15]:
var input = dataset.map(d => [d[0], d[11]]);

### Clustering

Now run the algorithm, check out the docs [here](https://mljs.github.io/kmeans/)

In [16]:
var KMEANS = require('ml-kmeans');

var K = 3;

var options = {
    maxIterations: 100,
    tolerance: 1e-6,
    withIterations: false,
    // distanceFunction: () => {}, // can specify our own distance but may not converge
    initialization: 'random'
}

var ans = KMEANS(input, K, options)

var {converged, clusters, centroids, iterations} = ans;

if (converged) {
    console.log("Converged after", iterations, "iterations")
}
else {
    console.log("Did not converge after", iterations, "iterations")
}

Converged after 8 iterations


### Plot the results

We now have class labels in `ans.clusters` corresponding to each feature vector in our `input` array. Let's scatterplot the results but color these by class label.

#### TODO: now add the class centroids to the plot with larger markers

In [17]:
var trace = { 
    x: input.map(d => d[0]),
    y: input.map(d => d[1]),
    mode: 'markers',
    marker: { 
        color: clusters, // <- here are our results
        size: 8,
        colorbar: {
            xpad: 100
        }
    },
    type: 'scatter'
};

var centroidsTraces = centroids.map(d => ({
    x: [d.centroid[0]], y: [d.centroid[1]],
    mode: 'markers',
    marker: { 
        size: 20,
        line: { width: 2, color: '#000' },
        opacity: 1
    },
    opacity: 0.3,
    type: 'scatter',
}));

var layout = { width: 800, height: 700, xaxis: { title: features[0] }, yaxis: { title: features[11] }};

$$html$$ = Plot.createPlot([trace, ...centroidsTraces], layout).render()

### Plot the feature space partitioning

Now let's look at the decision boundaries in the feature space. 

Kmeans already labelled our training set for us but if we need to determine the class of a datapoint that we have not seen yet when we need ot use a classifer with the `centroids` that kmeans gave us.

This is known as a `forward pass`. KMeans has used the training set to *learn* the k class centroids, but to *generaise* to data we have not seen before we need to run a classifier.

Luckily we just created one in the last notebook. Either copy the function definition in here or create a new `.js` file, and `require` that (if you do, remember to restart the kernel otherwise the notebook won't see the new file).


In [6]:
function myClassifier(centroids, inputs) {
    var distance = require('ml-distance').distance; 
    return inputs.map(row => {
        
        return row.map(col => {
            
            // for each class
            var dists = centroids.map(c => {
                return distance.euclidean(c, row);
            })

            var mindist = dists.reduce((a,d) => {
                return Math.min(a, d)
            }, 99999)

            var foundClass = -1;
            for (let i = 0; i < 3; i++) {
                if (mindist === dists[i]) {
                    foundClass = i; 
                    break;
                }
            }

            //return [...row, foundClass+1];
            return foundClass+1;
        })
    })
}



var g = grid2([0,10]);
var H = myClassifier(centroids, input); 

var trace = {
    z: H.map(row => row.map(h => h[2])),
    type: "heatmap"
}

var layout = {
    title: "decision space",
    width: 700,
    height: 700
}

$$html$$ = Plot.createPlot([trace], layout).render()

 ### Time to Play around
 
 Time to try a few different things out and see the effect on the clustering and classification
 
 - try with different initialisations
   - mostDistant
   - use the class centers that we picked manually
 - try with different values of K, what happens? why?
 - try with fewer points, how does restricting the training set affect class positions?
 - (time allowing) try with a different dataset

### Confidence

So we have assigned all of the samples in our training set to one of K classes and we've used the centroids produced to configure a classifier, so we can classify wines that we've never seen before, we've genrelised!


But are all wines in a class equal? how can we measure confidence of membership in any one class? (hint: look back at out scatter plot)


##### TODO Think of a confidence measure, compute it and display a new scatter plot with marker sizes adjusted for confidence

So for each of our N entires in `input` and `clusters`, we'll want a new N entry list `confidence`.


In [69]:
// derive a confidence measure and compute it

'use strict'

In [70]:
// grabs the scatter plot code form above and customise it to some

'use strict'

#### Further Reading

More sophisticated techniques can produce different

 - bayes learning and 2nd order bayes classifers
 - gaussian mixture modelling
 - measuring performance in undersuipervised learning