# Chapter 19. Clustering

In [6]:
from __future__ import division
from linear_algebra import squared_distance, vector_mean, distance
import math, random
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

Most of the algorithms in this book are what's known as [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning), meaning that they train with a set of labeled data and use that as the basis for making predictions about new, unlabeled data.  
Clustering, however, is an example of [unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning), in which we work with completely unlabeled (or ignored/disregarded) data.

## The Idea

Whenever you look at some sort of data, it's likely that the data will somehow form [clusters](https://en.wikipedia.org/wiki/Cluster_analysis).  
A data set showing where millionaires live probably has clusters in places like San Francisco, Beverly Hills, and Manhattan.  
A data set showing how many hours people work each week probably has a cluster around 40.  
Unlike some of the problems that this book has examined, there is generally no 'correct' clustering.  
An alternative clustering scheme might group "bored retirees" with "avid golfers" and others with "RV owners".  
Neither scheme is necessarily more correct -- instead, each is likely more optimal with respect to it own "how good are the clusters?" metric.  
Furthermore, clusters won't label themselves.  
You will have to do that by looking at the data underlying each one.

## The Model

For us, each `input` will be a vector in $d$-dimensional space (which, as usual, we will represent as a list of numbers).  
Our goal will be to identify clusters of similar inputs and (sometimes) to find a representative value for each cluster.  
For example, each input could be (a numeric vector that somehow represents) the title of a blog post, in which case the goal might be to find clusters of similar posts, perhaps in order to understand what our users our blogging about.  
Or imagine that we have a picture containing thousands of (`red`, `green`, `blue`) colors and that we need to screen-print a 10-color version of it.  
Clustering can help us choose 10 colors that will minimize the total "color error."

One of the simplest clustering methods is [k-means](https://en.wikipedia.org/wiki/K-means_clustering), in which the number of clusters $k$ is chosen in advance, after which the goal is to partition the inputs into sets $S_1, S_2, \ldots, S_k$ in a way that minimizes the total sum of squared distances from each point to the mean of the assigned cluster.  
There are many ways to assign $n$ points to $k$ clusters, which means that finding an optimal clustering is a challenging problem.  
For instructive purposes, we can settle for an iterative algorithm that usually finds a good clustering:
1. Start with a set of $k-means$, which are points in a $d$-dimensional space.
2. Assign each point to the mean to which it is closest.
3. If no point's assignment has changed, stop and keep the clusters.
4. If some point's assignment has changed, recompute the means and and return to step 2.

Using the `vector_mean` function from Chapter 4, we can create a class that does this:

In [7]:
class KMeans:
    """ performs k-means clustering """
    def __init__(self, k):
        self.k = k         # number of means
        self.means = None  # means of the clusters
        
    def classify(self, input):
        """ return the index of the cluster closest to the input """
        return min(range(self.k), key=lambda i: squared_distance(input, self.means[i]))
    
    def train(self, inputs):
        # choose k random points as the initial means
        self.means = random.sample(inputs, self.k)
        assignments = None
        
        while True:
            # find new assignments
            new_assignments = map(self.classify, inputs)
            # if no assignments have changed, we're done
            if assignments == new_assignments:
                return
            # otherwise, keep the new assignments
            assignments = new_assignments
            # and calculate new means based on the new assignments
            for i in range(self.k):
                # find all of the points assigned to cluster i
                i_points = [p for p, a in zip(inputs, assignments) if a == i]
                # make sure i_points is not empty so we're not dividing by zero
                if i_points:
                    self.means[i] = vector_mean(i_points)

Let's take a look at how this works.

## Example: Meetups