# Part 4: Clustering

Clustering is an unsuperivsed ML method used to group data points based on their features alone, and no observed grouping labels as in supervised classification. Thus most clustering alorithms seeks to group points by their distance in a high dimensional space generated by provided features.

## 1) K-means clustering  

In this section we will cover k-means clustering using `scikit-learn`. The scikit-learn documentation for clustering is found [here](http://scikit-learn.org/stable/modules/clustering.html).

First we'll import `KMeans` and `numpy` so that we can make our arrays. The `%matplotlib inline` will make our plots show up within the notebook.

In [None]:
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

We'll start off with a few points. Remember, as with classification and regression, our data should be in a numpy array.

In [None]:
X = np.array([[0,1], [1,2], [1, 0], [-1, -3],
             [15, 21], [18, 30], [20, 20], [22, 19],
             [45, 50], [42, 48], [60, 40], [50, 50]])

If we plot them we can see that they appear to be arranged roughly in three groups.

In [None]:
plt.scatter(*X.T)
plt.ylabel('random points')
plt.show()

To get our clusters, all we have to do is specify how many we want, and then fit the model to the data. We'll choose 3. We can also specify the maximum number of iterations of the k-means algorithm, which you may want to do with a much larger dataset.

First thing's first: **set a random seed!**

In [None]:
np.random.seed(10)

Now we can create the model:

In [None]:
kmeans = KMeans(n_clusters=3,
               max_iter=300 #default
               ).fit(X)

We can access the centers of the clusters through the `cluster_centers_` attribute. To get the labels (i.e. the corresponding cluster) we use `labels_`.

In [None]:
print("Centers")
print(kmeans.cluster_centers_)
print()

print("Labels")
print(kmeans.labels_)
print()

for point, label in zip(X, kmeans.labels_):
    print("Coordinates:", point, "Label:", label)

Now let's also plot out cluster centers along with the points.

In [None]:
fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(*X.T, s=50, c='b', label='original points')
ax1.scatter(*kmeans.cluster_centers_.T, s=50, c='r', label='cluster centers')
plt.legend(loc='upper left')
plt.show()

If we want to see to which cluster a new point would belong, we simply use the `predict` method.

In [None]:
new_points = np.asarray([[0, 4],
                        [19, 25],
                        [40, 50]])

print("Predictions:")
print()

print("0, 4")
print("Cluster:", kmeans.predict([[0, 4]]))
print()

print("19, 25")
print("Cluster:", kmeans.predict([[19, 25]]))
print()

print("40, 50")
print("Cluster:", kmeans.predict([[40, 50]]))

#plot new points

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(*X.T, s=50, c='b', label='original points')
ax1.scatter(*kmeans.cluster_centers_.T, s=50, c='r', label='cluster centers')
ax1.scatter(*new_points.T, s=50, c='g', label='new points')
plt.legend(loc='upper left')
plt.show()

## 2) Agglomerative clustering

Now we'll show an example of agglomerative clustering, which is a type of hierarchical clustering. The documentation is [here](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering) in case you want to know more about the parameters. We'll use some of scikitlearn's toy datasets.

In [None]:
from sklearn import datasets

n_samples = 1500

noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)[0]
blobs = datasets.make_blobs(n_samples=n_samples, random_state=0)[0]

plt.scatter(*noisy_moons.T)
plt.ylabel('noisy moons')
plt.show()

plt.scatter(*blobs.T)
plt.ylabel('blobs')
plt.show()

We'll use two clusters this time, and use ward linkage.

In [None]:
from sklearn.cluster import AgglomerativeClustering

ward = AgglomerativeClustering(n_clusters=2,
                               linkage='ward', #linkage can be ward (default), complete, or average
                               affinity='euclidean') #affinity must be euclidean if linkage=ward

Now we'll fit the clustering model on the dataset.

In [None]:
ward.fit(noisy_moons)

Here we'll sort the points by label and then plot them.

In [None]:
zero = np.array([point for label, point in zip(ward.labels_, noisy_moons) if label == 0])
one = np.array([point for label, point in zip(ward.labels_, noisy_moons) if label == 1])

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(*zero.T, s=50, c='b', label='zero')
ax1.scatter(*one.T, s=50, c='r', label='one')
plt.show()

Now we'll do the same with the blobs dataset.

In [None]:
ward.fit(blobs)

ward.labels_

zero = np.array([point for label, point in zip(ward.labels_, blobs) if label == 0])
one = np.array([point for label, point in zip(ward.labels_, blobs) if label == 1])

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(*zero.T, s=50, c='b', label='zero')
ax1.scatter(*one.T, s=50, c='r', label='one')
plt.show()

## Challenge: DBSCAN 


It looks like our agglomerative clustering model did not cluster the noisy moons dataset how we might have wanted. For the challenge, use [`DBSCAN`](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) to cluster noisy moons. Then plot the results and see what it looks like. Try an `eps` value of .2. This sets the maximum distance between two samples for them to be considered in the same neighborhood.