<h1>Clustering Algorithms</h1>
<h3>Unsupervised learning</h3>
<ul>
<li>The algorithm tries to group similar data together (clusters)
<li>Using the values of the feature space
</ul>
<h3>K-Means Clusterng</h3>
<ul>
<li>partitions the dataspace into clusters
<li>minimizes distance between the mean of a cluster and the data points
<li>the desired number of clusters must be known in advance
</ul>

<h2>Image recognition dataset</h2>
<ul>
<li>Digits 0-9 pixelated into 64 quadrants
<li>Each value represents the area that is shaded


<h2>Do imports</h2>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_digits
from sklearn.preprocessing import scale

<h2>Load data</h2>

In [None]:
digits = load_digits()
digits

In [None]:
print(digits['DESCR'])


<h2>scale the data to normal distribution</h2>

In [None]:
data = scale(digits.data)

In [None]:
data

<h2>Render the digit images and their associated values</h2>

In [None]:
def print_digits(images,y,max_n=10):
    # set up the figure size in inches
    fig = plt.figure(figsize=(12, 12))
    fig.subplots_adjust(left=0, right=1, bottom=0, top=1,
           hspace=.05, wspace=.5)
    i = 0
    while i <max_n and i <images.shape[0]:
        # plot the images in a matrix of 20x20
        p = fig.add_subplot(20, 20, i + 1, xticks=[],
              yticks=[])
        p.imshow(images[i], cmap=plt.cm.bone)
        # label the image with the target value
        p.text(0, 14, str(y[i]))
        i = i + 1
print_digits(digits.images, digits.target, max_n=10)

<h2>Training and testing samples</h2>

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, images_train,images_test = train_test_split(
        data, digits.target, digits.images,  test_size=0.25, 
          random_state=42)

n_samples, n_features = X_train.shape
n_digits = len(np.unique(y_train))
labels = y_train

In [None]:
len(np.unique(y_train))

<h2>Create the model and fit the data</h2>

In [None]:
from sklearn import cluster
clf = cluster.KMeans(init='k-means++',n_clusters=10, random_state=42)
clf.fit(X_train)

k-means++ runs an initializer before using the k-means algorithm

In [None]:
images_train

<h2>Call print_digits with training images, and computed labels</h2>
<h2>Returned labels are cluster numbers</h2>

In [None]:
print_digits(images_train, clf.labels_, max_n=20)

<h2>Use test sample to generate predictions</h2>

In [None]:
y_pred=clf.predict(X_test)
y_pred

In [None]:
def print_cluster(images, y_pred, cluster_number):
    images = images[y_pred==cluster_number]
    y_pred = y_pred[y_pred==cluster_number]
    print_digits(images, y_pred,max_n=15)
for i in range(10):
      print_cluster(images_test, y_pred, i)


<h1>Evaluating the model</h1>
<li>Adjusted rand index: A measure of the similarity between two groups</li>
<li>We'll use it to see how similar the y_test actuals and predicted groupings are</li>
<li>http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html</li>
<li>0.0 indicates that there is no similarity and any overlap is explainable as totally random</li>
<li>1.0 indicates that the two groups are identical</li>


In [None]:
from sklearn import metrics
print("Adjusted rand score: {0:2}".format(metrics.adjusted_rand_score(y_test, y_pred)))

<h2>Confusion matrix</h2>
<li>Each row corresponds to a number (y_test)
<li>Each column to y_pred (the cluster number)
<li>Data is the number of times y_test was assigned to the corresponding y_pred
<li>For example, 0 is fully assigned to cluster 2 (Row 0, Column 2)
<li>8 is assigned to cluster 0  21 times (Row 8, Column 0)
<li>7, which is cluster 6 is assigned to cluster 6 34 times (Row 7, Column 6)


In [None]:
print(metrics.confusion_matrix(y_test, y_pred))

<h2>Graphical view of the clusters</h2>

<li>First reduce the x dimensions to 2 using principle component analysis</li>
<li>https://en.wikipedia.org/wiki/Principal_component_analysis</li>
<li>Then figure out the range of values and define the grid</li>
<li>Run k-means on the reduced (2 component) data set</li>
<li>Draw a color map and plot the pca points on this map</li>
<li>Find the cluster centroids and plot them on the color map</li>



In [None]:
from sklearn import decomposition
pca = decomposition.PCA(n_components=2).fit(X_train)
reduced_X_train = pca.transform(X_train)
# Step size of the mesh. 
h = .01     
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = reduced_X_train[:, 0].min() + 1, reduced_X_train[:, 0].max() - 1
y_min, y_max = reduced_X_train[:, 1].min() + 1, reduced_X_train[:, 1].max() - 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), 
    np.arange(y_min, y_max, h))
kmeans = cluster.KMeans(init='k-means++', n_clusters=n_digits, 
    n_init=10)
kmeans.fit(reduced_X_train)
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest', extent=(xx.min(), xx.max(), yy.min(), yy.max()), cmap=plt.cm.Paired, aspect='auto', origin='lower')
plt.plot(reduced_X_train[:, 0], reduced_X_train[:, 1], 'k.', 
    markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],marker='.', 
    s=169, linewidths=3, color='w', zorder=10)
plt.title('K-means clustering on the digits dataset (PCA reduced data)\nCentroids are marked with white dots')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()