# Digits Dataset Using K-Means Clustering

We are going to use the data set form sklearn.datasets.load_digits() to perform K-Means Clustering. We will use the K-Means Clustering to cluster the data into 10 clusters. We will then use the cluster centers to visualize the data.

In [149]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits


In [150]:
digits = load_digits()
print(digits.data)

[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]


After loading the data and analyzing it. We have discovered that the data each entry is a 8x8 matrix. There a total of 1797 entries.

Now that we have in mind how are data looks like, we can analyze our target data

In [151]:
target = digits.target

The target data shows that there will be 9 possible clusters representing natural number from 0-9 for our data. Lets visualize the data to see how it looks like.

In [152]:
plt.gray()
plt.matshow(digits.images[9])
plt.show()

Create model and train it

In [153]:
from sklearn.cluster import KMeans

In [154]:
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans.fit(digits.data)



# Visualize the data

In [155]:
fig = plt.figure(figsize=(8, 3))
fig.suptitle('Cluster Center Images', fontsize=14, fontweight='bold')

Text(0.5, 0.98, 'Cluster Center Images')

We create a for loop that plots each cluster center.

In [156]:
for i in range(10):
    # Initialize subplots in a grid of 2X5, at i+1th position
    ax = fig.add_subplot(2, 5, 1 + i)
    # Display images
    ax.imshow(kmeans.cluster_centers_[i].reshape((8, 8)), cmap=plt.cm.binary)
fig.savefig('digits.png')

# Testing the model

The following array represents a group of 4 digits. The correct order should be 1 9 9 8. Lets see if our model can predict the correct order.

In [157]:
testing_data = np.array([
    [0.00,0.00,0.00,1.78,0.51,0.00,0.00,0.00,0.00,0.00,3.83,12.49,6.88,0.00,0.00,0.00,0.00,4.34,12.62,12.75,7.64,0.00,0.00,0.00,2.68,12.24,9.82,11.72,7.64,0.00,0.00,0.00,4.46,10.33,1.41,12.62,6.63,0.00,0.00,0.00,0.00,0.00,0.00,12.36,6.88,0.00,0.00,0.00,0.00,0.00,0.00,11.47,7.65,0.00,0.00,0.00,0.00,0.00,0.00,11.34,8.03,0.00,0.00,0.00],
    [0.00,0.00,0.00,0.77,2.55,1.02,0.00,0.00,0.00,0.00,3.45,11.99,12.75,12.62,4.98,0.00,0.00,0.00,11.73,10.84,3.70,11.35,9.18,0.00,0.00,0.00,12.37,10.45,7.14,12.75,10.33,0.00,0.00,0.00,4.85,10.71,11.47,11.86,11.73,0.00,0.00,0.00,0.00,0.00,0.00,4.97,12.75,1.91,0.00,0.00,0.00,0.00,0.00,1.53,12.62,7.02,0.00,0.00,0.00,0.00,0.00,0.00,9.06,8.92],
    [0.00,2.17,9.18,11.22,7.53,0.00,0.00,0.00,0.89,11.73,11.73,9.31,12.75,7.14,0.00,0.00,3.57,12.75,7.65,11.47,12.75,11.22,0.00,0.00,0.76,10.46,12.75,12.75,12.62,11.47,0.00,0.00,0.00,0.00,0.00,0.00,8.67,10.58,0.00,0.00,0.00,0.00,0.00,0.00,8.92,10.19,0.00,0.00,0.00,0.00,0.00,0.00,7.01,12.24,1.15,0.00,0.00,0.00,0.00,0.00,2.17,9.94,1.92,0.00],
    [0.00,7.02,12.75,12.75,12.62,2.94,0.00,0.00,0.51,12.37,8.16,0.90,11.48,9.05,0.00,0.00,1.27,12.75,7.39,3.19,10.46,11.47,0.00,0.00,0.00,8.80,12.75,12.75,12.75,12.50,6.00,0.00,0.00,0.00,9.82,12.37,8.16,9.57,12.62,0.64,0.00,0.89,12.75,7.27,0.38,4.08,12.75,2.55,0.00,0.25,11.09,12.75,11.60,10.71,12.37,1.15,0.00,0.00,1.41,7.27,9.94,10.20,4.85,0.00]
])

In [158]:
new_labels = kmeans.predict(testing_data)


In [159]:
for i in range(len(new_labels)):
    if new_labels[i] == 0:
        print(0, end='')
    elif new_labels[i] == 1:
        print(9, end='')
    elif new_labels[i] == 2:
        print(2, end='')
    elif new_labels[i] == 3:
        print(1, end='')
    elif new_labels[i] == 4:
        print(6, end='')
    elif new_labels[i] == 5:
        print(8, end='')
    elif new_labels[i] == 6:
        print(4, end='')
    elif new_labels[i] == 7:
        print(5, end='')
    elif new_labels[i] == 8:
        print(7, end='')
    elif new_labels[i] == 9:
        print(3, end='')

8199

# Conclusion

We can see that our model predicted the correct digits. This is a very simple example of how K-Means Clustering works.
