# Python Clustering with Scikit-learn

**(C) 2017 by [Damir Cavar](http://cavar.me/damir/)**

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

This is a tutorial related to the discussion of clustering in the textbook [Machine Learning: The Art and Science of Algorithms that Make Sense of Data](https://www.cs.bris.ac.uk/~flach/mlbook/) by [Peter Flach](https://www.cs.bris.ac.uk/~flach/).

This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/).

## K-means Clustering

We will use the *array* objects from the Python module *numpy*:

In [12]:
import numpy

X = numpy.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

To use the *K-means* clustering algorithm from [Scikit-learn](http://scikit-learn.org/), we import it and specify the number of clusters (that is the *k*), and the random state to initialize the centroid centers of the clusters. We assume that the data can be grouped into two clusters:

In [13]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0)

We can now apply the clustering algorithm to the datapoints in $X$:

In [14]:
kmeans.fit(X)

print(kmeans.labels_)

[0 0 0 1 1 1]


The output above shows the assignment of datapoints to clusters.

We can use the model now to make predictions about other datapoints:

In [15]:
print(kmeans.predict([[0, 0], [4, 4]]))

[0 1]


We can also output the centroids of the two clusters:

In [11]:
print(kmeans.cluster_centers_)

[[ 1.  2.]
 [ 4.  2.]]


(C) 2017 by [Damir Cavar](http://cavar.me/damir/) - [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))