Here is a function created for purposes of performing the $k$-means algorithm. It selects the initial centroids at random.

In [0]:
import numpy as np
def kmeans(data, k, N):
    centroids = sample(data, k)
    for _ in range(N):
        clusters = [ [] for __ in range(k)]
        for datum in data:
            distances = [ (datum - centroids[j]).norm() for j in range(k)]
            clusters[np.argmin(distances)].append(datum)
        centroids = [mean(clusters[j]) for j in range(k)]
    return [clusters, centroids]
        

Measurement data for the various iris species will be read in from a spreadsheet (.csv) file. For each species, there are four measurements: sepal length, sepal width, petal length, and petal width. For purposes of visualization in $\mathbb{R}^3$, we will consider only the first, third, and fourth of these values.

In [6]:
import csv
irises = csv.reader(open("iris.data"))
data = []
for flower in irises:
    data.append(vector([float(flower[0]), float(flower[1]),float(flower[2])]))
list_plot(data, size=20)

In [7]:
data

[(5.1, 3.5, 1.4),
 (4.9, 3.0, 1.4),
 (4.7, 3.2, 1.3),
 (4.6, 3.1, 1.5),
 (5.0, 3.6, 1.4),
 (5.4, 3.9, 1.7),
 (4.6, 3.4, 1.4),
 (5.0, 3.4, 1.5),
 (4.4, 2.9, 1.4),
 (4.9, 3.1, 1.5),
 (5.4, 3.7, 1.5),
 (4.8, 3.4, 1.6),
 (4.8, 3.0, 1.4),
 (4.3, 3.0, 1.1),
 (5.8, 4.0, 1.2),
 (5.7, 4.4, 1.5),
 (5.4, 3.9, 1.3),
 (5.1, 3.5, 1.4),
 (5.7, 3.8, 1.7),
 (5.1, 3.8, 1.5),
 (5.4, 3.4, 1.7),
 (5.1, 3.7, 1.5),
 (4.6, 3.6, 1.0),
 (5.1, 3.3, 1.7),
 (4.8, 3.4, 1.9),
 (5.0, 3.0, 1.6),
 (5.0, 3.4, 1.6),
 (5.2, 3.5, 1.5),
 (5.2, 3.4, 1.4),
 (4.7, 3.2, 1.6),
 (4.8, 3.1, 1.6),
 (5.4, 3.4, 1.5),
 (5.2, 4.1, 1.5),
 (5.5, 4.2, 1.4),
 (4.9, 3.1, 1.5),
 (5.0, 3.2, 1.2),
 (5.5, 3.5, 1.3),
 (4.9, 3.1, 1.5),
 (4.4, 3.0, 1.3),
 (5.1, 3.4, 1.5),
 (5.0, 3.5, 1.3),
 (4.5, 2.3, 1.3),
 (4.4, 3.2, 1.3),
 (5.0, 3.5, 1.6),
 (5.1, 3.8, 1.9),
 (4.8, 3.0, 1.4),
 (5.1, 3.8, 1.6),
 (4.6, 3.2, 1.4),
 (5.3, 3.7, 1.5),
 (5.0, 3.3, 1.4),
 (7.0, 3.2, 4.7),
 (6.4, 3.2, 4.5),
 (6.9, 3.1, 4.9),
 (5.5, 2.3, 4.0),
 (6.5, 2.8, 4.6),
 (5.7, 2.8

The centroids after 30 iterations using $k=3$ clusters:

In [8]:
clusters, centroids = kmeans(data, 3, 20)
[len(clusters[i]) for i in range(len(clusters))]


[50, 40, 60]

A plot of the same points, colored by measurement type. The centroids are shown in black:

In [0]:
plot = list_plot(centroids, color='black', size=50)
colors = [ 'red','yellow','magenta']
for j in range(len(clusters)):
    plot = plot + list_plot(clusters[j], color=colors[j], size=20)
plot