In [2]:
import numpy as np

def centroids(x,y):
    c0 = x[np.where(y==0)].mean(axis=0)
    c1 = x[np.where(y==1)].mean(axis=0)
    c2 = x[np.where(y==2)].mean(axis=0)
    return [c0,c1,c2]


The selected code defines a function called `centroids` that takes two arguments: `x` and `y`. `x` is a NumPy array containing the feature matrix of the iris dataset, and `y` is a NumPy array containing the target vector of the iris dataset.

The `centroids` function calculates the mean of each feature for each class in the iris dataset. Specifically, it calculates the mean of the features for samples where the target variable `y` is equal to 0, 1, and 2, respectively. These means are then returned as a list of three arrays, where each array contains the mean of the features for one of the three classes.

The resulting list of means can be used as the centroids for a k-means clustering algorithm, where each mean represents the center of one of the three clusters.

Overall, the `centroids` function is a helper function that prepares the data for clustering by calculating the initial centroids.

In [4]:

def predict(c0,c1,c2,x):
    p = np.zeros(x.shape[0], dtype="uint8")
    for i in range(x.shape[0]):
        d = [((c0-x[i])**2).sum(),
             ((c1-x[i])**2).sum(),
             ((c2-x[i])**2).sum()]
        p[i] = np.argmin(d)
    return p


The selected code defines a function called `predict` that takes four arguments: `c0`, `c1`, `c2`, and `x`. `c0`, `c1`, and `c2` are NumPy arrays representing the centroids of three clusters, and `x` is a NumPy array representing the feature matrix of a dataset.

The `predict` function assigns each sample in the feature matrix `x` to one of the three clusters based on the distance between the sample and the centroids of the clusters. Specifically, for each sample in `x`, the function calculates the squared Euclidean distance between the sample and each of the three centroids. The function then assigns the sample to the cluster with the closest centroid.

The function returns a NumPy array `p` containing the cluster assignments for each sample in `x`. The values in `p` are integers representing the cluster assignments, where 0 represents the first cluster, 1 represents the second cluster, and 2 represents the third cluster.

Overall, the `predict` function is a helper function that performs the clustering step of a k-means clustering algorithm. It takes the centroids of the clusters and a feature matrix as input, and returns the cluster assignments for each sample in the feature matrix.

In [8]:


x = np.load("../data/iris/iris_features.npy")
y = np.load("../data/iris/iris_labels.npy")
N = 120
x_train = x[:N]; x_test = x[N:]
y_train = y[:N]; y_test = y[N:]
c0, c1, c2 = centroids(x_train, y_train)
p = predict(c0,c1,c2, x_test)
nc = len(np.where(p == y_test)[0])
nw = len(np.where(p != y_test)[0])
acc = float(nc) / (float(nc)+float(nw))
print("predicted:", p)
print("actual   :", y_test)
print("test accuracy = %0.4f" % acc)





predicted: [0 0 1 1 0 1 2 1 1 1 0 2 1 0 0 1 2 2 1 1 0 2 2 0 1 0 2 1 2 0]
actual   : [0 0 1 2 0 1 2 1 1 1 0 2 2 0 0 2 2 2 1 1 0 2 2 0 1 0 2 1 2 0]
test accuracy = 0.9000


In [18]:


x = np.load("../data/iris/iris_train_features_augmented.npy")
y = np.load("../data/iris/iris_train_labels_augmented.npy")
N = 120
x_train = x
y_train = y
c0, c1, c2 = centroids(x_train, y_train)

x_test = np.load("../data/iris/iris_test_features_augmented.npy")
y_test = np.load("../data/iris/iris_test_labels_augmented.npy")

print(f"x_train shape: {x_train.shape}, y_train shape: {y_train.shape}")

p = predict(c0,c1,c2, x_test)
nc = len(np.where(p == y_test)[0])
nw = len(np.where(p != y_test)[0])
acc = float(nc) / (float(nc)+float(nw))
print("predicted:", p)
print("actual   :", y_test)
print("test accuracy = %0.4f" % acc)





x_train shape: (1200, 4), y_train shape: (1200,)
predicted: [0 0 1 1 0 1 2 1 1 1 0 2 2 0 0 1 2 2 1 1 0 2 2 0 1 0 2 1 2 0]
actual   : [0 0 1 2 0 1 2 1 1 1 0 2 2 0 0 2 2 2 1 1 0 2 2 0 1 0 2 1 2 0]
test accuracy = 0.9333
