# K-means Clustering of Numerical Data

Source: https://www.kaggle.com/code/khotijahs1/k-means-clustering-of-iris-dataset

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.[[1](https://en.wikipedia.org/wiki/Iris_flower_data_set#:~:text=The%20Iris%20flower%20data%20set,example%20of%20linear%20discriminant%20analysis.)] It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.[2] Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".[3] Fisher's paper was published in the journal, the Annals of Eugenics, creating controversy about the continued use of the Iris dataset for teaching statistical techniques today.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

![https://miro.medium.com/max/2550/0*GVjzZeYrir0R_6-X.png](https://miro.medium.com/max/2550/0*GVjzZeYrir0R_6-X.png)

source image:https://miro.medium.com/max/2550/0*GVjzZeYrir0R_6-X.png

This study we try to clustering Iris Dataset used Kmeans

[Attribute Information:
](https://archive.ics.uci.edu/ml/datasets/iris)
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.cluster import KMeans 
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import MinMaxScaler

reading dataset

In [24]:
from sklearn import datasets
from sklearn.metrics import pairwise_distances

# Load the Iris dataset
iris = datasets.load_iris()
data = iris.data
print(f"Number of features: {len(data[0])}\nNumber of data points: {len(data)}")
x = data


Number of features: 4
Number of data points: 150


# K-Means

[K-means](http://https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/) is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.

# How to Implementing K-Means Clustering ?

* Choose the number of clusters k
* Select k random points from the data as centroids
* Assign all the points to the closest cluster centroid
* Recompute the centroids of newly formed clusters
* Repeat steps 3 and 4


In [19]:
#Finding the optimum number of clusters for k-means classification
from sklearn.cluster import KMeans

num_clusters = 3
cluster_results = []
for k in range(2,9):
    # Initialize the KMeans model
    kmeans = KMeans(n_clusters=k, random_state=42)
    # Fit the model to the data
    kmeans.fit(data)
    cluster_results.append((k,kmeans.labels_))

# AntClust

In [21]:
# ----------------------
#       imports
# ----------------------
# import opencv
import cv2 as cv
# matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import pairwise_distances

# make AntClus dir known
import sys
sys.path.append("../AntClust")
# import AntClust
from AntClust import AntClust
from importlib import reload
import distance_classes
reload(distance_classes)
# import the rule set
from rules import labroche_rules

# Compute pairwise Euclidean distances
distances = pairwise_distances(data, metric='euclidean')

# Get the minimum and maximum distances
min_distance = np.min(distances)  # Exclude zeros on the diagonal
max_distance = np.max(distances)

f_sim = [distance_classes.similarity_euclid(min_distance, max_distance)]
ant_clust = AntClust(f_sim, labroche_rules())
ant_clust.fit([[d] for d in data])
clusters_found = ant_clust.get_clusters()

AntClust: phase 1 of 3 -> meeting ants
Meeting 11250 / 11250
Meeting 10125 / 11250
Meeting 9000 / 11250
Meeting 7875 / 11250
Meeting 6750 / 11250
Meeting 5625 / 11250
Meeting 4500 / 11250
Meeting 3375 / 11250
Meeting 2250 / 11250
Meeting 1125 / 11250
AntClust: phase 2 of 3 -> shrink nests
AntClust: phase 3 of 3 -> reassign ants


# Metrics

In [23]:
from sklearn import metrics

df = pd.DataFrame()


homogeneity_score = metrics.homogeneity_score(iris.target, clusters_found)
completeness_score = metrics.completeness_score(iris.target, clusters_found)
v_score = metrics.v_measure_score(iris.target, clusters_found)
ari_score = metrics.adjusted_rand_score(iris.target, clusters_found)
data = {
    'Homogeneity': homogeneity_score,
    'Completeness': completeness_score,
    'V-measure': v_score,
    'Adjusted Rand-Index': ari_score,
}

# Creating a new DataFrame with the data for the new row
new_row = pd.DataFrame(data, index=["AntClust (euclidean distance)"])
df = pd.concat([df, new_row])

for k, k_label in cluster_results:
    homogeneity_score = metrics.homogeneity_score(iris.target, k_label)
    completeness_score = metrics.completeness_score(iris.target, k_label)
    v_score = metrics.v_measure_score(iris.target, k_label)
    ari_score = metrics.adjusted_rand_score(iris.target, k_label)
    data = {
        'Homogeneity': homogeneity_score,
        'Completeness': completeness_score,
        'V-measure': v_score,
        'Adjusted Rand-Index': ari_score,
    }
    new_row = pd.DataFrame(data, index=[f"K-means (k={k})"])
    df = pd.concat([df, new_row])

df


Unnamed: 0,Homogeneity,Completeness,V-measure,Adjusted Rand-Index
AntClust (euclidean distance),0.667307,0.664601,0.665951,0.544441
K-means (k=2),0.522322,0.883514,0.656519,0.539922
K-means (k=3),0.751485,0.764986,0.758176,0.730238
K-means (k=4),0.808314,0.652211,0.72192,0.649818
K-means (k=5),0.823883,0.599287,0.693863,0.607896
K-means (k=6),0.823883,0.520492,0.637954,0.447534
K-means (k=7),0.914483,0.524576,0.666707,0.474661
K-means (k=8),0.92556,0.513151,0.660247,0.463783
