The **Iris dataset** is a classic dataset in the field of machine learning and statistics, often used for testing classification algorithms. It contains data about three species of Iris flowers: Setosa, Versicolour, and Virginica. The dataset was introduced by the British statistician and biologist Ronald Fisher in 1936 as an example of discriminant analysis.

In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans, AgglomerativeClustering
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


The line X = iris.data in Python, when using the context of the Iris dataset from scikit-learn, is used to extract the feature data from the dataset and assign it to the variable X.

In [None]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
print(iris)



In [None]:
print(X)

In [None]:
# Elbow Method to determine the number of clusters for KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)



In [None]:
# Plotting the Elbow Method results to visualize the optimal number of clusters
plt.figure(figsize=(10, 5))
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()



In [None]:
# Assuming the optimal number of clusters is 3 (based on the Elbow Method)
optimal_clusters = 3



In [None]:
# KMeans Clustering with determined number of clusters
kmeans = KMeans(n_clusters=optimal_clusters, random_state=0)
kmeans_labels = kmeans.fit_predict(X)

# Hierarchical Clustering with the same number of clusters
hier_cluster = AgglomerativeClustering(n_clusters=optimal_clusters)
hier_labels = hier_cluster.fit_predict(X)

# The 'kmeans_labels' and 'hier_labels' variables contain the cluster labels assigned to each sample in the dataset
# Convert the dataset to a DataFrame and add the cluster labels
iris_df = pd.DataFrame(X, columns=iris.feature_names)
iris_df['Cluster_Label'] = kmeans_labels

# Visualize the results using seaborn's pairplot
sns.pairplot(iris_df, hue='Cluster_Label', palette='bright')
plt.show()

The clustering results on the Iris dataset using KMeans have been visualized with a pair plot. This plot uses seaborn's pairplot function to display pairwise relationships in the dataset. Each plot in the grid represents a pair of features from the dataset, and the points are colored based on the cluster labels assigned by the KMeans algorithm.

In [None]:
# Modify the cluster labels to use the species names instead of numerical labels
cluster_names = {0: 'Setosa', 1: 'Versicolour', 2: 'Virginica'}
iris_df['Cluster_Label'] = iris_df['Cluster_Label'].map(cluster_names)

# Visualize the results again with the updated cluster names
sns.pairplot(iris_df, hue='Cluster_Label', palette='bright')
plt.show()

### Here's a breakdown of what they show and the insights they can provide:

**Pairwise Relationships:** Each plot in the grid represents a pairwise relationship between two features of the Iris dataset. Since there are four features (sepal length, sepal width, petal length, and petal width), the grid consists of a 4x4 matrix of plots.

**Distribution of Individual Features:** The diagonal plots in the grid are not scatter plots but histograms. These show the distribution of each feature. For instance, you can see how sepal length varies across all samples, regardless of the species.

**Scatter Plots for Feature Combinations:** The off-diagonal plots are scatter plots representing the relationship between pairs of features. For example, one plot might show sepal length (x-axis) vs. sepal width (y-axis). These plots help in understanding how these features correlate with each other.

**Cluster Identification:** Points in each scatter plot are colored based on the cluster label assigned by the KMeans algorithm. This coloring helps to visually identify how well the algorithm has clustered the data. Ideally, points of the same color (i.e., from the same cluster) should form distinct groups, indicating that the algorithm has successfully identified patterns in the data.

**Insights on Feature Influence in Clustering:** By observing how points are grouped in different scatter plots, one can infer which features contribute more to the separation between clusters. For instance, if points are well-separated in the plot of petal length vs. petal width but not as much in sepal length vs. sepal width, it suggests that petal measurements are more influential in defining these clusters.

**Potential Overlaps and Misclassifications:** Areas where different colored points overlap indicate potential misclassifications by the algorithm or regions where the clusters are not very distinct. This can be a signal to try different parameters in the clustering algorithm or consider more complex models.

### Iris 3D Visualisation

In [None]:
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import matplotlib.pyplot as plt

# Though the following import is not directly being used, it is required
# for 3D projection to work with matplotlib < 3.2
import mpl_toolkits.mplot3d  # noqa: F401
import numpy as np

from sklearn import datasets
from sklearn.cluster import KMeans

np.random.seed(5)

iris = datasets.load_iris()
X = iris.data
y = iris.target

estimators = [
    ("k_means_iris_8", KMeans(n_clusters=8, n_init="auto")),
    ("k_means_iris_3", KMeans(n_clusters=3, n_init="auto")),
    ("k_means_iris_bad_init", KMeans(n_clusters=3, n_init=1, init="random")),
]

fig = plt.figure(figsize=(10, 8))
titles = ["8 clusters", "3 clusters", "3 clusters, bad initialization"]
for idx, ((name, est), title) in enumerate(zip(estimators, titles)):
    ax = fig.add_subplot(2, 2, idx + 1, projection="3d", elev=48, azim=134)
    est.fit(X)
    labels = est.labels_

    ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(float), edgecolor="k")

    ax.xaxis.set_ticklabels([])
    ax.yaxis.set_ticklabels([])
    ax.zaxis.set_ticklabels([])
    ax.set_xlabel("Petal width")
    ax.set_ylabel("Sepal length")
    ax.set_zlabel("Petal length")
    ax.set_title(title)

# Plot the ground truth
ax = fig.add_subplot(2, 2, 4, projection="3d", elev=48, azim=134)

for name, label in [("Setosa", 0), ("Versicolour", 1), ("Virginica", 2)]:
    ax.text3D(
        X[y == label, 3].mean(),
        X[y == label, 0].mean(),
        X[y == label, 2].mean() + 2,
        name,
        horizontalalignment="center",
        bbox=dict(alpha=0.2, edgecolor="w", facecolor="w"),
    )
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor="k")

ax.xaxis.set_ticklabels([])
ax.yaxis.set_ticklabels([])
ax.zaxis.set_ticklabels([])
ax.set_xlabel("Petal width")
ax.set_ylabel("Sepal length")
ax.set_zlabel("Petal length")
ax.set_title("Ground Truth")

plt.subplots_adjust(wspace=0.25, hspace=0.25)
plt.show()