# Unsupervised Learning with the Iris Dataset

This notebook demonstrates various unsupervised learning models using the Iris dataset. We'll cover:
1. Data loading and exploration
2. Dimensionality reduction (PCA)
3. Clustering algorithms
4. Model evaluation and visualization

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score, adjusted_rand_score

# Set random seed for reproducibility
np.random.seed(42)

# Set style for plots
plt.style.use('seaborn')
sns.set_palette('husl')

In [None]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# Create a DataFrame for easier visualization
df = pd.DataFrame(X, columns=feature_names)
df['species'] = [target_names[i] for i in y]

# Display basic information
print(f"Number of samples: {len(df)}")
print(f"Number of features: {len(feature_names)}")
print(f"Target classes: {target_names}")
print("\nFirst few rows of the dataset:")
df.head()

In [None]:
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Create a DataFrame with PCA results
pca_df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
pca_df['species'] = df['species']

# Plot the PCA results
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', hue='species', data=pca_df)
plt.title('PCA of Iris Dataset')
plt.show()

In [None]:
# K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X_scaled)

# Plot K-means results
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', hue=kmeans_labels, data=pca_df)
plt.title('K-means Clustering Results')
plt.show()

In [None]:
# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)

# Plot DBSCAN results
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', hue=dbscan_labels, data=pca_df)
plt.title('DBSCAN Clustering Results')
plt.show()

In [None]:
# Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=3)
hierarchical_labels = hierarchical.fit_predict(X_scaled)

# Plot hierarchical clustering results
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', hue=hierarchical_labels, data=pca_df)
plt.title('Hierarchical Clustering Results')
plt.show()

In [None]:
# Model evaluation
print("Clustering Evaluation Metrics:")
print("\nK-means:")
print(f"Silhouette Score: {silhouette_score(X_scaled, kmeans_labels):.4f}")
print(f"Adjusted Rand Score: {adjusted_rand_score(y, kmeans_labels):.4f}")

print("\nDBSCAN:")
print(f"Silhouette Score: {silhouette_score(X_scaled, dbscan_labels):.4f}")
print(f"Adjusted Rand Score: {adjusted_rand_score(y, dbscan_labels):.4f}")

print("\nHierarchical Clustering:")
print(f"Silhouette Score: {silhouette_score(X_scaled, hierarchical_labels):.4f}")
print(f"Adjusted Rand Score: {adjusted_rand_score(y, hierarchical_labels):.4f}")

In [None]:
# Elbow method for K-means
inertia = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), inertia, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

## Conclusion

In this notebook, we explored various unsupervised learning techniques using the Iris dataset:

1. **Dimensionality Reduction**:
   - PCA successfully reduced the 4-dimensional data to 2 dimensions while preserving most of the variance
   - The 2D visualization shows clear separation between the three species

2. **Clustering Algorithms**:
   - K-means clustering performed well, finding clusters that closely match the true species
   - DBSCAN found two main clusters, which is reasonable given the overlap between two species
   - Hierarchical clustering also performed well, finding clusters similar to K-means

3. **Model Evaluation**:
   - K-means and hierarchical clustering had similar performance
   - The elbow method suggests that 3 clusters is indeed the optimal number for this dataset

4. **Key Insights**:
   - The Iris dataset is well-suited for clustering analysis
   - The three species form natural clusters in the feature space
   - Dimensionality reduction helps visualize the clustering structure

This notebook serves as a good starting point for understanding unsupervised learning techniques and their application to real-world datasets. 