<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_3/Section_4_Python_Example__K_means_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 4 - K-means clustering with python

K-means clustering is a popular and straightforward algorithm widely used in data science for partitioning data into K distinct, non-overlapping clusters. It works by assigning each data point to the nearest cluster center and then iteratively moving those centers to minimize the total variance within each cluster. This section provides a detailed example of how to implement K-means clustering using Python's scikit-learn library, demonstrating the process on a synthetic dataset to identify distinct groups based on their features.

1. Setting Up the Environment:

Ensure Python and the necessary libraries are installed. For K-means clustering, we primarily need scikit-learn, numpy, and matplotlib for visualizations. Install them using pip if they are not already installed:

In [None]:
pip install numpy matplotlib scikit-learn

2. Importing Required Libraries:

Start by importing the libraries necessary for creating the dataset, performing the clustering, and visualizing the results:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

3. Generating Synthetic Data:

To demonstrate K-means, we'll create a synthetic dataset with clear clusters using make_blobs, which is useful for generating data with Gaussian distributions:

In [None]:
# Generate synthetic data with 3 clusters
X, y = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0)

# Plot the data
plt.scatter(X[:,0], X[:,1], s=50)
plt.title("Synthetic Data for Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

4. Applying K-means Clustering:

We'll apply K-means clustering to this dataset, specifying the number of clusters we expect (which is 3 in this case):

In [None]:
# Create a K-means clustering model
kmeans = KMeans(n_clusters=3)

# Fit the model to the data
kmeans.fit(X)

# Predict the cluster labels
y_kmeans = kmeans.predict(X)

5. Visualizing the Clusters:

Visualize the resulting clusters and the centroids to understand how well the algorithm has performed:

In [None]:
# Scatter plot of the data with the color indicating the cluster
plt.scatter(X[:,0], X[:,1], c=y_kmeans, s=50, cmap='viridis', marker='o', alpha=0.6, label='Data Points')
centers = kmeans.cluster_centers_

# Scatter plot for the centroids
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X', label='Centroids')
plt.title("K-means Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

6. Evaluation and Analysis:

After visualizing the clusters, evaluate the model's effectiveness. A common approach is to look at the within-cluster sum of squares (inertia), which K-means tries to minimize:

In [None]:
print("Model Inertia:", kmeans.inertia_)

7. Conclusion:

This example illustrates how K-means clustering works and how it can be implemented in Python using scikit-learn. The algorithm is efficient and effective for a wide range of simple clustering problems. However, it requires the number of clusters to be known beforehand and assumes that the clusters are isotropic. K-means can perform poorly on complex geometries or clusters of varying sizes and densities. For real-world applications, it's important to preprocess the data appropriately and consider the algorithm's assumptions when interpreting the results.