In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("assignment4.ipynb")

In [2]:
pip install scikit-learn -U

Note: you may need to restart the kernel to use updated packages.


# Clustering in Practice and K-Means Clustering

Please complete the following assignment. Run all cells and submit your completed notebook through Gradescope.

- Clustering is a fundamental unsupervised learning technique used to group similar data points based on their features. It helps in discovering patterns and structures within datasets without predefined labels.

- The k-means algorithm aims to partition a dataset into k clusters, assigning each data point to the cluster whose centroid is closest to that point.

#### In this assignment, the aim is to test your fundamentals of Clustering, specifically K-Means clustering.


<!-- BEGIN QUESTION -->

#### 1. Which of the following statements about k-means clustering is true?
    A) k-means clustering can only be used for categorical data.

    B) k-means clustering aims to minimize the within-cluster sum of squares.

    C) The k-means algorithm always converges to the global optimum.

    D) k-means clustering is a hierarchical clustering algorithm.

In [3]:
def ans1():
    return 'B'

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### 2. What is the main advantage of using the k-means++ initialization method over the standard k-means initialization?

    A) It selects initial centroids that are maximally distant from each other, ensuring convergence to the global optimum.

    B) It reduces the computational complexity of k-means from quadratic to linear time with respect to the number of data points.

    C) It allows k-means clustering to automatically determine the optimal number of clusters without prior specification.

    D) It probabilistically selects initial centroids based on data density, leading to faster convergence and improved clustering results.

In [4]:
def ans2():
    return 'D'

<!-- END QUESTION -->

#### 3. Clustering on Synthetic Blob Data
- In this question, you are given a synthetic dataset generated using the `make_blobs` sklearn dataset method. 
- You're required to apply K-Means clustering on this data and plot the clustered dataset, marking the cluster centers.
- Use `k = 3`
- Return the **centers, clustered data point labels (classes), and the WCSS**
- `Please use random_state = 42 while defining the KMeans model`

In [5]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
def ans3():
    X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)
    kmeans = KMeans(n_clusters  = 3, random_state = 42).fit(X)
    return kmeans.cluster_centers_, kmeans.labels_, kmeans.inertia_

In [6]:
grader.check("q3")

#### 4. Clustering on funky-looking data
- In this question, you are given a synthetic dataset (anisotropic) generated using the `make_circles` sklearn dataset method. 
- You're required to apply K-Means clustering on this data and plot the clustered dataset, marking the cluster centers.
- Return the **predicted labels, true labels, and the Adjusted Rand Index**
- `Please use random_state = 42 while defining the KMeans model`

In [7]:
from sklearn.datasets import make_circles
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

def ans4():
    # Generates dataset with two interleaving half circles
    X, y_true = make_circles(n_samples=300, noise=0.05, random_state=42)
    kmeans = KMeans(n_clusters = 2, random_state = 42).fit(X)
    pred = kmeans.fit_predict(X)
    return pred, y_true, adjusted_rand_score(pred, y_true)

In [8]:
grader.check("q4")

#### 5. Comparing K-means with Random Initialization and K-means++ Initialization

- In this question you are tasked with **comparing** the k-means clustering using random initialization and k-means++ initialization.
- The function should return the number of iterations and the wcss measure for both techniques. (return it as `n_iter_random`, `n_iter_kpp`, `wcss_random`, `wcss_kpp`)
- `Please use random_state = 42 while defining the KMeans model`


In [9]:
def ans5():
    # Generated synthetic dataset with 5 clusters
    X, y_true = make_blobs(n_samples=1000, centers=5, cluster_std=0.60, random_state=42)
    kmeans = KMeans(n_clusters  = 5, init = 'random', random_state = 42).fit(X)
    kmeans_pp = KMeans(n_clusters = 5, init='k-means++', random_state = 42).fit(X)
    return kmeans.n_iter_ ,kmeans_pp.n_iter_,kmeans.inertia_, kmeans_pp.inertia_

In [10]:
grader.check("q5")

#### 6. Determining Optimal K Using Elbow Method
- In this question, run k-means clustering with different `k` values (1-10), and compute the wcss value for each k value.
- Use `kmeans++` as initialization method.
- Compute the `rand scores` and `wcss` values for each `k`
- Return the optimal `k` value along with the `rand scores` and `wcss` values.   **[return in this order k, rand_scores(list), wcss(list)]**
- `Please use random_state = 42 while defining the KMeans model`

In [11]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

def ans6():
    #generate the blob dataset
    rand_scores = []
    wcss = []
    optimal_k = 0
    optimal_score = 0
    X, y_true = make_blobs(n_samples=1000, centers=5, cluster_std=0.60, random_state=42,)
    for i in range(1,11):
        kmeans_pp = KMeans(n_clusters = i, init='k-means++', random_state = 42, n_init = i).fit(X)
        pred = kmeans_pp.fit_predict(X)
        rand_scores.append(adjusted_rand_score(pred, y_true))
        wcss.append(kmeans_pp.inertia_)
        if adjusted_rand_score(pred, y_true) > optimal_score:
            optimal_k = i
            optimal_score = adjusted_rand_score(pred, y_true)
    return [optimal_k,rand_scores, wcss]

In [12]:
grader.check("q6")

#### Submit the completed notebook to Gradescope to view your result!