<a href="https://colab.research.google.com/github/anyuanay/INFO213/blob/main/INFO213_Week10_more_clustering_lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 213: Data Science Programming 2
___

### Week 10: More on Clustering Analysis


### Overview

- [Grouping objects by similarity using k-means](#Grouping-objects-by-similarity-using-k-means)
  - [Using the elbow method to find the optimal number of clusters](#Using-the-elbow-method-to-find-the-optimal-number-of-clusters)
  - [Quantifying the quality of clustering via silhouette plots](#Quantifying-the-quality-of-clustering-via-silhouette-plots)
- [Organizing clusters as a hierarchical tree](#Organizing-clusters-as-a-hierarchical-tree)
  - [Grouping clusters in bottom-up fashion](#Grouping-clusters-in-bottom-up-fashion)
  - [Performing hierarchical clustering on a distance matrix](#Performing-hierarchical-clustering-on-a-distance-matrix)
  - [Attaching dendrograms to a heat map](#Attaching-dendrograms-to-a-heat-map)
  - [Applying agglomerative clustering via scikit-learn](#Applying-agglomerative-clustering-via-scikit-learn)
- [Locating regions of high density via DBSCAN](#Locating-regions-of-high-density-via-DBSCAN)

<br>
<br>

# Create Simple 2-D Dataset

```python
from sklearn.datasets import make_blobs


X, y = make_blobs(n_samples=150,
                  n_features=2,
                  centers=3,
                  cluster_std=0.5,
                  shuffle=True,
                  random_state=0)
```

## Visualize the data set

```python
import matplotlib.pyplot as plt


plt.scatter(X[:, 0], X[:, 1],
            c='white', marker='o', edgecolor='black', s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.grid()
plt.tight_layout()
#plt.savefig('figures/10_01.png', dpi=300)
plt.show()
```

## Using the elbow method to find the optimal number of clusters

- One of the main challenges in unsupervised learning is that we do not know the definitive answer.
- We don’t have the ground-truth class labels in our dataset that allow us to apply the techniques model evaluation and selection.
- To quantify the quality of clustering, we need to use
intrinsic metrics—such as the within-cluster SSE (distortion).
- Conveniently, we don’t need to compute the within-cluster SSE explicitly when we are using scikit-learn, as it is already accessible via the inertia_ attribute after fitting a KMeans model:

```python
from sklearn.cluster import KMeans

km = KMeans(n_clusters=3,
            init='random',
            n_init=10,
            max_iter=300,
            tol=1e-04,
            random_state=0)

y_km = km.fit_predict(X)
```

```python
print(f'Distortion: {km.inertia_:.2f}')
```

- Based on the within-cluster SSE, we can use a graphical tool, the so-called elbow method, to estimate
the optimal number of clusters, k, for a given task.
- We can say that if k increases, the distortion will
decrease.
- This is because the examples will be closer to the centroids they are assigned to.
- The idea behind the elbow method is to identify the value of k where the distortion begins to increase most
rapidly, which will become clearer if we plot the distortion for different values of k:

```python
distortions = []
for i in range(1, 11):
    km = KMeans(n_clusters=i,
                init='k-means++',
                n_init=10,
                max_iter=300,
                random_state=0)
    km.fit(X)
    distortions.append(km.inertia_)
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.tight_layout()
#plt.savefig('figures/10_03.png', dpi=300)
plt.show()
```

- As you can see, the elbow is located at k = 3, so this is supporting evidence that k = 3 is indeed a good choice for this dataset.

## Quantifying the quality of clustering  via silhouette plots

- Another intrinsic metric to evaluate the quality of a clustering is silhouette analysis.
- Silhouette analysis can be used as a graphical tool to plot a measure of how tightly grouped the examples
in the clusters are.
- To calculate the silhouette coefficient of a single example in our dataset, we can apply the following three steps:

1. Calculate the cluster cohesion, $a^{(i)}$, as the average distance between an example, $x^{(i)}$, and all
other points in the same cluster.
2. Calculate the cluster separation, $b^{(i)}$, from the next closest cluster as the average distance between the example, $x^{(i)}$, and all examples in the nearest cluster.
3. Calculate the silhouette, $s^{(i)}$, as the difference between cluster cohesion and separation divided
by the greater of the two, as shown here:
$$
s^{(i)} = \frac{b^{(i)} - a^{(i)}}{max\{b^{(i)}, a^{(i)}\}}
$$

- The silhouette coefficient is bounded in the range –1 to 1.
- The silhouette coefficient is 0 if the cluster separation and cohesion are equal (b(i) = a(i)).
- The coefficient is 1 if b(i) >> a(i).
- The silhouette coefficient is available as silhouette_samples from scikit-learn’s metric module, and
optionally, the silhouette_scores function can be imported for convenience.
- The silhouette_scores function calculates the average silhouette coefficient across all examples, which is equivalent to numpy.mean(silhouette_samples(...)).
- By executing the following code, we will now create a plot of the silhouette coefficients for a k-means clustering with k = 3:

```python
import numpy as np
from matplotlib import cm
from sklearn.metrics import silhouette_samples


km = KMeans(n_clusters=3,
            init='k-means++',
            n_init=10,
            max_iter=300,
            tol=1e-04,
            random_state=0)
y_km = km.fit_predict(X)

cluster_labels = np.unique(y_km)
n_clusters = cluster_labels.shape[0]

silhouette_vals = silhouette_samples(X, y_km, metric='euclidean')
y_ax_lower, y_ax_upper = 0, 0
yticks = []
for i, c in enumerate(cluster_labels):
    c_silhouette_vals = silhouette_vals[y_km == c]
    c_silhouette_vals.sort()
    y_ax_upper += len(c_silhouette_vals)
    color = cm.jet(float(i) / n_clusters)
    plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0,
             edgecolor='none', color=color)

    yticks.append((y_ax_lower + y_ax_upper) / 2.)
    y_ax_lower += len(c_silhouette_vals)

silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color="red", linestyle="--")

plt.yticks(yticks, cluster_labels + 1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')

plt.tight_layout()
#plt.savefig('figures/10_04.png', dpi=300)
plt.show()
```

- Through a visual inspection of the silhouette plot, we can quickly scrutinize the sizes of the different
clusters and identify clusters that contain outliers.
- However, as you can see in the preceding silhouette plot, the silhouette coefficients are not close to 0
and are approximately equally far away from the average silhouette score, which is, in this case, an
indicator of good clustering.
- Furthermore, to summarize the goodness of our clustering, we added the average silhouette coefficient to the plot (dotted line).

- To see what a silhouette plot looks like for a relatively bad clustering, let’s seed the k-means algorithm
with only two centroids:

```python
km = KMeans(n_clusters=2,
            init='k-means++',
            n_init=10,
            max_iter=300,
            tol=1e-04,
            random_state=0)
y_km = km.fit_predict(X)

plt.scatter(X[y_km == 0, 0],
            X[y_km == 0, 1],
            s=50,
            c='lightgreen',
            edgecolor='black',
            marker='s',
            label='Cluster 1')
plt.scatter(X[y_km == 1, 0],
            X[y_km == 1, 1],
            s=50,
            c='orange',
            edgecolor='black',
            marker='o',
            label='Cluster 2')

plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],
            s=250, marker='*', c='red', label='Centroids')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.legend()
plt.grid()
plt.tight_layout()
#plt.savefig('figures/10_05.png', dpi=300)
plt.show()
```

```python
cluster_labels = np.unique(y_km)
n_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X, y_km, metric='euclidean')
y_ax_lower, y_ax_upper = 0, 0
yticks = []
for i, c in enumerate(cluster_labels):
    c_silhouette_vals = silhouette_vals[y_km == c]
    c_silhouette_vals.sort()
    y_ax_upper += len(c_silhouette_vals)
    color = cm.jet(float(i) / n_clusters)
    plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0,
             edgecolor='none', color=color)

    yticks.append((y_ax_lower + y_ax_upper) / 2.)
    y_ax_lower += len(c_silhouette_vals)

silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color="red", linestyle="--")

plt.yticks(yticks, cluster_labels + 1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')

plt.tight_layout()
#plt.savefig('figures/10_06.png', dpi=300)
plt.show()
```

- As you can see in Figure, the silhouettes now have visibly different lengths and widths, which is evidence of a relatively bad or at least suboptimal clustering.

# Organizing clusters as a hierarchical tree

- One advantage of the hierarchical clustering algorithm is that it allows us to plot dendrograms
(visualizations of a binary hierarchical clustering), which can help with the interpretation of
the results by creating meaningful taxonomies.
- Another advantage of this hierarchical approach is
that we do not need to specify the number of clusters upfront.

## Grouping clusters in bottom-up fashion

- The two standard algorithms for agglomerative hierarchical clustering are single linkage and complete
linkage.
- Using single linkage, we compute the distances between the most similar members for each pair of clusters and merge the two clusters for which the distance between the most similar members is the smallest.
- The complete linkage approach is similar to single linkage but, instead of comparing the most similar members in each pair of clusters, we compare the most dissimilar members to perform
the merge.

<img src="https://github.com/rasbt/machine-learning-book/blob/main/ch10/figures/10_07.png?raw=true" width="600px" />

### Agglomerative Clustering
- Hierarchical complete linkage clustering is an iterative procedure that can be summarized by the following steps:
1. Compute a pair-wise distance matrix of all examples.
2. Represent each data point as a singleton cluster.
3. Merge the two closest clusters based on the distance between the most dissimilar (distant)
members.
4. Update the cluster linkage matrix.
5. Repeat steps 2-4 until one single cluster remains.


```python
import pandas as pd
import numpy as np


np.random.seed(123)

variables = ['X', 'Y', 'Z']
labels = ['ID_0', 'ID_1', 'ID_2', 'ID_3', 'ID_4']

X = np.random.random_sample([5, 3])*10
df = pd.DataFrame(X, columns=variables, index=labels)
df
```

## Performing hierarchical clustering on a distance matrix

```python
from scipy.spatial.distance import pdist, squareform


row_dist = pd.DataFrame(squareform(pdist(df, metric='euclidean')),
                        columns=labels,
                        index=labels)
row_dist
```

- We can either pass a condensed distance matrix (upper triangular) from the `pdist` function, or we can pass the "original" data array and define the `metric='euclidean'` argument in `linkage`.
- However, we should not pass the squareform distance matrix, which would yield different distance values although the overall clustering could be the same.

```python
# 1. incorrect approach: Squareform distance matrix

from scipy.cluster.hierarchy import linkage


row_clusters = linkage(row_dist, method='complete', metric='euclidean')
pd.DataFrame(row_clusters,
             columns=['row label 1', 'row label 2',
                      'distance', 'no. of items in clust.'],
             index=[f'cluster {(i + 1)}'
                    for i in range(row_clusters.shape[0])])
```

```python
# 2. correct approach: Condensed distance matrix

row_clusters = linkage(pdist(df, metric='euclidean'), method='complete')
pd.DataFrame(row_clusters,
             columns=['row label 1', 'row label 2',
                      'distance', 'no. of items in clust.'],
            index=[f'cluster {(i + 1)}'
                    for i in range(row_clusters.shape[0])])
```

```python
# 3. correct approach: Input matrix

row_clusters = linkage(df.values, method='complete', metric='euclidean')
pd.DataFrame(row_clusters,
             columns=['row label 1', 'row label 2',
                      'distance', 'no. of items in clust.'],
             index=[f'cluster {(i + 1)}'
                    for i in range(row_clusters.shape[0])])
```

```python
from scipy.cluster.hierarchy import dendrogram


# make dendrogram black (part 1/2)
# from scipy.cluster.hierarchy import set_link_color_palette
# set_link_color_palette(['black'])

row_dendr = dendrogram(row_clusters,
                       labels=labels,
                       # make dendrogram black (part 2/2)
                       # color_threshold=np.inf
                       )
plt.tight_layout()
plt.ylabel('Euclidean distance')
#plt.savefig('figures/10_11.png', dpi=300,
#            bbox_inches='tight')
plt.show()
```

## Applying agglomerative clustering via scikit-learn

```python
from packaging import version
from sklearn.cluster import AgglomerativeClustering


ac = AgglomerativeClustering(n_clusters=3,
                             metric="euclidean",
                             linkage="complete"
                            )

labels = ac.fit_predict(X)
print(f'Cluster labels: {labels}')
```

```python
# Two clusters
ac = AgglomerativeClustering(n_clusters=2,
                             metric="euclidean",
                             linkage="complete"
                            )

labels = ac.fit_predict(X)

print(f'Cluster labels: {labels}')
```

# Locating regions of high density via DBSCAN

- Density-Based Spatial Clustering of Applications with
Noise (DBSCAN) does not make assumptions about spherical clusters like k-means, nor does
it partition the dataset into hierarchies that require a manual cut-off point.
- As its name implies, density-
based clustering assigns cluster labels based on dense regions of points.
- In DBSCAN, the notion of density is defined as the number of points within a specified radius, $\epsilon$ .
- According to the DBSCAN algorithm, a special label is assigned to each example (data point) using
the following criteria:
    - A point is considered a core point if at least a specified number (MinPts) of neighboring points fall within the specified radius, $\epsilon$.
    - A border point is a point that has fewer neighbors than MinPts within $\epsilon$, but lies within the $\epsilon$ radius of a core point.
    - All other points that are neither core nor border points are considered noise points.

- After labeling the points as core, border, or noise, the DBSCAN algorithm can be summarized in two simple steps:
    1. Form a separate cluster for each core point or connected group of core points. (Core points are connected if they are no farther away than $\epsilon$.)
    2. Assign each border point to the cluster of its corresponding core point.

<img src="https://github.com/rasbt/machine-learning-book/blob/main/ch10/figures/10_13.png?raw=true" width="600px" />

```python
from sklearn.datasets import make_moons


X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
plt.scatter(X[:, 0], X[:, 1])

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.tight_layout()
#plt.savefig('figures/10_14.png', dpi=300)
plt.show()
```

- K-means and hierarchical clustering:

```python

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))

km = KMeans(n_clusters=2, random_state=0)
y_km = km.fit_predict(X)
ax1.scatter(X[y_km == 0, 0], X[y_km == 0, 1],
            edgecolor='black',
            c='lightblue', marker='o', s=40, label='cluster 1')
ax1.scatter(X[y_km == 1, 0], X[y_km == 1, 1],
            edgecolor='black',
            c='red', marker='s', s=40, label='cluster 2')
ax1.set_title('K-means clustering')

ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')

ac = AgglomerativeClustering(n_clusters=2,
                             metric='euclidean',
                             linkage='complete')
y_ac = ac.fit_predict(X)
ax2.scatter(X[y_ac == 0, 0], X[y_ac == 0, 1], c='lightblue',
            edgecolor='black',
            marker='o', s=40, label='Cluster 1')
ax2.scatter(X[y_ac == 1, 0], X[y_ac == 1, 1], c='red',
            edgecolor='black',
            marker='s', s=40, label='Cluster 2')
ax2.set_title('Agglomerative clustering')

ax2.set_xlabel('Feature 1')
ax2.set_ylabel('Feature 2')

plt.legend()
plt.tight_layout()
#plt.savefig('figures/10_15.png', dpi=300)
plt.show()
```

- Density-based clustering:

```python
from sklearn.cluster import DBSCAN


db = DBSCAN(eps=0.2, min_samples=5, metric='euclidean')
y_db = db.fit_predict(X)
plt.scatter(X[y_db == 0, 0], X[y_db == 0, 1],
            c='lightblue', marker='o', s=40,
            edgecolor='black',
            label='Cluster 1')
plt.scatter(X[y_db == 1, 0], X[y_db == 1, 1],
            c='red', marker='s', s=40,
            edgecolor='black',
            label='Cluster 2')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.legend()
plt.tight_layout()
#plt.savefig('figures/10_16.png', dpi=300)
plt.show()
```

- The DBSCAN algorithm can successfully detect the half-moon shapes, which highlights one of the strengths of DBSCAN—clustering data of arbitrary shapes.
- However, we should also note some of the disadvantages of DBSCAN. With an increasing number of
features in our dataset, the negative effect of the
curse of dimensionality increases.
- This is especially a problem if we are using the Euclidean distance metric.
- However, the problem of the curse of dimensionality is not unique to DBSCAN: it also affects
other clustering algorithms that use the Euclidean distance metric, for example, k-means and hierarchical
clustering algorithms.
- In addition, we have two hyperparameters in DBSCAN (MinPts and $\epsilon$) that need to be optimized to yield good clustering results.
- Finding a good combination of MinPts and $\epsilon$ can be problematic if the density differences in the dataset are relatively large.

# Summary

- In practice, it is not always obvious which clustering algorithm will perform best on a given
dataset, especially if the data comes in multiple dimensions that make it hard or impossible to visualize.
- Furthermore, it is important to emphasize that a successful clustering does not only depend on
the algorithm and its hyperparameters; rather, the choice of an appropriate distance metric and the
use of domain knowledge that can help to guide the experimental setup can be even more important.
- In the context of the curse of dimensionality, it is thus common practice to apply dimensionality
reduction techniques prior to performing clustering.

```python

```