# Density-Based Clustering

**CS5483 Data Warehousing and Data Mining**

___

In [None]:
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from ipywidgets import interact
from sklearn import datasets, preprocessing
from sklearn.cluster import DBSCAN
from sklearn.pipeline import make_pipeline

%matplotlib inline

## DBSCAN with scikit-learn

[DBSCAN (Density-based spatial clustering of applications with noise)](https://en.wikipedia.org/wiki/DBSCAN) is a clustering algorithm that identifies clusters as regions of densely populated instances.

````{admonition} Definition 

Given the parameters $\varepsilon$ and $\operatorname{MinPts}$, a point $\M{p}\in D$ in the dataset is called a *core point* if it satisfies

$$
\begin{align}
|D\cap N_{\varepsilon}(\M{p})|&\geq \operatorname{MinPts} \quad \text{where}\\
N_{\varepsilon}(\M{p})&:= \Set{\M{q}\in D| \operatorname{dist}(\M{p},\M{q})\leq \varepsilon}.
\end{align}
$$ (core-points)

The core points form the *pillars* of the clusters to generate. More precisely, clusters are grown from core points to points in their neighborhood $N_{\varepsilon}(\M{p})$, referred to as *density-reachable* points:

- A point is regarded as noise if it has no cluster assignment, i.e., not *density-reachable* from a core point.
- A non-core point reachable from a core point is called a *border point*.

````

We will create synthetic datasets using the [sample generators](https://scikit-learn.org/stable/modules/classes.html#samples-generator) of `sklearn`. In particular, we first create spherical clusters using [`sklearn.datasets.make_blobs`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs):

In [None]:
def XY2df(X, Y):
    """Return a DataFrame for 2D data with 2 input features X and a target Y."""
    df = pd.DataFrame(columns=["feature1", "feature2", "target"])
    df["target"] = Y
    df[["feature1", "feature2"]] = X
    return df


@interact
def generate_blobs(
    n_samples=widgets.IntSlider(value=200, min=10, max=1000, continuous_update=False),
    centers=widgets.IntSlider(value=3, min=1, max=4, continuous_update=False),
    cluster_std=widgets.FloatSlider(
        value=0.5, min=0, max=5, step=0.1, continuous_update=False
    ),
):
    df = XY2df(
        *datasets.make_blobs(
            n_samples=n_samples,
            centers=centers,
            cluster_std=cluster_std,
            random_state=0,
        )
    )
    fig, ax = plt.subplots()
    ax.set_aspect("equal")
    sns.scatterplot(data=df, x="feature1", y="feature2", hue="target", ax=ax)

We will use the dataset `df_spherical` created with the default parameters specified above:

In [None]:
df_spherical = XY2df(
    *datasets.make_blobs(n_samples=200, centers=3, cluster_std=0.5, random_state=0)
)

To create non-spherical clusters, one way is to use [`sklearn.datasets.make_circle`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html).

In [None]:
df_nonspherical = XY2df(
    *datasets.make_circles(n_samples=200, factor=0.1, noise=0.1, random_state=0)
)

**Exercise** Complete the following code by assigning `X` and `Y` to the respective arrays of input features and target generated using [`sklearn.datasets.make_circle`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html). Set `random_state=0` and use the parameters `n_samples`, `factor`, and `noise` specified by the widgets.

In [None]:
@interact
def generate_circles(
    n_samples=widgets.IntSlider(value=200, min=10, max=1000, continuous_update=False),
    factor=widgets.FloatSlider(
        value=0.1, min=0, max=0.99, step=0.01, continuous_update=False
    ),
    noise=widgets.FloatSlider(
        value=0.1, min=0, max=1, step=0.1, continuous_update=False
    ),
):
    df = pd.DataFrame(columns=["feature1", "feature2", "target"])
    # your python code here
    # end of python code
    
    df["target"] = Y
    df[["feature1", "feature2"]] = X
    fig, ax = plt.subplots()
    ax.set_aspect("equal")
    sns.scatterplot(data=df, x="feature1", y="feature2", hue="target", ax=ax)

To normalize the features followed by [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html), we create a pipeline as follows:

```python
from sklearn.cluster import DBSCAN
```

In [None]:
dbscan_minmax_normalized = make_pipeline(
    preprocessing.MinMaxScaler(), DBSCAN(eps=0.3, min_samples=3)
)
dbscan_minmax_normalized

To generate the clustering solution, we can again use the `fit_predict` method as follows:

In [None]:
feature1, feature2 = df_spherical.columns[0:2]

cluster_labels = dbscan_minmax_normalized.fit_predict(
    df_spherical[[feature1, feature2]]
)

plt.figure(figsize=(10, 5))
_ = plt.subplot(121, title="Cluster assignment", xlabel=feature1, ylabel=feature2)
plt.scatter(df_spherical[feature1], df_spherical[feature2], c=cluster_labels)
plt.subplot(122, title="Cluster assignment", xlabel=feature1, sharey=_)
plt.scatter(df_spherical[feature1], df_spherical[feature2], c=df_spherical["target"])
plt.show()

**Exercise** The clustering solution above is incorrect because the points at the top (`feature2 > 2`) have the same cluster label but may belong to two different classes. Explain how we should change the parameters, `eps = 0.3` and `min_samples = 3`, to improve the solution. 

```{caution}
The pairwise distance of points in different clusters appears larger than 1.
```

YOUR ANSWER HERE

**Exercise** Complete the following code to apply DBSCAN to the different datasets with different choices of parameters.

In [None]:
@interact(
    cluster_shape=["spherical", "non-spherical"],
    eps=widgets.FloatSlider(
        value=0.3, min=0.01, max=1, step=0.01, continuous_update=False
    ),
    min_samples=widgets.IntSlider(value=3, min=1, max=10, continuous_update=False),
)
def cluster_regions_dbscan(cluster_shape, eps, min_samples):
    df = {"spherical": df_spherical, "non-spherical": df_nonspherical}[cluster_shape]
    feature1, feature2 = df.columns[0:2]
    # your python code here
    # end of python code
    
    plt.figure(figsize=(10, 5))
    _ = plt.subplot(121, title="Cluster assignment", xlabel=feature1, ylabel=feature2)
    plt.scatter(df[feature1], df[feature2], c=cluster_labels)
    plt.subplot(122, title="Cluster assignment", xlabel=feature1, sharey=_)
    plt.scatter(df[feature1], df[feature2], c=df["target"])
    plt.show()

**Exercise** Is it possible to tune `eps` to cluster the generated datasets correctly with `min_samples = 1`?

```{note}
DBSCAN reduces to the single-linkage algorithm when `min_sample` is 1.
```

YOUR ANSWER HERE

## OPTICS with Weka

For DBSCAN, the parameters $\varepsilon$ and $\operatorname{MinPts}$ must be chosen properly. One needs to know how dense is dense enough to grow clusters, but this can be difficult, especially for high-dimensional data. A simpler alternative is to use OPTICS:

````{admonition} Definition 

[OPTICS (Ordering points to identify the clustering structure)](https://en.wikipedia.org/wiki/OPTICS_algorithm) starts at an arbitrary point and visits other points based on a priority queue that prioritizes a point $\M{q}$ with smaller

$$
\begin{align}
\operatorname{reachability-distance}(\M{q}) &:= \max \Set{\operatorname{dist}(\M{p}, \M{q}), \operatorname{core-distance}(\M{p})} \quad \text{where}\\
\operatorname{core-distance}(\M{p}) &:= \min\Set{0\leq \varepsilon' \leq \varepsilon| |D\cap N_{\varepsilon'}(\M{p})|\geq \operatorname{MinPts}} 
\end{align}
$$ (optics)

and $\M{p}$ is a core point that is visited before $\M{q}$ and that yields the smallest reachability distance, but with the following exceptions: 

- If a point $\M{p}$ is not a core point, its core distance is undefined.
- If no such point $\M{p}$ exists for $\M{q}$, the reachability distance of $\M{q}$ is undefined. 

````

We will use the package `optics_dbScan` in Weka for the density-based clustering algorithms. The package can be installed from Weka GUI -> `tools` -> `Package manager`.

Open the explorer interface and load the `iris.arff` dataset (not `iris.2D.arff`). Under the `Cluster` panel:
    
1. Choose `OPTICS` as the `Clusterer`.
1. Choose `Use training set` as the `Cluster mode`.
1. Ignore the `class` attribute using the `Ignore attributes` button.
1. Click `Start`.

The OPTICS Visualizer will appear. The `Table` tab shows the list of data points in the order visited by the algorithm:

![OPTICS](images/optics.png)

**Exercise** The reachability distance is always undefined for the first point visited. Why?

YOUR ANSWER HERE

The `Graph` tab shows the stem plots of core and reachability distances. We need to increase the `Vertical adjustment` in the `General Settings` panel to see the variations more clearly:

![](images/reachability1.png)

**Exercise** Note that, by definition, $\operatorname{reachability-distance}(\M{q}) \geq \operatorname{core-distance(\M{p})}$. Does this condition hold for the last two visited points $\M{p}$ and $\M{q}$? Why or Why not?

YOUR ANSWER HERE

Change the `General Settings` to give the reachability plot below:

![](images/reachability2.png)

The above stem plot is called the reachability plot. To obtain a cluster from the plot, 

1. specify a threshold to clip the reachability distance from above, and
1. identify a cluster as a "valley" of consecutively visited points with reachability distances below the threshold, except for  
1. the first point should have a reachability distance above the threshold.
1. All other points not assigned to a cluster are regarded as noise.

**Exercise** Assign to `eps_cl` a threshold value that can results in $2$ clusters and no noise points.

```{hint}
You can see the reachability distance of a stem in the reachability plot by hovering the mouse over the stem.
```

In [None]:
# your python code here
# end of python code

eps_cl

In [None]:
# hidden tests

**Exercise** Assign to `eps_cl` a threshold value that results in $3$ clusters. In particular, choose the threshold value that leads to as few noise points as possible.

In [None]:
# your python code here
# end of python code

eps_cl

In [None]:
# hidden tests

**Exercise** To evaluate the 3 clusters obtained from a particular threshold using an extrinsic measure, run DBSCAN with 
- the parameter `epsilon` set to the threshold you obtained and
- cluster mode set to `Classes to clusters evaluation`.

Assign to `error_rate` the fraction of incorrectly classified instances and `miss_rate` as the fraction of instances not assigned to a cluster.

In [None]:
# your python code here
# end of python code

error_rate, miss_rate

In [None]:
# hidden tests

```{note}
Noise points in DBSCAN may be clusters at different levels of density. The reachability plot of OPTICS can help identify the thresholds for clusters with different densities.
```