# Unsupervised learning - Clustering
## DBSCAN

Follow:
- _Introduction to Machine Learning_ [Chapter 3](https://github.com/amueller/introduction_to_ml_with_python/blob/master/03-unsupervised-learning.ipynb) **Section 3.5.3 DBSCAN** (p.189-193)



## DBSCAN
>The idea behind DBSCAN is that clusters form dense regions of data, separated by regions that are relatively empty

Assigns points to clusters automatically, no need to choose the number of clusters



Two parameters: `eps` and `min_samples`

>Points that are within a dense region are called core samples (or core points), and they are defined as follows. There are two parameters in DBSCAN: min_samples and eps. If there are at least min_samples many data points within a distance of eps to a given data point, that data point is classified as a core sample. Core samples that are closer to each other than the distance eps are put into the same cluster by DBSCAN

There is a *noise* cluster
>If there are less than min_samples points within distance eps of the starting point, this point is labeled as noise, meaning that it doesn’t belong to any cluster


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
X, y = make_blobs(random_state=0, n_samples=12)

dbscan = DBSCAN()
clusters = dbscan.fit_predict(X)
print("Cluster memberships:\n{}".format(clusters))

Parameters **need** to be tuned

In [None]:
import mglearn
mglearn.plots.plot_dbscan()

In [None]:
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler

X, y = make_moons(n_samples=200, noise=0.05, random_state=0)

# Rescale the data to zero mean and unit variance
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

dbscan = DBSCAN()
clusters = dbscan.fit_predict(X_scaled)
# plot the cluster assignments
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm2, s=60)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1");

## DBSCAN on Wine dataset 

In [None]:
from sklearn.datasets import load_wine

X, _ = load_wine(return_X_y=True, as_frame=True)
X = pd.DataFrame(StandardScaler().fit_transform(X), columns=X.columns)

In [None]:
for eps in [0.5, 1, 1.5, 2, 2.5, 3]:
    for min_samples in [2, 3, 4, 5, 6, 7, 8, 9, 10 ]:
        
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        labels = dbscan.fit_predict(X)
        num = len(np.unique(labels))
        if 1 < num < 7:
            print("\neps={} min_samples={}".format(eps, min_samples))
            print("Number of clusters: {}".format(num))
            print("Cluster sizes: {}".format(np.bincount(labels + 1)))

In [None]:
dbscan = DBSCAN(eps=2, min_samples=8)
X['clusters'] = dbscan.fit_predict(X)
X['clusters'] = X['clusters'].astype('category')

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X)
X_2D = pca.transform(X)
sns.scatterplot(x=X_2D[:,0], y=X_2D[:,1], hue=X['clusters'])