# Practical session: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) 

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core samples in regions of high density and expands clusters from them. This algorithm is good for data which contains clusters of similar density.

<img width="300" alt="image" src="https://dashee87.github.io/images/DBSCAN_tutorial.gif">
<img width="283" alt="image" src="https://media.geeksforgeeks.org/wp-content/uploads/20190418023034/781ff66c-b380-4a78-af25-80507ed6ff26.jpeg">

**Parameters**

- `eps` (Epsilon): Define the maximum distance between two points for one to be considered as a neighbor of the other.
It controls the radius of the neighborhood around a point. If the distance between two points is less than or equal to eps, they are considered part of the same neighborhood. Choosing the right value for eps is crucial; too small and DBSCAN may label most points as noise, too large and it may merge distinct clusters.

- `min_samples`: The minimum number of points required to form a dense region (cluster).
It defines the minimum size of a cluster. For a point to be classified as a core point (a point that is the center of a cluster), it must have at least min_samples points (including itself) within a distance of eps. Points with fewer than min_samples neighbors within eps are classified as border points or noise.

**Definitions**

- **Core Point**: A point that has at least min_samples points within its eps neighborhood.
- **Border Point**: A point that has fewer than min_samples points within eps but is within the neighborhood of a core point.
- **Noise Point**: A point that does not belong to any cluster (it is neither a core nor a border point).

## Overview

## Dataset

We will use `Wholesale customers data`, which contains the annual spending in monetary units (m.u.) on diverse product categories. The dataset includes 440 customers with 8 attributes for each of these customers.

- Dataset info: [../data/Wholesale-customers/](../data/Wholesale-customers/README.md)

## Data loading


In [1]:
## Importing Libraries 

# Base libraries
import numpy as np
import pandas as pd
import os
import datetime
import warnings
warnings.filterwarnings('ignore')

# Visualisation 
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# Models
from sklearn.cluster import DBSCAN

## Default options and global variables
# Set number of decimal points to float type
pd.set_option("display.float_format", lambda x: "%.2f" % x)
pd.set_option('display.precision', 2)
SEED = 2024


In [None]:
# Set the data path 
DATA_PATH="../data/Wholesale-customers/"

# Data loading
df = pd.read_csv(DATA_PATH + "Wholesale_customers.csv")

# Data dimension 
print("Dataset:",df.shape[0],"rows,",  df.shape[1], "columns")

## Quick exploration

In [None]:
# Quick exploration 
df.head()

In [None]:
df.describe()

In [None]:
plt.scatter(df.Fresh, df.Milk)
plt.xlabel('Fresh (m.u.)')
plt.ylabel('Milk (m.u.)')

## Unsupervise classification 

### Data preparation 



In [7]:
# Drop categorical variables 
X = df.drop(columns=['Channel', 'Region'])

In [None]:
X.head()

In [None]:
sns.pairplot(X,palette='dark')

### Clustering

In [86]:
epsilon = ?
min_samples = ?
dbscan = DBSCAN(eps=epsilon, min_samples=min_samples).fit(X)
labels = dbscan.labels_

In [None]:
unique_labels, data_counts = np.unique(labels, return_counts=True)
sns.barplot(x=unique_labels, y=data_counts)

In [None]:
X_labeled = X.copy()
X_labeled['label'] = dbscan.labels_
sns.pairplot(X_labeled, hue='label',palette='dark')

## Revision
...