# DBSCAN Project  

## The Data


Source: https://archive.ics.uci.edu/ml/datasets/Wholesale+customers

Margarida G. M. S. Cardoso, margarida.cardoso '@' iscte.pt, ISCTE-IUL, Lisbon, Portugal


Data Set Information:

Provide all relevant information about your data set.


Attribute Information:

    1) FRESH: annual spending (m.u.) on fresh products (Continuous);
    2) MILK: annual spending (m.u.) on milk products (Continuous);
    3) GROCERY: annual spending (m.u.)on grocery products (Continuous);
    4) FROZEN: annual spending (m.u.)on frozen products (Continuous)
    5) DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
    6) DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
    7) CHANNEL: customers  Channel - Horeca (Hotel/Restaurant/CafÃ©) or Retail channel (Nominal)
    8) REGION: customers  Region Lisnon, Oporto or Other (Nominal)
 

Relevant Papers:

Cardoso, Margarida G.M.S. (2013). Logical discriminant models â€“ Chapter 8 in Quantitative Modeling in Marketing and Management Edited by Luiz Moutinho and Kun-Huang Huarng. World Scientific. p. 223-253. ISBN 978-9814407717

Jean-Patrick Baudry, Margarida Cardoso, Gilles Celeux, Maria JosÃ© Amorim, Ana Sousa Ferreira (2012). Enhancing the selection of a model-based clustering with external qualitative variables. RESEARCH REPORT NÂ° 8124, October 2012, Project-Team SELECT. INRIA Saclay - ÃŽle-de-France, Projet select, UniversitÃ© Paris-Sud 11



-----

**COMPLETE THE REQUIRED TASKS:**


## EDA

**TASK: Create a scatterplot showing the relation between MILK and GROCERY spending, colored by Channel column.**

**TASK: Use seaborn to create a histogram of MILK spending, colored by Channel. Can you figure out how to use seaborn to "stack" the channels, instead of have them overlap?**

**TASK: Create an annotated clustermap of the correlations between spending on different cateogires.**

**TASK: Create a PairPlot of the dataframe, colored by Region.**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset into a DataFrame
data = """Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
2,3,12669,9656,7561,214,2674,1338
2,3,7057,9810,9568,1762,3293,1776
...
...
"""

df = pd.read_csv(pd.compat.StringIO(data), sep=",")

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Milk', y='Grocery', hue='Channel', data=df)
plt.title('Milk vs Grocery Spending (Colored by Channel)')
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x='Milk', hue='Channel', multiple='stack')
plt.title('Histogram of Milk Spending (Stacked by Channel)')
plt.show()

In [None]:
corr = df.drop(['Channel', 'Region'], axis=1).corr()
plt.figure(figsize=(10, 8))
sns.clustermap(corr, annot=True, cmap='coolwarm')
plt.title('Correlations Between Spending Categories')
plt.show()

In [None]:
plt.figure(figsize=(12, 10))
sns.pairplot(df, hue='Region')
plt.show()

## DBSCAN

**TASK: Since the values of the features are in different orders of magnitude, let's scale the data. Use StandardScaler to scale the data.**

**TASK: Use DBSCAN and a for loop to create a variety of models testing different epsilon values. Set min_samples equal to 2 times the number of features. During the loop, keep track of and log the percentage of points that are outliers. For reference the solutions notebooks uses the following range of epsilon values for testing:**

    np.linspace(0.001,3,50)

**TASK: Create a line plot of the percentage of outlier points versus the epsilon value choice.**

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt

# Scale the data
scaler = StandardScaler()
X = scaler.fit_transform(df.drop(['Channel', 'Region'], axis=1))

# Set the number of features
n_features = X.shape[1]

# Initialize lists to store outlier percentages and epsilon values
outlier_percentages = []
epsilon_values = np.linspace(0.001, 3, 50)

# Loop over different epsilon values
for eps in epsilon_values:
    # Create a DBSCAN model
    dbscan = DBSCAN(eps=eps, min_samples=2 * n_features)
    
    # Fit the model
    clusters = dbscan.fit_predict(X)
    
    # Calculate the number of outliers
    n_outliers = (clusters == -1).sum()
    
    # Calculate the percentage of outliers
    outlier_percentage = n_outliers / len(X) * 100
    outlier_percentages.append(outlier_percentage)

# Plot the outlier percentage vs epsilon values
plt.figure(figsize=(10, 6))
plt.plot(epsilon_values, outlier_percentages)
plt.xlabel('Epsilon Value')
plt.ylabel('Percentage of Outliers')
plt.title('Outlier Percentage vs Epsilon Value')
plt.show()

## DBSCAN with Chosen Epsilon

**TASK: Based on the plot created in the previous task, retrain a DBSCAN model with a reasonable epsilon value. Note: For reference, the solutions use eps=2.**

**TASK: Create a scatterplot of Milk vs Grocery, colored by the discovered labels of the DBSCAN model.**

**TASK: Create a scatterplot of Milk vs. Detergents Paper colored by the labels.**

**TASK: Create a new column on the original dataframe called "Labels" consisting of the DBSCAN labels.**

**TASK: Compare the statistical mean of the clusters and outliers for the spending amounts on the categories.**

**TASK: Normalize the dataframe from the previous task using MinMaxScaler so the spending means go from 0-1 and create a heatmap of the values.**

**TASK: Create another heatmap similar to the one above, but with the outliers removed**

**TASK: What spending category were the two clusters mode different in?**

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Retrain DBSCAN model with chosen epsilon value
dbscan = DBSCAN(eps=2, min_samples=2 * n_features)
labels = dbscan.fit_predict(X)

# Create a scatterplot of Milk vs Grocery, colored by DBSCAN labels
plt.figure(figsize=(8, 6))
plt.scatter(df['Milk'], df['Grocery'], c=labels)
plt.xlabel('Milk')
plt.ylabel('Grocery')
plt.title('Milk vs Grocery Spending (Colored by DBSCAN Labels)')
plt.show()

# Create a scatterplot of Milk vs Detergents_Paper, colored by DBSCAN labels
plt.figure(figsize=(8, 6))
plt.scatter(df['Milk'], df['Detergents_Paper'], c=labels)
plt.xlabel('Milk')
plt.ylabel('Detergents_Paper')
plt.title('Milk vs Detergents_Paper Spending (Colored by DBSCAN Labels)')
plt.show()

# Add DBSCAN labels to the original dataframe
df['Labels'] = labels

# Compare the statistical mean of clusters and outliers
cluster_means = df.groupby('Labels').mean()
print('Cluster Means:\n', cluster_means)

# Normalize the dataframe
norm_scaler = MinMaxScaler()
norm_df = pd.DataFrame(norm_scaler.fit_transform(df.drop(['Channel', 'Region', 'Labels'], axis=1)), columns=df.drop(['Channel', 'Region', 'Labels'], axis=1).columns)
norm_df['Labels'] = df['Labels']

# Create a heatmap of normalized values
plt.figure(figsize=(10, 8))
sns.heatmap(norm_df.groupby('Labels').mean(), annot=True, cmap='coolwarm')
plt.title('Normalized Spending Means by Cluster')
plt.show()

# Create a heatmap without outliers
inlier_mask = norm_df['Labels'] != -1
inlier_df = norm_df[inlier_mask]
plt.figure(figsize=(10, 8))
sns.heatmap(inlier_df.groupby('Labels').mean(), annot=True, cmap='coolwarm')
plt.title('Normalized Spending Means by Cluster (Outliers Removed)')
plt.show()

# Identify the category with the most significant difference between clusters
category_differences = cluster_means.iloc[0] - cluster_means.iloc[1]
most_different_category = category_differences.abs().idxmax()
print(f'The spending category with the most significant difference between clusters is: {most_different_category}')