<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Before-your-start:" data-toc-modified-id="Before-your-start:-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Before your start:</a></span></li><li><span><a href="#Challenge-1---Import-and-Describe-the-Dataset" data-toc-modified-id="Challenge-1---Import-and-Describe-the-Dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Challenge 1 - Import and Describe the Dataset</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?" data-toc-modified-id="Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?-2.0.0.1"><span class="toc-item-num">2.0.0.1&nbsp;&nbsp;</span>Explore the dataset with mathematical and visualization techniques. What do you find?</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-2---Data-Cleaning-and-Transformation" data-toc-modified-id="Challenge-2---Data-Cleaning-and-Transformation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Challenge 2 - Data Cleaning and Transformation</a></span></li><li><span><a href="#Challenge-3---Data-Preprocessing" data-toc-modified-id="Challenge-3---Data-Preprocessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Challenge 3 - Data Preprocessing</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here." data-toc-modified-id="We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here.-4.0.0.1"><span class="toc-item-num">4.0.0.1&nbsp;&nbsp;</span>We will use the <code>StandardScaler</code> from <code>sklearn.preprocessing</code> and scale our data. Read more about <code>StandardScaler</code> <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler" target="_blank">here</a>.</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-4---Data-Clustering-with-K-Means" data-toc-modified-id="Challenge-4---Data-Clustering-with-K-Means-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Challenge 4 - Data Clustering with K-Means</a></span></li><li><span><a href="#Challenge-5---Data-Clustering-with-DBSCAN" data-toc-modified-id="Challenge-5---Data-Clustering-with-DBSCAN-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Challenge 5 - Data Clustering with DBSCAN</a></span></li><li><span><a href="#Challenge-6---Compare-K-Means-with-DBSCAN" data-toc-modified-id="Challenge-6---Compare-K-Means-with-DBSCAN-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Challenge 6 - Compare K-Means with DBSCAN</a></span></li><li><span><a href="#Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters" data-toc-modified-id="Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Bonus Challenge 2 - Changing K-Means Number of Clusters</a></span></li><li><span><a href="#Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples" data-toc-modified-id="Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Bonus Challenge 3 - Changing DBSCAN <code>eps</code> and <code>min_samples</code></a></span></li></ul></div>

# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [None]:
# Import your libraries:

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings                                              
from sklearn.exceptions import DataConversionWarning          
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

# Challenge 1 - Import and Describe the Dataset

In this lab, we will use a dataset containing information about customer preferences. We will look at how much each customer spends in a year on each subcategory in the grocery store and try to find similarities using clustering.

The origin of the dataset is [here](https://archive.ics.uci.edu/ml/datasets/wholesale+customers).

In [None]:
# loading the data: Wholesale customers data
import pandas as pd

# Load CSV into Pandas DataFrame
file_path = '..\data\Wholesale customers data.csv'  
df = pd.read_csv(file_path)

# Display first few rows of the DataFrame
print(df.head())

# Display basic information about the DataFrame
print(df.info())

# Display summary statistics
print(df.describe())



#### Explore the dataset with mathematical and visualization techniques. What do you find?

Checklist:

* What does each column mean?
* Any categorical data to convert?
* Any missing data to remove?
* Column collinearity - any high correlations?
* Descriptive statistics - any outliers to remove?
* Column-wise data distribution - is the distribution skewed?
* Etc.

Additional info: Over a century ago, an Italian economist named Vilfredo Pareto discovered that roughly 20% of the customers account for 80% of the typical retail sales. This is called the [Pareto principle](https://en.wikipedia.org/wiki/Pareto_principle). Check if this dataset displays this characteristic.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 1. Calculate Column Colinearity (Correlation Matrix)
correlation_matrix = df.corr()

# 2. Descriptive Statistics
descriptive_stats = df.describe(include='all')

# 3. Column-wise Data Distribution (Value Counts for Categorical, Histogram for Numeric)
distribution = {}
for column in df.columns:
    if df[column].dtype == 'object' or df[column].dtype.name == 'category':
        distribution[column] = df[column].value_counts()
    else:
        distribution[column] = df[column].describe()


# 1. Correlation Heatmap
plt.figure(figsize=(10, 6))
plt.matshow(df.corr(), fignum=1, cmap='coolwarm')
plt.colorbar()
plt.title('Correlation Heatmap', fontsize=14, pad=20)
plt.xticks(ticks=range(df.shape[1]), labels=df.columns, rotation=45, ha='left')
plt.yticks(ticks=range(df.shape[1]), labels=df.columns)
plt.show()

# 2. Histograms for Numerical Columns
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
for column in numeric_columns:
    plt.figure(figsize=(7, 4))
    plt.hist(df[column], bins=30, alpha=0.7, edgecolor='black')
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.show()

# 3. Bar Plots for Categorical Columns
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
for column in categorical_columns:
    plt.figure(figsize=(7, 4))
    df[column].value_counts().plot(kind='bar', alpha=0.7, edgecolor='black')
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Count')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.show()



In [None]:
skewness = df.skew()

# Convert the pandas series to a DataFrame
skewness_df = skewness.reset_index()
skewness_df.columns = ['Variable', 'Skewness']

# Determine skewness type based on value
def determine_skewness(value):
    if value < -1:
        return 'Highly negatively skewed'
    elif -1 <= value < -0.5:
        return 'Moderately negatively skewed'
    elif -0.5 <= value <= 0.5:
        return 'Approximately symmetric'
    elif 0.5 < value <= 1:
        return 'Moderately positively skewed'
    else:
        return 'Highly positively skewed'

# Add Skewness Type column
skewness_df['Skewness Type'] = skewness_df['Skewness'].apply(determine_skewness)

print("\n--- Skewness Analysis from Series ---")
print(skewness_df.to_string(index=False))


In [None]:
from scipy import stats

# Z-score method
z_scores = stats.zscore(df.select_dtypes(include=['int64', 'float64']))
outliers = (abs(z_scores) > 3).any(axis=1)

# Show outliers
print("\n--- Outliers Detected (Z-score > 3) ---")
print(df[outliers])

# Remove outliers
df_no_outliers = df[~outliers]
print("\n--- DataFrame After Removing Outliers ---")
print(df_no_outliers.shape)


* What does each column mean?
FRESH: annual spending (m.u.) on fresh products (Continuous)
MILK: annual spending (m.u.) on milk products (Continuous)
GROCERY: annual spending (m.u.) on grocery products (Continuous)
FROZEN: annual spending (m.u.) on frozen products (Continuous)
DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
DELICATESSEN: annual spending (m.u.) on delicatessen products (Continuous)
CHANNEL: customers' Channel - Horeca (Hotel/Restaurant/Café) or Retail channel (Nominal)
REGION: customers' Region – Lisbon, Oporto or Other (Nominal)

* Any categorical data to convert?

CHANNEL and REGION are categoricals

* Any missing data to remove?

No missing data. 

* Column collinearity - any high correlations?

Grocery & Detergents_Paper: Very strong correlation. 
Milk & Grocery: Strong positive correlation.

* Descriptive statistics - any outliers to remove?

--- Outliers Detected (Z-score > 3) ---
     Channel  Region   Fresh   Milk  Grocery  Frozen  Detergents_Paper  \
23         2       3   26373  36423    22019    5154              4337   
39         1       3   56159    555      902   10002               212   
47         2       3   44466  54259    55571    7782             24171   
56         2       3    4098  29892    26866    2616             17740   
61         2       3   35942  38369    59598    3254             26701   
65         2       3      85  20959    45828      36             24231   
71         1       3   18291   1266    21042    5373              4173   
...
413        2435  
* Column-wise data distribution - is the distribution skewed?

--- Skewness Analysis from Series ---
        Variable  Skewness                Skewness Type
         Channel  0.760951 Moderately positively skewed
          Region -1.283627     Highly negatively skewed
           Fresh  2.561323     Highly positively skewed
            Milk  4.053755     Highly positively skewed
         Grocery  3.587429     Highly positively skewed
          Frozen  5.907986     Highly positively skewed
Detergents_Paper  3.631851     Highly positively skewed
      Delicassen 11.151586     Highly positively skewed

# Challenge 2 - Data Cleaning and Transformation

If your conclusion from the previous challenge is the data need cleaning/transformation, do it in the cells below. However, if your conclusion is the data need not be cleaned or transformed, feel free to skip this challenge. But if you do choose the latter, please provide rationale.

In [None]:
# Besides removing outliers, the data does not require any further preprocessing.

**Your comment here**

-  Besides removing outliers, the data does not require any further preprocessing.

In [None]:
# Your import here:

from sklearn.preprocessing import StandardScaler

# Select the numeric columns
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the numeric columns
customers_scale = df.copy()
customers_scale[numeric_columns] = scaler.fit_transform(df[numeric_columns])

# Display the first few rows of the scaled dataset
print("\n--- Scaled DataFrame (using StandardScaler) ---")
print(customers_scale.head())

# Show summary statistics of scaled data
print("\n--- Summary Statistics of Scaled Data ---")
print(customers_scale.describe())


# Challenge 4 - Data Clustering with K-Means

Now let's cluster the data with K-Means first. Initiate the K-Means model, then fit your scaled data. In the data returned from the `.fit` method, there is an attribute called `labels_` which is the cluster number assigned to each data record. What you can do is to assign these labels back to `customers` in a new column called `customers['labels']`. Then you'll see the cluster results of the original data.

In [None]:
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt


# Determine Optimal Number of Clusters Using Elbow Method
inertia = []
k_range = range(1, 11)
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(customers_scale)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(7, 4))
plt.plot(k_range, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.grid(True)
plt.show()

# Fit K-Means with Optimal Number of Clusters (Assume k=3)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(customers_scale)

# Add cluster labels to dataframe
df['Cluster'] = clusters

plt.figure(figsize=(7, 5))
plt.scatter(
    customers_scale.iloc[:, 0],  # First principal dimension
    customers_scale.iloc[:, 1],  # Second principal dimension
    c=df['Cluster'].values,      # Use cluster labels for coloring
    cmap='viridis',
    edgecolors='k'
)
plt.title('K-Means Clustering (First Two Dimensions)')
plt.xlabel('Fresh (scaled)')
plt.ylabel('Milk (scaled)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()


In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# trying out PCA to visualize the clusters

# Fit PCA with 2 principal components
pca_2d = PCA(n_components=2)
pca_2d_data = pca_2d.fit_transform(customers_scale) 

# 2. Plot the 2D PCA result
plt.figure(figsize=(7, 5))
plt.scatter(
    pca_2d_data[:, 0],
    pca_2d_data[:, 1],
    c=df['Cluster'].values,  # color by cluster labels
    cmap='viridis',
    edgecolors='k'
)
plt.title('K-Means Clusters in 2D (PCA)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# (Optional) Print explained variance ratio
print("Explained Variance Ratio (2D):", pca_2d.explained_variance_ratio_)


In [None]:
from mpl_toolkits.mplot3d import Axes3D  


# trying pca with 3 principal components, to visualize the clusters in 3D

# Fit PCA with 3 principal components
pca_3d = PCA(n_components=3)
pca_3d_data = pca_3d.fit_transform(customers_scale)

# Create a 3D plot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
sc = ax.scatter(
    pca_3d_data[:, 0],
    pca_3d_data[:, 1],
    pca_3d_data[:, 2],
    c=df['Cluster'].values,
    cmap='viridis',
    edgecolors='k'
)
ax.set_title('K-Means Clusters in 3D (PCA)')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.show()

# (Optional) Print explained variance ratio
print("Explained Variance Ratio (3D):", pca_3d.explained_variance_ratio_)


### Looking to the elbow we can choose 2 like the correct number of clusters

In [None]:
kmeans_2 = KMeans(n_clusters=2).fit(customers_scale)

labels = kmeans_2.predict(customers_scale)

clusters = kmeans_2.labels_.tolist()

In [None]:
clean_customers = df.copy()

clean_customers['Label'] = clusters

Count the values in `labels`.

In [None]:
from collections import Counter

label_counts = Counter(labels)
print(label_counts)


# Challenge 3 - Data Preprocessing

One problem with the dataset is the value ranges are remarkably different across various categories (e.g. `Fresh` and `Grocery` compared to `Detergents_Paper` and `Delicassen`). If you made this observation in the first challenge, you've done a great job! This means you not only completed the bonus questions in the previous Supervised Learning lab but also researched deep into [*feature scaling*](https://en.wikipedia.org/wiki/Feature_scaling). Keep on the good work!

Diverse value ranges in different features could cause issues in our clustering. The way to reduce the problem is through feature scaling. We'll use this technique again with this dataset.

#### We will use the `StandardScaler` from `sklearn.preprocessing` and scale our data. Read more about `StandardScaler` [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

*After scaling your data, assign the transformed data to a new variable `customers_scale`.*

# Challenge 5 - Data Clustering with DBSCAN

Now let's cluster the data using DBSCAN. Use `DBSCAN(eps=0.5)` to initiate the model, then fit your scaled data. In the data returned from the `.fit` method, assign the `labels_` back to `customers['labels_DBSCAN']`. Now your original data have two labels, one from K-Means and the other from DBSCAN.

In [None]:
from sklearn.cluster import DBSCAN 

from sklearn.cluster import DBSCAN
from collections import Counter
import matplotlib.pyplot as plt

# Initialize DBSCAN with eps=0.5
dbscan = DBSCAN(eps=0.5)

# Fit DBSCAN on the scaled data
dbscan.fit(customers_scale)

# Retrieve cluster labels and store them in labels_DBSCAN (-1 indicates noise)
labels_DBSCAN = dbscan.labels_

# Count how many points fall into each cluster (including noise)
label_counts = Counter(labels_DBSCAN)
print("DBSCAN cluster counts:", label_counts)

# Visualize the clusters using the first two dimensions of the scaled data
plt.figure(figsize=(7, 5))
plt.scatter(
    customers_scale.iloc[:, 0],  # First feature using iloc
    customers_scale.iloc[:, 1],  # Second feature using iloc
    c=labels_DBSCAN,                # Color by cluster label
    cmap='viridis',
    edgecolors='k'
)
plt.title("DBSCAN Clustering (eps=0.5)")
plt.xlabel("Feature 1 (scaled)")
plt.ylabel("Feature 2 (scaled)")
plt.grid(True)
plt.show()



Count the values in `labels_DBSCAN`.

In [None]:
from collections import Counter

label_counts = Counter(labels_DBSCAN)
print("DBSCAN cluster counts:", label_counts)


# Challenge 6 - Compare K-Means with DBSCAN

Now we want to visually compare how K-Means and DBSCAN have clustered our data. We will create scatter plots for several columns. For each of the following column pairs, plot a scatter plot using `labels` and another using `labels_DBSCAN`. Put them side by side to compare. Which clustering algorithm makes better sense?

Columns to visualize:

* `Detergents_Paper` as X and `Milk` as y
* `Grocery` as X and `Fresh` as y
* `Frozen` as X and `Delicassen` as y

In [None]:
# Define the column pairs to compare
column_pairs = [
    ('Detergents_Paper', 'Milk'),
    ('Grocery', 'Fresh'),
    ('Frozen', 'Delicassen')
]

Visualize `Detergents_Paper` as X and `Milk` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
import matplotlib.pyplot as plt

# Create a figure with 1 row and 2 columns
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('Clustering Comparison: Detergents_Paper vs. Milk', fontsize=16)

# Left subplot: K-Means Clustering
axes[0].scatter(df['Detergents_Paper'], df['Milk'], c=labels, cmap='viridis', edgecolors='k')
axes[0].set_title('K-Means Clustering')
axes[0].set_xlabel('Detergents_Paper')
axes[0].set_ylabel('Milk')
axes[0].grid(True)

# Right subplot: DBSCAN Clustering
axes[1].scatter(df['Detergents_Paper'], df['Milk'], c=labels_DBSCAN, cmap='viridis', edgecolors='k')
axes[1].set_title('DBSCAN Clustering')
axes[1].set_xlabel('Detergents_Paper')
axes[1].set_ylabel('Milk')
axes[1].grid(True)

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()


Visualize `Grocery` as X and `Fresh` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
import matplotlib.pyplot as plt

# Create a figure with 1 row and 2 columns
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('Clustering Comparison: Grocery vs. Fresh', fontsize=16)

# Left subplot: K-Means Clustering
axes[0].scatter(df['Grocery'], df['Fresh'], c=labels, cmap='viridis', edgecolors='k')
axes[0].set_title('K-Means Clustering')
axes[0].set_xlabel('Grocery')
axes[0].set_ylabel('Fresh')
axes[0].grid(True)

# Right subplot: DBSCAN Clustering
axes[1].scatter(df['Grocery'], df['Fresh'], c=labels_DBSCAN, cmap='viridis', edgecolors='k')
axes[1].set_title('DBSCAN Clustering')
axes[1].set_xlabel('Grocery')
axes[1].set_ylabel('Fresh')
axes[1].grid(True)

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()


Visualize `Frozen` as X and `Delicassen` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
import matplotlib.pyplot as plt

# Create a figure with 1 row and 2 columns
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('Clustering Comparison: Frozen vs. Delicassen', fontsize=16)

# Left subplot: K-Means Clustering
axes[0].scatter(df['Frozen'], df['Delicassen'], c=labels, cmap='viridis', edgecolors='k')
axes[0].set_title('K-Means Clustering')
axes[0].set_xlabel('Frozen')
axes[0].set_ylabel('Delicassen')
axes[0].grid(True)

# Right subplot: DBSCAN Clustering
axes[1].scatter(df['Frozen'], df['Delicassen'], c=labels_DBSCAN, cmap='viridis', edgecolors='k')
axes[1].set_title('DBSCAN Clustering')
axes[1].set_xlabel('Frozen')
axes[1].set_ylabel('Delicassen')
axes[1].grid(True)

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()


Let's use a groupby to see how the mean differs between the groups. Group `customers` by `labels` and `labels_DBSCAN` respectively and compute the means for all columns.

In [None]:
# Group by K-Means clusters and compute mean values
kmeans_group_means = clean_customers.groupby(labels).mean()
print("Mean values by K-Means clusters:")
print(kmeans_group_means)

# Group by DBSCAN clusters and compute mean values
dbscan_group_means = clean_customers.groupby(labels_DBSCAN).mean()
print("\nMean values by DBSCAN clusters:")
print(dbscan_group_means)


Which algorithm appears to perform better?

**Your observations here**

K-Means appears to perform better. It creates two clear groups with distinct average behaviors.
The differences in means between clusters are easier to interpret and can be directly tied to customer purchasing behaviors.

# Bonus Challenge 2 - Changing K-Means Number of Clusters

As we mentioned earlier, we don't need to worry about the number of clusters with DBSCAN because it automatically decides that based on the parameters we send to it. But with K-Means, we have to supply the `n_clusters` param (if you don't supply `n_clusters`, the algorithm will use `8` by default). You need to know that the optimal number of clusters differs case by case based on the dataset. K-Means can perform badly if the wrong number of clusters is used.

In advanced machine learning, data scientists try different numbers of clusters and evaluate the results with statistical measures (read [here](https://en.wikipedia.org/wiki/Cluster_analysis#External_evaluation)). We are not using statistical measures today but we'll use our eyes instead. In the cells below, experiment with different number of clusters and visualize with scatter plots. What number of clusters seems to work best for K-Means?

In [None]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# different k values to experiment with
k_values = [2, 3, 4, 5]

# Create subplots for each k value
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

for i, k in enumerate(k_values):
    # Initialize and fit K-Means with k clusters
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans_labels = kmeans.fit_predict(customers_scale)
    
    # Create a scatter plot using Detergents_Paper as X and Milk as Y
    axes[i].scatter(df['Detergents_Paper'], df['Milk'], c=kmeans_labels, cmap='viridis', edgecolors='k')
    axes[i].set_title(f'K-Means Clustering with k = {k}')
    axes[i].set_xlabel('Detergents_Paper')
    axes[i].set_ylabel('Milk')
    axes[i].grid(True)

plt.tight_layout()
plt.show()


**Your comment here**

From a purely visual standpoint, k=3 seems to strike a good balance between capturing the major distinctions in the data and avoiding overfragmentation.

# Bonus Challenge 3 - Changing DBSCAN `eps` and `min_samples`

Experiment changing the `eps` and `min_samples` params for DBSCAN. See how the results differ with scatter plot visualization.

In [None]:
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from collections import Counter

# Example ranges for eps and min_samples
eps_values = [0.3, 0.5, 0.7]
min_samples_values = [5, 10]

# Create a grid of subplots: rows = len(min_samples_values), cols = len(eps_values)
fig, axes = plt.subplots(nrows=len(min_samples_values), ncols=len(eps_values), figsize=(15, 8))

for row, ms in enumerate(min_samples_values):
    for col, e in enumerate(eps_values):
        # Initialize and fit DBSCAN with the current eps and min_samples
        dbscan = DBSCAN(eps=e, min_samples=ms)
        db_labels = dbscan.fit_predict(customers_scale)
        
        # Plot the results on the corresponding subplot
        ax = axes[row, col]
        scatter = ax.scatter(
            df['Detergents_Paper'],
            df['Milk'],
            c=db_labels,
            cmap='viridis',
            edgecolors='k'
        )
        ax.set_title(f"DBSCAN (eps={e}, min_samples={ms})")
        ax.set_xlabel('Detergents_Paper')
        ax.set_ylabel('Milk')
        ax.grid(True)
        
        # Count the clusters (including noise, labeled -1)
        cluster_counts = Counter(db_labels)
        print(f"eps={e}, min_samples={ms}, Cluster Counts: {cluster_counts}")

plt.tight_layout()
plt.show()


**Your comment here**

Increasing eps generally reduces the number of noise points and merges clusters.
Increasing min_samples increases the density requirement, leading to fewer, larger clusters (or more noise)