<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Before-your-start:" data-toc-modified-id="Before-your-start:-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Before your start:</a></span></li><li><span><a href="#Challenge-1---Import-and-Describe-the-Dataset" data-toc-modified-id="Challenge-1---Import-and-Describe-the-Dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Challenge 1 - Import and Describe the Dataset</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?" data-toc-modified-id="Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?-2.0.0.1"><span class="toc-item-num">2.0.0.1&nbsp;&nbsp;</span>Explore the dataset with mathematical and visualization techniques. What do you find?</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-2---Data-Cleaning-and-Transformation" data-toc-modified-id="Challenge-2---Data-Cleaning-and-Transformation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Challenge 2 - Data Cleaning and Transformation</a></span></li><li><span><a href="#Challenge-3---Data-Preprocessing" data-toc-modified-id="Challenge-3---Data-Preprocessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Challenge 3 - Data Preprocessing</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here." data-toc-modified-id="We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here.-4.0.0.1"><span class="toc-item-num">4.0.0.1&nbsp;&nbsp;</span>We will use the <code>StandardScaler</code> from <code>sklearn.preprocessing</code> and scale our data. Read more about <code>StandardScaler</code> <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler" target="_blank">here</a>.</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-4---Data-Clustering-with-K-Means" data-toc-modified-id="Challenge-4---Data-Clustering-with-K-Means-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Challenge 4 - Data Clustering with K-Means</a></span></li><li><span><a href="#Challenge-5---Data-Clustering-with-DBSCAN" data-toc-modified-id="Challenge-5---Data-Clustering-with-DBSCAN-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Challenge 5 - Data Clustering with DBSCAN</a></span></li><li><span><a href="#Challenge-6---Compare-K-Means-with-DBSCAN" data-toc-modified-id="Challenge-6---Compare-K-Means-with-DBSCAN-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Challenge 6 - Compare K-Means with DBSCAN</a></span></li><li><span><a href="#Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters" data-toc-modified-id="Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Bonus Challenge 2 - Changing K-Means Number of Clusters</a></span></li><li><span><a href="#Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples" data-toc-modified-id="Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Bonus Challenge 3 - Changing DBSCAN <code>eps</code> and <code>min_samples</code></a></span></li></ul></div>

# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
# Import your libraries:

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings                                              
from sklearn.exceptions import DataConversionWarning          
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

# Challenge 1 - Import and Describe the Dataset

In this lab, we will use a dataset containing information about customer preferences. We will look at how much each customer spends in a year on each subcategory in the grocery store and try to find similarities using clustering.

The origin of the dataset is [here](https://archive.ics.uci.edu/ml/datasets/wholesale+customers).

In [None]:
# loading the data: Wholesale customers data
df = pd.read_csv('Wholesale customers data.csv')

# Display first few rows of the dataset
df.head()

#### Explore the dataset with mathematical and visualization techniques. What do you find?

Checklist:

* What does each column mean?
* Any categorical data to convert?
* Any missing data to remove?
* Column collinearity - any high correlations?
* Descriptive statistics - any outliers to remove?
* Column-wise data distribution - is the distribution skewed?
* Etc.

Additional info: Over a century ago, an Italian economist named Vilfredo Pareto discovered that roughly 20% of the customers account for 80% of the typical retail sales. This is called the [Pareto principle](https://en.wikipedia.org/wiki/Pareto_principle). Check if this dataset displays this characteristic.

In [None]:
# 1. Understanding the columns
print("Column Names:")
print(df.columns)

# 2. Checking for categorical data
df.info()

# 3. Checking for missing data
print("\nMissing values:")
print(df.isnull().sum())

# 4. Correlation between columns
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# 5. Descriptive statistics
print("\nDescriptive Statistics:")
print(df.describe())

# 6. Detecting outliers using boxplot
plt.figure(figsize=(15, 10))
sns.boxplot(data=df)
plt.title('Boxplot for Outlier Detection')
plt.xticks(rotation=45)
plt.show()

# 7. Distribution of each column
plt.figure(figsize=(15, 10))
for i, column in enumerate(df.columns, 1):
    plt.subplot(3, 3, i)
    sns.histplot(df[column], kde=True)
    plt.title(f'Distribution of {column}')
plt.tight_layout()
plt.show()

# Pareto Principle Analysis
# Sort customers by their total spending
df['Total_Spend'] = df.sum(axis=1)
df_sorted = df.sort_values(by='Total_Spend', ascending=False)

# Calculate cumulative percentage of total spending
df_sorted['Cumulative_Percentage'] = df_sorted['Total_Spend'].cumsum() / df_sorted['Total_Spend'].sum() * 100

# Plot cumulative percentage
df_sorted.reset_index(inplace=True)
plt.figure(figsize=(10, 6))
plt.plot(df_sorted.index, df_sorted['Cumulative_Percentage'], marker='o')
plt.axhline(y=80, color='r', linestyle='--')
plt.xlabel('Number of Customers')
plt.ylabel('Cumulative Percentage of Total Spend')
plt.title('Pareto Principle Analysis - Cumulative Spend by Customers')
plt.show()

**Your observations here**

- No missing values were found.
- Outliers are present, which may affect the clustering results. We will proceed by removing outliers.
- No categorical data needs to be converted.
- Correlation analysis shows some features are highly correlated, indicating potential for dimensionality reduction using PCA.



# Challenge 2 - Data Cleaning and Transformation

If your conclusion from the previous challenge is the data need cleaning/transformation, do it in the cells below. However, if your conclusion is the data need not be cleaned or transformed, feel free to skip this challenge. But if you do choose the latter, please provide rationale.

In [None]:
# Removing outliers
# Using the IQR method to filter out outliers for each feature
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

# Defining a condition to filter out rows with outliers
df_filtered = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

print(f"\nNumber of rows before removing outliers: {df.shape[0]}")
print(f"Number of rows after removing outliers: {df_filtered.shape[0]}")

# Updating the dataset
df = df_filtered

# Standardizing the data for clustering
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.drop(columns=['Total_Spend']))

print("\nData cleaned and standardized for further analysis.")

**Your comment here**

-  I removed outliers using the IQR method, which resulted in fewer data points but potentially improved the clustering results.
- The dataset has been standardized, which is important for distance-based clustering methods.
- Moving forward, we will proceed with clustering techniques like KMeans to identify customer groups based on spending habits.

# Challenge 3 - Data Preprocessing

One problem with the dataset is the value ranges are remarkably different across various categories (e.g. `Fresh` and `Grocery` compared to `Detergents_Paper` and `Delicassen`). If you made this observation in the first challenge, you've done a great job! This means you not only completed the bonus questions in the previous Supervised Learning lab but also researched deep into [*feature scaling*](https://en.wikipedia.org/wiki/Feature_scaling). Keep on the good work!

Diverse value ranges in different features could cause issues in our clustering. The way to reduce the problem is through feature scaling. We'll use this technique again with this dataset.

#### We will use the `StandardScaler` from `sklearn.preprocessing` and scale our data. Read more about `StandardScaler` [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

*After scaling your data, assign the transformed data to a new variable `customers_scale`.*

In [None]:
# Your import here:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
customers_scale = scaler.fit_transform(df.drop(columns=['Total_Spend']))

# Challenge 4 - Data Clustering with K-Means

Now let's cluster the data with K-Means first. Initiate the K-Means model, then fit your scaled data. In the data returned from the `.fit` method, there is an attribute called `labels_` which is the cluster number assigned to each data record. What you can do is to assign these labels back to `customers` in a new column called `customers['labels']`. Then you'll see the cluster results of the original data.

In [None]:
from sklearn.cluster import KMeans

# Initialize KMeans with 4 clusters (arbitrary choice, you can adjust this later based on the elbow method)
kmeans = KMeans(n_clusters=4, random_state=42)

# Fit the model to the scaled data
kmeans.fit(customers_scale)

# Assign cluster labels to the original dataframe
df['labels'] = kmeans.labels_

# Display first few rows with the cluster labels
df.head()

### Looking to the elbow we can choose 2 like the correct number of clusters

In [None]:
kmeans_2 = KMeans(n_clusters=2).fit(customers_scale)

labels = kmeans_2.predict(customers_scale)

clusters = kmeans_2.labels_.tolist()

In [None]:
df['Label'] = clusters

Count the values in `labels`.

In [None]:
label_counts = df['Label'].value_counts()
print(label_counts)

# Challenge 5 - Data Clustering with DBSCAN

Now let's cluster the data using DBSCAN. Use `DBSCAN(eps=0.5)` to initiate the model, then fit your scaled data. In the data returned from the `.fit` method, assign the `labels_` back to `customers['labels_DBSCAN']`. Now your original data have two labels, one from K-Means and the other from DBSCAN.

In [None]:
from sklearn.cluster import DBSCAN 

# Initialize DBSCAN model
dbscan = DBSCAN(eps=0.5)

# Fit the model to the scaled data
dbscan.fit(customers_scale)

# Assign DBSCAN cluster labels to the original dataframe
df['labels_DBSCAN'] = dbscan.labels_

Count the values in `labels_DBSCAN`.

In [None]:
labels_dbscan_counts = df['labels_DBSCAN'].value_counts()
print(labels_dbscan_counts)

# Challenge 6 - Compare K-Means with DBSCAN

Now we want to visually compare how K-Means and DBSCAN have clustered our data. We will create scatter plots for several columns. For each of the following column pairs, plot a scatter plot using `labels` and another using `labels_DBSCAN`. Put them side by side to compare. Which clustering algorithm makes better sense?

Columns to visualize:

* `Detergents_Paper` as X and `Milk` as y
* `Grocery` as X and `Fresh` as y
* `Frozen` as X and `Delicassen` as y

Visualize `Detergents_Paper` as X and `Milk` as y by `labels` and `labels_DBSCAN` respectively

In [30]:
def plot(x,y,hue):
    sns.scatterplot(x=x, 
                    y=y,
                    hue=hue)
    plt.title('Detergents Paper vs Milk ')
    return plt.show();

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plot('Detergents_Paper', 'Milk', 'labels', 'K-Means: Detergents_Paper vs Milk')
plt.subplot(1, 2, 2)
plot('Detergents_Paper', 'Milk', 'labels_DBSCAN', 'DBSCAN: Detergents_Paper vs Milk')

Visualize `Grocery` as X and `Fresh` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plot('Grocery', 'Fresh', 'labels', 'K-Means: Grocery vs Fresh')
plt.subplot(1, 2, 2)
plot('Grocery', 'Fresh', 'labels_DBSCAN', 'DBSCAN: Grocery vs Fresh')

Visualize `Frozen` as X and `Delicassen` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plot('Frozen', 'Delicassen', 'labels', 'K-Means: Frozen vs Delicassen')
plt.subplot(1, 2, 2)
plot('Frozen', 'Delicassen', 'labels_DBSCAN', 'DBSCAN: Frozen vs Delicassen')

Let's use a groupby to see how the mean differs between the groups. Group `customers` by `labels` and `labels_DBSCAN` respectively and compute the means for all columns.

In [None]:
kmeans_group_means = df.groupby('labels').mean()
dbscan_group_means = df.groupby('labels_DBSCAN').mean()

print("K-Means Group Means:
", kmeans_group_means)
print("
DBSCAN Group Means:
", dbscan_group_means)

Which algorithm appears to perform better?

**Your observations here**

- K-Means creates more clearly defined clusters, especially when the data points are well-separated, but it may struggle with complex shapes or varying densities.
- DBSCAN, on the other hand, effectively identifies clusters of varying shapes and densities and is also able to label noise points. However, it can be sensitive to the parameters chosen (e.g., `eps`), and sometimes might leave more points unlabeled.
- Overall, DBSCAN appears to perform better in this dataset due to the irregular density and presence of noise. It is more robust in defining meaningful clusters, especially when the data distribution is non-uniform.

# Bonus Challenge 2 - Changing K-Means Number of Clusters

As we mentioned earlier, we don't need to worry about the number of clusters with DBSCAN because it automatically decides that based on the parameters we send to it. But with K-Means, we have to supply the `n_clusters` param (if you don't supply `n_clusters`, the algorithm will use `8` by default). You need to know that the optimal number of clusters differs case by case based on the dataset. K-Means can perform badly if the wrong number of clusters is used.

In advanced machine learning, data scientists try different numbers of clusters and evaluate the results with statistical measures (read [here](https://en.wikipedia.org/wiki/Cluster_analysis#External_evaluation)). We are not using statistical measures today but we'll use our eyes instead. In the cells below, experiment with different number of clusters and visualize with scatter plots. What number of clusters seems to work best for K-Means?

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

# Experiment with different numbers of clusters
for n_clusters in range(2, 7):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(customers_scale)
    df[f'labels_kmeans_{n_clusters}'] = kmeans.labels_
    
    # Plotting
    plt.figure(figsize=(14, 6))
    sns.scatterplot(x='Grocery', y='Fresh', hue=f'labels_kmeans_{n_clusters}', data=df, palette='viridis')
    plt.title(f'K-Means with {n_clusters} Clusters: Grocery vs Fresh')
    plt.show()

**Your comment here**

- After experimenting with different numbers of clusters, it appears that using 3 or 4 clusters yields the most meaningful separation in the data, especially when visualizing features such as `Grocery` vs `Fresh`.
- Using fewer clusters (e.g., 2) tends to group too many data points together, leading to less insight into the variability between customer segments.
- Using more clusters (e.g., 5 or 6) may lead to overfitting, where clusters start representing minor variations rather than meaningful customer groups. Therefore, a balance between simplicity and meaningful segmentation is key, and in this dataset, 3 or 4 clusters seem to work best.

# Bonus Challenge 3 - Changing DBSCAN `eps` and `min_samples`

Experiment changing the `eps` and `min_samples` params for DBSCAN. See how the results differ with scatter plot visualization.

In [None]:
from sklearn.cluster import DBSCAN

# Experiment with different values for `eps` and `min_samples`
eps_values = [0.3, 0.5, 0.7]
min_samples_values = [3, 5, 10]

for eps in eps_values:
    for min_samples in min_samples_values:
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        dbscan.fit(customers_scale)
        df[f'labels_dbscan_eps_{eps}_min_{min_samples}'] = dbscan.labels_
        
        # Plotting
        plt.figure(figsize=(14, 6))
        sns.scatterplot(x='Grocery', y='Fresh', hue=f'labels_dbscan_eps_{eps}_min_{min_samples}', data=df, palette='viridis')
        plt.title(f'DBSCAN with eps={eps}, min_samples={min_samples}: Grocery vs Fresh')
        plt.show()

**Your comment here**

- After experimenting with different values for `eps` and `min_samples`, it was observed that increasing `eps` leads to larger clusters and potentially fewer noise points. However, too large of an `eps` may cause distinct groups to merge.
- Similarly, adjusting `min_samples` affects the minimum density required for a group to form a cluster. A higher value for `min_samples` often leads to more points being considered noise, whereas a lower value makes the algorithm more sensitive to slight groupings in the data.
- Overall, the choice of `eps` and `min_samples` is crucial in DBSCAN to appropriately identify meaningful clusters while balancing the noise points, and these parameters should be tuned carefully based on the dataset characteristics. It is more robust in defining meaningful clusters, especially when the data distribution is non-uniform.