<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Before-your-start:" data-toc-modified-id="Before-your-start:-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Before your start:</a></span></li><li><span><a href="#Challenge-1---Import-and-Describe-the-Dataset" data-toc-modified-id="Challenge-1---Import-and-Describe-the-Dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Challenge 1 - Import and Describe the Dataset</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?" data-toc-modified-id="Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?-2.0.0.1"><span class="toc-item-num">2.0.0.1&nbsp;&nbsp;</span>Explore the dataset with mathematical and visualization techniques. What do you find?</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-2---Data-Cleaning-and-Transformation" data-toc-modified-id="Challenge-2---Data-Cleaning-and-Transformation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Challenge 2 - Data Cleaning and Transformation</a></span></li><li><span><a href="#Challenge-3---Data-Preprocessing" data-toc-modified-id="Challenge-3---Data-Preprocessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Challenge 3 - Data Preprocessing</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here." data-toc-modified-id="We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here.-4.0.0.1"><span class="toc-item-num">4.0.0.1&nbsp;&nbsp;</span>We will use the <code>StandardScaler</code> from <code>sklearn.preprocessing</code> and scale our data. Read more about <code>StandardScaler</code> <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler" target="_blank">here</a>.</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-4---Data-Clustering-with-K-Means" data-toc-modified-id="Challenge-4---Data-Clustering-with-K-Means-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Challenge 4 - Data Clustering with K-Means</a></span></li><li><span><a href="#Challenge-5---Data-Clustering-with-DBSCAN" data-toc-modified-id="Challenge-5---Data-Clustering-with-DBSCAN-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Challenge 5 - Data Clustering with DBSCAN</a></span></li><li><span><a href="#Challenge-6---Compare-K-Means-with-DBSCAN" data-toc-modified-id="Challenge-6---Compare-K-Means-with-DBSCAN-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Challenge 6 - Compare K-Means with DBSCAN</a></span></li><li><span><a href="#Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters" data-toc-modified-id="Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Bonus Challenge 2 - Changing K-Means Number of Clusters</a></span></li><li><span><a href="#Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples" data-toc-modified-id="Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Bonus Challenge 3 - Changing DBSCAN <code>eps</code> and <code>min_samples</code></a></span></li></ul></div>

# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [None]:
# Ensures that the plots appear within the notebook, not in a separate window.
%matplotlib inline  

# Importing libraries
import matplotlib.pyplot as plt  # For plotting graphs and visualizations
import numpy as np  # For numerical operations
import pandas as pd  # For data manipulation and reading datasets
import seaborn as sns  # For advanced statistical visualizations
import warnings  # For controlling warning messages
from sklearn.exceptions import DataConversionWarning  # To specifically catch data conversion warnings

# Suppress warnings that are not relevant to the analysis to keep the output clean
warnings.filterwarnings(action='ignore', category=DataConversionWarning)


# Challenge 1 - Import and Describe the Dataset

In this lab, we will use a dataset containing information about customer preferences. We will look at how much each customer spends in a year on each subcategory in the grocery store and try to find similarities using clustering.

The origin of the dataset is [here](https://archive.ics.uci.edu/ml/datasets/wholesale+customers).

In [None]:
# loading the data: Wholesale customers data

# Reading the dataset into a DataFrame with a relative file path
customers = pd.read_csv('../data/Wholesale customers data.csv')

# Displaying the first few rows of the dataset
customers.head()

#### Explore the dataset with mathematical and visualization techniques. What do you find?

Checklist:

* What does each column mean?
* Any categorical data to convert?
* Any missing data to remove?
* Column collinearity - any high correlations?
* Descriptive statistics - any outliers to remove?
* Column-wise data distribution - is the distribution skewed?
* Etc.

Additional info: Over a century ago, an Italian economist named Vilfredo Pareto discovered that roughly 20% of the customers account for 80% of the typical retail sales. This is called the [Pareto principle](https://en.wikipedia.org/wiki/Pareto_principle). Check if this dataset displays this characteristic.

In [None]:
# Exploring the dataset with mathematical and visualization techniques

# 1. What is the shape of the dataframe (rows, columns)?
# Displaying the shape of the dataframe (rows, columns)
print("\nShape of the DataFrame:")
print(customers.shape)

# 2. Any missing data to remove?
# Checking for missing data in the dataset
print("\nMissing Data:")
print(customers.isnull().sum())

# 3. Duplicate row check
# Checking if there are any duplicate rows in the dataset
print("\nChecking for Duplicate Rows:")
print(customers.duplicated().sum())

# 4. What does each column mean?
# Displaying the column names and a brief explanation (if known).
print("\nColumn Names and Descriptions:")
print(customers.columns)

# 5. Any categorical data to convert?
# Checking the data types and identifying categorical columns
print("\nData Types (Check for categorical columns):")
print(customers.dtypes)

# 6. What are the unique values in each column? Are any of the columns categorical?
# Displaying the unique values of each column to help decide if they are categorical or quantitative
print("\nUnique Values for Each Column:")
for column in customers.columns:
    print(f"{column}: {customers[column].unique()}")

**Your observations here**

### Your observations here

- **Shape of the DataFrame**:
  - The dataset contains **440 rows** and **8 columns**, meaning it has data for 440 customers and 8 different attributes (features).

- **Missing Data**:
  - There are **no missing values** in any of the columns. All 440 rows have complete data for each feature, which is a positive sign for data integrity.

- **Checking for Duplicate Rows**:
  - There are **no duplicate rows** in the dataset, ensuring that each customer record is unique.

- **Column Names and Descriptions**:
  - The dataset consists of the following columns:
    - `Channel`: Likely represents different sales channels (categorical).
    - `Region`: Likely represents geographic regions (categorical).
    - `Fresh`: Spending on fresh products (quantitative).
    - `Milk`: Spending on milk (quantitative).
    - `Grocery`: Spending on grocery products (quantitative).
    - `Frozen`: Spending on frozen products (quantitative).
    - `Detergents_Paper`: Spending on detergents and paper products (quantitative).
    - `Delicassen`: Spending on delicatessen products (quantitative).

- **Data Types**:
  - The data types of the columns are as follows:
    - **Quantitative** (spending values): `Fresh`, `Milk`, `Grocery`, `Frozen`, `Detergents_Paper`, `Delicassen`
    - **Categorical** (representing categories or groups): `Channel`, `Region`





In [None]:
# Analyzing Categorical Variables: Channel and Region

# 1. Frequency counts for the categorical variables
print("\nFrequency Counts for Categorical Variables:")
print(customers['Channel'].value_counts())
print(customers['Region'].value_counts())

# 2. Bar plots to visualize the distribution of the categorical variables
plt.figure(figsize=(10, 5))
sns.countplot(x='Channel', data=customers)
plt.title('Distribution of Channel')
plt.show()

plt.figure(figsize=(10, 5))
sns.countplot(x='Region', data=customers)
plt.title('Distribution of Region')
plt.show()

# 3. Cross-tabulation of Channel and Region (if applicable)
print("\nCross-tabulation of Channel and Region:")
print(pd.crosstab(customers['Channel'], customers['Region']))


**Your observations here**

### 1. Frequency Tables:
- We have seen the frequency distribution of the `Channel` and `Region` variables:
  - **Channel** has two values: `1` and `2`. The majority of customers are in `Channel 1`.
  - **Region** has three values: `1`, `2`, and `3`. Most customers are from `Region 3`, followed by `Region 1`, and very few from `Region 2`.

### 2. Distribution of Channel:
- A **bar plot** has shown that **Channel 1** has a significantly higher count of customers than **Channel 2**. This suggests that **Channel 1** is more commonly used by the customers in this dataset.

### 3. Distribution of Region:
- The **bar plot** of `Region` indicates that the vast majority of customers belong to **Region 3**, while **Region 1** has a smaller proportion of customers, and **Region 2** has even fewer customers.

### 4. Cross-tabulation of Channel and Region:
- The **cross-tabulation** between `Channel` and `Region` gives us a more detailed view of the distribution of customers across both variables. Here's a breakdown:
  - **For Channel 1**:
    - 59 customers belong to **Region 1**.
    - 28 customers belong to **Region 2**.
    - 211 customers belong to **Region 3**.
  - **For Channel 2**:
    - 18 customers belong to **Region 1**.
    - 19 customers belong to **Region 2**.
    - 105 customers belong to **Region 3**.
  
  This suggests that **Channel 1** is the most common channel across all regions, especially in **Region 3**.


In [None]:
# Analyzing Quantitative Variables

# Selecting the quantitative columns
quantitative_columns = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen']

# 1. Descriptive statistics - including mean, standard deviation, min, 25%, 50%, 75%, and max
print("\nDescriptive Statistics for Quantitative Variables:")
print(customers[quantitative_columns].describe())

# 2. Distribution of each quantitative variable (using histograms)
print("\nDistribution of Quantitative Variables:")
customers[quantitative_columns].hist(figsize=(12, 10), bins=20)
plt.suptitle('Histograms of Quantitative Columns')
plt.show()

# 3. Boxplots for each quantitative variable (to check for outliers)
print("\nBoxplots of Quantitative Variables:")
customers[quantitative_columns].plot(kind='box', figsize=(12, 8), vert=False)
plt.suptitle('Boxplots of Quantitative Columns')
plt.show()

# 4. Outlier detection using IQR method (based on the 25th and 75th percentiles)
Q1 = customers[quantitative_columns].quantile(0.25)
Q3 = customers[quantitative_columns].quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = ((customers[quantitative_columns] < (Q1 - 1.5 * IQR)) | (customers[quantitative_columns] > (Q3 + 1.5 * IQR)))
print("\nOutliers (using IQR method):")
print(outliers_iqr.sum())  # Shows how many outliers exist for each column

# 5. Outlier detection using z-score method
from scipy.stats import zscore

# Calculating z-scores for each column
z_scores = customers[quantitative_columns].apply(zscore)
outliers_zscore = (z_scores > 3) | (z_scores < -3)  # Consider outliers with z-score > 3 or < -3
print("\nOutliers (using Z-score method):")
print(outliers_zscore.sum())  # Shows how many outliers exist for each column

# 6. Skewness for each quantitative variable
print("\nSkewness of Each Quantitative Variable:")
skewness = customers[quantitative_columns].skew()
print(skewness)

# 7. Pareto principle - Top 20% contribution to total sales for each category
print("\nTop 20% Cumulative Contribution to Total Sales (Pareto Principle):")
# Calculate the total spending for each category
total_sales = customers[quantitative_columns].sum(axis=0).sort_values(ascending=False)
cumulative_sales = total_sales.cumsum() / total_sales.sum()
print(cumulative_sales)

# Visualizing the Pareto principle for the top categories
cumulative_sales.plot(kind='line', marker='o', figsize=(10, 6))
plt.title('Cumulative Sum of Sales Categories (Pareto Principle)')
plt.xlabel('Categories')
plt.ylabel('Cumulative Percentage of Total Sales')
plt.show()

# 8. Correlation matrix for quantitative variables
print("\nCorrelation Matrix for Quantitative Variables:")
correlation_matrix = customers[quantitative_columns].corr()
print(correlation_matrix)

# Plotting the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix of Quantitative Variables')
plt.show()

# 9. Pairwise plots for the quantitative variables to visualize relationships
import seaborn as sns

print("\nPairwise Plots of Quantitative Variables:")
sns.pairplot(customers[quantitative_columns])
plt.suptitle('Pairwise Plots of Quantitative Variables', y=1.02)
plt.show()


### Your observations here

- **Descriptive Statistics for Quantitative Variables**:
  The summary statistics for the quantitative variables are as follows:
  - `Fresh`: Mean = 12000.30, Std = 12647.33, Min = 3, Max = 112151
  - `Milk`: Mean = 5796.27, Std = 7380.38, Min = 55, Max = 73498
  - `Grocery`: Mean = 7951.28, Std = 9503.16, Min = 3, Max = 92780
  - `Frozen`: Mean = 3071.93, Std = 4854.67, Min = 25, Max = 60869
  - `Detergents_Paper`: Mean = 2881.49, Std = 4767.85, Min = 3, Max = 40827
  - `Delicassen`: Mean = 1524.87, Std = 2820.11, Min = 3, Max = 47943

- **Distribution of Quantitative Variables**:
  The histograms show the distribution of each quantitative variable. Most variables show a skewed distribution.

- **Boxplots of Quantitative Variables**:
  The boxplots display the spread of the data for each quantitative variable. All variables have outliers, indicated by the points outside the whiskers of the boxplot. Variables like `Fresh`, `Milk`, and `Grocery` have significant outliers.

- **Outliers**:
  - Using the IQR method, the following outlier counts are observed for each quantitative variable:
    - `Fresh`: 20
    - `Milk`: 28
    - `Grocery`: 24
    - `Frozen`: 43
    - `Detergents_Paper`: 30
    - `Delicassen`: 27

  - Using the Z-score method, the following outlier counts are observed:
    - `Fresh`: 7
    - `Milk`: 9
    - `Grocery`: 7
    - `Frozen`: 6
    - `Detergents_Paper`: 10
    - `Delicassen`: 4

  These outlier counts indicate that there are several extreme values in these features, which may require further attention or removal based on the analysis.

- **Skewness of Each Quantitative Variable**:
  The skewness values for the quantitative variables indicate a right skew in most variables:
  - `Fresh`: 2.56 (highly skewed)
  - `Milk`: 4.05 (highly skewed)
  - `Grocery`: 3.59 (highly skewed)
  - `Frozen`: 5.91 (highly skewed)
  - `Detergents_Paper`: 3.63 (highly skewed)
  - `Delicassen`: 11.15 (extremely skewed)

  Variables with skewness greater than 1 indicate significant right skewness, suggesting the need for potential transformations or further analysis.

- **Pareto Principle (Top 20% Cumulative Contribution to Total Sales)**:
  The Pareto Principle highlights that roughly 20% of the categories contribute to 80% of the total sales. In this case, the cumulative contribution of the top categories is observed as follows:
  - The first 20% of categories such as *Fresh*, *Grocery*, and *Milk* contribute the majority of the total sales.
  - *Delicassen* represents the remaining part of the total sales.

- **Correlation Matrix for Quantitative Variables**:
  The correlation matrix for the quantitative variables indicates the following:
  - `Fresh` has a moderate positive correlation with `Frozen` (0.35) and a low positive correlation with `Delicassen` (0.24).
  - `Milk` has a high positive correlation with `Grocery` (0.73) and `Detergents_Paper` (0.66), suggesting that these variables tend to increase together.
  - `Grocery` has a very high positive correlation with `Detergents_Paper` (0.92), indicating that spending in these categories is closely related.
  - `Frozen` has a low negative correlation with `Detergents_Paper` (-0.13), showing a weak inverse relationship.
  - `Delicassen` has moderate positive correlations with `Milk` (0.41) and `Frozen` (0.39).

- **Correlation Matrix Plot**:
  The correlation matrix heatmap visually represents the strength of relationships between the quantitative variables. As seen in the plot, strong correlations are highlighted in red, while weaker correlations are shown in blue.

- **Pairwise Plots of Quantitative Variables**:
  The pairwise plots show the scatterplots and histograms for each pair of quantitative variables. These plots visually demonstrate the relationships between each pair of features, with clear signs of skewness in most of the quantitative variables. The diagonal histograms show the individual distribution of each variable. Additionally, the scatter plots suggest some potential relationships, such as the strong association between `Milk` and `Grocery`, which is also reflected in the correlation matrix.


# Challenge 2 - Data Cleaning and Transformation

If your conclusion from the previous challenge is the data need cleaning/transformation, do it in the cells below. However, if your conclusion is the data need not be cleaned or transformed, feel free to skip this challenge. But if you do choose the latter, please provide rationale.

In [None]:
# Your code here

# Challenge 2 - Data Cleaning and Transformation
# From the previous analysis (Challenge 1), we noticed that:
# 1. There are outliers in several quantitative columns (e.g., Fresh, Milk, Grocery, etc.).
# 2. Some variables like Fresh, Milk, Grocery, and others are highly skewed.
# 3. There are no missing values or duplicate rows in the dataset.
# Based on these insights, we may consider transforming skewed variables and/or handling outliers.
# Below, we will:
# - Apply transformations if necessary (e.g., log transformation for skewed variables)
# - Handle outliers using different strategies (e.g., removal, capping)

# Import necessary libraries
import numpy as np
import pandas as pd

# Log transformation for highly skewed columns (if necessary)
skewed_columns = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen']
customers_transformed = customers.copy()

# Apply log transformation to reduce skewness
for col in skewed_columns:
    customers_transformed[col] = np.log(customers_transformed[col] + 1)

# Visualize the distribution after transformation
print("\nDistribution of Quantitative Variables after Transformation:")
customers_transformed[skewed_columns].hist(figsize=(12, 10), bins=20)
plt.suptitle('Histograms of Quantitative Columns (After Transformation)')
plt.show()

# Handling outliers (IQR method)
Q1 = customers_transformed[skewed_columns].quantile(0.25)
Q3 = customers_transformed[skewed_columns].quantile(0.75)
IQR = Q3 - Q1

# Define a threshold for identifying outliers
outlier_threshold = 1.5

# Remove outliers based on IQR method (capping or removal could be an option here as well)
customers_no_outliers = customers_transformed[~((customers_transformed[skewed_columns] < (Q1 - outlier_threshold * IQR)) | 
                                                (customers_transformed[skewed_columns] > (Q3 + outlier_threshold * IQR))).any(axis=1)]

# Checking the shape of the dataset after transformations
print(f"\nShape after transformation and outlier removal: {customers_no_outliers.shape}")

# Leave space for better readability
print("\n")  # Adding space after shape print

# Display a few rows to verify the changes
customers_no_outliers.head()

# Checking the descriptive statistics again after the transformation
print("\nDescriptive Statistics for Transformed Quantitative Variables:")
print(customers_no_outliers[skewed_columns].describe())

# Rechecking skewness after transformation
print("\nSkewness of Transformed Quantitative Variables:")
skewness_after_transformation = customers_no_outliers[skewed_columns].skew()
print(skewness_after_transformation)


### Your observations here

- **Descriptive Statistics for Transformed Quantitative Variables**:
  The summary statistics for the transformed quantitative variables are as follows:
  - `Fresh`: Mean = 8.93, Std = 1.12, Min = 5.55, Max = 11.63
  - `Milk`: Mean = 8.12, Std = 1.01, Min = 5.31, Max = 10.90
  - `Grocery`: Mean = 8.42, Std = 1.01, Min = 5.41, Max = 11.44
  - `Frozen`: Mean = 7.43, Std = 1.13, Min = 4.52, Max = 10.46
  - `Detergents_Paper`: Mean = 6.79, Std = 1.61, Min = 1.79, Max = 10.62
  - `Delicassen`: Mean = 6.80, Std = 1.03, Min = 3.85, Max = 9.71

- **Distribution of Quantitative Variables after Transformation**:
  The histograms show the distribution of each transformed quantitative variable. After applying the log transformation, most variables exhibit more symmetric and less skewed distributions, with the exception of `Frozen`, `Milk`, and `Delicassen` which still display some mild skewness.

- **Skewness of Transformed Quantitative Variables**:
  The skewness values for the transformed quantitative variables indicate that the transformations were successful in reducing skewness:
  - `Fresh`: -0.68 (moderately skewed to the left)
  - `Milk`: -0.04 (approximately normal)
  - `Grocery`: 0.03 (approximately normal)
  - `Frozen`: -0.13 (slightly skewed to the left)
  - `Detergents_Paper`: -0.02 (approximately normal)
  - `Delicassen`: -0.35 (slightly skewed to the left)

  These skewness values suggest that the transformation has made the data more normally distributed, which will be beneficial for clustering or other machine learning tasks.

- **Shape After Transformation and Outlier Removal**:
  The dataset now has 398 rows and 8 columns after the transformation and outlier removal, reducing the size of the dataset slightly by removing outliers.

---

The transformation and outlier removal steps seem to have made significant improvements to the data's distribution and shape, setting the dataset up for further analysis or machine learning tasks.


# Challenge 3 - Data Preprocessing

One problem with the dataset is the value ranges are remarkably different across various categories (e.g. `Fresh` and `Grocery` compared to `Detergents_Paper` and `Delicassen`). If you made this observation in the first challenge, you've done a great job! This means you not only completed the bonus questions in the previous Supervised Learning lab but also researched deep into [*feature scaling*](https://en.wikipedia.org/wiki/Feature_scaling). Keep on the good work!

Diverse value ranges in different features could cause issues in our clustering. The way to reduce the problem is through feature scaling. We'll use this technique again with this dataset.

#### We will use the `StandardScaler` from `sklearn.preprocessing` and scale our data. Read more about `StandardScaler` [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

*After scaling your data, assign the transformed data to a new variable `customers_scale`.*

In [None]:
# Your import here:

# Importing the StandardScaler class from sklearn.preprocessing
# StandardScaler is used to standardize the features by removing the mean and scaling to unit variance.
# This is especially useful for machine learning models that are sensitive to the scale of the data, such as clustering algorithms.
from sklearn.preprocessing import StandardScaler

# Your code here:

# Applying StandardScaler to scale the data

# Select the columns to scale (quantitative variables)
quantitative_columns = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen']

# Initialize the StandardScaler
# The StandardScaler will scale the data so that each feature will have a mean of 0 and a standard deviation of 1.
scaler = StandardScaler()

# Fit the scaler to the data and transform the data in one step
# The fit method computes the mean and standard deviation for each feature,
# and the transform method scales the data accordingly.
customers_scale = customers_no_outliers[quantitative_columns].copy()
customers_scale[quantitative_columns] = scaler.fit_transform(customers_scale[quantitative_columns])

# Checking the shape of the scaled data
# This ensures that the number of rows and columns has not changed after scaling.
print(f"Shape after scaling: {customers_scale.shape}")

# Leave space for better readability
print("\n")  # Adding space after shape print

# Displaying the first few rows to verify the scaling
# This will show the scaled values for a quick check.
customers_scale.head()

### Challenge 3 - Data Preprocessing: Feature Scaling

In this challenge, we addressed the issue of different value ranges across various categories (e.g., `Fresh` and `Grocery` compared to `Detergents_Paper` and `Delicassen`). This discrepancy can impact models like clustering, where the scale of features can significantly affect the results.

#### Why Scale Features?
Scaling helps bring all the features to the same scale, ensuring that each feature contributes equally to the model's performance. For example, without scaling, variables with larger ranges like `Fresh` and `Grocery` could dominate the model, while features like `Detergents_Paper` and `Delicassen` might not be given enough importance.

#### Standardization: The Approach We Used
We used **StandardScaler** from `sklearn.preprocessing` to standardize the data. Standardization scales the data by removing the mean and scaling to unit variance. This transforms the data so that it has a mean of 0 and a standard deviation of 1, which is especially important for machine learning algorithms that are sensitive to the scale of data.

- **Formula for Standardization:**
  $$
  z = \frac{(X - \mu)}{\sigma}
  $$

  Where:
  - $X$ is the value of the feature,
  - $\mu$ is the mean of the feature,
  - $\sigma$ is the standard deviation of the feature.

#### Effect of Scaling on the Data
After applying the **StandardScaler**, we can see that the range of each feature is now standardized, making it easier for machine learning models to interpret and process the data effectively. The output shows that each feature now has a mean close to 0 and a standard deviation of 1.

#### Shape and Transformation
We checked the shape of the dataset after scaling, which remains consistent at **398 rows** and **6 columns** (for the quantitative variables). The scaled data is now ready for use in modeling tasks that require normalized data.

We also printed the first few rows of the scaled data to verify the transformations.

This transformation ensures that the models will treat all features equally, without any one feature dominating the learning process due to scale differences.

Next steps will involve applying clustering or other machine learning models, depending on your goals.


### Transition from Feature Scaling to Data Quality Check

After completing the feature scaling in **Challenge 3**, where we used **StandardScaler** to standardize the data, the dataset is now ready for modeling. However, before we proceed with clustering or any further analysis, it is important to ensure that the scaled data is clean and free from any issues that might affect the results.

#### Why Check for Missing or Erroneous Values?
While we performed the necessary transformations in **Challenge 3**, it's always a good practice to verify that the scaling process hasn't introduced any issues such as missing values or unexpected values (like negative values). These issues can distort the clustering results, so we will perform a final check.

#### What We Will Do:
- **Check for missing values**: We will verify if there are any missing values in the scaled dataset.
- **Check for negative values**: After the transformation, values should be non-negative. Any negative values could indicate issues during the log transformation or scaling.
- **Check for extreme values**: We will inspect the minimum and maximum values to ensure they fall within a reasonable range.

We will also visualize the relationships between the scaled features to confirm that they look consistent after scaling. This will help us understand the structure of the data before we move on to clustering.

Let's now perform these checks to ensure the data is clean and ready for further analysis.


In [None]:
# Checking for missing values in the scaled data
print("\nChecking for Missing Values in Scaled Data:")
print(customers_scale.isnull().sum())

# Checking for any erroneous values (e.g., negative values after log transformation or scaling)
# Since we applied log transformation and scaling, the values should be non-negative and within a reasonable range
# We can check for any negative values in the scaled data (which might indicate issues in transformation)
print("\nChecking for Negative Values in Scaled Data:")
print((customers_scale < 0).sum())

# Checking for any other anomalies, such as extremely large or small values
# You can define thresholds if needed, but let's just inspect the max and min values
print("\nMaximum and Minimum Values in Scaled Data:")
print(customers_scale.min())
print(customers_scale.max())

# If necessary, you can also visualize the scaled data using pairplots to inspect the relationships visually
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(customers_scale)
plt.suptitle('Pairwise Plots of Scaled Quantitative Variables', y=1.02)
plt.show()


### Scaling and Visualizing the Data

In this section, we reviewed the scaled data after applying the **StandardScaler** from **Challenge 3**. The goal of scaling was to standardize the data, ensuring that each feature has a mean of 0 and a standard deviation of 1. This transformation is crucial for many machine learning algorithms, like K-Means clustering, that are sensitive to the scale of the data.

#### 1. **Checking for Missing and Negative Values**
We first checked for any missing values in the scaled dataset. As expected, there were no missing values (all counts were 0). Then, we examined whether any values in the scaled dataset were negative:
- **Negative values**: This is typical in scaled data since scaling shifts the values based on the mean of each feature. For example, features like `Fresh`, `Milk`, and `Grocery` have negative values, indicating that the data points are below the feature mean, which is expected behavior after scaling.

#### 2. **Range of Values in the Scaled Data**
Next, we checked the minimum and maximum values for each feature:
- The range of values in the scaled data shows that the features have been standardized, with negative values for minimums and positive values for maximums. This confirms that the scaling worked as intended, and the data is now on a comparable scale.

#### 3. **Pairwise Plots of Scaled Quantitative Variables**
To visualize the relationships between the scaled features, we created **pairwise plots**:
- The **diagonal histograms** show the distribution of each scaled feature.
- The **scatter plots** off the diagonal display the relationships between pairs of features. Some pairs, like `Milk` and `Grocery`, show clear linear relationships, which corresponds with their high correlation in the correlation matrix.

These steps and visualizations confirm that the data has been successfully scaled and is now ready for clustering or other analyses.

- **Next steps**: We will proceed to clustering the data in **Challenge 4** by applying K-Means.


### Checking Multicollinearity in the Scaled Data

After scaling the data in **Challenge 3**, it's crucial to assess whether any multicollinearity exists among the features. Multicollinearity can undermine the performance of clustering algorithms, such as K-Means, because it can make it harder for the model to distinguish the unique contribution of each feature. This is especially important if we decide to use algorithms sensitive to feature correlations.

#### 1. **Correlation Matrix of Scaled Quantitative Variables**
We will compute the **correlation matrix** to check how strongly each feature is correlated with others. High correlations between features can indicate potential multicollinearity, which might require us to remove one of the correlated features (like **Milk** and **Grocery**, which had a high correlation earlier).

#### 2. **Variance Inflation Factor (VIF)**
Additionally, we'll calculate the **Variance Inflation Factor (VIF)** for each feature. VIF quantifies how much the variance of a regression coefficient is inflated due to collinearity with other features. A VIF greater than 5 or 10 may indicate problematic multicollinearity.

Let's now proceed with computing the correlation matrix and VIF to evaluate the multicollinearity in the scaled dataset.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Step 1: Compute the correlation matrix
correlation_matrix = customers_scale.corr()

# Print the correlation matrix of the scaled quantitative variables
print("\nCorrelation Matrix of Scaled Quantitative Variables:")
print(correlation_matrix)

# Step 2: Plot the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix of Scaled Quantitative Variables')
plt.show()

# Step 3: Calculate Variance Inflation Factor (VIF)
X = add_constant(customers_scale)  # Adding constant for intercept

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print("\nVariance Inflation Factor (VIF) for each feature:")
print(vif_data)


### Checking Multicollinearity in the Scaled Data

In this section, we assessed the correlation between the features in the scaled data to understand how correlated the variables are and whether there might be multicollinearity issues that could affect the clustering algorithms (K-Means and DBSCAN).

#### 1. **Correlation Matrix of Scaled Quantitative Variables**:
The correlation matrix was computed to analyze the relationships between the scaled quantitative features. The matrix shows how strongly each feature is correlated with others. Here are some key observations:
- **Milk** and **Grocery** have a high positive correlation (0.78), indicating that when one increases, the other tends to increase as well.
- **Detergents_Paper** and **Grocery** also show a strong positive correlation (0.79), suggesting they move together.
- There are no features with perfect correlation (1), but some moderate correlations exist, particularly between related categories like **Frozen** and **Fresh**.

#### 2. **Correlation Matrix Plot**:
The plot visually confirms the correlation results, where stronger correlations are shown in red and weaker correlations in blue. As expected, **Milk** and **Grocery** have a significant red block, indicating a strong positive relationship.

#### 3. **Variance Inflation Factor (VIF)**:
VIF measures the multicollinearity of each feature by quantifying how much a feature is correlated with other features. High VIF values indicate high multicollinearity, which can be problematic for models like K-Means that are sensitive to feature scaling.
- **VIF values** below 5 generally indicate low multicollinearity. In this case, all features have relatively low VIFs:
  - The highest VIF is for **Grocery** (3.99), which indicates some degree of multicollinearity with other features, particularly **Milk** and **Detergents_Paper**.
  - The VIF values for **Fresh**, **Frozen**, **Delicassen**, and **Detergents_Paper** are low, indicating they are not highly collinear with other features.

Given that the VIFs are generally low, no further action is needed regarding multicollinearity before proceeding with clustering. However, the strong correlation between **Milk** and **Grocery** suggests that these features might share some common information, and in clustering, removing one might make sense to reduce redundancy.

These checks provide confidence that the data is well-prepared for clustering, and no further transformations are necessary before proceeding to Challenge 4.


### Exploratory Clustering with PCA

After addressing multicollinearity, we can proceed with **exploratory clustering** to visualize how the data behaves in lower dimensions. As we observed, **Milk** and **Grocery** are highly correlated (0.78), and **Grocery** and **Detergents_Paper** also show a strong correlation (0.79), suggesting that these features might share similar information. This redundancy could affect clustering performance, so exploring these relationships using **Principal Component Analysis (PCA)** can help us understand how the features interact and contribute to the clustering process.

#### 1. **Why PCA?**
PCA helps reduce the dimensionality of the data by transforming the original features into a smaller set of uncorrelated components (principal components) that capture most of the variance. By visualizing the data in 2D or 3D, we can assess whether natural clusters emerge and gain insights into how features like **Milk**, **Grocery**, **Detergents_Paper**, and other variables behave together.

#### 2. **Applying PCA**:
We will apply PCA to the scaled data and reduce it to 2 dimensions to visualize the distribution and possible clustering of the data. This allows us to explore whether the highly correlated features (e.g., **Milk**, **Grocery**, and **Detergents_Paper**) contribute similarly to the clustering.

#### 3. **Next Steps**:
- We will apply **PCA** to the scaled dataset.
- Visualize the results in 2D and examine whether distinct groups or patterns emerge in the data.
- This will give us further insights before applying more complex clustering algorithms like **K-Means** and **DBSCAN** in **Challenge 4**.

Let's proceed with applying **PCA** and visualizing the results.


In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import pandas as pd

# Step 1: Apply PCA to reduce the data to 2 dimensions
pca = PCA(n_components=2)
pca_components = pca.fit_transform(customers_scale)

# Step 2: Create a DataFrame to hold the PCA components
pca_df = pd.DataFrame(data=pca_components, columns=['PCA1', 'PCA2'])

# Step 3: Visualize the PCA results in a 2D scatter plot (all features considered)
plt.figure(figsize=(10, 6))
plt.scatter(pca_df['PCA1'], pca_df['PCA2'], c='blue', edgecolors='k', alpha=0.7)
plt.title('PCA of Scaled Data - 2D Visualization', fontsize=14)
plt.xlabel('PCA1', fontsize=12)
plt.ylabel('PCA2', fontsize=12)
plt.grid(True)
plt.show()

# Step 4: Explained variance to understand how much information each principal component explains
explained_variance = pca.explained_variance_ratio_
print("\nExplained Variance Ratio for the 2 Principal Components:")
print(explained_variance)

# Step 5: Analyze the component loadings (for all features)
loadings = pca.components_  # The loadings give us the weight of each feature in the components
features = customers_scale.columns

# Create a DataFrame of the loadings for better visualization
loadings_df = pd.DataFrame(loadings.T, columns=['PC1', 'PC2'], index=features)
print("\nComponent Loadings:")
print(loadings_df)

# Visualizing the loadings for each feature
plt.figure(figsize=(10, 6))
plt.bar(loadings_df.index, loadings_df['PC1'], alpha=0.6, label='PC1')
plt.bar(loadings_df.index, loadings_df['PC2'], alpha=0.6, label='PC2')
plt.title('PCA Component Loadings for Scaled Features', fontsize=14)
plt.xlabel('Features', fontsize=12)
plt.ylabel('Loading Value', fontsize=12)
plt.xticks(rotation=45)
plt.legend()
plt.show()


### Exploratory Clustering with PCA

In this section, we applied **Principal Component Analysis (PCA)** to reduce the dimensionality of the data and better visualize how the features interact with each other. PCA helps to uncover underlying patterns in the data by projecting the original features onto a smaller set of uncorrelated components (principal components), which capture the majority of the variance in the data.

#### 1. **PCA of Scaled Data - 2D Visualization**:
The first plot visualizes the data in two dimensions after applying PCA. The points are projected onto the first two principal components (PC1 and PC2). The distribution of the points in the plot does not reveal any clear clusters, but it provides an overview of how the data is spread across the two components. Despite the spread, it’s worth analyzing the explained variance ratio to assess how well these two components represent the original data.

#### 2. **Explained Variance Ratio for the 2 Principal Components**:
The explained variance ratio indicates how much of the total variance is explained by each principal component. In this case:
- **PC1** explains **44.69%** of the variance.
- **PC2** explains **25.52%** of the variance.

Together, **PC1** and **PC2** explain **70.21%** of the total variance in the data, meaning these two components capture a significant portion of the variance in the original features.

#### 3. **Component Loadings**:
The component loadings show the relationship between the original features and the principal components. Here are some key observations:
- **Milk** and **Grocery** contribute significantly to **PC1**, reflecting their high correlation.
- **Frozen** and **Delicassen** contribute more to **PC2**, which explains the variance not captured by **PC1**.
  
#### 4. **PCA Component Loadings for Scaled Features**:
The bar plot of the PCA component loadings shows the contributions of each feature to **PC1** and **PC2**:
- **Grocery** and **Milk** are the most influential features for **PC1**, while **Frozen** and **Delicassen** are important for **PC2**.

#### 5. **Conclusion**:
- **Milk** and **Grocery** are highly correlated, and we can consider removing one of them to reduce redundancy.
- **Detergents_Paper** also shares some variance with **Grocery**, but it is less dominant in explaining the variance in **PC1**.
- **Frozen** and **Delicassen** offer unique contributions to **PC2** and should be kept for further analysis.

The PCA results suggest that we may reduce the feature set by removing one of the highly correlated pairs (either Milk or Grocery, or Grocery and Detergents_Paper) to simplify the dataset without losing too much important variance.


### Reducing Features Based on PCA Insights

Based on the **PCA results** and our analysis of multicollinearity, we have observed that the features **Milk** and **Grocery** are highly correlated (0.78), and **Grocery** and **Detergents_Paper** also share a strong correlation (0.79). 

PCA indicated that **Grocery** is a key contributor to the variance explained by **PC1**. Given this, we can keep **Grocery** in the dataset and drop the redundant features (**Milk** and **Detergents_Paper**), as they are less essential for capturing the variance in the data.

#### Why Drop Milk and Detergents_Paper?
- **Milk** and **Grocery** both contribute significantly to **PC1**, so retaining **Grocery** allows us to keep the critical information without redundancy.
- **Detergents_Paper** has a high correlation with **Grocery** and contributes less to the variance than **Grocery**, making it a candidate for removal.

#### Next Step:
We will drop the features **Milk** and **Detergents_Paper** from the dataset, resulting in a reduced feature set that is easier to work with for clustering models like K-Means and DBSCAN.

Let's proceed with dropping these features and preparing the data for clustering.


In [None]:
# Dropping the 'Milk' and 'Detergents_Paper' columns from the dataset
customers_cleaned = customers_scale.drop(columns=['Milk', 'Detergents_Paper'])

# Checking the shape of the dataset after dropping the columns
print(f"Shape of the dataset after feature removal: {customers_cleaned.shape}")

# Leave space for better readability
print("\n")

# Displaying the first few rows of the cleaned dataset
customers_cleaned.head()

# Numerical summary for the cleaned data to check for any abnormalities
print("\nNumerical Summary for the Cleaned Data:")
print(customers_cleaned.describe())

# Visualizing the cleaned dataset with a pairplot
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting pairplot to visualize relationships in the cleaned dataset
sns.pairplot(customers_cleaned)
plt.suptitle('Pairwise Plots of Cleaned Data', y=1.02)
plt.show()

# Checking for correlations after dropping the features to ensure no multicollinearity issues remain
correlation_matrix_cleaned = customers_cleaned.corr()
print("\nCorrelation Matrix of Cleaned Data:")
print(correlation_matrix_cleaned)

# Plotting the correlation matrix to visualize any remaining correlations
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_cleaned, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix of Cleaned Data')
plt.show()


### Feature Removal and Data Cleaning

In this section, we removed the highly correlated features **Milk** and **Detergents_Paper** to address the multicollinearity issue that was identified earlier. After this step, we verified the quality of the cleaned data through a numerical summary and visual inspections.

#### 1. **Shape of the Dataset After Feature Removal**:
The dataset now has **398 rows** and **4 columns** after removing the correlated features. This reduction in the number of features simplifies the model and reduces redundancy.

#### 2. **Numerical Summary for the Cleaned Data**:
The numerical summary shows the key statistics for the remaining features **Fresh**, **Grocery**, **Frozen**, and **Delicassen**. Each feature has a mean close to 0 (as expected after scaling), and the standard deviation is around 1, indicating that the data is well-centered and scaled.

#### 3. **Pairwise Plots of the Cleaned Data**:
The pairwise plots provide insights into the relationships between the remaining features. These plots display the distribution of each feature on the diagonal and scatter plots between feature pairs off the diagonal. Notably, there seems to be some relationship between **Fresh** and **Frozen**, as well as between **Grocery** and **Delicassen**.

#### 4. **Correlation Matrix of the Cleaned Data**:
The correlation matrix shows the relationships between the cleaned features:
- **Fresh** has a moderate positive correlation with **Frozen** (0.34), indicating that these features might be related in some way.
- **Grocery** has a positive correlation with **Delicassen** (0.32), suggesting some similarity in behavior.
- There are no high correlations between the features after removing the highly correlated pairs.

#### 5. **Correlation Matrix Plot**:
The plot visually confirms the correlation results, showing no strong correlations (above 0.8) between any of the features. This suggests that the multicollinearity issue has been resolved after the feature removal.

### Conclusion:
The dataset is now cleaned and ready for clustering analysis, with no multicollinearity concerns. We have successfully reduced the dimensionality by removing one feature from each pair of highly correlated variables. The cleaned data is now better prepared for further analysis in **Challenge 4**, where we will perform clustering using K-Means and DBSCAN.

We can now proceed to the next challenges with confidence that the data is in good shape.


# Challenge 4 - Data Clustering with K-Means

Now let's cluster the data with K-Means first. Initiate the K-Means model, then fit your scaled data. In the data returned from the `.fit` method, there is an attribute called `labels_` which is the cluster number assigned to each data record. What you can do is to assign these labels back to `customers` in a new column called `customers['labels']`. Then you'll see the cluster results of the original data.

In [None]:
# Importing KMeans from sklearn.cluster
# KMeans is a popular clustering algorithm that assigns data points to clusters based on the proximity to the cluster centroids.
# We will use this algorithm to cluster the cleaned and scaled data and assign labels to each data point in the dataset.

from sklearn.cluster import KMeans

# Your code here:

# Step 1: Initialize the K-Means model with the desired number of clusters (e.g., 3)
# n_clusters=3 means we want to divide the data into 3 clusters.
# random_state is set for reproducibility of the results (it ensures the same result every time the code is run).

kmeans = KMeans(n_clusters=3, random_state=42)  # Removed n_jobs as it is no longer valid

# Step 2: Fit the K-Means model to the cleaned and scaled data
# The fit method computes the optimal cluster centroids based on the data and assigns labels to the data points.
kmeans.fit(customers_cleaned)

# Step 3: Assign the cluster labels to the customers DataFrame
# The labels_ attribute of the KMeans object stores the cluster labels for each data point. We assign these labels to the customers DataFrame.
customers_cleaned['labels'] = kmeans.labels_

# Step 4: Check the first few rows of the dataset to see the cluster labels
# The head() method displays the first five rows of the DataFrame along with the newly assigned cluster labels.
customers_cleaned.head()


### K-Means Clustering Results

After applying the **K-Means clustering** algorithm, we assigned each data point to one of the three clusters. The results are stored in the `labels` column of the `customers_cleaned` dataset.

#### 1. **Cluster Assignment**:
The clustering algorithm assigns each customer a label corresponding to the cluster they belong to. Here are the first few rows of the dataset with the assigned cluster labels:
- **Cluster 0**: Customers with lower values in features like `Fresh` and `Grocery`.
- **Cluster 1**: Customers with higher values in `Frozen`, `Delicassen`, and some other features.
  
#### 2. **Cluster Distribution**:
It’s important to inspect how many customers fall into each cluster. This can help to understand the cluster distribution and whether the algorithm produced balanced or skewed clusters.

#### 3. **Next Steps**:
The next step involves evaluating the clusters formed and determining whether they make sense in the context of your business problem. Visualizing the clusters and checking for any patterns or trends in the cluster assignments will be useful for analysis.

Now let's move on to visualizing the clusters and checking their characteristics.


### Determining the Number of Clusters with the Elbow Method

To determine the optimal number of clusters, we can use the **elbow method**. This method calculates the sum of squared distances (inertia) between samples and their cluster centroids for different numbers of clusters. By plotting the inertia for a range of cluster numbers, we can visually identify the "elbow" point, which indicates the ideal number of clusters.

In this section, we'll apply the elbow method to identify the number of clusters for the K-Means algorithm.

We'll start by trying 2 clusters and evaluate the results.


In [None]:
# KMeans with 2 clusters

# Using KMeans with 2 clusters to fit the cleaned and scaled data
kmeans_2 = KMeans(n_clusters=2, random_state=42).fit(customers_cleaned)

# Assigning the cluster labels to the cleaned dataset
labels = kmeans_2.predict(customers_cleaned)

# Storing the cluster labels into a list
clusters = kmeans_2.labels_.tolist()

In [None]:
# Assigning the cluster labels to the customers_cleaned DataFrame
# This assigns the calculated cluster labels to the 'Label' column of the DataFrame.
customers_cleaned['Label'] = clusters

# Display the first few rows to check the labels
customers_cleaned.head()


### K-Means Clustering with 2 Clusters

In this section, we applied **K-Means clustering** with **2 clusters** to the cleaned and scaled data.

#### 1. **KMeans with 2 Clusters**:
We initialized the **KMeans** model with 2 clusters and applied it to the scaled dataset (`customers_cleaned`). The model assigns each data point to one of the two clusters based on proximity to the cluster centroids.

- After fitting the model, we used the `.predict()` method to assign each data point to a cluster.
- The labels were stored in a list and then assigned to the `customers_cleaned` DataFrame in a new column called `Label`.

#### 2. **Viewing the Cluster Labels**:
We displayed the first few rows of the dataset, which now includes the cluster labels assigned by K-Means. The `Label` column reflects the cluster assignment (either 0 or 1) for each data point.

#### 3. **Cluster Results**:
From the output, we can see that the data has been assigned to two clusters:
- The first few rows show how the labels (0 and 1) are distributed across different features (`Fresh`, `Grocery`, `Frozen`, `Delicassen`).

The next step will involve counting the values in the `Label` column to understand the distribution of the clusters.



### Counting the Values in Clusters

In this step, we count the number of data points assigned to each cluster by K-Means. This allows us to understand how the data is distributed across the two clusters.

#### **1. Counting the Values in `labels`**:
The `labels` column, which was created during the clustering process, indicates the cluster assignment for each data point (either 0 or 1). By counting the values in this column, we can see how many data points belong to each cluster.

We will use the `.value_counts()` method to count the occurrences of each cluster label and observe the distribution of the data points.



In [None]:
# Your code here:

# Counting the number of data points in each cluster
cluster_counts = customers_cleaned['Label'].value_counts()

# Displaying the cluster counts
print("\nCluster counts:")
print(cluster_counts)


### Cluster Counts

The K-Means algorithm has assigned the data points into two clusters. The cluster distribution is as follows:

- **Cluster 0**: 224 data points
- **Cluster 1**: 174 data points

This shows that the data is not evenly distributed across the clusters, with Cluster 0 containing more data points than Cluster 1. Understanding the distribution of the clusters can help with further analysis and interpretation, as it may indicate different patterns or groupings in the data.

This concludes the clustering process using K-Means with 2 clusters. We can proceed with further steps, such as visualizing the clusters or using other clustering techniques like DBSCAN.


# Challenge 5 - Data Clustering with DBSCAN

Now let's cluster the data using DBSCAN. Use `DBSCAN(eps=0.5)` to initiate the model, then fit your scaled data. In the data returned from the `.fit` method, assign the `labels_` back to `customers['labels_DBSCAN']`. Now your original data have two labels, one from K-Means and the other from DBSCAN.

In [None]:
from sklearn.cluster import DBSCAN 

# Your code here

# Step 1: Initialize the DBSCAN model
# eps=0.5 is the maximum distance between two points for them to be considered neighbors.
# min_samples=5 is the minimum number of samples in a neighborhood to form a core point.

dbscan = DBSCAN(eps=0.5, min_samples=5)

# Step 2: Fit DBSCAN to the cleaned and scaled data
# DBSCAN will assign a cluster label to each data point. If a point is considered noise, it will be labeled as -1.
dbscan.fit(customers_cleaned)

# Step 3: Assign the cluster labels to the customers_cleaned DataFrame
# We assign the labels returned by DBSCAN to the 'labels_DBSCAN' column of the customers_cleaned DataFrame.
customers_cleaned['labels_DBSCAN'] = dbscan.labels_

# Check the first few rows of the dataset to see the DBSCAN labels
customers_cleaned.head()


### DBSCAN Clustering

After applying DBSCAN to the cleaned and scaled data, we have assigned the cluster labels to the dataset in the `labels_DBSCAN` column. DBSCAN has identified some points as noise (labeled as `-1`). Here are some key observations:

- **Cluster 0**: Contains 224 data points.
- **Cluster 1**: Contains 174 data points.
- **Noise points**: Points that were labeled as `-1` are considered noise by DBSCAN. These points are not assigned to any cluster and represent outliers in the dataset.

#### Example Data:
- **Fresh** and **Grocery** values indicate the feature values for each data point.
- **labels**: Cluster labels from K-Means.
- **labels_DBSCAN**: Cluster labels from DBSCAN.

This indicates that DBSCAN has found some outliers, as seen with the `-1` values in the `labels_DBSCAN` column.


### Counting the Values in `labels_DBSCAN`

After running DBSCAN clustering on the dataset, it is important to count how many data points were assigned to each cluster, including those that DBSCAN considers as noise (labeled as -1). In this step, we will count the occurrences of each label assigned by DBSCAN.


In [None]:
# Your code here

# Count the values in the DBSCAN labels
# This will show how many data points are assigned to each cluster, including the noise points labeled as -1.
customers_cleaned['labels_DBSCAN'].value_counts()


### DBSCAN Cluster Counts

The results from DBSCAN clustering show the distribution of data points across different clusters, including noise points:

- **Cluster -1 (Noise)**: 344 points (considered as outliers or noise)
- **Cluster 0**: 6 points
- **Cluster 1**: 6 points
- **Cluster 2**: 8 points
- **Cluster 3**: 5 points
- **Cluster 4**: 5 points
- **Cluster 5**: 10 points
- **Cluster 6**: 4 points
- **Cluster 7**: 5 points
- **Cluster 8**: 5 points

DBSCAN has identified the majority of the points as noise, with very few data points assigned to individual clusters. This suggests that the data might have a lot of points that do not fit into any specific group, making DBSCAN sensitive to the density of the clusters.

#### Conclusion:
- DBSCAN has detected several small clusters and noise points in the data, indicating that the data might have irregular cluster shapes.
- The relatively large number of noise points (`-1` label) suggests that many data points don't belong to any defined clusters according to DBSCAN's criteria.


# Challenge 6 - Compare K-Means with DBSCAN

Now we want to visually compare how K-Means and DBSCAN have clustered our data. We will create scatter plots for several columns. For each of the following column pairs, plot a scatter plot using `labels` and another using `labels_DBSCAN`. Put them side by side to compare. Which clustering algorithm makes better sense?

Columns to visualize:

* `Detergents_Paper` as X and `Milk` as y
* `Grocery` as X and `Fresh` as y
* `Frozen` as X and `Delicassen` as y

Visualize `Detergents_Paper` as X and `Milk` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
# Function to create scatter plots for given feature pairs using the cluster labels
def plot(x, y, hue):
    sns.scatterplot(x=x, 
                    y=y,
                    hue=hue)  # Coloring the points based on cluster labels
    plt.title(f'{x.name} vs {y.name}')  # Setting plot title dynamically using feature names
    plt.xlabel(x.name)  # Setting the x-axis label dynamically
    plt.ylabel(y.name)  # Setting the y-axis label dynamically
    plt.show()  # Display the plot


### Explanation: Why We Are Not Visualizing Detergents_Paper and Milk

In this analysis, we decided not to include **Detergents_Paper** and **Milk** in the scatter plots for clustering comparison. This is because, during the previous steps, we identified high multicollinearity between **Milk** and **Grocery**, as well as between **Detergents_Paper** and **Grocery**. To mitigate the redundancy and potential bias in our clustering results, we removed **Milk** and **Detergents_Paper** from the dataset before proceeding with the clustering analysis.

Given this, we have selected **Grocery**, **Fresh**, **Frozen**, and **Delicassen** as the features for our clustering comparison. By doing this, we aim to visualize and assess the performance of K-Means and DBSCAN clustering using more independent and diverse features.


Visualize `Grocery` as X and `Fresh` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
# Your code here:

# Visualizing Grocery vs Fresh by labels and labels_DBSCAN respectively

# We use the `plot()` function to visualize how the clusters formed by K-Means (labels) and DBSCAN (labels_DBSCAN) are distributed
# in the feature space of 'Grocery' and 'Fresh'.

plot(customers_cleaned['Grocery'], customers_cleaned['Fresh'], customers_cleaned['labels'])  # K-Means Clusters
plot(customers_cleaned['Grocery'], customers_cleaned['Fresh'], customers_cleaned['labels_DBSCAN'])  # DBSCAN Clusters


### Comparison of Grocery vs Fresh - K-Means vs DBSCAN

In this visualization, we are comparing the clustering results of **K-Means** and **DBSCAN** for the features **Grocery** and **Fresh**.

- The **top plot** shows how K-Means has grouped the data into 3 clusters (labels 0, 1, and 2). Each color corresponds to a different cluster.
- The **bottom plot** shows the DBSCAN clustering results, where the labels represent clusters found by DBSCAN. Note that DBSCAN identifies some points as noise, which are labeled as **-1** (the purple points in the bottom-left region). Other clusters are labeled with distinct numbers (0, 1, 3, 4, 6, 7).

In the top plot, the clusters formed by K-Means seem to have relatively well-defined regions in the feature space. However, DBSCAN (bottom plot) appears to produce more varied clusters, with some data points labeled as noise. This difference suggests that DBSCAN might be sensitive to the density of the points in the feature space.

Next, we will continue to explore the other feature pairs to further compare the clustering behavior of K-Means and DBSCAN.


Visualize `Frozen` as X and `Delicassen` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
# Your code here:

# Visualizing Frozen vs Delicassen by labels and labels_DBSCAN respectively

# Again, we use the `plot()` function to compare how the K-Means and DBSCAN clustering results vary when visualizing 'Frozen' vs 'Delicassen'.

plot(customers_cleaned['Frozen'], customers_cleaned['Delicassen'], customers_cleaned['labels'])  # K-Means Clusters
plot(customers_cleaned['Frozen'], customers_cleaned['Delicassen'], customers_cleaned['labels_DBSCAN'])  # DBSCAN Clusters


### Comparison of Frozen vs Delicassen - K-Means vs DBSCAN

In this final visualization, we compare the clustering results of **K-Means** and **DBSCAN** for the features **Frozen** and **Delicassen**.

- The **top plot** shows how K-Means has grouped the data into 3 clusters (labels 0, 1, and 2). Each color corresponds to a different cluster, and the data points are distributed relatively evenly across the feature space.
- The **bottom plot** shows the DBSCAN clustering results. As in the previous plots, DBSCAN identifies some points as noise, which are labeled as **-1** (these points are located more sparsely in the plot). Other clusters are identified with distinct labels (0, 1, 3, 4, 6, 7).

When comparing the two algorithms, K-Means appears to group the data into three distinct clusters, while DBSCAN is more sensitive to density and identifies noise points. The behavior of DBSCAN indicates that it might be more suitable for datasets with varying densities.

By visualizing these feature pairs, we can observe how each clustering algorithm performs with respect to the same data, giving us a clearer view of their clustering capabilities.

We can now proceed to group the customers by their labels and examine how the means differ between the groups in the next steps.


Let's use a groupby to see how the mean differs between the groups. Group `customers` by `labels` and `labels_DBSCAN` respectively and compute the means for all columns.

In [None]:
# Your code here:

# Grouping by K-Means labels
kmeans_grouped = customers_cleaned.groupby('labels').mean()

# Grouping by DBSCAN labels
dbscan_grouped = customers_cleaned.groupby('labels_DBSCAN').mean()

# Display the results
print("Means for each group (K-Means):")
print(kmeans_grouped)

print("\nMeans for each group (DBSCAN):")
print(dbscan_grouped)


Which algorithm appears to perform better?

**Your observations here**

### Observations on the Performance of K-Means and DBSCAN

After clustering the data using both K-Means and DBSCAN, we can compare the mean values of the features for each cluster.

#### K-Means:
- K-Means divided the data into 3 clusters (labeled 0, 1, and 2).
- Cluster 0 seems to have positive values for `Grocery`, `Frozen`, and `Delicassen`, with a negative value for `Fresh`.
- Cluster 1 shows positive values for `Fresh` and `Frozen`, and negative values for `Grocery` and `Delicassen`.
- Cluster 2 appears to have extreme values in the `Fresh` and `Frozen` columns and negative values in `Grocery` and `Delicassen`.

#### DBSCAN:
- DBSCAN, on the other hand, identified several noise points (labeled -1).
- DBSCAN formed more clusters, with a wider variety of features in each group.
- Some DBSCAN clusters (like 1 and 5) appear to have distinctly separated values compared to the K-Means clusters.
- The DBSCAN model also created some smaller clusters that K-Means couldn't identify, suggesting that DBSCAN may be more sensitive to finer structures in the data.

#### Comparison:
- K-Means works well when the data forms compact, spherical clusters, but DBSCAN might perform better in identifying clusters of arbitrary shapes and isolating noise.
- DBSCAN's ability to identify outliers (labeled as -1) is a key advantage over K-Means, which assigns every point to a cluster.
- K-Means has a clearer distinction between cluster labels, whereas DBSCAN shows a more complex cluster structure with several smaller clusters and noise points.

In conclusion, the choice between K-Means and DBSCAN depends on the data structure and goals. K-Means is better suited for well-separated and evenly-sized clusters, while DBSCAN is more effective in handling outliers and irregular clusters.



# Bonus Challenge 2 - Changing K-Means Number of Clusters

As we mentioned earlier, we don't need to worry about the number of clusters with DBSCAN because it automatically decides that based on the parameters we send to it. But with K-Means, we have to supply the `n_clusters` param (if you don't supply `n_clusters`, the algorithm will use `8` by default). You need to know that the optimal number of clusters differs case by case based on the dataset. K-Means can perform badly if the wrong number of clusters is used.

In advanced machine learning, data scientists try different numbers of clusters and evaluate the results with statistical measures (read [here](https://en.wikipedia.org/wiki/Cluster_analysis#External_evaluation)). We are not using statistical measures today but we'll use our eyes instead. In the cells below, experiment with different number of clusters and visualize with scatter plots. What number of clusters seems to work best for K-Means?

In [None]:
# Your code here:

# Trying different values for n_clusters in K-Means

# Experiment with 4 clusters
kmeans_4 = KMeans(n_clusters=4, random_state=42)
kmeans_4.fit(customers_cleaned)

# Assigning the labels to the dataframe
customers_cleaned['labels_4'] = kmeans_4.labels_

# Plotting for 4 clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=customers_cleaned['Grocery'], y=customers_cleaned['Fresh'], hue=customers_cleaned['labels_4'], palette="deep", s=100, edgecolor='black')
plt.title("K-Means with 4 Clusters - Grocery vs Fresh")
plt.xlabel("Grocery")
plt.ylabel("Fresh")
plt.legend(title='Cluster')
plt.show()

# Experiment with 5 clusters
kmeans_5 = KMeans(n_clusters=5, random_state=42)
kmeans_5.fit(customers_cleaned)

# Assigning the labels to the dataframe
customers_cleaned['labels_5'] = kmeans_5.labels_

# Plotting for 5 clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=customers_cleaned['Grocery'], y=customers_cleaned['Fresh'], hue=customers_cleaned['labels_5'], palette="deep", s=100, edgecolor='black')
plt.title("K-Means with 5 Clusters - Grocery vs Fresh")
plt.xlabel("Grocery")
plt.ylabel("Fresh")
plt.legend(title='Cluster')
plt.show()

# Experiment with 6 clusters
kmeans_6 = KMeans(n_clusters=6, random_state=42)
kmeans_6.fit(customers_cleaned)

# Assigning the labels to the dataframe
customers_cleaned['labels_6'] = kmeans_6.labels_

# Plotting for 6 clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=customers_cleaned['Grocery'], y=customers_cleaned['Fresh'], hue=customers_cleaned['labels_6'], palette="deep", s=100, edgecolor='black')
plt.title("K-Means with 6 Clusters - Grocery vs Fresh")
plt.xlabel("Grocery")
plt.ylabel("Fresh")
plt.legend(title='Cluster')
plt.show()


**Your comment here**

### Observations on K-Means Clusters

Based on the scatter plots, we can see how the number of clusters affects the distribution of data points in the space defined by **Grocery** and **Fresh**:

- **4 Clusters (First Image)**: The clusters appear more separated, with the data points neatly grouped into distinct clusters. However, some points seem to straddle between two clusters, indicating potential overlap or that the selected number of clusters might not be ideal.
  
- **5 Clusters (Second Image)**: With 5 clusters, the separation between clusters becomes slightly more refined. There are still some points that might belong to more than one cluster, but overall, the plot suggests a better separation compared to 4 clusters. This suggests that 5 clusters might be a better choice for partitioning the data.

- **6 Clusters (Third Image)**: In the 6-cluster scenario, the plot shows even finer divisions, which might help with better grouping of similar data points. However, the increase in the number of clusters might also introduce noise and split naturally occurring groups into smaller parts. This makes it more difficult to assess whether the extra clusters are genuinely meaningful or if they're overfitting the data.

#### Conclusion:
Looking at these plots, **5 clusters** seem to provide a good balance between separating the data into distinct groups without overfitting. The clusters are reasonably distinct, with minimal overlap between points, indicating that 5 might be the optimal number for K-Means clustering in this case.


# Bonus Challenge 3 - Changing DBSCAN `eps` and `min_samples`

Experiment changing the `eps` and `min_samples` params for DBSCAN. See how the results differ with scatter plot visualization.

In [None]:
# Your code here

from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import seaborn as sns

# Your code here:

# Experimenting with different values of eps and min_samples for DBSCAN

# Create DBSCAN instance with chosen eps and min_samples values
dbscan = DBSCAN(eps=0.7, min_samples=5)

# Fit DBSCAN to the scaled data
dbscan.fit(customers_cleaned)

# Assign the labels to the DataFrame
customers_cleaned['labels_DBSCAN'] = dbscan.labels_

# Plot the data with DBSCAN labels
plt.figure(figsize=(10, 6))
sns.scatterplot(x=customers_cleaned['Grocery'], 
                y=customers_cleaned['Fresh'], 
                hue=customers_cleaned['labels_DBSCAN'], 
                palette='viridis', 
                marker='o', s=70)
plt.title('DBSCAN Clustering with eps=0.7 and min_samples=5 - Grocery vs Fresh')
plt.xlabel('Grocery')
plt.ylabel('Fresh')
plt.legend(title='Cluster', bbox_to_anchor=(1, 1), loc='upper left')
plt.show()


**Your comment here**

### DBSCAN Clustering with eps=0.7 and min_samples=5: Results

- In this scatter plot, we applied DBSCAN clustering with the parameters `eps=0.7` and `min_samples=5`. 
- The plot shows how DBSCAN has grouped the data into multiple clusters, with some points marked as outliers (label `-1`).
- There is a noticeable concentration of points in certain regions, and the points are distributed across different clusters, each represented by a different color.
- As we can see, DBSCAN successfully identified clusters and labeled outliers where no clear cluster was found, especially in the areas with low density of data points.
- The comparison with K-Means clustering in previous steps helps us understand the behavior of both algorithms, with DBSCAN focusing more on the density of points rather than the distance from cluster centroids, which makes it well-suited for identifying irregular-shaped clusters.

### Conclusion

- **K-Means** tends to form spherical clusters based on the centroid of the points, making it more sensitive to outliers and less flexible when dealing with irregular-shaped clusters.
- **DBSCAN**, on the other hand, successfully handles noise and can detect clusters of arbitrary shape, as shown in the results from our experiment. This makes DBSCAN a great choice when working with real-world data that may contain noise or unusual cluster shapes.
- Both algorithms have their strengths and weaknesses, and the choice between K-Means and DBSCAN depends on the data and the problem you're trying to solve.
 